Looking for a specific timezone? We have it covered...

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Site Reliability Engineering [clear filter]
Wednesday, October 17

9:00am EDT

Monitoring The Easy Way
Observability is a hot new buzzword that comes from Control Theory, which isn't so new or hot. We'll look at how we can get more observability into our systems using Prometheus, Jaeger, OpenTracing, and Istio. We'll walk through a demo and deployment using the Operator pattern in Kubernetes.

avatar for Daniel Barker

Daniel Barker

Chief Architect, National Association of Insurance Commissioners
Dan spent 12 years in the military as a fighter jet mechanic before transitioning to a career in technology as a Software/DevOps Engineer/Manager. He’s now the Chief Architect at the National Association of Insurance Commissioners. He’s leading the technical and cultural transformation... Read More →

Wednesday October 17, 2018 9:00am - 9:30am EDT
Live, Online

9:30am EDT

Getting Started With Chaos Engineering
Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in our systems. Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses

avatar for Ana Margarita Medina

Ana Margarita Medina

Chaos Engineer, Gremlin
Ana is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. Previously, she worked at Uber as an engineer on the SRE and Infrastructure teams, where she specifically focused on chaos engineering and cloud... Read More →

Wednesday October 17, 2018 9:30am - 10:00am EDT
Live, Online

10:00am EDT

DevSecOps & Chaos Engineering: Knowing the Unknown
In this talk Aaron Rinehart will uncover the importance of using Chaos Engineering in developing a learning culture in a DevOps world. Aaron will walk us through how to get started using Chaos Engineering as well as dive into some of the most popular and fresh Chaos Engineering Open Source toolsets.

avatar for Aaron Rinehart

Aaron Rinehart

Founder, Chaos Engineering Meetup
Aaron has been expanding the possibilities of Chaos Engineering in its application to other safety-critical portions of the domain notably Cyber Security.He began pioneering the application of Security in Chaos Engineering during his tenure as the Chief Security Architect at the largest... Read More →

Wednesday October 17, 2018 10:00am - 10:30am EDT
Live, Online

12:00pm EDT

Deploying SRE Training Best Practices To Production: What We Learned A.K.A. Strapping Jetpacks On Unicorns, The Postmortem)
This talk addresses what we learned when scaling Site Reliability Engineering training best practices globally at Google. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that your SREs ramp up quickly.

avatar for Jennifer Petoff

Jennifer Petoff

Senior Program Manager, Site Reliability Engineering Team, Google Ireland
Jennifer Petoff is a Senior Program Manager for Google's Site Reliability Engineering team based in Dublin, Ireland. She is the global lead for Google’s SRE EDU program and is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production... Read More →

Wednesday October 17, 2018 12:00pm - 12:30pm EDT
Live, Online

12:30pm EDT

Practical, Team-Focused Operability Techniques For Distributed Systems
In this talk, we explore five practical, tried-and-tested, real world, team-focused techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.

avatar for Matthew Skelton

Matthew Skelton

Head of Consulting, Conflux
Matthew Skelton has been building, deploying, and operating commercial software systems since 1998. Head of Consulting at Conflux (confluxdigital.net), he specialises in Continuous Delivery, operability and organisation design for software in manufacturing, ecommerce, and online services... Read More →

Wednesday October 17, 2018 12:30pm - 1:00pm EDT
Live, Online

1:00pm EDT

Micro-Metrics To Forecast Performance Tsunamis
Tsunami waves travel at the speed of 500 - 600 miles/hr. Normal waves travel at the speed of 5 - 60 miles/hr. Due to technical limitations, even massive Tsunamis are hard to forecast and detect beforehand. In recent times, hyper sensitive micro-metrics measuring technologies are employed to forecast Tsunamis. Similarly, it’s hard to forecast production performance problems beforehand. In this session you will learn the micro-metrics to be measured in dev/test environments that can forecast production performance problems with fair level of accuracy.

avatar for Ram Lakshmanan

Ram Lakshmanan

CEO, Tier1app
Every single day millions & millions of people in North America travel, bank and do commerce using the applications that Ram Lakshmanan has architected. He has developed one of the world’s largest banking application which is used by 1 in 3 USA households. He has designed a B2B... Read More →

Wednesday October 17, 2018 1:00pm - 1:30pm EDT
Live, Online

1:30pm EDT

Resolving Outages Faster With Better Debugging Strategies
When using tens or hundreds of microservices to provide an application's critical functionality, diagnosing what interaction between components is causing an outage can be challenging. Learn how SREs discover and debug problems at Google during outages, and hear real stories about our experiences.

avatar for Liz Fong-Jones

Liz Fong-Jones

Developer Advocate, Activist, and Site Reliability Engineer, Google
Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates... Read More →

Wednesday October 17, 2018 1:30pm - 2:00pm EDT
Live, Online

3:30pm EDT

DevOps at Federal Reserve Bank of New York
As FRBNY moves workloads to cloud environments, we will discuss how we are achieving that while continue to maintain our security levels that comes with managing trillions of dollars. We with give an overview of the types of technologies for CI/CD and how we are delivering solutions to our customers

avatar for Colin Wynd

Colin Wynd

VP & Head Common Services Organization, FRBNY
Head of the Common Services organization - a software house within the Federal Reserve Bank of New York (FRBNY) with ~100 staff. Common Services drives technology solutions across the Federal Reserve System (FRS) including DevOps, Cloud, NLP/AI, Engineering, Data Services including... Read More →

Wednesday October 17, 2018 3:30pm - 4:00pm EDT
Live, Online

4:00pm EDT

Smooth Sailing with Containers...Because Ship Happens
Today’s forward-thinking companies are leveraging container technology to deliver software faster. But most implementations and typical container tools focus primarily on the technology of creating and running containers. Now, as IT teams have learned how to work with containers, companies are challenged with how to run containers at scale and standardize and manage release processes across hundreds of applications. The ephemeral and disposable nature of containers requires significant effort to keep track of rapidly-changing container deployments, manage dependencies between multiple microservices, and enforce security, audit, and compliance processes.

Join this session and learn how to drive container and cloud deployments at scale while delivering the organizational benefits required—all while the business charges forward at high speed!

avatar for T.j. Randall

T.j. Randall

VP of Customer Success, XebiaLabs
T.j. Randall is the street-smart DevOps technologist and business-minded VP behind XebiaLabs’ Customer Success team. T.j. combines hands-on engineering experience with a passion for studying the intersections of business and technology. At XebiaLabs, he routinely partners with global... Read More →

Wednesday October 17, 2018 4:00pm - 4:30pm EDT
Live, Online

4:30pm EDT

Tickets Make Operations Unnecessarily Miserable
But what if we made a mistake letting tickets take over operations? What if these ticket queues are actually the source of much of the dysfunction, bottlenecks, and capacity issues that have traditionally plagued our organizations? This talk is going to make that case and discuss our alternatives.

avatar for Damon Edwards

Damon Edwards

Co-Founder and Chief Product Officer, Rundeck
Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the popular open source Self-Service Operations platform. Damon was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent the past years working with... Read More →

Wednesday October 17, 2018 4:30pm - 5:00pm EDT
Live, Online

5:00pm EDT

What Makes A Good SRE - Findings From The SRE Survey
Site Reliability Engineering is a relatively new discipline when it comes to careers, having only been in existence for about 15 years. While 15 years may seem like an eternity, the SRE role can be considered to be in its infancy. This leads to challenges defining the role and understanding exactly what it is. Browsing through job descriptions or speaking with other SREs you will see many different job descriptions and responsibilities.

Earlier this year Catchpoint conducted a survey of 416 professionals with the title or responsibility of an SRE in an attempt to create a real-world profile of the SRE and the organizations where they work. This session will review the findings of the survey. Learn what the top technical and non-technical skills are, and whether they vary by industry or size of the company. What surprised us with the findings? What additional questions arose analyzing the results? And how can the survey results be used to help organizations building out an SRE team.

avatar for Dawn Parzych

Dawn Parzych

Director, Catchpoint
Dawn is a Director at Catchpoint where she uses her storytelling prowess to write and speak about the intersection of technology and psychology. She makes technical information accessible avoiding buzzwords and jargon whenever possible. Dawn has spoken at DevOpsDays, Velocity, Interop... Read More →

Wednesday October 17, 2018 5:00pm - 5:30pm EDT
Live, Online

5:30pm EDT

What The NTSB Teaches Us About Incident Management & Postmortems
The NTSB are one of the world's best known incident management experts. They fly around the world to investigate accidents and write detailed recommendations. So what can we learn from them? This talk will deep-dive on the process NTSB use and how it can improve your incident management process.

avatar for Michael Kehoe

Michael Kehoe

Staff SRE, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles, and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automatio... Read More →

Wednesday October 17, 2018 5:30pm - 6:00pm EDT
Live, Online

7:30pm EDT

SLOs and Error Budgets
100% is almost never the right reliability target for a service, and service level agreements (SLAs) aren't the right tool for SREs to manage a service. These two (apparent) heresies are fundamental to how Google SRE thinks about running large-scale distributed computing services: we set service level objectives (SLOs) expressing how reliable a service needs to be and manage our service to maximize product development and feature velocity within the agreed "error budget."

We'll discuss the differences between indicators, objectives, and agreements; error budgets in practice; and how this brings product managers, product developers, and SREs together in a spirit of peaceful coexistence and cooperation.

avatar for Chris Jones

Chris Jones

Senior Privacy Engineer, Google
Chris Jones is a Privacy Engineer. From 2007 to 2017, he was a Site Reliability Engineer at Google.  Among other projects, he was tech lead and an editor for Google's book, "Site Reliability Engineering" (O'Reilly, 2016).Session: SLOs and Error Budgets

Wednesday October 17, 2018 7:30pm - 8:00pm EDT
Live, Online

8:00pm EDT

Who Wants PIE? A Series Of Post Incident Experiments
You'll likely have post-mortem or post incident review (PIR) processes in place at your company. I'll share a series of experiments we ran across our PIR's where we sought to: Generate empathy for the customers experience; Reduce post incident review toil; and Improve PIR quality.

avatar for Paul Greig

Paul Greig

SRE Team Lead, Atlassian
Paul began his career in finance and operations in the late 90’s. He then moved into technology and specifically trading technology and reliability. Next, he flew into the world of hedge funds, ensuring the reliability of high-frequency ultra low latency trading engines and market... Read More →

Wednesday October 17, 2018 8:00pm - 8:30pm EDT
Live, Online

8:30pm EDT

Better On-Call The SRE Way
On-call work is often a significant negative influence on work/life balance and job satisfaction. Google's SRE organization has developed principles and practices to make their on-call work both easier for the engineering team and more effective in fixing problems. This talk shares the lessons we have learned along the way and offers you and your organization many ways to make on-call better.

avatar for Christopher Davis

Christopher Davis

SRE, Google
Christopher Davis has been part of Google SRE since 2011. His previous experience includes the Electronic Frontier Foundation (as their first full-time sysadmin), the Whitehead/MIT Center for Genome Research (where he built computer infrastructure for the Human Genome Project), and... Read More →

Wednesday October 17, 2018 8:30pm - 9:00pm EDT
Live, Online

9:00pm EDT

SRE101: Lessons from a Parallel Universe
Just within the last fifteen years, we have seen at least two separate communities evolve from the generic idea of Systems Administration/Operations. The first, DevOps, grew up very much in public, the second, SRE, germinated within the halls of “special” companies like Google and Facebook and is now starting to gain significant visibility and traction in the wider world.

Join me for an introduction to SRE: what it is, why it matters, how it relates to other operations practices like DevOps, and if/how you can get started with it in your organization.

avatar for David Blank-Edelman

David Blank-Edelman

Senior Cloud Ops Advocate, Microsoft
David has over thirty years of experience in the systems administration/DevOps/SRE field in large multiplatform environments. He is the curator/editor of the O’Reilly Book Seeking SRE: Conversations on Running Production Systems at Scale and author of the O’Reilly Otter Book... Read More →

Wednesday October 17, 2018 9:00pm - 9:30pm EDT
Live, Online

9:30pm EDT

War & Peace & IT
Businesses and IT organizations find themselves today in an environment of uncertainty, complexity, and rapid change – kind of like Napoleon at the battle of Borodino. The more complexity and uncertainty in our situation, the less appropriate are our traditional ways of making oversight and governance decisions for IT projects. DevOps and the cloud can help us avoid Napoleon’s mistakes and make sure that we don’t get exiled to a remote island.

avatar for Mark


Enterprise Strategist, Amazon Web Services
Mark Schwartz joined AWS as an Enterprise Strategist and Evangelist in July 2017. In this role, Mark works with enterprise technology executives to share experiences and strategies for how the cloud can help them increase speed and agility while devoting more of their resources to... Read More →

Wednesday October 17, 2018 9:30pm - 10:00pm EDT
Live, Online