June 10 2021 9:00 am June 10 2021 9:30 am America/Los_Angeles How Tracing Uncovers Half-truths in Slack’s CI Infrastructure Traditional monitoring tools like logs and metrics were necessary but not sufficient to debug how and where systems failed in CI, which relies on multiple, interconnected critical systems (e.g. GHE, Checkpoint, Cypress). In this talk... Virtual Meet
TRACK 2: Mysteries Solved
Thursday, June 10 2021 9:00 am - 9:30 am PT

How Tracing Uncovers Half-truths in Slack’s CI Infrastructure

Traditional monitoring tools like logs and metrics were necessary but not sufficient to debug how and where systems failed in CI, which relies on multiple, interconnected critical systems (e.g. GHE, Checkpoint, Cypress).

In this talk, Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions. We go from stories with early incidents that motivated further investment throughout Slack’s internal tooling teams to stories about gains in performance and resiliency throughout our infrastructure.