Took over a stalled platform. 99.99% uptime, 22 features shipped, technical debt halved — in 6 months.
Northwind builds field-service-management software for HVAC, plumbing, and electrical contractors (1,400 active customers). Their lead engineer left abruptly in summer 2025, and the platform stalled — incidents up 4×, deploys down 60%, customer-success tickets up 2.7×. They needed an agency to take ownership of operations AND ship the product roadmap, not just fix bugs. We took over in 6 weeks, cleared the incident backlog in 90 days, shipped 22 customer-facing features in 6 months, and halved their technical debt while maintaining 99.99% uptime.
Northwind came to us with…
When their lead engineer left, Northwind had a 14-page Notion handover doc, two junior engineers who'd never owned production, and 1,400 customers who didn't know any of this. Within 6 weeks: P1 incidents went from 0.3/month to 1.2/month, dependency upgrades stopped, and the customer-success team was fielding 4× their usual ticket volume. The CTO needed to either rebuild the engineering team (12-month process) or hand off operations to a senior agency.
Inherited a stalled platform
47 known bugs in the backlog (some open for 14 months). 8 dependency-vulnerability alerts. Test coverage at 22%. CI/CD broken. Deployment wiki was 18 months stale.
Two junior engineers, no senior
Both juniors were strong but had never owned production incidents. We needed to handle on-call AND mentor them up to mid-level competence.
1,400 customers, zero notice
Couldn't take downtime windows. Migration had to happen in-flight without customer-visible disruption.
Roadmap commitments
8 features had been promised on dated commit dates. Pushing them would erode customer trust further. We had to ship AND clean up the platform simultaneously.
The approach
Discovery + emergency stabilization
Two staff engineers + one SRE shadowed every part of the platform for 2 weeks. Wrote 47 pages of runbooks. Set up OpenTelemetry + PagerDuty. Cleared the P1 backlog (4 critical bugs, all production-blocking) in week 5. Took over on-call in week 6.
- 47-page runbook library
- OpenTelemetry instrumentation across 12 services
- PagerDuty rotation + escalation
- P1 backlog cleared (4 bugs)
- Dependency upgrade plan (8 vulns)
Test coverage + CI/CD recovery
Got CI/CD back to green. Wrote tests for the 12 highest-risk modules. Test coverage went 22% → 64%. Deploy frequency went from 0.4/week to 11/week. Mean time to deploy a 1-line change went from 3 days to 22 minutes.
- Test coverage 22% → 64%
- CI/CD pipeline (GitHub Actions)
- Canary deployment infrastructure
- Feature flags (LaunchDarkly)
- Automated dependency upgrades (Renovate)
Roadmap + customer-facing features
Shipped 22 customer-facing features against the original roadmap commitments. Plus 6 unplanned features driven by data from the new observability stack (we found bugs that revealed unmet customer needs). Customer-success ticket volume halved.
- 22 customer-facing features shipped
- 6 unplanned features (from observability data)
- Mobile app v3 (React Native)
- API platform v2 (rate-limited, versioned)
- Performance audit + optimization
Steady-state operations + roadmap velocity
Now in steady state: 11 deploys/week, 99.99% uptime, customer-success at 60% of pre-stall volume. Both junior engineers have been promoted to mid-level and own discrete services. We're 8 weeks into the year-2 roadmap with zero incidents shipped.
- Steady-state operations playbook
- Junior engineer mentorship track
- Quarterly SLO + error-budget review
- Customer-facing status page
- Year-2 roadmap (scoped + estimated)
Before / after — every metric
Numbers verifiable with the client. Audit trail available on request.
| Metric | Before | After | Change |
|---|---|---|---|
| P1 incidents (monthly) | 1.2 | 0 | −100% |
| Mean time-to-resolve P1 | 8.4 hours | 1.1 hours | −87% |
| Test coverage | 22% | 64% | +190% |
| Deploy frequency (per week) | 0.4 | 11 | +2,650% |
| Mean time-to-deploy 1-line change | 3 days | 22 min | −99.5% |
| Open dependency vulnerabilities | 8 | 0 | −100% |
| Customer-success ticket volume | 240/wk | 118/wk | −51% |
| Features shipped (6 months) | — | 22 | — |
| Technical-debt score (SonarQube) | 47.2 | 21.8 | −54% |
| Uptime (6 months) | 99.2% | 99.99% | +0.79pp |
Stack, team, and tools
- · Node.js + TypeScript
- · Postgres
- · Redis
- · React Native (mobile)
- · Next.js (web)
- · AWS (ECS + RDS)
- · OpenTelemetry
- · 1 engineering manager (lead)
- · 2 staff engineers
- · 1 SRE / on-call lead
- · 1 mobile engineer (RN)
- · 1 QA engineer
- · GitHub Actions
- · PagerDuty
- · Datadog
- · SonarQube
- · LaunchDarkly
- · Renovate
- · Sentry
When our lead engineer left, I had three options: rebuild the team (12 months), accept slower delivery (board wouldn't), or find a senior agency to operate the platform. SERP Axis was option three. Six months later we have 99.99% uptime, 22 features shipped, and our two junior engineers have been promoted to mid-level. They didn't just operate it — they made our team better.
“I went from 4× ticket volume back to under our pre-stall baseline. The customer-success team noticed the difference within a month. The retention math alone paid for the engagement.”
Year-2 plan: AI-assisted scheduling for field technicians (RAG over historical work-order data), plus a Power BI dashboard for ops + customer-success. Both scoped, kicking off month 9.
The cost of waiting
is your competitor.
Every 90 days you delay is 90 days of authority compounding for someone else. Get the audit. See the math. Then decide.