SimpleCGT — Case study

Tax / Fintech

05 Case study

Tax & compliance platform resilience
four weeks to a sustainable reliability model.

4 weeks

StationOps helped SimpleCGT improve platform reliability, observability, and cost efficiency so the team could scale with confidence.

Engagement

4 weeks, reliability program

Team

SRE lead, Platform engineer, Observability specialist

Focus areas

Reliability, observability, cost optimization

50%

Fewer P1/P2 incidents

3×

Faster root-cause isolation

22%

Infra cost saved

99.9%

Uptime target met

Situation

SimpleCGT is a tax and compliance platform helping users calculate capital gains obligations. Growing traffic and regulatory complexity were outpacing the team’s ability to maintain uptime, diagnose issues, and control infrastructure spend.

Reliability

Current reality

No SLOs, alerting based on raw thresholds, frequent P1/P2 incidents during peak filing periods.

Observability

Current reality

Basic CloudWatch metrics only — no distributed tracing, limited structured logging, slow root-cause isolation.

Incidents

Current reality

Ad-hoc response with no severity model, unclear escalation paths, and no post-incident learning process.

Cost

Current reality

Over-provisioned compute and database instances, no per-team spend visibility, costs growing faster than traffic.

What we did

01 Unified observability & SLOs

Deployed distributed tracing and structured logging across core services, then defined service-level objectives tied to real user-facing outcomes — calculation accuracy, API response time, and submission success rate. Rebuilt alerting around error budgets so the team gets a clear signal when user experience degrades, not when infrastructure metrics fluctuate.

02 Incident response model

Introduced severity criteria aligned to business impact, on-call escalation paths, and a post-incident review framework with corrective-action tracking. Reduced mean time to recovery by giving responders structured runbooks and clear ownership from detection through resolution.

03 Cost & capacity optimisation

Right-sized compute and database instances across environments, removed idle resources, and introduced per-team cost dashboards with monthly review cadences — cutting infrastructure spend while maintaining headroom for peak filing periods.

04 Platform governance & enablement

Embedded an operating cadence — weekly SLO reviews, monthly capacity planning, and quarterly reliability retrospectives — so the team could sustain the model without ongoing external support.

SimpleCGT — delivery summary

In four weeks we unified tracing and logging with SLOs and error-budget alerting, introduced an incident severity and runbook model, right-sized costs with per-team dashboards, and embedded governance (weekly SLO reviews, monthly capacity planning, quarterly retros) — plus hands-on enablement.

Below is the full published narrative: situation, workstreams, deliverables, timeline, impact and ROI, and customer quote.

Deliverables

SLO framework — service-level objectives for calculation, API, and submission paths with error-budget tracking and alerting.
Observability platform — distributed tracing, structured logging, and application-level metrics across all core services.
Incident operations playbook — severity model, escalation paths, runbooks, and post-incident review framework with corrective-action tracking.
Cost & capacity dashboards — per-team infrastructure spend visibility with right-sizing recommendations and monthly review cadence.
Platform governance model — weekly SLO reviews, monthly capacity planning, and quarterly reliability retrospectives embedded into the team’s workflow.
Internal enablement workshops — hands-on sessions covering observability, incident response, and cost controls so the team could operate independently.

Technologies & approach (at a glance)

Amazon CloudWatch as baseline; distributed tracing and structured logging; SLOs on calculation, API, and submission paths; severity-based incident response; runbooks and post-incident reviews; compute and database right-sizing; per-team cost visibility — with a governance cadence the team can sustain.

Timeline

Four weeks from discovery to full operating-model handoff, with SLO alerting and observability improvements live by week two.

Week 1 Discovery & mapping

Service dependency mapping, failure-mode inventory, SLO target definition, and observability gap analysis.

Week 2 Instrumentation & SLOs

Tracing and structured logging deployed, SLO alerting live, incident severity model and runbooks rolled out to on-call teams.

Week 3 Cost optimisation & hardening

Compute and database right-sizing, idle-resource removal, per-team cost dashboards deployed, and incident playbooks tested.

Week 4 Govern & embed

Operating cadence embedded, enablement workshops delivered, and full knowledge transfer to the SimpleCGT team.

Impact & ROI

SimpleCGT hit its 99.9% uptime target through peak filing season — the period where downtime directly costs revenue. The 22% infrastructure cost reduction and 50% drop in P1/P2 incidents started paying back the engagement within weeks, and the governance model means the gains compound rather than erode.

Hiring an SRE team, deploying observability, and building an incident response model internally would have taken months — and the platform would have been unprotected through at least one more filing season.

Dimension	Typical internal, manual build	With StationOps engagement
Timeline	3–5 months to hire SRE capacity, deploy tracing, define SLOs, and build incident response.	4 weeks — SLO alerting live by week two, full model handed off by week four.
Engineering effort	3–5 person-months (≈ 480–800 hours) of senior engineers diverted from product.	Engineering team stayed on product — reliability work ran in parallel.
Fully-loaded cost	Roughly €40k–€90k in engineering time, plus risk of another unprotected filing season.	Engagement paid for itself through 22% infra savings and avoided incident costs.

Figures shown are typical ranges for comparable work and will vary by baseline maturity, constraints, and team size.

Incident load halved — 50% fewer P1/P2 incidents, with the remaining issues resolved faster through structured runbooks and clear escalation paths.
Root-cause isolation accelerated — 3× faster diagnosis via distributed tracing and correlated logs: from hours of guesswork to minutes of signal.
Infrastructure spend reduced — 22% cost saving through right-sizing and idle-resource removal, with per-team dashboards preventing drift.
99.9% uptime target met — SLO-driven reliability model sustained through peak filing season with error budgets and proactive capacity planning.

“StationOps helped us scale reliability and observability without slowing down. We now diagnose issues faster and plan capacity with confidence.”

— Director of Platform, SimpleCGT