Tax / Fintech
Tax & compliance platform resilience
four weeks to a sustainable reliability model.
4 weeks
StationOps helped SimpleCGT improve platform reliability, observability, and cost efficiency so the team could scale with confidence.
Engagement
4 weeks, reliability program
Team
SRE lead, Platform engineer, Observability specialist
Focus areas
Reliability, observability, cost optimization
50%
Fewer P1/P2 incidents
3×
Faster root-cause isolation
22%
Infra cost saved
99.9%
Uptime target met
Situation
SimpleCGT is a tax and compliance platform helping users calculate capital gains obligations. Growing traffic and regulatory complexity were outpacing the team’s ability to maintain uptime, diagnose issues, and control infrastructure spend.
Reliability
No SLOs, alerting based on raw thresholds, frequent P1/P2 incidents during peak filing periods.
Observability
Basic CloudWatch metrics only — no distributed tracing, limited structured logging, slow root-cause isolation.
Incidents
Ad-hoc response with no severity model, unclear escalation paths, and no post-incident learning process.
Cost
Over-provisioned compute and database instances, no per-team spend visibility, costs growing faster than traffic.
What we did
01 Unified observability & SLOs
Deployed distributed tracing and structured logging across core services, then defined service-level objectives tied to real user-facing outcomes — calculation accuracy, API response time, and submission success rate. Rebuilt alerting around error budgets so the team gets a clear signal when user experience degrades, not when infrastructure metrics fluctuate.
02 Incident response model
Introduced severity criteria aligned to business impact, on-call escalation paths, and a post-incident review framework with corrective-action tracking. Reduced mean time to recovery by giving responders structured runbooks and clear ownership from detection through resolution.
03 Cost & capacity optimisation
Right-sized compute and database instances across environments, removed idle resources, and introduced per-team cost dashboards with monthly review cadences — cutting infrastructure spend while maintaining headroom for peak filing periods.
04 Platform governance & enablement
Embedded an operating cadence — weekly SLO reviews, monthly capacity planning, and quarterly reliability retrospectives — so the team could sustain the model without ongoing external support.
SimpleCGT — delivery summary
In four weeks we unified tracing and logging with SLOs and error-budget alerting, introduced an incident severity and runbook model, right-sized costs with per-team dashboards, and embedded governance (weekly SLO reviews, monthly capacity planning, quarterly retros) — plus hands-on enablement.
Below is the full published narrative: situation, workstreams, deliverables, timeline, impact and ROI, and customer quote.
Deliverables
- SLO framework — service-level objectives for calculation, API, and submission paths with error-budget tracking and alerting.
- Observability platform — distributed tracing, structured logging, and application-level metrics across all core services.
- Incident operations playbook — severity model, escalation paths, runbooks, and post-incident review framework with corrective-action tracking.
- Cost & capacity dashboards — per-team infrastructure spend visibility with right-sizing recommendations and monthly review cadence.
- Platform governance model — weekly SLO reviews, monthly capacity planning, and quarterly reliability retrospectives embedded into the team’s workflow.
- Internal enablement workshops — hands-on sessions covering observability, incident response, and cost controls so the team could operate independently.
Technologies & approach (at a glance)
Amazon CloudWatch as baseline; distributed tracing and structured logging; SLOs on calculation, API, and submission paths; severity-based incident response; runbooks and post-incident reviews; compute and database right-sizing; per-team cost visibility — with a governance cadence the team can sustain.
Timeline
Four weeks from discovery to full operating-model handoff, with SLO alerting and observability improvements live by week two.
Week 1 Discovery & mapping
Service dependency mapping, failure-mode inventory, SLO target definition, and observability gap analysis.
Week 2 Instrumentation & SLOs
Tracing and structured logging deployed, SLO alerting live, incident severity model and runbooks rolled out to on-call teams.
Week 3 Cost optimisation & hardening
Compute and database right-sizing, idle-resource removal, per-team cost dashboards deployed, and incident playbooks tested.
Week 4 Govern & embed
Operating cadence embedded, enablement workshops delivered, and full knowledge transfer to the SimpleCGT team.
Impact & ROI
SimpleCGT hit its 99.9% uptime target through peak filing season — the period where downtime directly costs revenue. The 22% infrastructure cost reduction and 50% drop in P1/P2 incidents started paying back the engagement within weeks, and the governance model means the gains compound rather than erode.
Hiring an SRE team, deploying observability, and building an incident response model internally would have taken months — and the platform would have been unprotected through at least one more filing season.
| Dimension | Typical internal, manual build | With StationOps engagement |
|---|---|---|
| Timeline | 3–5 months to hire SRE capacity, deploy tracing, define SLOs, and build incident response. | 4 weeks — SLO alerting live by week two, full model handed off by week four. |
| Engineering effort | 3–5 person-months (≈ 480–800 hours) of senior engineers diverted from product. | Engineering team stayed on product — reliability work ran in parallel. |
| Fully-loaded cost | Roughly €40k–€90k in engineering time, plus risk of another unprotected filing season. | Engagement paid for itself through 22% infra savings and avoided incident costs. |
Figures shown are typical ranges for comparable work and will vary by baseline maturity, constraints, and team size.
- Incident load halved — 50% fewer P1/P2 incidents, with the remaining issues resolved faster through structured runbooks and clear escalation paths.
- Root-cause isolation accelerated — 3× faster diagnosis via distributed tracing and correlated logs: from hours of guesswork to minutes of signal.
- Infrastructure spend reduced — 22% cost saving through right-sizing and idle-resource removal, with per-team dashboards preventing drift.
- 99.9% uptime target met — SLO-driven reliability model sustained through peak filing season with error budgets and proactive capacity planning.
“StationOps helped us scale reliability and observability without slowing down. We now diagnose issues faster and plan capacity with confidence.”
Scaling under regulatory pressure?
We map the real risks in your platform and build a reliability and cost model your team can own — no pitch deck required.
Related case studies
Assiduous
How StationOps delivered a six-account Control Tower Landing Zone, SLO-based operations, and ongoing managed AWS for an AI-enabled corporate finance platform — in weeks instead of months.
Auth.inc
How StationOps delivered a production multi-region AWS adtech platform — ECS, EKS, Aurora, MSK, CloudFormation, and CD from Azure Pipelines — in twelve weeks.
DigiPro
How StationOps helped DigiPro cut incidents, speed up safe releases, and reclaim engineering time — with SLOs, observability, CI/CD guardrails, and cost visibility in twelve weeks.
Flexiwage
How StationOps improved payroll pipeline availability, automated compliance evidence, cut MTTR and cloud spend, and doubled safe deploy frequency for Flexiwage in fourteen weeks.




