Technical Design Document

We want to enable decentralised, fast, decision making in engineering organisations, but we also want to make sure we're making high quality decisions. Technical Design Documents (TDDs) support this process by making sure we articulate our design approach, architecture and implementation details clearly. They also make sure we're thinking through the sticky bits like migrations, testability and reliability. In turn this process allows us to collaborate, refine and feedback on design ideas. TDDs can also serve as snapshots of our thinking at pivotal moments, improving the long-term maintainability of the system. Below I've outlined a lightweight TDD template you can use as a baseline to get started. At the bottom I've added a quick FAQ that covers how the process around TDDs can work within an engineering organisation.

TDD Template

How to use this document (remove once read)

Use it as a checklist to make sure you cover all bases, remove anything you don’t need.

Work fast and light. A good TDD is <1000 words and it includes one diagram (that’s two pages, not including the 525 words for the template).

Write it in the open (no private documents!). Once it’s in Ready state, share it with everyone.

StatusDraft, Ready, Accepted, Rejected, Superseded
Created@June 1, 2023
OwnersTag using @ to notify.
ReviewersTag using @ to notify.

TL;DR

Too Long; Didn’t Read. Be Answer first. If someone only reads this far, what do you want them to know?

Goals and Non Goals

What problems are you solving? What are you not trying to solve? What’s in scope and what’s out of scope? Prevents scope creep. Bullets only.

Context

How do things work now? What systems exist in this area already? Is there any important terminology that you need to define? May involve research.

Design

What are you going to build or change? What are the key components? How do they fit together? How do they communicate (sync/async)? Backend/frontend? What data storage systems will you use? What related systems will change? A high-level overview first, then get into detail. Miro diagrams are very useful here.

Migration

Does the design incur non-backwards-compatible changes to an existing system? How will they migrate to the new design?

Failure

How could this system go wrong? How will you mitigate the failure? How we will know when the system has gone wrong? What will we alert on in OpsGenie? Will we use Sentry/Datadog? Are there any specific scalability concerns (i.e. dataset growth, elasticity of load)?

Alternatives

What alternative designs did you consider? Why didn’t you go for them? Use this to avoid bike shedding further down the line. Bullets.

Plan

How will you actually make this design happen? What order will you build things in? Can you choose a valuable slice to deliver first? Use bullets and update later to link any Jiras.

Testing

How will the work be tested? Can all tests be automated? Are any manual test steps required? If there is state, how is the test data generated?

Metrics

Does the design make it easy (or even possible) to measure success? Consider this from a product perspective first, then consider operational metrics that the team will use to monitor the service.

Data

Are we capturing any new data? Have we checked with the data team whether this should be forwarded to Redshift? How to we make sure we’re contributing data that is fresh and consistent? What’s the volume likely to be?

Security, Privacy and Risk

Is there any PII involved? Is the system exposed to the internet? Are we working with new third parties? Are we introducing new technologies? How are secrets going to be handled?

Release

How will you launch the change? How will you measure the results? Will you use an experiment? Feature toggle? How will you observe it in production? What are the implications of the launch on other dependency/dependent systems? What’s the first iteration for the system under design?

Further Reading

Any further reading or prior art goes here. E.g:

# TDD Template

### How to use this document (remove once read)
Use it as a checklist to make sure you cover all bases, remove anything you don’t need.

**Work fast and light**. A good TDD is <1000 words and it includes one diagram (thats two pages, not including the 525 words for the template).

Write it in the open (no private documents!). Once its in Ready state, share it with everyone.

| Status | Draft, Ready, Accepted, Rejected, Superseded |
| --- | --- |
| Created | @June 1, 2023  |
| Owners | Tag using @ to notify. |
| Reviewers | Tag using @ to notify. |

## TL;DR

*Too Long; Didn’t Read. Be Answer first.* *If someone only reads this far, what do you want them to know?*

## Goals and Non Goals

*What problems are you solving? What are you not trying to solve? What’s in scope and what’s out of scope? Prevents scope creep. Bullets only.*

## Context

*How do things work now? What systems exist in this area already? Is there any important terminology that you need to define? May involve research.*

## Design

*What are you going to build or change? What are the key components? How do they fit together? How do they communicate (sync/async)? Backend/frontend? What data storage systems will you use? What related systems will change? A high-level overview first, then get into detail. Miro diagrams are very useful here.*

## Migration

*Does the design incur non-backwards-compatible changes to an existing system? How will they migrate to the new design?*

## Failure

*How could this system go wrong? How will you mitigate the failure? How we will know when the system has gone wrong? What will we alert on in OpsGenie? Will we use Sentry/Datadog? Are there any specific scalability concerns (i.e. dataset growth, elasticity of load)?*

## Alternatives

*What alternative designs did you consider? Why didn’t you go for them? Use this to avoid bike shedding further down the line. Bullets.*

## Plan

*How will you actually make this design happen? What order will you build things in? Can you choose a valuable slice to deliver first? Use bullets and update later to link any Jiras.*

## Testing

*How will the work be tested? Can all tests be automated? Are any manual test steps required? If there is state, how is the test data generated?*

## Metrics

*Does the design make it easy (or even possible) to measure success? Consider this from a product perspective first, then consider operational metrics that the team will use to monitor the service.* 

## Data

*Are we capturing any new data? Have we checked with the data team whether this should be forwarded to Redshift? How to we make sure we’re contributing data that is fresh and consistent? What’s the volume likely to be?*

## Security, Privacy and Risk

*Is there any PII involved? Is the system exposed to the internet? Are we working with new third parties? Are we introducing new technologies? How are secrets going to be handled?*

## Release

*How will you launch the change? How will you measure the results? Will you use an experiment? Feature toggle? How will you observe it in production? What are the implications of the launch on other dependency/dependent systems? What’s the first iteration for the system under design?*

## Further Reading

*Any further reading or prior art goes here. E.g:*

- [Design docs at Google](https://www.industrialempathy.com/posts/design-docs-at-google/)

What’s the lifecycle of an RFC?

We use five states: DraftReadyAccepted || Rejected || Superseded

Draft: it’s a work in progress and you’re collating ideas. Write the document in public and change the state to DRAFT.

Ready: the design is good to go from your perspective and you’d like a review, @ your reviewers to fire and fire out the document to peers inside and outside your team so everyone has visibility on your work.

Accepted: all reviewers have got back to you, you’ve incorporated feedback into your design as appropriate and you’re ready to cut turf. All comments are closed/marked as resolved.

Rejected: for whatever reason, it didn’t pan out, and we’ve decided to not go ahead with the TDD.

Superseded: the design was replaced, and we’re marking our docs to make sure that’s clear, it might be helpful to update the links with the design that replaced the one discussed in the TDD.

When should I write one?

This is subjective and lean into your instinct. I’d suggest using four main dimensions for this decision: scope (does the change impact more than one team?), impact, effort and contagion (while a seemingly small change today, will the effects of this change spread across the system over time?)

How long should I spend writing the document?

Work fast and light always so trim sections you don’t need and share often and early (TDDs should be written in public). A good TDD is under two pages (<1000 words), might have one diagram and takes around a day to write. Remember what we’re trying to get out of this process is a disciplined approach to design that reduces waste by ruling out some mistakes up front and fast dissemination and feedback on ideas.

Who should I share it with?

Share early (write it in the open) and often. Show the doc in Draft state to your team. Once Ready, share it again widely (I’d suggest with your engineering organisation as a whole at least asynchronously). Once Accepted, consider a quick lightening talk in an engineering forum or a asynchronous video format like Loom.

Who should review it?

Generally, we want to encourage cross-team collaboration and as much rigor as possible in our design process, so I think a good practice is to make sure you have at least 3 reviewers, 2 from within your team (typically one of these would be a senior engineer, or a delegate) and at least 1 from outside your team (for example, a subject matter expert in the layer of the architecture, or a Senior engineer in another team). Think through who all of your stakeholders are and make sure they are aware of the changes.

Where should I share it?

Default to making it as visible as possible, post it to in Slack etc. and do a lightening talk in your engineering forum. Make sure it’s linked from a central repository so it becomes part of the historical record.

What happens once the TDD is Accepted, do I need to maintain it?

No, while you build it it’s helpful to cross off next steps, or link something like a Jira story/epic (so in theory you can chain from RFC → Jira → PR) but the TDD is not meant to be living documentation for what you’ve built, instead it’s a snapshot of what we knew and what our thinking was at the point we designed the thing you’re working on.