Technical Design Document
Mar 10, 2023
TDD Template
How to use this document (remove once read)
Use it as a checklist to make sure you cover all bases, remove anything you don’t need.
Work fast and light. A good TDD is <1000 words and it includes one diagram (that’s two pages, not including the 525 words for the template).
Write it in the open (no private documents!). Once it’s in Ready state, share it with everyone.
Status | Draft, Ready, Accepted, Rejected, Superseded |
---|---|
Created | @June 1, 2023 |
Owners | Tag using @ to notify. |
Reviewers | Tag using @ to notify. |
TL;DR
Too Long; Didn’t Read. Be Answer first. If someone only reads this far, what do you want them to know?
Goals and Non Goals
What problems are you solving? What are you not trying to solve? What’s in scope and what’s out of scope? Prevents scope creep. Bullets only.
Context
How do things work now? What systems exist in this area already? Is there any important terminology that you need to define? May involve research.
Design
What are you going to build or change? What are the key components? How do they fit together? How do they communicate (sync/async)? Backend/frontend? What data storage systems will you use? What related systems will change? A high-level overview first, then get into detail. Miro diagrams are very useful here.
Migration
Does the design incur non-backwards-compatible changes to an existing system? How will they migrate to the new design?
Failure
How could this system go wrong? How will you mitigate the failure? How we will know when the system has gone wrong? What will we alert on in OpsGenie? Will we use Sentry/Datadog? Are there any specific scalability concerns (i.e. dataset growth, elasticity of load)?
Alternatives
What alternative designs did you consider? Why didn’t you go for them? Use this to avoid bike shedding further down the line. Bullets.
Plan
How will you actually make this design happen? What order will you build things in? Can you choose a valuable slice to deliver first? Use bullets and update later to link any Jiras.
Testing
How will the work be tested? Can all tests be automated? Are any manual test steps required? If there is state, how is the test data generated?
Metrics
Does the design make it easy (or even possible) to measure success? Consider this from a product perspective first, then consider operational metrics that the team will use to monitor the service.
Data
Are we capturing any new data? Have we checked with the data team whether this should be forwarded to Redshift? How to we make sure we’re contributing data that is fresh and consistent? What’s the volume likely to be?
Security, Privacy and Risk
Is there any PII involved? Is the system exposed to the internet? Are we working with new third parties? Are we introducing new technologies? How are secrets going to be handled?
Release
How will you launch the change? How will you measure the results? Will you use an experiment? Feature toggle? How will you observe it in production? What are the implications of the launch on other dependency/dependent systems? What’s the first iteration for the system under design?
Further Reading
Any further reading or prior art goes here. E.g:
</div><div
data-tab-item="Markdown"
data-tab-group="default"
class='tab-item '>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-Markdown" data-lang="Markdown"><span class="line"><span class="cl"><span class="gh"># TDD Template
### How to use this document (remove once read) Use it as a checklist to make sure you cover all bases, remove anything you don’t need. Work fast and light. A good TDD is <1000 words and it includes one diagram (that’s two pages, not including the 525 words for the template). Write it in the open (no private documents!). Once it’s in Ready state, share it with everyone. | Status | Draft, Ready, Accepted, Rejected, Superseded | | — | — | | Created | @June 1, 2023 | | Owners | Tag using @ to notify. | | Reviewers | Tag using @ to notify. | ## TL;DR Too Long; Didn’t Read. Be Answer first. If someone only reads this far, what do you want them to know? ## Goals and Non Goals What problems are you solving? What are you not trying to solve? What’s in scope and what’s out of scope? Prevents scope creep. Bullets only. ## Context How do things work now? What systems exist in this area already? Is there any important terminology that you need to define? May involve research. ## Design What are you going to build or change? What are the key components? How do they fit together? How do they communicate (sync/async)? Backend/frontend? What data storage systems will you use? What related systems will change? A high-level overview first, then get into detail. Miro diagrams are very useful here. ## Migration Does the design incur non-backwards-compatible changes to an existing system? How will they migrate to the new design? ## Failure How could this system go wrong? How will you mitigate the failure? How we will know when the system has gone wrong? What will we alert on in OpsGenie? Will we use Sentry/Datadog? Are there any specific scalability concerns (i.e. dataset growth, elasticity of load)? ## Alternatives What alternative designs did you consider? Why didn’t you go for them? Use this to avoid bike shedding further down the line. Bullets. ## Plan How will you actually make this design happen? What order will you build things in? Can you choose a valuable slice to deliver first? Use bullets and update later to link any Jiras. ## Testing How will the work be tested? Can all tests be automated? Are any manual test steps required? If there is state, how is the test data generated? ## Metrics Does the design make it easy (or even possible) to measure success? Consider this from a product perspective first, then consider operational metrics that the team will use to monitor the service. ## Data Are we capturing any new data? Have we checked with the data team whether this should be forwarded to Redshift? How to we make sure we’re contributing data that is fresh and consistent? What’s the volume likely to be? ## Security, Privacy and Risk Is there any PII involved? Is the system exposed to the internet? Are we working with new third parties? Are we introducing new technologies? How are secrets going to be handled? ## Release How will you launch the change? How will you measure the results? Will you use an experiment? Feature toggle? How will you observe it in production? What are the implications of the launch on other dependency/dependent systems? What’s the first iteration for the system under design? ## Further Reading Any further reading or prior art goes here. E.g: - [Design docs at Google](https://www.industrialempathy.com/posts/design-docs-at-google/)
What’s the lifecycle of an RFC?
We use five states: Draft ⏩ Ready ⏩ Accepted || Rejected || Superseded
Draft: it’s a work in progress and you’re collating ideas. Write the document in public and change the state to DRAFT.
Ready: the design is good to go from your perspective and you’d like a review, @
your reviewers to fire and fire out the document to peers inside and outside your team so everyone has visibility on your work.
Accepted: all reviewers have got back to you, you’ve incorporated feedback into your design as appropriate and you’re ready to cut turf. All comments are closed/marked as resolved.
Rejected: for whatever reason, it didn’t pan out, and we’ve decided to not go ahead with the TDD.
Superseded: the design was replaced, and we’re marking our docs to make sure that’s clear, it might be helpful to update the links with the design that replaced the one discussed in the TDD.
When should I write one?
This is subjective and lean into your instinct. I’d suggest using four main dimensions for this decision: scope (does the change impact more than one team?), impact, effort and contagion (while a seemingly small change today, will the effects of this change spread across the system over time?)
How long should I spend writing the document?
Work fast and light always so trim sections you don’t need and share often and early (TDDs should be written in public). A good TDD is under two pages (<1000 words), might have one diagram and takes around a day to write. Remember what we’re trying to get out of this process is a disciplined approach to design that reduces waste by ruling out some mistakes up front and fast dissemination and feedback on ideas.
Who should I share it with?
Share early (write it in the open) and often. Show the doc in Draft state to your team. Once Ready, share it again widely (I’d suggest with your engineering organisation as a whole at least asynchronously). Once Accepted, consider a quick lightening talk in an engineering forum or a asynchronous video format like Loom.
Who should review it?
Generally, we want to encourage cross-team collaboration and as much rigor as possible in our design process, so I think a good practice is to make sure you have at least 3 reviewers, 2 from within your team (typically one of these would be a senior engineer, or a delegate) and at least 1 from outside your team (for example, a subject matter expert in the layer of the architecture, or a Senior engineer in another team). Think through who all of your stakeholders are and make sure they are aware of the changes.
Where should I share it?
Default to making it as visible as possible, post it to in Slack etc. and do a lightening talk in your engineering forum. Make sure it’s linked from a central repository so it becomes part of the historical record.
What happens once the TDD is Accepted, do I need to maintain it?
No, while you build it it’s helpful to cross off next steps, or link something like a Jira story/epic (so in theory you can chain from RFC → Jira → PR) but the TDD is not meant to be living documentation for what you’ve built, instead it’s a snapshot of what we knew and what our thinking was at the point we designed the thing you’re working on.