Release It! Design and Deploy Production-Ready Software

Chapter 1: Living in Production

Aiming for the right target

A Million Dollars Here, A Million Dollars There

Pragmatic Architecture

Two types of architects:

  1. Ivory Tower a. More dogmatic; focused on standardization only: "All UIs will be built with Angular." b. When this architect is done, there is no room to admit the system can be improved
  2. Coder a. Each component is good enough for the current stresses, and the architect knows which components will need to be changed if stresses increase.

Part I: Create Stability

Chapter 2: Case Study: The Exception that Grounded an Airline

The Smoking Gun

Chapter 3: Stabilize Your System

Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do.

A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing.

If new code is deployed into production every week, then it doesn't matter if the system can run for twe years without rebooting.

Failure Modes, and Chains of Failure

Extending Your Life Span

Major dangers to a system's longevity:

Chapter 4: Stability Antipatterns

Integration Points

Socket-Based Protocols

Transmission Control Protocol (TCP) Explained

  1. Client sends SYN (Synchronize) to server
  2. Server sends SYN/ACK (Synchronize/Acknowledgment) to client, which means it will accept connections
  3. Client sends ACK (Acknowledgment)
  4. The two applications can now send data back and forth

HTTP Protocols

Vendor API Libraries

Countering Integration Point Problems

Most effective methods:

Every integration point will fail in some way, and you need to be prepared for that failure.

You need to do more than handle error responses. You need to be able to handle slow responses and hangs, too.

Chain Reactions

If a defect causes a memory leak, and one server goes down, then the other servers in the farm have to add an extra burden, which makes them more likely to go down, all until the last one goes down.

Cascading Failures



How does your system react to excessive demand?

This is where running in the cloud is your friend, because you can autoscale. But it is pretty easy to rack up a huge bill because of a buggy application.

Stateful sessions can lead to situations where the server runs out of memory by holding everyone's session data. This is why stateless sessions are nice.

If you have to keep data in the session, you should use weak references, which allow the garbage collector to eat them up when it needs more memory. And then you just need to make sure that the caller knows how to deal with a null.

Off-Heap Memory, Off-Host Memory

Memcached and Redis are popular tools for moving memory outside of your process. Many systems use Redis to store session data.

Expensive to Serve

Have load tests for your expensive transactions, or expensive user flows (when the user is doing a lot of stuff).

Expensive users are usually the ones that bring you the most revenue, because they're interacting with your system.

Unwanted Users

Sessions are the Achilles' heel of web applications. If you pick a deep link from a site and start sending requests to it over and over without cookies, it'll create a new session for every request.

There's an entire industry built on the idea of consuming resources from other companies' websites, called competitive intelligence. Bots & scrapers.

Session Tracking: Cookies

HTTP is stateless, so even if the same person makes the same request over and over, the server doesn't know that it's coming from the same place. Netscape found a way to add a little extra data into the protocol, called Cookies. Cookies are mostly used to mantain the idea of a session.

Malicious Users

Most common attack is the distributed denial-of-service (DDoS) attack. The attacker causes many computers to start generating load on your site. They usually use a botnet, which is a computer that issues commands to a bunch of other compromised computers.

Most network vendors have software to help prevent DDoS attacks.

Blocked Threads

Self-Denial Attacks

Scaling Effects

Unbalanced Capacities


Force Multiplier

Slow Responses

Unbounded Result Sets

Ch. 5: Stability Antipatterns

"Not one of these will help your software pass QA, but they will help you get a full night's sleep ... once your software lauches."


Circuit Breaker

  1. A call succeeds, the circuit breaker stays closed, and keeps failure count at 0.
  2. A call fails, the circuit breaker stays closed, but increments the failure count.
  3. n calls fail, the circuit breaker opens the circuit, immediately failing calls without even attempting to perform the operation.
  4. After a given period of time, the circuit goes "half-closed," and tries to actually perform the next request.
  5. If it fails, the circuit breaker opens the circuit again. If it succeeds, the circuit breaker fully closes the circuit.

Options for what the Circuit Breaker can return when it's open

Involve the stakeholders to help you decide how to handle calls when the circuit is open.

Counting faults

Logging & Monitoring Circuit Breakers


Steady State

Fail Fast

Failure response is better than a slow response.

Let it Crash


Test Harnesses

Decoupling Middleware

Create Back Pressure

Ch. 7: Foundations

"Designing for production means designing for people who do operations."

NICs and Names


  1. The name an operating system uses to identify itself (run hostname command)
  2. The external name of a system which DNS takes and resolves to an IP address.

These are not the same thing.

DNS to IP address is a many-to-many relationship. A single domain name can map to multiple IP addresses, via a load balancer. And multiple domain names can point to the same IP.

A single server can have many network interfaces. NIC = Network Interface Controller.

You can have some interfaces for production traffic, and some for monitoring or operations. It's also good practice to perform backups on a separate network interface, since backups are short bursts of large volume, it can clog up production traffic.

You should specify not only the port number, but also the domain name where you want your server to listen to incoming traffic.

Physical Hosts

Back in the day, you wanted each box in the data center to be designed for high reliability. Now we use load-balanced services with so much redundancy that the loss of a single box isn't a big deal, so they're designed to be as cheap as possible.

Virtual Machines in the Data Center

Containers in the Data Center

Notes from presentation