The DevOps Handbook
The DevOps Handbook

The DevOps Handbook

Table of Contents

success in modern technical endeavors absolutely requires multiple perspectives and expertise to collaborate. (Location 491)

By adding the expertise of QA, IT Operations, and Infosec into delivery teams and automated self-service tools and platforms, teams are able to use that expertise in their daily work without being dependent on other teams. (Location 506)

Note: Embed qa, infosec and it into teams

technical debt describes how decisions we make lead to problems that get increasingly more difficult to fix over time, continually reducing our available options (Location 559)

The first act begins in IT Operations, where our goal is to keep applications and infrastructure running so that our organization can deliver value to customers. In our daily work, many of our problems are due to applications and infrastructure that are complex, poorly documented, and incredibly fragile. This is the technical debt and daily workarounds that we live with constantly, always promising that we’ll fix the mess when we have a little more time. But that time never comes. (Location 576)

Tags: itops

Note: .itops it ops keep apps and infrastructure running

The second act begins when somebody has to compensate for the latest broken promise—it could be a product manager promising a bigger, bolder feature to dazzle customers with or a business executive setting an even larger revenue target. Then, oblivious to what technology can or can’t do, or what factors led to missing our earlier commitment, they commit the technology organization to deliver upon this new promise. As a result, Development is tasked with another urgent project that inevitably requires solving new technical challenges and cutting corners to meet the promised release date, further adding to our technical debt—made, of course, with the promise that we’ll fix any resulting problems when we have a little more time. (Location 582)

Note: IT are put under pressure by promises from product owners or business cuts

Instead of a culture of fear, we have a high-trust, collaborative culture, where people are rewarded for taking risks. They are able to fearlessly talk about problems as opposed to hiding them or putting them on the backburner—after all, we must see problems in order to solve them. (Location 671)

Tags: culture, collaboration

Note: .collaboration .culture

And, because everyone fully owns the quality of their work, everyone builds automated testing into their daily work and uses peer reviews to gain confidence that problems are addressed long before they can impact a customer. These processes mitigate risk, as opposed to approvals from distant authorities, allowing us to deliver value quickly, reliably, and securely—even proving to skeptical auditors that we have an effective system of internal controls. (Location 673)

Tags: testing

We also hold internal technology conferences to elevate our skills and ensure that everyone is always teaching and learning. (Location 678)

Tags: teaching

Note: .teaching

when projects are late adding more developers not only decreases individual developer productivity but also decreases overall productivity. (Location 708)

Note: Adding more developers to a project can slow things down

Part I Introduction

Three Ways: Flow, Feedback, and Continual Learning and Experimentation. (Location 799)

Two of Lean’s major tenets include the deeply held belief that manufacturing lead time required to convert raw materials into finished goods was the best predictor of quality, customer satisfaction, and employee happiness, and that one of the best predictors of short lead times was small batch sizes of work. (Location 819)

Note: Small batch sizes improve lead time and quality

1Agile, Continuous Delivery, and the Three Ways

In DevOps, we typically define our technology value stream as the process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer. (Location 876)

Tags: valuestream

Note: .valuestream process to turn business into a tech enabled service to provide value to the customer

Because value is created only when our services are running in production, we must ensure that we are not only delivering fast flow, but that our deployments can also be performed without causing chaos and disruptions such as service outages, service impairments, or security or compliance failures. (Location 881)

Tags: value, production

Note: Value is only created when our services are in production

The First Way enables fast left-to-right flow of work from Development to Operations to the customer. In order to maximize flow, we need to make work visible, reduce our batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for the global goals. (Location 930)

Tags: flow

Note: .flow

The Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or enable faster detection and recovery. (Location 939)

Tags: feedback, flow

Note: .flow .feedback

The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures. (Location 945)

Tags: culture

Note: .culture

2The First Way: The Principles of Flow

Ideally, our kanban board will span the entire value stream, defining work as completed only when it reaches the right side of the board. Work is not done when Development completes the implementation of a feature—rather, it is only done when our application is running successfully in production, delivering value to the customer. (Location 988)

Tags: kanban

Note: .kanban work is only complete when itis depoyed to customers

To transmit code through the value stream requires multiple departments to work on a variety of tasks, including:

- functional testing

- integration testing

- environment creation

- server administration

- storage administration

- networking

- load balancing, and

- information security (Location 1059)

Tags: deployment

Note: .deployment there are often many teamms involved

Dr. Goldratt defined the “five focusing steps”: Identify the system’s constraint. Decide how to exploit the system’s constraint. Subordinate everything else to the above decisions. Elevate the system’s constraint. If in the previous steps a constraint has been broken, go back to step one, but do not allow inertia to cause a system constraint. (Location 1079)

Environment creation: We cannot achieve deployments on-demand if we always have to wait weeks or months for production or test environments. The countermeasure is to create environments that are on demand and completely self-serviced, so that they are always available when we need them. (Location 1085)

Tags: environments, devops

Note: .devops

Code deployment: We cannot achieve deployments on demand if each of our production code deployments take weeks or months to perform (i.e., each deployment requires 1,300 manual, error-prone steps involving up to three hundred engineers). The countermeasure is to automate our deployments as much as possible, with the goal of being completely automated so they can be done self-service by any developer. (Location 1087)

Tags: devops

Note: Automate deployments so we can deploy more frequently

Test setup and run: We cannot achieve deployments on demand if every code deployment requires two weeks to set up our test environments and data sets, and another four weeks to manually execute all our regression tests. The countermeasure is to automate our tests so we can execute deployments safely and to parallelize them so the test rate can keep up with our code development rate. (Location 1090)

Tags: devops

Note: .devops

The following categories of waste and hardship come from Implementing Lean Software Development unless otherwise noted: Partially done work: This includes any work in the value stream that has not been completed (e.g., requirement documents or change orders not yet reviewed) and work that is sitting in queue (e.g., waiting for QA review or server admin ticket). Partially done work becomes obsolete and loses value as time progresses. Extra processes: Any additional work that is being performed in a process that does not add value to the customer. This may include documentation not used in a downstream work center, or reviews or approvals that do not add value to the output. Extra processes add effort and increase lead times. Extra features: Features built into the service that are not needed by the organization or the customer (e.g., “gold plating”). Extra features add complexity and effort to testing and managing functionality. Task switching: When people are assigned to multiple projects and value streams, requiring them to context switch and manage dependencies between work, adding additional effort and time into the value stream. Waiting: Any delays between work requiring resources to wait until they can complete the current work. Delays increase cycle time and prevent the customer from getting value. (Location 1112)

Tags: waste

Note: .waste

3The Second Way: The Principles of Feedback

Swarming is necessary for the following reasons: It prevents the problem from progressing downstream, where the cost and effort to repair it increases exponentially and technical debt is allowed to accumulate. It prevents the work center from starting new work, which will likely introduce new errors into the system. If the problem is not addressed, the work center could potentially have the same problem in the next operation (e.g., fifty-five seconds later), requiring more fixes and work. (Location 1220)

Note: The further downstream a problem is resolved the greater the cost and effort to fix

According to Lean, our most important customer is our next step downstream. Optimizing our work for them requires that we have empathy for their problems in order to better identify the design problems that prevent fast and smooth flow. (Location 1275)

4The Third Way: The Principles of Continual Learning and Experimentation

Dr. Westrum defined three types of culture: Pathological organizations are characterized by large amounts of fear and threat. People often hoard information, withhold it for political reasons, or distort it to make themselves look better. Failure is often hidden. Bureaucratic organizations are characterized by rules and processes, often to help individual departments maintain their “turf.” Failure is processed through a system of judgment, resulting in either punishment or justice and mercy. Generative organizations are characterized by actively seeking and sharing information to better enable the organization to achieve its mission. Responsibilities are shared throughout the value stream, and failure results in reflection and genuine inquiry. (Location 1334)

Tags: culture

Note: .culture in pathological organisations mistakes are buried. In bureaucratic organisations there are many rules and processes

We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of our code and environments—we do this by reserving cycles in each development interval or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want. (Location 1361)

Tags: techdebt

Note: Set aside time for engineers to work on fixes

TRANSFORM LOCAL DISCOVERIES INTO GLOBAL IMPROVEMENTS When new learnings are discovered locally, there must also be some mechanism to enable the rest of the organization to use and benefit from that knowledge.

In other words, when teams or individuals have experiences that create expertise, our goal is to convert that tacit knowledge (i.e., knowledge that is difficult to transfer to another person by means of writing it down or verbalizing) into explicit, codified knowledge, which becomes someone else’s expertise through practice. (Location 1382)

Tags: knowledge management, learning, favorite, notion, knowledge

Note: .knowledge convert local learning into global knowledge

In the technology value stream, we can introduce the same type of tension into our systems by seeking to always reduce deployment lead times, increase test coverage, decrease test execution times, and even by re-architecting if necessary to increase developer productivity or increase reliability. (Location 1411)

Part II Introduction

5Selecting Which Value Stream to Start With

new ideas are often quickly embraced by innovators and early adopters, while others with more conservative attitudes resist them (the early majority, late majority, and laggards). Our goal is to find those teams that already believe in the need for DevOps principles and practices, and who possess a desire and demonstrated ability to innovate and improve their own processes. Ideally, these groups will be enthusiastic supporters of the DevOps journey. (Location 1598)

Note: New ideas are readily adopted by some but take longer to be adopted by others. Focus on the early adopters

Regardless of how we scope our initial effort, we must demonstrate early wins and broadcast our successes. We do this by breaking up our larger improvement goals into small, incremental steps. (Location 1609)

Note: Get some early wins and let those wins be known

6Understanding the Work in Our Value Stream, Making it Visible, and Expanding it Across the Organization

Operations: the team often responsible for maintaining the production environment and helping ensure that required service levels are met (Location 1687)

Tags: ops

Note: .ops

In our value stream, work likely begins with the product owner in the form of a customer request or the formulation of a business hypothesis. Some time later, this work is accepted by Development, where features are implemented in code and checked in to our version control repository. Builds are then integrated, tested in a production-like environment, and finally deployed into production, where they (ideally) create value for our customer. (Location 1695)

Tags: valuestream

Note: .valuestream

we should focus our investigation and scrutiny on the following areas: Places where work must wait weeks or even months, such as getting production-like environments, change approval processes, or security review processes Places where significant rework is generated or received (Location 1709)

organizations need to create a dedicated transformation team that is able to operate outside of the rest of the organization that is responsible for daily operations (which they call the “dedicated team” and “performance engine” respectively). (Location 1740)

Note: Transfoormation team needs to be seperate to the organisation

(NFRs, sometimes referred to as the “ilities”), such as maintainability, manageability, scalability, reliability, testability, deployability, and security. (Location 1786)

Tags: business analyst, nfr

Note: .nfr

The deal [between product owners and] engineering goes like this: Product management takes 20% of the team’s capacity right off the top and gives this to engineering to spend as they see fit. They might use it to rewrite, re-architect, or re-factor problematic parts of the code base...whatever they believe is necessary to avoid ever having to come to the team and say, ‘we need to stop and rewrite [all our code].’ (Location 1793)

Tags: developer

Note: Set aside 20% of dev time for devs to improve/maintain code the devvelopment

One goal is that our tooling reinforces that Development and Operations not only have shared goals but have a common backlog of work, ideally stored in a common work system and using a shared vocabulary, so that work can be prioritized globally. (Location 1854)

Tags: backlog

Note: Have a shared backlog and work system for dev and ops

7How to Design Our Organization and Architecture with Conway’s Law in Mind

When we have a tightly-coupled architecture, small changes can result in large scale failures. As a result, anyone working in one part of the system must constantly coordinate with anyone else working in another part of the system they may affect, including navigating complex and bureaucratic change management processes. (Location 2078)

Tags: architecture

Testing is done in scarce integration test environments, which often require weeks to obtain and configure. The result is not only long lead times for changes (typically measured in weeks or months) but also low developer productivity and poor deployment outcomes. (Location 2082)

Tags: environments

Note: .environments integration test environments often take a long time to get

when we have an architecture that enables small teams of developers to independently implement, test, and deploy code into production safely and quickly, we can increase and maintain developer productivity and improve deployment outcomes. These characteristics can be found in service-oriented architectures (SOAs) first described in the 1990s, in which services are independently testable and deployable. A key feature of SOAs is that they’re composed of loosely-coupled services with bounded contexts. (Location 2084)

Tags: architecture

Note: .architecture soa architecture

Having architecture that is loosely-coupled means that services can update in production independently, without having to update other services. Services must be decoupled from other services and, just as important, from shared databases (although they can share a database service, provided they don’t have any common schemas).

The idea is that developers should be able to understand and update the code of a service without knowing anything about the internals of its peer services. Services interact with their peers strictly through APIs and thus don’t share data structures, database schemata, or other internal representations of objects. (Location 2088)

Tags: apis, microservices

8How to Get Great Outcomes by Integrating Operations into the Daily Work of Development

CONCLUSION Throughout this chapter, we explored ways to integrate Operations into the daily work of Development and looked at how to make our work more visible to Operations. To accomplish this, we explored three broad strategies, including creating self-service capabilities to enable developers in service teams to be productive, embedding Ops engineers into the service teams, and assigning Ops liaisons to the service teams when embedding Ops engineers was not possible. Lastly, we described how Ops engineers can integrate with the Dev team through inclusion in their daily work, including daily standups, planning, and retrospectives. (Location 2347)

Tags: devops

Note: .devops integrate ops team members into development team and ceremonies

Part III Introduction (Location 2377)

reduce the risk associated with deploying and releasing changes into production. We will do this by implementing a set of technical practices known as continuous delivery. (Location 2380)

Tags: cd

Note: .cd reduce risks of deploying into production

Continuous delivery includes creating the foundations of our automated deployment pipeline, ensuring that we have automated tests that constantly validate that we are in a deployable state, having developers integrate their code in to trunk daily, and architecting our environments and code to enable low-risk releases. (Location 2381)

Tags: cd

Note: .cd automated deployment pipeline, automated testing, integrate code into trunk daily

9Create the Foundations of Our Deployment Pipeline

In order to create fast and reliable flow from Dev to Ops, we must ensure that we always use production-like environments at every stage of the value stream. Furthermore, these environments must be created in an automated manner, ideally on demand from scripts and configuration information stored in version control and entirely self-serviced, without any manual work required from Operations. Our goal is to ensure that we can re-create the entire production environment based on what’s in version control. (Location 2393)

Tags: environments

Note: .environments all environments should be as cloe s possible to production

variety of problems that can be traced back to inconsistently constructed environments and changes not being systematically put back into version control. (Location 2419)

Tags: environments

Note: .environments

why does using version control for our environments predict IT and organizational performance better than using version control for our code? Because in almost all cases, there are orders of magnitude more configurable settings in our environment than in our code. Consequently, it is the environment that needs to be in version control the most. (Location 2497)

Tags: environments

Note: .environments environments need version control so mcuh as they have so many configuration settings

Bill Baker, a distinguished engineer at Microsoft, quipped that we used to treat servers like pets: “You name them and when they get sick, you nurse them back to health. [Now] servers are [treated] like cattle. You number them and when they get sick, you shoot them.” By having repeatable environment creation systems, we are able to easily increase capacity by adding more servers into rotation (i.e., horizontal scaling). We also avoid the disaster that inevitably results when we must restore service after a catastrophic failure of irreproducible infrastructure, created through years of undocumented and manual production changes. (Location 2506)

Tags: environments

Note: .environments Recreate rather than repair

To ensure consistency of our environments, whenever we make production changes (configuration changes, patching, upgrading, etc.), those changes need to be replicated everywhere in our production and pre-production environments, as well as in any newly created environments. Instead of manually logging into servers and making changes, we must make changes in a way that ensures all changes are replicated everywhere automatically and that all our changes are put into version control. (Location 2511)

Tags: environments

Note: .environments any changes made to production should be replicated across all environments automatically

Now that our environments can be created on demand and everything is checked in to version control, our goal is to ensure that these environments are being used in the daily work of Development. We need to verify that our application runs as expected in a production-like environment long before the end of the project or before our first production deployment. (Location 2529)

Tags: environments

Note: .environments

at the end of each development interval, we have integrated, tested, working and potentially shippable code, demonstrated in a production-like environment. (Location 2538)

Tags: environments

Note: .environments get code working in a production like environment

CONCLUSION The fast flow of work from Development to Operations requires that anyone can get production-like environments on demand. By allowing developers to use production-like environments, even at the earliest stages of a software project, we significantly reduce the risk of production problems later. This is one of many practices that demonstrate how Operations can make developers far more productive. We enforce the practice of developers running their code in production-like environments by incorporating it into the definition of “done.” Furthermore, by putting all production artifacts into version control, we have a “single source of truth” that allows us to re-create the entire production environment in a quick, repeatable, and documented way, using the same development practices for Operations work as we do for Development work. And by making production infrastructure easier to rebuild than to repair, we make resolving problems easier and faster, as well as making it easier to expand capacity. (Location 2553)

10Enable Fast and Reliable Automated Testing

The deployment pipeline, first defined by Jez Humble and David Farley in their book Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, ensures that all code checked in to version control is automatically built and tested in a production-like environment. By doing this, we find any build, test, or integration errors as soon as a change is introduced, enabling us to fix them immediately. Done correctly, this allows us to always be assured that we are in a deployable and shippable state. (Location 2661)

Tags: deploymentpipeline

Note: .deploymentpipeline all code checked into version control is automatically deployed and tested in a production like environment

Now that we have a working deployment pipeline infrastructure, we must create our continuous integration practices, which require three capabilities: A comprehensive and reliable set of automated tests that validate we are in a deployable state. A culture that “stops the entire production line” when our validation tests fail. Developers working in small batches on trunk rather than long-lived feature branches. In the next section, we describe why fast and reliable automated testing is needed and how to build it. (Location 2699)

Tags: continuousintegration

Note: .continuousintegration

we need fast automated tests that run within our build and test environments whenever a new change is introduced into version control. In this way we can find and fix any problems immediately, as the Google Web Server example demonstrated. By doing this, we ensure our batches remain small, and, at any given point in time, we remain in a deployable state. (Location 2721)

Note: Have fast automated tests every time a change is made

Unit tests: These typically test a single method, class, or function in isolation, providing assurance to the developer that their code operates as designed. For many reasons, including the need to keep our tests fast and stateless, unit tests often “stub out” databases and other external dependencies (e.g., functions are modified to return static, predefined values, instead of calling the real database). (Location 2725)

Tags: test, unittest

Note: usually test a single class or method in isolation. Frequently stub out databases and dependencies

Acceptance tests: These typically test the application as a whole to provide assurance that a higher level of functionality operates as designed (e.g., the business acceptance criteria for a user story, the correctness of an API) and that regression errors have not been introduced (i.e., we broke functionality that was previously operating correctly). Humble and Farley define the difference between unit and acceptance testing as, “The aim of a unit test is to show that a single part of the application does what the programmer intends it to....The objective of acceptance tests is to prove that our application does what the customer meant it to, not that it works the way its programmers think it should.” After a build passes our unit tests, our deployment pipeline runs it against our acceptance tests. Any build that passes our acceptance tests is then typically made available for manual testing (e.g., exploratory testing, UI testing, etc.) as well as for integration testing. (Location 2729)

Tags: testing

Note: Acceptance testing checks if the software works the way users want it to, rather than how developers think it should work

Integration tests: Integration tests are where we ensure that our application correctly interacts with other production applications and services, as opposed to calling stubbed out interfaces.

As Humble and Farley observe, “Much of the work in the SIT environment involves deploying new versions of each of the applications until they all cooperate. In this situation the smoke test is usually a fully fledged set of acceptance tests that run against the whole application.”

Integration tests are performed on builds that have passed our unit and acceptance tests. Because integration tests are often brittle, we want to minimize the number of integration tests and find as many of our defects as possible during unit and acceptance testing. The ability to use virtual or simulated versions of remote services when running acceptance tests becomes an essential architectural requirement. (Location 2736)

Tags: stubs, integration, testing

Note: .testing test that our applicables integrate with other systems rather than just stubs

CATCH ERRORS AS EARLY IN OUR AUTOMATED TESTING AS POSSIBLE A specific design goal of our automated test suite is to find errors as early in the testing as possible. This is why we run faster-running automated tests (e.g., unit tests) before slower-running automated tests (e.g., acceptance and integration tests), which are both run before any manual testing. (Location 2751)

Tags: testing

Note: .testing cath buds early. Unit tests run faster than acceptance testing or integration testing so we run them fiirst

Humble and Farley, Continuous Delivery, Kindle edition, location 3868.) (Location 2775)

Tags: toread

Note: .toread

Continuous integration was designed to solve this problem by making merging into trunk a part of everyone’s daily work. (Location 2947)

Tags: ci

Note: .ci we encourage continuous integration to trunk so that merge issues are discovered early

Gruver and his team created a goal of increasing the time spent on innovation and new functionality by a factor of ten. The team hoped this goal could be achieved through: Continuous integration and trunk-based development Significant investment in test automation Creation of a hardware simulator so tests could be run on a virtual platform The reproduction of test failures on developer workstations A new architecture to support running all printers off a common build and release (Location 2961)

Tags: ci

Note: .ci

“Without automated testing, continuous integration is the fastest way to get a big pile of junk that never compiles or runs correctly.” (Location 2975)

Tags: ci, testing

Note: .testing .ci

they created a culture that halted all work anytime a developer broke the deployment pipeline, ensuring that developers quickly brought the system back into a green state. (Location 2980)

Tags: pipeline

Note: .pipeline

Our countermeasure to large batch size merges is to institute continuous integration and trunk-based development practices, where all developers check in their code to trunk at least once per day. Checking code in this frequently reduces our batch size to the work performed by our entire developer team in a single day. The more frequently developers check in their code to trunk, the smaller the batch size and the closer we are to the theoretical ideal of single-piece flow. (Location 3022)

Tags: ci

Note: .ci each developee should commit to teunl at least once per day

including unit tests in JUnit, regression tests in Selenium, and getting a deployment pipeline running in TeamCity. (Location 3061)

Tags: testing

Note: .testing

12Automate and Enable Low-Risk Releases

Once we have the process documented, our goal is to simplify and automate as many of the manual steps as possible, such as: Packaging code in ways suitable for deployment Creating pre-configured virtual machine images or containers Automating the deployment and configuration of middleware Copying packages or files onto production servers Restarting servers, applications, or services Generating configuration files from templates Running automated smoke tests to make sure the system is working and correctly configured Running testing procedures Scripting and automating database migrations (Location 3131)

Tags: deploying

Note: .deploying

the development environments were missing many production assets such as security, firewalls, load balancers, and a SAN.” (Location 3165)

Tags: environments

Note: .environments

Deployment is the installation of a specified version of software to a given environment (e.g., deploying code into an integration test environment or deploying code into production). Specifically, a deployment may or may not be associated with a release of a feature to customers. (Location 3276)

Tags: deloyment

Note: .deloyment instal software on a given environment

Release is when we make a feature (or set of features) available to all our customers or a segment of customers (e.g., we enable the feature to be used by 5% of our customer base). Our code and environments should be architected in such a way that the release of functionality does not require changing our application code. (Location 3278)

Tags: release

Note: .release make feature available to customers

Environment-based release patterns: This is where we have two or more environments that we deploy into, but only one environment is receiving live customer traffic (e.g., by configuring our load balancers). New code is deployed into a non-live environment, and the release is performed moving traffic to this environment. These are extremely powerful patterns, because they typically require little or no change to our applications. These patterns include blue-green deployments, canary releases, and cluster immune systems, all of which will be discussed shortly. (Location 3290)

Tags: release, environment

Note: .environment .release 2 environments. Deploy into non-live and then release by directing traffc to it

Decoupling deployments from our releases dramatically changes how we work. We no longer have to perform deployments in the middle of the night or on weekends to lower the risk of negatively impacting customers. Instead, we can do deployments during typical business hours, enabling Ops to finally have normal working hours, just like everyone else. (Location 3302)

Tags: deployments

Note: .deployments

Blue-green deployment.

In this pattern, we have two production environments: blue and green. At any time, only one of these is serving customer traffic.

To release a new version of our service, we deploy to the inactive environment where we can perform our testing without interrupting the user experience. When we are confident that everything is functioning as designed, we execute our release by directing traffic to the blue environment. Thus, blue becomes live and green becomes staging. Roll back is performed by sending customer traffic back to the green environment. (Location 3307)

Tags: environments, deployment

Note: .deployment .environments

For example, we may have 1% of our online users make invisible calls to a new feature scheduled to be launched to see how our new feature behaves under load. After we find and fix any problems, we progressively increase the simulated load by increasing the frequency and number of users exercising the new functionality. By doing this, we are able to safely simulate production-like loads, giving us confidence that our service will perform as it needs to. (Location 3407)

Tags: feature, deployment

Note: Make invisible calls to a new feature to test it under load

13Architect for Low-Risk Releases

As described earlier, the strangler application pattern involves placing existing functionality behind an API, where it remains unchanged, and implementing new functionality using our desired architecture, making calls to the old system when necessary. When we implement strangler applications, we seek to access all services through versioned APIs, also called versioned services or immutable services. (Location 3628)

Part IV Introduction

14Create Telemetry to Enable Seeing and Solving Problems

we should ensure that all potentially significant application events generate logging entries, including those provided on this list assembled by Anton A. Chuvakin, a research VP at Gartner’s GTP Security and Risk Management group: Authentication/authorization decisions (including logoff) System and data access System and application changes (especially privileged changes) Data changes, such as adding, editing, or deleting data Invalid input (possible malicious injection, threats, etc.) Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits) Health and availability Startups and shutdowns Faults and errors Circuit breaker trips Delays Backup success/failure To make it easier to interpret and give meaning to all these log entries, we should (ideally) create logging hierarchical categories, such as for non-functional attributes (e.g., performance, security) and for attributes related to features (e.g., search, ranking). (Location 3843)

Tags: devops, logging

Note: .logging .devops

In the previous steps, we enabled Development and Operations to create and improve production telemetry as part of their daily work. In this step, our goal is to radiate this information to the rest of the organization, ensuring that anyone who wants information about any of the services we are running can get it without needing production system access or privileged accounts, or having to open up a ticket and wait for days for someone to configure the graph for them. (Location 3893)

Tags: devops

Note: .devops make monitoring data widely available

We want our production telemetry to be highly visible, which means putting it in central areas where Development and Operations work, thus allowing everyone who is interested to see how our services are performing. At a minimum, this includes everyone in our value stream, such as Development, Operations, Product Management, and Infosec. (Location 3898)

Tags: devops

Note: .devops make producrion monitoring data clearly visible

we create enough telemetry at all levels of the application stack for all our environments, as well as for the deployment pipelines that support them. We need metrics from the following levels: Business level: Examples include the number of sales transactions, revenue of sales transactions, user signups, churn rate, A/B testing results, etc. Application level: Examples include transaction times, user response times, application faults, etc. Infrastructure level (e.g., database, operating system, networking, storage): Examples include web server traffic, CPU load, disk usage, etc. Client software level (e.g., JavaScript on the client browser, mobile application): Examples include application errors and crashes, user measured transaction times, etc. Deployment pipeline level: Examples include build pipeline status (e.g., red or green for our various automated test suites), change deployment lead times, deployment frequencies, test environment promotions, and environment status. (Location 3947)

Tags: reporting, devops

Note: .devops .reporting

15Analyze Telemetry to Better Anticipate Problems and Achieve Goals

16Enable Feedback So Development and Operations Can Safely Deploy Code

Development management see that business goals are not achieved simply because features have been marked as “done.” Instead, the feature is only done when it is performing as designed in production, without causing excessive escalations or unplanned work for either Development or Operations. (Location 4307)

Tags: product development

Note: A feature is only done when working as expected in production

do what Google does, which is have Development groups self-manage their services in production before they become eligible for a centralized Ops group to manage. By having developers be responsible for deployment and production support, we are far more likely to have a smooth transition to Operations. (Location 4350)

Note: Developers manage deployment to production and supporr before handing over to ops

By creating launch guidance, we help ensure that every product team benefits from the cumulative and collective experience of the entire organization, especially Operations. Launch guidance and requirements will likely include the following: Defect counts and severity: Does the application actually perform as designed? Type/frequency of pager alerts: Is the application generating an unsupportable number of alerts in production? Monitoring coverage: Is the coverage of monitoring sufficient to restore service when things go wrong? System architecture: Is the service loosely-coupled enough to support a high rate of changes and deployments in production? Deployment process: Is there a predictable, deterministic, and sufficiently automated process to deploy code into production? Production hygiene: Is there evidence of enough good production habits that would allow production support to be managed by anyone else? (Location 4355)

Tags: deployment

Note: .deployment launch guidance checklist

“Every time we do a launch, we learn something. There will always be some people who are less experienced than others doing releases and launches. The LRR and HRR checklists are a way to create that organizational memory.” (Location 4419)

Tags: checklists

Note: .checklists checklist help create organnisational memory

17Integrate Hypothesis-Driven Development and A/B Testing into Our Daily Work

We Believe that increasing the size of hotel images on the booking page Will Result in improved customer engagement and conversion We Will Have Confidence To Proceed When we see a 5% increase in customers who review hotel images who then proceed to book in forty-eight hours. (Location 4554)

Tags: splittest

Note: .splittest state at what point you will hve confidence to roll out to more customers

18Create Review and Coordination Processes to Increase Quality of Our Current Work

One of the core beliefs in the Toyota Production System is that “people closest to a problem typically know the most about it.” (Location 4666)

Tags: problems, problemhacking

Note: As those closest to the problem what the issue is

high-performing organizations relied more on peer review and less on external approval of changes. (Location 4671)

Note: Those closest to the problrm know it best

ENABLE PEER REVIEW OF CHANGES Instead of requiring approval from an external body prior to deployment, we may require engineers to get peer reviews of their changes. In Development, this practice has been called code review, but it is equally applicable to any change we make to our applications or environments, including servers, networking, and databases.‡ The goal is to find errors by having fellow engineers close to the work scrutinize our changes. This review improves the quality of our changes, which also creates the benefits of cross-training, peer learning, and skill improvement. (Location 4695)

Tags: peerreview

Note: .Peerreview

“There is a non-linear relationship between the size of the change and the potential risk of integrating that change—when you go from a ten line code change to a one hundred line code, the risk of something going wrong is more than ten times higher, and so forth.” This is why it’s so essential for developers to work in small, incremental steps rather than on long-lived feature branches. (Location 4706)

Part V Introduction

19Enable and Inject Learning into Daily Work

Blameless post-mortems, a term coined by John Allspaw, help us examine “mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure.” To do this, we schedule the post-mortem as soon as possible after the accident occurs and before memories and the links between cause and effect fade or circumstances change. (Of course, we wait until after the problem has been resolved so as not to distract the people who are still actively working on the issue.) (Location 4968)

Dan Milstein, one of the principal engineers at Hubspot, writes that he begins all blameless post-mortem meetings by saying, “We’re trying to prepare for a future where we’re as stupid as we are today.” In other words, it is not acceptable to have a countermeasure to merely “be more careful” or “be less stupid”— instead, we must design real countermeasures to prevent these errors from happening again. Examples of such countermeasures include new automated tests to detect dangerous conditions in our deployment pipeline, adding further production telemetry, identifying categories of changes that require additional peer review, and conducting rehearsals of this category of failure as part of regularly scheduled Game Day exercises. (Location 5002)

Note: Dont have generic measures like "be more careful". Be specific, have more tests

After we conduct a blameless post-mortem meeting, we should widely announce the availability of the meeting notes and any associated artifacts (e.g., timelines, IRC chat logs, external communications). This information should (ideally) be placed in a centralized location where our entire organization can access it and learn from the incident. Conducting post-mortems is so important that we may even prohibit production incidents from being closed until the post-mortem meeting has been completed. (Location 5009)

Tags: retrospectives, post mortem, meetings

Note: Publicise meeting postmortems, including timelines

Resilience requires that we first define our failure modes and then perform testing to ensure that these failure modes operate as designed. One way we do this is by injecting faults into our production environment and rehearsing large-scale failures so we are confident we can recover from accidents when they occur, ideally without even impacting our customers. (Location 5086)

Tags: failure

Note: .failure plan and test your failure mode

By creating failure in a controlled situation, we can practice and create the playbooks we need. One of the other outputs of Game Days is that people actually know who to call and know who to talk to—by doing this, they develop relationships with people in other departments so they can work together during an incident, turning conscious actions into unconscious actions that are able to become routine. (Location 5135)

Note: Simulating failures ensures people know we too reach out to in other deppartments which real failures occur

20Convert Local Discoveries into Global Improvements

21Reserve Time to Create Organizational Learning and Improvement (Location 5365)

Note: .h2

improvement blitz (or sometimes a kaizen blitz), defined as a dedicated and concentrated period of time to address a particular issue, often over the course of a several days. Dr. Spear explains, “...blitzes often take this form: A group is gathered to focus intently on a process with problems…The blitz lasts a few days, the objective is process improvement, and the means are the concentrated use of people from outside the process to advise those normally inside the process.” (Location 5367)

Tags: kaizen

Note: .kaizen focus intently on an improvent over a few days

CONCLUSION TO PART V Throughout Part V, we explored the practices that help create a culture of learning and experimentation in your organization. Learning from incidents, creating shared repositories, and sharing learnings is essential when we work in complex systems, helping to make our work culture more just and our systems safer and more resilient. (Location 5530)

Tags: learning

Note: .learning

Part VI Introduction

22Information Security as Everyone’s Job, Every Day

Ultimately, our goal is to provide the security libraries or services that every modern application or environment requires, such as enabling user authentication, authorization, password management, data encryption, and so forth. (Location 5621)

Tags: infosec

Note: .infosec auuthentication, password management, encryption

In order to detect problematic user behavior that could be an indicator or enabler of fraud and unauthorized access, we must create the relevant telemetry in our applications. Examples may include: Successful and unsuccessful user logins User password resets User email address resets User credit card changes For instance, as an early indicator of brute-force login attempts to gain unauthorized access, we might display the ratio of unsuccessful login attempts to successful logins. (Location 5809)

Tags: security

Note: Log unsuccessful login attempts

23Protecting the Deployment Pipeline

order for normal changes to be authorized, the CAB will almost certainly have a well-defined request for change (RFC) form that defines what information is required for the go/no-go decision. The RFC form usually includes the desired business outcomes, planned utility and warranty,† a business case with risks and alternatives, and a proposed schedule.‡ (Location 5914)

Tags: cab

Note: .cab you usually complete a request for change form for cab

as complexity and deployment frequency increase, performing production deployments successfully increasingly requires everyone in the value stream to quickly see the outcomes of their actions. Separation of duty often can impede this by slowing down and reducing the feedback engineers receive on their work. This prevents engineers from taking full responsibility for the quality of their work and reduces a firm’s ability to create organizational learning. (Location 5998)

Lean principles focus on creating value for the customer—thinking systematically, creating constancy of purpose, embracing scientific thinking, creating flow and pull (versus push), assuring quality at the source, leading with humility, and respecting every individual. (Location 6180)

Tags: lean

Note: .lean

THE CONTINUOUS DELIVERY MOVEMENT Building upon the Development discipline of continuous build, test, and integration, Jez Humble and David Farley extended the concept of continuous delivery, which included a “deployment pipeline” to ensure that code and infrastructure are always in a deployable state and that all code checked in to truck is deployed into production. (Location 6195)

For more on value stream mapping, see Value Stream Mapping: How to Visualize Work and Align Leadership for Organizational Transformation by Karen Martin and Mike Osterling. (Location 6359)

Tags: toread

Note: .toread