The Phoenix Project

My heart lurches as all the implications sink in. I’ve seen this movie before. The plot is simple: First, you take an urgent date-driven project, where the shipment date cannot be delayed because of external commitments made to Wall Street or customers. Then you add a bunch of developers who use up all the time in the schedule, leaving no time for testing or operations deployment. And because no one is willing to slip the deployment date, everyone after Development has to take outrageous and unacceptable shortcuts to hit the (Location 711)

I nod silently but refuse to say more. I always liked that phrase in Saving Private Ryan: “There’s a chain of command: gripes go up, not down.” (Location 821)

Tags: leadership

Note: Don't complain to people lower down the chain of command

I have a sinking feeling in the pit of my stomach. How can we manage production if we don’t know what the demand, priorities, status of work in process, and resource availability are? Suddenly, I’m kicking myself that I didn’t ask these questions on my first day. (Location 1033)

Tags: software delivery, kanban

Note: You need to know the demand, priority, status of WIP and resources available

“I merely want a one-line description about what all that work is and how long they think it will take!” Realizing how this might come across, I add, “Make sure you tell people that we’re doing this so we can get more resources. I don’t want anyone thinking that we’re outsourcing or firing anyone, okay?” Patty nods. “We should have done this a long time ago. We bump up the priorities of things all the time, but we never really know what just got bumped down. That is, until someone screams at us, demanding to know why we haven’t delivered something.” (Location 1060)

Is it a change or not? I can see both sides of the argument. After thirty minutes of arguing, it’s still not clear that we know the definition of what a “change” should be. Was rebooting a server a change? Yes, because we don’t want anyone rebooting servers willy-nilly, especially if it’s running a critical service. How about turning off a server? Yes, for the same reason. How about turning on a server? No, we all thought. That is, until someone came up with the example of turning on a duplicate dhcp server, which screwed up the entire enterprise network for twenty-four hours. A half hour later, we finally write on the whiteboard: “a ‘change’ is any activity that is physical, logical, or virtual to applications, databases, operating systems, networks, or hardware that could impact services being delivered.” (Location 1207)

Tags: change management

Note: Define your terms

“We spent a lot of blood, sweat, and tears creating our old change management policy, and everyone still blew it off. What makes you think this will be any different?” I shrug. “I don’t know. But we’ll keep trying things until we have a system that works, and I’m going to make sure everyone keeps helping us get there. It’s not just to satisfy the audit findings. We need some way to plan, communicate, and make our changes safely. I can guarantee you that if we don’t change the way we work, I’ll be soon out of a job.” (Location 1222)

Tags: change management

Note: Need to plan, communicate and make changes safely

“You probably don’t even see when work is committed to your organization. And if you can’t see it, you can’t manage it—let alone organize it, sequence it, and have any assurance that your resources can complete (Location 1333)

Tags: kanban

Note: If you can't see it you can't manage it

“In the 1980s, this plant was the beneficiary of three incredible scientifically-grounded management movements. You’ve probably heard of them: the Theory of Constraints, Lean production or the Toyota Production System, and Total Quality Management. Although each movement started in different places, they all agree on one thing: wip is the silent killer. Therefore, one of the most critical mechanisms in the management of any plant is job and materials release. Without it, you can’t control wip.” (Location 1355)

Tags: wip

Note: .wip

“Dr. Eliyahu M. Goldratt, who created the Theory of Constraints, showed us how any improvements made anywhere besides the bottleneck are an illusion. Astonishing, but true! Any improvement made after the bottleneck is useless, because it will always remain starved, waiting for work from the bottleneck. And any improvements made before the bottleneck merely results in more inventory piling up at the bottleneck. (Location 1375)

Tags: bottlenecks, kanban

Note: Improvements made anywhere but the bottleneck are an illusion

the Three Ways,” he says. “The First Way helps us understand how to create fast flow of work as it moves from Development into it Operations, because that’s what’s between the business and the customer. The Second Way shows us how to shorten and amplify feedback loops, so we can fix quality at the source and avoid rework. And the Third Way shows us how to create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery.” (Location 1398)

Tags: kanban, agile

Note: 1= Create fast flow from dev to ops 2 = shorten feedback loops. Fix issues at source and reduce rework 3 = Create a culture that embraces learning and experimemtation

We make a pile of nearly fifty cards proposing changes to the Rainbow, Saturn, and Taser applications, and also changes to the network and certain shared databases, which could impact a significant portion, or even all, of the business. “Even looking at those cards makes my heart palpitate,” Wes says. “These are some of the dangerous changes we make around here.” He’s right. I say, “Okay, let’s mark all of these as ‘fragile.’ These are high risk and must be authorized by the cab. Patty, changes like this should be at the top of the pile during our meetings.” Patty nods, taking notes saying, “Got it. We’re predefining high-risk categories of change that not only must have change requests submitted, but must have authorization before being scheduled and implemented.” (Location 1544)

Tags: change management, risk, categorisation

Note: Categorise work items by risk level

Wes laughs and adds wryly, “Yeah, in the case of puccar, have the coroner stock up on a bunch of body bags, too. And a pr person ready to handle the angry phone calls from the business, saying that some customers were allergic to the foam we used.” I laugh. “You know, that’s an interesting idea. Let’s let the business choose the foam. There’s no reason why all the responsibility should rest on our shoulders. We can send an e-mail out to the business ahead of time and ask when the best implementation time would be. If we can give them data on the outcomes of previous changes, they may even withdraw the change.” Patty is typing away. “Got it. For these types of changes, I’ll have my staff generate some reports on the changes’ success rates and any associated downtime. This will help the business make more informed decisions around the changes.” (Location 1555)

Tags: deployment

Note: Monitor the success rates of deployments of risky applications

“That leaves about two hundred changes that are medium-risk changes that we still need to look at.” “I agree with Wes,” I respond. “For these, we need to trust that the manager knows what he or she is doing. But I’d like Patty to verify that people have appropriately informed anyone they could affect, and gotten the ‘okay to proceed’ from all of them.” (Location 1573)

Tags: change management

Every time that we let Brent fix something that none of us can replicate, Brent gets a little smarter, and the entire system gets dumber. We’ve got to put an end to that. (Location 1817)

Note: Prevent bottlenecks by ensuring knowledge is shared amongst the team

I smirk at the reference to smoke tests, a term circuit designers use. The saying goes, “If you turn the circuit board on and no smoke comes out, it’ll probably work.” (Location 2002)

Tags: qa

Not surprisingly, Sarah is unimpressed. As soon as I stop talking, she says, “We’ve all been busting ass getting Phoenix this far. Marketing is ready, Development is ready. Everyone is ready but you. I’ve told you before, but apparently, you’re not listening: Perfection is the enemy of good. We’ve got to keep going.” (Location 2067)

“It’s harder than ever to convince the business to do the right thing. They’re like kids in a candy store. They read in an airline magazine that they can manage their whole supply chain in the cloud for $499 per year, and suddenly that’s the main company initiative. When we tell them it’s not actually that easy, and show them what it takes to do it right, they disappear. Where did they go? They’re talking to their Cousin Vinnie or some outsourcing sales guy who promises they can do it in a tenth of the time and cost.” (Location 2390)

“It’s like the free puppy,” I continue. “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end.” Chris cracks up. “Yes, exactly! They’ll say, ‘The puppy can’t quite do everything we need. Can you train it to fly airplanes? It’s just a simple matter of coding, right?’” (Location 2398)

Tags: it ops, maintenance

Note: Operations and maintenance kill you rather than the up front cost

“Yes, I think I can,” I say. “At the plant, I gave you one category, which was business projects, like Phoenix,” I say. “Later, I realized that I didn’t mention internal it projects. A week after that, I realized that changes are another category of work. But it was only after the Phoenix fiasco that I saw the last one, because of how it prevented all other work from getting completed, and that’s the last category, isn’t it? Firefighting. Unplanned work.” (Location 2578)

Tags: kanban

Note: Business projects, IT projects, changes and unplanned work

“You’ve come much further than I thought: You’ve started to take steps to stabilize the operational environment, you’ve started to visually manage wip within IT Operations, and you’ve started to protect your constraint, Brent. You’ve also reinforced a culture of operational rigor and discipline. Well done, Bill.” (Location 2597)

Tags: constraint, wip

“There are five focusing steps which Sensei Goldratt describes in The Goal: Step 1 is to identify the constraint. You’ve done that, so congratulations. Keep challenging yourself to really make sure that’s your organizational constraint, because if you’re wrong, nothing you do will matter. Remember, any improvement not made at the constraint is just an illusion, yes? “Step 2 is to exploit the constraint,” he continues. “In other words, make sure that the constraint is not allowed to waste any time. Ever. It should never be waiting on any other resource for anything, and it should always be working on the highest priority commitment the it Operations organization has made to the rest of the enterprise. Always.” (Location 2612)

Tags: bottlenecks, constraint, kanban

Note: always have your constraint on the highest priority tasks

Chester, your peer in Development, is spending all his cycles on features, instead of stability, security, scalability, manageability, operability, continuity, and all those other beautiful ’itties. (Location 2637)

Tags: it ops, it security, devops

Note: Dont just focus on new features. Think about security,scalability,stability

“On the other end of the assembly line, Jimmy keeps trying to retrofit production controls after the toothpaste is out of the tube,” he says, scoffing. “Hopeless! Futile! It’ll never work! You need to design these things, what some call ‘nonfunctional requirements,’ into the product. But your problem is that the person who knows the most about where your technical debt is and how to actually build code that is designed for Operations is too busy. You know who that person is, don’t you?” (Location 2638)

Tags: nonfunctional requirements

Five Dysfunctions of a Team, by Patrick Lencioni. (Location 2909)

Tags: toread

Note: .toread

I reply, “Erik has helped me understand that there are four types of it Operations work: business projects, it Operations projects, changes, and unplanned work. But, we’re only talking about the first type of work, and the unplanned work that get’s created when we do it wrong. We’re only talking about half the work we do in it Operations.” (Location 3048)

“Erik asked me how we made the same type of decision in it,” I recall. “I told him then, and I’ll tell you now, I don’t know. I’m pretty sure we don’t do any sort of analysis of capacity and demand before we accept work. Which means we’re always scrambling, having to take shortcuts, which means more fragile applications in production. Which means more unplanned work and firefighting in the future. So, around and around we go.” (Location 3089)

Tags: capacity planning

“Everyone knows that in manufacturing, as wip increases, due-date performance goes down. (Location 3175)

Tags: wip

Note: As WIP goes up then due-dates get pushed out

“every work center is made up of four things: the machine, the man, the method, and the measures. Suppose for the machine, we select the heat treat oven. The men are the two people required to execute the predefined steps, and we obviously will need measures based on the outcomes of executing the steps in the method.” (Location 3344)

“Okay. These ‘security’ projects decrease your project throughput, which is the constraint for the entire business. And swamp the most constrained resource in your organization. And they don’t do squat for scalability, availability, survivability, sustainability, security, supportability, or the defensibility of the organization.” He asks deadpan, “So, genius: Do Jimmy’s projects sound like a good use of time to you?” (Location 3449)

Tags: it security

“Let’s use the example of configuring a server. It involves procurement, installing the os and applications on it according to some specification, and then getting it racked and stacked. Then we validate that it’s been built correctly. Each of these steps are typically done by different people. Maybe each step is like a work center, each with its own machines, methods, men, and measures.” (Location 3599)

Tags: it ops, server, it

Note: .it .server

We’re still struggling on how to prioritize our own seventy-three internal projects,” she says, her expression turning glum. “There’s still way too many. We’ve spent weeks with all the team leads trying to establish some sort of relative importance level, but that’s all we’ve done. Argue.” She flips to the second page. “The projects seem to fall into the following categories: replacing fragile infrastructure, vendor upgrades, or supporting some internal business requirement. The rest are a hodgepodge of audit and security work, data center upgrade work, and so forth.” (Location 3699)

Tags: it ops, it

Note: IT work includes; replacing fragile infrastructure, vendor upgrades, supporting internal business requirements, audit & security work, data centre upgrades

Improving something anywhere not at the constraint is an illusion. You know, no offense, but you sort of sound like John right now.” (Location 3721)

Tags: constraint, bottlenecks

“What that graph says is that everyone needs idle time, or slack time. If no one has slack time, wip gets stuck in the system. Or more specifically, stuck in queues, just waiting.” (Location 3795)

Tags: slack time

“Some of the wisest auditors say that there are only three internal control objectives: to gain assurance for reliability of financial reporting, compliance with laws and regulations, and efficiency and effectiveness of operations. That’s it. What you and John are talking about are just different slides of what is called the ‘coso Cube.’” (Location 4079)

Tags: regulation, financial reporting, audit

Note: Internal controls = reliability of financial reporting, compliance with reg & laws and effectiveness of operations

“Metaphors like oil changes help people make that connection. Preventive oil changes and vehicle maintenance policies are like preventive vendor patches and change management policies. By showing how it risks jeopardize business performance measures, you can start making better business decisions. (Location 4097)

Tags: it ops

“I want accurate and timely order information from our stores and online channels. I want to press a button and get it, instead of running it through the circus we’ve created. I’d use that data to create marketing campaigns that continually do a/b testing of offers, finding the ones that our customers jump at. When we find out what works, we’d replicate it across our entire customer list. By doing this, we’d be creating a huge and predictable sales funnel for Ron. “I’d use that information to drive our production schedule, so we can manage our supply and demand curves. We’d keep the right products on the right store shelves and keep them stocked. Our revenue per customer would go through the roof. Our average order sizes would go up. We’d finally increase our market share and start beating the competition again.” (Location 4201)

Tags: data, marketing, sales data

“In these competitive times, the name of the game is quick time to market and to fail fast. We just can’t have multiyear product development timelines, waiting until the end to figure out whether we have a winner or loser on our hands. We need short and quick cycle times to continually integrate feedback from the marketplace. “But that’s just half the picture,” she continues. “The longer the product development cycle, the longer the company capital is locked up and not giving us a return. Dick expects that on average, our r&d investments return more than ten percent. That’s the internal hurdle rate. If we don’t beat the hurdle rate, the company capital would have been better spent being invested in the stock market or gambled on racehorses. (Location 4224)

Tags: cycle time

Note: Decrease cycle times to stay competitive and reduce payback times

“When r&d capital is locked up as wip for more than a year, not returning cash back to the business, it becomes almost impossible to pay back the business,” she continues. (Location 4230)

Tags: cycle time, wip

...data, which, incidentally, form two of the three legs of the ‘confidentiality, integrity, and availability triangle’ or cia.” (Location 4280)

Tags: data

Note: .data

She didn’t look at any of the it systems until she understood exactly where in the process material errors could occur and where they would be detected. She found that most of the time, we would detect it in a manual reconciliation step where account balances and values from one source were compared to another, usually on a weekly basis. “When this happens,” he says, with awe and wonder in his voice, “she knew the upstream it systems should be out of scope of the audit.” “Here’s what she showed the auditors,” John says, excitedly flipping to the second page. “Quote: ‘The control being relied upon to detect material errors is the manual reconciliation step, not in the upstream it systems.’ I went through all of Faye’s papers, and in every case, the auditors agreed, withdrawing their it finding. “That’s why Erik called the pile of audit findings a ‘scoping error.’ He’s right. If the audit test plan was scoped correctly in the beginning, there wouldn’t have been any it findings!” he concludes. (Location 4344)

Tags: audit

Because of our ever-improving production monitoring of the infrastructure and applications, more often than not, we know about the incidents before the business does. (Location 4416)

Tags: it ops, monitoring

Note: Have good production monitoring to shorten feedback loop for issues

Erik says that we are starting to master the First Way: We’re curbing the handoffs of defects to downstream work centers, managing the flow of work, setting the tempo by our constraints, and, based on our results from audit and from Dick, we’re understanding better than we ever have what is important versus what is not. (Location 4439)

Tags: kanban

“Think like a plant manager. When you see work going upstream, what does it mean to you?” He quickly responds, “The flow of work should ideally go in one direction only: forward. When I see work going backward, I think ‘waste.’ It might be because of defects, lack of specification, or rework… Regardless, it’s something we should fix.” (Location 4637)

“As part of the Second Way, you need to create a feedback loop that goes all the way back to the earliest parts of product definition, design, and development,” (Location 4711)

Tags: feedback

until code is in production, no value is actually being generated, because it’s merely wip stuck in the system. (Location 4752)

Tags: product development, deployment

Continuous Delivery. (Location 4758)

Tags: toread

Note: .toread

“you need to create what Humble and Farley called a deployment pipeline. That’s your entire value stream from code check-in to production. That’s not an art. That’s production. You need to get everything in version control. Everything. Not just the code, but everything required to build the environment. Then you need to automate the entire environment creation process. You need a deployment pipeline where you can create test and production environments, and then deploy code into them, entirely on-demand. That’s how you reduce your setup times and eliminate errors, so you can finally match whatever rate of change Development sets the tempo at.” (Location 4761)

Tags: version control, deployment pipeline, deployment

Note: .deployment

Thinking about what Erik said, I add, “How about enabling Marketing to make their own changes to content or business rules or enabling faster experimentation and a/b split testing, to see what offers work best?” (Location 4807)

Tags: webflow

I think about Erik challenging me to think like a plant manager as opposed to a work center supervisor. I suddenly realize that he probably meant that I needed to span the departmental boundaries of Development and it Operations. “You guys are both correct,” I say, interrupting Wes and William. “William, would you mind writing down all the steps on the whiteboard? I’d suggest starting at ‘code committed,’ and keep going until the handoff to our group.” He nods and walks to the whiteboard and starts drawing boxes, discussing the steps as he goes. Over the next ten minutes, he proves that there are likely over one hundred steps, including the automated tests run in the Dev environment, creating a qa environment that matches Dev, deploying code into it, running all the tests, deploying and migrating into a fresh staging environment that matches qa, load testing, and finally the baton being passed to it Operations. (Location 4819)

Tags: dev environment, deployment

Note: .deployment

I indicate to Brent and Wes that one of them should continue where William left off. Brent gets up and starts drawing boxes to indicate the packaging of the code for deployment; preparing new server instances; loading and configuring the operating system, databases, and applications; making all the changes to the networks, firewalls, and load balancers; and then testing to make sure the deployment completed successfully. (Location 4829)

Tags: deployment

Note: .deployment

Realizing the importance and enormity of the challenge in front of us, I walk to the whiteboard, and pick up the red marker. I say, “I’m going to put a big red star on each step where we had problems during previous launches.” Starting to make marks on the whiteboard, I explain, “Because a fresh qa environment wasn’t available, we used an old version; because of all the test failures, we made code and environment changes to the qa environment, which never made it back into the Dev or Production environments; and because we never synchronized all the environments, we had the same problems the next time around, too.” (Location 4838)

Tags: deployment

Note: .deployment

“With the current process, two issues keep coming up: At every stage of the deployment process, environments are never available when we need them, and even when they are, there’s considerable rework required to get them all synchronized with one another. Yes?” Wes snorts, saying, “No reward for stating something that obvious, but you’re right.” She continues, “The other obvious source of rework and long setup time is in the code packaging process, where it Operations takes what Development checks into version control and then generates the deployment packages. Although Chris and his team do their best to document the code and configurations, something always falls through the cracks, which are only exposed when the code fails to run in the environment after deployment. (Location 4863)

Tags: deployment

Note: .deployment syncing of environments

Right now, we focus mostly on having deployable code at the end of the project. I propose we change that requirement. At each three-week sprint interval, we not only need to have deployable code but also the exact environment that the code deploys into, and have that checked into version control, (Location 4891)

Tags: version control, deployment

Note: .deployment

Similarly, we were all amazed that we had a qa environment available that matched Dev so early in the project. That, too, was unprecedented. We needed to make a bunch of adjustments to reflect that the Dev systems had considerably less memory and storage than qa, and qa had less than those in Production. But the vast majority of the environments were identical and could be modified and spun up in minutes. (Location 4954)

Tags: deployment

Note: .deployment environments differing mostly in memory and storage

‘Messiahs are good, but scripture is better.’” (Location 5446)

Tags: quotes

Note: .quotes

DevOps, but I suspect it’s something much more than that. It’s Product Management, Development, it Operations, and even Information Security all working together and supporting one another. Even Steve is a part of this super-tribe. (Location 5457)

Tags: devops

Beyond the Goal, (Location 5495)

Tags: toolistento

Note: .toolistento

Beyond the Phoenix Project, (Location 5504)

Tags: tolistento

Note: .tolistento

I’ve also come across otherwise smart [people] who are of the mistaken belief that if they hold on to a task, something only they know how to do, it’ll ensure job security. These people are knowledge Hoarders. This doesn’t work. Everyone is replaceable. No matter how talented they are. Sure it may take longer at first to find out how to do that special task, but it will happen without them. (Location 5530)

Note: Everyone is replaceable

Visible Ops Security. (Location 5541)

Tags: toread

Note: .toread

Continuous Delivery (Location 5677)

Tags: toread

Note: .toread

Many DevOps practices emerge if we continue to manage our work beyond the goal of “potentially shippable code” at the end of each iteration, extending it to having our code always in a deployable state, with developers checking into trunk daily, and that we demonstrate our features in production-like environments. (Location 5721)

Tags: deployment

Note: .deployment have code in a deployable state rther than potentially shippable

Myth—DevOps Means Eliminating IT Operations, or “NoOps:” Many misinterpret DevOps as the complete elimination of the IT Operations function. However, this is rarely the case. While the nature of IT Operations work may change, it remains as important as ever. IT Operations collaborates far earlier in the software life cycle with Development, who continues to work with IT Operations long after the code has been deployed into production. (Location 5737)

Tags: devops

Note: Operations woork with dev far earlier in the life cycle

Imagine a world where product owners, Development, QA, IT Operations, and Infosec work together, not only to help each other, but also to ensure that the overall organization succeeds. By working toward a common goal, they enable the fast flow of planned work into production (e.g., performing tens, hundreds, or even thousands of code deploys per day), while achieving world-class stability, reliability, availability, and security. (Location 5779)

Tags: devops

Note: .devops product,dev,test, it operations and infosec work together