Latest Posts »
Latest Comments »
Popular Posts »

Aviate, Navigate, Communicate

Written by Kendall Miller on March 27, 2008 – 12:29 am

If you’re involved in IT operations or even in business long enough, you’re going to experience some emergencies. During these emergencies, you’re going to have to balance several conflicting things that will demand your attention simultaneously:

  1. Cause of the problem: What is really happening? What device is at the root of the problem (network switch died because an admin configured a loop in the fabric and miss-configured the port)
  2. Scope of the problem: Just how bad is it? Problems usually show up in one place (users can’t access Exchange) but those symptoms often represent a larger problem (network switch died)
  3. Communicate with users: First, people will be coming in the door to report the problem (do you know that Exchange is down?) and will be expecting updates on what’s going on and when it’ll be resolved (I really need to tell my friend about a party tonight, when will email be back up?)

Even in a shop with healthy staffing, this can be a lot to handle at once particularly because your impulse is going to be to move between the root cause and communication. The first because it’s the real high value item -fix the problem. The last because whenever someone walks in, you’ll want to tell them what’s going on. The higher up the chain of command, the better you’ll want it to sound.

Whenever I’m wondering how to look at an IT Operations problem from a different perspective to gain insight, aviation is the first place I go. Think about the modern air transport system in the United States not from your usual perspective (a passenger on a plane) but from the standpoint of the people that live within it and operate it. For example, the life of a flight deck crew isn’t that different than system support in the sense that you have long periods of routine punctuated by periods of high stress activity. A classic rule taught to pilots when they’re first being trained is Aviate, Navigate, and Communicate - in that order.

  1. First, fly the plane. (Be in the middle of the air, not the bottom)
  2. Figure out where you are. (Over the White House)
  3. Then communicate. (Sorry Tower, would you like us to land?)

To make things easier on commercial planes, you have a pilot and co-pilot that divide these responsibilities by having clear designation of one being the Pilot Flying and the other (called the Pilot Not Flying or Pilot Monitoring) responsible for navigation and communication. This is practiced carefully during training with different parts of each emergency checklist assigned to either the Pilot Flying or Pilot Monitoring.

Now apply this back to a system problem:

  1. Create Clear Roles: Have your team know who is going to take on the role of Admin Flying and Admin Monitoring. This shouldn’t always be the same - it may be based simply on rotation (who is “up”) or who gets the trouble ticket or whatever within your shop. The team should declare their role in a situation so everyone knows their role.
  2. Perform in Order: If you have an Admin monitoring, it’s their role to intercept external communication while the Admin Flying is working on the problem.
  3. Make a Checklist: When there is an emergency isn’t the time to be winging it. During quiet moments, talk as a team about what you would do in a hypothetical situation and work to distill out a basic checklist of things you’re going to run through. Focus on having it be the shortest list that verifies the largest set of items. When a problem shows up, use the checklist.

Problem Checklists

There are a few great advantages to using a checklist for problems:

  • Reduce Solution Focus: When diagnosing problem, the general process is to propose a theory then test it to either prove or disprove it. This create cycles where you create theories you have to believe in then your job is to prove yourself wrong. It turns out that people tend to naturally bias towards information that proves themselves right and away from information that’s inconsistent with that diagnosis. Checklists for diagnostics can ensure that a significant breadth of information is available at the start of this process to enable the best theories to be created quickly.
  • Creates a Pace: It’s easy to get caught up in an emergency and start working at a pace that really isn’t necessary, but degrades your accuracy and effectiveness. Checklists stop the emotional cycle that reinforces the early stages of emergencies and instead create a steadily paced environment of gathering and verifying facts.
  • Establish a Baseline for Improvement: One of the most important parts of any emergency, and the least frequently used effectively, is an after action review. After you’re back up and everyone has calmed down, you want to learn as much as you can from what happened. The existence of a checklist creates a baseline for systematic (As opposed to random or by chance) improvement to your team’s ability to handle future problems. This is true even if the checklist wasn’t used; the fact it wasn’t used is itself an indictment of either the checklist itself or the team’s training.

While initially it may feel corny or even overly dramatic or bureaucratic to create checklists, there is real evidence to back up using them in environments where the downside cost (crash and death) is very steep, and if pressed to admit it most engineer will confess they have a mental checklist they use for standard problems.

Plans are Useless, Planning is Priceless.

Just by creating the checklists (even if they were never used) your team can get a lot of value:

  • Cooperative learning: This is a great tool for the team to learn from each other. Each admin will share their best tips and tricks from their mental checklist and be surprised that they don’t line up. Where they don’t, the discussion on which approach is better and why is gold. It’s hard to get the same result with a contrived exercise, so use this opportunity to build the checklist and maintain it as a team.
  • Clarifies Automation: While creating the checklist, it will naturally precipitate ideas for how to automatically identify and possibly solve steps in the checklist itself. For example, if a step in the checklist is to verify Internet connectivity, how are you going to accomplish that? Instead of having an ad-hoc mechanism, can an automated mechanism be put in place so that you now can quickly check that data point without variation?
  • Encourages Collaboration: If the team collaborates to create the checklist, when a problem occurs they will be more likely to collaborate on resolving the problem because they already have had the experience of working together as a team. This will tend to replace individual ego with group esprit de corps.

An Exercise Left to the Interested Student

A friend of mine also pointed out the principle that if you have a checklist that always ends in the same action, why not automate the action in response to the checklist? In other words, if you can automate the detection steps that lead up to the action, then find a way to automate the resolution. You will often find you get here in inches: You progressively improve your monitoring so that you can find problems faster. Once this is reliable, you start just hooking up alarms to the monitoring so you don’t wait for a call from a real user or a higher level system. Once that’s working well enough, you get tired of performing the resolution manually so you write a script that takes a few arguments to perform the resolution. Now, just connect them together.

Move Forward One Step Today

The best part about this is that you can get there in small steps that even the busiest team can fit into their schedule with a confidence that they will pay back in time saved in the future. With practice, it will become second nature and make it easier for your team to accommodate new processes and service requirements with ease. In the end, isn’t that what you need to ensure your team is viewed as a vital part of your organization?


Tags:
Posted in Management, Monitoring | No Comments »

Technology is not Scalable

Written by Kendall Miller on March 24, 2008 – 11:22 pm

I was watching Start-Up Junkies the other day which is following a group of people attempting to get Earth Class Mail off the ground. On one recent episode, the main focus was converting their system from being PHP-based to Microsoft .NET.

As you watch the episode, you hear two different reasons given for this transition. The CEO advances the issue that they started with PHP because they needed to get something done quick, but now they need to switch to ASP.NET to make it scale. They always knew they’d have to do it, but now they’re against the wall because they have been invited to demo at a Microsoft conference. The lead engineering staff and operations manager advance a different point: We are about to be demoing in Europe in Microsoft’s booth at a major convention that’s key to our growth, we need to be using Microsoft’s technology for this to happen.

Of these two reasons, which do you think is the better reason for their technology choice:

  1. Convert to ASP.NET to scale up because PHP can’t scale.
  2. Convert to ASP.NET because we’re getting marketing and sales assistance from Microsoft, which we’ll only get on their platform.

If you picked #1, two things are likely true: First, you haven’t worked with enough technology to know that the choice between PHP and ASP.NET for scalability is pretty far down on the list of “things that control how much we can scale”. Second, you’re probably not balancing business and technology interests effectively.

Go and check out the technology portfolio used at the most scalable web sites in the world - say the top ten super scalable systems, systems that are going to be at least two orders of magnitude greater than anything you’re likely to create (and this isn’t a negative thing; it’s a liberating thing). Notice anything in particular? You don’t tend to see either Microsoft ASP.NET or J2EE in the web infrastructure. In fact, you tend to see a lot of… PHP.

There are a few key reasons that the super scalable sites like these solutions:

  1. Open source means they have the source: You can bet that these sites aren’t able to use anything off the shelf. Their needs so outstrip the normal system that it isn’t reasonable that any off-the-shelf framework is going to fit their needs entirely. In fact, if it did that framework is likely seriously over-engineered for the majority of its user base.
  2. Licensing costs add up: If you’re a small shop, licensing costs are highly unlikely to be a significant percentage of your total cost of goods for a technology product; bandwidth, hosting, and above all people are the big numbers. If you’re Google, you don’t want to pay even $10 in licensing per server. This is similar to a large manufacturer worry about saving $.05 on a bolt; small incremental costs still add up.
  3. Scalability is their first concern: More important than ease of development, cool debuggers, third party component libraries, or anything else. If it can’t meet their scale, it isn’t even a potential solution. Perhaps more importantly, they have to have both the experience and human belief that it will scale. If you’re one of these sites, there is no way a vendor has tested their solution at your scale - you’ll be the first. If you’re going to be the first, you want to have a simple solution that you can adjust and correct.

If you’re honest, your decision matrix isn’t the same as this. It’s highly unlikely you’ll create the next MySpace, even if you are successful. While the principles of scalability are constant, the importance of scalability vs. other constraints changes. More likely, you need to base your technology choices on a mix of:

  1. What resources do you have? If you already have a staff of people experienced at technology X, they are likely to produce more results in any moderate interval of time (say one to three months) with this technology than any new one. If you have a large body of existing code in technology X, this is a big accelerator to your project.
  2. What resources can you get? When picking a technology, buy the community, not the product. If you can take a number of pieces off the shelf, particularly for things you aren’t attempting to innovate (such as security, content management, grid controls, reporting..) it will accelerate your product curve. Conversely, if you can’t get great people that want to work with a technology, it really doesn’t matter how great the technology is.
  3. What religion is your market? Many markets have a non-rational product selection bias. For example, if you want to sell your product primarily to Macintosh users, you probably shouldn’t use ASP.NET. It isn’t that it should make a difference to how the product works for them, but as a group Macintosh users tend to put “Not Microsoft” on their evaluation lists. Similarly Linux users. Conversely, there are several products that are defined as “just like X, but in ASP.NET!” If your market typically has a technology selection criteria that isn’t based on business or practical fundamentals, it’s best to respect it, otherwise you’ll have to focus additional energy during your sales and marketing efforts to overcome what your market will perceive as a natural disadvantage. The coolest technology, developed quickly and cheaply, is no good if your target customer won’t even invite you to the dance.

Back to Earth Class Mail. Could they scale using PHP? Absolutely, others have. Should they switch to ASP.NET? Probably - they wanted to leverage the marketing advantage of Microsoft. I suspect if IBM was the big animal in the space they wanted and a deal could have been made it would have been WebSphere instead of ASP.NET. Each of these technologies can scale, or not scale, depending on how they are used.


Tags: , , ,
Posted in Infrastructure, Software Development | 3 Comments »

No drop of rain believes it’s responsible for the flood

Written by Kendall Miller on March 20, 2008 – 8:33 pm

I grew up as the third son in our family. When my oldest brother was a newly minted driver, like every new driver he was a little rough. And like any younger siblings, my other brother and I were kind and gentle in our commentary about it. This led him to declaring his first driving rule: No comments on his driving when he was driving. He was in command, and that was it.

One day soon thereafter, he was backing the car out of the garage with my other brother and I in it. Now, he didn’t normally park on the right side of the garage - that was where my dad’s car went. But by whatever fluke, there the big Ford station wagon was - on the right side of the garage. When backing out, you have to start turning right away because the driveway isn’t straight. In fact, you have to start turning left, meaning the front of the car goes to the right. Normally this was not a problem since there was plenty of clearance. But when starting on the right side of the garage, on the right is… the garage. As he started backing out, my brother and I quietly sat there, watched the side of the garage come right up and *bang* *crunch* we hit it. In the “after action review” that followed, my brother exclaimed “Why didn’t you tell me I was going to hit the garage?” You can imagine our response – “Because you had said never comment on your driving.”

At the time, I was smug in my righteousness. We had done exactly what he’d asked, we weren’t the driver and the driver is responsible for the car, so it was a big ‘not-my-problem’.

We were dead wrong.We were in a position to have prevented the problem, and we should have spoken up. Ever since, we’ve had the rule in our family that when riding in a car. The rule is to speak up if you see a problem without fear that the driver will be upset. The potential consequences of not calling a problem to the driver’s attention are too great.

How Do You Play the Blame Game?

The same story often plays out in the aftermath of a technology problem. Hang around a software development team long enough and you’re bound to hear a developer complain “Why didn’t QA find that defect? They should have found it before it shipped.” The difference between an experienced, healthy team and an amateur team is whether the developer is just venting or actually believes they are justified.

We often have a strong desire to try to reduce accountability for avoiding issues to a single party:

  • QA is responsible for finding all defects in the software before it is released.
  • IT Operations is responsible for keeping all of the servers running.
  • The receptionist is responsible for ensuring we don’t run out of coffee.

Before looking at the contentious examples, look at just the last one. Say that you noticed that you pulled the next-to-last box of K-cups out of the supply cabinet. You’re not out of coffee yet - there are 24 individual servings in the box you pulled, one more box on the shelf. In most small companies, that’s at least a day’s worth of coffee. Would you tell the receptionist that you need more coffee? Or just assume that it will be taken care of? Say you then run out of coffee two days later, and everyone has to run out to Starbucks to feed their habit. Would you feel at all responsible for not speaking up when the problem was still avoidable?

You probably would have spoken up - the receptionist is a nice person, it’s an easy enough thing to do and you like your coffee.

Now look at the other two scenarios. The only real difference between them and running out of coffee is that these two will tend to be political and possibly even contractual. While you’d likely also speak up if you saw a defect in your company’s product before it was released or if you saw that a server was just about out of disk space, you wouldn’t want to accept any accountability after the fact if things went bad.

Here’s the elephant in the room: Your customers don’t care who was accountable for avoiding a problem. They care that the problem happened. They pay you for something that works (and has to work according to their definition of what works means). Anything else is just internal noise. If you want to drive your business forward - and really, if you don’t, you need to look to work somewhere else - this needs to be your motivator.

Formal vs. Practical Accountability

What if, instead of looking at issues as someone else’s problem, you followed these two principles?

  • If you are in a position to prevent a problem, you are accountable for preventing it.
  • If you are responsible for ensuring a problem doesn’t happen, you need to stay in a position to prevent it.

This means that many different people and groups may each be 100% accountable for a problem, because the most useful way to look at accountability is based on the ability and responsibility for preventing the problem. Why the most useful? Because the problem happened, that’s a matter of fact. While recriminations, blame, and shame may be cathartic or fun, they aren’t useful because they don’t further the goals of the team or the company. Put simply, your customers don’t care who’s at fault within your organization, just that you get the seriousness of the problem and you’re making it right. When debriefing your team, the ideal outcome is that everyone in the room sees how they could have prevented the problem, and takes on that they should have prevented the problem. From that, you then work into who was in the best place to prevent it - who could have seen it first, and addressed it while it was cheapest to address. You want to have everyone walk out with a balanced perspective of how they could have prevented it and how to identify when you’re in the best spot to prevent it.

A natural concern with this approach, particularly if it’s new to your organization, is that after action reviews are often a game of musical chairs - while there’s a superficial impression of honesty and openness, the true goal is to not be left without a chair when the music stops. Far from a well-calculated political move, this is really an emotional and ego driven outcome. No one likes admitting they are wrong, and with practice people get very skilled at justifying their emotional responses with pseudo-intellectual reasoning – it is called rationalization.

The next time you’re in this situation, try being the first party to speak up about what you could have done to avoid the problem, and make sure you communicate sincere regret you didn’t catch it. If you are completely open in this - sticking just to what you could have done without any back handedness (that’s right - you can’t say “I couldn’t cover up his incompetence.” That doesn’t count.) you’ll be amazed at how quickly the mood in the room changes. Very quickly others will jump in with what they could have done. You’ve created an environment where people can speak the real fears that are on their mind without posturing.

Once you’ve established this environment, you need to be active in maintaining it. If someone jumps into the attack, speak up and redirect the conversation. This is true particularly if the attack isn’t directed at you. Keep listening to have the conversation stay in even tones and that each party is either talking about what their area could have done or is constructively helping the overall conversation.

Eventually, there will be a fundamentally sticky conversation about which party was in the best position to avoid the problem. At this point it’s going to come down to culture - if your culture is one that learns from mistakes, it will be a clear and short conversation. Depending on how strong the duck-and-cover instinct is in your shop, it can be very painful. In the end - speak up if your team is the one that should have the spotlight. Fear of accountability is often overstated. In practice, managers know that in the end they need people that will be accountable for what happened, and the experience can still be positive in the long term. Great managers actively hunt out people that are quick to learn from their mistakes and own them.


Tags: , ,
Posted in Infrastructure, Management, Software Development | 1 Comment »

First, Fly the Plane

Written by Kendall Miller on March 16, 2008 – 8:45 pm

I used to work with a former Navy A-6 pilot and instructor.  One of his standard techniques for helping pilots deal with emergencies was to train them to take an immediate action when they noticed the problem - an action that had no consequence but would fill the need to do something.  What he trained them to do was reset the built-in timer clock as soon as they noticed the problem.  Ostensibly, this was to help them downstream know how long a problem had happened, but its true purpose was to give them a single, standard action to fill the human need to do something, then they could take time to reflect on the problem.  Step two on the checklist was fly the plane. There have been several CFIT accidents where pilots were too busy troubleshooting a problem to avoid the ground.  The pilots forgot their first responsibility: make sure you put flying the plane in front of any other activity.

When doing IT Operations, there’s a lot you can learn from aviation.  I’ve seen several situations where technicians have caused much larger problems while troubleshooting small ones.  This comes from the same mindset that caused air crashes:  you become so focused on the immediate problem that you are no longer aware of your environment. The longer you work at a problem, the more likely this will happen.

A few team techniques you can use to help avoid this:

  • The Two Person Rule: Have two technicians involved in the problem with one taking the immediate actions and the other taking a longer view.
  • Separate Diagnostics from Remediation: Break your approach into non-invasive diagnostic activities before remediation attempts. This gives you a discrete point before you start putting thing at risk to recheck your assumptions about dependencies and risks to other systems.
  • Peer Review: Before approaching a problem, discuss your approach with two other people on your team (at the same time). If that approach isn’t successful or you need to deviate from it, reconvene the group to discuss again.

In many ways this is an extension of Don’t Taunt the Bear.  When working on a problem during business hours (or, if you like, non-maintenance hours) before taking anything off line, even for a moment, ask yourself:  Do I need to take this action right now?  How sure am I that it won’t have any unexpected consequences?  Is the risk I’m wrong worth the benefit of doing this right now?
All of this may sound like it’s going to add time to problem resolution, and it might - however remember that your first responsibility is to keep services flowing to your users. Most users will be unsympathetic if they lose access to their home directories because you were troubleshooting a problem with the printer in accounting and took down the same services that shared files.


Tags: , , ,
Posted in Infrastructure | No Comments »

Don’t Taunt the Bear

Written by Kendall Miller on March 10, 2008 – 12:45 am

When I first started at John Deere, I was working in a division that deployed systems to dealerships. Up until that point, they hadn’t done anything with hardware RAID. Dealerships are extremely cost-conscious, and while I was a huge believer in the value of hardware RAID arrays, they needed to prove their merit. At that time, HP was the preferred vendor for dealership equipment so I had gotten them to provide us a demonstration server with a hardware RAID card so I could show it off. The high point of my demo to the service staff was when I pulled a drive out of the running server while it was in the middle of running a very visible, high load process - and to everyone’s surprise it would just keep running! The first time I did the demo, it worked great - I pulled out the first drive and the server didn’t miss a beat.

A day later I was doing the same demo for a group of managers. The previous day’s work had been fruitful - it had gotten the attention I wanted and now a higher group wanted to discuss it. This time around, someone raised the question “so, any drive can fail and the system keeps running?” With much bravado I replied “sure! Watch!” and pulled out the second drive. Two seconds later to my shock the system froze and then went to a blue screen.

This was when I discovered that, unlike the Compaq systems I was used to the HP system didn’t automatically rebuild by default when you reinserted the drive.

I took a number of lessons away from this:

  1. Don’t assume each vendor’s equipment works the same way, even if that way seems to make a lot of sense.
  2. There is almost no amount of check & recheck that is too much when removing redundant components.

When you work with systems designed for high reliability, it’s often tempting to take advantage of the innate redundancy of the system to allow you to be somewhat more cavalier in your operational procedures. For example, if you have two web servers that are part of a load balance cluster, conceptually you can take one offline, reboot it and do whatever - right in the middle of the day when it’s convenient to your IT staff. On the surface, there’s nothing wrong with this - if everything operates as designed, you should be able to rip out the second server and do whatever you want without causing a problem. It’s very tempting to forget the cluster while working on the server.

However, it often pays to be vigilant in this circumstance. Don’t taunt the bear - just because it shouldn’t cause a problem, doesn’t mean it won’t cause a problem. For example - what if during the reboot the server comes back on line? Depending on how exactly your load balancing system works it may start getting new requests because it appears to be operational. It’s very hard to explain to your peers and the rest of the business why you went offline because you took a shortcut.

There is a fine line between taking advantage of redundancy and causing problems.

Don’t count on Redundancy

At a SaaS company I worked for we had a highly redundant SAN. Each server had two cards, they connected to two independent switches which in turn each had a connection to the two storage processors that ran the array. The whole system was designed and certified by the vendor to operate without interruption in the face of a failure of a card, switch, storage processor, etc. It also was designed to be continuously operational while having every component upgraded - the firmware of the switch, the storage processors, etc.

This highly redundant design opens the possibility of performing configuration changes, firmware upgrades, even component replacement during the day while business is going on - after all, it should work just fine. This is a good example of being tempted to taunt the bear - just because a system should be redundant and not have a problem with what you’re doing, don’t bank on that capability if you don’t have a compelling reason to do so. If you have to do it, don’t rely on automatic redundancy behavior - manually take the component offline.

Treat the bear with respect. If you can, schedule work for maintenance time periods so that if there is a service interruption it will have the smallest impact. If you have a good deal of experience that a particular action won’t cause a problem then you might perform it just outside of business hours instead of during maintenance time periods (which are often in the dark of night).

Restoring Redundancy

The rules change a little when dealing with a failure. For example, if you have a drive fail in a redundant array and get in a new drive you have to balance the competing goals of restoring redundancy and the risk of replacing the drive. There are number of risk elements in replacing a failed drive:

  1. You could pull the wrong drive, causing the whole array to fail.
  2. The physical disconnection of the drive could cause a SCSI bus reset or some other momentary interruption of data on the array.
  3. The new drive could be electrically defective and short the bus.
  4. Mechanically inserting the drive could disrupt the bus or jar another drive or other physical part, causing the array to fail.

So, how do you balance the desire to replace the failed drive with the risks of causing the array to fail?

  1. If the system is stable and still redundant, wait until the next scheduled maintenance period to perform corrective action. There’s no rush.
  2. If it is not redundant, but operable, you need to balance risk with benefit. It is very unlikely that an independent part will fail within 24 hours of another failure, so you can almost always wait until a low activity time outside of business hours or even in the middle of the night to replace the component.
  3. If the system is not stable, you have the most difficult decision. First, don’t make this on your own. Get together at least the available IT engineers and, if at all possible, a representative of the business process(es) affected by the problem. You need to balance the current instability with the probability that you will make it worse by changing the system. If it’s just a dead drive, this is pretty easy: Low risk, high benefit (however it’s unlikely you’d be in an unstable situation if this happened).

Lockout / Tagout

Clustering systems combine the ability to automatically recognize when a node is down (automatic failover) and be manually told to ignore a node (manual failover). Before performing invasive work on a node in the cluster that has been taken offline automatically, go back to the clustering system and place the node offline manually. Think of this as being the equivalent of procedures used when working with dangerous machinery - Lockout/Tagout. Straight from our friends at OSHA:

“Lockout/Tagout (LOTO)” refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities.

This is exactly what we want to do - make sure while we’re performing actions that impair the availability of part of a reliable system we have the cluster configured so that the part can’t be accidentally used. There are two parts of this: Lock out the item so it can’t be unintentionally accessed and tag the device so that everyone knows that it’s locked out. You want to be clear on how to accomplish both for each cluster you have. The latter may take the form of just notification - an email to your support team - or a post on a central site. The point is you need a big, visible way of clearly communicating the status of the device.

If your clustering mechanism doesn’t have a way of doing this, or it relies on the node itself (such as Windows NLB) you should consider it always live and dangerous.

Nice Bear. Friendly Bear.

If the bear is working well, let him continue doing what he’s doing. Your running system should be treated with respect at all times, because there is a great deal of complexity that goes into each of the elements and how they work together, even if it appears simple on the outside. As a person responsible for a reliable system, you need to always be thinking in the long term. You don’t want to cause an outage just to deploy an upgraded component or firmware. Almost without question, the theoretical issues fixed by the firmware update aren’t going to be as important to your customers are the real issues caused by a service interruption.


Tags: , , ,
Posted in Infrastructure | 1 Comment »