The
overwhelmingly most important job of any tech company’s top architect is …
[drum roll, please]… assuring the reliability and responsiveness of his
company’s product and/or service.
This
may not be what you think. It often is not what the top architect thinks. If
that is the case, all I can say is, think again, and think better this time.
I
find that the technical staff in computer-enabled companies have all sorts of
ideas of what things are “important.” Generally speaking, “importance” seems to
be correlated with distance from day-to-day events, distance from the data
center, distance from anything concerning “operations,” and distance from the
concerns that existing customers have working with the product/service the
company provides in day-to-day use. Anything “strategic” – that’s in; that’s a
proper concern of the top thinkers in the company. Anything “tactical” or that
could conceivably be in the realm of the customer service or data center
operations group – that’s out; that’s just a waste of time for the company’s
best minds.
However
common this way of thinking may be, I still find it to be not just bizarre, but
perverse. The reason, simply put, is that it’s completely out of sync with what
customers think the most important issues are.
If
your product or service:
- just
doesn’t work
- goes
off-line unpredictably
- slows
way down at key times of day
- old,
reliable features suddenly stop working, or change their behavior unpleasantly
What do you think will happen to your customer base? If you think you don’t care because you have such a great flow of new customers, think for a moment about what causes that flow: do you think reputation, references or word-of-mouth might have anything to do with it? Do I have to mention Toyota to remind you how fast a great service record can get destroyed?
Now, let’s get to the crux of the issue: when those bad things happen, whose fault is it, and whose actions and decisions are most highly correlated with creating the conditions that led to the problem? To make this simple, let’s turn again to Toyota.- Is
the driver (user) at fault due to improper use? Sure, that makes sense, it was
the users that made the site crash.
- Are
the technicians (for example in the data center) at fault? Sure, they can screw
up; but the best-architected systems don’t depend on technician action to
achieve reliability and response time.
- Are
the customer service people at fault? Hmmm.
Here’s the reality: while anyone in an organization can screw up and cause problems for customers, the most serious issues I’ve seen in companies are the direct result of architectural decisions or lack of attention/involvement. This includes response time, flakiness, down time, and the other things that drive customers nuts, not to mention drive them to your competitors.
I’ll
give a couple of illustrations.
A company’s web site was down for an hour or more at a time. Repeatedly. The only solution is to re-build a key component and its database on a new machine and bring it on-line. Right away, the finger of guilt pointed at data center operations for failing to deliver. But the root cause of the problem was a key application component that had no fail-over capability. How could that happen? Simple. The company’s top architects failed to make scalability and fault tolerance the non-negotiable, number one priority when selecting this key component. Instead they concentrated on all sorts of other things. Are the data center people at fault here? Hardly. They just had this crippled software tossed at them, and did their best to hold off the inevitable disaster.
A company was in a vice grip of pain. Existing customers were complaining that the service provided was slow and faulty. New partners were putting on the squeeze for new releases of functionality that they felt were crucial for winning business. The more new code the company released, the more bugs and customer problems were created. When the company tried to slow things down to stabilize the service, the angrier the new partners got, who accused the company of failing to meet its commitment to them. The data center staff exploded, the QA staff grew, consultants were crawling around, and life was miserable. The root cause? Again, simple. The company’s top architects had completely ignored the whole release and go-live process, and built a software system that was designed for a set of unchanging requirements, instead of the fluid and constantly changing reality of the company’s customers and market. The whole nightmare was an architectural side-effect, and the solution was a change in architecture – the good, practical kind of architecture that encompasses everything about the company’s product/service, including releases and the data center.
I
think the message is a clear and simple one: if your top minds are not already
focused on the company’s most important issues, viz., those that are most important
to your customers, get them focused on those seemingly mundane, tactical,
near-term, nuts-and-bolts concerns. Now.
Comments