The Seven Dirty Words of IT23 Jun 2013
When you are thinking about IT often the first thought is hardware or software - some server in a backroom or a mission-critical application. When we speak of these important issues, critical processes, data integrity and security it is often in terms of software bugs, mean time between failure and other factors.
Today I’d like to bring up one area of IT that is often overlooked - the very language that we use. The words we choose to describe situations often have more of a “framing effect” than we’d like to admit.
The following list is my accumulation of what I’ve believe to be some of the dirtiest words in IT.
“Probably” comes first on the list because I feel it is the most dangerous. It’s so subtle that it slips into conversation imperceptible, however its effect is detrimental to the logical process needed to evaluate a situation. Often this is used when we are “reasonably certain” that a problem has been solved. The problem with this statement is that it short-circuits the logical process as “probably” always comes across as meaning “certainly”.
When we hear the word “probably” it can ring through with a certain air of assurance. We are placing our trust in the speaker that their experience and judgment is leading them to this conclusion and it is just a matter of formality to “prove” the circumstances. And herein lies the trap…
Why did we not just go the final 10% and prove conclusion is correct?
This is logical laziness!
True, there are circumstance that we cannot prove our assumptions - perhaps the necessary data is unavailable or the circumstances cannot be reproduced. We make our best guess and move on with life.
Unfortunately, “probably” is used in numerous situations where we can prove our assumptions. One classic example of this is in troubleshooting performance issues. Time and time again what I see the word “probably” used to mask or avoid the dirty work of truly tracing down the root cause of a performance issue. “It’s ‘probably’ Application FooBar, we just installed it yesterday and that’s when the performance issues started”. The best recent anti-example of this I’ve seen ( http://utcc.utoronto.ca/~cks/space/blog/sysadmin/MailProblemAnatomy ) goes to show that if we leave our assumptions at our first gut instinct we will never truly solve the issue.
Just remember “probably” comes from “probability”, so if you are saying something is “probably X” ask yourself what you are basing that opinion on. If you are basing it on a guess what is the harm of proving that assumption is correct?
Though less often spoken, the “never/always” dichotomy is entrenched in the thought process in IT. We are constantly planning for “future proof” systems, adding more drive space than we will “ever need” or building systems so resilient that they will “never” fail.
The root of this problem is that we are building an abstract logical system on top of a substrate that is firmly rooted in real world conditions. Variables such as component failure, unexpected latency and human creativity will continue to add unanticipated exceptions to your Platonic ideal of how a user will interact with your systems.
The key to keep in mind is that if something “never” or “always” happens then it would not need to be stated, it would be an unquestioned expectation. However, if you find yourself stating or thinking “X will never happen” or “Y will always do this” then ask yourself if this is a truism or if you are just taking the mental shortcut of “probably” and if so, what is the true probability?
The classic scenario for this would be a server with three drives in a RAID 5 configuration. RAID 5 is considered to provide durable storage since it is highly unlikely that multiple drives will fail at the same time, thus it is always true that we can survive a single drive failure. Or is it?
Consider the following, during the past decade drive speeds and capacities have grown leaps and bounds, but what has happened to the error-rate or mean time between failure? The “likelihood of read failure per storage unit” has not increased to compensate. Well that is what RAID 5 is to protect against, right? Check out http://www.standalone-sysadmin.com/blog/2012/08/i-come-not-to-praise-raid-5/ for a further explanation but generally at the capacity of over a few terabytes your likelihood of failure on rebuild is a coin-flip.
Again, calling back to our previous “dirty word” what is the true likelihood of failure in your particular environment?
Often when we are discussing cost we are exclaiming at the outrageous cost of the new Foobar Widget or how we got a great deal on last years discontinued model. Often these are just idle conversation starters, no one would really base business and technological decisions on the sticker price of an item. Surely there is other research involved…
Sadly too often this is the initial, and only, consideration made when evaluating a new technology. All things being equal, we should be able to make our decision on the intersection of “does what I want” and “cheap” and arrive at a logical conclusion - the correct conclusion. Ahh, but this is not the case. The sticker price of an item, any item, is far more complex than “the relative worth of the widget”, at it’s essence it is “what the market will bare”.
There are many factors that come into play when a manufacture is pricing any item. Some are very straightforward - the raw materials cost X, labor costs Y, we anticipate high demand for our new widget. We’d like to pretend that the retailer is covering their costs, making a fair profit and using that to base the cost on - this simply is not the case. In the world of retail there are many psychological pricing tricks that get used for various purposes. There is the “framing” effect of using Small, Medium and Large that will set guidelines for what you see as a realistic and fair price regardless of intrinsic value or other vendors pricing. Another common trick is simply pricing something higher to make it seem like a better item. “It costs more so it must be better, right?”
Once we get past the initial purchase price of an item there are the “hidden” costs. What are the maintenance costs, training costs, cost of consumables? What are the training costs to use the new widget? How long with Widget X last? How often does the manufacturer release a new model or for that matter how long will it be supported.
Again the running theme here is that we want to avoid “logical laziness” and truly evaluate the situation beyond our initial gut reactions. Even if the answers to these questions are unknown we still have made a more considered evaluation and are in a better position to understand the true costs involved.
“As Soon As Possible” has become one of the most trite expressions that has crept into professional language. Typically, this is used as a throwaway phrase or filler adding emphasis to the overall message being delivered (i.e. “Do It Now”). Sadly, this phrase is so lacking in information that it often comes across as more of a threat then any sort of actual priority assignment. When this phrase is used often what the speaker means to say is “this is the highest priority I have at the present moment and I’d like you to make it yours as well, the time frame for completion is limited”.
The problem with this phrase is that we are working backwards from our goal instead of the other way around. To say this another way - if you have made a convincing argument for why the other person should care about this issue, why it is important and time sensitive, the priority will be implicit and self-evident.
Whenever we are giving direction, making suggestion, etc. the global scope needs to be considered. Yes this problem is important, but at what scale? Is solving this problem?
Is it critical to continued service? Critical to continued business operation? Impacting one customer? Impacting multiple customers …
Having an idea of scale/scope helps us to start to prioritize this with other ongoing issues as well as determine how much “effort” to apply to the issue.
The difficulty with a statement like “X is complete” is that the very idea of “completion” can be elusive and have many meanings. Does “X is complete” mean the initial problem is circumvented for now? Is it fixed in the sense that the faulty part was replace? Have new pieces been put in place to predict or prevent failure of a similar scenario?
As you can see from just these few questions to idea of “complete” from person to person will have many different meanings in different situations. It is best to define the completion in terms of what has been accomplished and what can be expected in the future. After all, the very notion of “complete” or “finished” in IT is an asymptote. Given more time/money/desire surely a proceed or process could be made at least somewhat better.
When you are defining a product or solution it is best to come up with a list of agreed upon deliverables that way the “completeness” of the project is clear to everyone involved.
Saying something “just works” implies some sort of “magic” or an incomplete understanding of the system. If something really “just works” it could be that it is a truly durable and robust system - it is clearly thought out, has a high degree of reliability and transparency. Sadly, “just works” is often just the expression of “logical laziness” that we have discussed.
If one is afraid or unwilling to peel back the layers of the system it will appear as a simple “just works” black box. However, if you are willing to truly examine the foundations and inner workings of a system you will undoubtedly reveal a layer of complexity that you were previously unaware of.
Be careful not to confuse a truly elegant system where the pieces are well thought out and understandable with a system that is just opaque.
The very concept of a backup is a duality of backup/restore, there is no “backup”. Let me say this again, a “backup” is not a “backup” it’s only half of a thing that we typically call a “backup”. Without a verified restore we must consider the backup to be “write only”.
Regular and rigorous restore tests are the only way to ensure that you are truly able to retrieve the data you wish to share. Is there a bug in the vendor’s software? Does your filesystem checksum data read and written, does it do it correctly? What about your backup media? Are you even backing up the correct files? Server? None of these questions can truly be verified without testing the viability of the date you’ve attempted to backup.
Again the key to keep in mind is that every backup is a “backup attempt” and cannot be fully trusted until it’s been restored. The restore must be of the same level and quality you wish you use the backups for. If you only are truly concerned about recovering a text file, then restoring that file and verifying it’s contents is enough. However if you wish to use any data with any sort of complexity it needs to be loaded into the appropriate application. Do you know, truly know, that the data file you are backing up is in a consistent state on the disk - are you certain it was entirely flushed to disk when your backup ran? You’ll never really know until you do a full test restore and run it in at least an approximation of your production environment…
I hope you’ve enjoyed my tour through the seven “dirty words” of IT. The purpose of this is not to by hyper-analytical and just pick apart common mistakes or communication difficulties - it hopefully has left you with some scenarios in your life to think about and find new ways to think about and approach these problems.
If you’ve encountered any situations like theses, or have some “dirty words” of your own please share in the comments.
Thank you for reading