By Simon Bearne, Commercial Director, Next Generation Data www.ngd.co.uk
Data Centres are quick to lay out their credentials but can be slow to prove them – black testing will prove critical infrastructure will work when needed
For cloud providers and their many customers, a robust and continuously available power supply is amongst the most important reasons for placing IT equipment in a data centre. It’s puzzling, therefore, why so many data centres fail repeatedly in measuring up to such a mission-critical requirement.
Only recently, for example, cloud service providers and communications companies were hit by yet another protracted power outage affecting a data centre in London. It took time for engineers from the National Grid to restore power, and meanwhile thousands of end users were impacted.
Let’s face it. From time to time there will be Grid interruptions, but they shouldn’t be allowed to escalate into noticeable service interruptions for customers. Inevitably, such incidents create shockwaves among users and cloud service providers, their shareholders, suppliers, and anyone else touched by the inconvenience.
The buck stops here
While it’s clear something, someone, or both are at fault, the buck eventually has to stop at the door of the data centre provider. Outages are generally caused by a loss of power in the power distribution network. This could be triggered by a range of factors, from construction workers accidentally cutting through cables – common in metro areas – to power equipment failure, adverse weather conditions, not to mention human error.
Mitigating such risks should be easy when choosing a data centre. Locate your data away from flood plains and ideally choose a site where power delivery from the utilities won’t be impaired; this is a critical point. Cloud providers and their customers need to fully appreciate how the power routes between their chosen data centre and through the electricity distribution network – in some cases, it’s pretty tortuous.
Finding the ideal data centre location that ticks all the right boxes is often easier said than done, especially in the traditional data centre heartlands. Certainly, having an N+1 redundancy infrastructure in place is critical to mitigating outages due to equipment failure.
Simply put, N+1 means there’s more equipment deployed than needed and so allows for single component failure. The ‘N’ stands for the number of components necessary to run your system, and the ‘+1’ means there’s additional capacity should a single component fail. A handful of facilities go further. NGD, for example, has more than double the equipment needed to supply contracted power to customers, split into two power trains on either side of the building each of which is N+1. Both are completely separated with no common points of failure.
But even with all these precautions, a data centre still isn’t necessarily 100% ‘outage proof’. All data centre equipment has an inherent possibility of failure, and while N+1 massively reduces the risks you can never become complacent. After all, studies show that a proportion of failures are caused by human mismanagement of functioning equipment. This puts a huge emphasis on engineers being well trained, and critically, having the confidence and experience in knowing when to intervene and when to allow the automated systems to do their job. They must also be skilled in performing concurrent maintenance and minimising the time during which systems are running with limited resilience.
Prevention is always better than cure. Far greater emphasis should be placed on engineers reacting quickly when a component failure occurs rather than assuming that inbuilt resilience will solve all problems. This demands high-quality training for engineering staff, predictive diagnostics, watertight support contracts and sufficient on-site spares.
However, to be totally confident with data centre critical infrastructure come hell or high water, it should be rigorously tested. Not all data centres do this regularly. Some will have procedures to test their installations but rely on simulating the total loss of incoming power. But this isn’t completely foolproof as the generators remain on standby and the equipment in front of the UPS systems stays on. This means that the cooling system and the lighting remain functioning during testing.
Absolute proof comes with black Testing. It’s not for the faint-hearted, and many data centres simply don’t do it. Every six months incoming mains grid power can be isolated and for up to sixteen seconds the UPS takes the full load while the emergency backup generators kick in. Clearly, the power is only being cut to one side of the infrastructure and it’s done under strictly controlled conditions.
When it comes to data centre critical power infrastructure, regular full-scale black testing is the only way to be sure the systems will function correctly in the event of a real problem. Hoping for the best in the event of real-life loss of mains power simply isn’t an option.
- Uptime checklist
- Ensure N+1 redundancy at a minimum, but
ideally2N+x redundancy of critical systems to support separacy, testing and concurrent access
- Streamlining MTTF will deliver significant returns on backup systems availability and reliability, and overall facilities uptime performance
- Utilise predictive diagnostics, ensure fit for purpose support contracts, and hold appropriate spares stock on-site
- Regularly black test UPS and generator backup systems
- Drive a culture of continuous training and practise regularly to ensure staff are clear on spotting incipient problems and responding to real-time problems – what to do, and when/when not to intervene