Planning and controlling data centre downtime

By Mike Jansma*
Tuesday, 03 September, 2013

Unplanned and expensive downtime is on the rise due to greater complexity in the data centre. With trends such as mobile computing, server virtualisation and cloud storage increasing the intricacy of demand, more system failures, human error and natural disasters are causing unplanned data centre outages. It’s been claimed that nearly a third of Australian businesses have experienced unplanned downtime in the last year, which significantly impacts the viability of a business, damaging the company’s reputation and costing millions in future business opportunities.

But a data centre can’t function without experiencing some downtime, when vital equipment needs to be replaced, upgraded or routinely maintained. The art lies in controlling this downtime, planning it effectively into the data centre schedule and understanding how to keep it to a minimum.

Avoid the unexpected

One specific cause of unexpected downtime is equipment failure, which can happen for a variety of reasons within the data centre environment. At some point, all kinds of IT gear, components, electrical kit and so on will fail - this can’t be avoided. But the key lies in fitting the data centre with back-up tools that will provide the intelligence and monitoring required to alert engineers before problems are likely to occur.

Hot-swapping

Avoiding unplanned downtime and minimising scheduled outages relies on effective forward planning and futureproofing through purchasing decisions. It means investing in the right technologies to ensure that simple component failures won’t cripple operations needlessly. The ability to hot-swap PDU components allows engineers to perform crucial maintenance or replacement schedules without having to power down servers.

The temptation is understandably to make savings where possible to expand the relevant data centre budgets for greater operating capacity, and this means the PDU is often the last on the shopping list. However, this is a dangerous false economy. Much like the budget airline equation - on the surface the price seems like a great deal, but once meals, priority boarding and seat fees have been added the cost has risen above and beyond the scheduled airline’s set price. Factoring in the huge costs to reputation and loss of business due to unplanned downtime, the lack of concern for buying the right component is both a costly and risky strategy.

Human error

Human error is one element that is impossible to completely remove from the downtime equation. The risk can be significantly reduced by providing comprehensive, regular training and clear and careful operating procedures, but it will not be completely eradicated.

But with the market demanding more and data centre complexity growing, the potential for human error is amplified. Working in tight spaces with multiple cords and cables, engineers face increasing challenges from accidental overloading of branch circuits. Technology can help iron out these issues, by selecting products with lockable cords, providing protection against accidental power loss of the attached IT equipment.

Communication is crucial

Measuring and understanding power consumption in a data centre is fundamental to reducing power-related disruption. Research by McKinsey & Company found that only 6 to 12% of energy was used to power operating servers, therefore idle servers that remain unmonitored are wasting a large proportion of the total energy consumption.

Therefore, if managers are equipped with this knowledge they can proactively ensure energy is effectively distributed throughout the data centre and make informed capacity planning decisions.

Over-cautious cooling

Companies are continuing to put their IT resources at risk by their lack of knowledge about downtime and therefore managing it in the wrong way. Partially, this is because vendors have created an attitude that downtime is unavoidable short of extremely costly expenditure, but in fact small efficiencies can make for huge differences. Many data centres are running at much lower temperatures than necessary due to a misheld belief that IT cannot withstand temperatures above 25°. In sharp contrast, Dell research revealed that systems actually failed more often in colder data centres running below 16° rather than at 25°.

The truth is that by raising the temperature of the data centre by as little as one degree, huge savings can be made on energy costs without putting operations at risk. In fact, the recent ASHRAE (the American Society of Heating, Refrigerating and Air-conditioning Engineers) has suggested that data centre managers should extend the recommended ranges for IT equipment to allow for aggressive economisation. But this is only possible with technology engineered to enable this flexibility.

Futureproofing

In technology areas, change is constant. If data centre stakeholders have the opportunity to enable their infrastructure to incorporate new products then they will not limit their potential for growth. If left unattended, poor data centre infrastructure will continue to contribute to recurring downtime events and will limit growth. Many overlook the need to assess existing data centre infrastructure, before considering a complete overhaul. But by investing in small but vital components, unnecessary large-scale investments can be avoided.

Collaboration between key stakeholders is crucial. PDUs can enable IT and facilities managers to understand each other and work together to manage server uptime and capacity planning. This in turn supports the crucial work of installers and engineers who need to operate within the data centre environment. By being proactive rather than reactive to downtime events, data centre operators can minimise the risk of downtime even if the area may seem small. If staff can control products in their data centre and are fed accurate details about energy consumption then they are less likely to have to manage an unexpected downtime incident.

While risk cannot be entirely eliminated, the risk strategy budget can be reduced so that money is not wasted and funds can be released back into the business. Furthermore, by planning and buying the right small but vital components, managers can futureproof their data centre for stable growth.

*Mike Jansma is the chief marketing officer and co-founder of Enlogic.

Planning and controlling data centre downtime

Avoid the unexpected

Hot-swapping

Human error

Communication is crucial

Over-cautious cooling

Futureproofing

Data centres flex to tackle exponential energy demand from AI upsurge

Optical fibre: the foundation of AI-ready data centres

Why businesses can't afford to wait on slow data

Content from other channels on our network