A Basic Guide To Data Center Maintenance (With Checklist)
Why Is the Maintenance of the Data Center Important?
The world of computers operates on 1s, 0s, and electrical pulses on circuit boards with no moving parts. So why is it so important to maintain a data center? Isn’t a data center just a collection of servers that are just big computers?
This premise is obviously ridiculous to anyone who has serviced IT equipment at all. Computers simply break in different ways than mechanical equipment, and a data center is a complex operation that relies on digital, electrical, and mechanical systems. Although a data center never moves, it’s more comparable to an automobile, at least from a maintenance perspective. If you ignore the signals your car sends you about maintenance, you’ll end up stranded and paying a lot of money to get up and running again. If you perform regular data center maintenance, including monitoring each system for issues, you avoid unplanned breakdowns and surprise expenses.
One of the first things you should consider is that the human memory is inadequate for the job of keeping up with all the procedures and to-dos for successful data center operation and maintenance. You need to build systems and documents procedures that take the memory and guesswork out of it. Below we’ll discuss some of the main areas you should cover in your data center maintenance schedule.
Consult Equipment Manuals
It’s a good idea to read equipment manuals and note the recommended maintenance procedures, duty cycles, and frequency. You may choose to deviate from the manufacturer’s recommendations, especially if you can combine maintenance tasks to gain efficiency, but the manual should be your first point of reference.
Data Center Maintenance Checklist
This checklist will help you plan your maintenance routines, especially if you’re starting from scratch.
Visual Inspection
It’s unreasonable to expect your employees to notice problems in the course of their day-to-day work. Humans are prone to target fixation and familiarity blindness. The stuff we see every day escapes notice, especially when focused on a task.
Schedule time for employees to regularly walk the facility and look for issues. Include obvious (walk-ways, server cabinets) and neglected areas (below, above, and behind equipment).
Use “sub-check lists” to help guide the inspection process:
- Is the area dirty, dusty, or wet?
- Are the proper lights illuminated?
- Are the mechanical components operating smoothly?
- Is anything loose, wiggling, or stuck?
- Are there any unusual noises?
- Does the area seem abnormally hot or cold?
Cleaning
Aside from the daily cleaning of your facility, which may be performed by a janitorial service, you need to regularly clean equipment and areas that don’t get daily attention. The accumulation of dirt and dust can cause overheating or other premature equipment failures. Some pieces of equipment may require special cleaning procedures to avoid static charge build-up, moisture exposure, or breakdown due to incompatible cleaning chemicals.
Testing
Some problems are easier to identify in advance if you regularly test for them. Stress tests, fail-over tests, and emergency backup tests are critical for long-term performance. In data center terms, that means uptime. When you identify problems before they manifest as equipment failure, you have the option of bringing in redundant equipment and preventing any downtime. Some systems or pieces of equipment do not allow for failure testing. Fire suppression systems are a great example because they would cause unnecessary damage. You may need to hire specialized professionals to test any system that doesn’t allow for redundancy or irreversible effects.
Reporting & Monitoring
The best way to learn from mistakes is to examine history. In the case of a data center, you create that history by monitoring and reporting. If you can’t automatically monitor a piece of equipment from a central dashboard, then you should set up regular check-points to record functionality and flag abnormalities. These reports should become part of the service history for the equipment.
The better your system for monitoring and reporting, the more visibility you will have into the lifecycle of your equipment. IT personnel likely have anecdotal evidence for which pieces of equipment fail more often than they should, but historical data are the only way to know for certain.
Repairs
Let’s return to the automotive analogy for a moment. While it may seem obvious that when something breaks, you should fix it, there’s plenty of evidence that humans will limp along, ignoring the problem as long as possible. This phenomenon is made worse if you don’t have the budget to conduct a comprehensive repair.
Predictive (where you replace something before it fails) and preventative (changing filters, fluids, and other consumables) maintenance can lower the disruption from surprise failures. In any case, you need to allocate budget for planned and unplanned repairs.
Safety Checks
Cybersecurity is a major priority for data centers, but physical security should also be taken seriously. Performing perimeter checks and verifying that the building and grounds are properly protected is a vital maintenance task.
Disaster Preparedness
This is a category unto itself, but it’s another must-have item on any data center maintenance checklist. Do you have a disaster preparedness plan? Has your team practiced following it? Does the equipment such as backup generators, battery banks, and HVAC systems work as intended when normal utilities are unavailable?
Server Room Maintenance Best Practices
While the majority of this list is aimed at organizations with data centers that require regular, comprehensive maintenance, it also applies if you are only managing a single server room.
If you’re co-locating your server at a larger data center and rely on other service providers to maintain your equipment, then you need to verify that all the maintenance is performed by qualified professionals.
And depending on the size of your organization, you should consider hiring an outside IT consultancy to handle your maintenance needs.
Pay for a DCIM
If you don’t already have data center infrastructure management (DCIM) software in place and you’re managing your own server room or data center, then you need to shop for a DCIM soon. DCIM software will greatly simplify the process of cataloging equipment, monitoring duty cycles, scheduling maintenance, and managing documentation.
About i.e.Smart Systems
i.e.Smart Systems is a Houston, TX based technology integration partner that specializes in design and installation of audio/visual technology and structured cabling. For more than three decades, our team of in-house experts has partnered with business owners, architectural firms, general contractors, construction managers, real estate developers, and designers in the Houston market, to deliver reliable, scalable solutions that align with their unique goals.