Application outages are going to happen no matter how much time, money and effort you expend in people, processes and technology. Even the largest and most technology savvy companies have them with some regularity, though they spend millions to prevent them. No matter the size of your organization or your budget, there are steps you can take to ensure that outages do minimal damage to your relationship with your clients, reputation, and your brand image.
This handy checklist will help you to assess some of the most common causes of outages and make remediation.
Check your connections
It sounds rudimentary, but check to be sure the power to your servers and other essential network equipment is operating. In addition, make sure that your cables have not deteriorated or loosened over time.
Check your external providers
Do you rely on cloud services or an external data center for your operations? Be sure there is not an outage there that is affecting you. These updates are typically provided by a Corporate Status Page where users can obtain updates, notifications, issue classification, and estimated timelines when service will be restored.
Has there been a cyber intrusion?
Security issues are a common and more increasingly cause of outages. Have a forensic plan in place to track down sneaky bugs and malware. Educate your employees about opening suspect emails or clicking on external links that may be phishing attempts. You will know if something like a DDOS attack is occurring, so make sure have a remediation plan for that which may include your ISP. Most times, preventing cyber attacks simply means just keeping your firewalls and whitelists updated, which is done automatically with most platforms nowadays.
Is your software and OS updated?
There are hundreds if not thousands of updates and patches that are released weekly that can cause all the various software you use to stop playing nice together. Maintain a maintenance schedule and a compatibility alert log to assist in more quickly troubleshooting potential software compatibility issues.
Has there been a human error?
According to HP, almost 50% of IT outages are caused by some type of human error. Robust documentation can help, but also build a corporate culture where employees aren’t afraid of repercussions from reporting incidents. In any type of outage, the sooner an incident has been reported, the faster it can be resolved. As such, time is of the essence. Make sure your employees understand and know how to quickly report an incident. Nowadays, some leading companies deploy a variety of test scenarios (i.e. fake phishing emails) to see how their employees interact and respond. In many situations, further employee training and threat awareness are identified as areas that warrant further training.
Is a hardware failure to blame?
Not just physical failure, but incompatibility with newer software or OS can also be an issue. Inadequate storage, lack of server memory, and power fluctuations may be causes.
How’s the weather?
It may be sunny where you are, but weather could be affecting a data center or hub anywhere on the planet that may unknowingly be part of your system. Be sure you get reporting on your entire network. Sometimes these issues originate from a smaller geographic area that can pose additional challenges to Enterprises when trying to pinpoint the cause of the outage.
Create an Application Status Page
Some causes of downtime, such as weather or hardware failure, are beyond your control. Some causes can be prevented with policy changes or software updates. Planning ahead for all of them will help speed remediation of outages, but while they are occurring your biggest problem is eliminating confusion and panic on the part of your clients and employees. This can be accomplished by only one method: communication. An Application Status Page can provide monitoring alert notifications, notify users of incidents, the ongoing status and the expected duration of incidents, and when they are resolved.