The State of DR for Manufacturing Companies Today

In today’s highly connected IT world, prolonged downtime can be a manufacturing company’s nightmare. Whether it’s downtime from a natural disaster, or resulting from human error, the impact of an outage can ripple throughout every aspect of your business, threatening operations, accruing high costs, creating negative customer experiences – and even putting business reputation at stake, as we’ve seen with some highly publicized large-scale outages in recent years.

It’s no surprise, then, that disaster recovery (DR) has become a top consideration for manufacturing companies around the globe. Yet while many businesses are aware of the concerning implications of NOT having a DR plan in place, there’s still some confusion around what truly constitutes a sound DR plan. Common misconceptions, many of which are related to understanding your downtime tolerance, can create a false sense of security for business owners who believe they have a sufficient plan in place when in reality it is exposing them to risk.

In this blog, we’ll lay out everything you need to know, starting with a DR marketplace overview and ending with specific steps and considerations you can take to create a DR plan that’s a good fit for your unique business. We’ll highlight some major pitfalls along the way that every organization (particularly those running Power i systems) needs to be aware of, starting with the most basic: what constitutes a disaster. 

What Constitutes a Disaster?

Thinking about “disasters” may conjure up images of fires, floods, storms, explosions, earthquakes, major power outages and the like. Yet too often we think of disasters as large-scale events, a misconception that can lead to complacency or an attitude of “it probably won’t happen to my business.”

While these types of natural occurrences are certainly a factor, disasters can also be characterized by seemingly benign events that in turn lead to business disruptions. This includes air conditioner leaks, overheated server spaces, a blown boiler, mold and other physical issues.  

Additionally, hardware failures, software failures and human errors can always rear their heads, even among the most well-prepared IT teams. They may not cause major disruptions if your IBM i system health is in check, but for many organizations a variety of business processes depend on the continuity of small, seemingly insignificant, hard-to-troubleshoot add-ons.

How do you know if your IBM i system in good health? An online self-diagnostic tool is an easy way to determine your system’s needs.

Whatever the scale of the event of disaster, downtime is downtime—and failure to have an adequate DR plan in place threatens your business’s ability to keep ticking along as it should. Essentially, the need to maintain business continuity is the driving force behind disaster recovery, a critical point to keep in mind as you approach building out a DR plan.

More SMBs Embracing DR

As the need for DR grows in importance, many SMBs who may have once dismissed it are now jumping on board. Cloud technology has played a big role in this, which we’ll cover in the next section. The need for DR is also driven by the competitive pressures of today’s manufacturing marketplace. Regardless of size, no company can afford to be the weak link in a highly connected chain of customers and suppliers. And when unplanned downtime does occur, it’s often small businesses who have the hardest time recovering. Consider the following unsettling statistics for SMBs:

    • According to the Small Business Administration (SBA), 90% of companies fail within a year unless they have the ability to resume operations within five days following a disaster.
    • In a study conducted by CA Technologies, about half of the small businesses surveyed said unplanned outages affected their reputation. Of those half, 18% would describe the impact on reputation as “very damaging.”
    • In 2016, the Federal Emergency Management Agency reported that nearly 40% of small businesses are forced to shut down after a disaster.
    • 52% of SMBs say it would take at least three months to recover from a disaster, according to a survey of 500 small business owners conducted by Nationwide.

The Adoption of Cloud Technology

It wasn’t too long ago that the adoption of cloud technology was characterized by a cautious wait-and-see attitude, rife with perceived risks and unknowns. Today, of course, there’s been a dramatic shift in that attitude. Manufacturing companies are quick to embrace all things cloud, viewing “as-a-service” technologies as the most viable and secure option for their needs—and Disaster Recovery as a Service (DRaaS) is no exception.

DRaaS has exploded into the market in the last few years, breaking down many of the traditional barriers companies once faced when considering a DR plan, such as lack of budget and lack of resources. DRaaS has opened the doors for SMBs everywhere, making DR implementation easier, more accessible and often more cost-effective. DRaaS solutions essentially eliminate the need for large cash outlays for equipment, and they can be implemented quickly and without need for complex IT infrastructure. They can also promise ready-to-go communication to multiple locations, faster testing times and fewer personnel requirements, as compared to managing everything in-house.

The Global Growth of DRaaS in Manufacturing

While some organizations still have success in implementing and managing their DR efforts in house, or with a “DIY” approach, the greater trend for manufacturing companies is using a DRaaS solution for at least a subset of their DR needs. (Related: Still Not Sold on the Advantages of Cloud Solutions for IBM Power Systems? Keep Reading.) In fact, DRaaS is among the fastest growing cloud-based service in the market today. Here’s a closer look at how that’s playing out in the manufacturing sector:

PSGI bar chart

PSGI table chart

A Shifting Mindset

Along with the widespread acceptance of cloud solutions, the fast-growing DRaaS market can also be attributed to a more strategic mindset in how manufacturing businesses are thinking about their IT environments. DR was once a question of “What will it cost me to recover and rebuild?” Today, the notion of DR has evolved to reflect a more proactive mentality, with businesses now asking, “What’s it going to cost me if I CAN’T recover my systems?” and “What’s the true cost of that disruption to my customers and my business?” 

DR was once a question of “What will it cost me to recover and rebuild?” Now it has evolved into more proactive mindset, with businesses asking, “What’s it going to cost me if I CAN’T recover my systems?” and “What’s the true cost of that disruption to my customers and my business?”

Here again, this changing mentality is a result of a highly competitive—and highly intertwined—supply chain. Our disasters are shared; the downstream effect of an outage requires more focus and proactive planning than ever before. As technology advances, players across the board expect more and have less tolerance for disruptions. The best partners are the ones who can fulfill your orders reliably, making 24/7/365 availability the new, inexorable standard.

Key Elements & Pitfalls of Disaster Recovery

There’s much to think about when it comes to implementing DR for your company, so in this section, we’ll break down some of the key terms, elements and pitfalls that surround a DR program.

DR and High Availability (HA)

While they often go hand in hand, it’s important to make the distinction between DR and High Availability (HA).  By definition, disaster recovery is an area of security that aims to protect and recover your IT infrastructure in the event of a human or natural disaster.

A DR program, then, is a pre-planned, documented approach of policies and procedures. It ensures business continuity in the face of disruption for your IT infrastructure that supports your mission-critical business systems.

High Availability refers to the technology needed to minimize IT disruptions by ensuring continuity when critical parts of your environment are not available. HA solutions minimize downtime and data loss through replication: data is replicated on an environment that exists in a separate geographic location. With an HA solution, as changes are made on a production system, they are replicated on a back-up system in near-real-time. That means, if something were to happen to your production system, such as a fire or flood, an HA solution would be able to initiate a “failover” where your HA system assumes the role of your production system.

Together, HA/DR are critical components that ensure data integrity and business continuity around the clock. 

Read more about how High Availability fits into your IBM i Disaster Recovery plan – check out this related article.

Recovery Point Objective & Recovery Time Objective

How long can a manufacturing company’s system be down before your business is seriously impacted? Recovery time objective (RTO) is an organization’s best measure of this. By definition, RTO is the desired duration of time within which a business process must be restored after a disruption to avoid unacceptable consequences to business continuity. In general, the shorter your RTO, the more robust your DR plan will need to be.

In the event of a disaster that leads to downtime, systems designers will aim to revert to a recovery point objective (RPO), which is the point in time you’d like to go back to and retrieve your data. In the most simplified terms, if your business had a disaster two hours ago, your desired RPO would be the moments that were immediately prior to that two-hour mark. 

Here are some key factors that organizations determine an optimal RTO and RPO:

PSGI comparison between RPO and RTO

These questions demand a careful, coordinated approach – and it’s critical to note that this is not just a task for your IT team. Your approach to DR should involve key business stakeholders throughout your manufacturing firm. (In the section that follows, How to Create a Top-Down DR Plan, we’ll take a deep drive on how to effectively do this.)

 

Keep In Mind These RTO Pitfalls

With RTO and RPO serving as major components of DR, manufacturing companies should be aware of a mistake that we see time and time again: in general, RTO estimates are too high.

Why? Too often companies approach RTO from an internal perspective. Businesses leaders assess their tolerance for downtime based on what’s happening “inside their own walls,” looking to their company’s systems, critical applications and internal processes to arrive at an estimate on what their tolerance is. What’s missing from that approach is customers. Customer impact and customer attrition are critical parts of this equation that are often overlooked.

 

When thinking about RTO, ask the following questions and you’re likely to realize your tolerance for downtime is far less.

  • How will my customers be affected if my systems become unexpectedly unavailable?
  • What is the external impact of orders that are not accounted for during unplanned downtime?
  • What happens when those orders cannot ship?

Again, by creating a top-down DR plan that starts with your business requirements—not your applications—your company can gain a more reasonable and comprehensive understanding of downtime tolerances. Fortunately, we’ve seen the needle move in the right direction in general, with more manufacturing companies recognizing the importance of realistic RTO assessments.

In a recent survey conducted by Information Technology Intelligence Corp (ITIC), 72 percent of respondents consider 99.99 percent to be the minimum acceptable level of reliability for their main line of business servers, up from 49 percent in 2014. 

 

Cloud Backups and Tape Backups

While backups play an important role in recording your data, essentially creating “pulse points” in time to which you can go back and retrieve data, make no mistake: backups alone do not constitute a DR plan. They will help you recover data up until the moment the disaster occurs, but they do little to cover you during and after downtime. A HA/DRaaS, on the other hand, entails a far more comprehensive approach, ensuring business continuity before, during and after a disaster. It keeps mission-critical applications and business processes running smoothly until the production system is repaired and back online through real-time data replication—something your company cannot achieve with backups alone.

Cloud Backups

Many organizations find that backing up data to the cloud is an easy and attainable option, making cloud-based backups (also referred to as backup as a service or BaaS) yet another fast-growing trend in cloud technologies. For companies considering this option, it’s important to think about not just the back-up process, but the recovery process as well. Be sure to ask potential vendors what the back-up and recovery sets look like. Does the provider solution offer fast restore times? And are those restore times guaranteed? Vendor credibility is extremely important—so as you would with any cloud provider, be sure you do your homework during the selection process.

Tape Backups

Of all the pitfalls to avoid in your approach to DR, the reliance on tape back-ups is arguably the most concerning. Why? In the critical moments following an unplanned outage, recovered tapes are found to not work at all in roughly 30% of cases. Here are some additional findings from a global study that show how a surprising number of  IT administrators are still struggling with backup tapes for their legacy data:

  • 19% of companies being asked about the numbers of legacy tapes they store said that they hold more than 100 tapes.
  • 5% of respondents store more than 1,000 tapes on average.
  • 27% of companies surveyed admitted they do not have clear insight into what specific data is stored in their legacy tapes.

This leads to a scenario we’ve seen unravel all too often: a disaster occurs and a manufacturing business find that it needs to restore, say, its Power i machine. In that moment, IT managers discover that even the best-case scenario is dismal: to obtain backup tapes from off-site storage, obtain a new Power i machine, reconfigure it, then restore it through a tape back-up can take up to four days. And remember, a disconcerting number of tapes are found not to work at all. (Read more at 3 Major Pitfalls to Avoid With Your IBM i Disaster Recovery Plan.)

To make matters worse, starting from the point in which the downtime occurred, companies find they don’t have a viable way to go back and re-enter into their system all those unaccounted-for transactions. When you think about the number of transactions that occur even within one hour, let alone half a day, the magnitude of this problem becomes overwhelming.  

How to Create A Top-Down DR Playbook

In this section, we’ll offer strategies to help create a DR that’s sufficient for your specific manufacturing organization. Remember, a DR plan is essentially a prioritized set of policies and procedures that will enable you to continually do business in the event of a disaster or system outage. A “top-down” DR program then, as we’re referring to here, should start with an assessment of your business requirements. It should ultimately yield a matrix of business applications showing downtime tolerances, solutions and associated costs on a per-application basis. The following section discusses each of these factors and how they all work together.

Downtime Assumptions

In the event of a disaster, which processes are most vital to your operations? How quickly will those supporting business applications need to be recovered? When thinking about DR, these questions are too often answered through assumptions. Business often believe 24-48 to be acceptable downtime for a given application—but rarely are those assumptions tested for accuracy. 

Critical Tip:  Ditch your downtime assumptions and keep an open mind! It’s nothing personal, but important to note that as business leaders approach DR from the top down, more often than not they will find their previous assumptions to be incorrect. 

With a strategic DR approach that includes the following steps, you can take a more realistic and strategic approach. And that starts with prioritizing your business processes.

1: Prioritize Business Processes

What are your mission-critical business processes? Which ones are the most important to keep your business going? For example, your “Tier 1” processes are those which cannot stop even for a short period of time, such as your customer-facing systems or order shipping systems. These processes are likely to require a mirroring solution that gives your company near-immediate recovery.

A Tier 2 process, on the other hand, is one that may not need to have immediate recovery. Why? Customers probably will not feel the impact right away and you can continue to do business when that process goes down. A Tier 3 business process, then, may be one that is important to employees and but is not essential to running your business… and so on and so forth. To assess your processes and assign these priorities in this way, you’ll need to do the following:

  • Spend time interviewing key business teams. Prioritizing your business processes should NOT be viewed as an IT task. It should entail a coordinated, iterative approach among business decision-makers. Challenge your assumptions and discuss what would happen -- not just on a system level but a business level -- if certain business functions were not operational. Identify interdependencies for a given process, as well the impact on the resources involved. These discussions allow you to think comprehensively about the implications of that process not working. Be sure to document everything accordingly.
  • Think first about business requirements, then about applications. After reviewing your essential business functions, now it’s time to consider which applications can be mapped to those functions. Ask these questions:
    • What are the interdependencies for those processes (in terms of both people and functions)?
    • If downtime occurred, what will temporarily replace the systems that were in place?
    • Can any of the work be done manually -- and if so, how and for how long?
    • Can any of the work NOT be done even manually?
    • For processes that become manual, how can data be captured and reentered back into the system?

This analysis will essentially form the foundation of your DR plan and inform other criteria in your DR matrix, such as solutions and testing.

2: Assess Downtime Tolerances

At the same time you’re evaluating and prioritizing your business requirements, you will need to assign downtime tolerances for your business applications.  

  • Remember the critical tip: downtime assumptions for manufacturing companies are notoriously inaccurate. Many will find that seemingly small and innocuous applications (for example, a pick-list report for your warehouse team) will suddenly halt operations when they go down, turning that 48-hour window into what feels an impossible scenario. To exacerbate the problem, business leaders often discover that the paper back-ups or manual processes they believed they had in place are simply not viable. Your interview process should surface such issues, helping to inform your downtime needs which should subsequently be formalized and agreed-upon by stakeholders.
  • Know that disasters go both ways. You may have decided that your firm can work around a 24- to 48-hour outage, but what happens to your partners in other locations if data is not being delivered to them? Working with your business stakeholders, think comprehensively about the downstream effect of an outage and the wide range of schemes that can occur. Conversely, how will YOU be affected if downtime occurs for one of your key suppliers? What if one of your key suppliers is offline for 48 hours? Does that partner have a DR plan in place? There may be questions you should be asking of your suppliers as they may influence your downtime tolerances.
  • Consider downtime on a per-application basis. Remember, a top-down design looks at business requirements first, then the applications that support them. When you know which applications are linked to your Tier 1 business processes, you can assign downtime accordingly—for example, applications supporting Tier 1 processes will have zero tolerance.

Sample Dr Matrix

3: Factor in solutions and cost.

Given our integrated IT environments, downtime costs can be far reaching. Customers are not always patient. If your manufacturing firm is down for a few days, delayed business can mean lost business – and the revenue lost during that time must be considered in your DR plan.  

Downtime costs, of course, extend beyond revenue lost. Be sure you factor in soft costs as well. For example, in the event you must revert to a manual process, what are the costs associated with having a less efficient workforce? Additionally, what costs are incurred not just during but after downtime? Ask questions such as:

  • What’s your playbook to recover work that is done manually?
  • What's the cost if manual processes don’t work?
  • Do you have a method for accounting for all transactions during downtime? (If you operate for years without testing your manual processes, you may be surprised at what your business cannot do and the costs that are associated!)

It’s a good idea to note that determining the true cost of downtime is perhaps the most difficult aspect of disaster recovery. Downtime costs are unique to your specific business. They cannot be measured in a standardized way, nor is there a one-size-fits-all solution. Whether it’s a simple solution or a sophisticated one, a solution that’s driven by your business requirements is the best-fit choice for your organization. 

4: Test Your Top-Down Design

Ultimately, your DR matrix rolls up each of these considerations into a DR plan that shows how each piece comes together: your business applications, each with an acceptable downtime tolerance, a solution and associated cost. Now it’s time to test through simulation.  Testing, however, is another area of DR that’s commonly subject to pitfalls.

DR testing frequency is largely inadequate, with only about 40% of companies testing their DR plans once annually. 28% test their DR plan rarely, if ever.

Source: 2016 State of Disaster Recovery Survey

So where to most companies go wrong? Too often, applications are tested without considering connections and interdependencies, or the downstream/upstream effects. When considering a testing or back-up plan, many companies will start by looking at their core ERP systems, believing that if they have a plan to get their main systems up and running, they’ll be ok.

But as we know, business processes today rely on outside systems. From Windows servers in data centers, to cloud-based systems, to banking systems, to freight company systems… there are far more dependencies between interfaces and integrations than a cursory test (or a core system test) could ever account for. Instead, you can do the following:

  • Use a coordinated approach. Similar to prioritizing your business requirements, realize that that DR testing is not a task for just your IT team. It should be centered around business requirements and therefore be evaluated and tested among business stakeholders.
  • Test Tier 1 processes. Testing every single business application is simply not realistic. It is, however, critical to test your Tier 1 processes. Simulate a typical day’s activity, keeping in mind the potential impact on your customers. Do applications that are integrated to your core applications function properly?
  • Test people, too. In addition to testing processes, test “people” during downtime so you can build processes around them. Will employees know what they need to do during downtime? Do they have the tools they need and the processes in place to get back up to speed? How will affected resources manage workflows during downtime? Where will they get the documentation needed – is it stored in a system that might be down?
  • Don’t forget to document. Be sure to formalize the results of your testing plan. According to the 2016 State of Disaster Recovery survey, only 40% of companies have a documented disaster recovery plan to guide their staff in the event of an outage.

Finally, keep in mind that failed tests are often documented but then nothing is done to correct them. Remember to be honest and keep an open mind about your results.

If you would like to take an even deeper dive on any of these steps, and learn how they have applied to other manufacturers, you can download our recorded webinar.

di

In-House DR vs. Outsourcing

Do-It-Yourself (DIY) Disaster Recovery? Third-Party Support? Or Both?

Once your organization has a good understanding of DR, from the big picture down to the nitty gritty details, the next question becomes: do I manage my DR efforts in-house or outsource to a provider? Or a combination of the two?

Most manufacturers will find that the decision does not have to be one or the other. A DIY approach to DR can take on many forms, and there are varying degrees of services you may or may not need to entrust to a third party.  Key points of consideration include:

  • Personnel. Do you have resources available to manage, coordinate and create a top-down DR plan? Once your DR plan/matrix is created, is there a resource in place who can execute on that plan over time? What processes do you have in place to make up the gap during down time—and when that happens, who are the personnel involved and how available are they?
  • Your DR location/data centers. How does data get back and forth to your sites? What communication capabilities do you have to your failover sites? Do you use the cloud as your DR failover site? Do you use full replication to your own data centers?
  • Your DR location/data centers. When comparing the cost of a DIY approach to an outsourced one, what are the costs associated more internal control? DR responsibilities in-house often mean more costly personnel requirements. There are also upfront costs for things like equipment, hardware, licensing, data center space, and so on; whereas DRaaS solutions do not require large cash outlays.

These are just some of the many factors that will influence your decision; and every manufacturing organization must look to its unique needs to find the best-fit approach to DR.  That said, the trend we’re seeing today with “aaS” solutions—whether it’s for all or just a subset of DR needs—is impossible to ignore. The growing DRaaS market makes DR solutions more accessible and affordable for businesses large and small—and as that happens, solutions become more sophisticated and therefore better suited to protect you from incidents that threaten business continuity. In our tightly woven global network of manufacturers, customers and suppliers, the increased adoption of disaster recovery in good news all around.

Do You Have Questions We Didn’t Address? Ask Away!

Whatever phase you’re in with disaster recovery planning, your manufacturing organization can greatly benefit by partnering with a provider who has both industry-specific business experience in addition to technical expertise. PSGi partners with Thrive to bring you the full breadth of skills and knowledge. From your apps to your servers, we can help with everything from handling security and audit requirements for your DRaaS site, to assessing and prioritizing your business’s DR needs, to finding the right technical solutions (across any platform) that meet your unique criteria. If you have questions, we’re here to help—simply contact us or fill out the form below and let us know how we can get your DR plan on the right track.