Tuesday, May 18, 2010

Chapter 5. Disaster Recovery and Business Continuity

- Disaster recovery for systems typically focuses on making alternative processes and resources available for transaction processing. A disaster recovery plan (DRP) should reduce the length of recovery time necessary and also the costs associated with recovery.

- A disaster can be classified as a disruption that causes critical information resources to be inoperative for a period of time, adversely affecting business operations.

- Business continuity plans (BCP) are the result of a process of plan creation to ensure that critical business functions can withstand a variety of emergencies.

- Disaster-recovery plans deal with the immediate restoration of the organization’s business systems while the business continuity plan also deals with the long-term issues before, during, and after the disaster. The BCP should include getting employees to the appropriate facilities; communicating with the public, partners, and customers; and making the transition from emergency recovery back to normal operations. The DRP is a part of the BCP and is the responsibility
of senior management.

- These are the attributes of a disaster:
➤ Unplanned and unanticipated
➤ Impacts critical business functions
➤ Has the capacity for significant loss

- During the initiation of the business continuity planning process, the BCP team should prepare for a meeting with senior management to define the project goals and objectives, present the project schedule, and review the proposed interview schedule.

In preparation for this meeting, the BCP team should do the following:
➤ Review the organizational structure to determine what resources will be assigned to the team
➤ Review existing disaster-planning policies, strategies, and procedures
➤ Review existing continuity plans
➤ Research any events that have occurred previously (severe weather, fires,equipment or facility failures, and so on) and that had or could have a negative effect on the organization
➤ Create a draft project schedule and associated documents (timing, resources, interview questionnaires, roles and responsibilities, and so on)

- Per ISACA, the business continuity planning process can be divided into the following phases:
➤ Analyze the business impact
➤ Develop business-recovery strategies
➤ Develop a detailed plan
➤ Implement the plan
➤ Test and maintain the plan

- A business impact analysis (BIA) is used to identify threats that can impact continuity
of operations.

- The results of the BIA should provide a clear picture of the continuity impact in terms of the impact to human and financial resources, as well as the reputation of the organization.

- The BIA team should work with senior management, IT personnel, and end users to identify all resources used during normal operations. Although BCP and DRP are often implemented and tested by middle management and end users, the ultimate responsibility and accountability for the plans
remains with executive management, such as the board of directors.

- The following steps can be used for the framework of business impact assessment:
➤ Gather business impact analysis data
➤ Questionnaires or interviews
➤ Review the BIA results
➤ Check for completeness and consistency
➤ Follow up with interviews for areas of ambiguity or missing information
➤ Establish the recovery time for operations, processes, and systems
➤ Define recovery alternatives and costs

- End-user involvement is critical during the business impact assessment phase of business continuity planning.

- The BIA questionnaire and interviews should gather the following information from the business units:
➤ Financial impacts resulting from the incapability to operate for prolonged periods of time
➤ Operational impacts within each business unit
➤ Expenses associated with continuing operations after a disruption
➤ Current policies and procedures to resume operations in the event of a disruption
➤ Technical requirements for recovery

- The BIA should include both quantitative and qualitative questions. Quantitative questions generally describe the economic or financial impacts of a potential disruption. Qualitative impacts are impacts that cannot be quantified in monetary terms. These types of impacts are generally associated with the business impact of a disaster and include damage to reputation and loss of confidence in customer services or products.

- Before the development of a BCP/DRP, the BIA team should develop a recommendation or findings
report for senior management. The purpose of this report is to provide senior management with a draft priority list of the business unit service and support recovery, as well as the financial and operational impacts that drive the prioritization.

- In reviewing the information gathered during the BIA, the team should determine what the critical information resources are related to the organization’s critical business processes.This relationship is important because the disruption of an information resource is not a disaster unless that resource is critical to a business process. Per ISACA, each resource should be assessed to determine criticality. Indications of criticality might include
these:
➤ The process supports lives or people’s health and safety.
➤ Disruption of the process would cause a loss of income to the organization or exceptional costs that are unacceptable.
➤ The process must meet legal or statutory requirements.

- In making this determination, the BIA team should consider two cost factors. The first is the cost associated with downtime. The stop in growth reflects the point in time when the business can no longer function.The second cost factor is the cost associated with recovery or resumption
of services by implementing the business continuity plan. As stated earlier, an optimal BCP and associated strategies should be based on the point in time when both cost factors are at a minimum.

- The next step in developing the business continuity plan is to identify recovery strategies and select the strategy or strategies that best meet the organization’s needs. It is important to remember that the strategy should include the technologies required for recovery and that the policies and procedures should include specific sequencing. The sequence in which systems are
recovered is important for ensuring that the organization can function effectively following a disaster.

- The selection of the recovery strategy is based on the following:
➤ The criticality of the business process and the applications supporting the process
➤ The cost of the downtime and recovery
➤ Time required to recovery
➤ Security

- Critical : These functions cannot be performed unless they are replaced by identical capabilities. Critical applications cannot be replaced by manual methods. Tolerance to interruption is very low; therefore, cost of interruption is very high.

- Vital : These functions can be performed manually, but only for a brief period of time. There is a higher tolerance of interruption than with critical systems and, therefore, somewhat lower costs of interruption, provided that functions are restored within a certain time frame (usually five days or less).

- Sensitive : These functions can be performed manually, at a tolerable cost and for an extended period of time. Although they can be performed manually, it usually is a difficult process and requires additional staff to perform.

- Noncritical : These functions can be interrupted for an extended period of time, at little or no cost to the company, and require little or no catching up when restored.

- The best strategy is one that takes into account the cost of downtime and recovery, the criticality of the system, and the likelihood of occurrence determined during the BIA.

- In addition to actual recovery procedures, the organization should implement different levels of
redundancy so that a relatively small event does not escalate to a full-blown disaster. An example of this type of control is to use redundant routing or fully meshed wide area networks.

- A hot site is a facility that is basically a mirror image of the organization’s current processing facility. It can be ready for use within a short period of time and contains the equipment, network, operating systems, and applications that are compatible with the primary facility being backed up. When hot sites are used, the staff, data files, and documentation are the only additional items needed in the facility.

- A hot site is generally the highest cost among recovery options, but it can be justified when critical applications and data need to resume operations in a short period of time. The costs associated include subscription costs, monthly fees, testing costs, activation costs, and
hourly or daily charges (when activated). The physical facility should incorporate the same level of security as the primary facility and should not be easily identifiable externally (with
signs or company logos, for example).

- Although hot sites are the most expensive type of alternate processing redundancy, they are very appropriate for operations that require immediate or very short recovery times.

- Warm sites are sites that contain only a portion of the equipment and applications required for recovery. In a warm site recovery, it is assumed that computer equipment and operating software can be procured quickly in the event of a disaster. The warm site might contain some computing equipment that is generally of a lower capacity than the equipment at the primary facility.
The contracting and use of a warm site are generally lower cost than a hot site but take longer to get critical business functions back online. Because of the requirement of ordering, receiving, and installing equipment and operating systems, a warm site might be operational in days or weeks, as opposed to hours with a hot site.

- The costs associated with a warm site are similar to but lower than those of a hot site and include subscription costs, monthly fees, testing costs, activation costs, and hourly or daily charges (when activated).

- A cold site can be considered a basic recovery site, in that it has the required space for equipment and environmental controls (air conditioning, heating, power, and so on) but does not contain any equipment of connectivity. A cold site is ready to receive the equipment necessary for a recovery but will take several weeks to activate. Of the three major types of off-site processing facilities (hot, warm, and cold), a cold site is characterized by at least providing
for electricity and HVAC. A warm site improves upon this by providing for redundant equipment and software that can be made operational within a short time.

- A cold site is often an acceptable solution for preparing for recovery of noncritical systems and data.

- Duplicate processing facilities are similar to hot site facilities, with the exception that they are completely dedicated, self-developed recovery facilities. The organization might have a primary site in Washington, D.C., and might designate a duplicate site at one of its own
facilities in Utah. The duplicate facility would have the same equipment, operating systems, and applications and might have regularly synchronized data. In this example, the facility can be activated in a relatively short period of time and does not require the organization to notify a third party for activation.

- Reciprocal agreements are arrangements between two or more organizations with similar equipment and applications. In this type of agreement, the organizations agree to provide computer time (and sometimes facility space) to one another in the event of an emergency. These types of agreements are generally low cost and can be used between organizations that have unique hardware or software that cannot be maintained at a hot or warm site. The disadvantage of reciprocal agreements is that they are not enforceable, hardware and software changes are generally not communicated over time
(requiring significant reconfiguration in the event of an emergency), and the sites generally do not employ capacity planning, which may render them useless in the event of an emergency.

- A reciprocal agreement is not usually appropriate as an alternate processing solution for organizations with large databases or live transaction processing.

- The BCP team should develop a detailed plan for recovery.
The following factors should be considered when developing
the detailed plan:
➤ Predisaster readiness: Contracts, maintenance and testing, policies, and procedures
➤ Evacuation procedures: Personnel, required company information
➤ Disaster declaration: What defines a disaster? Who is responsible for declaring?
➤ Identification of critical business processes and key personnel (business and IT)
➤ Plan responsibilities: Plan objectives
➤ Roles and responsibilities: Who is responsible for what?
➤ Contract information: Who maintains it, and where is it?
➤ Procedures for recovery: Step-by-step procedures with defined responsibilities
➤ Resource identification: Hardware, software, and personnel required for recovery

- The BCP should be written in clear, simple language and should be understandable to all in the organization. When the plan is complete, a copy should be maintained off-site and should be easily accessible.

- The business continuity plan should be created to minimize the effect of disruptions. The process associated with the development of the plan should include the following steps:
➤ Perform a business impact analysis to determine the effect of disruptions on critical business processes
➤ Identify, prioritize, and sequence resources (systems and personnel) required to support critical business processes in the event of a disruption
➤ Identify recovery strategies that meet the needs of the organization in resumption of critical business functions until permanent facilities are available
➤ Develop the detailed disaster-recovery plan for the IT systems and data that support the critical business functions
➤ Test both the business continuity and disaster recovery plans
➤ Maintain the plan and ensure that changes in business process, critical business functions, and systems assets, such as replacement of hardware, are immediately recorded within the business continuity plan

- As an IS auditor, you should review the plan to ensure that it will allow the organization to resume its critical business functions in the event of a disaster. ISACA states the IS Auditors tasks include the following:
➤ Evaluating the business continuity plans to determine their adequacy and currency, by reviewing the plans and comparing them to appropriate standards or government regulations
➤ Verifying that the business continuity plans are effective, by reviewing the results from previous tests performed by both IT and end-user personnel
➤ Evaluating off-site storage to ensure its adequacy, by inspecting the facility and reviewing its contents, security, and environmental controls
➤ Evaluating the ability of IT and user personnel to respond effectively in emergency situations, by reviewing emergency procedures, employee training, and results of their tests and drills

- The organization’s critical data should be stored both onsite, for quick recovery in nondisaster situations, and off-site, in case of a disaster. The Storage Networking Industry Association defines a backup as follows:
A collection of data stored on (usually removable) nonvolatile storage media for purposes of recovery in case the original copy of data is lost or becomes inaccessible.

- Three backup methods are used:
➤ Full backup—In a full backup, all the files (in some cases, applications) are backed up by copying them to a tape or other storage medium. This type of backup is the easiest backup to perform but requires the most time and space on the backup media.
➤ Differential backup—A differential backup is a procedure that backs up only the files that have been changed or added since the last full backup. This type of backup reduces the time and media required.
➤ Incremental backup—An incremental backup is a procedure that backs up only the files that have been added or changed since the last backup (whether full or differential).

- For instance, the organization might choose to perform a single full weekly backup combined
with daily incremental backups. This method decreases the time and media required for the daily backups but increases restoration time. This type of restoration requires more steps and, therefore, more time because the administrator will have to restore the full backup first and then apply the incremental backups sequentially until all the data is restored.

- Tape backup media is a magnetic medium and, as such, is susceptible to damage from both the environment in which it is stored (temperature, humidity, and so on) and physical damage to the tape through excessive use. For this reason, administrators use backup schemes that allow tapes to be regularly rotated and eventually retired from backup service.

- One popular scheme is the grandfather, father, and son scheme (GFS), in which the central server
writes to a single tape or tape set per backup. When using the GFS scheme, the backup sets are daily (son), weekly (father), and monthly (grandfather).

- Daily backups come first. The four backup tapes are usually labeled (Mon–Thur) and used on their corresponding day. The tape rotation is based on how long the organization wants to maintain file history. If a file history for one week is required, tapes are overwritten each week; if history required for three weeks, each tape is overwritten every three weeks (requiring 12 tapes). The five (some months have five weeks) father tapes are used for full weekly backups (Friday tapes).

- Two types of tape storage are used:
➤ Onsite storage—One copy of the backup tapes should be stored onsite to effect quick recovery of critical files.Another copy should be moved to an off-site location as redundant storage. Onsite tapes should be stored in a secure fireproof vault, and all access to tapes should be logged.
➤ Off-site storage—The organization could contract with a reputable records storage company for off-site tape storage, or could maintain the facility themselves. The physical and environmental controls for the offsite facility should be equal to those of the organization. The contract
should stipulate who from the organization will have the authority to provide and pick up tapes, as well as the time frame in which tapes can be delivered in the event of a disaster.

- A SAN is a special-purpose network in which different types of data storage are associated with servers and users. A SAN can either interconnect attached storage on servers into a storage array or connect the servers and users to a storage device that contains disk arrays.

- If the organization cannot implement an off-site SAN, it might opt for an electronic vaulting option. With this option, the organization contracts with a vaulting provider that provides disk arrays for the backup and storage of the organization’s applications and data. Generally, the organization installs an agent on all the servers and workstations that require a backup and identifies the files to be included in the backup. The agent then performs full and
incremental backups, and moves that data via a broadband connection to the electronic vault. Organizations that have a significant amount of data or high levels of change might incur issues in moving large amounts of data across a broadband connection.

- As a part of regular testing and maintenance, organizations can opt to perform either full or partial testing of recovery and continuity plans, though most organizations do not perform full-scale tests because of resource constraints. To continue to improve recovery and continuity plans, organizations can perform a paper, walk-through, or preparedness test (Full Test)

- A paper test is the least complex test that can be performed. This test helps ensure that the plan is complete and that all team members are familiar with their responsibilities within the plan. With this type of test, the BCP/DRP plan documents are simply distributed to appropriate managers and BCP/DRP team members for review, markup, and comment.

- A walk-through test is an extension of the paper testing, in that the appropriate managers and BCP/DRP team members actually meet to discuss and walk through procedures of the plan, individual training needs, and clarification of critical plan elements.

- Of the three major types of BCP tests (paper, walk-through, and preparedness), a walk-through test requires only that representatives from each operational area meet to review the plan.

- A preparedness test is a localized version of the full test in which the team members and participants simulate an actual outage or disaster and simulate performing the steps necessary to effect recovery and continuity. This test can be performed against specific areas of the plan instead of the entire plan. This test validates response capability, demonstrates skills and training, and practices decision-making capabilities. Only the preparedness test actually
takes the primary resources offline to test the capabilities of the backup resources and processing.

- Of the three major types of BCP tests (paper, walkthrough, and preparedness), only the preparedness test uses actual resources to simulate a system crash and validate the plan’s effectiveness.

- A full operational test is the most comprehensive test and includes all team members and participants in the plan. The BCP team and participants should have multiple paper and preparedness tests completed before performing a full operational test. This test involves the mobilization of personnel, and disrupts and restores operations just as an outage or disaster
would. This test extends the preparedness test by including actual notification, mobilization of resources, processing of data, and utilization of backup media for restoration.

- During the test, detailed documentation and observations should be maintained.
Per ISACA, these measurements might include the following:
➤ Time—The elapsed time for completion of prescribed tasks, delivery of equipment, assembly of personnel, and arrival at a predetermined site.
➤ Amount—Amount of work performed at the backup site by clerical personnel and information systems processing operations.
➤ Count—The number of vital records successfully carried to the backup site versus the required number, and the number of supplies and equipment requested versus those actually received. Also, the number of critical systems successfully recovered can be measured with the number of
transactions processed.
➤ Accuracy—Accuracy of the data entry at the recovery site versus normal accuracy. Also, the accuracy of actual processing cycles can be determined by comparing output results with those for the same period processed under normal conditions.

- It is important for organizations to remember that a BCP plan is a living document and will change according to the needs of the organization.The organization should appoint a business
continuity coordinator to ensure that periodic testing and maintenance of the plan are implemented. The coordinator should also ensure that team members and participants receive regular training associated with their duties in the BCP and maintain records and results of testing.

- Business disruptions, as opposed to disasters, can be caused by a variety of internal and external factors, including these:
➤ Equipment failure (processors, hard drives, memory, and so on)
➤ Service failures (telecommunications outages, power outages, external application failure, and so on)
➤ Application or data corruption

- In addition to the disaster-recovery plan, the IT department should have policies and procedures for backup, storage of backup media (onsite and offsite), defined roles and responsibilities, and recovery. The IS auditor should review the following to ensure that the organization can recover data and applications in the event of a short-term disruption:
➤ Backup procedures—The procedures identify the backup scheme and
define responsibilities for implementing backups
➤ Onsite storage—All storage media should be stored in environmentally
controlled facilities and should be secured in a fire rated safe.
➤ Off-site storage—The off-site storage facility should have environmental and security controls that equal those of the onsite storage facility. The contract with the off-site facility should contain the points of contact within the organization that have the authority to check storage
media in and out of the facility, as well as clearly defined response times for the delivery of storage media in the event of a disaster.

- The organization’s insurance coverage should take into account the actual cost of recovery and should include coverage for media damage, business interruption, and business continuity processing.

- There are two general types of insurance: property and liability.
Property insurance can protect the organization from a wide variety of losses,
including these:
➤ Buildings
➤ Personal property owned by the organization (tables, desks, chairs, and equipment)
➤ Loss of income
➤ Earthquake
➤ Flood (usually an additional rider on the policy) Property insurance

- A general liability policy is designed to provide coverage for the following:
➤ Personal injury
➤ Fire liability
➤ Medical expenses
➤ General liability for accidents occurring on the organization premises

- The organization must ensure that all costs associated with a disaster and the recoveries are included in its insurance policies. It might be necessary to purchase additional insurance policies to extend coverage (sometimes called umbrella policies) or purchase specific insurance coverage (flood or terrorism, for example) based on the needs of the organization.

- The BCP team should define key personnel within the business units and IT to implement the plan. These personnel should be a part of the planning, testing, and maintenance of the BCP. Key personnel should have alternates to function in their place, where necessary.

- ➤ Salvage team—This team manages the relocation project. It also makes a more detailed assessment of the damage to the facilities and equipment than was performed initially, provides the emergency-management team with the information required to determine whether planning should be directed toward reconstruction or relocation, provides information necessary for filling out insurance claims, and coordinates the efforts necessary for immediate records salvage, such as restoring paper documents and electronic media.
➤ Relocation team—This team coordinates the process of moving from the hot site to a new location or to the restored original location.

- the MOST important control aspect of maintaining data backup at off-site storage facilities is Critical and time-sensitive data is kept current at the off-site storage facility.

- Duplicate logging of transactions, use of before-and-after images of master records, and time stamping of transactions and communications data are all recommended best practices for establishing effective redundancy of transaction databases.

- Electronic vaulting and remote journaling are both considered effective redundancy controls for backing up real-time transaction databases.

12 comments:

  1. It really is a nightmare and yes, revenue loss from these disasters is real. Having a plan in place is a must nowadays!
    disaster recovery and business continuity

    ReplyDelete
  2. Business resumption plan is a process to regain, recovered, restored or resume the business after the big disturbance.

    Failure in any field can cause a heavy mental error but in the business, you can defeat the failure By this method.

    The process to regain and reused the business is term as Business resumption plan.

    It is very essential for the business or the company to study about this because this will definitely help the businesses and company in the future to overcome the crisis.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. It's really awesome blog. i get lot of information. i also share some information. Hope you like it. Thanks for sharing it. ISO 22301 Lead Auditor Course

    ReplyDelete