Data Protection Diaries Reliability, Availability, Serviceability RAS Fundamentals

November 26, 2017 – 6:53 pm

Reliability, Availability, Serviceability RAS Fundamentals

Companion to Software Defined Data Infrastructure Essentials – Cloud, Converged, Virtual Fundamental Server Storage I/O Tradecraft ( CRC Press 2017)

server storage I/O data infrastructure trends

By Greg November 26, 2017

This is Part 2 of a multi-part series on Data Protection fundamental tools topics techniques terms technologies trends tradecraft tips as a follow-up to my Data Protection Diaries series, as well as a companion to my new book Software Defined Data Infrastructure Essentials – Cloud, Converged, Virtual Server Storage I/O Fundamental tradecraft (CRC Press 2017).

Software Defined Data Infrastructure Essentials Book SDDC

Click here to view the previous post Part 1 Data Infrastructure Data Protection Fundamentals, and click here to view the next post Part 3 Data Protection Access Availability RAID Erasure Codes (EC) including LRC.

Post in the series includes excerpts from Software Defined Data Infrastructure (SDDI) pertaining to data protection for legacy along with software defined data centers ( SDDC), data infrastructures in general along with related topics. In addition to excerpts, the posts also contain links to articles, tips, posts, videos, webinars, events and other companion material. Note that figure numbers in this series are those from the SDDI book and not in the order that they appear in the posts.

In this post the focus is around Data Protection availability from Chapter 9 which includes access, durability, RAS, RAID and Erasure Codes (including LRC), mirroring and replication along with related topics.

SDDC, SDI, SDDI data infrastructure
Figure 1.5 Data Infrastructures and other IT Infrastructure Layers

Reliability, Availability, Serviceability (RAS) Data Protection Fundamentals

Reliability, Availability Serviceability (RAS) and other access availability along with Data Protection topics are covered in chapter 9. A resilient data infrastructure (software-defined, SDDC and legacy) protects, preserves, secures and serves information involving various layers of technology. These technologies enable various layers ( altitudes) of functionality, from devices up to and through the various applications themselves.

SDDI SDDC Data Protection Big Picture
Figure 9.2 Various threat issues and challenges that drive the need for data protection

Some applications need a faster rebuild, while others need sustained performance (bandwidth, latency, IOPs, or transactions) with the slower rebuild; some need lower cost at the expense of performance; others are ok with more space if other objectives are meet. The result is that since everything is different yet there are similarities, there is also the need to tune how data Infrastructure protects, preserves, secures, and serves applications and data.

General reliability, availability, serviceability, and data protection functionality includes:

  • Manually or automatically via policies, start, stop, pause, resume protection
  • Adjust priorities of protection tasks, including speed, for faster or slower protection
  • Fast-reacting to changes, disruptions or failures, or slower cautious approaches
  • Workload and application load balancing (performance, availability, and capacity)

RAS can be optimized for:

  • Reduced redundancy for lower overall costs vs. resiliency
  • Basic or standard availability (leverage component plus)
  • High availability (use better components, multiple systems, multiple sites)
  • Fault-tolerant with no single points of failure (SPOF)
  • Faster restart, restore, rebuild, or repair with higher overhead costs
  • Lower overhead costs (space and performance) with lower resiliency
  • Lower impact to applications during rebuild vs. faster repair
  • Maintenance and planned outages or for continues operations

Common availability Data Protection related terms, technologies, techniques, trends and topics pertaining to data protection from availability and access to durability and consistency to point in time protection and security are shown below.

Data Protection Gaps and Air Gap

There are Good Data Protection Gaps that provide recovery points to a past time enabling recoverability in the future to move forward. Another good data protection gap is an Air Gap that isolates protection copies off-site or off-line so that they can not be tampered with enabling recovery from ransomware and other software defined threats. There are Bad data protection gaps including gaps in coverage where data is not protected or items are missing. Then there are Ugly data protecting gaps which include Bad gaps that result in what you think is protected are not and finding that your copies are bad when it is too late.

Data Protection Gaps Good Bad Ugly
Data Protection Gaps Good Bad and Ugly

The following figure shows good data protection gaps including recovery points (point in time protection) along with air gaps.

Good Data Protection Gaps
Figure 9.9 Air Gaps and Data Protection

Fault / Failures To Tolerate (FTT)

FTT is how many faults or failures to tolerate for a given solution or service which in turn determines what mode of protection, or fault tolerant mode ( FTM) to use.

Fault Tolerant Mode (FTM)

FTM is the mode or technique used to enable resiliency and protect against some number of faults.

Fault / Failure Domains

Fault or Failure domains are places and things that can fail from regions, data centers or availability zones, clusters, stamps, pods, servers, networks, storage, hardware (systems, components including SSD and HDDs, power supplies, adapters). Other fault domain topics and focus areas include facility power, cooling, software including applications, databases, operating systems and hypervisors among others.

SDDI SDDC Fault Domains Zones Regions
Figure 9.5 Various Fault and Failure Domains, Regions, Locations


Clustering is a technique and technology for enabling resiliency, as well as scaling performance, availability, and capacity. Clusters can be local, remote, or wide-area to support different data infrastructure objectives, combined with replication and other techniques.

SDDI SDDC Clustering
Figure 9.12 Clustering and Replication Examples

Another characteristic of clustering and resiliency techniques is the ability to detect and react quickly to failures to isolate and contain faults, as well as invoking automatic repair if needed. Different clustering technologies enable various approaches, from proprietary hardware and software tightly coupled to loosely coupled general-purpose hardware or software.

Clustering characteristics include:

  • Application, database, file system, operating system (Windows Storage Replica)
  • Storage systems, appliances, adapters and network devices
  • Hypervisors ( Hyper-V, VMware vSphere ESXi and vSAN among others)
  • Share everything, share some things, share nothing
  • Tightly or loosely coupled with common or individual system metadata
  • Local in a data center, campus, metro, or stretch cluster
  • Wide-area in different regions and availability zones
  • Active/active for fast fail over or restart, active/passive (standby) mode

Additional clustering considerations include:

  • How does performance scale as nodes are added, or what overhead exists?
  • How is cluster resource locking in shared environments handled?
  • How many (or few) nodes are needed for quorum to exist?
  • Network and I/O interface (and management) requirements
  • Cluster partition or split-brain (i.e., cluster splits into two)?
  • Fast-reacting fail over and resiliency vs. overhead of failing back
  • Locality of where applications are located vs. storage access and clustering

Where To Learn More

Continue reading additional posts in this series of Data Infrastructure Data Protection fundamentals and companion to Software Defined Data Infrastructure Essentials (CRC Press 2017) book, as well as the following links covering technology, trends, tools, techniques, tradecraft and tips.

Additional learning experiences along with common questions (and answers), as well as tips can be found in Software Defined Data Infrastructure Essentials book.

Software Defined Data Infrastructure Essentials Book SDDC

What This All Means

Everything is not the same across different environments, data centers, data infrastructures and applications. There are various performance, availability, capacity economic (PACE) considerations along with service level objectives (SLO). Availability means being able to access information resources (applications, data and underlying data infrastructure resources), as well as data being consistent along with durable. Being durable means enabling data to be accessible in the event of a device, component or other fault domain item failures (hardware, software, data center).

Just as everything is not the same across different environments, there are various techniques, technologies and tools that can be used in different ways to enable availability and accessibility. These include high availability (HA), RAS, mirroring, replication, parity along with derivative erasure code (EC), LRC, RS and other RAID implementations, along with clustering. Also keep in mind that pertaining to data protection, there are good gaps (e.g. time intervals for recovery points, air gaps), bad gaps (missed coverage or lack of protection), and ugly gaps (not being able to recover from a gap in time).

Note that mirroring, replication, EC, LRC, RS or other Parity and RAID approaches are not replacements for backup, rather they are companions to time interval based recovery point protection such as snapshots, backup, checkpoints, consistency points and versioning among others (discussed in follow-up posts in this series).

Which data protection tool, technology to trend is the best depends on what you are trying to accomplish and your application workload PACE requirements along with SLOs. Get your copy of Software Defined Data Infrastructure Essentials here at, at CRC Press among other locations and learn more here. Meanwhile, continue reading with the next post in this series, Part 3 Data Protection Access Availability RAID Erasure Codes (EC) including LRC.

Ok, nuff said, for now.


Greg Schulz – Microsoft MVP Cloud and Data Center Management, VMware vExpert 2010-2017 (vSAN and vCloud). Author of Software Defined Data Infrastructure Essentials (CRC Press), as well as Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press), Resilient Storage Networks (Elsevier) and twitter @storageio. Courteous comments are welcome for consideration. First published on any reproduction in whole, in part, with changes to content, without source attribution under title or without permission is forbidden.

All Comments, (C) and (TM) belong to their owners/posters, Other content (C) Copyright 2006-2018 Server StorageIO and UnlimitedIO. All Rights Reserved. StorageIO is a registered Trade Mark (TM) of Server StorageIO.