Reliability, Availability, Serviceability RAS Fundamentals
Companion to Software Defined Data Infrastructure Essentials – Cloud, Converged, Virtual Fundamental Server Storage I/O Tradecraft ( CRC Press 2017)
This is Part 2 of a multi-part series on Data Protection fundamental tools topics techniques terms technologies trends tradecraft tips as a follow-up to my Data Protection Diaries series, as well as a companion to my new book Software Defined Data Infrastructure Essentials – Cloud, Converged, Virtual Server Storage I/O Fundamental tradecraft (CRC Press 2017).
Click here to view the previous post Part 1 Data Infrastructure Data Protection Fundamentals, and click here to view the next post Part 3 Data Protection Access Availability RAID Erasure Codes (EC) including LRC.
Post in the series includes excerpts from Software Defined Data Infrastructure (SDDI) pertaining to data protection for legacy along with software defined data centers ( SDDC), data infrastructures in general along with related topics. In addition to excerpts, the posts also contain links to articles, tips, posts, videos, webinars, events and other companion material. Note that figure numbers in this series are those from the SDDI book and not in the order that they appear in the posts.
In this post the focus is around Data Protection availability from Chapter 9 which includes access, durability, RAS, RAID and Erasure Codes (including LRC), mirroring and replication along with related topics.
Reliability, Availability, Serviceability (RAS) Data Protection Fundamentals
Reliability, Availability Serviceability (RAS) and other access availability along with Data Protection topics are covered in chapter 9. A resilient data infrastructure (software-defined, SDDC and legacy) protects, preserves, secures and serves information involving various layers of technology. These technologies enable various layers ( altitudes) of functionality, from devices up to and through the various applications themselves.
Figure 9.2 Various threat issues and challenges that drive the need for data protection
Some applications need a faster rebuild, while others need sustained performance (bandwidth, latency, IOPs, or transactions) with the slower rebuild; some need lower cost at the expense of performance; others are ok with more space if other objectives are meet. The result is that since everything is different yet there are similarities, there is also the need to tune how data Infrastructure protects, preserves, secures, and serves applications and data.
General reliability, availability, serviceability, and data protection functionality includes:
- Manually or automatically via policies, start, stop, pause, resume protection
- Adjust priorities of protection tasks, including speed, for faster or slower protection
- Fast-reacting to changes, disruptions or failures, or slower cautious approaches
- Workload and application load balancing (performance, availability, and capacity)
RAS can be optimized for:
- Reduced redundancy for lower overall costs vs. resiliency
- Basic or standard availability (leverage component plus)
- High availability (use better components, multiple systems, multiple sites)
- Fault-tolerant with no single points of failure (SPOF)
- Faster restart, restore, rebuild, or repair with higher overhead costs
- Lower overhead costs (space and performance) with lower resiliency
- Lower impact to applications during rebuild vs. faster repair
- Maintenance and planned outages or for continues operations
Common availability Data Protection related terms, technologies, techniques, trends and topics pertaining to data protection from availability and access to durability and consistency to point in time protection and security are shown below.
Data Protection Gaps and Air Gap
There are Good Data Protection Gaps that provide recovery points to a past time enabling recoverability in the future to move forward. Another good data protection gap is an Air Gap that isolates protection copies off-site or off-line so that they can not be tampered with enabling recovery from ransomware and other software defined threats. There are Bad data protection gaps including gaps in coverage where data is not protected or items are missing. Then there are Ugly data protecting gaps which include Bad gaps that result in what you think is protected are not and finding that your copies are bad when it is too late.
Data Protection Gaps Good Bad and Ugly
The following figure shows good data protection gaps including recovery points (point in time protection) along with air gaps.
Figure 9.9 Air Gaps and Data Protection
Fault / Failures To Tolerate (FTT)
Fault Tolerant Mode (FTM)
FTM is the mode or technique used to enable resiliency and protect against some number of faults.
Fault / Failure Domains
Fault or Failure domains are places and things that can fail from regions, data centers or availability zones, clusters, stamps, pods, servers, networks, storage, hardware (systems, components including SSD and HDDs, power supplies, adapters). Other fault domain topics and focus areas include facility power, cooling, software including applications, databases, operating systems and hypervisors among others.
Figure 9.5 Various Fault and Failure Domains, Regions, Locations
Clustering is a technique and technology for enabling resiliency, as well as scaling performance, availability, and capacity. Clusters can be local, remote, or wide-area to support different data infrastructure objectives, combined with replication and other techniques.
Figure 9.12 Clustering and Replication Examples
Another characteristic of clustering and resiliency techniques is the ability to detect and react quickly to failures to isolate and contain faults, as well as invoking automatic repair if needed. Different clustering technologies enable various approaches, from proprietary hardware and software tightly coupled to loosely coupled general-purpose hardware or software.
Clustering characteristics include:
- Application, database, file system, operating system (Windows Storage Replica)
- Storage systems, appliances, adapters and network devices
- Hypervisors ( Hyper-V, VMware vSphere ESXi and vSAN among others)
- Share everything, share some things, share nothing
- Tightly or loosely coupled with common or individual system metadata
- Local in a data center, campus, metro, or stretch cluster
- Wide-area in different regions and availability zones
- Active/active for fast fail over or restart, active/passive (standby) mode
Additional clustering considerations include:
- How does performance scale as nodes are added, or what overhead exists?
- How is cluster resource locking in shared environments handled?
- How many (or few) nodes are needed for quorum to exist?
- Network and I/O interface (and management) requirements
- Cluster partition or split-brain (i.e., cluster splits into two)?
- Fast-reacting fail over and resiliency vs. overhead of failing back
- Locality of where applications are located vs. storage access and clustering
Where To Learn More
Continue reading additional posts in this series of Data Infrastructure Data Protection fundamentals and companion to Software Defined Data Infrastructure Essentials (CRC Press 2017) book, as well as the following links covering technology, trends, tools, techniques, tradecraft and tips.
- Part 1 – Data Infrastructure Data Protection Fundamentals
- Part 2 – Reliability, Availability, Serviceability ( RAS) Data Protection Fundamentals
- Part 3 – Data Protection Access Availability RAID Erasure Codes ( EC) including LRC
- Part 4 – Data Protection Recovery Points (Archive, Backup, Snapshots, Versions)
- Part 5 – Point In Time Data Protection Granularity Points of Interest
- Part 6 – Data Protection Security Logical Physical Software Defined
- Part 7 – Data Protection Tools, Technologies, Toolbox, Buzzword Bingo Trends
- Part 8 – Data Protection Diaries Walking Data Protection Talk
- Part 9 – who’s Doing What ( Toolbox Technology Tools)
- Part 10 – Data Protection Resources Where to Learn More
- Data Protection Diaries series
- Data Infrastructure server storage I/O network Recommended Reading List Book Shelf
- Software Defined Data Infrastructure Essentials (CRC 2017) Book
Additional learning experiences along with common questions (and answers), as well as tips can be found in Software Defined Data Infrastructure Essentials book.
What This All Means
Everything is not the same across different environments, data centers, data infrastructures and applications. There are various performance, availability, capacity economic (PACE) considerations along with service level objectives (SLO). Availability means being able to access information resources (applications, data and underlying data infrastructure resources), as well as data being consistent along with durable. Being durable means enabling data to be accessible in the event of a device, component or other fault domain item failures (hardware, software, data center).
Just as everything is not the same across different environments, there are various techniques, technologies and tools that can be used in different ways to enable availability and accessibility. These include high availability (HA), RAS, mirroring, replication, parity along with derivative erasure code (EC), LRC, RS and other RAID implementations, along with clustering. Also keep in mind that pertaining to data protection, there are good gaps (e.g. time intervals for recovery points, air gaps), bad gaps (missed coverage or lack of protection), and ugly gaps (not being able to recover from a gap in time).
Note that mirroring, replication, EC, LRC, RS or other Parity and RAID approaches are not replacements for backup, rather they are companions to time interval based recovery point protection such as snapshots, backup, checkpoints, consistency points and versioning among others (discussed in follow-up posts in this series).
Which data protection tool, technology to trend is the best depends on what you are trying to accomplish and your application workload PACE requirements along with SLOs. Get your copy of Software Defined Data Infrastructure Essentials here at Amazon.com, at CRC Press among other locations and learn more here. Meanwhile, continue reading with the next post in this series, Part 3 Data Protection Access Availability RAID Erasure Codes (EC) including LRC.
Ok, nuff said, for now.
Greg Schulz – Microsoft MVP Cloud and Data Center Management, VMware vExpert 2010-2017 (vSAN and vCloud). Author of Software Defined Data Infrastructure Essentials (CRC Press), as well as Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press), Resilient Storage Networks (Elsevier) and twitter @storageio. Courteous comments are welcome for consideration. First published on https://storageioblog.com any reproduction in whole, in part, with changes to content, without source attribution under title or without permission is forbidden.
All Comments, (C) and (TM) belong to their owners/posters, Other content (C) Copyright 2006-2018 Server StorageIO and UnlimitedIO. All Rights Reserved. StorageIO is a registered Trade Mark (TM) of Server StorageIO.