Cloud conversations: Gaining cloud confidence from insights into AWS outages (Part II)

StorageIO industry trends cloud, virtualization and big data

This is the second in a two-part industry trends and perspective looking at learning from cloud incidents, view part I here.

There is good information, insight and lessons to be learned from cloud outages and other incidents.

Sorry cynics no that does not mean an end to clouds, as they are here to stay. However when and where to use them, along with what best practices, how to be ready and configure for use are part of the discussion. This means that clouds may not be for everybody or all applications, or at least today. For those who are into clouds for the long haul (either all in or partially) including current skeptics, there are many lessons to be  learned and leveraged.

In order to gain confidence in clouds, some questions that I routinely am asked include are clouds more or less reliable than what you are doing? Depends on what you are doing, and how you will be using the cloud services. If you are applying HA and other BC or resiliency best practices, you may be able to configure and isolate from the more common situations. On the other hand, if you are simply using the cloud services as a low-cost alternative selecting the lowest price and service class (SLAs and SLOs), you might get what you paid for. Thus, clouds are a shared responsibility, the service provider has things they need to do, and the user or person designing how the service will be used have some decisions making responsibilities.

Keep in mind that high availability (HA), resiliency, business continuance (BC) along with disaster recovery (DR) are the sum of several pieces. This includes people, best practices, processes including change management, good design eliminating points of failure and isolating or containing faults, along with how the components  or technology used (e.g. hardware, software, networks, services, tools). Good technology used in goods ways can be part of a highly resilient flexible and scalable data infrastructure. Good technology used in the wrong ways may not leverage the solutions to their full potential.

While it is easy to focus on the physical technologies (servers, storage, networks, software, facilities), many of the cloud services incidents or outages have involved people, process and best practices so those need to be considered.

These incidents or outages bring awareness, a level set, that this is still early in the cloud evolution lifecycle and to move beyond seeing clouds as just a way to cut cost, and seeing the importance and value HA, resiliency, BC and DR. This means learning from mistakes, taking action to correct or fix errors, find and cut points of failure are part of a technology maturing or the use of it. These all tie into having services with service level agreements (SLAs) with service level objectives (SLOs) for availability, reliability, durability, accessibility, performance and security among others to protect against mayhem or other things that can and do happen.

Images licensed for use by StorageIO via
Atomazul / Shutterstock.com

The reason I mentioned earlier that AWS had another incident is that like their peers or competitors who have incidents in the past, AWS appears to be going through some growing, maturing, evolution related activities. During summer 2012 there was an AWS incident that affected Netflix (read more here: AWS and the Netflix Fix?). It should also be noted that there were earlier AWS outages where Netflix (read about Netflix architecture here) leveraged resiliency designs to try and prevent mayhem when others were impacted.

Is AWS a lightning rod for things to happen, a point of attraction for Mayhem and others?

Granted given their size, scope of services and how being used on a global basis AWS is blazing new territory and experiences, similar to what other information services delivery platforms did in the past. What I mean is that while taken for granted today, open systems Unix, Linux, Windows-based along with client-server, midrange or distributed systems, not to mention mainframe hardware, software, networks, processes, procedures, best practices all went through growing pains.

There are a couple of interesting threads going on over in various LinkedIn Groups based on some reporters stories including on speculation of what happened, followed with some good discussions of what actually happened and how to prevent recurrence of them in the future.

Over in the Cloud Computing, SaaS & Virtualization group forum, this thread is based on a Forbes article (Amazon AWS Takes Down Netflix on Christmas Eve) and involves conversations about SLAs, best practices, HA and related themes. Have a look at the story the thread is based on and some of the assertions being made, and ensuing discussions.

Also over at LinkedIn, in the Cloud Hosting & Service Providers group forum, this thread is based on a story titled Why Netflix’ Christmas Eve Crash Was Its Own Fault with a good discussion on clouds, HA, BC, DR, resiliency and related themes.

Over at the Virtualization Practice, there is a piece titled Is Amazon Ruining Public Cloud Computing? with comments from me and Adrian Cockcroft (@Adrianco) a Netflix Architect (you can read his blog here). You can also view some presentations about the Netflix architecture here.

What this all means

Saying you get what you pay for would be too easy and perhaps not applicable.

There are good services free, or low-cost, just like good free content and other things, however vice versa, just because something costs more, does not make it better.

Otoh, there are services that charge a premium however may have no better if not worse reliability, same with content for fee or perceived value that is no better than what you get free.

Additional related material

Some closing thoughts:

  • Clouds are real and can be used safely; however, they are a shared responsibility.
  • Only you can prevent cloud data loss, which means do your homework, be ready.
  • If something can go wrong, it probably will, particularly if humans are involved.
  • Prepare for the unexpected and clarify assumptions vs. realities of service capabilities.
  • Leverage fault isolation and containment to prevent rolling or spreading disasters.
  • Look at cloud services beyond lowest cost or for cost avoidance.
  • What is your organizations culture for learning from mistakes vs. fixing blame?
  • Ask yourself if you, your applications and organization are ready for clouds.
  • Ask your cloud providers if they are ready for you and your applications.
  • Identify what your cloud concerns are to decide what can be done about them.
  • Do a proof of concept to decide what types of clouds and services are best for you.

Do not be scared of clouds, however be ready, do your homework, learn from the mistakes, misfortune and errors of others. Establish and leverage known best practices while creating new ones. Look at the past for guidance to the future, however avoid clinging to, and bringing the baggage of the past to the future. Use new technologies, tools and techniques in new ways vs. using them in old ways.

Ok, nuff said.

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press, 2011), The Green and Virtual Data Center (CRC Press, 2009), and Resilient Storage Networks (Elsevier, 2004)

twitter @storageio

All Comments, (C) and (TM) belong to their owners/posters, Other content (C) Copyright 2006-2024 Server StorageIO and UnlimitedIO LLC All Rights Reserved

Clouds and Data Loss: Time for CDP (Commonsense Data Protection)?

Today SNIA released a press release pertaining to cloud storage timed to coincide with SNW where we can only presume vendors are talking about their cloud storage stories.

Yet chatter on the coconut wire along with various news (here and here and here) and social media sites is how could cloud storage and information service provider T-Mobile/Microsoft/Side-Kick loose customers data?

Data loss is a dangerous phrase, after all, your data may still be intact somewhere, however if you cannot get to it when needed, that may seem like data loss to you.

There are many types of data loss including loss of accessibility or availability along with flat out loss. Let me clarify, loss of data availability or accessibility means that somewhere, your data is still intact, perhaps off-line on a removable disk, optical, tape or at another site on-line, near-line or off-line, its just that you cannot get to it yet. There is also real data loss where both your primary copy and backup as well as archive data are lost, stolen, corrupted or never actually protected.

Clouds or managed service providers in general are getting beat up due to some loss of access, availability or actual data loss, however before jumping on that bandwagon and pointing fingers at the service, how about a step back for a minute. Granted, given all of the cloud hype and proliferation of managed service offerings on the web (excuse me cloud), there is a bit of a lightning rod backlash or see I told you so approach.

Whats different about this story compared to prior disruptions with Amazon, Google, Blackberry among others is that unlike where access to information or services ranging from calendar, emails, contacts or other documents is disrupted for a period of time, it sounds as those data may have been lost.

Lost data you should say? How can you lose data after all there are copies of copies of data that have been snapshot, replicated and deduplicated storage across different tiered storage right?

Certainly anyone involved in data management or data protection is asking the question; why not go back to a snapshot copy, replicated volute, backup copy on disk or tape?

Needless to say, finger pointing aerobics are or will be in full swing. Instead, lets ask the question, is it time for CDP as in Commonsense Data Protection?

However, rather than point blame or spout off about how bad clouds are, or, that they are getting an un-fair shake and un-due coverage, and that just because there might be a few bad ones, not all clouds are bad particularly with recent outages.

I can think of many ways on how to actually lose data, however, to totally lose data requires not a technology failure, it can be something much simpler and is equally applicable to cloud, virtual and physical data centers and storage environments from the largest to the smallest to the consumer. Its simple, common sense, best practices, making copies of all data and keeping extra copies around somewhere, with more frequent or recent data having copies readily available.

Some trends Im seeing include among others:

  • Low cost craze leveraging free or near free services and products
  • Cloud hype and cloud bashing and need to discuss wide area in between those extremes
  • Renewed need for basic data protection including BC/DR, HA, backup and security
  • Opportunity to re-architect data protection in conjunction with other initiatives
  • Lack of adequate funding for continued and proactive data protection

Just to be safe, lets revisit some common data protection best practices:

  • Learn from mistakes, preferable during testing with aim to avoid repeating them again
  • Most disasters in IT and elsewhere are the result of a chain of events not being contained
  • RAID is not a replacement for backup, it simply provides availability or accessibility
  • Likewise, mirroring or replication by themselves is not a replacement for backup.
  • Use point in time RPO based data protection such as snapshots or backup with replication
  • Maintain a master backup or gold copy that can be used to restore to a given point of time
  • Keep backup on another medium, also protect backup catalog or other configuration data
  • If using deduplication, make sure that indexes/dictionary or Meta data is also protected.
  • Moving your data into the cloud is not a replacement for a data protection strategy
  • Test restoration of backed data both locally, as well as from cloud services
  • Employ data protection management (DPM) tools for event correlation and analysis
  • Data stored in clouds need to be part of a BC/DR and overall data protection strategy
  • Have extra copy of data placed in clouds kept in alternate location as part of BC/DR
  • Ask yourself, what will do you when your cloud data goes away (note its not if, its when)
  • Combine multiple layers or rings of defines and assume what can break will break

Clouds should not be scary; Clouds do not magically solve all IT or consumer issues. However they can be an effective tool when of high caliber as part of a total data protection strategy.

Perhaps this will be a wake up call, a reminder, that it is time to think beyond cost savings and a shift back to basic data protection best practices. What good is the best or most advanced technology if you have less than adequate practices or polices? Bottom line, time for Commonsense Data Protection (CDP).

Ok, nuff said for now, I need to go and make sure I have a good removable backup in case my other local copies fail or Im not able to get to my cloud copies!

Ok, nuff said.

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press) and Resilient Storage Networks (Elsevier)
twitter @storageio

All Comments, (C) and (TM) belong to their owners/posters, Other content (C) Copyright 2006-2024 Server StorageIO and UnlimitedIO LLC All Rights Reserved