hadoop Archives

October 30, 2017November 26, 2023

Data Infrastructure server storage I/O network Recommended Reading #blogtober

server storage I/O data infrastructure trends recommended reading list

Updated 7/30/2018

The following is an evolving recommended reading list of data infrastructure topics including, server, storage I/O, networking, cloud, virtual, container, data protection and related topics that includes books, blogs, podcast’s, events and industry links among other resources.

Various Data Infrastructure including hardware, software, services related links:

Links A-E
Links F-J
Links K-O
Links P-T
Links U-Z
Other Links

In addition to my own books including Software Defined Data Infrastructure Essentials (CRC Press 2017), the following are Server StorageIO recommended reading list items . The recommended reading list includes various IT, Data Infrastructure and related topics.

Intel Recommended Reading List (IRRL) for developers is a good resource to check out.

Duncan Epping (@DuncanYB), Frank Denneman (@FrankDenneman) and Neils Hagoort (@NHagoort) have released their VMware vSphere 6.7 Clustering Deep Dive book available at venues including Amazon.com. This is the latest in a series of Cluster and deep dive books from Frank and Duncan which if you are involved with VMware, SDDC and related software defined data infrastructures these should be on your bookshelf.

Check out the Blogtober list of check out some of the blogs and posts occurring during October 2017 here.

Preston De Guise aka @backupbear is Author of several books has an interesting new site Foolsrushin.info that looks at topics including Ethics in IT among others. Check out his new book Data Protection: Ensuring Data Availability (CRC Press 2017) and available via Amazon.com here.

Brendan Gregg has a great site for Linux performance related topics here.

Greg Knieriemen has a must read weekly blog, post, column collection of whats going on in and around the IT and data infrastructure related industries, Check it out here.

Interested in file systems, CIFS, SMB, SAMBA and related topics then check out Chris Hertels book on implementing CIFS here at Amazon.com

For those involved with VMware, check out Frank Denneman VMware vSphere 6.5 host resource guide-book here at Amazon.com.

Docker: Up & Running: Shipping Reliable Containers in Production by Karl Matthias & Sean P. Kane via Amazon.com here.

Essential Virtual SAN (VSAN): Administrator’s Guide to VMware Virtual SAN,2nd ed. by Cormac Hogan & Duncan Epping via Amazon.com here.

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale by Tom White via Amazon.com here.

Systems Performance: Enterprise and the Cloud by Brendan Gregg Via Amazon.com here.

Implementing Cloud Storage with OpenStack Swift by Amar Kapadia, Sreedhar Varma, & Kris Rajana Via Amazon.com here.

The Human Face of Big Data by Rick Smolan & Jennifer Erwitt Via Amazon.com here.

VMware vSphere 5.1 Clustering Deepdive (Vol. 1) by Duncan Epping & Frank Denneman Via Amazon.com here. Note: This is an older title, but there are still good fundamentals in it.

Linux Administration: A Beginners Guide by Wale Soyinka Via Amazon.com here.

TCP/IP Network Administration by Craig Hunt Via Amazon.com here.

Cisco IOS Cookbook: Field tested solutions to Cisco Router Problems by Kevin Dooley and Ian Brown Via Amazon.com here.

I often mention in presentations a must have for anybody involved with software defined anything, or programming for that matter which is the Niklaus Wirth classic Algorithms + Data Structures = Programs that you can get on Amazon.com here.

Seven Databases in Seven Weeks including NoSQL

Another great book to have is Seven Databases in Seven Weeks (here is a book review) which not only provides an overview of popular NoSQL databases such as Cassandra, Mongo, HBASE among others, lots of good examples and hands on guides. Get your copy here at Amazon.com.

Additional Data Infrastructure and related topic sites

In addition to those mentioned above, other sites, venues and data infrastructure related resources include:

aiim.com – Archiving and records management trade group

apache.org – Various open-source software

blog.scottlowe.org – Scott Lowe VMware Networking and topics

blogs.msdn.microsoft.com/virtual_pc_guy – Ben Armstrong Hyper-V blog

brendangregg.com – Linux performance-related topics

cablemap.info – Global network maps

CMG.org – Computer Measurement Group (CMG)

communities.vmware.com – VMware technical community and resources

comptia.org – Various IT, cloud, and data infrastructure certifications

cormachogan.com – Cormac Hogan VMware and vSAN related topics

csrc.nist.gov – U.S. government cloud specifications

dmtf.org – Distributed Management Task Force (DMTF)

ethernetalliance.org – Ethernet industry trade group

fibrechannel.org – Fibre Channel trade group

github.com – Various open-source solutions and projects

Intel Reading List – recommended reading list for developers

ieee.org – Institute of Electrical and Electronics Engineers

ietf.org – Internet Engineering Task Force

iso.org – International Standards Organizations

it.toolbox.com – Various IT and data infrastructure topics forums

labs.vmware.com/flings – VMware Fling additional tools and software

nist.gov – National Institute of Standards and Technology

nvmexpress.org – NVM Express (NVMe) industry trade group

objectstoragecenter.com – Various object and cloud storage items

opencompute.org – Open Compute Project (OCP) servers and related topics

opendatacenteralliance.org – Open Data Center Alliance (ODCA)

openfabrics.org – Open-fabric software industry group

opennetworking.org – Open Networking Foundation (ONF)

openstack.org – OpenStack resources

pcisig.com – Peripheral Component Interconnect (PCI) trade group

reddit.com – Various IT, cloud, and data infrastructure topics

scsita.org – SCSI trade association (SAS and others)

SNIA.org – Storage Network Industry Association (SNIA)

Speakingintech.com – Popular industry and data infrastructure podcast

Storage Bibliography – Collection of Dr. J. Metz storage related content

technet.microsoft.com – Microsoft TechNet data infrastructure–related topics

thenvmeplace.com – various NVMe and related tools, topics and links

thevpad.com – Collection of various virtualization and related sites

thessdplace.com – various NVM, SSD, flash, 3D XPoint related topics, tools, links

tpc.org – Transaction Performance Council benchmark site

vmug.org – VMware User Groups (VMUG)

wahlnetwork.com – Chris Whal Networking and related topics

yellow-bricks.com – Duncan Epping VMware and related topics

Additional Data Infrastructure Venues

Additional useful data infrastructure related information can be found at BizTechMagazine, BrightTalk, ChannelProNetwork, ChannelproSMB, ComputerWeekly, Computerworld, CRN, CruxialCIO, Data Center Journal (DCJ), Datacenterknowledge, and DZone. Other good sourses include Edtechmagazine, Enterprise Storage Forum, EnterpriseTech, Eweek.com, FedTech, Google+, HPCwire, InfoStor, ITKE, LinkedIn, NAB, Network Computing, Networkworld, and nextplatform. Also check out Reddit, Redmond Magazine and Webinars, Spiceworks Forums, StateTech, techcrunch.com, TechPageOne, TechTarget Venues (various Search sites, e.g., SearchStorage, SearchSSD, SearchAWS, and others), theregister.co.uk, TheVarGuy, Tom’s Hardware, and zdnet.com, among many others.

Where To Learn More

Learn more about related technology, trends, tools, techniques, and tips with the following links.

Additional learning experiences along with common questions (and answers), as well as tips can be found in Software Defined Data Infrastructure Essentials book.

What This All Means

The above is an evolving collection of recommended reading including what I have on my physical and virtual bookshelves, as well as list of web sites, blogs and podcasts worth listening, reading or watching. Watch for more items to be added to the book shelf soon, and if you have a suggested recommendation, add it to the comments below.

By the way, if you have not heard, its #Blogtober, check out some of the other blogs and posts occurring during October here as part of your recommended reading list.

Ok, nuff said, for now.

Greg Schulz – Microsoft MVP Cloud and Data Center Management, VMware vExpert 2010-2017 (vSAN and vCloud). Author of Software Defined Data Infrastructure Essentials (CRC Press), as well as Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press), Resilient Storage Networks (Elsevier) and twitter @storageio. Courteous comments are welcome for consideration. First published on https://storageioblog.com any reproduction in whole, in part, with changes to content, without source attribution under title or without permission is forbidden.

August 18, 2017November 26, 2023

Chelsio Storage over IP and other Networks Enable Data Infrastructures

Chelsio Storage over IP Enable Data Infrastructures

server storage I/O data infrastructure trends

Chelsio and Storage over IP (SoIP) continue to enable Data Infrastructures from legacy to software defined virtual, container, cloud as well as converged. This past week I had a chance to visit with Chelsio to discuss data infrastructures, server storage I/O networking along with other related topics. More on Chelsio later in this post, however, for now lets take a quick step back and refresh what is SoIP (Storage over IP) along with Storage over Ethernet (among other networks).

Various IT and Cloud Infrastructure Layers including Data Infrastructures

Server Storage over IP Revisited

There are many variations of SoIP from network attached storage (NAS) file based processing including NFS, SAMBA/SMB (aka Windows File sharing) among others. In addition there is various block such as SCSI over IP (e.g. iSCSI), along with object via HTTP/HTTPS, not to mention the buzzword bingo list of RoCE, iSER, iWARP, RDMA, DDPK, FTP, FCoE, IFCP, and SMB3 direct to name a few.

Who is Chelsio

For those who are not aware or need a refresher, Chelsio is involved with enabling server storage I/O by creating ASICs (Application Specific Integrated Circuits) that do various functions offloading those from the host server processor. What this means for some is a throw back to the early 2000s of the TCP Offload Engine (TOE) era where various processing to handle regular along with iSCSI and other storage over Ethernet and IP could be accelerated.

Chelsio data infrastructure focus

Chelsio ecosystem across different data infrastructure focus areas and application workloads

As seen in the image above, certainly there is a server and storage I/O network play with Chelsio, along with traffic management, packet inspection, security (encryption, SSL and other offload), traditional, commercial, web, high performance compute (HPC) along with high profit or productivity compute (the other HPC). Chelsio also enables data infrastructures that are part of physical bare metal (BM), software defined virtual, container, cloud, serverless among others.

Chelsio server storage I/O focus

The above image shows how Chelsio enables initiators on server and storage appliances as well as targets via various storage over IP (or Ethernet) protocols.

Chelsio enabling various data center resources

Chelsio also plays in several different sectors from *NIX to Windows, Cloud to Containers, Various processor architectures and hypervisors.

Chelsio ecosystem

Besides diverse server storage I/O enabling capabilities across various data infrastructure environments, what caught my eye with Chelsio is how far they, and storage over IP have progressed over the past decade (or more). Granted there are faster underlying networks today, however the offload and specialized chip sets (e.g. ASICs) have also progressed as seen in the above and next series of images via Chelsio.

The above showing TCP and UDP acceleration, the following show Microsoft SMB 3.1.1 performance something important for doing Storage Spaces Direct (S2D) and Windows-based Converged Infrastructure (CI) along with Hyper Converged Infrastructures (HCI) deployments.

Chelsio software environments

Something else that caught my eye was iSCSI performance which in the following shows 4 initiators accessing a single target doing about 4 million IOPs (reads and writes), various size and configurations. Granted that is with a 100Gb network interface, however it also shows that potential bottlenecks are removed enabling that faster network to be more effectively used.

Chelsio server storage I/O performance

Moving on from TCP, UDP and iSCSI, NVMe and in particular NVMe over Fabric (NVMeoF) have become popular industry topics so check out the following. One of my comments to Chelsio is to add host or server CPU usage to the following chart to help show the story and value proposition of NVMe in general to do more I/O activity while consuming less server-side resources. Lets see what they put out in the future.

Chelsio

Ok, so Chelsio does storage over IP, storage over Ethernet and other interfaces accelerating performance, as well as regular TCP and UDP activity. One of the other benefits of what Chelsio and others are doing with their ASICs (or FPGA by some) is to also offload processing for security among other topics. Given the increased focus around server storage I/O and data infrastructure security from encryption to SSL and related usage that requires more resources, these new ASIC such as from Chelsio help to offload various specialized processing from the server.

The customer benefit is that more productive application work can be done by their servers (or storage appliances). For example, if you have a database server, that means more product ivy data base transactions per second per licensed software. Put another way, want to get more value out of your Oracle, Microsoft or other vendors software licenses, simple, get more work done per server that is licensed by offloading and eliminate waits or other bottlenecks.

Using offloads and removing server bottlenecks might seem like common sense however I’m still amazed that the number of organizations who are more focused on getting extra value out of their hardware vs. getting value out of their software licenses (which might be more expensive).

Chelsio

Where To Learn More

Learn more about related technology, trends, tools, techniques, and tips with the following links.

www.chelsio.com
Some server storage I/O benchmark workload scripts (Part I)
Server Storage I/O Benchmarking Performance Resource Tools
NVMe related and flash SSD along with cloud, bulk, object storage topics
Software Defined Data Infrastructure Essentials (CRC Press) and more on server I/O here.

Various IT and Cloud Infrastructure Layers including Data Infrastructures

What This All Means

Data Infrastructures exist to protect, preserve, secure and serve information along with the applications and data they depend on. With more data being created at a faster rate, along with the size of data becoming larger, increased application functionality to transform data into information means more demands on data infrastructures and their underlying resources.

This means more server I/O to storage system and other servers, along with increased use of SoIP as well as storage over Ethernet and other interfaces including NVMe. Chelsio (and others) are addressing the various application and workload demands by enabling more robust, productive, effective and efficient data infrastructures.

Check out Chelsio and how they are enabling storage over IPO (SoIP) to enable Data Infrastructures from legacy to software defined virtual, container, cloud as well as converged, oh, and thanks Chelsio for being able to use the above images.

Ok, nuff said, for now.
Gs

Greg Schulz – Multi-year Microsoft MVP Cloud and Data Center Management, VMware vExpert (and vSAN). Author of Software Defined Data Infrastructure Essentials (CRC Press), as well as Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press), Resilient Storage Networks (Elsevier) and twitter @storageio.

Courteous comments are welcome for consideration. First published on https://storageioblog.com any reproduction in whole, in part, with changes to content, without source attribution under title or without permission is forbidden.

April 11, 2017November 26, 2023

VMware vSAN V6.6 Part IV (HCI scaling ROBO and data centers today)

server storage I/O trends

VMware vSAN V6.6 Part IV (HCI scaling ROBO and data centers today)

In case you missed it, VMware announced vSAN v6.6 hyper-converged infrastructure (HCI) software defined data infrastructure solution. This is the fourth of a five-part series about VMware vSAN V6.6. View Part I here, Part II (just the speeds feeds please) is located here, part III (reducing cost and complexity) located here, as well as part V here (VMware vSAN evolution, where to learn more and summary).

VMware vSAN 6.6
Image via VMware

For those who are not aware, vSAN is a VMware virtual Storage Area Network (e.g. vSAN) that is software-defined, part of being a software-defined data infrastructure (SDDI) and software-defined data center (SDDC). Besides being software-defined vSAN is HCI combining compute (server), I/O networking, storage (space and I/O) along with hypervisors, management, and other tools.

Scaling HCI for ROBO and data centers today and for tomorrow

Scaling with stability for today and tomorrow. This includes addressing your applications Performance, Availability, Capacity and Economics (PACE) workload requirements today and for the future. By scaling with stability means boosting performance, availability (data protection, security, resiliency, durable, FTT), effective capacity without one of those attributes compromising another.

VMware vSAN data center scaling
Image via VMware

Scaling today for tomorrow also means adapting to today’s needs while also flexible to evolve with new application workloads, hardware as well as a cloud (public, private, hybrid, inter and intra-cloud). As part of continued performance improvements, enhancements to optimize for higher performance flash SSD including NVMe based devices.

VMware vSAN cloud analytics
Image via VMware

Part of scaling with stability means enhancing performance (as well as productivity) or the effectiveness of a solution. Keep in mind that efficiency is often associated with storage (or server or network) space capacity savings or reductions. In that context then effectiveness means performance and productivity or how much work can be done with least overhead impact. With vSAN, V6.6 performance enhancements include reduced checksum overhead, enhanced compression, and deduplication, along with destaging optimizations.

Other enhancements that help collectively contribute to vSAN performance improvements include VMware object handling (not to be confused with cloud or object storage S3 or Swift objects) as well as faster iSCSI for vSAN. Also improved are more accurate refined cache sizing guidelines. Keep in mind that a little bit of NAND flash SSD or SCM in the right place can have a significant benefit, while a lot of flash cache costs much cash.

Part of enabling and leveraging new technology today includes support for larger capacity 1.6TB flash SSD drives for cache, as well as lower read latency with 3D XPoint and NVMe drives such as those from Intel among others. Refer to the VMware vSAN HCL for current supported devices which continue evolve along with the partner ecosystem. Future proofing is also enabled where you can grow from today to tomorrow as new storage class memories (SCM) among other flash SSD as well as NVMe enhanced storage among other technologies are introduced into the market as well as VMware vSAN HCL.

VMware vSAN and data center class applications
Image via VMware

Traditional CI and in particular many HCI solutions have been optimized or focused on smaller application workloads including VDI resulting in the perception that HCI, in general, is only for smaller environments, or larger environment non-mission critical workloads. With vSAN V6.6 VMware is addressing and enabling larger environment mission critical applications including Intersystem Cache medical health management software among others. Other application workload extensions including support for higher performance demanding Hadoop big data analytics, a well as extending virtual desktop infrastructure (VDI) workspace with XenDesktop/XenApp, along with Photon 1.1 container support.

What about VMware vSAN 6.6. Packaging and License Options

As part of vSAN 6.6 VMware several solution bundle packaged options for the data center as well as smaller ROBO environment. Contact your VMware representative or partner to learn more about specific details.

VMware vSAN cloud analytics
Image via VMware

Where to Learn More

The following are additional resources to find out more about vSAN and related technologies.

VMware vSAN 6.6 Part I, Part II (just the speeds feeds please) here, part III (reducing cost and complexity) here, part IV (scaling ROBO and data centers today) located here, as well as part V here (VMware vSAN evolution, where to learn more and summary).
Launch Webcast and registration link (Registration Link for Modernize Your IT with vSAN Innovations)
VMUG Webinar: What’s New Technical Deep Dive
What’s New Blogs (VMware blogs)
Native Data-at-Rest Encryption (VMware blogs)
What’s New Page (VMware)
vSAN 6.6 Datasheet (PDF)
vSAN Customer Page
VMware Storage Hub
Whats New in VSAN 6.6? (Via ComacHogan)
VMware Security Infographic (PDF)
Via Duncan Epping (@DuncanYB) What’s new for vSAN 6.6?
Introducing vSphere 6.5 (VMware blogs)
VMware vSAN 6.5 Licensing Guide (PDF)
vSAN licensing and packaging (@DuncanYB)
Where to get VMware vSphere and related bits (e.g. VMware downloads here)
Data infrastructures primer and overview
vExpert Panel Discussion Hyper-Convergence and vSAN (Webinar)
The NVMe place (www.thenvmeplace.com) and The SSD Place (www.thessdplace.com)
Server StorageIO CI and HCI microsite landing page www.storageio.com/converge

What this all means

Continue reading more about VMware vSAN 6.6 in part I here, part II (just the speeds feeds please) is located here, part III (reducing cost and complexity) located here as well as part V here (VMware vSAN evolution, where to learn more and summary).

Ok, nuff said (for now…).

Cheers
Gs

Greg Schulz – Microsoft MVP Cloud and Data Center Management, VMware vExpert (and vSAN). Author Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press), Resilient Storage Networks (Elsevier) and twitter @storageio. Watch for the Spring 2017 release of his new book “Software-Defined Data Infrastructure Essentials” (CRC Press).

February 29, 2016March 8, 2022

Part II – EMC DSSD D5 Direct Attached Shared AFA

server storage I/O trends

This is the second post in a two-part series on the EMC DSSD D5 announcement, you can read part one here.

Lets take a closer look at how EMC DSSD D5 works, its hardware and software components, how it compares and other considerations.

How Does DSSD D5 Work

Up to 48 Linux servers attach via dual port PCIe Gen 3 x8 cards that are stateless. Stateless simply means they do not have any flash or are not being used as storage cards, rather, they are essentially just an NVMe adapter card. With the first release block, HDFS file along with object and APIs are available for Linux systems. These drivers enabling the shared NVMe storage to be accessed by applications using different streamlined server and storage I/O driver software stacks to cut latency. DSSD D5 is meant to be a rack scale solutions so distance is measured as inside a rack (e.g. a couple of meters).

The 5U tall DSSD D5 supports 48 servers via a pair of I/O Modules (IOM) each with 48 ports that in turn attach to the data plane and on to the Flash Modules (FM). Also attached to the data plane are a pair of controllers that are active / active for performing management tasks, however they do not sit in the data path. This means that host client directly access the FMs without having to go through a controller which is the case in traditional storage systems and AFAs. The controllers only get involved when there is some setup, configuration or other management activities, otherwise they get out-of-the-way, kind of like how management should function. There when you need them to help, then get out-of-the-way so productive work can be done.

EMC DSSD shared ssd das
Pardon the following hand drawn sketches, you can see some nice pretty diagrams, videos and other content via the EMC Pulse Blog as well as elsewhere.

Note that the host client servers take on the responsibility for managing and coordinating data consistency meaning data can be shared between servers assuming applicable software is used for implementing integrity. This means that clustering and other software that can support shared storage are able to support low latency high performance read and write activity the DSSD D5 as opposed to relying on the underlying storage system for handling the shared storage coordination such as in a NAS. Another note is that the DSSD D5 is optimized for concurrent multi-threaded and asynchronous I/O operations along with atomic writes for data integrity that enable the multiple cores in today’s faster processors to be more effectively leveraged.

The data plane is a mesh or switch or expander based back plane enabling any of the north bound (host client-server) 96 (2 x 48) PCIe Gen 3 x4 ports to reach the up to 36 (or as few as 18) FMs that are also dual pathed. Note that the host client-server PCIe dual port cards are Gen 3 x8 while the DSSD D5 ports are Gen 3 x4. Simple math should tell you that if are going to have 2 x PCIe Gen 3 x4 ports running at full speed, you want to have a Gen 3 x8 connection inside the server to get full performance.

Think of the data plane similar to how a SAS expander works in an enclosure or a SAS switch, the difference being it is PCIe and not SAS or other protocol. Note that even though the terms mesh, fabric, switch, network are used, these are NOT attached to traditional LAN, SAN, NAS or other networks. Instead, this is a private “networked back plane” between the server and storage devices (e.g. FM).

EMC DSSD D5 details

The dual controllers (e.g. control plane) over see the flash management including garbage collection among other tasks, as well as storage is thin provisioned.

Dual Controllers (active/active) are connected to each other (e.g. control plane) as well as to the data path, however, do not sit in the data path. Thus this is a fast path control path approach meaning the controllers can get involved to do management functions when needed, and get out-of-the-way of work when not needed. The controllers are hot-swap and add global management functions including setting up, tearing down host client/server I/O paths, mappings and affinities. Controllers also support the management of CUBIC RAID data protection functions performed by the Flash Modules (FM).

Other functions the controllers implement leveraging their CPUs and DRAM include flash translation layer (FTL) functions normally handled by SSD cards, drives or other devices. These FTL functions include wear-leveling for durability, garbage collection, voltage power management among other tasks. The result is that the flash modules are able to spend more time and their resources handling I/O operations vs. handling management tasks vs. traditional off the shelf SSD drives, cards or devices.

The FMs insert from the front and come in two sizes of 2TB and 4TB of raw NAND capacity. What’s different about the FMs vs. some other vendors approach is that these are not your traditional PCIe flash cards, instead they are custom cards with a proprietary ASIC and raw nand dies. DRAM is used in the FM as a buffer to hold data for write optimization as well as enhance wear-leveling to increase flash endurance.

The result is up to thousands of nand dies spread over up to 36 FMs however more important, more performance being derived out of those resources. The increased performance comes from DSSD implementing its own flash translation layer, garbage collection, power voltage management among other techniques to derive more useful work per watt of energy consumed.

EMC DSSD performance claims:

100 microsecond latency for small IOs
100GB bandwidth for large IOs
10 Million small IO IOPs
Up to 144TB raw capacity

How Does It Compare To Other AFA and SSD solutions

There will be many apples to oranges comparisons as is often the case with new technologies or at least until others arrive in the market.

Some general comparisons that may be apples to oranges as opposed to apples to apples include:

Shared and dense fast nand flash (eMLC) SSD storage
disaggregated flash SSD storage from server while enabling high performance, low latency
Eliminate pools or ponds of dedicated SSD storage capacity and performance
Not a SAN yet more than server-side flash or flash SSD JBOD
Underlying Flash Translation Layer (FTL) is disaggregated from SSD devices
Optimized hardware and software data path
Requires special server-side stateless adapter for accessing shared storage

Some other comparisons include:

Hybrid and AFA shared via some server storage I/O network (good sharing, feature rich, resilient, slower performance and higher latency due to hardware, network and server I/O software stacks). For example EMC VMAX, VNX, XtremIO among others.

Server attached flash SSD aka server SAN (flash SSD creates islands of technology, lower resource sharing, data shuffling between servers, limited or no data services, management complexity). For example PCIe flash SSD state full (persistent) cards where data is stored or used as a cache along with associated management tools and drivers.

DSSD D5 is a rack-scale hybrid approach combing direct attached shared flash with lower latency, higher performance vs. traditional AFA or hybrid storage array, better resource usage, sharing, management and performance vs. traditional dedicated server flash. Compliment server-side data infrastructure and applications scale-out software. Server applications can reach NVMe storage via user spacing with block, hdfs, Flood and other APIs.

Using EMC DSSD D5 in possible hybrid ways

What Happened to Server PCIe cards and Server SANs

If you recall a few years ago the industry rage was flash SSD PCIe server cards from vendors such as EMC, FusionIO (now part of SANdisk), Intel (still Intel), LSI (now part of Seagate), Micron (still Micron) and STEC (now part of Western Digital) among others. Server side flash SSD PCIe cards are still popular particular with newer NVMe controller based models that use the NVMe protocol stack instead of AHC/SATA or others.

However as is often the case, things evolve and while there is still a place for server-side state full PCIe flash cards either for data or as cache, there is also the need to combine and simplify management, as well as streamline the software I/O stacks which is where EMC DSSD D5 comes into play. It enables consolidation of server-side SSD cards into a shared 5U chassis enabling up to 48 dual pathed servers access to the flash pools while using streamlined server software stacks and drivers that leverage NVMe over PCIe.

Where to learn more

Continue reading with the following links about NVMe, flash SSD and EMC DSSD.

Part one of this series here and part two here.

Performance Redefined! Introducing DSSD D5 Rack-Scale Flash Solution (EMC Pulse Blog)

EMC Unveils DSSD D5: A Quantum Leap In Flash Storage (EMC Press Release)

EMC Declares 2016 The “Year of All-Flash” For Primary Storage (EMC Press Release)

EMC DSSD D5 Rack-Scale Flash (EMC PDF Overview)

EMC DSSD and Cloudera Evolve Hadoop (EMC White Paper Overview)

Software Aspects of The EMC DSSD D5 Rack-Scale Flash Storage Platform (EMC PDF White Paper)

EMC DSSD D5 (EMC PDF Architecture and Product Specification)

EMC VFCache respinning SSD and intelligent caching (Part II)

EMC To Acquire DSSD, Inc., Extends Flash Storage Leadership

Part II: XtremIO, XtremSW and XtremSF EMC flash ssd portfolio redefined

XtremIO, XtremSW and XtremSF EMC flash ssd portfolio redefined

Learn more about flash SSD here and NVMe here at thenvmeplace.com

What this all means

EMC with DSSD D5 now has another solution to offer clients, granted their challenge as it has been over the past couple of decades now will be to educate and compensate their sales force and partners on what technology solution to put for different needs.

On one hand, life could be simpler for EMC if they only had one platform solution that would then be the answer to every problem, something that some other vendors and startups face. Likewise, if all you have is one solution, then while you can try to make that solution fit different environments, or, get the environment to adapt to the solution, having options is a good thing if those options can remove complexity along with cost while boosting productivity.

I would like to see support for other operating systems such as Windows, particular with the future Windows 2016 based Nano, as well as hypervisors including VMware, Hyper-V among others. On the other hand I also would like to see a Sharp Aquous Quattron 80" 1080p 240Hz 3D TV on my wall to watch HD videos from my DJI Phantom Drone. For now focusing on Linux makes sense, however, would be nice to see some more platforms supported.

Keep an eye on the NVMe space as we are seeing NVMe solutions appearing inside servers, storage system, external dedicated and shared, as well as some other emerging things including NVMe over Fabric. Learn more about EMC DSSD D5 here.

Ok, nuff said (for now)

Cheers
Gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press) and Resilient Storage Networks (Elsevier)
twitter @storageio

February 29, 2016March 8, 2022

EMC DSSD D5 Rack Scale Direct Attached Shared SSD All Flash Array Part I

server storage I/O trends

This is the first post in a two-part series pertaining to the EMC DSSD D5 announcement, you can read part two here.

EMC announced today the general availability of their DSSD D5 Shared Direct Attached SSD (DAS) flash storage system (e.g. All Flash Array or AFA) which is a rack-scale solution. If you recall, EMC acquired DSSD back in 2014 which you can read more about here. EMC announced four configurations that include 36TB, 72TB and 144TB raw flash SSD capacity with support for up to 48 dual-ported host client servers.

Via EMC Pulse Blog

What Is DSSD D5

At a high level EMC DSSD D5 is a PCIe direct attached SSD flash storage solution to enable aggregation of disparate SSD card functionality typically found in separate servers into a shared system without causing aggravation. DSSD D5 helps to alleviate server side I/O bottlenecks or aggravation issues that can be the result of aggregation of workloads or data. Think of DSSD D5 as an shared application server storage I/O accelerator for up to 48 servers to access up to 144TB of raw flash SSD to support various applications that have the need for speed.

Applications that have the need for speed or that can benefit from less time waiting for results, where time is money, or boosting productivity can enable high profitability computing. This includes legacy as well as emerging applications and workloads spanning little data, big data and big fast structure and unstructured data. From Oracle to SAS to HBASE and Hadoop among others, perhaps even Alluxio.

Some examples include:

Clusters and scale-out grids
High Performance COMpute (HPC)
Parallel file systems
Forecasting and image processing
Fraud detection and prevention
Research and analytics
E-commerce and retail
Search and advertising
Legacy applications
Emerging applications
Structured database and key-value repositories
Unstructured file systems, HDFS and other data
Large undefined work sets
From batch stream to real-time
Reduces run times from days to hours

Where to learn more

Continue reading with the following links about NVMe, flash SSD and EMC DSSD.

Part one of this series here and part two here.

Performance Redefined! Introducing DSSD D5 Rack-Scale Flash Solution (EMC Pulse Blog)

EMC Unveils DSSD D5: A Quantum Leap In Flash Storage (EMC Press Release)

EMC Declares 2016 The “Year of All-Flash” For Primary Storage (EMC Press Release)

EMC DSSD D5 Rack-Scale Flash (EMC PDF Overview)

EMC DSSD and Cloudera Evolve Hadoop (EMC White Paper Overview)

Software Aspects of The EMC DSSD D5 Rack-Scale Flash Storage Platform (EMC PDF White Paper)

EMC DSSD D5 (EMC PDF Architecture and Product Specification)

EMC VFCache respinning SSD and intelligent caching (Part II)

EMC To Acquire DSSD, Inc., Extends Flash Storage Leadership

Part II: XtremIO, XtremSW and XtremSF EMC flash ssd portfolio redefined

XtremIO, XtremSW and XtremSF EMC flash ssd portfolio redefined

Learn more about flash SSD here and NVMe here at thenvmeplace.com

What this all means

Today’s legacy, and emerging applications have the need for speed, and where the applications may not need speed, the users as well as Internet of Things (IoT) that depend upon, or feed those applications do need things to move faster. Fast applications need fast software and hardware to get the same amount of work done faster with less wait delays, as well as process larger amounts of structured and unstructured little data, big data and very fast big data.

Different applications along with the data infrastructures they rely upon including servers, storage, I/O hardware and software need to adapt to various environments, one size, one approach model does not fit all scenarios. What this means is that some applications and data infrastructures will benefit from shared direct attached SSD storage such as rack scale solutions using EMC DSSD D5. Meanwhile other applications will benefit from AFA or hybrid storage systems along with other approaches used in various ways.

Continue reading part two of this series here including how EMC DSSD D5 works and more perspectives.

Ok, nuff said (for now)

Cheers
Gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press), The Green and Virtual Data Center (CRC Press) and Resilient Storage Networks (Elsevier)
twitter @storageio

January 5, 2013November 26, 2023

The Human Face of Big Data, a Book Review

My copy of the new book The Human Face of Big Data created by Rick Smolan and Jennifer Erwitt arrived yesterday compliments of EMC (the lead sponsor). In addition to EMC, the other sponsors of the book are Cisco, VMware, FedEx, Originate and Tableau software.

To say this is a big book would be an understatement, then again, big data is a big topic with a lot of diversity if you open your eyes and think in a pragmatic way, which once you open and see the pages you will see. This is physically a big book (11x 14 inches) with lots of pictures, texts, stories, factoids and thought stimulating information of the many facets and dimensions of big data across 224 pages.

While Big Data as a buzzword and industry topic theme might be new, along with some of the related technologies, techniques and focus areas, other as aspects have been around for some time. Big data means many things to various people depending on their focus or areas of interest ranging from analytics to images, videos and other big files. A common theme is the fact that there is no such thing as an information or data recession, and that people and data are living longer, getting larger, and we are all addicted to information for various reasons.

Big data needs to be protected and preserved as it has value, or its value can increase over time as new ways to leverage it are discovered which also leads to changing data access and life cycle patterns. With many faces, facets and areas of interests applying to various spheres of influence, big data is not limited to programmatic, scientific, analytical or research, yet there are many current and use cases in those areas.

Big data is not limited to videos for security surveillance, entertainment, telemetry, audio, social media, energy exploration, geosciences, seismic, forecasting or simulation, yet those have been areas of focus for years. Some big data files or objects are millions of bytes (MBytes), billion of bytes (GBytes) or trillion of bytes (TBytes) in size that when put into file systems or object repositories, add up to Exabytes (EB – 1000 TBytes) or Zettabytes (ZB – 1000 EBs). Now if you think those numbers are far-fetched, simply look back to when you thought a TByte, GByte let alone a MByte was big or far-fetched future. Remember, there is no such thing as a data or information recession, people and data are living longer and getting larger.

Big data is more than hadoop, map reduce, SAS or other programmatic and analytical focused tool, solution or platform, yet those all have been and will be significant focus areas in the future. This also means big data is more than data warehouse, data mart, data mining, social media and event or activity log processing which also are main parts have continued roles going forward. Just as there are large MByte, GByte or TByte sized files or objects, there are also millions and billions of smaller files, objects or pieces of information that are part of the big data universe.

You can take a narrow, product, platform, tool, process, approach, application, sphere of influence or domain of interest view towards big data, or a pragmatic view of the various faces and facets. Of course you can also spin everything that is not little-data to be big data and that is where some of the BS about big data comes from. Big data is not exclusive to the data scientist, researchers, academia, governments or analysts, yet there are areas of focus where those are important. What this means is that there are other areas of big data that do not need a data science, computer science, mathematical, statistician, Doctoral Phd or other advanced degree or training, in other words big data is for everybody.

Cover image of Human Face of Big Data Book

Back to how big this book is in both physical size, as well as rich content. Note the size of The Human Face of Big Data book in the adjacent image that for comparison purposes has a copy of my last book Cloud and Virtual Data Storage Networking (CRC), along with a 2.5 inch hard disk drive (HDD) and a growler. The Growler is from Lift Bridge Brewery (Stillwater, MN), after all, reading a big book about big data can create the need for a big beer to address a big thirst for information ;).

The Human Face of Big Data is more than a coffee table or picture book as it is full of with information, factoids and perspectives how information and data surround us every day. Check out the image below and note the 2.5 inch HDD sitting on the top right hand corner of the page above the text. Open up a copy of The Human Face of Big Data and you will see examples of how data and information are all around us, and our dependence upon it.

A look inside the book The Humand Face of Big Data image

Book Details:
Copyright 2012
Against All Odds Productions
ISBN 978-1-4549-0827-2
Hardcover 224 pages, 11 x 0.9 x 14 inches
4.8 pounds, English

There is also an applet to view related videos and images found in the book at HumanFaceofBigData.com/viewer in addition to other material on the companion site www.HumanFacesofBigData.com.

Get your copy of
The Human Face of Big Data at Amazon.com by clicking here or at other venues including by clicking on the following image (Amazon.com).

Some added and related material:
Little data, big data and very big data (VBD) or big BS?
How many degrees separate you and your information?
Hardware, Software, what about Valueware?
Changing Lifecycles and Data Footprint Reduction (Data doesnt have to lose value over time)
Garbage data in, garbage information out, big data or big garbage?
Industry adoption vs. industry deployment, is there a difference?
Is There a Data and I/O Activity Recession?
Industry trend: People plus data are aging and living longer
Supporting IT growth demand during economic uncertain times
No Such Thing as an Information Recession

For those who can see big data in a broad and pragmatic way, perhaps using the visualization aspect this book brings forth the idea that there are and will be many opportunities. Then again for those who have a narrow or specific view of what is or is not big data, there is so much of it around and various types along with focus areas you too will see some benefits.

Do you want to play in or be part of a big data puddle, pond, or lake, or sail and explore the oceans of big data and all the different aspects found in, under and around those bigger broader bodies of water.

Bottom line, this is a great book and read regardless of if you are involved with data and information related topics or themes, the format and design lend itself to any audience. Broaden your horizons, open your eyes, ears and thinking to the many facets and faces of big data that are all around us by getting your copy of The Human Face of Big Data (Click here to go to Amazon for your copy) book.

Ok, nuff said.

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press, 2011), The Green and Virtual Data Center (CRC Press, 2009), and Resilient Storage Networks (Elsevier, 2004)

twitter @storageio

November 26, 2012July 28, 2021

Ceph Day Amsterdam 2012 (Object and cloud storage)

Recently while I was in Europe presenting some sessions at conferences and doing some seminars, I was invited by Ed Saipetch (@edsai) of Inktank.com to attend the first Ceph Day in Amsterdam.

Ceph day image

As luck or fate would turn out, I was in Nijkerk which is about an hour train ride from Amsterdam central station plus a free day in my schedule. After a morning train ride and nice walk from Amsterdam Central I arrived at the Tobacco Theatre (a former tobacco trading venue) where Ceph Day was underway, and in time for lunch of Krokettens sandwich.

Lets take a quick step back and address for those not familiar what is Ceph (Cephalanthera) and why it was worth spending a day to attend this event. Ceph is an open source distributed object scale out (e.g. cluster or grid) software platform running on industry standard hardware.

Ceph is used for deploying object storage, cloud storage and managed services, general purpose storage for research, commercial, scientific, high performance computing (HPC) or high productivity computing (commercial) along with backup or data protection and archiving destinations. Other software similar in functionality or capabilities to Ceph include OpenStack Swift, Basho Riak CS, Cleversafe, Scality and Caringo among others. There are also the tin wrapped software (e.g. appliances or pre-packaged) solutions such as Dell DX (Caringo), DataDirect Networks (DDN) WOS, EMC ATMOS and Centera, Amplidata and HDS HCP among others. From a service standpoint, these solutions can be used to build services similar Amazon S3 and Glacier, Rackspace Cloud files and Cloud Block, DreamHost DreamObject and HP Cloud storage among others.

At the heart of Ceph is RADOS a distributed object store that consists of peer nodes functioning as object storage devices (OSD). Data can be accessed via REST (Amazon S3 like) APIs, Libraries, CEPHFS and gateway with information being spread across nodes and OSDs using a CRUSH based algorithm (note Sage Weil is one of the authors of CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data). Ceph is scalable in terms of performance, availability and capacity by adding extra nodes with hard disk drives (HDD) or solid state devices (SSDs). One of the presentations pertained to DreamHost that was an early adopter of Ceph to make their DreamObjects (cloud storage) offering.

In addition to storage nodes, there are also an odd number of monitor nodes to coordinate and manage the Ceph cluster along with optional gateways for file access. In the above figure (via DreamHost), load balancers sit in front of gateways that interact with the storage nodes. The storage node in this example is a physical server with 12 x 3TB HDDs each configured as a OSD.

In the DreamHost example above, there are 90 storage nodes plus 3 management nodes, the total raw storage capacity (no RAID) is about 3PB (12 x 3TB = 36TB x 90 = 3.24PB). Instead of using RAID or mirroring, each objects data is replicated or copied to three (e.g. N=3) different OSDs (on separate nodes), where N is adjustable for a given level of data protection, for a usable storage capacity of about 1PB.

Note that for more usable capacity and lower availability, N could be set lower, or a larger value of N would give more durability or data protection at higher storage capacity overhead cost. In addition to using JBOD configurations with replication, Ceph can also be configured with a combination of RAID and replication providing more flexibility for larger environments to balance performance, availability, capacity and economics.

One of the benefits of Ceph is the flexibility to configure it how you want or need for different applications. This can be in a cost-effective hardware light configuration using JBOD or internal HDDs in small form factor generally available servers, or high density servers and storage enclosures with optional RAID adapters along with SSD. This flexibility is different from some cloud and object storage systems or software tools which take a stance of not using or avoiding RAID vs. providing options and flexibility to configure and use the technology how you see fit.

Here are some links to presentations from Ceph Day:
Introduction and Welcome by Wido den Hollander
Ceph: A Unified Distributed Storage System by Sage Weil
Ceph in the Cloud by Wido den Hollander
DreamObjects: Cloud Object Storage with Ceph by Ross Turk
Cluster Design and Deployment by Greg Farnum
Notes on Librados by Sage Weil

While at Ceph day, I was able to spend a few minutes with Sage Weil Ceph creator and founder of inktank.com to record a pod cast (listen here) about what Ceph is, where and when to use it, along with other related topics. Also while at the event I had a chance to sit down with Curtis (aka Mr. Backup) Preston where we did a simulcast video and pod cast. The simulcast involved Curtis recording this video with me as a guest discussing Ceph, cloud and object storage, backup, data protection and related themes while I recorded this pod cast.

One of the interesting things I heard, or actually did not hear while at the Ceph Day event that I tend to hear at related conferences such as SNW is a focus on where and how to use, configure and deploy Ceph along with various configuration options, replication or copy modes as opposed to going off on erasure codes or other tangents. In other words, instead of focusing on the data protection protocol and algorithms, or what is wrong with the competition or other architectures, the Ceph Day focused was removing cloud and object storage objections and enablement.

Where do you get Ceph? You can get it here, as well as via 42on.com and inktank.com.

Thanks again to Sage Weil for taking time out of his busy schedule to record a pod cast talking about Ceph, as well 42on.com and inktank for hosting, and the invitation to attend the first Ceph Day in Amsterdam.

Returning to Amsterdam central station after Ceph Day

Ok, nuff said.

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press, 2011), The Green and Virtual Data Center (CRC Press, 2009), and Resilient Storage Networks (Elsevier, 2004)

twitter @storageio

November 26, 2012January 1, 2020

Seven databases in seven weeks, a book review of NoSQL databases

Seven Databases in Seven Weeks (A Guide to Modern Databases and the NoSQL Movement) is a book written Eric Redmond (@coderoshi) and Jim Wilson (@hexlib), part of The Pragmatic Programmers (@pragprog) series that takes a look at several non SQL based database systems.

Coverage includes PostgreSQL, Riak, Apache HBase, MongoDB, Apache CouchDB, Neo4J and Redis with plenty of code and architecture examples. Also covered include relational vs. key value, columnar and document based systems among others.

The details: Seven Databases in Seven Weeks
Paperback: 352 pages
Publisher: Pragmatic Bookshelf (May 18, 2012)
Language: English
ISBN-10: 1934356921
ISBN-13: 978-1934356920
Product Dimensions: 7.5 x 0.8 x 9 inches

Buzzwords (or keywords) include availability, consistency, performance and related themes. Others include MongoDB, Cassandra, Redis, Neo4J, JSON, CouchDB, Hadoop, HBase, Amazon Dynamo, Map Reduce, Riak (Basho) and Postgres along with data models including relational, key value, columnar, document and graph along with big data, little data, cloud and object storage.

While this book is not a how to tutorial or installation guide, it does give a deep dive into the different databases covered. The benefit is gaining an understanding of what the different databases are good for, strengths, weakness, where and when to use or choose them for various needs.

A look inside my copy of Seven Databases in Seven Days

Who should this book includes applications developers, programmers, Cloud, big data and IT/ICT architects, planners and designers along with database, server, virtualization and storage professionals. What I like about the book is that it is a great intro and overview along with sufficient depth to understand what these different solutions can and cannot do, when, where and why to use these tools for different situations in a quick read format and plenty of detail.

Would I recommend buying it: Yes, I bought a copy myself on Amazon.com, get your copy by clicking here.

Ok, nuff said

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press, 2011), The Green and Virtual Data Center (CRC Press, 2009), and Resilient Storage Networks (Elsevier, 2004)

twitter @storageio

October 28, 2012November 26, 2023

Little data, big data and very big data (VBD) or big BS?

This is an industry trends and perspective piece about big data and little data, industry adoption and customer deployment.

If you are in any way associated with information technology (IT), business, scientific, media and entertainment computing or related areas, you may have heard big data mentioned. Big data has been a popular buzzword bingo topic and term for a couple of years now. Big data is being used to describe new and emerging along with existing types of applications and information processing tools and techniques.

I routinely hear from different people or groups trying to define what is or is not big data and all too often those are based on a particular product, technology, service or application focus. Thus it should be no surprise that those trying to police what is or is not big data will often do so based on what their interest, sphere of influence, knowledge or experience and jobs depend on.

Not long ago while out traveling I ran into a person who told me that big data is new data that did not exist just a few years ago. Turns out this person was involved in geology so I was surprised that somebody in that field was not aware of or working with geophysical, mapping, seismic and other legacy or traditional big data. Turns out this person was basing his statements on what he knew, heard, was told about or on sphere of influence around a particular technology, tool or approach.

Fwiw, if you have not figured out already, like cloud, virtualization and other technology enabling tools and techniques, I tend to take a pragmatic approach vs. becoming latched on to a particular bandwagon (for or against) per say.

Not surprisingly there is confusion and debate about what is or is not big data including if it only applies to new vs. existing and old data. As with any new technology, technique or buzzword bingo topic theme, various parties will try to place what is or is not under the definition to align with their needs, goals and preferences. This is the case with big data where you can routinely find proponents of Hadoop and Map reduce position big data as aligning with the capabilities and usage scenarios of those related technologies for business and other forms of analytics.

Not surprisingly the granddaddy of all business analytics, data science and statistic analysis number crunching is the Statistical Analysis Software (SAS) from the SAS Institute. If these types of technology solutions and their peers define what is big data then SAS (not to be confused with Serial Attached SCSI which can be found on the back-end of big data storage solutions) can be considered first generation big data analytics or Big Data 1.0 (BD1 ;) ). That means Hadoop Map Reduce is Big Data 2.0 (BD2 ;) ;) ) if you like, or dislike for that matter.

Funny thing about some fans and proponents or surrogates of BD2 is that they may have heard of BD1 like SAS with a limited understanding of what it is or how it is or can be used. When I worked in IT as a performance and capacity planning analyst focused on servers, storage, network hardware, software and applications I used SAS to crunch various data streams of event, activity and other data from diverse sources. This involved correlating data, running various analytic algorithms on the data to determine response times, availability, usage and other things in support of modeling, forecasting, tuning and trouble shooting. Hmm, sound like first generation big data analytics or Data Center Infrastructure Management (DCIM) and IT Service Management (ITSM) to anybody?

Now to be fair, comparing SAS, SPSS or any number of other BD1 generation tools to Hadoop and Map Reduce or BD2 second generation tools is like comparing apples to oranges, or apples to pears.

Lets move on as there is much more to what is big data than simply focus around SAS or Hadoop.

Another type of big data are the information generated, processed, stored and used by applications that result in large files, data sets or objects. Large file, objects or data sets include low resolution and high-definition photos, videos, audio, security and surveillance, geophysical mapping and seismic exploration among others. Then there are data warehouses where transactional data from databases gets moved to for analysis in systems such as those from Oracle, Teradata, Vertica or FX among others. Some of those other tools even play (or work) in both traditional e.g. BD1 and new or emerging BD2 worlds.

This is where some interesting discussions, debates or disagreements can occur between those who latch onto or want to keep big data associated with being something new and usually focused around their preferred tool or technology. What results from these types of debates or disagreements is a missed opportunity for organizations to realize that they might already be doing or using a form of big data and thus have a familiarity and comfort zone with it.

By having a familiarity or comfort zone vs. seeing big data as something new, different, hype or full of FUD (or BS), an organization can be comfortable with the term big data. Often after taking a step back and looking at big data beyond the hype or fud, the reaction is along the lines of, oh yeah, now we get it, sure, we are already doing something like that so lets take a look at some of the new tools and techniques to see how we can extend what we are doing.

Likewise many organizations are doing big bandwidth already and may not realize it thinking that is only what media and entertainment, government, technical or scientific computing, high performance computing or high productivity computing (HPC) does. I’m assuming that some of the big data and big bandwidth pundits will disagree, however if in your environment you are doing many large backups, archives, content distribution, or copying large amounts of data for different purposes that consume big bandwidth and need big bandwidth solutions.

Yes I know, that’s apples to oranges and perhaps stretching the limits of what is or can be called big bandwidth based on somebody’s definition, taxonomy or preference. Hopefully you get the point that there is diversity across various environments as well as types of data and applications, technologies, tools and techniques.

What about little data then?

I often say that if big data is getting all the marketing dollars to generate industry adoption, then little data is generating all the revenue (and profit or margin) dollars by customer deployment. While tools and technologies related to Hadoop (or Haydoop if you are from HDS) are getting industry adoption attention (e.g. marketing dollars being spent) revenues from customer deployment are growing.

Where big data revenues are strongest for most vendors today are centered around solutions for hosting, storing, managing and protecting big files, big objects. These include scale out NAS solutions for large unstructured data like those from Amplidata, Cray, Dell, Data Direct Networks (DDN), EMC (e.g. Isilon), HP X9000 (IBRIX), IBM SONAS, NetApp, Oracle and Xyratex among others. Then there flexible converged compute storage platforms optimized for analytics and running different software tools such as those from EMC (Greenplum), IBM (Netezza), NetApp (via partnerships) or Oracle among others that can be used for different purposes in addition to supporting Hadoop and Map reduce.

If little data is databases and things not generally lumped into the big data bucket, and if you think or perceive big data only to be Hadoop map reduce based data, then does that mean all the large unstructured non little data is then very big data or VBD?

Of course the virtualization folks might want to if they have not already corner the V for Virtual Big Data. In that case, then instead of Very Big Data, how about very very Big Data (vvBD). How about Ultra-Large Big Data (ULBD), or High-Revenue Big Data (HRBD), granted the HR might cause some to think its unique for Health Records, or Human Resources, both btw leverage different forms of big data regardless of what you see or think big data is.

Does that then mean we should really be calling videos, audio, PACs, seismic, security surveillance video and related data to be VBD? Would this further confuse the market, or the industry or help elevate it to a grander status in terms of size (data file or object capacity, bandwidth, market size and application usage, market revenue and so forth)?

Do we need various industry consortiums, lobbyists or trade groups to go off and create models, taxonomies, standards and dictionaries based on their constituents needs and would they align with those of the customers, after all, there are big dollars flowing around big data industry adoption (marketing).

What does this all mean?

Is Big Data BS?

First let me be clear, big data is not BS, however there is a lot of BS marketing BS by some along with hype and fud adding to the confusion and chaos, perhaps even missed opportunities. Keep in mind that in chaos and confusion there can be opportunity for some.

IMHO big data is real.

There are different variations, use cases and types of products, technologies and services that fall under the big data umbrella. That does not mean everything can or should fall under the big data umbrella as there is also little data.

What this all means is that there are different types of applications for various industries that have big and little data, virtual and very big data from videos, photos, images, audio, documents and more.

Big data is a big buzzword bingo term these days with vendor marketing big dollars being applied so no surprise the buzz, hype, fud and more.

Ok, nuff said, for now.

Cheers gs

Greg Schulz – Author Cloud and Virtual Data Storage Networking (CRC Press, 2011), The Green and Virtual Data Center (CRC Press, 2009), and Resilient Storage Networks (Elsevier, 2004)

twitter @storageio

Additional Data Infrastructure and related topic sites

Additional Data Infrastructure Venues

Where To Learn More

What This All Means

Share this:

Chelsio Storage over IP Enable Data Infrastructures

Server Storage over IP Revisited

Who is Chelsio

Where To Learn More

What This All Means

Share this:

VMware vSAN V6.6 Part IV (HCI scaling ROBO and data centers today)

Scaling HCI for ROBO and data centers today and for tomorrow

What about VMware vSAN 6.6. Packaging and License Options

Where to Learn More

What this all means

Share this:

Part II – EMC DSSD D5 Direct Attached Shared AFA

How Does DSSD D5 Work

How Does It Compare To Other AFA and SSD solutions

What Happened to Server PCIe cards and Server SANs

Where to learn more

What this all means

Share this:

EMC DSSD D5 Rack Scale Direct Attached Shared SSD All Flash Array Part I

What Is DSSD D5

Where to learn more

What this all means

Share this:

Share this:

Share this:

Share this:

Share this: