Every five years the amount of information in the electronic world undergoes a ten-fold increase. This proliferation of data will soon eclipse 2000 exabytes, with the majority of data being unstructured. Most of this data is not well suited for business intelligence (BI) or analytical applications, especially data rooted in the relational database and XML world. Traditional data storage and retrieval mechanisms that have come of age over the last decade not only have issues elegantly storing and indexing large collections of unstructured data, they are chronically flawed in their ability to effectively retrieve, distribute, and attach real-life meaning or intelligibility to the data (i.e. loss of its intrinsic value or an understanding of what it empirically represents).
Unstructured data comes from many sources and takes many forms: OLTP applications that store data inside text fields without proper validation, domain rules, or attached metadata and semantics; web logs; text messages; voice mails; self-contained (energy) usage meters; video files and so forth. Although there are treasure troves of valuable wisdom, insights, and business intelligence contained in this unwieldy data, during our lifetime the majority of the world’s information and knowledge will remain in an unstructured or semi-structured format. Despite the promise of Web 3.0 and persistent improvements in information technology, data of a scrambled composition and ill-formed constitution will always pose large challenges. The evolution of technology will always outpace formal methods to categorize, classify, and attach semantic consistency and clarity to data. However, there are technologies arriving in the BI marketplace that provide a way for companies to make better sense of the disorganized and formless tangles of knowledge that reside in enterprise applications.
In order to deal with their increasing load of unstructured information, many business enterprises with industrial strength BI requirements are betting on a project administered by the Apache Software Foundation named Hadoop. Hadoop is an open source software framework that supports the lightening fast processing of huge amounts of data (especially unstructured data) via massively parallel processing (MPP) on distributed clusters of servers: more specifically, “shared nothing” commodity servers which can host and process terabytes (dare I say petabytes) of data over hundreds or thousands of managed nodes.
To gain a high-level understanding of Hadoop one must comprehend the two main features/services that advantageously differentiate it from competing solutions: the Hadoop Distributed File System (HDFS) on which all data resides, and a high-performance computational methodology technique called MapReduce, which is able to process enormously large amounts of data in parallel by allocating work across a wide landscape of physical processors.
Hadoop contains its own distributed file system (DFS, or as previously mentioned HDFS) which is based in no small part on the Google File System. The technical architecture of HDFS is engineered so that it can support large files (especially those which contain gig upon gig of unstructured data) while providing very high levels of throughput and fault tolerance. Data is mirrored across compute nodes of the DFS, spanning a potentially infinite number of commodity server or processor clusters. Every cluster draws on the aggregate sum of all constituent CPUs in order to impart high data bandwidth for exceedingly demanding data processing tasks. Processing instructions are apportioned and appointed to each cluster according to their abilities, where the compute nodes with free cycles can crunch through the data.
The Hadoop file system monitors all activity across clusters of servers and gracefully handles performance issues, from I/O bottlenecks to outright node failures. If any given node is running slowly, or if a large number of nodes are deemed to be ineffectual or performing poorly, Hadoop will immediately reschedule any relevant operations and execute them on another server which holds a mirror copy of the dataset being processed. When it comes to matters of redundancy, HDFS can handily survive disk failures because of its ability to automatically replicate data to at least two other nodes at all times, thereby drastically reducing the risk of catastrophic failures.
The MapReduce paradigm runs against the data that is resident on the nodes of a DFS cluster. Like HDFS, MapReduce owes a great debt to Google. Both Google and Yahoo rely heavily on MapReduce technology for their search engine functionality and for the dynamic serving up of advertising content. With MapReduce, processing instructions are divided into numerous tiny units of work (sometimes referred to as blocks) which can be allocated to (and executed on) any available node in a distributed cluster. “Map” describes the act of distributing fragments of work throughout all HDFS clusters and nodes, while assembly and distribution of the results is known as “reduce”.
Much noise has been made about the benefits of MapReduce technology versus the time-tested practice of storing data in a relational database management system (RDBMS) and issuing vanilla SQL queries against that data. The pros and cons of both approaches are too numerous to summarize in this article. Nevertheless, we can speak to this debate with a good degree of confidence by putting forth the following generalization: when structured data is being analyzed then the use of SQL is recommended, provided that the queries can be tuned and optimized if needed, and that the data has been properly indexed, striped, and partitioned. However, when there are large volumes of unstructured data to be processed and analyzed, MapReduce wins hands down. After all, there are no standard RDBMS that can match its powerful and beastly processing capabilities for accessing, scanning and processing data across hundreds or thousands of cluster nodes. And when it comes to loading, storing, and querying various types of unstructured data in an RDBMS (if it will even hold or accept the data in the first place) there will almost always be hefty problems with performance.
As long the human population on our planet increases (a pretty safe beat for the near-term), the amount of unstructured data that exists in electronic and paper form will continue to be overwhelming. Data mining and BI journeys into the unstructured world will remain off-putting and often times demoralizing. For the time present, the Hadoop software framework, aided in large part by its innovative file system and MapReduce engine, will offer some relief. Hadoop’s open source origins ensure an unprecedented level of customization and scalability. Any variety of commodity servers can be used in a Hadoop cluster without negatively affecting the disk file system’s distinguished high availability and fault tolerance features.
Despite the attractive cost savings and effectiveness of Hadoop, it is important to understand some of the more visible implementation challenges.
- There is a learning curve with Hadoop that can be quite steep, especially for those without a nominal grasp of grid computing, or a decent understanding of how server architecture(s) affect the performance of a data warehouse, data mart, or data service.
- There exists a lack of technical expertise in the marketplace with verifiable experience in Hadoop. This will change quickly as the global community of contributors and users of Hadoop grows.
- The initial cost of ownership and implementation for a Hadoop configuration can be quite steep for companies that do not already have a good sized inventory of state-of-the art servers and reliable processing power. A Hadoop clustered server configuration usually uses three times the storage space of the files that it will process and manage. (Given the steeply declining cost of storage over the last few years, this consideration has diminished in impact.)
- In many IT organizations, the fear and bias against open source software is still alive and well. The truth of the matter though is that there are now so many advantages of open source BI technologies, they are impossible to ignore. On that note, Hadoop’s disk file system is totally portable. It can run admirably on Windows, MAC, and Linux platforms.
- When conducting a Hadoop proof-of-concept (POC) exercise, there may be initial disappointment in query performance. It is important to understand that this may be due to a lack of sufficient clusters and nodes being configured to handle the processing demands thrown at the DFS. (General guidelines for cluster configuration and scalability can be found at the Hadoop homepage: hadoop.apache.org).
At present, Cloudera (http://www.cloudera.com/) is the most active and visible contributor to the Hadoop project, providing a commercial distribution of Hadoop that is ready for the enterprise. Cloudera is greatly mitigating the risks associated with the open source aspect of Hadoop by offering a stable and bug free version of the source code.
About the Author
William Laurent is one of the world's leading experts in information strategy and governance. For 20 years, he has advised numerous businesses and governments on technology strategy, performance management, and best practices—across all market sectors. William currently runs an independent consulting company that bears his name. In addition, he frequently teaches classes, publishes books and magazine articles, and lectures on various technology and business topics worldwide. As Senior Contributing Author for Dashboard Insight, he would enjoy your comments at firstname.lastname@example.org
Copyright 2011 - Dashboard Insight - All Rights Reserved.