Give me a Buzz: mail2boopal@gmail.com
  • Youtube Video

    Buzzwords aside, Big Data is likely to have a major impact on our world - amounting in a significant technological shift over the next data. Don't believe us? Watch this video to find out some real world examples of how Big Data is already making an impact..

  • Welcome to the Era of Big Data

    Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth

Big Data Open Source

It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.
“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.

Hadoop Distributions

Hortonworks

Cloud Operating System

Cloud Foundry — By VMware
OpenStack — Worldwide participation and well-known companies

Storage

fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.

Development Platforms and Tools

REEF — Microsoft’s Hadoop development platform
Lingual — By Concurrent
Pattern — By Concurrent
Python — Awesome programming language
Mahout — Machine learning programming language
Impala — Cloudera
R — MVP among statistical tools
Storm — Stream processing by Twitter
LucidWorks — Search, based on Apache Solr
Giraph — Graph processing by Facebook

NoSql Databases

MongoDB, Cassandra, Hbase

Sql Databases

MySql — Belongs to Oracle
MariaDB — Partnered with SkySql
PostgreSQL — Object Relational Database
TokuDB — Improves RDBMS performance

Server Operating Systems

Red Hat — The defacto OS for Hadoop Servers

BI, Data Integration, and Analytics

Talend
Pentaho
Jaspersoft

Big Data Will Change Our World

How-to: Select the Right Hardware for Your New Hadoop Cluster

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.
Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)
In this blog post, you’ll learn some of the principles of workload evaluation and the critical role it plays in hardware selection. You’ll also learn the various factors that Hadoop administrators should take into account during this process.

Marrying Storage with Compute

Over the past decade, IT organizations have standardized on blades and SANs (Storage Area Networks) to satisfy their grid and processing-intensive workloads. While this model makes a lot of sense for a number of standard applications such as web servers, app servers, smaller structured databases, and data movement, the requirements for infrastructure have changed as the amount of data and number of users has grown. Web servers now have caching tiers, databases have gone massively parallel with local disk, and data movement jobs are pushing more data than they can handle locally.

Most teams building a Hadoop cluster don’t yet know the eventual profile of their workload.

Hardware vendors have created innovative systems to address these requirements including storage blades, SAS (Serial Attached SCSI) switches, external SATA arrays, and larger capacity rack units. However, Hadoop is based on a new approach to storing and processing complex data, with data movement minimized. Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier.
Hadoop distributes data across a cluster of balanced machines and uses replication to ensure data reliability and fault tolerance. Because data is distributed on machines with compute power, processing can be sent directly to the machines storing the data. Since each machine in a Hadoop cluster stores as well as processes data, those machines need to be configured to satisfy both data storage and processing requirements.

Why Workloads Matter

In nearly all cases, a MapReduce job will either encounter a bottleneck reading data from disk or from the network (known as an IO-bound job) or in processing data (CPU-bound). An example of an IO-bound job is sorting, which requires very little processing (simple comparisons) and a lot of reading and writing to disk. An example of a CPU-bound job is classification, where some input data is processed in very complex ways to determine ontology.
Here are several more examples of IO-bound workloads:
  • Indexing
  • Grouping
  • Data importing and exporting
  • Data movement and transformation
Here are several more examples of CPU-bound workloads:
  • Clustering/Classification
  • Complex text mining
  • Natural-language processing
  • Feature extraction
Because Cloudera’s customers need to thoroughly understand their workloads in order to fully optimize Hadoop hardware, a classic chicken-and-egg problem ensues. Most teams looking to build a Hadoop cluster don’t yet know the eventual profile of their workload, and often the first jobs that an organization runs with Hadoop are far different than the jobs that Hadoop is ultimately used for as proficiency increases. Furthermore, some workloads might be bound in unforeseen ways. For example, some theoretical IO-bound workloads might actually be CPU-bound because of a user’s choice of compression, or different implementations of an algorithm might change how the MapReduce job is constrained. For these reasons, when the team is unfamiliar with the types of jobs it is going to run, as an initial approach it makes sense to invest in a balanced Hadoop cluster.
The next step would be to benchmark MapReduce jobs running on the balanced cluster to analyze how they’re bound. To achieve that goal, it’s straightforward to measure live workloads and determine bottlenecks by putting thorough monitoring in place. We recommend installing Cloudera Manager on the Hadoop cluster to provide real-time statistics about CPU, disk, and network load. (Cloudera Manager is an included component of Cloudera Standard and Cloudera Enterprise — in the latter case with enterprise functionality, such as support for rolling upgrades, in place.) With Cloudera Manager installed, Hadoop administrators can then run their MapReduce jobs and check the Cloudera Manager dashboard to see how each machine is performing.  

The first step is to know which hardware your operations team already manages.

In addition to building out a cluster appropriate for the workload, we encourage customers to work with their hardware vendor to understand the economics of power and cooling. Since Hadoop runs on tens, hundreds, or thousands of nodes, an operations team can save a significant amount of money by investing in power-efficient hardware. Each hardware vendor will be able to provide tools and recommendations for how to monitor power and cooling.

Selecting Hardware for Your CDH Cluster

The first step in choosing a machine configuration is to understand the type of hardware your operations team already manages. Operations teams often have opinions or hard requirements about new machine purchases, and will prefer to work with hardware with which they’re already familiar. Hadoop is not the only system that benefits from efficiencies of scale. Again, as a general suggestion, if the cluster is new or you can’t accurately predict your ultimate workload, we advise that you use balanced hardware.
There are four types of roles in a basic Hadoop cluster: NameNode (and Standby NameNode), JobTracker,TaskTracker, and DataNode. (A node is a machine performing a particular task.) Most machines in your cluster will perform two of these roles, functioning as both DataNode (for data storage) and TaskTracker (for data processing).
Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster:
  • 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-512GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)
The NameNode role is responsible for coordinating data storage on the cluster, and the JobTracker for coordinating data processing. (The Standby NameNode should not be co-located on the NameNode machine for clusters and will run on hardware identical to that of the NameNode.) Cloudera recommends that customers purchase enterprise-class machines for running the NameNode and JobTracker, with redundant power and enterprise-grade disks in RAID 1 or 10 configurations.
The NameNode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of room to grow the cluster. We also recommend having HA configured on both the NameNode and JobTracker, features that have been available in the CDH4 line for some time.
Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy:
  • 4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-128GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet

Remember, the Hadoop ecosystem is designed with a parallel environment in mind.

If you expect your Hadoop cluster to grow beyond 20 machines, we recommend that the initial cluster be configured as if it were to span two racks, where each rack has a top-of-rack 10 GigE switch. As the cluster grows to multiple racks, you will want to add redundant core switches to connect the top-of-rack switches with 40GigE. Having two logical racks gives the operations team a better understanding of the network requirements for intra-rack and cross-rack communication.
With a Hadoop cluster in place, the team can start identifying workloads and prepare to benchmark those workloads to identify hardware bottlenecks. After some time benchmarking and monitoring, the team will understand how additional machines should be configured. Heterogeneous Hadoop clusters are common, especially as they grow in size and number of use cases – so starting with a set of machines that are not “ideal” for your workload will not be a waste of time. Cloudera Manager offers templates that allow different hardware profiles to be managed in groups, making it simple to manage heterogeneous clusters.
Below is a list of various hardware configurations for different workloads, including our original “balanced” recommendation:
  • Light Processing Configuration (1U/machine): Two hex-core CPUs, 24-64GB memory, and 8 disk drives (1TB or 2TB)
  • Balanced Compute Configuration (1U/machine): Two hex-core CPUs, 48-128GB memory, and 12 – 16 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 24 drives in a single 2U cabinet.
  • Storage Heavy Configuration (2U/machine): Two hex-core CPUs, 48-96GB memory, and 16-24 disk drives (2TB – 4TB). This configuration will cause high network traffic in case of multiple node/rack failures.
  • Compute Intensive Configuration (2U/machine): Two hex-core CPUs, 64-512GB memory, and 4-8 disk drives (1TB or 2TB)
(Note that Cloudera expects to adopt 2×8, 2×10, and 2×12 core configurations as they arrive.)
The following diagram shows how a machine should be configured according to workload:

Other Considerations

It is important to remember that the Hadoop ecosystem is designed with a parallel environment in mind. When purchasing processors, we do not recommended getting the highest GHz chips, which draw high watts (130+). This will cause two problems: higher consumption of power and greater heat expulsion. The mid-range models tend to offer the best bang for the buck in terms of GHz, price, and core count.
When we encounter applications that produce large amounts of intermediate data — outputting data on the same order as the amount read in — we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Bonded 2Gbps is tolerable for up to about 12TB of data per nodes. Once you move above 12TB, you will want to move to bonded 4Gbps(4x1Gbps). Alternatively, for customers that have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network-bound workloads. Confirm that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.
When computing memory requirements, remember that Java uses up to 10 percent of it for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM, as well as setting appropriate kernel settings on most Linux distributions.
It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory, each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. Similarly, quad-channel DIMM should be in groups of four.

Beyond MapReduce

Hadoop is far bigger than HDFS and MapReduce; it’s an all-encompassing data platform. For that reason, CDH includes many different ecosystem products (and, in fact, is rarely used solely for MapReduce). Additional software components to consider when sizing your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all be run on the DataNode process to maintain data locality.

Focusing on resource management will be your key to success.

HBase is a reliable, column-oriented data store that provides consistent, low-latency, random read/write access. Cloudera Search solves the need for full text search on content stored in CDH to simplify access for new types of users, but also open the door for new types of data storage inside Hadoop. Cloudera Search is based on Apache Lucene/Solr Cloud and Apache Tika and extends valuable functionality and flexibility for search through its wider integration with CDH. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or transformation.
HBase users should be aware of heap-size limits due to garbage collector (GC) timeouts. Other JVM column stores also face this issue. Thus, we recommend a maximum of ~16GB heap per Region Server. HBase does not require too many other resources to run on top of Hadoop, but to maintain real-time SLAs you should use schedulers such as fair and capacity along with Linux Cgroups.
Impala uses memory for most of its functionality, consuming up to 80 percent of available RAM resources under default configurations, so we recommend at least 96GB of RAM per node. Users that run Impala alongside MapReduce should consult our recommendations in “Configuring Impala and MapReduce for Multi-tenant Performance.” It is also possible to specify a per-process or per-query memory limit for Impala.
Search is the most interesting component to size. The recommended sizing exercise is to purchase one node, install Solr and Lucene, and load your documents. Once the documents are indexed and searched in the desired manner, scalability comes into play. Keep loading documents until the indexing and query latency exceed necessary values to the project — this will give you a baseline for max documents per node based on available resources and a baseline count of nodes not including and desired replication factor.

Conclusions

Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and Cloudera recommends deploying initial hardware with balanced specifications when getting started. It is important to remember when using multiple ecosystem components resource usage will vary and focusing on resource management will be your key to success.
I would like to thank Kevin O’Dell ,a Systems Engineer at Cloudera. This article was originally published in Cloudera Blog

Hadoop 2.0 makes Big Data

It took a little longer than expected, but the Apache Software Foundation has announced the general availability of Apache Hadoop 2.0 five days ago, which will ultimately be an elephant-sized step in how Hadoop is used for managing big data collections.

The biggest change to Apache Hadoop 2.2.0, the first generally available version of the 2._x_ series, is the update to the MapReduce framework to Apache YARN, also known as MapReduce 2.0. MapReduce is a big feature in Hadoop—the batch processor that lines up search jobs that go into the Hadoop distributed file system (HDFS) to pull out useful information. In the previous version of MapReduce, jobs could only be done one at a time, in batches, because that's how the Java-based MapReduce tool worked. With the available update, MapReduce 2.0 will enable multiple search tools to hit the data within the HDFS storage system at the same time.
The new YARN/MapReduce 2.0 architecture.
The new YARN/MapReduce 2.0 architecture
What YARN does is divide the functionality of MapReduce even further, breaking the two major responsibilities of the MapReduce JobTracker component—resource management and job scheduling/monitoring—into separate applications: a global ResourceManager and per-application ApplicationMaster.

Splitting up these functions provides a more powerful way to manage a Hadoop cluster's resources than the current MapReduce systems can handle. It manages resources similar to the way an operating system handles jobs, which means no more one-at-a-time limitation.

With MapReduce 2.0, developers can now build apps directly within Hadoop, instead of bolting them on from the outside, as many third-party vendor tools have had to do in Hadoop 1.0. This essentially will establish Hadoop 2.0 as a platform into which developers can create applications that will search for an manipulate data far more efficiently.
While YARN is the biggest change in the new version of Hadoop, there are some nice changes in the HDFS side of the Hadoop, too, including high availability for HDFS, HDFS snapshots, and support for the NFSv3 filesystem to access data in HDFS, if need be.
Also, Hadoop 2.2 is now officially supported on Microsoft Windows, which will no doubt stir up interest from companies committed to Microsoft-only platforms.
There will no doubt be growing pains with Hadoop as companies migrate the new release, but the fundamental changes to the MapReduce framework will mean even more usefulness for Hadoop in big-data scenarios moving forward. Expect a lot of new tools that will capitalize on the new capabilities in YARN, and soon.

Apache Hadoop 2 is Here and Will Transform the Ecosystem

The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem
Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances, including (but not limited to):
·        High availability for the HDFS NameNode, which eliminates the previous SPOF in HDFS.
·        Support for filesystem snapshots in HDFS, which brings native backup and disaster recovery processes to Hadoop.
·        Support for federated NameNodes, which allows for horizontal scaling of the filesystem namespace.
·        Support for NFS access to HDFS, which allows HDFS to be mounted as a standard filesystem.
·        Native network encryption, which secures data while in transit.
·        The YARN resource management system, which provides infrastructure for the creation of new Hadoop computing paradigms beyond MapReduce. This new flexibility will serve to expand the use cases for Hadoop, as well as improve the efficiency of certain types of processing over data already stored there.
·        Several performance-related enhancements, including more efficient (and secure) short-circuit local reads in HDFS.
Furthermore, a great deal of work has gone into stabilizing and maturing Hadoop’s APIs in preparation for this release, which should give all users and projects building on top of Hadoop confidence that what they’re creating today will work for years to come.
As for CDH, Cloudera’s distribution including Hadoop and related projects have already delivered several stable, high-value parts of Hadoop 2 in the current release (such as HDFS 2.0, network encryption, and performance improvements), and the next release (CDH 5) will be based entirely on Hadoop 2 — using YARN for resource coordination between MapReduce and other components. 

What is Map/Reduce ?

Map/Reduce is for huge data sets that have to be indexed, categorized, sorted, culled, analyzed, etc.  It can take a very long time to look through each record or file in a serial environment.  Map/Reduce allows data to be distributed across a large cluster, and can distribute out tasks across the data set to work on pieces of it independently, and in parallel.  This allows big data to be processed in relatively little time.

Laundromat analogy of Map/Reduce

Imagine that your data is laundry.  You wash this laundry by similar colors.  Then you dry this laundry by similar material (denims, towels, shirts, etc.)

Serial Operation:

 Map/Reduce operation


Word Count example of  Map/Reduce


Other Potential uses of Map/Reduce

Since it takes a large data set, breaks it down into smaller data sets, here are some potential uses:
  •  indexing large data sets in a database
  •  image recognition in large images
  •  processing geographic information system (GIS) data - combining vector data w/ point data (Kerr, 2009) analyzing unstructured data
  •  analyzing stock data 
  • Machine learning tasks 

Apache Hadoop Basics

Apache Hadoop Basics

“Big Data” 

A “big” shift is occurring. Today, the enterprise collects more data than ever before, from a wide variety of sources and in a wide variety of formats. Along with traditional transactional and analytics data stores, we now collect additional data across social media activity, web server log files, financial transactions and sensor data from equipment in the field. Deriving meaning from all this data using conventional database technology and analytics tools would be impractical.  A new set of technologies has enabled this shift.  
Now an extremely popular term, “big data” technology seeks to transform all this raw data into meaningful and actionable insights for the enterprise. In fact, big data is about more than just the “bigness” of the data. Its key characteristics, coined by industry analysts are the “Three V’s,” which include volume (size) as well as velocity (speed) and variety (type). While these terms are effective descriptors, big data also needs to provide value and we often find that as the intersection of transactions, interactions and observations. 
Each of these data source introduces its own set of complexities to data processing; combined they can easily push a given application or data set beyond the pale of traditional data processing tools, methods or systems. In these cases, a new approach is required. Big data represents just that. 
At its core then, big data is the collection, management and manipulation of data using a new generation of technologies. Driven by the growing importance of data to the enterprise, big data techniques have become essential to enterprise competitiveness.

Apache Hadoop: Platform for Big Data  

Apache Hadoop is a platform that offers an efficient and effective method for storing and processing massive amounts of data. Unlike traditional offerings, Hadoop was designed and built from the ground up to address the requirements and challenges of big data. Hadoop is powerful in its ability to allow businesses to stop worrying about building big-data-capable infrastructure and to focus on what really matters: extracting business value from the data.  
Apache Hadoop use cases are many, and show up in many industries, including: risk, fraud and portfolio analysis in financial services; behavior analysis and personalization in retail; social network, relationship and sentiment analysis for marketing; drug interaction modeling and genome data processing in healthcare and life sciences… to name a few.  
The first step in harnessing the power of Hadoop for your business is to understand the basics: what Hadoop is, where it comes from, how it can be applied to your business processes, and how to get started using it.

The Early Days of Hadoop at Yahoo! 

Hadoop has its roots at Yahoo!, whose Internet search engine business required the continuous processing of large amounts of Web page data. In 2005 Eric Baldeschwieler (aka “E14” and Hortonworks CTO) challenged Owen O’Malley (Hortonworks co-founder) and several others to solve a really hard problem: store and process the data on the internet in a simple, scalable and economically feasible way.  They looked at traditional storage approaches but quickly realized they just weren’t going to work for the type of data (much of it unstructured) and the sheer quantity Yahoo! would have to deal with.  
The team designed and prototype a new framework for Yahoo! Search. This framework, called Dreadnaught, was implemented in the C++ programming language and modeled after research published by Google describing its Google File System and a distributed processing algorithm called MapReduce.
 At the same time, an open-source search engine project called Nutch, lead by Doug Cutting and Mike Carafella, had implemented similar functions using Java. By the end of 2005, Doug and Mike had the Java version working on small cluster of 20 nodes. Impressed by their progress, Yahoo! Hired Cutting in January 2006, and merged the two teams together, choosing the Java version of MapReduce because it had some of the search functions already built out. With an expanded team including Owen O’Malley and Arun Murthy, the code base of this distributed processing system tripled by mid-year. Soon, a 200-node research cluster was up and running and the first real users were getting started. Apache Hadoop had liftoff.  

The Power of Open Source 

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF).   As part of the ASF, development of will remain open and transparent for all future users and more importantly freely available.  
Eric and team realized that with a community of like-minded individuals, Hadoop would innovate far faster.  At the same time, they’d enable other organizations to realize some of the same benefits that they were starting to see from their early efforts.  When organizations such as Facebook, LinkedIn, eBay, Powerset, Quantcast and others began picking up Hadoop and innovating in areas beyond the initial focus, it reinforced the fact that the choice of community driven open source was the right one. 
A case in point being when a small startup (Powerset) started working on a project to support tables on HDFS inspired by Google’s BigTable paper; that effort turned into what’s now Apache HBase! Further… Facebook started an effort to build a SQL layer on top of MapReduce, which became Apache Hive! 
It has proven time and again when it comes to platform technologies like Hadoop that community-driven open source will always outpace the innovation of a single group of people or single company. 

Commercial Adoption 

Apache Hadoop became a foundational technology at Yahoo, underlying a wide range of business-critical applications.  Companies in nearly every vertical started to adopt Hadoop.  By 2010, the community had grown to thousands of users and broad enterprise momentum had been established.
After many years architecting and operating the Hadoop infrastructure at Yahoo! and contributing heavily to the open source community, E14 and 20+ Hadoop architects and engineers spun out of Yahoo! to form Hortonworks in 2011.  Having seen what it could do for Yahoo, Facebook, eBay, LinkedIn and others, their singular objective is to focus on making Apache Hadoop into a platform that is easy to use and consume by the broader market of enterprise customers and partners.  

Understanding Hadoop and Related Services 

At its core, Apache Hadoop is a framework for scalable and reliable distributed data storage and processing. It allows for the processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, aggregating the local computation and storage from each server. Rather than relying on expensive hardware, the Hadoop software detects and handles any failures that occur, allowing it to achieve high availability on top of inexpensive commodity computers, each individually prone to failure.  
At the core of Apache Hadoop are the Hadoop Distributed File System, or HDFS, and Hadoop MapReduce, which provides a framework for distributed processing.   

HDFS 

The Hadoop Distributed File System is a scalable and reliable distributed storage system that aggregates the storage of every node in a Hadoop cluster into a single global file system. HDFS stores individual files in large blocks, allowing it to efficiently store very large or numerous files across multiple machines and access individual chunks of data in parallel, without needing to read the entire file into a single computer’s memory. Reliability is achieved by replicating the data across multiple hosts, with each block of data being stored, by default, on three separate computers. If an individual node fails, the data remains available and an additional copy of any blocks it holds may be made on new machines to protect against future failures.  
This approach allows HDFS to dependably store massive amounts of data. For instance, in late 2012, the Apache Hadoop clusters at Yahoo had grown to hold over 350 petabytes (PB) of data across 40,000+ servers. Once data has been loaded into HDFS, we can begin to process it with MapReduce. 

MapReduce 

MapReduce is the programming model that allows Hadoop to efficiently process large amounts of data. MapReduce breaks large data processing problems into multiple steps, namely a set of Maps and Reduces, that can each be worked on at the same time (in parallel) on multiple computers.  
MapReduce is designed to work with of HDFS. Apache Hadoop automatically optimizes the execution of MapReduce programs so that a given Map or Reduce step is run on the HDFS node that contains locally the blocks of data required to complete the step. Explaining the MapReduce algorithm in a few words can be difficult, but we provide an example in Appendix A for the curious or technically inclined. For the rest, suffice it to say that MapReduce has proven itself in its ability to allow data processing problems that once required many hours to complete on very expensive computers to be written as programs that run in minutes on a handful of rather inexpensive machines. And, while MapReduce can require a shift in thinking on the part of developers, many problems not traditionally solved using the method are easily expressed as MapReduce programs.  

Hadoop 2.0: YARN 

As previously noted, Hadoop was initially adopted by many of the large web properties and was designed to meet their needs for large web-scale batch type processing.  As clusters grew and adoption expanded, so did the number of ways that users wanted to interact with the data stored in Hadoop. As with any successful open-source project, the broader ecosystem of Hadoop users responded by contributing additional capabilities to the Hadoop community, with some of the most popular examples being Apache Hive for SQL-based querying, Apache Pig for scripted data processing and Apache HBase as a NoSQL database.  
These additional open source projects opened the door for a much richer set of applications to be built on top of Hadoop – but they didn’t really address the design limitations inherent in Hadoop; specifically, that it was designed as a single application system with MapReduce at the core (i.e. batch-oriented data processing).  Today, enterprise applications want to interact with Hadoop in a host of different ways: batch, interactive, analyzing data streams as they arrive, and more.  And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so. Enter YARN. 
YARN is a key piece of Hadoop version 2 which is currently under development in the community.  It provides a resource management framework for any processing engine, including MapReduce.  It allows new processing frameworks to use the power of the distributed scale of Haodop.  It transforms Hadoop from a single application to a multi application data system.  It is the future of Hadoop.  

The Hadoop Project Ecosystem 

As Apache Hadoop has matured, a number of important tools have been built to support it. These tools may be categorized into Hadoop Data Services and Hadoop Operational Services based on the functionality they offer.  

Hadoop Data Services Hadoop 

Data Services are tools that allow users to more easily manipulate and process data. They include the following:
 • Apache Hive Apache Hive is data warehouse infrastructure built on top of Hadoop for providing data summarization, ad-hoc query, and the analysis of large datasets. It provides a mechanism for imparting structure onto the data in Hadoop and for querying that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and various business intelligence and visualization tools.
 • Apache Pig Apache Pig allows you to write complex map reduce transformations using a simple scripting language called Pig Latin. Pig Latin defines a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin into Hadoop MapReduce so that it can be executed within HDFS. 
 • Apache HCatalog HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – including Hive, Pig and MapReduce – to more easily read and write data. HCatalog’s table abstraction presents users with a relational view of data in HDFS and ensures that users need not worry about where or in what format their data is stored. 
 • Apache HBase Apache HBase is a non-relational database that runs on top of HDFS. It provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes.
 • Apache Sqoop Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external sources either directly into HDFS or into systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external repositories such as relational databases and data warehouses.
 • Apache Flume Apache Flume is a service for efficiently collecting, aggregating, and moving large amounts of log data. Flume allows log data from many different sources, such as web servers, to be easily stored in a centralized HDFS repository.  

Hadoop Operational Services

 Distributed computing at scale can present significant operational challenges. Several projects have emerged to aid in the operations and management of a Hadoop cluster. These include:
 • Apache Ambari Apache Ambari provides an intuitive set of tools to monitor, manage and efficiently provision an Apache Hadoop cluster. Ambari simplifies the operation and hides the complexity of Hadoop, making Hadoop work as a single, cohesive data platform.
 • Apache Oozie Apache Oozie is a Java web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop.  
• Apache ZooKeeper Apache Zookeeper provides operational services for a Hadoop cluster, including a distributed configuration service, a synchronization service and a naming registry. These services allow Hadoop’s distributed processes to more easily coordinate with one another.  

Hadoop Distributions 

Apache Hadoop is open source.  Any interested party can visit the Hadoop web site to download the project’s source code or binaries directly from the ASF. Downloading directly, however, presents a number of challenges to enterprises that want to get started quickly. Modern Hadoop deployments require not only MapReduce and HDFS, but many—often all—of the supporting projects already discussed, along with a variety of other open source software. Assuming the enterprise even knows which software is required, this diversity adds complexity to the process of obtaining, installing and maintaining an Apache Hadoop deployment. Each of the aforementioned projects is developed independently, with its own release cycle and versioning scheme. Not all versions of all projects are API compatible, so care must be taken to choose versions of each project that are known to work together.  Achieving production-level stability for a given set of projects presents even more complexity. To ensure stability, an enterprise must attempt to determine which versions of which projects have been tested together, by whom, and under what conditions—information that is not often readily available.  
How then can an enterprise ensure that its first steps with Hadoop go smoothly, that the software they download is internally compatible and completely stable, and that support will be available when needed? The answer to these questions is a Hadoop Distribution, which provides an integrated, pre-packaged bundle of software that includes all required Hadoop components, related projects, and supporting software. Enterprise-focused Hadoop distributions integrate pre-certified and pre-tested versions of Apache Hadoop and related projects and make them available in a single easily installed and managed download.

The Future of Hadoop 

Hadoop has “crossed the chasm” from a framework for early adopters, developers and technology enthusiasts to a strategic data platform embraced by innovative CTOs and CIOs across mainstream enterprises. These people, who want to improve the performance of their companies and unlock new business opportunities, realize that including Apache Hadoop as a deeply integrated supplement to their current data architecture offers the fastest path to reaching their goals while maximizing their existing investments.

Appendix A – MapReduce Example 

As you might guess from the name MapReduce, the first of these steps is the Map; the next is the Reduce. To understand how Map and Reduce work, consider the following example, based on an extended explanation by Steve Krenzel1.   
Let’s say a social media site wants to calculate every member's common friends once a day and store those results. As is typical for inputs to and outputs from MapReduce tasks, friends are stored as a key-value pair expressed in the form Member->[List of Friends]: 
 A -> B C D 
 B -> A C D E  
C -> A B D E … 
The first step in calculating the members’ common friends using MapReduce is the Map, in which each of the friend lists above gets mapped to a new key-value pair. In this case, the mappers sort and group the input so that a new key is output for every combination of two members, whose value is simply a list of the first member’s friends followed by a list of the second member’s friends. In our example, (A B) forms a new key whose value is a list of A’s friends (BCD) and B’s friends (ACDE):  
(A B) -> (B C D) (A C D E) 
 … 
All these intermediate key-value pairs are then sent to the Reduce step, which simply outputs for each key any friends in both members’ friend lists; in this case resulting in:  
(A B) : (C D)
 …
 which says that members A and B have C and D as common friends. Without MapReduce, this process would require a slow, serial process of comparing A’s friends with B’s friends, A friends with C’s friends, A’s friends with D’s friends, etc., for every one of A’s friends. And so on, for every member of the social media site. Easy when you’re just starting out and have five members, but the traditional approach quickly becomes untenable when membership gets into the millions.  
With MapReduce and HDFS, the map and reduce tasks are distributed across the Hadoop cluster and many tasks run simultaneously across the cluster. Because of the data locality property of Hadoop, the map tasks each run on a node that contains the blocks of input data (lines in the Friends List file) that they will operate on. 

About

Contact Details