About Hadoop

Hadoop

Hadoop is an open-source software platform designed to store and process quantities of data that are tooo larg for jsut one particular device or server. The strength of hadoop lies in its ability to scale across thousands of commodity servers that don’t share memory or disks.

Hadoop can be thought of as an ecosystem. There are two key functional components are:

The storage component - Hadoop Distributed File System (HDFS)
The processing component - MapReduce

key features:

High availability -
Scalability - application can be scale from a single node to hundreds of nodes without having to re-architect the system.
Controling cost - You don’t have to commit to more storage or processing power ahead of time and can scale only when required, thus controlling your costs.
Agility and innovation - Since data is stored in its original format and there is no predefined schema, it is easy to apply new and evolving analytic techniques to this data using MapReduce.
fault tolerent

Hadoop Distributed File System (HDFS)

It’s a scalable file system thjat distributes and stores data across all machines in a Hadoop cluster (a group of servers). Each HDFS cluster contains the following:

NameNode: Runs as a “master node” that tracks and direct the storage of the cluster
DataNode: Runs as “slave nodes” which make up the majority of the machines within a cluster
Client machine: Neither a Namenode or a Datanode, Client machines have Hadoop installed on them. They are responsible for loading data into the cluster, submitting MapReduce jobs and viewing trhe results of the job once completed.

MapReduce

The processing system. Its strength lies in the ability to divide a single larg data processing job into smaller tasks. All Mapreduce jobs are written in Java, but other languages can be used via the Hadoop Streaming API, a utility comes with Hadoop.

Once the tasks have been created. Their delegations is handled by two “daemons”.

JobTracker: Oversees how MapReduce jobs are splited up into tasks and divided among nodes within the cluster
TaskTracker: Accept tasks from the JobTracker, performs the work and alters the JobTracker once it’s done. TaskTrackers and DataNodes are located on the same nodes to improve performance.

Data locality: An important concept with HDFS and MapReduce - bringing the computation to the data.

Yet Another Resource Negotiator (YARN)

YARN is an updated way of handling the delegation of resourses for MapReduce jobs. It takes the place of the JabTracker and TaskTracker. If JobTracker and TaskTracker can be thought of as the foreman, YARN is a foreman with an MBA - it’s a more advanced way of carrying out MapReduce jobs.

It also gives you added abilities, such as the ability to work with frameworks other than MapReduce and to translate jobs developed in languages other than Java.

HBase

HBase is a column-oriented non-relational distributed database that is built on top of Hadoop and runs on HDFS. The key difference between MapReduce and HBase is that HBase is intended to work with random workloads.

For example, if you have regular files that need to be processed, MapReduce works just fine. But if you have a table that is a petabyte in size and you need to process a single row from a random location within this table, you would use HBase. Another benefit of HBase is the extremely low latency, or time delay, it provides.

However, HBase and MapReduce are not mutually exclusive. You can often run them together.

Hive

Hive is a data warehouse system for Hadoop. Hive allows users who aren’t familiar with programming to access and analyze big data using a SQL-like syntax called Hive Query Language (HiveQL). In general, Hive is used for complex, long-running tasks and analyses on large sets of data.

Impala

Impala also uses SQL syntax instead of programming languages. The diference between Hive and Impala is speed. Impala is used for analyses that you want to run and return quickly on a small subset of your data, e.g. analyzing company finances for a daily or weekly report. Both Hive and Impala use more structed or processed data. It’s not ideal if you are in the process of data preparation and complex data manipulation.

Pig

Pig language called Pig Latin which allows you to extract, transform and load (ETL) data at a very high level - meaning somthing that woul require several hundred lines of Java code can be expressed in much less lines of Pig..

While Hive and Impala require data to be more structured in order to be analyzed, Pig allows you to work with unstructured data. In other words, while Hive and Impala are essentially query engines used for more straightforward analysis, Pig’s ETL capability means it can perform “grunt work” on unstructured data, cleaning it up and organizing it so that queries can be run against it

Spark

Spark is an alternative way to perform the type of batch-oriented processing that MapReduce does. Batch-oriented means that it will take a certain amount of time for a result to be returned, as opposed to returning it in real-time.

While MapReduce jobs use data that have been replicated and stored on-disk within a cluster, Spark allows you to leerge the mamory space on servers, performing in-memory computing, which allows for real-time data processing that is up to 100 times faster than MapReduce in some instances.

Hadoop Common

Contains Java libraries and utilities needed by other Hadoop modules. These libraries give filesystem and OS level abstraction and comprise of the essential Java files and scripts that required to start Hadoop.

Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.

Mahout

Mahout is a scalable machine learning library that implements various different approaches machine learning. At present Mahout contains four main groups of algorithms:

Recommendations, also known as collective filtering Classifications, also known as categorization Clustering Frequent itemset mining, also known as parallel frequent pattern mining

Sqoop (SQL-to-Hadoop)

Sqoop is a tool designed for efficiently transferring structured data from SQL Server and SQL Azure to HDFS and then uses it in MapReduce and Hive jobs. One can even use Sqoop to move data from HDFS to SQL Server.

About Hadoop

JH

April 1, 2016