Hadoop
Hadoop is an open-source software platform designed to store and process quantities of data that are tooo larg for jsut one particular device or server. The strength of hadoop lies in its ability to scale across thousands of commodity servers that don’t share memory or disks.
Hadoop can be thought of as an ecosystem. There are two key functional components are:
key features:
Hadoop Distributed File System (HDFS)
It’s a scalable file system thjat distributes and stores data across all machines in a Hadoop cluster (a group of servers). Each HDFS cluster contains the following:
MapReduce
The processing system. Its strength lies in the ability to divide a single larg data processing job into smaller tasks. All Mapreduce jobs are written in Java, but other languages can be used via the Hadoop Streaming API, a utility comes with Hadoop.
Once the tasks have been created. Their delegations is handled by two “daemons”.
Data locality: An important concept with HDFS and MapReduce - bringing the computation to the data.
Yet Another Resource Negotiator (YARN)
YARN is an updated way of handling the delegation of resourses for MapReduce jobs. It takes the place of the JabTracker and TaskTracker. If JobTracker and TaskTracker can be thought of as the foreman, YARN is a foreman with an MBA - it’s a more advanced way of carrying out MapReduce jobs.
It also gives you added abilities, such as the ability to work with frameworks other than MapReduce and to translate jobs developed in languages other than Java.
HBase
HBase is a column-oriented non-relational distributed database that is built on top of Hadoop and runs on HDFS. The key difference between MapReduce and HBase is that HBase is intended to work with random workloads.
For example, if you have regular files that need to be processed, MapReduce works just fine. But if you have a table that is a petabyte in size and you need to process a single row from a random location within this table, you would use HBase. Another benefit of HBase is the extremely low latency, or time delay, it provides.
However, HBase and MapReduce are not mutually exclusive. You can often run them together.
Hive
Hive is a data warehouse system for Hadoop. Hive allows users who aren’t familiar with programming to access and analyze big data using a SQL-like syntax called Hive Query Language (HiveQL). In general, Hive is used for complex, long-running tasks and analyses on large sets of data.
Impala
Impala also uses SQL syntax instead of programming languages. The diference between Hive and Impala is speed. Impala is used for analyses that you want to run and return quickly on a small subset of your data, e.g. analyzing company finances for a daily or weekly report. Both Hive and Impala use more structed or processed data. It’s not ideal if you are in the process of data preparation and complex data manipulation.
Pig
Pig language called Pig Latin which allows you to extract, transform and load (ETL) data at a very high level - meaning somthing that woul require several hundred lines of Java code can be expressed in much less lines of Pig..
While Hive and Impala require data to be more structured in order to be analyzed, Pig allows you to work with unstructured data. In other words, while Hive and Impala are essentially query engines used for more straightforward analysis, Pig’s ETL capability means it can perform “grunt work” on unstructured data, cleaning it up and organizing it so that queries can be run against it
Spark
Spark is an alternative way to perform the type of batch-oriented processing that MapReduce does. Batch-oriented means that it will take a certain amount of time for a result to be returned, as opposed to returning it in real-time.
While MapReduce jobs use data that have been replicated and stored on-disk within a cluster, Spark allows you to leerge the mamory space on servers, performing in-memory computing, which allows for real-time data processing that is up to 100 times faster than MapReduce in some instances.
Hadoop Common
Contains Java libraries and utilities needed by other Hadoop modules. These libraries give filesystem and OS level abstraction and comprise of the essential Java files and scripts that required to start Hadoop.
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Mahout
Mahout is a scalable machine learning library that implements various different approaches machine learning. At present Mahout contains four main groups of algorithms:
Recommendations, also known as collective filtering Classifications, also known as categorization Clustering Frequent itemset mining, also known as parallel frequent pattern mining
Sqoop (SQL-to-Hadoop)
Sqoop is a tool designed for efficiently transferring structured data from SQL Server and SQL Azure to HDFS and then uses it in MapReduce and Hive jobs. One can even use Sqoop to move data from HDFS to SQL Server.