Goals
Install RHadoop system for testing R capability to manage and analyze data in Hadoop cluster
Components
- Operating System: Ubuntu Server 14.04 LTS(HVM)
- Apache Hadoop 2.7.0 Single Node Cluster
- R & Rstudio Server
- RHadoop packages:
Installation Steps:
- Install Ubuntu Server on Amazon EC2
- Setup Ubuntu system
- Set up Apache Hadoop 2.7.0 Single Node Cluster
- Set up R:
- Set up Rstudio Server
- Install RHadoop packages
- Install pre-required R packages:
- Rcpp“,”RJSONIO“,”bitops“,”digest“,”functional“,”stringr“,”plyr“,”reshape2“,”dplyr“,”R.methodsS3“,”caTools“,”Hmisc“,”rjson“,”memoise“,”data.table“,”rJava"
- Set up environment variables:
- Sys.setenv(“HADOOP_CMD”=“/usr/local/hadoop/bin/hadoop”)
- Sys.setenv(“LD_LIBRARY_PATH”=“/usr/local/hadoop/lib/native/”)
- Download RHadoop packages: https://github.com/RevolutionAnalytics/RHadoop/wiki
- Install Rhadoop packages:
install.packages("<path>/rhdfs*.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2*.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr*.tar.gz", repos=NULL, type="source")
Testing
- Test Hadoop MapReduce job with example
$ cd $HADOOP_COMMON_HOME
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 10 100

- Test R and Rstudio Server
- Using webbrowser to connect to Rstudio Server

- Demo with RHadoop packages