1 Hadoop/HDFS

1.1 Verify that Hadoop and yarn are running

Output of hdfs dfsadmin -report“

Output of “hdfs dfsadmin -report”“

Output of yarn node -list

Output of “yarn node -list”

1.2 Explore management tools

Screenshot of Datanode Information page

Screenshot of Datanode Information page

Screenshot of All Applications page

Screenshot of All Applications page

1.3 Load data files into HDFS

Output of putting tweet sentiment files into HDFS

Output of putting tweet sentiment files into HDFS

You can observe from the screenshot that the file sizes on the local filesystem and in the HDFS are the same.

2 MapReduce

2.1 What does the console output of the MapReduce program look like?

[hadoop@localhost mapreduce_wordcount]$ time ./run.sh 
packageJobJar: [/tmp/hadoop-unjar8960290533764928209/] [] /tmp/streamjob358535730770061676.jar tmpDir=null
15/09/12 14:05:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/09/12 14:05:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/09/12 14:05:06 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 14:05:06 INFO mapreduce.JobSubmitter: number of splits:2
15/09/12 14:05:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442076969451_0002
15/09/12 14:05:06 INFO impl.YarnClientImpl: Submitted application application_1442076969451_0002
15/09/12 14:05:06 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1442076969451_0002/
15/09/12 14:05:06 INFO mapreduce.Job: Running job: job_1442076969451_0002
15/09/12 14:05:12 INFO mapreduce.Job: Job job_1442076969451_0002 running in uber mode : false
15/09/12 14:05:12 INFO mapreduce.Job:  map 0% reduce 0%
15/09/12 14:05:17 INFO mapreduce.Job:  map 50% reduce 0%
15/09/12 14:05:18 INFO mapreduce.Job:  map 100% reduce 0%
15/09/12 14:05:22 INFO mapreduce.Job:  map 100% reduce 100%
15/09/12 14:05:23 INFO mapreduce.Job: Job job_1442076969451_0002 completed successfully
15/09/12 14:05:23 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=269679
        FILE: Number of bytes written=866110
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=161699
        HDFS: Number of bytes written=90819
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=7882
        Total time spent by all reduces in occupied slots (ms)=2413
        Total time spent by all map tasks (ms)=7882
        Total time spent by all reduce tasks (ms)=2413
        Total vcore-seconds taken by all map tasks=7882
        Total vcore-seconds taken by all reduce tasks=2413
        Total megabyte-seconds taken by all map tasks=8071168
        Total megabyte-seconds taken by all reduce tasks=2470912
    Map-Reduce Framework
        Map input records=2001
        Map output records=27876
        Map output bytes=213921
        Map output materialized bytes=269685
        Input split bytes=230
        Combine input records=0
        Combine output records=0
        Reduce input groups=8929
        Reduce shuffle bytes=269685
        Reduce input records=27876
        Reduce output records=8929
        Spilled Records=55752
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=248
        CPU time spent (ms)=3250
        Physical memory (bytes) snapshot=686350336
        Virtual memory (bytes) snapshot=6302289920
        Total committed heap usage (bytes)=513802240
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=161469
    File Output Format Counters 
        Bytes Written=90819
15/09/12 14:05:23 INFO streaming.StreamJob: Output directory: /user/hadoop/output/outputPart2

real    0m20.019s
user    0m3.161s
sys 0m0.426s

2.2 How long did the word count program take?

The program took a little over 20 seconds to execute

2.3 Provide a sample of the word count results

получит�?�?-    1
поте�?тить  1
придет�?�?  1
принадлежит 1
прогу.  1
�?качал 1
то  1
тупо    1
хочу    1
хренового   1
بابا    1
به  1
حسین    1
خیر.    1
سلام    1
صبح 1
نمونه   1
کارمند  1
—   1
♥   2
♪   2
♫   8
�   1
�rte... 1
��  1

2.4 How could the word count results be made more accurate?

The program could be made more accurate by removing punctuation and maybe changing everything to lower case.

3 Spark

3.1 pySpark Console

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Python version 2.7.5 (default, Jun 24 2015 00:41:19)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x1c84e10>
>>> data = sc.textFile("hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv")
15/09/12 14:31:43 INFO storage.MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=278019440
15/09/12 14:31:43 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 265.0 MB)
15/09/12 14:31:43 INFO storage.MemoryStore: ensureFreeSpace(14397) called with curMem=130448, maxMem=278019440
15/09/12 14:31:43 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 265.0 MB)
15/09/12 14:31:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37390 (size: 14.1 KB, free: 265.1 MB)
15/09/12 14:31:43 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
>>> data.count()
15/09/12 14:31:53 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 14:31:53 INFO spark.SparkContext: Starting job: count at <stdin>:1
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(count at <stdin>:1)
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Missing parents: List()
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at count at <stdin>:1), which has no missing parents
15/09/12 14:31:53 INFO storage.MemoryStore: ensureFreeSpace(5936) called with curMem=144845, maxMem=278019440
15/09/12 14:31:53 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.8 KB, free 265.0 MB)
15/09/12 14:31:53 INFO storage.MemoryStore: ensureFreeSpace(3553) called with curMem=150781, maxMem=278019440
15/09/12 14:31:53 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 265.0 MB)
15/09/12 14:31:53 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37390 (size: 3.5 KB, free: 265.1 MB)
15/09/12 14:31:53 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[2] at count at <stdin>:1)
15/09/12 14:31:53 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/09/12 14:31:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1436 bytes)
15/09/12 14:31:53 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1436 bytes)
15/09/12 14:31:53 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/09/12 14:31:53 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/09/12 14:31:54 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv:0+79549
15/09/12 14:31:54 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv:79549+79549
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/09/12 14:31:54 INFO python.PythonRDD: Times: total = 461, boot = 344, init = 109, finish = 8
15/09/12 14:31:54 INFO python.PythonRDD: Times: total = 502, boot = 358, init = 141, finish = 3
15/09/12 14:31:54 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1871 bytes result sent to driver
15/09/12 14:31:54 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1871 bytes result sent to driver
15/09/12 14:31:54 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 617 ms on localhost (1/2)
15/09/12 14:31:54 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 626 ms on localhost (2/2)
15/09/12 14:31:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 0.646 s
15/09/12 14:31:54 INFO scheduler.DAGScheduler: Job 0 finished: count at <stdin>:1, took 0.718714 s
15/09/12 14:31:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
2001
>>> 

3.2 Python notebook

%time
from operator import add
data = sc.textFile('hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv')
counts = (data.flatMap(lambda x: x.split(' '))
          .map(lambda x: (x, 1))
          .reduceByKey(add))
counts.saveAsTextFile('spark_wordcount')
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.81 µs

This program ran in 3.81 micro-seconds versus 20 seconds for the MapReduce program. I believe the dramatic difference in time is due to the fact that Spark does everything in memory

4 H2O

4.1 Initial AUC

The initial AUC with ntrees = 3 and max_depth = 10 is 0.586106

4.2 Increased Performance

By increasing the ntrees to 30 and max_depth to 20 I was able to get an AUC of 0.616722

Changing these parameters resulted in a better model because it allows for more complicated models (by increasing the depth) and increasing the number of trees increases the number of different combinations of attributes which can be explored (adding stochasticity).