Output of “hdfs dfsadmin -report”“
Output of “yarn node -list”
Screenshot of Datanode Information page
Screenshot of All Applications page
Output of putting tweet sentiment files into HDFS
You can observe from the screenshot that the file sizes on the local filesystem and in the HDFS are the same.
[hadoop@localhost mapreduce_wordcount]$ time ./run.sh
packageJobJar: [/tmp/hadoop-unjar8960290533764928209/] [] /tmp/streamjob358535730770061676.jar tmpDir=null
15/09/12 14:05:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/09/12 14:05:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/09/12 14:05:06 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 14:05:06 INFO mapreduce.JobSubmitter: number of splits:2
15/09/12 14:05:06 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1442076969451_0002
15/09/12 14:05:06 INFO impl.YarnClientImpl: Submitted application application_1442076969451_0002
15/09/12 14:05:06 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1442076969451_0002/
15/09/12 14:05:06 INFO mapreduce.Job: Running job: job_1442076969451_0002
15/09/12 14:05:12 INFO mapreduce.Job: Job job_1442076969451_0002 running in uber mode : false
15/09/12 14:05:12 INFO mapreduce.Job: map 0% reduce 0%
15/09/12 14:05:17 INFO mapreduce.Job: map 50% reduce 0%
15/09/12 14:05:18 INFO mapreduce.Job: map 100% reduce 0%
15/09/12 14:05:22 INFO mapreduce.Job: map 100% reduce 100%
15/09/12 14:05:23 INFO mapreduce.Job: Job job_1442076969451_0002 completed successfully
15/09/12 14:05:23 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=269679
FILE: Number of bytes written=866110
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=161699
HDFS: Number of bytes written=90819
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=7882
Total time spent by all reduces in occupied slots (ms)=2413
Total time spent by all map tasks (ms)=7882
Total time spent by all reduce tasks (ms)=2413
Total vcore-seconds taken by all map tasks=7882
Total vcore-seconds taken by all reduce tasks=2413
Total megabyte-seconds taken by all map tasks=8071168
Total megabyte-seconds taken by all reduce tasks=2470912
Map-Reduce Framework
Map input records=2001
Map output records=27876
Map output bytes=213921
Map output materialized bytes=269685
Input split bytes=230
Combine input records=0
Combine output records=0
Reduce input groups=8929
Reduce shuffle bytes=269685
Reduce input records=27876
Reduce output records=8929
Spilled Records=55752
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=248
CPU time spent (ms)=3250
Physical memory (bytes) snapshot=686350336
Virtual memory (bytes) snapshot=6302289920
Total committed heap usage (bytes)=513802240
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=161469
File Output Format Counters
Bytes Written=90819
15/09/12 14:05:23 INFO streaming.StreamJob: Output directory: /user/hadoop/output/outputPart2
real 0m20.019s
user 0m3.161s
sys 0m0.426s
The program took a little over 20 seconds to execute
получит�?�?- 1
поте�?тить 1
придет�?�? 1
принадлежит 1
прогу. 1
�?качал 1
то 1
тупо 1
хочу 1
хренового 1
بابا 1
به 1
حسین 1
خیر. 1
سلام 1
صبح 1
نمونه 1
کارمند 1
— 1
♥ 2
♪ 2
♫ 8
� 1
�rte... 1
�� 1
The program could be made more accurate by removing punctuation and maybe changing everything to lower case.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Python version 2.7.5 (default, Jun 24 2015 00:41:19)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x1c84e10>
>>> data = sc.textFile("hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv")
15/09/12 14:31:43 INFO storage.MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=278019440
15/09/12 14:31:43 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 265.0 MB)
15/09/12 14:31:43 INFO storage.MemoryStore: ensureFreeSpace(14397) called with curMem=130448, maxMem=278019440
15/09/12 14:31:43 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 265.0 MB)
15/09/12 14:31:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37390 (size: 14.1 KB, free: 265.1 MB)
15/09/12 14:31:43 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
>>> data.count()
15/09/12 14:31:53 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/12 14:31:53 INFO spark.SparkContext: Starting job: count at <stdin>:1
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(count at <stdin>:1)
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Missing parents: List()
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at count at <stdin>:1), which has no missing parents
15/09/12 14:31:53 INFO storage.MemoryStore: ensureFreeSpace(5936) called with curMem=144845, maxMem=278019440
15/09/12 14:31:53 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.8 KB, free 265.0 MB)
15/09/12 14:31:53 INFO storage.MemoryStore: ensureFreeSpace(3553) called with curMem=150781, maxMem=278019440
15/09/12 14:31:53 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 265.0 MB)
15/09/12 14:31:53 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37390 (size: 3.5 KB, free: 265.1 MB)
15/09/12 14:31:53 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/09/12 14:31:53 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[2] at count at <stdin>:1)
15/09/12 14:31:53 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/09/12 14:31:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1436 bytes)
15/09/12 14:31:53 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1436 bytes)
15/09/12 14:31:53 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/09/12 14:31:53 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/09/12 14:31:54 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv:0+79549
15/09/12 14:31:54 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv:79549+79549
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/09/12 14:31:54 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/09/12 14:31:54 INFO python.PythonRDD: Times: total = 461, boot = 344, init = 109, finish = 8
15/09/12 14:31:54 INFO python.PythonRDD: Times: total = 502, boot = 358, init = 141, finish = 3
15/09/12 14:31:54 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1871 bytes result sent to driver
15/09/12 14:31:54 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1871 bytes result sent to driver
15/09/12 14:31:54 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 617 ms on localhost (1/2)
15/09/12 14:31:54 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 626 ms on localhost (2/2)
15/09/12 14:31:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 0.646 s
15/09/12 14:31:54 INFO scheduler.DAGScheduler: Job 0 finished: count at <stdin>:1, took 0.718714 s
15/09/12 14:31:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
2001
>>>
%time
from operator import add
data = sc.textFile('hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv')
counts = (data.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add))
counts.saveAsTextFile('spark_wordcount')
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.81 µs
This program ran in 3.81 micro-seconds versus 20 seconds for the MapReduce program. I believe the dramatic difference in time is due to the fact that Spark does everything in memory
The initial AUC with ntrees = 3 and max_depth = 10 is 0.586106
By increasing the ntrees to 30 and max_depth to 20 I was able to get an AUC of 0.616722
Changing these parameters resulted in a better model because it allows for more complicated models (by increasing the depth) and increasing the number of trees increases the number of different combinations of attributes which can be explored (adding stochasticity).