Synopsis

The following data analysis is for the Coursera Data Science Specializaiton Course “Reproducible Research” final project. The dataset used in this assignment is the Storm Data set from the National Weather Service. The purpose of this data analysis is to answer two questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
By summing up the total number of fatalities and injuries per event, it is determined that the following top 5 events were the most harmful: Tornado, Lightning,Heat,Flash Flood, and Excessive Heat. By summing up the total property damage and crop damange, it is determined that the top 5 types of events that cost most economic losts are: Tornado, Flash Flood, TSTM Wind,Flood, and Thunderstorm Wind. See below for etailed data processing and analyzing processes.

Data Processing

Step 1: Read the Data and Documentations

## read data
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file.url, destfile ="storm.csv.bz2",method = "auto")
storm<-read.csv("storm.csv.bz2")
# read documentations
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(file.url,destfile="storm_doc.pdf",method="auto")
# read FAQ
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(file.url,destfile = "FAQ.pdf",method="auto")
##Now let's see what's inside of the dataframe storm
head(storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the print out of head(storm) and the structure of storm, along with the documentation of storm_doc.pdf and FAQ.pdf, it seems that there are only five variables/columns that we will need, for the purpose of this assignment. They are: EVTYPE(factor; event type),FATALITIES(numeric;number of fatalities),INJURIES(numeric; number of injuries),PROPDMG(numeric; amount of property damage),and CROPDMG(numeric; amount of crop damage). Therefore, next step is to drop those unnecessary columns:

Step 2: Load the libraries

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape)
## Warning: package 'reshape' was built under R version 3.2.5
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename

Step 3: Drop the Unnecessary columns

Particularly, in order to answer the first quesiton, we need to know the total number of loss of lives, and injuries, across the U.S., by even type.

storm_lean<-subset(storm,select=c(EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG))

Step 4: Sum up the Four Numeric Columns by Event Types.

sum_storm<-storm_lean %>%
  group_by(EVTYPE) %>%
  summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES),total_propdmg=sum(PROPDMG),total_cropdmg=sum(CROPDMG))

Step 5: Find out the answer for Question 1

Question 1 is: Across the United States, which types of events are most harmful with respect to population health? To answer this question, we need to: first sort sum_storm in descendent order by columns of total_fatalities and total_injuries. Then look at the first few rows, depending on how many events you want to consider as “the most harmful” events across the U.S. Here in this assignment, only the first five rows are considered.

storm_harm<-arrange(sum_storm,desc(total_fatalities),desc(total_injuries))
head(storm_harm)
## # A tibble: 6 × 5
##           EVTYPE total_fatalities total_injuries total_propdmg
##           <fctr>            <dbl>          <dbl>         <dbl>
## 1        TORNADO             5633          91346     3212258.2
## 2 EXCESSIVE HEAT             1903           6525        1460.0
## 3    FLASH FLOOD              978           1777     1420124.6
## 4           HEAT              937           2100         298.5
## 5      LIGHTNING              816           5230      603351.8
## 6      TSTM WIND              504           6957     1335965.6
## # ... with 1 more variables: total_cropdmg <dbl>

Step 6: Find out the answer for Question 2

Question 2 is: Across the United States, which types of events have the greatest economic consequences? To answer this question, we first sort sum_storm in descendent order by columns of total_propdmg and total_cropdmg.Then look at the first few rows, depending on how many events you want to consider as “the most harmful” events across the U.S. Here in this assignment, only the first five rows are considered.

storm_damage<-arrange(sum_storm,desc(total_propdmg),desc(total_cropdmg))

Results

The five most harmful event types:

Now we can use head(storm_har,5L)to see which 5 events are the most harmful across U.S. They are: TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT, AND LIGHTNING. See below for the actual numbers:

head(storm_harm,5L)
## # A tibble: 5 × 5
##           EVTYPE total_fatalities total_injuries total_propdmg
##           <fctr>            <dbl>          <dbl>         <dbl>
## 1        TORNADO             5633          91346     3212258.2
## 2 EXCESSIVE HEAT             1903           6525        1460.0
## 3    FLASH FLOOD              978           1777     1420124.6
## 4           HEAT              937           2100         298.5
## 5      LIGHTNING              816           5230      603351.8
## # ... with 1 more variables: total_cropdmg <dbl>

But to be more intuitive, let’s use a bar chart.

most_harm<-subset(storm_harm[1:5,],select=c(EVTYPE,total_fatalities,total_injuries))
melted_harm<-melt(data.frame(most_harm), id="EVTYPE")
ggplot(melted_harm,aes(x=EVTYPE,y=value,fill=variable)) + geom_bar(stat="identity", position="dodge",alpha=.3)+ggtitle("Top Five Most Harmful Storm Events Across U.S.")+labs(x="Event Type", y="Number of Injuries and Fatalities")+scale_fill_discrete(name="Total number of Harms", breaks=c("total_fatalities","total_injuries"),labels=c("Fatalities","Injuries"))+
  theme(axis.text.x=element_text(angle=-60,hjust=-0.1))

From the plot above, we can tell that TORNADO cost the most injuries and fatalities in comparison to the other four types. Also the numbers of injuries and fatalities cost by TORNADO are far more than the other four types.

The five most damaging event types:

Now we can use head(storm_damage) to tell which 5 events are the most harmful across U.S. They are: TORNADO, FLASH FLOOD, TSTM WIND, FLOOD, and THUNDERSTORM WIND. See below for the actual numbers:

head(storm_damage,5L)
## # A tibble: 5 × 5
##              EVTYPE total_fatalities total_injuries total_propdmg
##              <fctr>            <dbl>          <dbl>         <dbl>
## 1           TORNADO             5633          91346     3212258.2
## 2       FLASH FLOOD              978           1777     1420124.6
## 3         TSTM WIND              504           6957     1335965.6
## 4             FLOOD              470           6789      899938.5
## 5 THUNDERSTORM WIND              133           1488      876844.2
## # ... with 1 more variables: total_cropdmg <dbl>

But to be more intuitive, let’s use a bar chart.

## Now let's plot. 
most_damage<-subset(storm_damage[1:5,],select=c(EVTYPE,total_propdmg,total_cropdmg))
melted_damage<-melt(data.frame(most_damage), id="EVTYPE")
ggplot(melted_damage,aes(x=EVTYPE,y=value,fill=variable)) + geom_bar(stat="identity", position="dodge",alpha=.3)+ggtitle("Top Five Most Damaging Storm Events Across U.S.")+labs(x="Event Type", y="Number of Damage")+scale_fill_discrete(name="Total Damage", breaks=c("total_propdmg","total_cropdmg"),labels=c("Property Damage","Crop Damage"))+
  theme(axis.text.x=element_text(angle=-60,hjust=-0.1))

Again, from the plot above, it can be told that TORNADO cost the most property and damage. However, the most crop damage are cost by FLASH FLOOD AND FLOOD. Overall, it seems that TORNADO still is the winner in terms of costing economics damage.