The following data analysis is for the Coursera Data Science Specializaiton Course “Reproducible Research” final project. The dataset used in this assignment is the Storm Data set from the National Weather Service. The purpose of this data analysis is to answer two questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
By summing up the total number of fatalities and injuries per event, it is determined that the following top 5 events were the most harmful: Tornado, Lightning,Heat,Flash Flood, and Excessive Heat. By summing up the total property damage and crop damange, it is determined that the top 5 types of events that cost most economic losts are: Tornado, Flash Flood, TSTM Wind,Flood, and Thunderstorm Wind. See below for etailed data processing and analyzing processes.
## read data
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file.url, destfile ="storm.csv.bz2",method = "auto")
storm<-read.csv("storm.csv.bz2")
# read documentations
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(file.url,destfile="storm_doc.pdf",method="auto")
# read FAQ
file.url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(file.url,destfile = "FAQ.pdf",method="auto")
##Now let's see what's inside of the dataframe storm
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From the print out of head(storm) and the structure of storm, along with the documentation of storm_doc.pdf and FAQ.pdf, it seems that there are only five variables/columns that we will need, for the purpose of this assignment. They are: EVTYPE(factor; event type),FATALITIES(numeric;number of fatalities),INJURIES(numeric; number of injuries),PROPDMG(numeric; amount of property damage),and CROPDMG(numeric; amount of crop damage). Therefore, next step is to drop those unnecessary columns:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape)
## Warning: package 'reshape' was built under R version 3.2.5
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
Particularly, in order to answer the first quesiton, we need to know the total number of loss of lives, and injuries, across the U.S., by even type.
storm_lean<-subset(storm,select=c(EVTYPE,FATALITIES,INJURIES,PROPDMG,CROPDMG))
sum_storm<-storm_lean %>%
group_by(EVTYPE) %>%
summarise(total_fatalities=sum(FATALITIES),total_injuries=sum(INJURIES),total_propdmg=sum(PROPDMG),total_cropdmg=sum(CROPDMG))
Question 1 is: Across the United States, which types of events are most harmful with respect to population health? To answer this question, we need to: first sort sum_storm in descendent order by columns of total_fatalities and total_injuries. Then look at the first few rows, depending on how many events you want to consider as “the most harmful” events across the U.S. Here in this assignment, only the first five rows are considered.
storm_harm<-arrange(sum_storm,desc(total_fatalities),desc(total_injuries))
head(storm_harm)
## # A tibble: 6 × 5
## EVTYPE total_fatalities total_injuries total_propdmg
## <fctr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 3212258.2
## 2 EXCESSIVE HEAT 1903 6525 1460.0
## 3 FLASH FLOOD 978 1777 1420124.6
## 4 HEAT 937 2100 298.5
## 5 LIGHTNING 816 5230 603351.8
## 6 TSTM WIND 504 6957 1335965.6
## # ... with 1 more variables: total_cropdmg <dbl>
Question 2 is: Across the United States, which types of events have the greatest economic consequences? To answer this question, we first sort sum_storm in descendent order by columns of total_propdmg and total_cropdmg.Then look at the first few rows, depending on how many events you want to consider as “the most harmful” events across the U.S. Here in this assignment, only the first five rows are considered.
storm_damage<-arrange(sum_storm,desc(total_propdmg),desc(total_cropdmg))
Now we can use head(storm_har,5L)to see which 5 events are the most harmful across U.S. They are: TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT, AND LIGHTNING. See below for the actual numbers:
head(storm_harm,5L)
## # A tibble: 5 × 5
## EVTYPE total_fatalities total_injuries total_propdmg
## <fctr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 3212258.2
## 2 EXCESSIVE HEAT 1903 6525 1460.0
## 3 FLASH FLOOD 978 1777 1420124.6
## 4 HEAT 937 2100 298.5
## 5 LIGHTNING 816 5230 603351.8
## # ... with 1 more variables: total_cropdmg <dbl>
But to be more intuitive, let’s use a bar chart.
most_harm<-subset(storm_harm[1:5,],select=c(EVTYPE,total_fatalities,total_injuries))
melted_harm<-melt(data.frame(most_harm), id="EVTYPE")
ggplot(melted_harm,aes(x=EVTYPE,y=value,fill=variable)) + geom_bar(stat="identity", position="dodge",alpha=.3)+ggtitle("Top Five Most Harmful Storm Events Across U.S.")+labs(x="Event Type", y="Number of Injuries and Fatalities")+scale_fill_discrete(name="Total number of Harms", breaks=c("total_fatalities","total_injuries"),labels=c("Fatalities","Injuries"))+
theme(axis.text.x=element_text(angle=-60,hjust=-0.1))
From the plot above, we can tell that TORNADO cost the most injuries and fatalities in comparison to the other four types. Also the numbers of injuries and fatalities cost by TORNADO are far more than the other four types.
Now we can use head(storm_damage) to tell which 5 events are the most harmful across U.S. They are: TORNADO, FLASH FLOOD, TSTM WIND, FLOOD, and THUNDERSTORM WIND. See below for the actual numbers:
head(storm_damage,5L)
## # A tibble: 5 × 5
## EVTYPE total_fatalities total_injuries total_propdmg
## <fctr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 3212258.2
## 2 FLASH FLOOD 978 1777 1420124.6
## 3 TSTM WIND 504 6957 1335965.6
## 4 FLOOD 470 6789 899938.5
## 5 THUNDERSTORM WIND 133 1488 876844.2
## # ... with 1 more variables: total_cropdmg <dbl>
But to be more intuitive, let’s use a bar chart.
## Now let's plot.
most_damage<-subset(storm_damage[1:5,],select=c(EVTYPE,total_propdmg,total_cropdmg))
melted_damage<-melt(data.frame(most_damage), id="EVTYPE")
ggplot(melted_damage,aes(x=EVTYPE,y=value,fill=variable)) + geom_bar(stat="identity", position="dodge",alpha=.3)+ggtitle("Top Five Most Damaging Storm Events Across U.S.")+labs(x="Event Type", y="Number of Damage")+scale_fill_discrete(name="Total Damage", breaks=c("total_propdmg","total_cropdmg"),labels=c("Property Damage","Crop Damage"))+
theme(axis.text.x=element_text(angle=-60,hjust=-0.1))
Again, from the plot above, it can be told that TORNADO cost the most property and damage. However, the most crop damage are cost by FLASH FLOOD AND FLOOD. Overall, it seems that TORNADO still is the winner in terms of costing economics damage.