Synopsis

We are using the NOAA Storm Database to analyzie which types of events across the United States that are most harmful to the population health, and also which types of events that have the biggest impact on the economy. From these data, we found that between 1950 and 2011, tornado caused most injuries to the population, and flood has the biggest impact for the cost of damage.

Download and read the data

fileURL<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

download.file(fileURL, "stormData.csv.bz2", mode = "wb")
data <- read.csv("Stormdata.csv.bz2", stringsAsFactors = FALSE)

dateDownloaded<-date()

Data processing the raw data

The data has 37 variables and 902297 observations. We subset following 8 variables of interest :
BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP

data1 <- subset(data, 
                select= c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

Check if any valueas are missing (i.e. coded as NA) in the the observations .

colSums(is.na(data1))
##   BGN_DATE     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP 
##          0          0          0          0          0          0 
##    CROPDMG CROPDMGEXP 
##          0          0

Property damage estimates are entered as actual dollar amounts, and rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K”" for thousands, “M”" for millions, and “B”" for billions. In following two variables PROPDMGEXP and CROPDMGEXP, the data entries will be replaced by the de-coded numerical values.

Also the character variabel BGN_DATE is converted to dates format.

unique(data1$PROPDMGEXP)
unique(data1$CROPDMGEXP)

data1$PROPDMGEXP <- sapply(data$PROPDMGEXP,
                              function(x) {switch(as.character(x),  "-" = 1, "?" = 1, "+" = 1,"0" = 1,"1" = 10^1,"2" = 10^2,"3" = 10^3,"4" = 10^4,"5" = 10^5, "6" = 10^6, "7" = 10^7,"8" = 10^8,"9" = 10^9,"h" = 10^2,"H" = 10^2,"k" = 10^3,"K" = 10^3,"m" = 10^6, "M" = 10^6,"b" = 10^9,"B" = 10^9, 1)})

data1$CROPDMGEXP <- sapply(data$CROPDMGEXP,
                              function(x) {switch(as.character(x),  "-" = 1, "?" = 1, "+" = 1,"0" = 1,"1" = 10^1,"2" = 10^2,"3" = 10^3,"4" = 10^4,"5" = 10^5, "6" = 10^6, "7" = 10^7,"8" = 10^8,"9" = 10^9,"h" = 10^2,"H" = 10^2,"k" = 10^3,"K" = 10^3,"m" = 10^6, "M" = 10^6,"b" = 10^9,"B" = 10^9, 1)})

data1$BGN_DATE<-as.Date(data1$BGN_DATE, "%m/%d/%Y %H:%M:%S")

Data analyis

Now we calculate the total number of INJURIES across all states for each EEVTYPE (event type), and only keeping data entries where number of injuries are greater than zero. Then we sort them in descending order with respect to the number of injuries.

injury <- aggregate(INJURIES ~ EVTYPE, data = data1, sum)
injury <- subset(injury, INJURIES > 0)
injury <- injury[order(-injury[,2]), ] 

A barplot of the 10 most harmful event type for the population health across the United States.

injury_top10 <- injury[1:10, ] 
injury_top10 <- transform(injury_top10, EVTYPE=reorder(EVTYPE,order(-INJURIES, decreasing=TRUE))) ## reorder `EVTYPE` on `INJURIES`

g1 <-ggplot(data=injury_top10, aes(x=EVTYPE, y=INJURIES)) + geom_bar(stat="identity", fill="grey") + xlab("Event type") + ylab("Number of injuries") + ggtitle("Top 10 most harmful event types across USA \n between 1950 and 2011") + coord_flip() 
g1

plot of chunk unnamed-chunk-12

Calculate the cost of the damage on crops and properties caused by different event type, and append the result to the data frame.
Calculate the total cost across all states by adding the crops and property cost of damage for each event type. Then sort the total cost in descending order.

data1[,"property_cost"] <- mapply(function(x,y){x*y},data1$PROPDMG,data1$PROPDMGEXP)
data1[,"crop_cost"] <- mapply(function(x,y){x*y},data1$CROPDMG,data1$CROPDMGEXP)
data1[,"total_cost"] <- mapply(function(x,y){x+y},data1$property_cost,data1$crop_cost)

total_cost <- aggregate(total_cost ~ EVTYPE, data = data1, sum)
total_cost <- subset(total_cost, total_cost > 0)
total_cost <- total_cost[order(-total_cost[,2]), ] 

A barplot of the 10 most costful event type across the United States.

total_cost_top10 <- total_cost[1:10, ] 
total_cost_top10 <- transform(total_cost_top10, EVTYPE=reorder(EVTYPE,order(-total_cost, decreasing=TRUE))) ## reorder `EVTYPE` on `total_cost`

g2 <-ggplot(data=total_cost_top10, aes(x=EVTYPE, y=total_cost)) + geom_bar(stat="identity", fill="grey") + xlab("Event type") + ylab("cost damage") + ggtitle("Top 10 most costful event type across USA \n between 1950 and 2011") + coord_flip() 
g2

plot of chunk unnamed-chunk-16

Result

In our study we found that flood has the biggest impact on cost of damage, and tornado caused greatest injuries on the population between 1950 and 2011, across all states in USA.