Reproducible Research: Peer Assessment 2

Puxin Xu

Thursday, December 23, 2015

Synopsis

In this analysis, we explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer two questions, one is which types of events are most harmful with respect to population health? and the other is which types of events have the greatest economic consequences? acrossed the United States. In the result, we found that Tornado is the most harmful weather to population health, the Flood will lead to greatest economic consequences.

Data Processing

Get the raw data

if (!file.exists("./tempdata.csv.bz2")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "tempdata.csv.bz2", method = "curl")}
data <- read.csv("./tempdata.csv.bz2")

Get the cleaned data

After read the documentation of Storm Data and the requirements of the assignment. The related column are BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEX,CROPDMG,CROPDMGEXP.So,subseting the original dataset to this column.The year in the column BGN_DATE is useful,extract it to new dataset.

library(lubridate)
year <- year(mdy_hms(data$BGN_DATE))
data <- cbind(data,year)
cleaned_data <- subset(data,select = c(year,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))

Select the completed data

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. we can hist the years of the cleaned data.

with(cleaned_data,hist(year,breaks=60))

We can see that the huge increasing around in the year 1995,so we select the data from 1995-2011.

completed_data <- subset(cleaned_data,year >= 1995)

Analysis Processing

Effect on population health

Sum the column FATALITIES and INJURIES are most related to population health,we want to use the top 10 of ENTYPE to show which of type cause population health.

library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, mday, month, quarter, wday, week, yday, year
dt_data <- as.data.table(completed_data)
aspect_data <- as.data.table(aggregate(FATALITIES~EVTYPE,dt_data,sum))
setorder(aspect_data,-FATALITIES)
Fatalities <- aspect_data
aspect_data <- as.data.table(aggregate(INJURIES~EVTYPE,dt_data,sum))
setorder(aspect_data,-INJURIES)
Injuries <- aspect_data
total_effect_health <- merge(Fatalities,Injuries,by = "EVTYPE")
setorder(total_effect_health,-FATALITIES,-INJURIES)
total_effect_health <- with(total_effect_health,cbind(total_effect_health,c(FATALITIES+INJURIES)))[1:10,]

Effect on economy

The Corp damage and Property damage is related to this section.

symbol_corp <- data.frame(CROPDMGEXP=c("M","m","K","k","B"),exp = c(6,6,3,3,9))
symbol_prop <- data.frame(PROPDMGEXP=c("M","m","K","H","B"),exp = c(6,6,3,2,9))
getfull_num <- function(type,symbol,type_exp){
    test_data <- completed_data
    test <- merge(test_data,symbol,by = type_exp)
    not_zero_data <- subset(test,test[[type]] != 0)
    full_num <- not_zero_data[[type]] * 10^not_zero_data[["exp"]]
    not_zero_data <- cbind(not_zero_data,full_num)
    data_df <- as.data.table(aggregate(not_zero_data$full_num,by = list(not_zero_data$EVTYPE),FUN = "sum"))
    setorder(data_df,-x)
    top_10 <- data_df[1:30,]
    setnames(top_10,"Group.1",type)
    return (top_10)
  }
    Crop_effect <- getfull_num("CROPDMG",symbol_corp,"CROPDMGEXP")
    Prop_effect <- getfull_num("PROPDMG",symbol_prop,"PROPDMGEXP")
    setnames(Crop_effect,"CROPDMG","type")
    setnames(Prop_effect,"PROPDMG","type")
    total_effect_ecnomic <- merge(Crop_effect,Prop_effect,by = "type")
    total_effect_ecnomic <- with(total_effect_ecnomic,cbind(total_effect_ecnomic,
                                                          c(x.x+x.y)))[1:10,]
    setorder(total_effect_ecnomic,-V2)

Results

The most harmful weather to population health

  • Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
library(ggplot2)
ggplot(total_effect_health, aes(x = reorder(EVTYPE, -V2),y = V2, fill = EVTYPE))+geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 45,hjust = 1))+xlab("Types of events")+ylab("Total num")+ggtitle("Population health by types of events(Year:1995-2011)")

We can see that the most harmful weather to population health is Tornado.

The most harmful weather cause the greatest economic consequences

  • Question: Across the United States, which types of events have the greatest economic consequences?
ggplot(total_effect_ecnomic, aes(x = reorder(type, -V2),y = V2/10^6, fill = type))+geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 45,hjust = 1))+xlab("Types of events")+ylab("Economic losses(Million dollars)")+ggtitle("Economic losses by types of events(Year:1995-2011)")

As the plot says, Flood has the greatest economic consequences.