Hsin-Yu Cheng
June 21, 2015
This report invloves analyzing a the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It records data related to storms and other severe weather events between 1950 and 2011. These events influence human health on fatality or injury. Also, they damage property and crop that lead to a serious economic loss. It is critical to understand which weather event is the most harmful for these concerns. Therefore, we could make corresponding strategy to prevent the harm and loss.
The porpuses of the analysis are to answer two questions:
1.Across the United States, which types of events are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
The Storm Data is download directly from Coursera website in Reproducible Research course and is read as csv file for following process.
setInternet2(TRUE)
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
data <- read.csv(bzfile(temp, "repdata_data_StormData.csv"))
unlink(temp)
The dataset contains 37 variables and 902,297 observations. From Storm Data Documentation, we can understand meanings of variables. To answer desired questions, certain factors are extracted to process, namely EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP base on definition on the documentation. Data processing and code are following.
Variables related to human health are fatality and injury. These factors are numeric and do not contain missing values and unreasonable symbols. Therefore, the number of fatalities and injuries are added together to identify the top 10 severe weather events.
Q1 <- data %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES)) %>%
mutate(population = FATALITIES + INJURIES) %>%
arrange(desc(population)) %>%
mutate(Rank = rank(desc(population))) %>%
filter(Rank < 11) %>%
select(-Rank, -population)
gather <- Q1 %>% gather(Type,Total_Population,-EVTYPE)
ggplot(gather, aes(x = reorder(EVTYPE, Total_Population), ymax = 110000, y = Total_Population, fill = Type)) +
geom_bar(stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(size = 10)) +
xlab("Event Type") +
ylab("Total Population of Fatalities and Injuries") +
theme(legend.position = "top") +
coord_flip() +
scale_fill_brewer(palette="Set1") +
ggtitle("Top 10 Harmful Weather Events for Population Health") +
theme(plot.title = element_text(lineheight = .8, face = "bold", size = 15))
*Fig 1 Top 10 Harmful Weather Events for Human Fatality and Injury*
Variables related to economic loss are PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
1.PROPDMG: The number of property damage.
2.PROPDMGEXP : The monetary unit for property damage.
3.CROPDMG : The number of crop damage.
4.CROPDMGEXP : The monetary unit for crop damage.
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Strategy for data cleaning : In “Storm Data Documentation” for PROPDMGEXP and CROPDMGEXP, alphabetical character “B” means billions, “m” or “M” mean millions, “K” or “k” mean thousands, “h” or “H” mean hundreds. I create a new column to transfer all observations to same monetary unit.
However, for certain symbols, such as “-”, “?”, “+”, and number 1 to 8, they are not defined in the documentation so I treat them as missing values and do not contain them in the analysis. Code processing is following.
Q2_prop <- data %>%
select(EVTYPE,PROPDMG,PROPDMGEXP) %>%
mutate(New_PROPDMGEXP = ifelse(PROPDMGEXP == "B", PROPDMG*1000000000,
ifelse(PROPDMGEXP == "K", PROPDMG*1000,
ifelse(PROPDMGEXP == "m" | PROPDMGEXP =="M", PROPDMG*1000000,
ifelse(PROPDMGEXP == "h" | PROPDMGEXP =="H", PROPDMG*100,"NotAValue"))))) %>%
filter(!New_PROPDMGEXP == "NotAValue") %>%
mutate(New_PROPDMGEXP = as.numeric(New_PROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(PROP_DMG_EXP = (sum(New_PROPDMGEXP))/1000000) %>%
arrange(desc(PROP_DMG_EXP))
Q2_crop <- data %>%
select(EVTYPE,CROPDMG,CROPDMGEXP) %>%
mutate(New_CROPDMGEXP = ifelse(CROPDMGEXP == "B", CROPDMG*1000000000,
ifelse(CROPDMGEXP == "m" | CROPDMGEXP =="M", CROPDMG*1000000,
ifelse(CROPDMGEXP == "k" | CROPDMGEXP =="K", CROPDMG*1000,"NotAValue")))) %>%
filter(!New_CROPDMGEXP == "NotAValue") %>%
mutate(New_CROPDMGEXP = as.numeric(New_CROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(CROP_DMG_EXP = (sum(New_CROPDMGEXP))/1000000) %>%
arrange(desc(CROP_DMG_EXP))
economics <- Q2_prop %>%
left_join(Q2_crop, by = "EVTYPE") %>%
mutate(CROP_DMG_EXP = ifelse(is.na(CROP_DMG_EXP), 0,CROP_DMG_EXP),
Total_Eco = CROP_DMG_EXP + PROP_DMG_EXP) %>%
arrange(desc(Total_Eco)) %>%
mutate(Rank = rank(desc(Total_Eco))) %>%
filter(Rank < 11) %>%
select(-Rank, -Total_Eco)
gather_Q2 <- economics %>% gather(Type,Total_Exp,-EVTYPE)
ggplot(gather_Q2, aes(x = reorder(EVTYPE, Total_Exp), ymax = 110000, y = Total_Exp, fill = Type)) +
geom_bar(stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(size = 10)) +
xlab("Event Type") +
ylab("Total Damage Expense (Million Dollars)") +
theme(legend.position = "top") +
coord_flip() +
scale_fill_brewer(palette="Set2") +
ggtitle("Total Damage Expense For Top 10 Weather Events") +
theme(plot.title = element_text(lineheight = .8, face = "bold", size = 15)) +
scale_y_continuous(breaks = pretty_breaks(n = 10))
Fig 2 Total Damage Expense of Property and Crop For Top 10 Weather Events
As can be seen from figure 1, Tornado is the most harmful event for human health. It had caused around 5,633 deaths and 91,346 injuries in the past 60 years.
Moving to figure 2, it can be seen the top weather events that damage property and crop, . The event “Flood” caused the most damage expense at around 150 billion dollars between 1950 and 2011. In general, the damage expense on property is higher than that on crop.