In America, storm and other severe weather events happen across the whole countries, causing huge public health and economic damages, like fatalities, injuries, and property damages. This report aims to run an ananlysis on the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database is tracking American severe weather events including storm.
The results of analysis presents in the following section through figures. It shows the scales of the influence of the high-ranking severe weather events on public health and economy.
The analysis of public health data indicates that tornado is the most harmful severe weather event which causes huge numbers of fatalities and injuries, while the flood has the most threatening impacts on both property and crop.
The data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? Across the United States, which types of events have the greatest economic consequences?
Here is the main data pre-processing procedure.
#Checking the Url
filename<-"repdata_data_StormData.csv.bz2"
if(!file.exists(filename)){
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,filename,method = "curl")
}
#Loading the raw data
storm_data <- read.csv(bzfile(filename))
weather_event <- as.data.frame(storm_data)
#Loading the relevant packages
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Quoting from the instruction of Course Project 2 on Coursera, The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The size of this file is 47MB.
This file inludes the data which start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The relevant introduction documents are also offered in the instruction. Some documentation are also available on the database website. They show how some of the variables are constructed/defined.
Based on the loaded data, the cleaning process starts with the following codes.
#Checking the whole data
str(weather_event)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
dim(weather_event)
## [1] 902297 37
#Subsetting the useful data fragment
weather_impact <- select(weather_event,c("EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
head(weather_impact)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
tail(weather_impact)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER 0 0 0 K 0 K
## 902293 HIGH WIND 0 0 0 K 0 K
## 902294 HIGH WIND 0 0 0 K 0 K
## 902295 HIGH WIND 0 0 0 K 0 K
## 902296 BLIZZARD 0 0 0 K 0 K
## 902297 HEAVY SNOW 0 0 0 K 0 K
#check the missing value in the subset
sum(is.na(weather_impact$FATALITIES))
## [1] 0
sum(is.na(weather_impact$INJURIES))
## [1] 0
sum(is.na(weather_impact$PROPDMG))
## [1] 0
sum(is.na(weather_impact$PROPDMGEXP))
## [1] 0
sum(is.na(weather_impact$CROPDMG))
## [1] 0
sum(is.na(weather_impact$CROPDMGEXP))
## [1] 0
##Transforming the subset data
sort(table(weather_impact$EVTYPE),decreasing=TRUE)[1:20]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
Here I Re-arrange and group all the weather event with the highly mentioned keywords like “HAIL”, “WIND”, “FLOOD”, “RAIN” etc. All the new-grouped event type will be collected into a new variable called BRIEF_EVENT. Through this variable, I organise the data in a clear way with a few similar key words in one variable. Keyword “OTHERS” covers all the rare weather events not in the top 20.
#Weather events re-group
weather_impact$BRIEF_EVENT<-"OTHERS"
weather_impact$BRIEF_EVENT[grep("HAIL", weather_impact$EVTYPE, ignore.case = TRUE)] <- "HAIL"
weather_impact$BRIEF_EVENT[grep("FLOOD", weather_impact$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
weather_impact$BRIEF_EVENT[grep("WIND", weather_impact$EVTYPE, ignore.case = TRUE)] <- "WIND"
weather_impact$BRIEF_EVENT[grep("STORM", weather_impact$EVTYPE, ignore.case = TRUE)] <- "STORM"
weather_impact$BRIEF_EVENT[grep("TORNADO", weather_impact$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
weather_impact$BRIEF_EVENT[grep("LIGHTNING", weather_impact$EVTYPE, ignore.case = TRUE)] <- "LIGHTNING"
weather_impact$BRIEF_EVENT[grep("SNOW", weather_impact$EVTYPE, ignore.case = TRUE)] <- "SNOW"
weather_impact$BRIEF_EVENT[grep("RAIN", weather_impact$EVTYPE, ignore.case = TRUE)] <- "RAIN"
weather_impact$BRIEF_EVENT[grep("HEAT", weather_impact$EVTYPE, ignore.case = TRUE)] <- "HEAT"
weather_impact$BRIEF_EVENT[grep("WINTER", weather_impact$EVTYPE, ignore.case = TRUE)] <- "WINTER"
sort(weather_impact$BRIEF_EVENT,decreasing = TRUE)[1:20]
## [1] "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER"
## [9] "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER" "WINTER"
## [17] "WINTER" "WINTER" "WINTER" "WINTER"
#Economic damage re-group
#check the contents distributions in PROPDMGEXP and CROPDMGEXP variable
sort(table(weather_impact$PROPDMGEXP),decreasing = TRUE)
##
## K M 0 B 5 1 2 ? m H
## 465934 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
sort(table(weather_impact$CROPDMGEXP),decreasing = TRUE)
##
## K M k 0 B ? 2 m
## 618413 281832 1994 21 19 9 7 1 1
Quoting the Storm Data Documentation, the characters in PROPDMGEXP and CROPDMGEXP variable means that “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions". In the following sub-section,I convert the mentioned damage data into same unit level.
weather_impact$PROPDMGEXP <- as.character(weather_impact$PROPDMGEXP)
weather_impact$PROPDMGEXP[is.na(weather_impact$PROPDMGEXP)] <- 0
weather_impact$PROPDMGEXP[!grepl("K|M|B", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- 0
weather_impact$PROPDMGEXP[grep("K", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "3"
weather_impact$PROPDMGEXP[grep("M", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "6"
weather_impact$PROPDMGEXP[grep("B", weather_impact$PROPDMGEXP, ignore.case = TRUE)] <- "9"
weather_impact$PROPDMGEXP <- as.numeric(as.character(weather_impact$PROPDMGEXP))
weather_impact$PROPERTY <- weather_impact$PROPDMG * 10^weather_impact$PROPDMGEXP
weather_impact$CROPDMGEXP <- as.character(weather_impact$CROPDMGEXP)
weather_impact$CROPDMGEXP[is.na(weather_impact$CROPDMGEXP)] <- 0
weather_impact$CROPDMGEXP[!grepl("K|M|B", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- 0
weather_impact$CROPDMGEXP[grep("K", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "3"
weather_impact$CROPDMGEXP[grep("M", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "6"
weather_impact$CROPDMGEXP[grep("B", weather_impact$CROPDMGEXP, ignore.case = TRUE)] <- "9"
weather_impact$CROPDMGEXP <- as.numeric(as.character(weather_impact$CROPDMGEXP))
weather_impact$CROP <- weather_impact$CROPDMG * 10^weather_impact$CROPDMGEXP
sort(table(weather_impact$PROPERTY),decreasing = TRUE)[1:20]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617
## 250000 500 1e+05 2500 30000 5e+05 4000 8000 75000 2500000
## 8439 6707 6302 5807 4391 4000 3202 2877 2419 2411
sort(table(weather_impact$CROP),decreasing = TRUE)[1:20]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05 15000
## 880198 4097 2349 1984 1233 956 951 830 758 721 598
## 500 3000 250000 2e+05 1e+06 30000 75000 150000 3e+05
## 568 554 513 479 447 317 290 268 250
In this section, the data analysis process is offered.
#Clean and calculate the data of Public Health variables
#Calculate the total fatalities and injuries numbers
total_health_loss<-ddply(weather_impact, .(BRIEF_EVENT), summarize,
Total = sum(FATALITIES + INJURIES, na.rm = TRUE))
total_health_loss$Type <- "Fatalities and Injuries"
#Fatalities
Fatalities <- ddply(weather_impact, .(BRIEF_EVENT), summarize,
Total = sum(FATALITIES, na.rm = TRUE))
Fatalities$Type <- "Fatalities"
#Injuries
Injuries <- ddply(weather_impact, .(BRIEF_EVENT), summarize,
Total = sum(INJURIES, na.rm = TRUE))
Injuries$Type <- "Injuries"
#Health damage
Health_damage <- rbind(Fatalities,Injuries)
Health_Event<-join(Fatalities, Injuries, by="BRIEF_EVENT")
Health_Event
## BRIEF_EVENT Total Type Total Type
## 1 FLOOD 1524 Fatalities 8602 Injuries
## 2 HAIL 15 Fatalities 1371 Injuries
## 3 HEAT 3138 Fatalities 9224 Injuries
## 4 LIGHTNING 817 Fatalities 5232 Injuries
## 5 OTHERS 1809 Fatalities 6993 Injuries
## 6 RAIN 114 Fatalities 305 Injuries
## 7 SNOW 164 Fatalities 1164 Injuries
## 8 STORM 416 Fatalities 5338 Injuries
## 9 TORNADO 5661 Fatalities 91407 Injuries
## 10 WIND 1209 Fatalities 9001 Injuries
## 11 WINTER 278 Fatalities 1891 Injuries
#Clean and calculate the data of economic variables
# total economic damage
total_economic_damage <- ddply(weather_impact, .(BRIEF_EVENT),
summarize, Total = sum(PROPERTY + CROP, na.rm = TRUE))
total_economic_damage$Type <- "Property and Crop Damage"
# Property Damage
PROPERTY <- ddply(weather_impact, .(BRIEF_EVENT),
summarize, Total = sum(PROPERTY, na.rm = TRUE))
PROPERTY$Type <- "Property"
# Crop Damage
CROP <- ddply(weather_impact, .(BRIEF_EVENT),
summarize, Total = sum(CROP, na.rm = TRUE))
CROP$Type <- "crop"
# Economic Damage
Economic_Damage <- rbind(PROPERTY, CROP)
Economic_Event <- join(PROPERTY, CROP,by="BRIEF_EVENT")
Economic_Event
## BRIEF_EVENT Total Type Total Type
## 1 FLOOD 167502193929 Property 12266906100 crop
## 2 HAIL 15733043048 Property 3046837473 crop
## 3 HEAT 20325750 Property 904469280 crop
## 4 LIGHTNING 933974947 Property 12097090 crop
## 5 OTHERS 96313022890 Property 23576788780 crop
## 6 RAIN 3270230192 Property 919315800 crop
## 7 SNOW 1024169752 Property 134683100 crop
## 8 STORM 66304209893 Property 6374469888 crop
## 9 TORNADO 58593098029 Property 417461520 crop
## 10 WIND 10847086618 Property 1403719150 crop
## 11 WINTER 6777295251 Property 47444000 crop
Based on the Project Target, the results of this analysis could answer the following issues: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
For Question 1, the following plot is presented.
#Plot the health damage
library(ggplot2)
Health_damage$BRIEF_EVENT <- as.factor(Health_damage$BRIEF_EVENT)
HP <- ggplot(Health_damage,aes(x=BRIEF_EVENT,y=Total,fill=Type))+ geom_bar(stat = "identity") +
coord_flip() +
labs(x="Event",y="Numbers of Fatalities and Injuries",title="Severe Weather Impacts on Public Health") +
theme(plot.title = element_text(hjust = 1))
HP+theme_bw()
For Question 2, the following plot is offered.
#Plot the economic damage
Economic_Damage$BRIEF_EVENT <- as.factor(Economic_Damage$BRIEF_EVENT)
EP <- ggplot(Economic_Damage,aes(x=BRIEF_EVENT,y=Total,fill=Type))+ geom_bar(stat = "identity") +
coord_flip() +
labs(x="Event",y="Amouts of Dollar Loss",title="Severe Weather Impacts on Economy") +
theme(plot.title = element_text(hjust = 1))
EP+theme_bw()
Based on the result output, the most harmful severe weather should be tornado, causing many casualties. And the flood cause the largest property and crop damage, threating the development of economy.