Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project examines the damaging effects of severe weather conditions (e.g. hurricanes, tornadoes, thunderstorms etc.) on human populations and the econonomy in the U.S. for 1950-2010 period. The goal of the project is to highlight the severe weather events that produce the most impacts on population health and economic damages.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, which type of event, as well as the estimates of relevant fatalities, injuries, and property damage.
The dataset used in this project is provided by the U.S. National Oceanic and Atmospheric Administration (NOAA) and you could download it in the below link.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data [47Mb] (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
echo = TRUE # Always make code visible
options(scipen = 1) # Turn off scientific notations for numbers
library(grid)
library(plyr)
library(ggplot2)
library(data.table)
library(gridExtra)
setwd("~/Desktop/Coursera/D5_Reproducible Research/Week 2/Project_02/")
if (!"StormData.csv.bz2" %in% dir("./")) {
print("Loading File.....Please wait")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
}
if (!"StormData" %in% ls()){
StormData <- read.csv("StormData.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
}
dim(StormData)
## [1] 902297 37
names(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
DT_StromData<- data.table(StormData)
str(DT_StromData)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
if (dim(StormData)[2] == 37) {
StormData$Year <- as.numeric(format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}
hist(StormData$Year,breaks = 50 ,
main = "Histogram of StormData by Year",
xlab = "Year",
ylab = "Frequency")
Based on this histogram, we could clearly observe that the number of events rise steadily from 1950, and start to rise rapidly around 1995 to 2012 (second bar next to 1990 from the histogram, scale = 2 Years). Thus, it’s approprite to use the subset data [1995, 2010] for our analsis.
StormData_ST <- StormData[StormData$Year >= 1995, ]
dim(StormData_ST)
## [1] 681500 38
We now narrow down our dataset to 681500 row and 38 columns
names(StormData_ST)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM" "Year"
Consult the Strom Data Documentation file, we could further filter the variables that affected most by Weather event. We keep only six columns (variables) for our analysis, see as below: BGN_DATE, EVTYPE, FATALITIES, INJURIES,PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
Filtered_Col <- c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
Filtered_StormData <- StormData_ST[Filtered_Col]
Filtered_StormData$Year <- as.numeric(format(as.Date(Filtered_StormData$BGN_DATE,format = "%m/%d/%Y %H:%M:%S"), "%Y"))
unique(Filtered_StormData$PROPDMGEXP)
## [1] "" "B" "M" "K" "m" "+" "0" "5" "6" "?" "4" "2" "3" "7" "H" "-" "1"
## [18] "8"
Filtered_StormData$PROPDMGEXP <- as.character(Filtered_StormData$PROPDMGEXP)
Filtered_StormData$PROPDMGEXP[Filtered_StormData$PROPDMGEXP == 'B'] <- "9"
Filtered_StormData$PROPDMGEXP[Filtered_StormData$PROPDMGEXP == 'M'] <- "6"
Filtered_StormData$PROPDMGEXP[Filtered_StormData$PROPDMGEXP == 'K'] <- "3"
Filtered_StormData$PROPDMGEXP[Filtered_StormData$PROPDMGEXP == 'H'] <- "2"
Filtered_StormData$PROPDMGEXP[Filtered_StormData$PROPDMGEXP == ' '] <- "0"
Filtered_StormData$PROPDMGEXP<- as.numeric(Filtered_StormData$PROPDMGEXP)
## Warning: NAs introduced by coercion
Filtered_StormData$PROPDMGEXP[is.na(Filtered_StormData$PROPDMGEXP)] <- 0
Filtered_StormData$Total_PROPDMG <- Filtered_StormData$PROPDMG * 10^Filtered_StormData$PROPDMGEXP
unique(Filtered_StormData$CROPDMGEXP)
## [1] "" "M" "m" "K" "B" "?" "0" "k" "2"
Filtered_StormData$CROPDMGEXP <- as.character(Filtered_StormData$CROPDMGEXP)
Filtered_StormData$CROPDMGEXP[Filtered_StormData$CROPDMGEXP == 'B'] <- "9"
Filtered_StormData$CROPDMGEXP[Filtered_StormData$CROPDMGEXP == 'M'] <- "6"
Filtered_StormData$CROPDMGEXP[Filtered_StormData$CROPDMGEXP == 'K'] <- "3"
Filtered_StormData$CROPDMGEXP[Filtered_StormData$CROPDMGEXP == 'H'] <- "2"
Filtered_StormData$CROPDMGEXP[Filtered_StormData$CROPDMGEXP == ' '] <- "0"
Filtered_StormData$CROPDMGEXP<- as.numeric(Filtered_StormData$CROPDMGEXP)
## Warning: NAs introduced by coercion
Filtered_StormData$CROPDMGEXP[is.na(Filtered_StormData$CROPDMGEXP)] <- 0
Filtered_StormData$Total_CROPDMG <- Filtered_StormData$CROPDMG * 10^Filtered_StormData$CROPDMGEXP
head(Filtered_StormData)
## BGN_DATE EVTYPE FATALITIES INJURIES
## 187560 1/6/1995 0:00:00 FREEZING RAIN 0 0
## 187561 1/22/1995 0:00:00 SNOW 0 0
## 187563 2/6/1995 0:00:00 SNOW/ICE 0 0
## 187565 2/11/1995 0:00:00 SNOW/ICE 0 0
## 187566 10/4/1995 0:00:00 HURRICANE OPAL/HIGH WINDS 2 0
## 187575 3/20/1995 0:00:00 HAIL 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP Year Total_PROPDMG
## 187560 0.0 0 0 0 1995 0e+00
## 187561 0.0 0 0 0 1995 0e+00
## 187563 0.0 0 0 0 1995 0e+00
## 187565 0.0 0 0 0 1995 0e+00
## 187566 0.1 9 10 6 1995 1e+08
## 187575 0.0 0 0 0 1995 0e+00
## Total_CROPDMG
## 187560 0e+00
## 187561 0e+00
## 187563 0e+00
## 187565 0e+00
## 187566 1e+07
## 187575 0e+00
names(Filtered_StormData)
## [1] "BGN_DATE" "EVTYPE" "FATALITIES" "INJURIES"
## [5] "PROPDMG" "PROPDMGEXP" "CROPDMG" "CROPDMGEXP"
## [9] "Year" "Total_PROPDMG" "Total_CROPDMG"
Fatalities <- aggregate(FATALITIES ~ EVTYPE, data = Filtered_StormData, FUN = sum)
Top10_Fatalities <- Fatalities[order(-Fatalities$FATALITIES), ][1:10, ] # Remove decreasing = T
Top10_Fatalities
## EVTYPE FATALITIES
## 112 EXCESSIVE HEAT 1903
## 666 TORNADO 1545
## 134 FLASH FLOOD 934
## 231 HEAT 924
## 358 LIGHTNING 729
## 144 FLOOD 423
## 461 RIP CURRENT 360
## 288 HIGH WIND 241
## 683 TSTM WIND 241
## 16 AVALANCHE 223
Injuries <- aggregate(INJURIES ~ EVTYPE, data = Filtered_StormData, FUN = sum)
Top10_Injuries <- Injuries[order(-Injuries$INJURIES), ][1:10, ] # Remove decreasing = T
Top10_Injuries
## EVTYPE INJURIES
## 666 TORNADO 21765
## 144 FLOOD 6769
## 112 EXCESSIVE HEAT 6525
## 358 LIGHTNING 4631
## 683 TSTM WIND 3630
## 231 HEAT 2030
## 134 FLASH FLOOD 1734
## 607 THUNDERSTORM WIND 1426
## 787 WINTER STORM 1298
## 313 HURRICANE/TYPHOON 1275
propdmg <- aggregate(Total_PROPDMG ~ EVTYPE, data = Filtered_StormData, FUN = sum)
Top10_Propdmg <- propdmg[order(-propdmg$Total_PROPDMG), ][1:10, ] # Remove decreasing = T
Top10_Propdmg
## EVTYPE Total_PROPDMG
## 144 FLOOD 144022037057
## 313 HURRICANE/TYPHOON 69305840000
## 519 STORM SURGE 43193536000
## 666 TORNADO 24925439556
## 134 FLASH FLOOD 16047794571
## 206 HAIL 15043822108
## 306 HURRICANE 11812819010
## 677 TROPICAL STORM 7653335550
## 288 HIGH WIND 5259785375
## 773 WILDFIRE 4759064000
cropdmg <- aggregate(Total_CROPDMG ~ EVTYPE, data = Filtered_StormData, FUN = sum)
Top10_Cropdmg <- cropdmg[order(-cropdmg$Total_CROPDMG), ][1:10, ] # Remove decreasing = T
Top10_Cropdmg
## EVTYPE Total_CROPDMG
## 84 DROUGHT 13922066000
## 144 FLOOD 5422810400
## 306 HURRICANE 2741410000
## 206 HAIL 2613777420
## 313 HURRICANE/TYPHOON 2607872800
## 134 FLASH FLOOD 1343915000
## 121 EXTREME COLD 1292473000
## 179 FROST/FREEZE 1094086000
## 241 HEAVY RAIN 728399800
## 677 TROPICAL STORM 677836000
ggplot(data = Top10_Fatalities , aes(x = reorder(EVTYPE, FATALITIES), y = FATALITIES)) +
geom_bar(stat = "identity", fill = "red") + coord_flip() +
ggtitle("Top 10 Fatalitie Damages of Weather Event Types in USA") +
labs(x = "Weather Event Types", y = "Total Fatalitie Damage Expenses (US$) ")
### Plot of Top 10 Injuries of Weather Event Types
ggplot(data = Top10_Injuries, aes(x = reorder(EVTYPE, INJURIES), y = INJURIES)) +
geom_bar(stat = "identity", fill = "red") + coord_flip() +
ggtitle("Top 10 Property Damages of Weather Event Types in USA") +
labs(x = "Weather Event Types", y = "Total Property Damage Expenses (US$) ")
# Ok try another way, see wether it works if I merge two dataset
# Merge Fatalities dataset with Injuries dataset into Health Damage Expense
Top10_HealthDmg <- merge(x = Top10_Fatalities, y = Top10_Injuries, by = "EVTYPE", all = TRUE)
Top10_HealthDmg <- melt(Top10_HealthDmg, id.vars = 'EVTYPE')
Top10_HealthDmg
## EVTYPE variable value
## 1 AVALANCHE FATALITIES 223
## 2 EXCESSIVE HEAT FATALITIES 1903
## 3 FLASH FLOOD FATALITIES 934
## 4 FLOOD FATALITIES 423
## 5 HEAT FATALITIES 924
## 6 HIGH WIND FATALITIES 241
## 7 HURRICANE/TYPHOON FATALITIES NA
## 8 LIGHTNING FATALITIES 729
## 9 RIP CURRENT FATALITIES 360
## 10 THUNDERSTORM WIND FATALITIES NA
## 11 TORNADO FATALITIES 1545
## 12 TSTM WIND FATALITIES 241
## 13 WINTER STORM FATALITIES NA
## 14 AVALANCHE INJURIES NA
## 15 EXCESSIVE HEAT INJURIES 6525
## 16 FLASH FLOOD INJURIES 1734
## 17 FLOOD INJURIES 6769
## 18 HEAT INJURIES 2030
## 19 HIGH WIND INJURIES NA
## 20 HURRICANE/TYPHOON INJURIES 1275
## 21 LIGHTNING INJURIES 4631
## 22 RIP CURRENT INJURIES NA
## 23 THUNDERSTORM WIND INJURIES 1426
## 24 TORNADO INJURIES 21765
## 25 TSTM WIND INJURIES 3630
## 26 WINTER STORM INJURIES 1298
ggplot(Top10_HealthDmg, aes(EVTYPE, value)) +
xlab("Event Type") +
ylab("Damage Expenses (US $)") +
ggtitle("Impact on Top-10 Health Damage Per Event Type") +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We could observe from the histograms,from 1995-2010, the Excessive hear and Tornade are the main cause of Fatalities; and Tornado is the main cause of Injuries in the US, which both have the highest figures.
Top10_Propdmg_Plot <- ggplot(data = Top10_Propdmg, aes(x = reorder(EVTYPE, Total_PROPDMG), y = Total_PROPDMG)) +
geom_bar(stat = "identity", fill = "steel blue") + coord_flip() +
ggtitle("Top 10 Property Damages of Weather Event Types in USA") +
labs(x = "Weather Event Types", y = "Total Property Damage Expenses (US$) ")
Top10_Propdmg_Plot
Top10_Cropdmg_Plot <- ggplot(data = Top10_Cropdmg, aes(x = reorder(EVTYPE, Total_CROPDMG),
y = Total_CROPDMG)) +
geom_bar(stat = "identity", fill = "red") + coord_flip() +
ggtitle("Top 10 Crop Damages of Weather Event Types in USA") +
labs(x = "Weather Event Types", y = "Total Crop Damage Expenses (US$) ")
Top10_Cropdmg_Plot
Top10_EconomicDmg <- merge(x = Top10_Propdmg, y = Top10_Cropdmg, by = "EVTYPE", all = TRUE)
Top10_EconomicDmg <- melt(Top10_EconomicDmg, id.vars = 'EVTYPE')
# Top10_EconomicDmg
ggplot(Top10_EconomicDmg, aes(EVTYPE, value)) +
xlab("Event Weather Type") +
ylab("Damage Expenses (US $)") +
ggtitle("Impact on Top-10 Ecomomic Damage Per Event Type") +
geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
According to the aboved histograms,from 1995-2010, the Flood and Hurricane/Typhoon lead to most of Property damages; and Flood cause most of Crop damages in the US, which both have the highest damage expenses figures.
From the cleaned data and merge graphics, we could conclure that Excessive Heat and tornado are most harmful to the population health, while Flood, Drought, and Hurricane/Typhoon have the greatest impact on economic damages .