This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to analyze the impact of storm and other severe weather events on population health and economy in the USA. We compared the total number of injuries and fatalities caused by each weather event, and found that Tonado had the biggest impact on population health in the USA. We also calcuated the total property and crop damages caused by each weather event, and found that Flood caused the biggest economical loss in the USA
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
We first download the data in the bz2 format.
# Download zip file
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("repdata_data_StormData.csv.bz2")){
download.file(url, destfile = "repdata_data_StormData.csv.bz2", method="curl")
}
We then read in the storm data from the csv file included in the bz2 file.
storm <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)
After reading the data, we check the dimension and first few rows in this dataset.
dim(storm)
## [1] 902297 37
Check the basic structure of the dataset.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
We select EVTYPE and the following variables for health: FATALITIES and INJURIES: approx. number of injuries; and the following variables for economy: PROPDMG, PROPDMGEXP (units for PROPDMG); CROPDMG and CROPDMGEXP (Units for CROPDMG).
We load the requried libraries and select the columns of variables from the original dataset.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
storm_damage<-select(storm, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
We then check whether there are missing values for FATALITIES, INJURIES, PROPDMG, and CROPDMG.
sum(is.na(storm_damage$FATALITIES))##check NA for fatalies
## [1] 0
sum(is.na(storm_damage$INJURIES))##check NA for injuries
## [1] 0
sum(is.na(storm_damage$PROPDMG))##check NA for property damage
## [1] 0
sum(is.na(storm_damage$CROPDMG))##check NA for crop damage
## [1] 0
There are no missing values in the data.
We group the dataset according to EVTYPE, and calculate the sum for FATALITIES, INJURIES, PROPDMG, and CROPDMG. We first calculate the sum of fatality and injury for each EVTYPE. The result is saved in a new list health_damage.
storm_damage<-group_by(storm_damage, EVTYPE)
health_damage<-storm_damage%>%summarise(fatality_sum=sum(FATALITIES), injury_sum=sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
head(health_damage)
## # A tibble: 6 x 3
## EVTYPE fatality_sum injury_sum
## <fct> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0
## 2 " COASTAL FLOOD" 0 0
## 3 " FLASH FLOOD" 0 0
## 4 " LIGHTNING" 0 0
## 5 " TSTM WIND" 0 0
## 6 " TSTM WIND (G45)" 0 0
We first check the units included in the PROPDMGEXP and CROPDMGEXP
table(storm_damage$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(storm_damage$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
We convert the value and units for the economical damages to numbers according to the rule of h for 100, m for 10^6, k for 10^3, and b for 10^9. Any numbers will be converted according to rule of 10^number, and the “-”, “?”, and blanks will be converted to 10^0.
test<- storm_damage
test[grepl("B", test$PROPDMGEXP),]$PROPDMG<- test[grepl("B", test$PROPDMGEXP),]$PROPDMG*(10^9)
test[grepl("[Hh]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Hh]", test$PROPDMGEXP),]$PROPDMG*(100)
test[grepl("[Mm]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Mm]", test$PROPDMGEXP),]$PROPDMG*(10^6)
test[grepl("[Kk]", test$PROPDMGEXP),]$PROPDMG<- test[grepl("[Kk]", test$PROPDMGEXP),]$PROPDMG*(1000)
for (i in c("1", "2", "3","4", "5","6", "7", "8")) {
test[grepl(i, test$PROPDMGEXP),]$PROPDMG<- test[grepl(i, test$PROPDMGEXP),]$PROPDMG*(10^(as.numeric(i)))
}
test[grepl("B", test$CROPDMGEXP),]$CROPDMG<- test[grepl("B", test$CROPDMGEXP),]$CROPDMG*(10^9)
test[grepl("[Mm]", test$CROPDMGEXP),]$CROPDMG<- test[grepl("[Mm]", test$CROPDMGEXP),]$CROPDMG*(10^6)
test[grepl("[Kk]", test$CROPDMGEXP),]$CROPDMG<- test[grepl("[Kk]", test$CROPDMGEXP),]$CROPDMG*(1000)
test[grepl("2", test$CROPDMGEXP),]$CROPDMG<- test[grepl("2", test$CROPDMGEXP),]$CROPDMG*(100)
storm_damage<-test
rm(test)
We create a new variable totalDM to be the sum of property damage and crop damage, and save the sum of total damage for each event type in ecomony_damage.
storm_damage$totalDM<-storm_damage$PROPDMG+storm_damage$CROPDMG
ecomony_damage<-storm_damage%>%summarise(damage_sum=sum(totalDM))
## `summarise()` ungrouping output (override with `.groups` argument)
head(ecomony_damage)
## # A tibble: 6 x 2
## EVTYPE damage_sum
## <fct> <dbl>
## 1 " HIGH SURF ADVISORY" 200000
## 2 " COASTAL FLOOD" 0
## 3 " FLASH FLOOD" 50000
## 4 " LIGHTNING" 0
## 5 " TSTM WIND" 8100000
## 6 " TSTM WIND (G45)" 8000
There are 985 levels for the factor variable EVTYPE. We reorder the dataset and find the top 5 events for fatality and injury. The top 5 events for fatalites are:
health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT LIGHTNING
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD ... WND
top_f<- as.character(health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ]$EVTYPE[1:5])
The top 5 events for injuries are:
health_damage[order(health_damage$injury_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] TORNADO TSTM WIND FLOOD EXCESSIVE HEAT LIGHTNING
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD ... WND
top_i<-as.character(health_damage[order(health_damage$injury_sum, decreasing = TRUE), ]$EVTYPE[1:5])
The top 5 events for economical damages are:
ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ]$EVTYPE[1:5]
## [1] FLOOD HURRICANE/TYPHOON TORNADO STORM SURGE
## [5] HAIL
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD ... WND
top_d<- as.character(ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ]$EVTYPE[1:5])
We make barplot for the top 5 events for fatalities and injuries, with the events ordered in a descending order.
top_fatality<-health_damage[order(health_damage$fatality_sum, decreasing = TRUE), ][1:5,]
fatality_p <- ggplot(top_fatality, aes(x = EVTYPE, y = fatality_sum))+geom_col(position="dodge")+theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for fatalities") + ylab("No. of fatailities")+scale_x_discrete (limits=top_f )
fatality_p
top_injury<-health_damage[order(health_damage$injury_sum, decreasing = TRUE), ][1:5,]
injury_p <- ggplot(top_injury, aes(x = EVTYPE, y = injury_sum))+geom_col(position="dodge")+ theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for injuries") + ylab("No. of injuries")+scale_x_discrete (limits=top_i )
injury_p
Tornado is thus the most important cause with respect to population health, both for causing fatalities and injuries.
We make barplot for the top 5 events for total economical damages, with the events ordered in a descending order.
top_damage<-ecomony_damage[order(ecomony_damage$damage_sum, decreasing = TRUE), ][1:5,]
damage_p <- ggplot(top_damage, aes(x = EVTYPE, y = damage_sum))+geom_col(position="dodge")+ theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Top 5 events for economical damage") + ylab("$ damage")+scale_x_discrete (limits=top_d )
damage_p
We thus conclude that Flood has the greatest economic consequences in the USA.