Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many events can result in fatalities, injuries and property damage, and preventing such outcomes is a key concern for governments. The aim of this analysis is to explore the NOAA Storm Database in order to understand the impact of weather events in US public health and economy. To achieve it we have analysed, on one hand the number of fatalities and injuries for each event, and on the other the overall economic impact in terms of property and crop damage. The NOAA dataset is publicly available and covers data of weather events in the United States for the time period between 1950 and 2011. To perform the whole analysis we have used the R programming language.
The following R libraries are required in order to perform the analysis.
library(R.utils) #to extract the NOAA archive.
library(dplyr) #to manipulate data.
library(ggplot2) #to plot data.
The NOAA dataset is publcly available here. It is possible to download it manually from the provided url or execute the R code below and download it in the correspondent working directory.
#Set working directory for the analysis
setwd("C:/Users/Marco/Dropbox/Coursera/Data Science Specialization - JHU/Reproducible Research/week4")
#Download dataset, which is compressed via the bzip2 algorithm. Once downloaded we will use the bunzip2 function, from the R.utils library, to convert it to .csv
if(!file.exists("storm_data.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","storm_data.csv.bz2")
bunzip2("storm_data.csv.bz2", "storm_data.csv",overwrite=TRUE, remove=FALSE)
}
#Download related documentation for the data
if(!file.exists("nsw_storm_sata_docuentation.pdf")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf","nsw_storm_sata_docuentation.pdf")
}
if(!file.exists("storm_events_faq.pdf")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf","storm_events_faq.pdf")
}
#Check that both data and documentation is available in local folder
dir()
## [1] "Ass2_Storm.Rmd" "nsw_storm_sata_docuentation.pdf"
## [3] "proj2_storm.html" "proj2_storm.Rmd"
## [5] "storm_data.csv" "storm_data.csv.bz2"
## [7] "storm_events_faq.pdf"
With the code below we will import the dataset in the R environment. The dataset consists 902297 rows and 37 columns. Some of the variables are not ncessary for the goal of our analysis, hece we will drop them in next step. We can see that data has been collected for 985 distinct events.
if(!exists("storm_data")) {
storm_data<-read.csv("storm_data.csv")
}
dim(storm_data)
## [1] 902297 37
names(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(storm_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
#There is a total of 985 distinct events in the dataset.
length(unique(storm_data$EVTYPE))
## [1] 985
After reading documentation, we reckon that - for the purpose of this analysis - we can reduce the size of the dataset by keeping just the necessary variables for our analysis.
Here below we subset our variables of interest.
sub_storm_data<- storm_data %>%
select(EVTYPE,FATALITIES,INJURIES,PROPDMG:CROPDMGEXP)
#Let also put column names in lower case.
names(sub_storm_data)<-tolower(names(sub_storm_data))
names(sub_storm_data)
## [1] "evtype" "fatalities" "injuries" "propdmg" "propdmgexp"
## [6] "cropdmg" "cropdmgexp"
summary(sub_storm_data)
## evtype fatalities injuries
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## propdmg propdmgexp cropdmg cropdmgexp
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
sum(is.na(sub_storm_data))
## [1] 0
#The dataset does not present missing values, hece we can use the whole observations for our analysis.
As stated in documentation (page 12), variables that property and crop damages need to be transformed using the exponential variables “propdmgext” and “cropdmgexp”, as follows: “Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions.
First we check out possible values for the “exponential” variables and clean where necessary.
unique(sub_storm_data$propdmgexp)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(sub_storm_data$cropdmgexp)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
#We can see that there are also other values apart from the ones stated in documentation. Since we have no information about how to encode the unknown values, we will filter them out and keep just the ones explicitly reported on official documentation. That are "H","K","M","B".
fil_sub_storm_data <- filter(sub_storm_data,sub_storm_data$propdmgexp %in% c("H","K","M","B") | sub_storm_data$cropdmgexp %in% c("H","K","M","B"))
#Let now create two new variables to measure cost of property and crop, based on exponential values.
exp<- c('H' = 100, 'K' = 1000, 'M' = 1000000, 'B' = 1000000000)
fil_sub_storm_data$CropCost <- ifelse(toupper(fil_sub_storm_data$cropdmgexp) %in% names(exp), fil_sub_storm_data$cropdmg * exp[toupper(fil_sub_storm_data$cropdmgexp)], 0)
fil_sub_storm_data$PropCost <- ifelse(toupper(fil_sub_storm_data$propdmgexp) %in% names(exp), fil_sub_storm_data$propdmg * exp[toupper(fil_sub_storm_data$propdmgexp)], 0)
#Finally we sum up property nd crop cost to calculat the total economic impact.
fil_sub_storm_data <- fil_sub_storm_data %>%
mutate(economic_impact= PropCost + CropCost)
To be consistent throughout the rest of the analysis, we will use the new dataset named “fil_sub_storm_data” to answer both questions about public health and economic impact of climate events.
Below we sumarize fatalities, injuries and economic impact according to the type of event. We will use this summarized dataset to show results in the next section.
summarized_df<-fil_sub_storm_data %>%
group_by(evtype) %>%
summarize(tot_fatalities=sum(fatalities),tot_injuries=sum(injuries),
tot_economic_impact=sum(economic_impact))
In the plots that follows, we show the top 15 most severe events in the US for:
plot_fatalities <- arrange(summarized_df,desc(tot_fatalities)) %>%
head(15) %>%
ggplot(aes(x=reorder(evtype,tot_fatalities), y=tot_fatalities)) +
geom_bar(fill="red",stat="identity") +
coord_flip() +
ylab("Total number of fatalities") + xlab("Event") +
ggtitle("Most severe weather events for public health - fatalities")
plot_fatalities
plot_injuries <- arrange(summarized_df,desc(tot_injuries)) %>%
head(15) %>%
ggplot(aes(x=reorder(evtype,tot_injuries), y=tot_injuries)) +
geom_bar(fill="orange",stat="identity") +
coord_flip() +
ylab("Total number of injuries") + xlab("Event") +
ggtitle("Most severe weather events for public health - injuries")
plot_injuries
plot_economic_impact <- arrange(summarized_df,desc(tot_economic_impact)) %>%
head(15) %>%
ggplot(aes(x=reorder(evtype,tot_economic_impact), y=tot_economic_impact)) +
geom_bar(fill="black",stat="identity") +
coord_flip() +
ylab("Total economic impact $") + xlab("Event") +
ggtitle("Most severe weather events for the economy")
plot_economic_impact
Tornado is by far the most harmful weather event for population health both in terms of fatalities and injuries caused.
When we looked at the economic damage, results show that flood has the greatest economic consecuences on the US economy, followed by hurricane/typhoon and tornado.