This is a second course project for Reproducible Research, which is the fifth course in Coursera Data Science Specialization.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, loss of crops and property damage. Preventing such outcomes as far as practical and possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The analysis of the data shows that tornadoes have the greatest human impact as measured by numbers of injuries and fatalities. The second biggest impact is heat. In terms of economic impact, as measured by damage to crops and property, flooding is the greatest threat.
The following libraries were used for the analysis
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file may be downloaded from here:
See links below for documentation of this data:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ page
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The aim of this work is to answer the following questions.
Across the United States, which types of severe weather events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data was downloaded from the link above and saved into my R environment (use setwd command to set the file path of the working directory and use getwd command to check the current working directory). The read_csv command was used to import the zipped csv file. If object storm.data is already loaded, use the cached object insted of reloading each time the Rmd file is knitted.
storm.data <- read_csv("StormData.csv.bz2")
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
head(storm.data)
str(storm.data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: chr NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 37
## .. ..$ STATE__ : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ BGN_DATE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BGN_TIME : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TIME_ZONE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ COUNTY : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ COUNTYNAME: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ STATE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ EVTYPE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BGN_RANGE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ BGN_AZI : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BGN_LOCATI: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ END_DATE : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ END_TIME : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ COUNTY_END: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ COUNTYENDN: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ END_RANGE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ END_AZI : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ END_LOCATI: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LENGTH : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ WIDTH : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ F : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ MAG : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ FATALITIES: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ INJURIES : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PROPDMG : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PROPDMGEXP: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ CROPDMG : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ CROPDMGEXP: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ WFO : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ STATEOFFIC: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ ZONENAMES : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LATITUDE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ LONGITUDE : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ LATITUDE_E: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ LONGITUDE_: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ REMARKS : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ REFNUM : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
The following variables are of most interest with regard to health and economic outcomes
Health variables:
FATALITIES: approx. number of deaths INJURIES: approx. number of injuries
Economic variables:
PROPDMG: approx. property damags PROPDMGEXP: the units for property damage value CROPDMG: approx. crop damages CROPDMGEXP: the units for crop damage value Events - target variable:
EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..)
Create a subset of data containing just these required variables:
vars <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
mydata <- storm.data[, vars]
Check if any values are missing from each of the numeric variables as a basic quality check.
incomplete <- sum(!complete.cases(mydata))
sum(is.na(mydata$FATALITIES))
## [1] 0
sum(is.na(mydata$INJURIES))
## [1] 0
sum(is.na(mydata$PROPDMG))
## [1] 0
sum(is.na(mydata$CROPDMG))
## [1] 0
Transform the units so actual dollar values are shown. The rules applied are as follows:
*K or k: thousand dollars (10^3)
*M or m: million dollars (10^6)
*B or b: billion dollars (10^9)
*those without the units above are considered as dollars
mydata$PROPDMGEXP <- as.character(mydata$PROPDMGEXP)
mydata$PROPDMGEXP[is.na(mydata$PROPDMGEXP)] <- 0 # NA's considered as dollars
mydata$PROPDMGEXP[!grepl("K|M|B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
mydata$PROPDMGEXP[grep("K", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$PROPDMGEXP[grep("M", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$PROPDMGEXP[grep("B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$PROPDMGEXP <- as.numeric(as.character(mydata$PROPDMGEXP))
mydata$property.damage <- mydata$PROPDMG * 10^mydata$PROPDMGEXP
mydata$CROPDMGEXP <- as.character(mydata$CROPDMGEXP)
mydata$CROPDMGEXP[is.na(mydata$CROPDMGEXP)] <- 0 # NA's considered as dollars
mydata$CROPDMGEXP[!grepl("K|M|B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
mydata$CROPDMGEXP[grep("K", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$CROPDMGEXP[grep("M", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$CROPDMGEXP[grep("B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$CROPDMGEXP <- as.numeric(as.character(mydata$CROPDMGEXP))
mydata$crop.damage <- mydata$CROPDMG * 10^mydata$CROPDMGEXP
#summarise data by event type; transform EVTYPE into a new variable called EVENT
mydata$EVENT <- "OTHER"
# group by keyword in EVTYPE
mydata$EVENT[grep("HAIL", mydata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
mydata$EVENT[grep("STORM", mydata$EVTYPE, ignore.case = TRUE)] <- "STORM"
mydata$EVENT[grep("WINTER", mydata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
mydata$EVENT[grep("ICE", mydata$EVTYPE, ignore.case = TRUE)] <- "ICE"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
mydata$EVENT[grep("SNOW", mydata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
mydata$EVENT[grep("TORNADO", mydata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
mydata$EVENT[grep("HEAT", mydata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
mydata$EVENT[grep("FLOOD", mydata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
mydata$EVENT[grep("WIND", mydata$EVTYPE, ignore.case = TRUE)] <- "WIND"
# list the transformed event types
sort(table(mydata$EVENT), decreasing = TRUE)
##
## WIND HAIL FLOOD TORNADO OTHER WINTER SNOW RAIN HEAT
## 364901 289270 82713 60699 48878 19597 17677 12156 2648
## ICE STORM
## 2101 1657
mydata_summarised <- mydata %>%
group_by(EVENT, crop.damage, property.damage, FATALITIES, INJURIES) %>%
summarise ()
crop_damage_summary <- aggregate(crop.damage~EVENT, mydata_summarised, sum)
arrange(crop_damage_summary, desc(crop.damage))
property_damage_summary <- aggregate(property.damage~EVENT, mydata_summarised, sum)
arrange(property_damage_summary, desc(property.damage))
fatalities_summary <- aggregate(FATALITIES~EVENT, mydata_summarised, sum)
arrange(fatalities_summary, desc(FATALITIES))
injuries_summary <- aggregate(INJURIES~EVENT, mydata_summarised, sum)
arrange(injuries_summary, desc(INJURIES))
The barplot ranks top severe weather events by type according to the number of fatalities and injuries. We see that tornadoes are resposible for by far the highest proportion of injuries and fatalities.
human <- merge(fatalities_summary, injuries_summary)
totals1 <- arrange(human, desc(FATALITIES + INJURIES))
names_events <- totals1$EVENT
barplot(t(totals1[,-1]), names.arg = names_events, ylim = c(0,70000), beside = T, cex.names = 0.8, las=2, col = c("black", "red"), main="Number of fatalities and injuries by severe weather event")
legend("topright",c("FATALITIES","INJURIES"),fill=c("black", "red"),bty = "n")
economic <- merge(property_damage_summary, crop_damage_summary)
totals <- arrange(economic, desc(property.damage + crop.damage))
names_events <- totals$EVENT
barplot(t(totals[,-1]), names.arg = names_events, ylim = c(0,160000000000), beside = T, cex.names = 0.8, las=2, col = c("green", "grey"), main="Economic loss ($) by severe weather event")
legend("topright",c("Property damage","Crop loss"),fill=c("green","grey"),bty = "n")
The barplot ranks top severe weather events by type according to the financial impact. We see that both flood and
tornado pose the biggest risk to property; flood, hail and wind (in order) are the three biggest risks to crops.