In this report we aim to find the most harmful weather event in the United States, in regards of health and economic consequences, from 1996 to 2011. To investigate it, we obtained the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. From these data, we found that tornados cause the most health damage, considering fatalities and injuries togheter. On the other hand, we conclude that flood is the event with worse economic consequences.
From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) we obtained the storm database, which tracks characteristics of major storms and weather events in the United States from 1950 to November 2011.
Firstly, we download the data from the url and we save it in the working directory.
fileurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileurl,destfile="repdata_data_StormData.csv.bz2",method = "curl")
Secondly, we read the downloaded data as a csv file using the Readr package. The data is stored in a variable called “stormdata”. We put guess_max=1000000 to have all the columns data type correctly.
library(readr)
stormdata<-read_csv("repdata_data_StormData.csv.bz2", guess_max=1000000)
According to NOAA the data recording started in Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type, therefore, all the records previous to 1996 are taken away from the analysis.
library(dplyr)
library(lubridate)
stormdata$BGN_DATE<-mdy_hms(stormdata$BGN_DATE)
stormdata<-stormdata %>% filter(year(BGN_DATE)>1996)
The damage done per event is stored in the variables CROPDMG and PROPDMG and the exponent in base 10 is stored in PROPDMGEXP and CROPDMGEXP. A transformation is needed to change the exponent to a numeric format.
First, we do it with the PROPMDGEXP.
stormdata <- stormdata %>% mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="-",0)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="+",0)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="?",0)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="b",9)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="m",6)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="k",3)) %>%
mutate(PROPDMGEXP=replace(PROPDMGEXP,PROPDMGEXP=="h",2)) %>%
mutate(PROPDMGEXP = as.numeric(PROPDMGEXP))
Secondly, we do it with the CROPDMGEXP
stormdata <- stormdata %>% mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="-",0)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="+",0)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="?",0)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="b",9)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="m",6)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="k",3)) %>%
mutate(CROPDMGEXP=replace(CROPDMGEXP,CROPDMGEXP=="h",2)) %>%
mutate(CROPDMGEXP = as.numeric(CROPDMGEXP))
In some observations, while CROPDMG or PROPDMG is a number, the exponent is NA. In this cases we change the exponent from NA to 0.
stormdata[which(is.na(stormdata$PROPDMGEXP) & !is.na(stormdata$PROPDMG)), "PROPDMGEXP"]<-0
stormdata[which(is.na(stormdata$CROPDMGEXP) & !is.na(stormdata$CROPDMG)), "CROPDMGEXP"]<-0
As we only need the total damage, we create a new variable combined all the previous four.
stormdata$totaldamage<-stormdata$PROPDMG*10^stormdata$PROPDMGEXP+stormdata$CROPDMG*10^stormdata$CROPDMGEXP
We are trying to find the most harmful events with respect to population health and economic consequences, therefore, we filter the events that do not have those types of damages.
stormdata<-stormdata %>% filter(FATALITIES>0 | INJURIES>0 | totaldamage>0)
We use only the Event Types described by the National Weather Service in point 7 of the Storm Data Documentation. Therefore, we need to match all the event types in the column EVTYPE to the ones in the documentation.
Firtly, all the values of EVTYPE are transformed to lower case
stormdata$EVTYPE<-tolower(stormdata$EVTYPE)
And then we create a vector with all the official event names in lower case.
events<-c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow",
"Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
"Flash Flood", "Flood","Freezing Fog","Frost/Freeze","Funnel Cloud", "Hail", "Heat", "Heavy Rain",
"Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", "Ice Storm", "Lakeshore Flood",
"Lake-Effect Snow", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Tide", "Strong Wind",
"Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
"Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
events<-tolower(events)
Now we need to match the event names in the data frame with the names in the vector. To do that, we store all the event names from the dataframe in a new vector call “DFEVENTS” and then we match those events with the official ones using the function amatch from the stringdist package (we are going to use a max distance equal to 4). The results are stored in a new dataframe called matchedevents.
dfevents<-names(table(stormdata$EVTYPE))
library(stringdist)
matchedevents<-data.frame(dfevents,events[amatch(dfevents,events,nomatch="other", maxDist = 4)])
We add a new variable to the dataframe called EVENT, which is calculated matching the variable EVTYPE in the dataframe with the variable EVTYPE from the matchedevents dataframe. Some events do no have any match. therefore, their value will be NA.
stormdata$EVENT<-matchedevents[match(stormdata$EVTYPE,matchedevents[,1]),2]
#fraction of NAs
sum(is.na(stormdata$EVENT))/length(stormdata$EVENT)
[1] 0.01209758
In this case, less than 1% of the observations do not have an official event name, therefore, we will filter it.
stormdata<- stormdata %>% filter(!is.na(EVENT))
To answer the question of which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health and economic consequences across the United States, we need to create a chart and visualize it.
The first step is to manipulate the data to create a table with each event and the sum of the number of fatalities, the number of enjuries adn the economic damage.
eventconseq<- stormdata %>% group_by(EVENT) %>%
summarise(FATALITIES = sum(FATALITIES, na.rm=TRUE),INJURIES = sum(INJURIES, na.rm=TRUE),
ECONOMICDMG = sum(totaldamage,na.rm=TRUE))
We only want the top 5 worst events in regards to fatalities, therefore, we sort the events by fatalities (descendent) and then we select the top 5 observations.
fatalities<-eventconseq %>% arrange(desc(FATALITIES))
library(ggplot2)
ggplot(fatalities[1:5,],aes(x=reorder(EVENT, FATALITIES), y=FATALITIES))+
geom_col() +
xlab("Weather event") +
ylab("Fatalities") +
ggtitle("Number of fatalities per weather event in the United State from 1996 to 2011") +
geom_text(aes(label=FATALITIES), vjust=-0.2)
We do the same with the injuries, selecting the top 5 events.
injuries<-eventconseq %>% arrange(desc(INJURIES))
ggplot(injuries[1:5,],aes(x=reorder(EVENT, INJURIES), y=INJURIES))+
geom_col() +
xlab("Weather event") +
ylab("Injuries") +
ggtitle("Number of injuries per weather event in the United State from 1996 to 2011") +
geom_text(aes(label=INJURIES), vjust=-0.2)
The event that cause more fatalities is the excessive heat, following closely by the tornado, however, in regards to injuries tornados cause much more damage than the others. If we consider the two factors, we can conclude that tornados are the most harmful with respect to population health.
We repeat the process with the ecnomic damages, selecting the top 5 events and then do the graph.
economic<-eventconseq %>% mutate(ECONOMICDMG = round(ECONOMICDMG/1000000000)) %>%
arrange(desc(ECONOMICDMG))
ggplot(economic[1:5,],aes(x=reorder(EVENT, ECONOMICDMG), y=ECONOMICDMG))+
geom_col() +
xlab("Weather event") +
ylab("Economic Consequences [Bn USD]") +
ggtitle("Economic consequences per weather event in the US from 1996 to 2011") +
geom_text(aes(label=ECONOMICDMG), vjust=-0.2)
The event that have the greatest economic consequences is the flood.