The goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data database to ascertain which types of severe weather events in the US are most harmful with respect to population health, and which types of events have the greatest economic consequences.
The data from NOAA is provided as a CSV file and will be imported and processed.
To assess the affect on population health, the number of fatalities and injuries associated with each event type will be assessed. The report will show that the events leading to the highest number of fatalities and injuries include Tornado and Excessive Heat events.
To assess the economic consequences, the total cost in terms of property and crop damage for each event type will be calculated. The report will show that the events leading to the greatest cost to property and crops include Flood and Hurricane events.
The data is supplied as a CSV file, and was downloaded from here. Dcoumentation on the data can be found from the National Weather Service Storm Data Documentation and National Climatic Data Center Storm Events FAQ.
For this analysis, the CSV file has been downloaded and is placed in the project directory. If data doesnt exists then we create it and download
if (!file.exists("data")){
dir.create("data")
adres <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(adres, destfile = "./data/repdata%2Fdata%2FStormData.csv.bz2", method = "auto")
}
We can verify if the download is done
list.files("./data")
## [1] "repdata%2Fdata%2FStormData.csv.bz2"
There are the necessary packages for analysis:
library(tidyverse) # for importing, manipulating and plotting
library(lubridate) # for handling dates
storm_data <- read_csv("./data/repdata%2Fdata%2FStormData.csv.bz2")
## Parsed with column specification:
## cols(
## .default = col_double(),
## BGN_DATE = col_character(),
## BGN_TIME = col_character(),
## TIME_ZONE = col_character(),
## COUNTYNAME = col_character(),
## STATE = col_character(),
## EVTYPE = col_character(),
## BGN_AZI = col_logical(),
## BGN_LOCATI = col_logical(),
## END_DATE = col_logical(),
## END_TIME = col_logical(),
## COUNTYENDN = col_logical(),
## END_AZI = col_logical(),
## END_LOCATI = col_logical(),
## PROPDMGEXP = col_character(),
## CROPDMGEXP = col_logical(),
## WFO = col_logical(),
## STATEOFFIC = col_logical(),
## ZONENAMES = col_logical(),
## REMARKS = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 5255570 parsing failures.
## row col expected actual file
## 1671 WFO 1/0/T/F/TRUE/FALSE NG './data/repdata%2Fdata%2FStormData.csv.bz2'
## 1673 WFO 1/0/T/F/TRUE/FALSE NG './data/repdata%2Fdata%2FStormData.csv.bz2'
## 1674 WFO 1/0/T/F/TRUE/FALSE NG './data/repdata%2Fdata%2FStormData.csv.bz2'
## 1675 WFO 1/0/T/F/TRUE/FALSE NG './data/repdata%2Fdata%2FStormData.csv.bz2'
## 1678 WFO 1/0/T/F/TRUE/FALSE NG './data/repdata%2Fdata%2FStormData.csv.bz2'
## .... ... .................. ...... ...........................................
## See problems(...) for more details.
str(storm_data)
## tibble [902,297 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : logi [1:902297] NA NA NA NA NA NA ...
## $ BGN_LOCATI: logi [1:902297] NA NA NA NA NA NA ...
## $ END_DATE : logi [1:902297] NA NA NA NA NA NA ...
## $ END_TIME : logi [1:902297] NA NA NA NA NA NA ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : logi [1:902297] NA NA NA NA NA NA ...
## $ END_LOCATI: logi [1:902297] NA NA NA NA NA NA ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : num [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: logi [1:902297] NA NA NA NA NA NA ...
## $ WFO : logi [1:902297] NA NA NA NA NA NA ...
## $ STATEOFFIC: logi [1:902297] NA NA NA NA NA NA ...
## $ ZONENAMES : logi [1:902297] NA NA NA NA NA NA ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : logi [1:902297] NA NA NA NA NA NA ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "problems")= tibble [5,255,570 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:5255570] 1671 1673 1674 1675 1678 1679 1680 1681 1682 1683 ...
## ..$ col : chr [1:5255570] "WFO" "WFO" "WFO" "WFO" ...
## ..$ expected: chr [1:5255570] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
## ..$ actual : chr [1:5255570] "NG" "NG" "NG" "NG" ...
## ..$ file : chr [1:5255570] "'./data/repdata%2Fdata%2FStormData.csv.bz2'" "'./data/repdata%2Fdata%2FStormData.csv.bz2'" "'./data/repdata%2Fdata%2FStormData.csv.bz2'" "'./data/repdata%2Fdata%2FStormData.csv.bz2'" ...
## - attr(*, "spec")=
## .. cols(
## .. STATE__ = col_double(),
## .. BGN_DATE = col_character(),
## .. BGN_TIME = col_character(),
## .. TIME_ZONE = col_character(),
## .. COUNTY = col_double(),
## .. COUNTYNAME = col_character(),
## .. STATE = col_character(),
## .. EVTYPE = col_character(),
## .. BGN_RANGE = col_double(),
## .. BGN_AZI = col_logical(),
## .. BGN_LOCATI = col_logical(),
## .. END_DATE = col_logical(),
## .. END_TIME = col_logical(),
## .. COUNTY_END = col_double(),
## .. COUNTYENDN = col_logical(),
## .. END_RANGE = col_double(),
## .. END_AZI = col_logical(),
## .. END_LOCATI = col_logical(),
## .. LENGTH = col_double(),
## .. WIDTH = col_double(),
## .. F = col_double(),
## .. MAG = col_double(),
## .. FATALITIES = col_double(),
## .. INJURIES = col_double(),
## .. PROPDMG = col_double(),
## .. PROPDMGEXP = col_character(),
## .. CROPDMG = col_double(),
## .. CROPDMGEXP = col_logical(),
## .. WFO = col_logical(),
## .. STATEOFFIC = col_logical(),
## .. ZONENAMES = col_logical(),
## .. LATITUDE = col_double(),
## .. LONGITUDE = col_double(),
## .. LATITUDE_E = col_double(),
## .. LONGITUDE_ = col_double(),
## .. REMARKS = col_logical(),
## .. REFNUM = col_double()
## .. )
There are some processing steps to be taken. After an initial exploration of the data (not shown), the following changes have been made:
BGN_DATE to a date variable.EVTYPE) present in the data, I have taken my best judgement to group some of the key event types together, without wanting to lose too much of the granularity that some descriptions provide.PROPDMG and CROPDMG) have been calculated using their respective magnitude indicators (PROPDMGEXP and CROPDMGEXP) e.g. K for thousands etcstorm_data_processed <- storm_data %>%
# format date
mutate(BGN_DATE = mdy_hms(BGN_DATE)) %>%
# filter out older years
filter(BGN_DATE >= as.Date("1980-01-01")) %>%
# events grouped together
mutate(EVTYPE = str_to_upper(EVTYPE),
event = str_replace_all(EVTYPE, "TSTM|THUNDERSTORMS", "THUNDERSTORM"),
event = str_replace_all(event, "WINDS", "WIND")) %>%
# calculate property and crop damage - if no valid code present then set to 0
mutate(prop_damage = case_when(str_to_upper(PROPDMGEXP) == "H" ~ PROPDMG*100,
str_to_upper(PROPDMGEXP) == "K" ~ PROPDMG*1000,
str_to_upper(PROPDMGEXP) == "M" ~ PROPDMG*10^6,
str_to_upper(PROPDMGEXP) == "B" ~ PROPDMG*10^9,
TRUE ~ 0),
crop_damage = case_when(str_to_upper(CROPDMGEXP) == "H" ~ CROPDMG*100,
str_to_upper(CROPDMGEXP) == "K" ~ CROPDMG*1000,
str_to_upper(CROPDMGEXP) == "M" ~ CROPDMG*10^6,
str_to_upper(CROPDMGEXP) == "B" ~ CROPDMG*10^9,
TRUE ~ 0))
count(storm_data_processed, event, sort = TRUE) %>% View()
Now the data has been processed, let’s consider 2 questions in turn:
To answer this question let’s assess the events against 2 measures: fatalities and injuries.
Firstly, let’s assess fatality:
health_effect <- storm_data_processed %>%
group_by(event) %>%
summarise(fatalities_total = sum(FATALITIES),
injuries_total = sum(INJURIES),
events_total = n(),
fatality_rate = fatalities_total / events_total,
injury_rate = injuries_total / events_total)
top10_fatal <- health_effect %>%
top_n(10, fatalities_total)
ggplot(top10_fatal, aes(x = fct_reorder(event, fatalities_total), y = fatalities_total)) +
geom_col(colour = 'red') +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Tornado and heat have caused the highest number of fatalities",
subtitle = "Weather events since 1980 in the US with the highest number of fatalities",
y = "Total Fatalities",
x = "Event Type")
Tornado events have led to the highest number of fatalities, with excessive heat/heat also featuring highly.
Now let’s assess the number of injuries:
top10_injury <- health_effect %>%
top_n(10, injuries_total)
ggplot(top10_injury, aes(x = fct_reorder(event, injuries_total), y = injuries_total)) +
geom_col(colour = 'red') +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Tornado and thunderstorm wind have caused\nthe highest number of injuries",
subtitle = "Weather events since 1980 in the US with the highest number of injuries",
y = "Total Injuries",
x = "Event Type")
Once again, tornados have resulted in the highest number of injuries. They account for more than 3 times the number of the next highest cause of injuries, thunderstorm wind.
Note that here we have been looking at total number of fatalities and injuries. These are likely to be higher overall for events which have occured the most often. It may also be insightful to assess the average number of fatalities and injuries per event. For example, although tornados have the highest number of fatalities, there is only a fatality rate of 0.06 fatalities per tornado event, where as excessive heat has more than 1 fatality per event. Likewise, in terms of injuries, excessive heat has a much higher rate at almost 4 injuries per event, compared to tornados which have less than 1 injury per event.
To answer this question, let’s assess economic consequences in terms of cost of property and crop damage.
eco_cost <- storm_data_processed %>%
group_by(event) %>%
summarise(cost = sum(prop_damage + crop_damage),
prop_cost = sum(prop_damage),
crop_cost = sum(crop_damage),
events_total = n(),
cost_rate = cost / events_total)
eco_cost %>%
top_n(10, cost) %>%
ggplot(aes(x = fct_reorder(event, cost), y = cost/10^9)) +
geom_col(colour = 'red') +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Flooding has resulted in the most\nfinancial damage to property and crops",
subtitle = "Weather events since 1980 in the US with the most financial damage",
y = "Cost in Billions of dollars",
x = "Event Type")
Flooding has resulted in more than double the cost of property/crop damage compared to hurricanes/typhoons. However, the hurricane/typhoon event has only been recorded 88 times, so in terms of cost per event it is extremely costly. Tornado events appear highly again, as it did in terms of population health impact. Extreme heat does not appear despite it having a high health impact, however, drought does appear in the top 10, largely driven by crop damage costs.
In conclusion, in terms of total impact on health, tornado events have the biggest adverse effects based on fatalities and injuries. However, in terms of the impact per event, excessive heat has a greater effect. Flooding has the biggest total impact on the economy, based on property and crop damage costs. However, hurricane/typhoon events have a greater cost per event.