This assignment explores the NOAA Storm Database and looks into the types of weather events that cause the most human and material damage across the United States. We conclude that, between 1950 and 2011, heat, tornadoes, thunderstorms, and floods have caused the highest numbers of fatalities and injuries. Similarly, tornadoes, floods, drought, and hurricanes have caused damage in billions of dollars to private property. States that are particularly affected by these events are Texas, Alabama, and Florida, for which we urge the authorities of these states to take the necessary precautions in order to avoid future disasters.
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(ggplot2)
We open the .zip file with the data. The columns we want to analyze are BGNDATE, EVTYPE, FATALITIES, INJURIES, CROPDMG, PROPDMG, CROPDMGEXP, and PROPDMGEXP.
dataurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp_file <- tempfile(fileext = ".bz2")
download.file(dataurl, temp_file)
data <- read.csv(temp_file, header = TRUE, sep = ",")
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
We keep only the relevant columns. We will also turn the BGN_DATE column into date objects, add a separate column for year only, and change EVTYPE data into lowercase to avoid duplicates. Trailing and leading spaces were also removed.
nrow(data)
## [1] 902297
damage<-data[, c(2,6:8,23:28)] %>%
mutate(BGN_DATE= str_sub(BGN_DATE, end = -8)) %>%
mutate(BGN_DATE= mdy(BGN_DATE))%>%
mutate(year= year(BGN_DATE)) %>%
mutate(EVTYPE= tolower(EVTYPE)) %>% #lowercase
mutate(EVTYPE=trimws(EVTYPE)) #remove trailing and leading spaces
A quick look into unique(damage$EVTYPE) shows that many
events can be merged into one type. We collapse all events with the
words “heat” (e.g. “heat”, “excessive heat”) into one type of event; we
do the same with events containing “thunderstorm”, “hurricane/typhoon”,
and “flood”.
evtypes<-unique(damage$EVTYPE)
heat<-str_subset(evtypes, "heat")
thstm<-str_subset(evtypes, "thunderstorm")
flood<- str_subset(evtypes, "flood")
ht<-str_subset(evtypes, "hurricane|typhoon")
damage <- damage %>%
mutate(EVTYPE = str_replace_all(EVTYPE, "\\?", "unknown")) %>%
mutate(EVTYPE = str_replace_all(EVTYPE, "tstm", "thunderstorm")) %>%
mutate(EVTYPE = if_else(EVTYPE %in% heat, "heat", EVTYPE)) %>%
mutate(EVTYPE = if_else(EVTYPE %in% thstm, "thunderstorm", EVTYPE)) %>%
mutate(EVTYPE = if_else(EVTYPE %in% ht, "hurricane/typhoon", EVTYPE)) %>%
mutate(EVTYPE= if_else(EVTYPE %in% flood, "flood", EVTYPE)) #collapsed all evtypes with "heat", "thstm wind", and "flood"
The columns CROPDMGEXP and PROPDMGEXP seem to have different exponents:
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
It is unclear what the numbers encode, but the remarks column suggests that K, M, and B refer to thousands, millions, and billions. We will turn everything to lowercase and transform into thousands.
damage$PROPDMGEXP<-tolower(data$PROPDMGEXP)
damage$CROPDMGEXP<-tolower(data$CROPDMGEXP)
damage <- damage %>%
mutate(CROPS= if_else(CROPDMGEXP == "m", CROPDMG*1000,
if_else(CROPDMGEXP == "b", CROPDMG*1000000, CROPDMG))) %>%
mutate(PROPERTY= if_else(PROPDMGEXP == "m", PROPDMG*1000,
if_else(PROPDMGEXP == "b", PROPDMG*1000000, PROPDMG)))
First we will look into the number of fatalities per event.
summary(damage$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
We see that the data is highly skewed, with the mean above the 3rd quartile. This means that, from 1950 to date, more than 75% of the weather events registered in the database have caused zero fatalities. We look into the top 3 events that have caused the highest number of fatalities over time:
damage %>%
group_by(EVTYPE) %>%
summarise(mean.fatalities = mean(FATALITIES), total.fatalities= sum(FATALITIES), deadliest.single=max(FATALITIES), n=n()) %>% slice_max(total.fatalities, n=3)
## # A tibble: 3 × 5
## EVTYPE mean.fatalities total.fatalities deadliest.single n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 tornado 0.0929 5633 158 60652
## 2 heat 1.19 3138 583 2648
## 3 flood 0.0184 1525 20 82725
We see that tornadoes, heat, and floods are the events with the highest numbers fatalities. Of these, we see that tornadoes are frequent and deadly, but the mean fatalities are fewer than those left by heat waves. Excessive heat, while less frequent, is more likely to cause a higher number of deaths per event, where the deadliest heatwave caused 583 deaths. Floods, while very frequent, tend to cause fewer fatalities.
We now look into the events causing more injuries.
summary(damage$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
Again, we see that the mean is above the third quartile. We group the total of events by type and check for the three types of event that have caused the largest numbers of injuries over time:
damage %>%
group_by(EVTYPE) %>%
summarise(mean.injuries = mean(INJURIES), total.injuries= sum(INJURIES), worst.single=max(INJURIES), n=n()) %>% slice_max(total.injuries, n=3)
## # A tibble: 3 × 5
## EVTYPE mean.injuries total.injuries worst.single n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 tornado 1.51 91346 1700 60652
## 2 thunderstorm 0.0283 9540 70 336745
## 3 heat 3.48 9224 519 2648
We see that tornadoes are again on top of the list, causing the most total injuries from 1950 to date. Thunderstorms, while very frequent, are less likely to cause injuries. However, heat waves cause the largest number of mean injuries per event, despite being less frequent.
We will now look into damages.
summary(damage$PROPERTY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 4.74e+02 0.00e+00 1.15e+08
summary(damage$CROPS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 54 0 5000000
Again, we see that 75% of the events cause no damage.
We filter out those cases of property damage with unclear exponents, and leave only those with “k”, “m”, and “b”.
damage %>%
filter(PROPDMGEXP %in% c("b", "m", "k")) %>%
group_by(EVTYPE) %>%
summarise(mean.propdmg = mean(PROPERTY), sum.propdmg= sum(PROPERTY), worst.single=max(PROPERTY), n=n()) %>% slice_max(sum.propdmg, n=3)
## # A tibble: 3 × 5
## EVTYPE mean.propdmg sum.propdmg worst.single n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 flood 3219. 167529215. 115000000 52040
## 2 hurricane/typhoon 381055. 85356410. 16930000 224
## 3 tornado 1099. 56937160. 2800000 51827
We see that flood, hurricane/typhoon, and tornado are the event types that lead to the highest sums of property damage. While hurricanes and typhoons are less frequent, they leave the highest mean damage.
We now turn to see the damages in agriculture.
damage %>%
filter(CROPDMGEXP %in% c("b", "m", "k")) %>%
group_by(EVTYPE) %>%
summarise(mean.cropdmg = mean(CROPS), sum.cropdmg= sum(CROPS), worst.single=max(CROPS), n=n()) %>% slice_max(sum.cropdmg, n=3)
## # A tibble: 3 × 5
## EVTYPE mean.cropdmg sum.cropdmg worst.single n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 drought 9199. 13972566 1000000 1519
## 2 flood 343. 12380079. 5000000 36115
## 3 hurricane/typhoon 47146. 5516118. 1510000 117
Crops are affected by drought, flood, and hurricanes. Drought leads the sum of crop damages from 1950 to date, whereas floods are the most frequent. Hurricanes are the least frequent but the count of mean crop damage caused is the highest.
Now we visualize the differences between types of events and the number of fatalities and injuries caused.
worstevents <- c("tornado", "heat", "flood", "thunderstorm", "hurricane/typhoon", "drought")
topmost<-damage %>% filter(EVTYPE %in% worstevents) %>%
pivot_longer(cols=c(FATALITIES, INJURIES, PROPERTY, CROPS), names_to="TypeofDamage", values_to="counts") %>%
mutate(AFFECTED=if_else(TypeofDamage %in% c("FATALITIES", "INJURIES"), "human", "material"))
sum<-topmost %>% summarise(.by = c(EVTYPE, TypeofDamage, AFFECTED), counts=sum(counts), mean=sum(counts)/n(), n=n())
sum
## # A tibble: 24 × 6
## EVTYPE TypeofDamage AFFECTED counts mean n
## <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 tornado FATALITIES human 5633 0.0929 60652
## 2 tornado INJURIES human 91346 1.51 60652
## 3 tornado PROPERTY material 56937459. 939. 60652
## 4 tornado CROPS material 415113. 6.84 60652
## 5 thunderstorm FATALITIES human 728 0.00216 336745
## 6 thunderstorm INJURIES human 9540 0.0283 336745
## 7 thunderstorm PROPERTY material 10982003. 32.6 336745
## 8 thunderstorm CROPS material 1271797. 3.78 336745
## 9 flood FATALITIES human 1525 0.0184 82725
## 10 flood INJURIES human 8604 0.104 82725
## # ℹ 14 more rows
Since the data is rather skewed, we log-transform for visualization purposes.
ggplot(topmost, aes(x=EVTYPE, y= log2(counts), color=TypeofDamage)) + geom_boxplot() + facet_grid(vars(AFFECTED)) + scale_color_discrete(breaks=c("FATALITIES","INJURIES","PROPERTY", "CROPS")) + labs(x= "Event type", y="total damage (log-transformed)")
## Warning: Removed 1727288 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Total damage by event, 1950-2011 (material damage in thousands of dollars)
Finally, what is the evolution of fatalities and injuries caused by these events over time? Have better measures been implemented after events that turned into massive disasters?
yearsum<-topmost %>%
filter(AFFECTED== "human") %>%
group_by(TypeofDamage, EVTYPE, year)%>%
filter_out(EVTYPE== "drought" | EVTYPE=="hurricane/typhoon") %>%
summarise(total=sum(counts), mean=mean(counts), n=n(), .groups = "drop_last")
ggplot(yearsum, aes(x=year, y= total, color=TypeofDamage)) + geom_line() + facet_wrap(vars(EVTYPE), scales="free")
Total fatalities and injuries caused by event types by year, 1960-2011
Here we see that disasters show a relatively steady trend regarding fatalities, except for heat, where the latest years show fewer deaths. Tornadoes seem to cause high numbers of fatalities even after 2010. Regarding injuries, floods seem to be relatively under control after 2000, but the other events show high numbers of injuries also in later years.
Regarding damages to crop and property, we summarize and look for total numbers:
yearsum2<-topmost %>%
filter(AFFECTED== "material") %>%
group_by(TypeofDamage, EVTYPE, year)%>%
filter_out(EVTYPE== "thunderstorm" | EVTYPE=="heat") %>%
summarise(total=sum(counts), mean=mean(counts), n=n(), .groups = "drop_last")
yearsum2 %>%
slice_max(total, n=1)
## # A tibble: 8 × 6
## # Groups: TypeofDamage, EVTYPE [8]
## TypeofDamage EVTYPE year total mean n
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 CROPS drought 2000 2438100 19662. 124
## 2 CROPS flood 1993 5179773. 3223. 1607
## 3 CROPS hurricane/typhoon 2005 2012682 42823. 47
## 4 CROPS tornado 2005 82105. 61.1 1343
## 5 PROPERTY drought 2003 645150 12407. 52
## 6 PROPERTY flood 2006 118704535. 30159. 3936
## 7 PROPERTY hurricane/typhoon 2005 49786635 1059290. 47
## 8 PROPERTY tornado 2011 9819601. 4480. 2192
ggplot(yearsum2, aes(x=year, y= total, color=TypeofDamage)) + geom_line() + facet_wrap(vars(EVTYPE), scales="free") + labs(y= "Annual damage (in thousands of dollars)")
Annual damage to crops and property, 1950-2011
Crop damage seems steady for all types of events over time, except for drought, where four peaks can be seen between 1995 and 2006. Floods and hurricanes/typhoons have caused the largest damage in property in years 2005 and 2006, where 47 hurricanes and typhoons left total property damages of over 49 billion dollars, and 3936 floods left a total damage of 118 billion dollars. in 2011, tornadoes were exceptionally destructive, causing 9 billion dollars worth of damages to property.
We look into the events that cause the most damage and check which states are more at risk:
topmost %>%
group_by(EVTYPE, STATE) %>%
summarise(n=n(), .groups = "drop_last") %>%
slice_max(n, n=1)
## # A tibble: 6 × 3
## # Groups: EVTYPE [6]
## EVTYPE STATE n
## <chr> <chr> <int>
## 1 drought TX 1732
## 2 flood TX 37988
## 3 heat AL 972
## 4 hurricane/typhoon FL 276
## 5 thunderstorm TX 93568
## 6 tornado TX 33152
Across the US, the state of Texas has been the most affected by drought, flood, thunderstorm, and tornadoes. Alabama shows the highest incidence of events related to heat. Finally, Florida shows the highest number of hurricanes/typhoons in the country.