This report is built based on the Storm Data collected and disseminated by the National Oceanographic and Atmospheric Association. The aim of the report is explore this data set to know which are the most damaging atmospheric events.
Below follows a thorough description of the data preparation.
The libraries used for this work are:
The data is provided as a csv.bzip file. To access it, I
created a temporary file to store the raw data, then proceeded to read
into the temp file with data.table.
All the temporary files are then dropped.
library(tidyverse)
library(here)
library(egg)
#Load data
if(!dir.exists("./data")){dir.create(here("./data"))}#Set-up directory for work and data
#Download data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile() #Temporary variable to save zip file
download.file(url,temp)
The data is then read.
repdata <- read.table(temp, sep = ",", header = TRUE)
unlink(temp); rm(temp, url) #Delete temporary files
str(repdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Given the size of the data, I will subset the dataset to hold the important variables only, and then group these by event type.
eventype <- repdata %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES),
mag = sum(MAG),
injuries = sum(INJURIES),
propmg = sum(PROPDMG))
head(eventype,5)
## # A tibble: 5 × 5
## EVTYPE fatalities mag injuries propmg
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0 0 200
## 2 " COASTAL FLOOD" 0 0 0 0
## 3 " FLASH FLOOD" 0 0 0 50
## 4 " LIGHTNING" 0 0 0 0
## 5 " TSTM WIND" 0 171 0 108
str(eventype)
## tibble [985 × 5] (S3: tbl_df/tbl/data.frame)
## $ EVTYPE : chr [1:985] " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
## $ fatalities: num [1:985] 0 0 0 0 0 0 0 0 0 0 ...
## $ mag : num [1:985] 0 0 0 0 171 0 0 0 0 0 ...
## $ injuries : num [1:985] 0 0 0 0 0 0 0 0 0 0 ...
## $ propmg : num [1:985] 200 0 50 0 108 8 0 0 5 0 ...
With the relevant data, I will plot the top 20 events by fatalitites and injuries to see which are the most harmful.
Let’s look at the top events in general. The top_n
function will retrieve the top events.
top_n(eventype, 20)
## Selecting by propmg
## # A tibble: 20 × 5
## EVTYPE fatalities mag injuries propmg
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 FLASH FLOOD 978 2987 1777 1420125.
## 2 FLASH FLOODING 19 0 8 28497.
## 3 FLOOD 470 102 6789 899938.
## 4 HAIL 15 28883944 1361 688693.
## 5 HEAVY RAIN 98 14556 251 50842.
## 6 HEAVY SNOW 127 179 1021 122252.
## 7 HIGH WIND 248 1062903 1137 324732.
## 8 HIGH WINDS 35 19220 302 55625
## 9 ICE STORM 89 0 1975 66001.
## 10 LIGHTNING 816 173 5230 603352.
## 11 STRONG WIND 103 139010 280 62994.
## 12 THUNDERSTORM WIND 133 4410745 1488 876844.
## 13 THUNDERSTORM WINDS 64 294020 908 446293.
## 14 TORNADO 5633 4331 91346 3212258.
## 15 TROPICAL STORM 58 303 340 48424.
## 16 TSTM WIND 504 6890911 6957 1335966.
## 17 URBAN/SML STREAM FLD 28 348 79 26052.
## 18 WILD/FOREST FIRE 12 10242 545 39345.
## 19 WILDFIRE 75 36924 911 84459.
## 20 WINTER STORM 206 728 1321 132721.
Seems that Flash Flooding is the top event, with the highest aggregated values for all 4 categories.
With data ready, we will be able to present the results.
The data indicates that the most harmful events for human life are the tornadoes. As seen below, the tornadoes are at the top for most fatalities and injuries
fat <- eventype %>%
top_n(20, fatalities) %>%
ggplot() +
geom_bar(aes(x=fatalities,
y=reorder(EVTYPE,+fatalities)),
stat="identity") +
labs(title="Fatalities",x="",y="")
inj <- eventype %>%
top_n(20, injuries) %>%
ggplot() +
geom_bar(aes(x=injuries,
y=reorder(EVTYPE,+injuries)),
stat="identity") +
labs(title="Injuries",x="",y="")
# Plot code:
ggarrange(fat,inj, ncol = 2)
In terms of material damage
eventype %>%
top_n(20, propmg) %>%
ggplot() +
geom_bar(aes(x=propmg,
y=reorder(EVTYPE,+propmg)),
stat="identity") +
labs(title="Property Damage",x="",y="")