Synopsis

This report is built based on the Storm Data collected and disseminated by the National Oceanographic and Atmospheric Association. The aim of the report is explore this data set to know which are the most damaging atmospheric events.

Data Processing

Below follows a thorough description of the data preparation.

The libraries used for this work are:

Data Load

The data is provided as a csv.bzip file. To access it, I created a temporary file to store the raw data, then proceeded to read into the temp file with data.table.

All the temporary files are then dropped.

library(tidyverse)
library(here)
library(egg)
#Load data

if(!dir.exists("./data")){dir.create(here("./data"))}#Set-up directory for work and data

#Download data

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile() #Temporary variable to save zip file
download.file(url,temp)

The data is then read.

repdata <- read.table(temp, sep = ",", header = TRUE)

unlink(temp); rm(temp, url) #Delete temporary files

str(repdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subsetting data

Given the size of the data, I will subset the dataset to hold the important variables only, and then group these by event type.

eventype <- repdata %>% 
        group_by(EVTYPE) %>%
        summarise(fatalities = sum(FATALITIES), 
                  mag = sum(MAG),
                  injuries = sum(INJURIES),
                  propmg = sum(PROPDMG))

head(eventype,5)
## # A tibble: 5 × 5
##   EVTYPE                  fatalities   mag injuries propmg
##   <chr>                        <dbl> <dbl>    <dbl>  <dbl>
## 1 "   HIGH SURF ADVISORY"          0     0        0    200
## 2 " COASTAL FLOOD"                 0     0        0      0
## 3 " FLASH FLOOD"                   0     0        0     50
## 4 " LIGHTNING"                     0     0        0      0
## 5 " TSTM WIND"                     0   171        0    108
str(eventype)
## tibble [985 × 5] (S3: tbl_df/tbl/data.frame)
##  $ EVTYPE    : chr [1:985] "   HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
##  $ fatalities: num [1:985] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mag       : num [1:985] 0 0 0 0 171 0 0 0 0 0 ...
##  $ injuries  : num [1:985] 0 0 0 0 0 0 0 0 0 0 ...
##  $ propmg    : num [1:985] 200 0 50 0 108 8 0 0 5 0 ...

With the relevant data, I will plot the top 20 events by fatalitites and injuries to see which are the most harmful.

Data exploration

Let’s look at the top events in general. The top_n function will retrieve the top events.

top_n(eventype, 20)
## Selecting by propmg
## # A tibble: 20 × 5
##    EVTYPE               fatalities      mag injuries   propmg
##    <chr>                     <dbl>    <dbl>    <dbl>    <dbl>
##  1 FLASH FLOOD                 978     2987     1777 1420125.
##  2 FLASH FLOODING               19        0        8   28497.
##  3 FLOOD                       470      102     6789  899938.
##  4 HAIL                         15 28883944     1361  688693.
##  5 HEAVY RAIN                   98    14556      251   50842.
##  6 HEAVY SNOW                  127      179     1021  122252.
##  7 HIGH WIND                   248  1062903     1137  324732.
##  8 HIGH WINDS                   35    19220      302   55625 
##  9 ICE STORM                    89        0     1975   66001.
## 10 LIGHTNING                   816      173     5230  603352.
## 11 STRONG WIND                 103   139010      280   62994.
## 12 THUNDERSTORM WIND           133  4410745     1488  876844.
## 13 THUNDERSTORM WINDS           64   294020      908  446293.
## 14 TORNADO                    5633     4331    91346 3212258.
## 15 TROPICAL STORM               58      303      340   48424.
## 16 TSTM WIND                   504  6890911     6957 1335966.
## 17 URBAN/SML STREAM FLD         28      348       79   26052.
## 18 WILD/FOREST FIRE             12    10242      545   39345.
## 19 WILDFIRE                     75    36924      911   84459.
## 20 WINTER STORM                206      728     1321  132721.

Seems that Flash Flooding is the top event, with the highest aggregated values for all 4 categories.

With data ready, we will be able to present the results.

Results

The data indicates that the most harmful events for human life are the tornadoes. As seen below, the tornadoes are at the top for most fatalities and injuries

fat <- eventype %>% 
        top_n(20, fatalities) %>%
        ggplot() + 
        geom_bar(aes(x=fatalities, 
                     y=reorder(EVTYPE,+fatalities)),
                 stat="identity") +
        labs(title="Fatalities",x="",y="")

inj <- eventype %>% 
        top_n(20, injuries) %>%
        ggplot() + 
        geom_bar(aes(x=injuries, 
                     y=reorder(EVTYPE,+injuries)),
                 stat="identity") +
        labs(title="Injuries",x="",y="")

# Plot code:

ggarrange(fat,inj, ncol = 2)

In terms of material damage

eventype %>% 
        top_n(20, propmg) %>%
        ggplot() + 
        geom_bar(aes(x=propmg, 
                     y=reorder(EVTYPE,+propmg)),
                 stat="identity") +
        labs(title="Property Damage",x="",y="")