Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This analysis address the following questions:

More infomartion about the dataset can be found here and here


Data Processing

First, load the R libraries that will be used and set system locale to avoid conflits

library(ggplot2, warn.conflicts = F)
library(dplyr, quietly = T, warn.conflicts = F)
library(data.table, quietly = T, warn.conflicts = F)
library(ggpubr, quietly = T, warn.conflicts = F)
Sys.setlocale('LC_ALL','English')  
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Now start by creating a directory to store data and figures, download the data from the web then load into R.
While doing it, create a text file which states the url and time/timezone of download.

datadir <- './data'
figuredir <- './figures'
dataurl <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
datafilepath <- paste0(datadir,'/NOAAdata.csv.bz2')


if(!dir.exists(datadir)){
        dir.create(datadir)
}

if(!file.exists(datafilepath)){
        
        download.file(url = dataurl,destfile = datafilepath,method = 'curl')
        
        time <- as.character(Sys.time())
        timezone <- Sys.timezone()
        
        downloadinfo <- data.frame(list(time = time, 
                             format = "%Y-%m-%d %H:%M:%S",
                             timezone = timezone))
        
        write.table(x = downloadinfo,
                    file = paste0(datadir,'/downloadinfo.txt'),
                    row.names = F)
}

NOAAdata <- fread(file = datafilepath,data.table = F)

if(!dir.exists(figuredir)){
        dir.create(figuredir)
}

Some preprocessing will need to be done.

First, transform begin dates (BGN_DATE) from character to date format.

NOAAdata <- NOAAdata %>% mutate(BGN_DATE = strptime(BGN_DATE, '%m/%d/%Y %H:%M:%S'))

Now it’s a good time to filter the more recent events to work with and select the columns of interest. Here, filter data from 2005 to 2010 and select state, county code, event type, fatalities, injuries, property damages and crop damage.

NOAAdata0510 <- NOAAdata %>% 
        filter(between(BGN_DATE, '2005-01-01','2010-12-31')) %>% 
        select(BGN_DATE,STATE,COUNTY,EVTYPE,
               FATALITIES,INJURIES,PROPDMG,
               PROPDMGEXP,CROPDMG,CROPDMGEXP)

Now give proper class to all the variables that need it.

NOAAdata0510 <- NOAAdata0510 %>%
        mutate(STATE = as.factor(STATE), COUNTY = as.factor(COUNTY), EVTYPE = as.factor(EVTYPE))

To deal with property and crop damage properly, we need a function to turn their data set values into actual amount of damage considering their unit of magnitude.

actualamount <-function(cost, unit){
        if(unit == 'K'){
                return((cost*1e3))
        }else if(unit == 'M'){
                return(cost*1e6)
        }else if(unit == 'B'){
                return(cost*1e9)
        }else{
                return(cost)
        }
}

Now use this function to transform data set values into actual amount.

NOAAdata0510 <- NOAAdata0510 %>% 
        mutate(prop = mapply(actualamount, PROPDMG,PROPDMGEXP),
               crop = mapply(actualamount,CROPDMG,CROPDMGEXP))

Results

Across the United States, which types of events are most harmful with respect to population health?

For that, we need a general look at the United States’ total injuries and fatalities for each type of event during the period from 2005 to 2010. Bar plots are appropriate to see this, but it is necessary to retrieve that information from the data set first.

health <- NOAAdata0510 %>% group_by(EVTYPE) %>%
        summarise(Injuries = sum(INJURIES), Fatalities = sum(FATALITIES) )
## `summarise()` ungrouping output (override with `.groups` argument)

Select the 15 highest cases for each outcome and organize data for plotting.

forinjuries <- health %>% select(EVTYPE, Cases = Injuries) %>%
  arrange(desc(Cases)) %>% 
  head(n = 15) %>% mutate(plot = 'Injuries')
forfatalities <- health %>% select(EVTYPE, Cases = Fatalities) %>%
  arrange(desc(Cases)) %>% 
  head(n = 15) %>% mutate(plot = 'Fatalities')

healthplot <- rbind(forinjuries,forfatalities)

Now, plot.

plt1<- ggplot(healthplot, aes(fill = EVTYPE,x = EVTYPE, y = Cases)) +
  geom_col() + theme_bw() + facet_wrap(. ~ plot,nrow = 2,scales = 'free_y') +
  scale_x_discrete(labels=NULL) + labs(x = '') +
  theme(legend.text = element_text(size=6),
        legend.title = element_blank(), legend.position = 'top')
print(plt1)

It can be seen excessive heat and tornadoes are the ones to make more injuries and to have more fatalities. Some attention must be given to flash floods events too, which were very deadly during the period, even though didn’t have much injuries.

Across the United States, which types of events have the greatest economic consequences?

For that we need a general look at the United States’ total amount of damage for properties and crops for each type of event during the period from 2005 to 2010. Bar plots are appropriate to see this, but it is necessary to retrieve that information from the data set first.

dmg <- NOAAdata0510 %>% group_by(EVTYPE) %>%
        summarise(Properties = sum(prop), Crops = sum(crop) )
## `summarise()` ungrouping output (override with `.groups` argument)

Select the 10 highest amount of damage for each damaged type and organize data for plotting.

forproperties <- dmg %>% select(EVTYPE, Amount = Properties) %>%
  arrange(desc(Amount)) %>% 
  head(n = 10) %>% mutate(plot = 'Damaged properties')
forcrop <- dmg %>% select(EVTYPE, Amount = Crops) %>%
  arrange(desc(Amount)) %>% 
  head(n = 10) %>% mutate(plot = 'Damaged crops')

dmgplot <- rbind(forproperties,forcrop)

Now, plot.

plt2 <- ggplot(dmgplot, aes(fill = EVTYPE,x = EVTYPE, y = Amount/1e6)) +
  geom_col() + theme_bw() + facet_wrap(. ~ plot,nrow = 2,scales = 'free_y') +
  scale_x_discrete(labels=NULL) + labs(x = '',y = 'Amount of damage (in millions of $)') +
  theme(legend.text = element_text(size=6),
        legend.title = element_blank(), legend.position = 'top')
print(plt2)

It can be seen floods have cost an absurd amount to properties, getting higher than 120 billion dollars during the period. Crops have seen a huge amount of damage due to this type too, but drought is the event that cost the most to crops during this period, getting as high as 4 billion.