Synopsis

In this analysis, we’re going to get the Storm Data database from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) website. Then we’ll conduct some exploratory analysis to understand the data better. We’ll then modify the data set to better suit our analysis needs. Finally, we’ll proceed to answer the questions that were posed by the assignment:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Across the United States, which types of events have the greatest economic consequences?

Data Processing

Reading the Raw Data

The first step is to actually get the data, to do so, we’re going to download it from the NOAA website.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("./data/stormdata.csv.bz2")){
        download.file (url, "./data/stormdata.csv.bz2")
}
if(!exists("stormdata")){
        stormdata <- read.csv("./data/stormdata.csv.bz2")
}

Exploratory Data Analysis

Let’s now look at some properties of the data set.

library(tidyverse) # loading libraries that will
library(lubridate) # be useful during the analysis
dim(stormdata)
## [1] 902297     37
names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
as_tibble(stormdata)
## # A tibble: 902,297 × 37
##    STATE__ BGN_DATE  BGN_T…¹ TIME_…² COUNTY COUNT…³ STATE EVTYPE BGN_R…⁴ BGN_AZI
##      <dbl> <chr>     <chr>   <chr>    <dbl> <chr>   <chr> <chr>    <dbl> <chr>  
##  1       1 4/18/195… 0130    CST         97 MOBILE  AL    TORNA…       0 ""     
##  2       1 4/18/195… 0145    CST          3 BALDWIN AL    TORNA…       0 ""     
##  3       1 2/20/195… 1600    CST         57 FAYETTE AL    TORNA…       0 ""     
##  4       1 6/8/1951… 0900    CST         89 MADISON AL    TORNA…       0 ""     
##  5       1 11/15/19… 1500    CST         43 CULLMAN AL    TORNA…       0 ""     
##  6       1 11/15/19… 2000    CST         77 LAUDER… AL    TORNA…       0 ""     
##  7       1 11/16/19… 0100    CST          9 BLOUNT  AL    TORNA…       0 ""     
##  8       1 1/22/195… 0900    CST        123 TALLAP… AL    TORNA…       0 ""     
##  9       1 2/13/195… 2000    CST        125 TUSCAL… AL    TORNA…       0 ""     
## 10       1 2/13/195… 2000    CST         57 FAYETTE AL    TORNA…       0 ""     
## # … with 902,287 more rows, 27 more variables: BGN_LOCATI <chr>,
## #   END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>,
## #   END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>,
## #   WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>,
## #   PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>,
## #   WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, …

There are 902297 observations across 37 variables. The variable names are described above.

Understanding the Time Frame of the Data

BGN_DATE could be a variable of interest if we wanted to compare between different periods or if we wanted to look at a specific range of time. By now, we’re content to look at the data as a whole, but it is important to understand the time frame that we’re looking at.

stormdata %>% mutate (BGN_DATE = mdy_hms(BGN_DATE)) ->
        stormdata
range(stormdata$BGN_DATE)
## [1] "1950-01-03 UTC" "2011-11-30 UTC"

The above sets the time-frame of the observations between January of 1950 and November of 2011.

Answering the Proposed Questions

Question 1

Based on the following question raised by the assignment:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The variables of interest are: EVTYPE, FATALITIES and INJURIES

We can group the data by the event type while summarizing with the sum of the fatalities and injuries for each event type. We can then order the observations by FATALITIES first and then INJURIES to resolve ties.

stormdata %>% as_tibble() %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES)) %>% arrange(desc(FATALITIES), desc(INJURIES)) ->
        healthdmg
healthdmg
## # A tibble: 985 × 3
##    EVTYPE         FATALITIES INJURIES
##    <chr>               <dbl>    <dbl>
##  1 TORNADO              5633    91346
##  2 EXCESSIVE HEAT       1903     6525
##  3 FLASH FLOOD           978     1777
##  4 HEAT                  937     2100
##  5 LIGHTNING             816     5230
##  6 TSTM WIND             504     6957
##  7 FLOOD                 470     6789
##  8 RIP CURRENT           368      232
##  9 HIGH WIND             248     1137
## 10 AVALANCHE             224      170
## # … with 975 more rows

The table above summarizes the data in the descending order by FATALITIES, and breaks ties by using INJURIES. The tie-breaking becomes most useful in the observations that didn’t lead to any FATALITIES. We can see only those observations by filtering.

healthdmg %>% filter(FATALITIES == 0) %>% select (EVTYPE,INJURIES) -> injuries
injuries
## # A tibble: 817 × 2
##    EVTYPE                 INJURIES
##    <chr>                     <dbl>
##  1 Heat Wave                    70
##  2 WINTER WEATHER MIX           68
##  3 SNOW/HIGH WINDS              36
##  4 THUNDERSTORMW                27
##  5 TORNADO F2                   16
##  6 GLAZE/ICE STORM              15
##  7 HEAVY SNOW/ICE               10
##  8 SMALL HAIL                   10
##  9 THUNDERSTORM  WINDS          10
## 10 NON-SEVERE WIND DAMAGE        7
## # … with 807 more rows

The table above summarizes the events that didn’t lead to any FATALITIES by descending order of INJURIES.

Question 2

Based on the following question raised by the assignment

Across the United States, which types of events have the greatest economic consequences?

The variables of interest are: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.

Let’s look at that segment of the data.

stormdata %>% select (EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) %>% as_tibble() -> economicdmg
economicdmg
## # A tibble: 902,297 × 5
##    EVTYPE  PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##    <chr>     <dbl> <chr>        <dbl> <chr>     
##  1 TORNADO    25   K                0 ""        
##  2 TORNADO     2.5 K                0 ""        
##  3 TORNADO    25   K                0 ""        
##  4 TORNADO     2.5 K                0 ""        
##  5 TORNADO     2.5 K                0 ""        
##  6 TORNADO     2.5 K                0 ""        
##  7 TORNADO     2.5 K                0 ""        
##  8 TORNADO     2.5 K                0 ""        
##  9 TORNADO    25   K                0 ""        
## 10 TORNADO    25   K                0 ""        
## # … with 902,287 more rows

The table above summarizes the economic damages. There’s a problem with it, though, the numbers in PROPDMG are absolute values that need to take into account the multiplicand in PROPDMGEXP. We can use the case_when function in dplyr to deal with this. But first, we need to see which cases exist.

unique(economicdmg$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
length(unique(economicdmg$PROPDMGEXP))
## [1] 19

From the above, we can see that there are many different cases. Some of them are quite obvious, such as “K” meaning thousands, “B” meaning billions and “M” meaning millions. The others are a little bit obscure.

After a quick online search, a report by Eddie Song reveals the meaning of each observation.

These are possible values of CROPDMGEXP and PROPDMGEXP:
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
(+) = 1
(-) = 0
(?) = 0
black/empty character = 0
numeric 0..8 = 10

Now we multiply the values in PROPDMG to the according exponent as per the report above.

economicdmg %>% mutate (PROPDMG = case_when(
        tolower(PROPDMGEXP) == "k" ~ PROPDMG*1000,
        tolower(PROPDMGEXP) == "m" ~ PROPDMG*1000000,
        tolower(PROPDMGEXP) == "b" ~ PROPDMG*1000000000,
        tolower(PROPDMGEXP) == "h" ~ PROPDMG*100,
        PROPDMGEXP %in% 0:8 ~ PROPDMG*10,
        PROPDMGEXP == "-" ~ PROPDMG*0,
        PROPDMGEXP == "" ~ PROPDMG*0,
        PROPDMGEXP == "+" ~ PROPDMG*1,
        PROPDMGEXP == "?" ~ PROPDMG*0)) %>% select(-PROPDMGEXP) -> economicdmg
economicdmg
## # A tibble: 902,297 × 4
##    EVTYPE  PROPDMG CROPDMG CROPDMGEXP
##    <chr>     <dbl>   <dbl> <chr>     
##  1 TORNADO   25000       0 ""        
##  2 TORNADO    2500       0 ""        
##  3 TORNADO   25000       0 ""        
##  4 TORNADO    2500       0 ""        
##  5 TORNADO    2500       0 ""        
##  6 TORNADO    2500       0 ""        
##  7 TORNADO    2500       0 ""        
##  8 TORNADO    2500       0 ""        
##  9 TORNADO   25000       0 ""        
## 10 TORNADO   25000       0 ""        
## # … with 902,287 more rows

Now we need to do the same for CROPDMG and CROPDMGEXP.

## # A tibble: 902,297 × 3
##    EVTYPE  PROPDMG CROPDMG
##    <chr>     <dbl>   <dbl>
##  1 TORNADO   25000       0
##  2 TORNADO    2500       0
##  3 TORNADO   25000       0
##  4 TORNADO    2500       0
##  5 TORNADO    2500       0
##  6 TORNADO    2500       0
##  7 TORNADO    2500       0
##  8 TORNADO    2500       0
##  9 TORNADO   25000       0
## 10 TORNADO   25000       0
## # … with 902,287 more rows

Now that the data is tidy, we can group it by the EVTYPE and take the sum of the damages. We will then order by the TOTALDMG in descending order to answer the proposed question.

economicdmg %>% group_by(EVTYPE) %>% summarise(TOTALDMG = sum(PROPDMG,CROPDMG)) %>% arrange(desc(TOTALDMG)) -> totaleconomicdmg
totaleconomicdmg
## # A tibble: 985 × 2
##    EVTYPE                TOTALDMG
##    <chr>                    <dbl>
##  1 FLOOD             150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3 TORNADO            57352117607
##  4 STORM SURGE        43323541000
##  5 HAIL               18758224527
##  6 FLASH FLOOD        17562132111
##  7 DROUGHT            15018672000
##  8 HURRICANE          14610229010
##  9 RIVER FLOOD        10148404500
## 10 ICE STORM           8967041810
## # … with 975 more rows

Results

We are going to plot some plots to answer the proposed questions. Let’s start with the first question.

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

First, let’s demonstrate the correlation between these two variables across the different event types.

model <- lm(INJURIES ~ FATALITIES, data = healthdmg)
reg_summary <- summary(model)
healthdmg %>% 
        ggplot(aes(x = FATALITIES, y = INJURIES)) +
        geom_point() +
        geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
        labs(caption = paste0("Regression equation: y = ",
                        round(reg_summary$coefficients[2, 1], 2),
                        "x + ",
                        round(reg_summary$coefficients[1, 1], 2),
                        ", R^2 = ",
                        round(reg_summary$r.squared, 2)))

With the strong correlation demonstrated, the plot that represents the FATALITIES per EVTYPE is enough to answer the question above, as FATALITIES are much more important than INJURIES with respect to the health of population, and due to the fact that both these variables are so closely correlated.Let’s filter only the observations which have more than 500 fatalities, so that the plot is more readable.

healthdmg %>% filter (FATALITIES>500) %>% 
        ggplot(aes(x = reorder(EVTYPE,-FATALITIES), y = FATALITIES)) +
        geom_col() +
        theme(axis.text.x = element_text(size = 8,
                                         angle = 45,
                                         hjust = 1)) +
        labs (x = "Event Type", y = "Fatalities") +
        ggtitle ("Fatalities due to Severe Weather Events in the US", 
                 subtitle = "From 1950 to 2011")

The plot shows, in descending order the types of event that are most harmful to health of US citizens.

Question 2

Across the United States, which types of events have the greatest economic consequences?

To answer this question, we will plot as above, but considering the economic damage. We will filter only the causes that have surpassed 10 billion dollars in TOTALDMG during the period, so that the plot is more readable.

totaleconomicdmg %>% filter(TOTALDMG>10000000000) %>%
        ggplot(aes(x = reorder(EVTYPE,-TOTALDMG), y = TOTALDMG)) + 
        geom_col() +
        theme(axis.text.x = element_text(size = 8,
                                         angle = 45,
                                         hjust = 1)) +
        labs (x = "Event Type", y = "Total Economic Damage (US Dollars)") +
        ggtitle ("Economic Damage due to Severe Weather Events in the US", 
                 subtitle = "From 1950 to 2011")

The plot above shows the types of event, in descending order, that have caused the most economic damage in the studied period.