Synopsis

This analysis summarizes the impacts that storms have on people and property. The dataset is provided by ???? and represents the time period between Jan 3, 1950 to Nov 30, 2011.
The first part of the analysis looks at what types of storms cause the most casualties and the second part looks at what types of storms cause the most property and crop damage.
Analysis showed that wind storms caused the vast majority of casualties with 886% more than the second highest cause.
Flood caused the most damage to property and crops. It was 238% higher than the next highest cause. # Preliminaries ### Start with an empty environment and set the working directory

rm(list = ls())
setwd("~/Analytics Course/04_ReproducibleResearch/Project2")

Load needed libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(R.cache)
## R.cache v0.12.0 (2015-11-12) successfully loaded. See ?R.cache for help.

Load the data

# Location of data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#download.file(url, "storm_data.csv.bz2")

# Read data into variable
storm_data_raw <- read.csv("storm_data.csv.bz2")

# Get range of dates covered by data
sdr <- storm_data_raw
sdr$BGN_DATE <- mdy_hms(sdr$BGN_DATE)

# Get range of dates in dataset
range(sdr$BGN_DATE)
## [1] "1950-01-03 UTC" "2011-11-30 UTC"
rm(sdr)

We can see that out data spans the time period of 1950-01-03 to 2011-11-30.

Data Processing

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

For this part only three columns are needed: EVTYPE,FATALITIES,INJURIES
Fatalities and Injuries are added together to form total casualties.
NOTE: No more weight is given to deaths than injuries

storm_data <- storm_data_raw %>%
    select(EVTYPE,FATALITIES,INJURIES) %>%
    mutate(EVTYPE = toupper(EVTYPE)) %>%
    group_by(EVTYPE) %>%
    summarise(sum_fatal = sum(FATALITIES), sum_injured = sum(INJURIES)) %>%
    mutate(total_casualties = sum_fatal + sum_injured) %>%
    arrange(-total_casualties)
print(storm_data)
## # A tibble: 898 x 4
##               EVTYPE sum_fatal sum_injured total_casualties
##                <chr>     <dbl>       <dbl>            <dbl>
## 1            TORNADO      5633       91346            96979
## 2     EXCESSIVE HEAT      1903        6525             8428
## 3          TSTM WIND       504        6957             7461
## 4              FLOOD       470        6789             7259
## 5          LIGHTNING       816        5230             6046
## 6               HEAT       937        2100             3037
## 7        FLASH FLOOD       978        1777             2755
## 8          ICE STORM        89        1975             2064
## 9  THUNDERSTORM WIND       133        1488             1621
## 10      WINTER STORM       206        1321             1527
## # ... with 888 more rows

After getting this initial dataset for the question I noticed that the top ten had several event types that were very similar. For instance #2 Excessive Heat, and #6 Heat; #4 Flood, and #7 Flash Flood; #3 TSTM Wind, and #9 Thunderstorm Wind. This meant that without being categorized another level this data is fairly meaningless, even though we can see already that the tornado is, by far, the cause of most casualties. The problem is that there are 898 unique EVTYPEs.

length(unique(storm_data$EVTYPE))  
## [1] 898

If we limit the events to only ones that caused more than 50 casualties then we are left with a more manageble 52 EVTYPEs.

# Extract all unique EVTYPEs with more than 50 casualties
more_than_50 <- storm_data %>%
    filter(total_casualties > 50) %>%
    select(EVTYPE) %>%
    arrange(EVTYPE)

unique(more_than_50$EVTYPE)    
##  [1] "AVALANCHE"               "BLIZZARD"               
##  [3] "COLD"                    "COLD/WIND CHILL"        
##  [5] "DENSE FOG"               "DUST STORM"             
##  [7] "EXCESSIVE HEAT"          "EXTREME COLD"           
##  [9] "EXTREME COLD/WIND CHILL" "EXTREME HEAT"           
## [11] "FLASH FLOOD"             "FLOOD"                  
## [13] "FOG"                     "GLAZE"                  
## [15] "HAIL"                    "HEAT"                   
## [17] "HEAT WAVE"               "HEAVY RAIN"             
## [19] "HEAVY SNOW"              "HEAVY SURF/HIGH SURF"   
## [21] "HIGH SURF"               "HIGH WIND"              
## [23] "HIGH WINDS"              "HURRICANE"              
## [25] "HURRICANE/TYPHOON"       "ICE"                    
## [27] "ICE STORM"               "LANDSLIDE"              
## [29] "LIGHTNING"               "RECORD HEAT"            
## [31] "RIP CURRENT"             "RIP CURRENTS"           
## [33] "STORM SURGE"             "STRONG WIND"            
## [35] "THUNDERSTORM WIND"       "THUNDERSTORM WINDS"     
## [37] "TORNADO"                 "TROPICAL STORM"         
## [39] "TROPICAL STORM GORDON"   "TSTM WIND"              
## [41] "TSTM WIND/HAIL"          "TSUNAMI"                
## [43] "URBAN/SML STREAM FLD"    "WILD FIRES"             
## [45] "WILD/FOREST FIRE"        "WILDFIRE"               
## [47] "WIND"                    "WINTER STORM"           
## [49] "WINTER WEATHER"          "WINTER WEATHER MIX"     
## [51] "WINTER WEATHER/MIX"      "WINTRY MIX"

OK, this is where I divert a little from data science norms…

I could categorize the EVTYPEs using a lot of grep calls searching for certain key words, but I’m on a deadline so I just extracted the unique event types to a csv, opened it up in excel, created another column for “Category” and manually assigned each of the 52 EVTYPEs to a more general category. I then saved it as a CSV and imported it as storm_category.
NOTE: Since the categories are subjective feel free to re-categorize them and see if you get different results.

# write.csv(more_than_50, "storm_category.csv", row.names = FALSE)

# Read in category xref
storm_category <- read.csv("storm_category.csv", header=TRUE) 

Now let’s merge the categories into the storm_data set to give groupings.

merged <- merge(storm_data, storm_category, by="EVTYPE", all.x = TRUE)

# Now, further consolidate the total_casualties by their category 
consolidated <- merged %>%
    group_by(CATEGORY) %>%
    summarise(category_total = sum(total_casualties)) %>%
    arrange(-category_total)
print(consolidated)
## # A tibble: 20 x 2
##          CATEGORY category_total
##            <fctr>          <dbl>
## 1            WIND         109247
## 2            HEAT          12319
## 3           FLOOD          10267
## 4    WINTER STORM           7407
## 5       LIGHTNING           6046
## 6  TROPICAL STORM           2038
## 7            FIRE           1696
## 8            HAIL           1476
## 9              NA           1185
## 10            FOG           1156
## 11    RIP CURRENT           1101
## 12           COLD            509
## 13     DUST STORM            462
## 14      AVALANCHE            394
## 15           RAIN            353
## 16      HIGH SURF            350
## 17  COLD AND WIND            256
## 18          GLAZE            223
## 19      LANDSLIDE             90
## 20        DROUGHT              4

Since we’re only interested in the ones with the most casualties I want to choose the grouping that shows the most info. There is a big dropoff between #5 Lightning, and #6 Tropical Storm. Let’s just use the top six to show that big decline.

top_six <- consolidated[1:6, ]

# In order to get them to print in the order of rank I need to set the levels of the EVTYPE factor to the order I want them.
levels_top_six <- c("WIND", "HEAT", "FLOOD", "WINTER STORM", "LIGHTNING",  
               "TROPICAL STORM")
top_six$CATEGORY <- ordered(top_six$CATEGORY, levels=levels_top_six)

Results

Now we are ready to plot out the results

ggplot(top_six, aes(CATEGORY, category_total, fill=category_total)) + 
    geom_bar(stat = "identity") + 
    labs(title = "Casualties per Storm Category", 
         x="Storm Category", 
         y="People Killed or Injured", 
         fill="Total\nCasualties") + 
    theme(axis.text.x = element_text(size = 8))

Data Processing

Question 2

Across the United States, which types of events have the greatest economic consequences?

Let’s get the data that deals with economics…

storm_data <- storm_data_raw %>%
    select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    mutate(EVTYPE = toupper(EVTYPE), PROPDMGEXP = toupper(PROPDMGEXP), 
           CROPDMGEXP = toupper(CROPDMGEXP))

head(storm_data)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO    25.0          K       0           
## 2 TORNADO     2.5          K       0           
## 3 TORNADO    25.0          K       0           
## 4 TORNADO     2.5          K       0           
## 5 TORNADO     2.5          K       0           
## 6 TORNADO     2.5          K       0

Looking at the documentation. The PROPDMGEXP and CROPDMGEXP are important. Values of K, M, & B mean thousands, millions, and billions and are multipliers of the values in PROPDMG & CROPDMG. The others can use the value without modification.

Lets make any value adjustments now before working any further with this dataset.

# Property damage adjustments
# Get the indexes for data with these multipliers
index_K <- which(storm_data$PROPDMGEXP == "K")
index_M <- which(storm_data$PROPDMGEXP == "M")
index_B <- which(storm_data$PROPDMGEXP == "B")

storm_data[,2][index_K] <- storm_data[,2][index_K]*(10^3)
storm_data[,2][index_M] <- storm_data[,2][index_M]*(10^6)
storm_data[,2][index_B] <- storm_data[,2][index_B]*(10^9)

# Crop damage adjustments.  I'm re-using variables for conciseness.
index_K <- which(storm_data$CROPDMGEXP == "K")
index_M <- which(storm_data$CROPDMGEXP == "M")
index_B <- which(storm_data$CROPDMGEXP == "B")

storm_data[,4][index_K] <- storm_data[,4][index_K]*(10^3)
storm_data[,4][index_M] <- storm_data[,4][index_M]*(10^6)
storm_data[,4][index_B] <- storm_data[,4][index_B]*(10^9)

Group and re-summarise storm data. Add the total_damage variable.

grouped <- storm_data %>%
    group_by(EVTYPE) %>%
    summarise(sum_prop = sum(PROPDMG), sum_crop = sum(CROPDMG)) %>%
    mutate(total_damage = sum_prop + sum_crop) %>%
    arrange(-total_damage)

head(grouped, 10)
## # A tibble: 10 x 4
##               EVTYPE     sum_prop    sum_crop total_damage
##                <chr>        <dbl>       <dbl>        <dbl>
## 1              FLOOD 144657709807  5661968450 150319678257
## 2  HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3            TORNADO  56937160779   414953270  57352114049
## 4        STORM SURGE  43323536000        5000  43323541000
## 5               HAIL  15732267048  3025954473  18758221521
## 6        FLASH FLOOD  16140812067  1421317100  17562129167
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927860  5022113500   8967041360

We have the same problem with redundant EVTYPEs. I pulled out the top 52, which are very different than the top 52 for casualties, extracted them to excel, manually grouped them in a ‘Category’ column and then re-saved and imported as storm_category.
NOTE: This is the same file as above and can be skipped if already imported.

# for_extract <- head(grouped, 52)
# write.csv(for_extract, "storm_type.csv", row.names = FALSE)

# Read in category xref
storm_category <- read.csv("storm_category.csv", header=TRUE) 

Merge the categories into the grouped data set.

merged <- merge(grouped, storm_category, by="EVTYPE", all.x = TRUE)

Consolidate the totals by category

consolidated <- merged %>%
    select(CATEGORY, total_damage) %>%
    group_by(CATEGORY) %>%
    summarise(sum_damage = sum(total_damage)) %>%
    arrange(-sum_damage)

consolidated
## # A tibble: 20 x 2
##          CATEGORY   sum_damage
##            <fctr>        <dbl>
## 1           FLOOD 227189427503
## 2  TROPICAL STORM  95059614060
## 3            WIND  75022527432
## 4            HAIL  18867253271
## 5    WINTER STORM  18402190803
## 6         DROUGHT  15018672000
## 7              NA  12489228304
## 8            FIRE   8793313130
## 9            COLD   2494076900
## 10           RAIN   1500432940
## 11      LIGHTNING    940751537
## 12           HEAT    924539250
## 13      LANDSLIDE    344613000
## 14      HIGH SURF     99825000
## 15            FOG     22829500
## 16  COLD AND WIND     11288000
## 17     DUST STORM      8649000
## 18      AVALANCHE      3721800
## 19          GLAZE      1000000
## 20    RIP CURRENT       163000

It looks like we have a nice break between 5 & 6 in this set also.

top_six <- consolidated[1:6, ]

Change factor levels for plotting purposes.

top_six$CATEGORY <- as.character(top_six$CATEGORY)
top_six$CATEGORY[5] <- "WINTER\nSTORM"

# Get the category order to assign as ordered levels
levels_dmg <- top_six$CATEGORY
top_six$CATEGORY <- ordered(top_six$CATEGORY, levels=levels_dmg)

Results

Plot out our results

ggplot(top_six, aes(CATEGORY, sum_damage/10^9, fill=sum_damage/10^9)) + 
    geom_bar(stat = "identity") + 
    labs(x="Storm Category", 
         y="Damage in billions($)", 
         fill="Damage\nBillions($)") +
    ggtitle("Storm Damage by Category") + 
    scale_x_discrete(name = "Storm Category") + 
    theme(axis.text.x = element_text(size = 8))

The End