Synopsis

This analysis utilizes the NOAA storm data (repdata_data_StormData.csv.bz2) to analyse the affects on these weather events on the population. The first question asks: Across the United States, which types of events are most harmful with respect to population health? The second question explores which types of events have the greatest economic consequences in the USA? These two questions will allow to analyse the health and economic costs of these storm events and rank them by severity. To answer the questions, the first step is to take the raw data and process it for analysis. We will combine the injuries and fatalities together to indicate population damage, and the property damage (PROPDMG) with crop damage (CROPDMG) together as the total economic damages. The processed data is then taken and analysed using plotting methods to help us visualize the answer. We visualize the results in barplots and show the top 5 storm events that cause the most casualties and economic damage.

Data Processing

Loading the Data

First we must load the data, which should be in the same working directory. The download link for the data is: Storm Data[47Mb].

rawdata <- read.csv("repdata_data_StormData.csv.bz2")
str(rawdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subsetting useful variables

To do our analysis, we need variables corresponding to: event type, health population health, and economic consquences. To extract those information, we need the following variables: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. More information on these variables can be found in the documentation here: Storm Data Documentation

events <- subset(rawdata, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
head(events)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Cleaning up the data for analysis

We must now clean the data up a little. The first step is to combine fatalities and injuries together, so we can find total population health damage under one variable.

events <- events %>% 
        mutate(casualties = select(., FATALITIES, INJURIES) %>% rowSums(na.rm = FALSE))

Since PROPDMGEXP and CROPDMGEXP are the base units of the actual property and crop damage, we will calculate the actual numerical values for each.

# substitute the exponential units with their actual numerical value for property damage exp:
events$PROPDMGEXP <- as.numeric(as.character(factor(events$PROPDMGEXP,
                                                    levels = c("K","M"),
                                                    labels = c(1000, 1000000))))

# do the same for crop damage exp:
events$CROPDMGEXP <- as.numeric(as.character(factor(events$CROPDMGEXP,
                                                    levels = c("K","M"),
                                                    labels = c(1000, 1000000))))

Now we can combine the total damages together by multiplying each damage variable with their base value column (EXP), and then adding them together.

events <- events %>% 
        mutate(prop_dmg = PROPDMG * PROPDMGEXP) %>%
        mutate(crop_dmg = CROPDMG * CROPDMGEXP) %>%
        mutate(econ_dmg = select(., prop_dmg, crop_dmg) %>% rowSums(na.rm = TRUE))

Finally we can remove the columns we do not need anymore and only keep the newly aggregated columns

events_clean <- subset(events, select = c(EVTYPE, casualties, econ_dmg))
head(events_clean)
##    EVTYPE casualties econ_dmg
## 1 TORNADO         15    25000
## 2 TORNADO          0     2500
## 3 TORNADO          2    25000
## 4 TORNADO          2     2500
## 5 TORNADO          2     2500
## 6 TORNADO          6     2500

Results

Question 1

Recall question 1: Across the United States, which types of events are most harmful with respect to population health? In order to do this, we must first aggregate by events for the total health damages. Then we will remove all zero damage events and sort by highest damaging to lowest.

agg_casualty <- aggregate(casualties~EVTYPE, events_clean, sum)

# remove all zero damage event and sort by highest damage to lowest
top_health <- arrange(subset(agg_casualty, casualties > 0), desc(casualties))
head(top_health)
##           EVTYPE casualties
## 1        TORNADO      96979
## 2 EXCESSIVE HEAT       8428
## 3      TSTM WIND       7461
## 4          FLOOD       7259
## 5      LIGHTNING       6046
## 6           HEAT       3037

Now we can plot the aggregated data as a barplot for the top 5 highest damaging events to population health.

x <- top_health[1:5,]
barplot(casualties~EVTYPE, x, col = "blue", xlab = "Storm Events", ylab = "Casualties (injuries & fatalities)",
        main = "Top Casualty Inducing Storm Events 1950-2011")

Question 2

Recall Question 2: Which types of events have the greatest economic consequences in the USA? Using the same method from the previous analysis, we will aggregate and sort to show the top events.

agg_econ <- aggregate(econ_dmg~EVTYPE, events_clean, sum)

# remove all zero damage event and sort by highest damage to lowest
top_econ <- arrange(subset(agg_econ, econ_dmg > 0), desc(econ_dmg))
head(top_econ) 
##        EVTYPE    econ_dmg
## 1     TORNADO 52040613590
## 2       FLOOD 27819678250
## 3        HAIL 16952904170
## 4 FLASH FLOOD 16562128610
## 5     DROUGHT 13518672000
## 6   HURRICANE  8910229010
y <- top_econ[1:5,]
barplot((econ_dmg/1000000)~EVTYPE, y, col = "red", xlab = "Storm Events", ylab = "Economic Damage ($ millions)",
        main = "Top Economic Damages by Storm Events 1950-2011")

Concluson

Based on the analysis and plots, we see that tornados cause both the highest casualties and economic damage. Floods also cause high amounts of economic damage, as well as also being in the top 5 casualty inducing storm events. These two events in particular are good points of interest in order to prevent damages or casualties in the future.