This analysis utilizes the NOAA storm data (repdata_data_StormData.csv.bz2) to analyse the affects on these weather events on the population. The first question asks: Across the United States, which types of events are most harmful with respect to population health? The second question explores which types of events have the greatest economic consequences in the USA? These two questions will allow to analyse the health and economic costs of these storm events and rank them by severity. To answer the questions, the first step is to take the raw data and process it for analysis. We will combine the injuries and fatalities together to indicate population damage, and the property damage (PROPDMG) with crop damage (CROPDMG) together as the total economic damages. The processed data is then taken and analysed using plotting methods to help us visualize the answer. We visualize the results in barplots and show the top 5 storm events that cause the most casualties and economic damage.
First we must load the data, which should be in the same working directory. The download link for the data is: Storm Data[47Mb].
rawdata <- read.csv("repdata_data_StormData.csv.bz2")
str(rawdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
To do our analysis, we need variables corresponding to: event type, health population health, and economic consquences. To extract those information, we need the following variables: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. More information on these variables can be found in the documentation here: Storm Data Documentation
events <- subset(rawdata, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
head(events)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
We must now clean the data up a little. The first step is to combine fatalities and injuries together, so we can find total population health damage under one variable.
events <- events %>%
mutate(casualties = select(., FATALITIES, INJURIES) %>% rowSums(na.rm = FALSE))
Since PROPDMGEXP and CROPDMGEXP are the base units of the actual property and crop damage, we will calculate the actual numerical values for each.
# substitute the exponential units with their actual numerical value for property damage exp:
events$PROPDMGEXP <- as.numeric(as.character(factor(events$PROPDMGEXP,
levels = c("K","M"),
labels = c(1000, 1000000))))
# do the same for crop damage exp:
events$CROPDMGEXP <- as.numeric(as.character(factor(events$CROPDMGEXP,
levels = c("K","M"),
labels = c(1000, 1000000))))
Now we can combine the total damages together by multiplying each damage variable with their base value column (EXP), and then adding them together.
events <- events %>%
mutate(prop_dmg = PROPDMG * PROPDMGEXP) %>%
mutate(crop_dmg = CROPDMG * CROPDMGEXP) %>%
mutate(econ_dmg = select(., prop_dmg, crop_dmg) %>% rowSums(na.rm = TRUE))
Finally we can remove the columns we do not need anymore and only keep the newly aggregated columns
events_clean <- subset(events, select = c(EVTYPE, casualties, econ_dmg))
head(events_clean)
## EVTYPE casualties econ_dmg
## 1 TORNADO 15 25000
## 2 TORNADO 0 2500
## 3 TORNADO 2 25000
## 4 TORNADO 2 2500
## 5 TORNADO 2 2500
## 6 TORNADO 6 2500
Recall question 1: Across the United States, which types of events are most harmful with respect to population health? In order to do this, we must first aggregate by events for the total health damages. Then we will remove all zero damage events and sort by highest damaging to lowest.
agg_casualty <- aggregate(casualties~EVTYPE, events_clean, sum)
# remove all zero damage event and sort by highest damage to lowest
top_health <- arrange(subset(agg_casualty, casualties > 0), desc(casualties))
head(top_health)
## EVTYPE casualties
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
Now we can plot the aggregated data as a barplot for the top 5 highest damaging events to population health.
x <- top_health[1:5,]
barplot(casualties~EVTYPE, x, col = "blue", xlab = "Storm Events", ylab = "Casualties (injuries & fatalities)",
main = "Top Casualty Inducing Storm Events 1950-2011")
Recall Question 2: Which types of events have the greatest economic consequences in the USA? Using the same method from the previous analysis, we will aggregate and sort to show the top events.
agg_econ <- aggregate(econ_dmg~EVTYPE, events_clean, sum)
# remove all zero damage event and sort by highest damage to lowest
top_econ <- arrange(subset(agg_econ, econ_dmg > 0), desc(econ_dmg))
head(top_econ)
## EVTYPE econ_dmg
## 1 TORNADO 52040613590
## 2 FLOOD 27819678250
## 3 HAIL 16952904170
## 4 FLASH FLOOD 16562128610
## 5 DROUGHT 13518672000
## 6 HURRICANE 8910229010
y <- top_econ[1:5,]
barplot((econ_dmg/1000000)~EVTYPE, y, col = "red", xlab = "Storm Events", ylab = "Economic Damage ($ millions)",
main = "Top Economic Damages by Storm Events 1950-2011")
Based on the analysis and plots, we see that tornados cause both the highest casualties and economic damage. Floods also cause high amounts of economic damage, as well as also being in the top 5 casualty inducing storm events. These two events in particular are good points of interest in order to prevent damages or casualties in the future.