The goal behind this analysis is to answer two important questions related with the collected data from storms in the US:
which types of events are most harmful with respect to population health?
which types of events have the greatest economic consequences?
In the following chapters it will be described step by step the performed analysis on the data and the reached results that can give some evidence to answer these questions.
After setting the local working directory in RStudio, we download the file from the given url. This might take a while depending on the computer since the file is large.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "stormData.csv.bz2")
stormData <- read.csv("stormData.csv.bz2")
Let’s have a look at the data to see some information of the dataset.
summary(stormData)
## STATE__ BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 MADISON : 5632 NE : 30271
## (Other): 3631 (Other) :862369 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 :547332
## TSTM WIND :219940 1st Qu.: 0.000 N : 86752
## THUNDERSTORM WIND: 82563 Median : 0.000 W : 38446
## TORNADO : 60652 Mean : 1.484 S : 37558
## FLASH FLOOD : 54277 3rd Qu.: 1.000 E : 33178
## FLOOD : 25326 Max. :3749.000 NW : 24041
## (Other) :170878 (Other):134990
## BGN_LOCATI END_DATE END_TIME
## :287743 :243411 :238978
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 :724837
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082
## Median :0 Median : 0.0000 S : 22510
## Mean :0 Mean : 0.9862 W : 20119
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047
## Max. :0 Max. :925.0000 NE : 14606
## (Other): 72096
## END_LOCATI LENGTH WIDTH
## :499225 Min. : 0.0000 Min. : 0.000
## COUNTYWIDE : 19731 1st Qu.: 0.0000 1st Qu.: 0.000
## SOUTH PORTION : 833 Median : 0.0000 Median : 0.000
## NORTH PORTION : 780 Mean : 0.2301 Mean : 7.503
## CENTRAL PORTION: 617 3rd Qu.: 0.0000 3rd Qu.: 0.000
## SPRINGFIELD : 575 Max. :2315.0000 Max. :4400.000
## (Other) :380536
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 568 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588295
There are some NA cases but apparently not in the relevant columns to answer the main questions of this analysis.
The event type is the column with the name “EVTYPE” and we will use it as a factor to distinguish the effects. We need to have a look at the different measures provided in this dataset to find out which variables are health related or are some health indicators. At the explanation document it is found in chapter 2.6 the description for “fatalities/injuries”. It is described “direct” and “indirect” fatalities/injuries however at the header names of the dataset we find only:
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
“FATALITIES” and “INJURIES” as two separate columns. Let’s filter the dataset to extract the information of interest to answer this question:
healthData <- stormData[,c("EVTYPE","FATALITIES","INJURIES")]
summary(healthData)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
sum(is.na(healthData$FATALITIES))
## [1] 0
sum(is.na(healthData$INJURIES))
## [1] 0
There are no missing values so we go on with the analysis. Since at the document these two factors are treated as one, we decide to merge both measures into one.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
healthData <- mutate(healthData, "FATALITIES/INJURIES" = FATALITIES + INJURIES)
# we take only the event type and the new merged column
healthData <- healthData[,c("EVTYPE","FATALITIES/INJURIES")]
names(healthData) <- c("eventType","totalInjuries")
Now the dataset is ready to apply some statistical function, in this case we are going to sum all these fatalities and injuries caused by each event type and see which one obtains the maximum number of fatalities and injuries.
injuriesPerEvent <- aggregate(totalInjuries ~ eventType, healthData, sum)
summary(injuriesPerEvent)
## eventType totalInjuries
## HIGH SURF ADVISORY: 1 Min. : 0
## COASTAL FLOOD : 1 1st Qu.: 0
## FLASH FLOOD : 1 Median : 0
## LIGHTNING : 1 Mean : 158
## TSTM WIND : 1 3rd Qu.: 0
## TSTM WIND (G45) : 1 Max. :96979
## (Other) :979
By having a quick view of the dataset we noticed a large number of events with 0 injuries. The mean is 158. We can take this reference to check which events have injuries above the mean value.
injuriesPerEvent <- filter(injuriesPerEvent,totalInjuries > 158)
Let’s compare now these worst events with a barplot graphic.
library(ggplot2)
ggplot(injuriesPerEvent, aes(x = eventType, y = totalInjuries))+geom_bar(stat = "identity", position = "dodge",aes(fill=eventType))+ coord_flip() + labs(x="Event Type")+labs(y= "Total number of fatalities/injuries")+labs(title="Storm events with worst health effects") + theme(legend.position="none")
We created a graphic using ggplot2 library. Use geom_plot to create the bar chart. In this case, we tell ggplot that the data is already summarised using stat=“identity”, since the default is to create a histogram. Though the font is a bit small we can see there is a clear winner in the number of fatalities/injuries. But we will introduce it at the Results chapter
We take again the “EVTYPE” column as a factor to distinguish the effects. We need to have a look at the different measures provided in this dataset to find out which variables are damage related or are some damage indicators. At the explanation document we find many columns. We found four indicators:
Flood-Related Damage
Crop Damage Data
Other Related Costs
Delayed Damage
The columns found at the data set are only two: PROPDMG and CROPDMG. At the document file it is explained also that alphabetical characters are used to signify magnitude, these include “K” for thousands, “M” for millions, and “B” for billions. The two columns providing this info are PROPDMGEXP and CROPDMGEXP.
Let’s filter the data of interest.
damageRawData <- stormData[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
summary(damageRawData)
## EVTYPE PROPDMG PROPDMGEXP
## HAIL :288661 Min. : 0.00 :465934
## TSTM WIND :219940 1st Qu.: 0.00 K :424665
## THUNDERSTORM WIND: 82563 Median : 0.00 M : 11330
## TORNADO : 60652 Mean : 12.06 0 : 216
## FLASH FLOOD : 54277 3rd Qu.: 0.50 B : 40
## FLOOD : 25326 Max. :5000.00 5 : 28
## (Other) :170878 (Other): 84
## CROPDMG CROPDMGEXP
## Min. : 0.000 :618413
## 1st Qu.: 0.000 K :281832
## Median : 0.000 M : 1994
## Mean : 1.527 k : 21
## 3rd Qu.: 0.000 0 : 19
## Max. :990.000 B : 9
## (Other): 9
The strategy now is to take those events with the higher costs and those are usually with the “B” unit at “PROPDMGEXP” or “CROPDMGEXP” columns. First we eliminate those cases where the damage costs are 0.
# eliminate 0 values for both
damageRawData <- filter(damageRawData,(PROPDMG>0 | CROPDMG>0))
# take only the billion costs events
billionDamageData <- filter(damageRawData,(PROPDMGEXP=="B" | CROPDMGEXP=="B"))
Now we sum the two columns by taking this approach: adding both values if both are in the same unit (billion) or taking the highest billion value from both. This approach is acceptable if we are comparing billions with millions or any other lower unit. For instance, if property damage = 2.1 billions and crop damage = 4 millions. When we transform the crop damage into billions then we will need to sum 0.0004 billions. Thus we are just saying that 2.1 billions is almost the same as 2.1004 billions which is a good approximation.
getBillionValue <- function(vector){
#vector[1] =PROPDMG, property damage, vector[2] = PROPDMGEXP
#vector[3] = CROPDMG, crop damage, vector[4] = CROPDMGEXP
if(vector[2] == "B"){
if(vector[4] == "B"){
return(as.numeric(vector[1]+vector[3]))
}
else{
# the first damage (PROP) is much more higher
# than the crop damage (CROP)
return(as.numeric(vector[1]))
}
}
return(as.numeric(vector[3]))
}
getTotalBillionDamage <- function(raw){
output <- raw
for(i in 1:nrow(raw)){
output[i,]$PROPDMG <- getBillionValue(raw[i,c("PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")])
}
return(output)
}
finalDamageData <- getTotalBillionDamage(billionDamageData)
We do some adjustments at the dataset to take only the columns of interest.
# we take only the event type and the new merged column costs in PROPDMG
damageData <- finalDamageData[,c("EVTYPE","PROPDMG")]
names(damageData) <- c("eventType","damageBillionCosts")
Now the dataset is ready to apply some statistical function, in this case we are going to sum all these billion damages caused by each event type and see which one obtains the maximum number.
billionDamagePerEvent <- aggregate(damageBillionCosts ~ eventType, damageData, sum)
summary(billionDamagePerEvent)
## eventType damageBillionCosts
## DROUGHT : 1 Min. : 0.100
## FLASH FLOOD: 1 1st Qu.: 1.250
## FLOOD : 1 Median : 2.500
## FREEZE : 1 Mean : 12.585
## HAIL : 1 3rd Qu.: 5.225
## HEAT : 1 Max. :122.500
## (Other) :17
Finally the graphic illustrating the damage costs.
library(ggplot2)
# Fitting Labels
ggplot(billionDamagePerEvent, aes(x = eventType, y = damageBillionCosts))+geom_bar(stat = "identity", position = "dodge",aes(fill=eventType))+ coord_flip() + labs(x="Event Type")+labs(y= "Damage costs in billions")+labs(title="Worst storm events for the US economy")+ theme(legend.position="none")
Tornado is the event with a huge number of injuries, far from the rest of events in the list. We take it from the injuriesPerEvent dataset computed before:
maxNrInjuries <- max(injuriesPerEvent$totalInjuries)
injuriesPerEvent[injuriesPerEvent$totalInjuries==maxNrInjuries,c("eventType","totalInjuries")]
## eventType totalInjuries
## 28 TORNADO 96979
Other bad storm events for the health are EXCESIVE HEAT, FLOOD or TSTM WIND.
From the displayed second graphic it is clearly seen FLOOD as the clear winner:
maxBillionCost <- max(billionDamagePerEvent$damageBillionCosts)
billionDamagePerEvent[billionDamagePerEvent$damageBillionCosts==maxBillionCost,c("eventType","damageBillionCosts")]
## eventType damageBillionCosts
## 3 FLOOD 122.5
Other bad storm events for the US economy are HURRICANE, RIVER FLOOD or STORM SURGE.