I did a simple analysis and found that tornados are the ones that kill the most, and floods the events that cost the most.
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
First things first, after a quick look at the data, and reading the codebooks the variables we need to process to answer the given questions are:
-Types of events: EVTYPE
-Population health:FATALITIES, INJURIES
-economic consecuences: PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
Because of that reason, I will make a new data frame only with desired columns.
storm2 <- data.frame(EVTYPE=storm$EVTYPE,FATALITIES=storm$FATALITIES, INJURIES=storm$INJURIES, PROPDMG=storm$PROPDMG, PROPDMGEXP=storm$PROPDMGEXP, CROPDMG=storm$CROPDMG, CROPDMGEXP=storm$CROPDMGEXP )
rm(storm)
The important thing is to convert de economic variables to full numbers. As it is agreed, B= billions, M= millions,K= thousands,H= hundreds. Other signs will be converted to zero. The following function will change all the letters to their corresponding 10 potential. All the other numbers and signs will be changed to zero.
storm2$PROPDMGEXP <- as.character(storm2$PROPDMGEXP)
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="-"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="?"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="+"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="0"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="1"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="2"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="3"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="4"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="5"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="6"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="7"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="8"] <- "0"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="K"] <- "3"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="B"] <- "9"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="b"] <- "9"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="M"] <- "6"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="m"] <- "6"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="H"] <- "2"
storm2$PROPDMGEXP[storm2$PROPDMGEXP=="h"] <- "2"
storm2$CROPDMGEXP<- as.character(storm2$CROPDMGEXP)
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="-"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="?"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="+"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="0"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="1"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="2"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="3"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="4"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="5"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="6"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="7"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="8"] <- "0"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="K"] <- "3"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="B"] <- "9"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="b"] <- "9"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="M"] <- "6"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="m"] <- "6"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="H"] <- "2"
storm2$CROPDMGEXP[storm2$CROPDMGEXP=="h"] <- "2"
storm2$PROPDMGEXP <- as.numeric(storm2$PROPDMGEXP)
storm2$CROPDMGEXP<- as.numeric(storm2$CROPDMGEXP)
## Warning: NAs introduced by coercion
storm2$CROPDMGEXP[is.na(storm2$CROPDMGEXP)] <- 0
storm2$PROPDMGEXP[is.na(storm2$PROPDMGEXP)] <- 0
After this, I will make a two new columns, multiplying DMGEXP and DMG columns for each crop and property.
storm2$NETDMG <- (storm2$PROPDMG*10^storm2$PROPDMGEXP)+(storm2$CROPDMG*10^storm2$CROPDMGEXP)
I will make a simplier data frame with the event types, total number of fatalities and total economic cost.
storm3 <- data.frame(EVTYPE=storm2$EVTYPE,FATALITIES=storm2$FATALITIES, INJURIES=storm2$INJURIES,NETDMG=storm2$NETDMG)
To finish this part, I will aggregate the data by event, and create new data frames for each injuries, fatalities and net cost.
storm_conc <-aggregate(. ~ EVTYPE, storm3, sum)
str(storm_conc)
## 'data.frame': 985 obs. of 4 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NETDMG : num 200000 0 50000 0 8100000 8000 0 0 5000 0 ...
#health_fatalities and health_injuries, self explanatory title for this new data frames by descending order
health_injuries <- storm_conc[order(-storm_conc$INJURIES),]
health_injuries <- data.frame(EVTYPE=as.character(health_injuries$EVTYPE), INJURIES=health_injuries$INJURIES)
health_fatalities <- storm_conc[order(-storm_conc$FATALITIES),]
health_fatalities <- data.frame(EVTYPE=as.character(health_fatalities$EVTYPE), FATALITIES=health_fatalities$FATALITIES)
net_costs <- storm_conc[order(-storm_conc$NETDMG),]
net_costs <- data.frame(EVTYPE=net_costs$EVTYPE, NETDMG=net_costs$NETDMG)
rm(storm2,storm3,storm_conc)
Now that our data set is tidy, I can proceed to answer the questions.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
After analysing the tidy data frame health_fatalities, I realise only the first 10 have a long number of fatalities. Only the first 20 have more than 100 deaths.
head(health_fatalities, n=22)
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
## 16 HEAVY SNOW 127
## 17 EXTREME COLD/WIND CHILL 125
## 18 STRONG WIND 103
## 19 BLIZZARD 101
## 20 HIGH SURF 101
## 21 HEAVY RAIN 98
## 22 EXTREME HEAT 96
As it can be seen, tornado and excessive heat are by far the two events that killed most people.
For the total number of injuries, I will take only in consideration the first 14 causes because they’re above 1000 people injured.
head(health_injuries, n=22)
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
## 16 THUNDERSTORM WINDS 908
## 17 BLIZZARD 805
## 18 FOG 734
## 19 WILD/FOREST FIRE 545
## 20 DUST STORM 440
## 21 WINTER WEATHER 398
## 22 DENSE FOG 342
As it can be seen tornado again is the event with most injuries, followed by other four (TMST wind, flood, excessive heat and ligthning).
Across the United States, which types of events have the greatest economic consequences?
As you can see, flood is by far the most expensive type of event.
head(net_costs, n=10)
## EVTYPE NETDMG
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352114049
## 4 STORM SURGE 43323541000
## 5 HAIL 18757805433
## 6 FLASH FLOOD 17562129167
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
plot(NETDMG[1:5]~factor(EVTYPE[1:5]), las=2, net_costs,xlab="", main="Total costs per event", type="l")
dev.off()
## null device
## 1
There are several points that dont make this analysis accurate. First, tornados are the longuest documented type of event, the others started being documented years later, giving tornados some advantage over the others. Also, total economic cost is different depending on the year, because of the total population and the real cost of money. For example, 25k dollars in 1950 is different from 25k dollares in 1995 and so on. For that reason it is very difficult to make real conclusions. I would suggest doing a correction on those variables. For the economic and health impact it could be adjusted per 100 000 people. And for the type of event, make it per decade, comparing different times. That work by far exceeds whats been asked in this exercise, but I wanted to explain it a little.
I conclude that tornados are the events that kill the most and floods the event that costs the most.
Im sorry I ran out of time for the plots.