Prepared By: Partha Majumdar
Date Prepared: 14Jun2014
Purpose: Prepared as a part of Project Work (Project 2) for the Reproducible Research Course which is a part of the Data Science Specialisation.
The below analysis is focused on finding out the impact of Storms on Population Health and the economic consequences of Stroms. This analysis is consucted as a part of the Project for Reproducible Research Course which is a part of the Data Science Specialisation. The data analyses is provided by the instructors and is taken from data collected by National Oceanic & Atmospheric Administration. As there are lots of gap in the completion of the data, only the complete data is used for analysis.
The top two events effecting population health are “Tornadoes” and “Excessive Heat”. Impact of “Tornadoes” far overweighs all other events.
The biggest volume of Property Damages are caused by “Tornadoes” followed by “Flash Floods”. The biggest volume of Crop Damages are caused by “Hail”.
On an overall analysis, “Tornadoes” are the biggest natural causes of disasters impacting the population and economics in United States of America.
This section states verbatim the process used for analysing the data for coming to the conclusions in this report.
To start the analysis, we need loading the storm data and finding the characteristics of the data. Though much has been provided in the associated documentation, it is essential to see the physical data before proceeding to analyse the same.
Load the data.
stormData <- read.csv("StormData.csv")
colnames(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
nrow(stormData)
## [1] 902297
ncol(stormData)
## [1] 37
completeStormData <- stormData[complete.cases(stormData), ]
nrow(completeStormData)
## [1] 0
We find that there is not one row which has complete data from all aspects. So, we narrow down on the columns which contain the data for the population health impact and the the data for the economic impact. We make 2 sets of data from the original dataset.
The first data set only includes data for the population impact. We include the following fields - State, Begin Date, County Name, Event Type, Fatalities and Injuries.
populationImpactData <- stormData[, c(1,2,6,7,8,23,24)]
nrow(populationImpactData)
## [1] 902297
populationImpactData <- populationImpactData[complete.cases(populationImpactData), ]
nrow(populationImpactData)
## [1] 902297
The second data set only includes data for the economic impact. We include the following fields - State, Begin Date, County Name, Event Type, Property Damages and Crop Damages.
economicImpactData <- stormData[, c(1,2,6,7,8,25,27)]
nrow(economicImpactData)
## [1] 902297
economicImpactData <- economicImpactData[complete.cases(economicImpactData), ]
nrow(economicImpactData)
## [1] 902297
We see that with this vertical subsetting, we can use all of the available data for analysis.
For analysis the population impact, we determine an “population impact”. “Population Impact” is calculated as sum of impact due to fatalities and impact due to injuries. We assign a 70% weight to the impact due to fatalities and 30% weight to the impact due to injuries. So,
Population Impact = 70% * Impact due to Fatalities + 30% * Impact due to Injuries.
populationImpact <- (70/100*populationImpactData$FATALITIES) + (30/100*populationImpactData$INJURIES)
populationImpactData <- cbind(populationImpactData, populationImpact)
head(populationImpactData)
## STATE__ BGN_DATE COUNTYNAME STATE EVTYPE FATALITIES INJURIES
## 1 1 4/18/1950 0:00:00 MOBILE AL TORNADO 0 15
## 2 1 4/18/1950 0:00:00 BALDWIN AL TORNADO 0 0
## 3 1 2/20/1951 0:00:00 FAYETTE AL TORNADO 0 2
## 4 1 6/8/1951 0:00:00 MADISON AL TORNADO 0 2
## 5 1 11/15/1951 0:00:00 CULLMAN AL TORNADO 0 2
## 6 1 11/15/1951 0:00:00 LAUDERDALE AL TORNADO 0 6
## populationImpact
## 1 4.5
## 2 0.0
## 3 0.6
## 4 0.6
## 5 0.6
## 6 1.8
We now aggregate the data for FATALITIES, INJURIES and “Population Health Impact”.
aggregatedFatalitiesSum <- aggregate(FATALITIES ~ EVTYPE, data=populationImpactData, sum)
aggregatedInjuriesSum <- aggregate(INJURIES ~ EVTYPE, data=populationImpactData, sum)
aggregatedImpactSum <- aggregate(populationImpact ~ EVTYPE, data=populationImpactData, sum)
We find that there are 985 Event Types recorded in the data. So, we consider only the top 5 Event Types in terms of values of Fatalities and Injuries.
topFatalities <- aggregatedFatalitiesSum[c(aggregatedFatalitiesSum$FATALITIES %in% sort(aggregatedFatalitiesSum$FATALITIES, decreasing=TRUE)[1:5]),]
topInjuries <- aggregatedInjuriesSum[c(aggregatedInjuriesSum$INJURIES %in% sort(aggregatedInjuriesSum$INJURIES, decreasing=TRUE)[1:5]),]
topImpact <- aggregatedImpactSum[c(aggregatedImpactSum$populationImpact %in% sort(aggregatedImpactSum$populationImpact, decreasing=TRUE)[1:5]),]
#topFatalities
#topInjuries
#topImpact
We now plot this data to see the impact. Below is a set of 3 plots fit in one sheet. We have chosen Bar Graph to be able to display the volumes of impact.
par(mfrow = c(1,3), bg="Cyan", las=3)
barplot(height=topFatalities$FATALITIES, names.arg=topFatalities$EVTYPE, main="FATALITIES", ylab="Numbers")
barplot(height=topInjuries$INJURIES, names.arg=topInjuries$EVTYPE, main="INJURIES", ylab="Numbers")
barplot(height=topImpact$populationImpact, names.arg=topImpact$EVTYPE, main="Net Impact", ylab="Numbers")
From the graphs, we get evidence that the biggest number of Fatalities and Injuries are caused by “Tornadoes”. Though there are not comparable number of fatalities due to “Excessive Heat”, the number of Injuries due to “Excessive Heat” is significant second. In net terms, the maximum impact on population health is due to “Tornadoes”.
For analysis the economic impact, we determine an “economic impact”. “Economic Impact” is calculated as sum of impact due to Property Damages and impact due to Crop Damages. We assign a 55% weight to the impact due to Property Damages and 45% weight to the impact due to Crop Damages. So,
Economic Impact = 55% * Impact due to Property Damages + 45% * Impact due to Crop Damages.
economicImpact <- (55/100*economicImpactData$PROPDMG) + (45/100*economicImpactData$CROPDMG)
economicImpactData <- cbind(economicImpactData, economicImpact)
head(economicImpactData)
## STATE__ BGN_DATE COUNTYNAME STATE EVTYPE PROPDMG CROPDMG
## 1 1 4/18/1950 0:00:00 MOBILE AL TORNADO 25.0 0
## 2 1 4/18/1950 0:00:00 BALDWIN AL TORNADO 2.5 0
## 3 1 2/20/1951 0:00:00 FAYETTE AL TORNADO 25.0 0
## 4 1 6/8/1951 0:00:00 MADISON AL TORNADO 2.5 0
## 5 1 11/15/1951 0:00:00 CULLMAN AL TORNADO 2.5 0
## 6 1 11/15/1951 0:00:00 LAUDERDALE AL TORNADO 2.5 0
## economicImpact
## 1 13.750
## 2 1.375
## 3 13.750
## 4 1.375
## 5 1.375
## 6 1.375
We now aggregate the data for FATALITIES, INJURIES and “Population Health Impact”.
aggregatedPropertyDamageSum <- aggregate(PROPDMG ~ EVTYPE, data=economicImpactData, sum)
aggregatedCropDamageSum <- aggregate(CROPDMG ~ EVTYPE, data=economicImpactData, sum)
aggregatedEconomicImpactSum <- aggregate(economicImpact ~ EVTYPE, data=economicImpactData, sum)
We find that there are 985 Event Types recorded in the data. So, we consider only the top 5 Event Types in terms of values of Property Damage and Crop Damage.
topPropertyDamage <- aggregatedPropertyDamageSum[c(aggregatedPropertyDamageSum$PROPDMG %in% sort(aggregatedPropertyDamageSum$PROPDMG, decreasing=TRUE)[1:5]),]
topCropDamage <- aggregatedCropDamageSum[c(aggregatedCropDamageSum$CROPDMG %in% sort(aggregatedCropDamageSum$CROPDMG, decreasing=TRUE)[1:5]),]
topEconomicImpact <- aggregatedEconomicImpactSum[c(aggregatedEconomicImpactSum$economicImpact %in% sort(aggregatedEconomicImpactSum$economicImpact, decreasing=TRUE)[1:5]),]
#topPropertyDamage
#topCropDamage
#topEconomicImpact
We now plot this data to see the impact. Below is a set of 3 plots fit in one sheet. We have chosen Bar Graph to be able to display the volumes of impact.
par(mfrow = c(1,3), bg="Yellow", las=3)
barplot(height=topPropertyDamage$PROPDMG, names.arg=topPropertyDamage$EVTYPE, main="Property Damages", ylab="Numbers")
barplot(height=topCropDamage$CROPDMG, names.arg=topCropDamage$EVTYPE, main="Crop Damages", ylab="Numbers")
barplot(height=topEconomicImpact$economicImpact, names.arg=topEconomicImpact$EVTYPE, main="Net Economic Impact", ylab="Numbers")
From the graphs, we get evidence that the biggest volume of Property Damages are caused by “Tornadoes” and the biggest volume of Crop Damages are caused by “Hail”.
This section provides the top 10 reasons for Population Impact and top 10 reasons for Economic Impact.
top10Fatalities <- aggregatedFatalitiesSum[c(aggregatedFatalitiesSum$FATALITIES %in% sort(aggregatedFatalitiesSum$FATALITIES, decreasing=TRUE)[1:10]),]
top10Fatalities[order(-top10Fatalities[,2]), ]
## EVTYPE FATALITIES
## 826 TORNADO 5633
## 124 EXCESSIVE HEAT 1903
## 151 FLASH FLOOD 978
## 271 HEAT 937
## 453 LIGHTNING 816
## 846 TSTM WIND 504
## 167 FLOOD 470
## 572 RIP CURRENT 368
## 343 HIGH WIND 248
## 19 AVALANCHE 224
top10Injuries <- aggregatedInjuriesSum[c(aggregatedInjuriesSum$INJURIES %in% sort(aggregatedInjuriesSum$INJURIES, decreasing=TRUE)[1:10]),]
top10Injuries[order(-top10Injuries[,2]), ]
## EVTYPE INJURIES
## 826 TORNADO 91346
## 846 TSTM WIND 6957
## 167 FLOOD 6789
## 124 EXCESSIVE HEAT 6525
## 453 LIGHTNING 5230
## 271 HEAT 2100
## 422 ICE STORM 1975
## 151 FLASH FLOOD 1777
## 753 THUNDERSTORM WIND 1488
## 241 HAIL 1361
top10PropertyDamage <- aggregatedPropertyDamageSum[c(aggregatedPropertyDamageSum$PROPDMG %in% sort(aggregatedPropertyDamageSum$PROPDMG, decreasing=TRUE)[1:10]),]
top10PropertyDamage[order(-top10PropertyDamage[,2]), ]
## EVTYPE PROPDMG
## 826 TORNADO 3212258
## 151 FLASH FLOOD 1420125
## 846 TSTM WIND 1335966
## 167 FLOOD 899938
## 753 THUNDERSTORM WIND 876844
## 241 HAIL 688693
## 453 LIGHTNING 603352
## 779 THUNDERSTORM WINDS 446293
## 343 HIGH WIND 324732
## 962 WINTER STORM 132721
top10CropDamage <- aggregatedCropDamageSum[c(aggregatedCropDamageSum$CROPDMG %in% sort(aggregatedCropDamageSum$CROPDMG, decreasing=TRUE)[1:10]),]
top10CropDamage[order(-top10CropDamage[,2]), ]
## EVTYPE CROPDMG
## 241 HAIL 579596
## 151 FLASH FLOOD 179200
## 167 FLOOD 168038
## 846 TSTM WIND 109203
## 826 TORNADO 100019
## 753 THUNDERSTORM WIND 66791
## 91 DROUGHT 33899
## 779 THUNDERSTORM WINDS 18685
## 343 HIGH WIND 17283
## 281 HEAVY RAIN 11123