Storm Data Analysis

Prepared By: Partha Majumdar

Date Prepared: 14Jun2014

Purpose: Prepared as a part of Project Work (Project 2) for the Reproducible Research Course which is a part of the Data Science Specialisation.

Synopsis

Background

The below analysis is focused on finding out the impact of Storms on Population Health and the economic consequences of Stroms. This analysis is consucted as a part of the Project for Reproducible Research Course which is a part of the Data Science Specialisation. The data analyses is provided by the instructors and is taken from data collected by National Oceanic & Atmospheric Administration. As there are lots of gap in the completion of the data, only the complete data is used for analysis.

Salient Findings

The top two events effecting population health are “Tornadoes” and “Excessive Heat”. Impact of “Tornadoes” far overweighs all other events.

The biggest volume of Property Damages are caused by “Tornadoes” followed by “Flash Floods”. The biggest volume of Crop Damages are caused by “Hail”.

On an overall analysis, “Tornadoes” are the biggest natural causes of disasters impacting the population and economics in United States of America.

Data Processing

This section states verbatim the process used for analysing the data for coming to the conclusions in this report.

To start the analysis, we need loading the storm data and finding the characteristics of the data. Though much has been provided in the associated documentation, it is essential to see the physical data before proceeding to analyse the same.

Load the data.

stormData <- read.csv("StormData.csv")
colnames(stormData)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

nrow(stormData)

## [1] 902297

ncol(stormData)

## [1] 37

completeStormData <- stormData[complete.cases(stormData), ]
nrow(completeStormData)

## [1] 0

We find that there is not one row which has complete data from all aspects. So, we narrow down on the columns which contain the data for the population health impact and the the data for the economic impact. We make 2 sets of data from the original dataset.

The first data set only includes data for the population impact. We include the following fields - State, Begin Date, County Name, Event Type, Fatalities and Injuries.

populationImpactData <- stormData[, c(1,2,6,7,8,23,24)]
nrow(populationImpactData)

## [1] 902297

populationImpactData <- populationImpactData[complete.cases(populationImpactData), ]
nrow(populationImpactData)

## [1] 902297

The second data set only includes data for the economic impact. We include the following fields - State, Begin Date, County Name, Event Type, Property Damages and Crop Damages.

economicImpactData <- stormData[, c(1,2,6,7,8,25,27)]
nrow(economicImpactData)

## [1] 902297

economicImpactData <- economicImpactData[complete.cases(economicImpactData), ]
nrow(economicImpactData)

## [1] 902297

We see that with this vertical subsetting, we can use all of the available data for analysis.

Population Impact Analysis

For analysis the population impact, we determine an “population impact”. “Population Impact” is calculated as sum of impact due to fatalities and impact due to injuries. We assign a 70% weight to the impact due to fatalities and 30% weight to the impact due to injuries. So,

Population Impact = 70% * Impact due to Fatalities + 30% * Impact due to Injuries.

populationImpact <- (70/100*populationImpactData$FATALITIES) + (30/100*populationImpactData$INJURIES)
populationImpactData <- cbind(populationImpactData, populationImpact)
head(populationImpactData)

##   STATE__           BGN_DATE COUNTYNAME STATE  EVTYPE FATALITIES INJURIES
## 1       1  4/18/1950 0:00:00     MOBILE    AL TORNADO          0       15
## 2       1  4/18/1950 0:00:00    BALDWIN    AL TORNADO          0        0
## 3       1  2/20/1951 0:00:00    FAYETTE    AL TORNADO          0        2
## 4       1   6/8/1951 0:00:00    MADISON    AL TORNADO          0        2
## 5       1 11/15/1951 0:00:00    CULLMAN    AL TORNADO          0        2
## 6       1 11/15/1951 0:00:00 LAUDERDALE    AL TORNADO          0        6
##   populationImpact
## 1              4.5
## 2              0.0
## 3              0.6
## 4              0.6
## 5              0.6
## 6              1.8

We now aggregate the data for FATALITIES, INJURIES and “Population Health Impact”.

aggregatedFatalitiesSum <- aggregate(FATALITIES ~ EVTYPE, data=populationImpactData, sum)
aggregatedInjuriesSum <- aggregate(INJURIES ~ EVTYPE, data=populationImpactData, sum)
aggregatedImpactSum <- aggregate(populationImpact ~ EVTYPE, data=populationImpactData, sum)

We find that there are 985 Event Types recorded in the data. So, we consider only the top 5 Event Types in terms of values of Fatalities and Injuries.

topFatalities <- aggregatedFatalitiesSum[c(aggregatedFatalitiesSum$FATALITIES %in% sort(aggregatedFatalitiesSum$FATALITIES, decreasing=TRUE)[1:5]),]
topInjuries <- aggregatedInjuriesSum[c(aggregatedInjuriesSum$INJURIES %in% sort(aggregatedInjuriesSum$INJURIES, decreasing=TRUE)[1:5]),]
topImpact <- aggregatedImpactSum[c(aggregatedImpactSum$populationImpact %in% sort(aggregatedImpactSum$populationImpact, decreasing=TRUE)[1:5]),]
#topFatalities
#topInjuries
#topImpact

We now plot this data to see the impact. Below is a set of 3 plots fit in one sheet. We have chosen Bar Graph to be able to display the volumes of impact.

par(mfrow = c(1,3), bg="Cyan", las=3)
barplot(height=topFatalities$FATALITIES, names.arg=topFatalities$EVTYPE, main="FATALITIES", ylab="Numbers")
barplot(height=topInjuries$INJURIES, names.arg=topInjuries$EVTYPE, main="INJURIES", ylab="Numbers")
barplot(height=topImpact$populationImpact, names.arg=topImpact$EVTYPE, main="Net Impact", ylab="Numbers")

plot of chunk unnamed-chunk-7

From the graphs, we get evidence that the biggest number of Fatalities and Injuries are caused by “Tornadoes”. Though there are not comparable number of fatalities due to “Excessive Heat”, the number of Injuries due to “Excessive Heat” is significant second. In net terms, the maximum impact on population health is due to “Tornadoes”.

Economic Impact Analysis

For analysis the economic impact, we determine an “economic impact”. “Economic Impact” is calculated as sum of impact due to Property Damages and impact due to Crop Damages. We assign a 55% weight to the impact due to Property Damages and 45% weight to the impact due to Crop Damages. So,

Economic Impact = 55% * Impact due to Property Damages + 45% * Impact due to Crop Damages.

economicImpact <- (55/100*economicImpactData$PROPDMG) + (45/100*economicImpactData$CROPDMG)
economicImpactData <- cbind(economicImpactData, economicImpact)
head(economicImpactData)

##   STATE__           BGN_DATE COUNTYNAME STATE  EVTYPE PROPDMG CROPDMG
## 1       1  4/18/1950 0:00:00     MOBILE    AL TORNADO    25.0       0
## 2       1  4/18/1950 0:00:00    BALDWIN    AL TORNADO     2.5       0
## 3       1  2/20/1951 0:00:00    FAYETTE    AL TORNADO    25.0       0
## 4       1   6/8/1951 0:00:00    MADISON    AL TORNADO     2.5       0
## 5       1 11/15/1951 0:00:00    CULLMAN    AL TORNADO     2.5       0
## 6       1 11/15/1951 0:00:00 LAUDERDALE    AL TORNADO     2.5       0
##   economicImpact
## 1         13.750
## 2          1.375
## 3         13.750
## 4          1.375
## 5          1.375
## 6          1.375

We now aggregate the data for FATALITIES, INJURIES and “Population Health Impact”.

aggregatedPropertyDamageSum <- aggregate(PROPDMG ~ EVTYPE, data=economicImpactData, sum)
aggregatedCropDamageSum <- aggregate(CROPDMG ~ EVTYPE, data=economicImpactData, sum)
aggregatedEconomicImpactSum <- aggregate(economicImpact ~ EVTYPE, data=economicImpactData, sum)

We find that there are 985 Event Types recorded in the data. So, we consider only the top 5 Event Types in terms of values of Property Damage and Crop Damage.

topPropertyDamage <- aggregatedPropertyDamageSum[c(aggregatedPropertyDamageSum$PROPDMG %in% sort(aggregatedPropertyDamageSum$PROPDMG, decreasing=TRUE)[1:5]),]
topCropDamage <- aggregatedCropDamageSum[c(aggregatedCropDamageSum$CROPDMG %in% sort(aggregatedCropDamageSum$CROPDMG, decreasing=TRUE)[1:5]),]
topEconomicImpact <- aggregatedEconomicImpactSum[c(aggregatedEconomicImpactSum$economicImpact %in% sort(aggregatedEconomicImpactSum$economicImpact, decreasing=TRUE)[1:5]),]
#topPropertyDamage
#topCropDamage
#topEconomicImpact

We now plot this data to see the impact. Below is a set of 3 plots fit in one sheet. We have chosen Bar Graph to be able to display the volumes of impact.

par(mfrow = c(1,3), bg="Yellow", las=3)
barplot(height=topPropertyDamage$PROPDMG, names.arg=topPropertyDamage$EVTYPE, main="Property Damages", ylab="Numbers")
barplot(height=topCropDamage$CROPDMG, names.arg=topCropDamage$EVTYPE, main="Crop Damages", ylab="Numbers")
barplot(height=topEconomicImpact$economicImpact, names.arg=topEconomicImpact$EVTYPE, main="Net Economic Impact", ylab="Numbers")

plot of chunk unnamed-chunk-11

From the graphs, we get evidence that the biggest volume of Property Damages are caused by “Tornadoes” and the biggest volume of Crop Damages are caused by “Hail”.

Result

This section provides the top 10 reasons for Population Impact and top 10 reasons for Economic Impact.

Top 10 reasons for Fatalities

top10Fatalities <- aggregatedFatalitiesSum[c(aggregatedFatalitiesSum$FATALITIES %in% sort(aggregatedFatalitiesSum$FATALITIES, decreasing=TRUE)[1:10]),]
top10Fatalities[order(-top10Fatalities[,2]), ]

##             EVTYPE FATALITIES
## 826        TORNADO       5633
## 124 EXCESSIVE HEAT       1903
## 151    FLASH FLOOD        978
## 271           HEAT        937
## 453      LIGHTNING        816
## 846      TSTM WIND        504
## 167          FLOOD        470
## 572    RIP CURRENT        368
## 343      HIGH WIND        248
## 19       AVALANCHE        224

Top 10 reasons for Injuries

top10Injuries <- aggregatedInjuriesSum[c(aggregatedInjuriesSum$INJURIES %in% sort(aggregatedInjuriesSum$INJURIES, decreasing=TRUE)[1:10]),]
top10Injuries[order(-top10Injuries[,2]), ]

##                EVTYPE INJURIES
## 826           TORNADO    91346
## 846         TSTM WIND     6957
## 167             FLOOD     6789
## 124    EXCESSIVE HEAT     6525
## 453         LIGHTNING     5230
## 271              HEAT     2100
## 422         ICE STORM     1975
## 151       FLASH FLOOD     1777
## 753 THUNDERSTORM WIND     1488
## 241              HAIL     1361

Top 10 reasons for Property Damages

top10PropertyDamage <- aggregatedPropertyDamageSum[c(aggregatedPropertyDamageSum$PROPDMG %in% sort(aggregatedPropertyDamageSum$PROPDMG, decreasing=TRUE)[1:10]),]
top10PropertyDamage[order(-top10PropertyDamage[,2]), ]

##                 EVTYPE PROPDMG
## 826            TORNADO 3212258
## 151        FLASH FLOOD 1420125
## 846          TSTM WIND 1335966
## 167              FLOOD  899938
## 753  THUNDERSTORM WIND  876844
## 241               HAIL  688693
## 453          LIGHTNING  603352
## 779 THUNDERSTORM WINDS  446293
## 343          HIGH WIND  324732
## 962       WINTER STORM  132721

Top 10 reasons for Crop Damages

top10CropDamage <- aggregatedCropDamageSum[c(aggregatedCropDamageSum$CROPDMG %in% sort(aggregatedCropDamageSum$CROPDMG, decreasing=TRUE)[1:10]),]
top10CropDamage[order(-top10CropDamage[,2]), ]

##                 EVTYPE CROPDMG
## 241               HAIL  579596
## 151        FLASH FLOOD  179200
## 167              FLOOD  168038
## 846          TSTM WIND  109203
## 826            TORNADO  100019
## 753  THUNDERSTORM WIND   66791
## 91             DROUGHT   33899
## 779 THUNDERSTORM WINDS   18685
## 343          HIGH WIND   17283
## 281         HEAVY RAIN   11123