This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In the following paragraphs, I’ll take a glance of the whole dataset and focus on damages to population health and economy due to the weather events.
Include useful package:
library(plyr)
First of all, I download the data using the URL and read it into R as a data frame.
filename <- "repdata_data_StormData.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(filename)){
download.file(fileURL, filename, method = "curl")
}
stormData <- read.csv(filename, sep = ",", header = TRUE)
To get a first impression of the data, we can look at the column names to see what kind of data do we have by calling the “names()” function:
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Also, we can see the summary of it:
summary(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
We extract and process data for the following questions: ### Q1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? To answer this question, we first need to extract the relevant data. By looking through the column names, we can see only “FATALITIES” and “INJURIES” are related to population health. So we can sum for those numbers by event types:
fatal_byType <- aggregate(FATALITIES ~ EVTYPE, stormData, sum)
injury_byType <- aggregate(INJURIES ~ EVTYPE, stormData, sum )
Sort the data in a descending order:
fatal_byType_Sorted <- fatal_byType[order(-fatal_byType$FATALITIES),]
injury_byType_Sorted <- injury_byType[order(-injury_byType$INJURIES),]
The same way, we first need to extract the relevant data, which are “PROPDMG” (property damages) and “CROPDAMGE” (crop damages). There is one important thing to pay attention to, which is the “EXP” for each damages:
stormData <- mutate(stormData, propertyDMG = ifelse(toupper(PROPDMGEXP) =='H', PROPDMG*1e+02,
ifelse(toupper(PROPDMGEXP) =='K', PROPDMG*1e+03,
ifelse(toupper(PROPDMGEXP) == 'M', PROPDMG*1e+06,
ifelse(toupper(PROPDMGEXP) == 'B', PROPDMG*1e+09, PROPDMG)))))
stormData <- mutate(stormData, cropDMG = ifelse(toupper(CROPDMGEXP) =='H', CROPDMG*1e+02,
ifelse(toupper(CROPDMGEXP) =='K', CROPDMG*1e+03,
ifelse(toupper(CROPDMGEXP) == 'M', CROPDMG*1e+06,
ifelse(toupper(CROPDMGEXP) == 'B', CROPDMG*1e+09, PROPDMG)))))
Since these two factors have equivalent importance, in addition to see them seperately, I will also add them up to see a total damage to the economy. Still, we can sum for those numbers by event types and sort them in a descending order.
## Extracting and sum over types
propertyDMG_byType <- aggregate(propertyDMG ~ EVTYPE, stormData, sum)
cropDMG_byType <- aggregate(cropDMG ~ EVTYPE, stormData, sum)
totalDMG_byType <- merge(propertyDMG_byType, cropDMG_byType, by = "EVTYPE")
totalDMG_byType$TOTALDMG <- totalDMG_byType$propertyDMG + totalDMG_byType$cropDMG
## Sorting
propertyDMG_byType_Sorted <- propertyDMG_byType[order(-propertyDMG_byType$propertyDMG),]
cropDMG_byType_Sorted <- cropDMG_byType[order(-cropDMG_byType$cropDMG),]
totalDMG_byType_Sorted <- totalDMG_byType[order(-totalDMG_byType$TOTALDMG),]
Here I show the top 10 weather events that cause fatalities and injuries:
fatal_byType_Sorted[1:10,]
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
## 856 TSTM WIND 504
## 170 FLOOD 470
## 585 RIP CURRENT 368
## 359 HIGH WIND 248
## 19 AVALANCHE 224
injury_byType_Sorted[1:10,]
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
## 275 HEAT 2100
## 427 ICE STORM 1975
## 153 FLASH FLOOD 1777
## 760 THUNDERSTORM WIND 1488
## 244 HAIL 1361
To get a more intuitive impression, we can make a bar plot showing the fatality and injury numbers caused by different events:
par(mfrow = c(1, 2))
par(mar = c(10, 4, 4, 2), cex = 0.8, cex.main = 1.2, cex.lab = 1.2)
barplot(fatal_byType_Sorted$FATALITIES[1:10], names.arg = fatal_byType_Sorted$EVTYPE[1:10], col = 'blue',
main = 'Top 10 Weather Events for Fatalities', ylab = 'Number of Fatalities')
barplot(injury_byType_Sorted$INJURIES[1:10], names.arg = injury_byType_Sorted$EVTYPE[1:10], col = 'green',
main = 'Top 10 Weather Events for Injuries', ylab = 'Number of Injuries')
It’s clear from the plots above, that TORNADO is most harmful to population health. The number of fatalities and injuries caused by TORNADO is far higher than the other events.
Here I show the top 10 weather events that cause property damages, crop damages and total damages:
propertyDMG_byType_Sorted[1:10,]
## EVTYPE propertyDMG
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56937160779
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16140812067
## 244 HAIL 15732267543
## 402 HURRICANE 11868319010
## 848 TROPICAL STORM 7703890550
## 972 WINTER STORM 6688497251
## 359 HIGH WIND 5270046295
cropDMG_byType_Sorted[1:10,]
## EVTYPE cropDMG
## 95 DROUGHT 13972567047
## 170 FLOOD 5662327861
## 590 RIVER FLOOD 5029470938
## 427 ICE STORM 5022154924
## 244 HAIL 3026276714
## 402 HURRICANE 2741917138
## 411 HURRICANE/TYPHOON 2607874471
## 153 FLASH FLOOD 1422066007
## 140 EXTREME COLD 1292980301
## 212 FROST/FREEZE 1094086000
totalDMG_byType_Sorted[1:10,]
## EVTYPE propertyDMG cropDMG TOTALDMG
## 170 FLOOD 144657709807 5.662328e+09 150320037668
## 411 HURRICANE/TYPHOON 69305840000 2.607874e+09 71913714471
## 834 TORNADO 56937160779 4.175364e+08 57354697199
## 670 STORM SURGE 43323536000 2.408588e+04 43323560086
## 244 HAIL 15732267543 3.026277e+09 18758544256
## 153 FLASH FLOOD 16140812067 1.422066e+09 17562878074
## 95 DROUGHT 1046106000 1.397257e+10 15018673047
## 402 HURRICANE 11868319010 2.741917e+09 14610236148
## 590 RIVER FLOOD 5118945500 5.029471e+09 10148416438
## 427 ICE STORM 3944927860 5.022155e+09 8967082784
To get a more intuitive impression, we can make a bar plot showing the damages caused by different events:
par(mfrow = c(1, 3))
par(mar = c(10, 4, 4, 2), cex = 0.8, cex.main = 1.2, cex.lab = 1.2)
barplot(propertyDMG_byType_Sorted$propertyDMG[1:10], names.arg = propertyDMG_byType_Sorted$EVTYPE[1:10], col = 'blue',
main = 'Top 10 Property damages', ylab = 'Property Damages')
barplot(cropDMG_byType_Sorted$cropDMG[1:10], names.arg = cropDMG_byType_Sorted$EVTYPE[1:10], col = 'green',
main = 'Top 10 Crop damages', ylab = 'Crop Damages')
barplot(totalDMG_byType_Sorted$TOTALDMG[1:10], names.arg = totalDMG_byType_Sorted$EVTYPE[1:10], col = 'orange',
main = 'Top 10 Total damages', ylab = 'Total Damages')
As shown in the plots, FLOOD causes the most property damages, while DROUGHT causing the most crop damages. Overall, FLOOD is still the most harmful weather event that causes economic consequences.