This report contains my analysis on Strom Dataset collected by NOAA National Weather Service between 1950 and 2011. The result shows that Tornado has the greatest impact on population health in the United Stated. Meanwhile, Excessive heat, which occures less than Tornado causes relativly a high number of injuries and fatalities in the US. In comparision, Flood causes the highst economic impact and then Hurricanes and Typhoons.
To analyze the data, we need plyr, reshape2, ggplot2, tm, wordcloud to be installed on the machine.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.2
## Loading required package: RColorBrewer
## Warning: package 'RColorBrewer' was built under R version 3.2.2
library(RColorBrewer)
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
The data obtaind from Strom Data. This data contains the US National Oceanic and Atmospheric Administaration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States from the begining of 1950 to the end of November 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The data for this project come in the form of a comma-seperated-value file compressed via the bzip2 algorithm to reduce its size. To read this we can use simply read.csv() function in R. This function can handle compressed files automatically. For this purpose, you can use read.table() function which is a generic version of read.csv(). Data loading is time-consuming process. To avoid this, we can use cache=TRUE option in code chunck.
stormData <- read.csv(bzfile("StormData.csv.bz2"))
After reading the storm data we can check the first few rows(this are 902,297 rows) of this datasets.
dim(stormData)
## [1] 902297 37
head(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
It would be useful to take a look at the format and type of considered variables in this dataset.
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The column we are interested in is the EVTYPE which contains the type of events happend during 1950-2011 in all across the US. Here we can see some type of them with their number of occurance.
stormData$EVTYPE <- toupper(stormData$EVTYPE)
events <- stormData$EVTYPE
summary(events)
## Length Class Mode
## 902297 character character
Here we can draw a fancy version of chart by wordcloud package.
wordcloud(as.character(unique(stormData$EVTYPE)), random.order=FALSE)
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.2.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.2
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
To answer this question that what type of events is the most harmufl one in the US, we should consider its impact on population health which is measurable by the number of Fatality and Injury. To do this, we need to compare the number of fatalities and injuries occured by most common type of events (top 20). Fist we need to subset and aggregate these data.
fatality <- aggregate(FATALITIES ~ EVTYPE, data = stormData, FUN = "sum")
fatality <- fatality[order(fatality$FATALITIES, decreasing = TRUE), ]
head(fatality,20)
## EVTYPE FATALITIES
## 758 TORNADO 5633
## 116 EXCESSIVE HEAT 1903
## 138 FLASH FLOOD 978
## 243 HEAT 937
## 418 LIGHTNING 816
## 779 TSTM WIND 504
## 154 FLOOD 470
## 524 RIP CURRENT 368
## 320 HIGH WIND 248
## 19 AVALANCHE 224
## 888 WINTER STORM 206
## 525 RIP CURRENTS 204
## 245 HEAT WAVE 172
## 125 EXTREME COLD 162
## 685 THUNDERSTORM WIND 133
## 274 HEAVY SNOW 127
## 126 EXTREME COLD/WIND CHILL 125
## 312 HIGH SURF 104
## 604 STRONG WIND 103
## 28 BLIZZARD 101
To compare different events, it’s nice to display the aggregated data on a bar chart.
topFatality <- fatality[1:20, ]
ggplot(data = topFatality, aes(EVTYPE, FATALITIES, fill = FATALITIES)) + geom_bar(stat = "identity") +
xlab("Event") + ylab("Fatalities") + ggtitle("Fatalities Caused by Different Events (top 20)") +
coord_flip() + theme(legend.position = "none")
The graph clearly shows that TORNADO is the most deadly event over the years from 1950 to 2011.
injury <- aggregate(INJURIES ~ EVTYPE, data = stormData, FUN = "sum")
injury <- injury[order(injury$INJURIES, decreasing = TRUE), ]
head(injury,20)
## EVTYPE INJURIES
## 758 TORNADO 91346
## 779 TSTM WIND 6957
## 154 FLOOD 6789
## 116 EXCESSIVE HEAT 6525
## 418 LIGHTNING 5230
## 243 HEAT 2100
## 387 ICE STORM 1975
## 138 FLASH FLOOD 1777
## 685 THUNDERSTORM WIND 1488
## 212 HAIL 1361
## 888 WINTER STORM 1321
## 372 HURRICANE/TYPHOON 1275
## 320 HIGH WIND 1137
## 274 HEAVY SNOW 1021
## 875 WILDFIRE 911
## 711 THUNDERSTORM WINDS 908
## 28 BLIZZARD 805
## 171 FOG 734
## 873 WILD/FOREST FIRE 545
## 105 DUST STORM 440
To compare different events, it’s nice to display the aggregated data on a bar chart.
topInjury <- injury[1:20, ]
ggplot(data = topInjury, aes(EVTYPE, INJURIES, fill = INJURIES)) + geom_bar(stat = "identity") +
xlab("Event") + ylab("Injuries") + ggtitle("Injuries Caused by Different Events (top 20)") +
coord_flip() + theme(legend.position = "none")
As you can see on the above chart, TORNADO is the most injuring event in the US.
Here we want to answer this question that what type of events has the most impact on economic. First we should cjeck the exponent data in PROPDMGEXP variable in the Storm dataset.
unique(stormData$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
It shows that we have mixed cases data, for example h and H or m and M. To fix this, it’s good idea to convert them to upper or lower case character.
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
unique(stormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
Now we should convert exponent characters to numeric values. calcExp() function will perform this task.
calcExp <- function(x, exp = "") {
switch(exp, `-` = x * -1, `?` = x, `+` = x, `1` = x*10, `2` = x*(10^2), `3` = x*(10^3),
`4` = x * (10^4), `5` = x * (10^5), `6` = x * (10^6), `7` = x *(10^7),
`8` = x * (10^8), H = x * 100, K = x * 1000, M = x * 1e+06, B = x * 1e+09, x)
}
applyCalcExp <- function(vx, vexp) {
if (length(vx) != length(vexp))
stop("Not same size")
result <- rep(0, length(vx))
for (i in 1:length(vx)) {
result[i] <- calcExp(vx[i], vexp[i])
}
result
}
Now, we are able to calculate the cost of damage caused by each event.
stormData$damageCosts <- applyCalcExp(as.numeric(stormData$PROPDMG), stormData$PROPDMGEXP)
summary(stormData$damageCosts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.500e+01 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11
Here, we calculate the total damage cost for every event.
tDamagecosts <- subset(aggregate(damageCosts ~ EVTYPE, data = stormData, FUN = "sum"), damageCosts > 0)
tDamagecosts <- tDamagecosts[order(tDamagecosts$damageCosts, decreasing = TRUE), ]
To compare different events, it’s nice to display the aggregated data on a bar chart.
tDamagecosts_top20 <- tDamagecosts[1:20, ]
ggplot(data = tDamagecosts_top20, aes(EVTYPE, damageCosts, fill = damageCosts)) + geom_bar(stat = "identity") +
xlab("Event") + ylab("Economic costs in $US") +
ggtitle("Economic costs caused by Different Events (top 20) ") + coord_flip() + theme(legend.position = "none")
As it’s clear in the above char, FLOOD has the greatest impact on economic and then HURRICAN/TYPHOON, TORNADO and STORM SURGE respectfully.