Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the population health. The second most dangerous event type is the excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused by drought, followed by flood and hails.
The analysis was performed on Storm Events Database, provided by National Climatic Data Center. The data is from a comma-separated-value file available here. There is also some documentation of the data available here.
library(data.table)
storm <- fread("repdata%2Fdata%2FStormData.csv" )
##
Read 17.6% of 967216 rows
Read 34.1% of 967216 rows
Read 49.6% of 967216 rows
Read 60.0% of 967216 rows
Read 75.5% of 967216 rows
Read 82.7% of 967216 rows
Read 92.0% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
storm <- as.data.frame(storm)
Then, explore the structure and the first few observations of the dataset.
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : chr "3" "2" "2" "2" ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 37 variables, and here we select those related to our analysis.
main <- storm[ , c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
length(unique((main$EVTYPE)))
## [1] 985
# translate all letters to lowercase
event_types <- tolower(storm$EVTYPE)
# replace all punct. characters with a space
event_types <- gsub("[[:blank:][:punct:]+]", " ", event_types)
length(unique(event_types))
## [1] 874
main$EVTYPE <- event_types
No further data preprocessing was performed although the event type field can be processed further to merge event types such as tstm wind and thunderstorm wind. After the cleaning, as expected, the number of unique event types reduce significantly. For further analysis, the cleaned event types are used.
library (dplyr)
main <- group_by(main, EVTYPE)
str(main)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "tornado" "tornado" "tornado" "tornado" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## - attr(*, "vars")=List of 1
## ..$ : symbol EVTYPE
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 874
## ..$ : int 246123
## ..$ : int 448451
## ..$ : int 265834
## ..$ : int 408332
## ..$ : int 311102
## ..$ : int 286076 354104 354971 384795
## ..$ : int 314005
## ..$ : int 418262
## ..$ : int 377548
## ..$ : int 317133 318247 322497 327547
## ..$ : int 416340 489974
## ..$ : int 453796
## ..$ : int 427305 427326 434839 434856
## ..$ : int 206502 245649 246813 291161 307730 308334
## ..$ : int 189190
## ..$ : int 451314 468698 488145 488630 488995 488996 489016 489018 489090 489098 ...
## ..$ : int 647989 648793 648794 648795 649611 649791 651162 651163 651165 651174 ...
## ..$ : int 214946
## ..$ : int 188985 197153 198143 198144 198145 198177 198178 245869 251400 263661 ...
## ..$ : int 231123
## ..$ : int 253530 312546 359346 435863
## ..$ : int 232609
## ..$ : int 239661 239662
## ..$ : int 217288 230923
## ..$ : int 324074
## ..$ : int 324077 330916 330920
## ..$ : int 267119 278569 284176 296231 296803 297295 301079 324112 339058 345010 ...
## ..$ : int 188773 188776 188791 188793 188794 188795 188796 188798 188808 188833 ...
## ..$ : int 210551 233563
## ..$ : int 210521
## ..$ : int 209768
## ..$ : int 192948 216768
## ..$ : int 188918
## ..$ : int 298057
## ..$ : int 188775
## ..$ : int 214965
## ..$ : int 296256
## ..$ : int 284188
## ..$ : int 192273 192275 438114 443079
## ..$ : int 192992 192994 198082 198085 198091 198186 198193 210560 221429 230915 ...
## ..$ : int 210564 216243
## ..$ : int 201214
## ..$ : int 233584
## ..$ : int 188779
## ..$ : int 248508 248510 435511
## ..$ : int 246723
## ..$ : int 330956
## ..$ : int 324060
## ..$ : int 188817 188854 189046 189075 192337 192651 192656 194071 194081 194083 ...
## ..$ : int 194118 194357 194389 207190 207717 208320 208733 208737 216666 217311 ...
## ..$ : int 313896 313901 330959 351804 365202
## ..$ : int 252624 252627 260029 265425 265817 265825 266919 266922 284072 296261
## ..$ : int 208153 220873
## ..$ : int 217275 217323
## ..$ : int 252630
## ..$ : int 296867
## ..$ : int 188093 188503 190674 194046 194048 194086 194087 194088 194415 195024 ...
## ..$ : int 200789 201130 201820 202202
## ..$ : int 201030 209196
## ..$ : int 202180
## ..$ : int 290347 298652 307783 308344 308346 308367 308605
## ..$ : int 282922
## ..$ : int 199737
## ..$ : int 279967 279974
## ..$ : int 323510 323970 330778 330780
## ..$ : int 198726 209858 213991
## ..$ : int 232879 232880 232990 451319
## ..$ : int 570799 570800 587890 593048 593049 601129 606118 606332 606440 606459 ...
## ..$ : int 323971 330781 499069 499071 506531 506532
## ..$ : int 232612
## ..$ : int 198197
## ..$ : int 350344
## ..$ : int 417607 434950
## ..$ : int 283092 283093 297372 392224
## ..$ : int 221548
## ..$ : int 197132 198113 202399 204462 216220 252787 252813 252817
## ..$ : int 213455
## ..$ : int 187616 189608 192263 192272 192276 192277 192278 192279 192280 192282 ...
## ..$ : int 626159 649577 677177 685298 725455 733335 780034 784173 839524 848497
## ..$ : int 192629 269335
## ..$ : int 192660 216979
## ..$ : int 359910
## ..$ : int 265183
## ..$ : int 194360 194361 197783 197785 197790 199732 208575 208577 208610 215081 ...
## ..$ : int 206503 208539 208546 208553 208560 208586 208589 208598 208601 208606 ...
## ..$ : int 474064
## ..$ : int 194402 377949 382044 382131 415946 416284 416434 450915 452927
## ..$ : int 359360 359394 364828 364838 427330 434862
## ..$ : int 232811
## ..$ : int 192410 192425 192714 193165 193166 193316 193575 193666 214720 215623 ...
## ..$ : int 192853
## ..$ : int 192713
## ..$ : int 193261 193672
## ..$ : int 192458
## ..$ : int 193583
## ..$ : int 192696 192708 193660 193705 193999
## ..$ : int 192697
## ..$ : int 246836
## ..$ : int 309727 413600 413603 435506
## .. [list output truncated]
## - attr(*, "group_sizes")= int 1 1 1 1 1 4 1 1 1 4 ...
## - attr(*, "biggest_group_size")= int 288661
## - attr(*, "labels")='data.frame': 874 obs. of 1 variable:
## ..$ EVTYPE: chr " " " high surf advisory" " coastal flood" " flash flood" ...
## ..- attr(*, "vars")=List of 1
## .. ..$ : symbol EVTYPE
## ..- attr(*, "drop")= logi TRUE
pophealth <- summarise(main, count = n(), totFat = sum(FATALITIES), totInj = sum(INJURIES), meanFat = mean(FATALITIES), meanInj = mean(INJURIES))
Top 10 eventtypes that leading to thelargest number of total deaths are:
totdeath <- head(arrange(pophealth[ , c("EVTYPE", "totFat")], desc(totFat)), 10)
totdeath
## # A tibble: 10 × 2
## EVTYPE totFat
## <chr> <dbl>
## 1 tornado 5633
## 2 excessive heat 1903
## 3 flash flood 978
## 4 heat 937
## 5 lightning 816
## 6 tstm wind 504
## 7 flood 470
## 8 rip current 368
## 9 high wind 248
## 10 avalanche 224
Top 10 eventtypes that leading to the largest number of total injuries are:
totinj <- head(arrange(pophealth[ , c("EVTYPE", "totInj")], desc(totInj)), 10)
totinj
## # A tibble: 10 × 2
## EVTYPE totInj
## <chr> <dbl>
## 1 tornado 91346
## 2 tstm wind 6957
## 3 flood 6789
## 4 excessive heat 6525
## 5 lightning 5230
## 6 heat 2100
## 7 ice storm 1975
## 8 flash flood 1777
## 9 thunderstorm wind 1488
## 10 hail 1361
Top 10 eventtypes that leading to the largest mean number of total deaths are:
meandeath <- head(arrange(pophealth[ , c("EVTYPE", "meanFat", "count")], desc(meanFat)), 10)
meandeath
## # A tibble: 10 × 3
## EVTYPE meanFat count
## <chr> <dbl> <int>
## 1 tornadoes tstm wind hail 25.000000 1
## 2 cold and snow 14.000000 1
## 3 tropical storm gordon 8.000000 1
## 4 record excessive heat 5.666667 3
## 5 extreme heat 4.363636 22
## 6 heat wave drought 4.000000 1
## 7 high wind seas 4.000000 1
## 8 marine mishap 3.500000 2
## 9 winter storms 3.333333 3
## 10 heavy surf and wind 3.000000 1
Top 10 eventtypes that leading to the largest mean number of total injuris are:
meaninj <- head(arrange(pophealth[ , c("EVTYPE", "meanInj", "count")], desc(meanInj)), 10)
meaninj
## # A tibble: 10 × 3
## EVTYPE meanInj count
## <chr> <dbl> <int>
## 1 tropical storm gordon 43.000000 1
## 2 wild fires 37.500000 4
## 3 thunderstormw 27.000000 1
## 4 high wind and seas 20.000000 1
## 5 snow high winds 18.000000 2
## 6 glaze ice storm 15.000000 1
## 7 heat wave drought 15.000000 1
## 8 hurricane typhoon 14.488636 88
## 9 winter storm high winds 7.500000 2
## 10 extreme heat 7.045455 22
To analyze the impact of weather events on the economy, available property damage and crop damage reportings/estimates were used.
In the raw data, the property damage is represented with two fields, a number PROPDMG in dollars and the exponent PROPDMGEXP. Similarly, the crop damage is represented using two fields, CROPDMG and CROPDMGEXP. The first step in the analysis is to calculate the property and crop damage for each event.
unique(main$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
We can see that there are many dirty data. We need to deal with it, transforming all the damage value to the same unit.
exp_transform <- function(e) {
# h -> hundred, k -> thousand, m -> million, b -> billion
if (e %in% c('h', 'H'))
return(2)
else if (e %in% c('k', 'K'))
return(3)
else if (e %in% c('m', 'M'))
return(6)
else if (e %in% c('b', 'B'))
return(9)
else if (!is.na(as.numeric(e))) # if a digit
return(as.numeric(e))
else if (e %in% c('', '-', '?', '+'))
return(0)
else {
stop("Invalid exponent value.")
}
}
prop_dmg_exp <- sapply(main$PROPDMGEXP, FUN=exp_transform)
main$prop_dmg <- main$PROPDMG * (10 ** prop_dmg_exp)
crop_dmg_exp <- sapply(main$CROPDMGEXP, FUN=exp_transform)
main$crop_dmg <- main$CROPDMG * (10 ** crop_dmg_exp)
To calculate the economic loss:
main <- group_by(main, EVTYPE)
eco_loss <- summarise(main, totProp = sum(prop_dmg), totCrop = sum(crop_dmg), count = n(), meanProp = mean(prop_dmg, na.rm = TRUE), meanCrop = mean(crop_dmg, na.rm = TRUE))
Top 10 eventtypes that leading to thelargest number of total property damage are:
totprop <- head(arrange(eco_loss[ , c("EVTYPE", "totProp")], desc(totProp)), 10)
totprop
## # A tibble: 10 × 2
## EVTYPE totProp
## <chr> <dbl>
## 1 flood 144657709807
## 2 hurricane typhoon 69305840000
## 3 tornado 56947380677
## 4 storm surge 43323536000
## 5 flash flood 16822673979
## 6 hail 15735267513
## 7 hurricane 11868319010
## 8 tropical storm 7703890550
## 9 winter storm 6688497251
## 10 high wind 5270046295
Top 10 eventtypes that leading to the largest number of total crop damage are:
totcrop <- head(arrange(eco_loss[ , c("EVTYPE", "totCrop")], desc(totCrop)), 10)
totcrop
## # A tibble: 10 × 2
## EVTYPE totCrop
## <chr> <dbl>
## 1 drought 13972566000
## 2 flood 5661968450
## 3 river flood 5029459000
## 4 ice storm 5022113500
## 5 hail 3025954473
## 6 hurricane 2741910000
## 7 hurricane typhoon 2607872800
## 8 flash flood 1421317100
## 9 extreme cold 1312973000
## 10 frost freeze 1094186000
Top 10 eventtypes that leading to the largest mean number of proper damage are:
meanpop <- head(arrange(eco_loss[ , c("EVTYPE", "meanProp", "count")], desc(meanProp)), 10)
meanpop
## # A tibble: 10 × 3
## EVTYPE meanProp count
## <chr> <dbl> <int>
## 1 tornadoes tstm wind hail 1600000000 1
## 2 heavy rain severe weather 1250000000 2
## 3 hurricane typhoon 787566364 88
## 4 hurricane opal 352538444 9
## 5 storm surge 165990559 261
## 6 wild fires 156025000 4
## 7 hurricane opal high winds 100000000 1
## 8 severe thunderstorm 92720000 13
## 9 hailstorm 80333333 3
## 10 hurricane 68208730 174
Top 10 eventtypes that leading to the largest mean number of crop damages are:
meancrop<- head(arrange(eco_loss[ , c("EVTYPE", "meanCrop", "count")], desc(meanCrop)), 10)
meancrop
## # A tibble: 10 × 3
## EVTYPE meanCrop count
## <chr> <dbl> <int>
## 1 excessive wetness 142000000 1
## 2 cold and wet conditions 66000000 1
## 3 damaging freeze 37028750 8
## 4 hurricane typhoon 29634918 88
## 5 river flood 29072017 173
## 6 early frost 21000000 2
## 7 hurricane erin 19430000 7
## 8 flood rain winds 18800000 6
## 9 hurricane 15758103 174
## 10 hurricane opal high winds 10000000 1
The following plots show the top dangerous weather events for population health in terms of total number of fatalities and injuries respectively:
library(ggplot2)
library(gridExtra)
p1 <- ggplot(totdeath, aes(x = reorder(EVTYPE,totFat), y = totFat, fill = totFat)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Total number of fatalities") +
xlab("Event type")
p2 <- ggplot(totinj, aes(x = reorder(EVTYPE,totInj), y = totInj, fill = totInj)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Total number of injuries") +
xlab("Event type")
grid.arrange(grobs = list(p1,p2), main = "top dangerous weather events for population health ")
Tornadoes cause most number of deaths and injuries among all event types. There are more than 5,000 deaths and more than 90,000 injuries in the last 60 years in US, due to tornadoes. The other event types that are most dangerous with respect to population health are excessive heat, heat and flash floods.
The following plots show the top dangerous weather events for economic loss in terms of total number of property damage and crop damage respectively:
p3 <- ggplot(totprop, aes(x = reorder(EVTYPE,totProp), y = log10(totProp), fill = totProp)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Property damage in dollars (log-scale)") +
xlab("Event type")
p4 <- ggplot(totcrop, aes(x = reorder(EVTYPE,totCrop), y = totCrop, fill = totCrop)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Crop damage in dollars") +
xlab("Event type")
grid.arrange(grobs = list(p3,p4), main="Weather costs to the US economy (1950-2011)")
Property damages are given in logarithmic scale due to large range of values. The data shows that flash floods and thunderstorm winds cost the largest property damages among weather-related natural diseasters. Note that, due to untidy nature of the available data, type flood and flash flood are separate values and should be merged for more accurate data-driven conclusions.
The most severe weather event in terms of crop damage is the drought. In the last half century, the drought has caused more than 10 billion dollars damage. Other severe crop-damage-causing event types are floods and hails.