Github repo for the Reproducible Research Course Project 2.
The goal of the assignment is to explore the NOAA Storm Database and explore the effects of severe weather events on both population and economy.The database covers the time period between 1950 and November 2011.
This analysis shows that the most harmful type of weather events (1950 - 2011) to population health (including fatalities and injuries) was “Tornados” with 96,980 casualties and the most harmful to economy cost (Property and Crops) was “Floods” with $150,320 Million dollars.
The following analysis investigates which types of severe weather events are most harmful on:
Information on the data: documentation.
Download the raw data file and extract the data. The data source is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
library("data.table")
# path <- getwd()
# downloading data
url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_data <- "StormData.csv.bz2"
if (!file.exists(file_data)) {
download.file(url_data, file_data, mode = "wb")
}
Reading data
# reading data
storm_data <- read.csv(file = file_data, header=TRUE, sep=",")
Dimention data
# dimention
dim(storm_data)
## [1] 902297 37
Summary of storm_data
# summare of storm_data
summary(storm_data)
## STATE__ BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 MADISON : 5632 NE : 30271
## (Other): 3631 (Other) :862369 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 :547332
## TSTM WIND :219940 1st Qu.: 0.000 N : 86752
## THUNDERSTORM WIND: 82563 Median : 0.000 W : 38446
## TORNADO : 60652 Mean : 1.484 S : 37558
## FLASH FLOOD : 54277 3rd Qu.: 1.000 E : 33178
## FLOOD : 25326 Max. :3749.000 NW : 24041
## (Other) :170878 (Other):134990
## BGN_LOCATI END_DATE END_TIME
## :287743 :243411 :238978
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 :724837
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082
## Median :0 Median : 0.0000 S : 22510
## Mean :0 Mean : 0.9862 W : 20119
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047
## Max. :0 Max. :925.0000 NE : 14606
## (Other): 72096
## END_LOCATI LENGTH WIDTH
## :499225 Min. : 0.0000 Min. : 0.000
## COUNTYWIDE : 19731 1st Qu.: 0.0000 1st Qu.: 0.000
## SOUTH PORTION : 833 Median : 0.0000 Median : 0.000
## NORTH PORTION : 780 Mean : 0.2301 Mean : 7.503
## CENTRAL PORTION: 617 3rd Qu.: 0.0000 3rd Qu.: 0.000
## SPRINGFIELD : 575 Max. :2315.0000 Max. :4400.000
## (Other) :380536
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 568 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588295
# examining column names
colnames(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
According to NOAA, the data recording start from Jan. 1950. At that time, they recorded only one event type - tornado. They added more events gradually, and only from Jan 1996 they started recording all events type. Since our objective is comparing the effects of different weather events, we need only to include events that started not earlier than Jan 1996.
# create subsetting by date
main_data <- storm_data
main_data$BGN_DATE <- strptime(storm_data$BGN_DATE, "%m/%d/%Y %H:%M:%S")
main_data <- subset(main_data, BGN_DATE > "1995-12-31")
Based on the above mentioned documentation and preliminary exploration of raw data with ?str?, ?names?, ?table?, ?dim?, ?head?, ?range? and other similar functions we can conclude that there are 7 variables we are interested in regarding the two questions.
Namely: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
Therefore, we can limit our data to these variables.
# select variables
main_data <- subset(main_data, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
EVTYPE - type of event
FATALITIES - number of fatalities
INJURIES - number of injuries
PROPDMG - the size of property damage
PROPDMGEXP - the exponent values for ‘PROPDMG’ (property damage)
CROPDMG - the size of crop damage
CROPDMGEXP - the exponent values for ‘CROPDMG’ (crop damage)
# cleaning event types names
main_data$EVTYPE <- toupper(main_data$EVTYPE)
# eliminating zero data
main_data <- main_data[main_data$FATALITIES != 0 |
main_data$INJURIES != 0 |
main_data$PROPDMG != 0|
main_data$CROPDMG != 0, ]
# unique event types
unique(main_data$EVTYPE)
## [1] "WINTER STORM" "TORNADO"
## [3] "TSTM WIND" "HIGH WIND"
## [5] "FLASH FLOOD" "FREEZING RAIN"
## [7] "EXTREME COLD" "LIGHTNING"
## [9] "HAIL" "FLOOD"
## [11] "TSTM WIND/HAIL" "EXCESSIVE HEAT"
## [13] "RIP CURRENTS" "OTHER"
## [15] "HEAVY SNOW" "WILD/FOREST FIRE"
## [17] "ICE STORM" "BLIZZARD"
## [19] "STORM SURGE" "ICE JAM FLOOD (MINOR"
## [21] "DUST STORM" "STRONG WIND"
## [23] "DUST DEVIL" "URBAN/SML STREAM FLD"
## [25] "FOG" "ROUGH SURF"
## [27] "HEAVY SURF" "HEAVY RAIN"
## [29] "MARINE ACCIDENT" "AVALANCHE"
## [31] "FREEZE" "DRY MICROBURST"
## [33] "WINDS" "COASTAL STORM"
## [35] "EROSION/CSTL FLOOD" "RIVER FLOODING"
## [37] "WATERSPOUT" "DAMAGING FREEZE"
## [39] "HURRICANE" "TROPICAL STORM"
## [41] "BEACH EROSION" "HIGH SURF"
## [43] "HEAVY RAIN/HIGH SURF" "UNSEASONABLE COLD"
## [45] "EARLY FROST" "WINTRY MIX"
## [47] "DROUGHT" "COASTAL FLOODING"
## [49] "TORRENTIAL RAINFALL" "LANDSLUMP"
## [51] "HURRICANE EDOUARD" "TIDAL FLOODING"
## [53] "STRONG WINDS" "EXTREME WINDCHILL"
## [55] "GLAZE" "EXTENDED COLD"
## [57] "WHIRLWIND" "HEAVY SNOW SHOWER"
## [59] "LIGHT SNOW" "COASTAL FLOOD"
## [61] "MIXED PRECIP" "COLD"
## [63] "FREEZING SPRAY" "DOWNBURST"
## [65] "MUDSLIDES" "MICROBURST"
## [67] "MUDSLIDE" "SNOW"
## [69] "SNOW SQUALLS" "WIND DAMAGE"
## [71] "LIGHT SNOWFALL" "FREEZING DRIZZLE"
## [73] "GUSTY WIND/RAIN" "GUSTY WIND/HVY RAIN"
## [75] "WIND" "COLD TEMPERATURE"
## [77] "HEAT WAVE" "COLD AND SNOW"
## [79] "RAIN/SNOW" "TSTM WIND (G45)"
## [81] "GUSTY WINDS" "GUSTY WIND"
## [83] "TSTM WIND 40" "TSTM WIND 45"
## [85] "HARD FREEZE" "TSTM WIND (41)"
## [87] "HEAT" "RIVER FLOOD"
## [89] "TSTM WIND (G40)" "RIP CURRENT"
## [91] "MUD SLIDE" "FROST/FREEZE"
## [93] "SNOW AND ICE" "AGRICULTURAL FREEZE"
## [95] "WINTER WEATHER" "SNOW SQUALL"
## [97] "ICY ROADS" "THUNDERSTORM"
## [99] "HYPOTHERMIA/EXPOSURE" "LAKE EFFECT SNOW"
## [101] "MIXED PRECIPITATION" "BLACK ICE"
## [103] "COASTALSTORM" "DAM BREAK"
## [105] "BLOWING SNOW" "FROST"
## [107] "GRADIENT WIND" "UNSEASONABLY COLD"
## [109] "TSTM WIND AND LIGHTNING" "WET MICROBURST"
## [111] "HEAVY SURF AND WIND" "FUNNEL CLOUD"
## [113] "TYPHOON" "LANDSLIDES"
## [115] "HIGH SWELLS" "HIGH WINDS"
## [117] "SMALL HAIL" "UNSEASONAL RAIN"
## [119] "COASTAL FLOODING/EROSION" " TSTM WIND (G45)"
## [121] "TSTM WIND (G45)" "HIGH WIND (G40)"
## [123] "TSTM WIND (G35)" "COASTAL EROSION"
## [125] "UNSEASONABLY WARM" "SEICHE"
## [127] "COASTAL FLOODING/EROSION" "HYPERTHERMIA/EXPOSURE"
## [129] "ROCK SLIDE" "GUSTY WIND/HAIL"
## [131] "HEAVY SEAS" " TSTM WIND"
## [133] "LANDSPOUT" "RECORD HEAT"
## [135] "EXCESSIVE SNOW" "FLOOD/FLASH/FLOOD"
## [137] "WIND AND WAVE" "FLASH FLOOD/FLOOD"
## [139] "LIGHT FREEZING RAIN" "ICE ROADS"
## [141] "HIGH SEAS" "RAIN"
## [143] "ROUGH SEAS" "TSTM WIND G45"
## [145] "NON-SEVERE WIND DAMAGE" "WARM WEATHER"
## [147] "THUNDERSTORM WIND (G40)" "LANDSLIDE"
## [149] "HIGH WATER" " FLASH FLOOD"
## [151] "LATE SEASON SNOW" "WINTER WEATHER MIX"
## [153] "ROGUE WAVE" "FALLING SNOW/ICE"
## [155] "NON-TSTM WIND" "NON TSTM WIND"
## [157] "BRUSH FIRE" "BLOWING DUST"
## [159] "VOLCANIC ASH" " HIGH SURF ADVISORY"
## [161] "HAZARDOUS SURF" "WILDFIRE"
## [163] "COLD WEATHER" "ICE ON ROAD"
## [165] "DROWNING" "EXTREME COLD/WIND CHILL"
## [167] "MARINE TSTM WIND" "HURRICANE/TYPHOON"
## [169] "DENSE FOG" "WINTER WEATHER/MIX"
## [171] "ASTRONOMICAL HIGH TIDE" "HEAVY SURF/HIGH SURF"
## [173] "TROPICAL DEPRESSION" "LAKE-EFFECT SNOW"
## [175] "MARINE HIGH WIND" "THUNDERSTORM WIND"
## [177] "TSUNAMI" "STORM SURGE/TIDE"
## [179] "COLD/WIND CHILL" "LAKESHORE FLOOD"
## [181] "MARINE THUNDERSTORM WIND" "MARINE STRONG WIND"
## [183] "ASTRONOMICAL LOW TIDE" "DENSE SMOKE"
## [185] "MARINE HAIL" "FREEZING FOG"
We aggregate fatalities and injuries numbers in order to identify TOP-10 events contributing the total people loss:
# total people loss
health_data <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = main_data, FUN=sum)
health_data$PEOPLE_LOSS <- health_data$FATALITIES + health_data$INJURIES
health_data <- health_data[order(health_data$PEOPLE_LOSS, decreasing = TRUE), ]
Top10_events_people <- health_data[1:10,]
knitr::kable(Top10_events_people, format = "markdown")
| EVTYPE | FATALITIES | INJURIES | PEOPLE_LOSS | |
|---|---|---|---|---|
| 149 | TORNADO | 1511 | 20667 | 22178 |
| 39 | EXCESSIVE HEAT | 1797 | 6391 | 8188 |
| 48 | FLOOD | 414 | 6758 | 7172 |
| 107 | LIGHTNING | 651 | 4141 | 4792 |
| 153 | TSTM WIND | 241 | 3629 | 3870 |
| 46 | FLASH FLOOD | 887 | 1674 | 2561 |
| 146 | THUNDERSTORM WIND | 130 | 1400 | 1530 |
| 182 | WINTER STORM | 191 | 1292 | 1483 |
| 69 | HEAT | 237 | 1222 | 1459 |
| 88 | HURRICANE/TYPHOON | 64 | 1275 | 1339 |
The number/letter in the exponent value columns (PROPDMGEXP and CROPDMGEXP) represents the power of ten (10^The number). It means that the total size of damage is the product of PROPDMG and CROPDMG and figure 10 in the power corresponding to exponent value.
We transform letters and symbols to numbers:
# transform letters to numbers
main_data$PROPDMGEXP <- gsub("[Hh]", "2", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- gsub("[Kk]", "3", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- gsub("[Mm]", "6", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- gsub("[Bb]", "9", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- gsub("\\+", "1", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0", main_data$PROPDMGEXP)
main_data$PROPDMGEXP <- as.numeric(main_data$PROPDMGEXP)
main_data$CROPDMGEXP <- gsub("[Hh]", "2", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- gsub("[Kk]", "3", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- gsub("[Mm]", "6", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- gsub("[Bb]", "9", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- gsub("\\+", "1", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", main_data$CROPDMGEXP)
main_data$CROPDMGEXP <- as.numeric(main_data$CROPDMGEXP)
main_data$PROPDMGEXP[is.na(main_data$PROPDMGEXP)] <- 0
main_data$CROPDMGEXP[is.na(main_data$CROPDMGEXP)] <- 0
#creating total damage values
library(dplyr)
main_data <- mutate(main_data,
PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP),
CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))
#analyzing
economic_data <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data = main_data, FUN=sum)
economic_data$ECONOMIC_LOSS <- economic_data$PROPDMGTOTAL + economic_data$CROPDMGTOTAL
economic_data <- economic_data[order(economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
| EVTYPE | PROPDMGTOTAL | CROPDMGTOTAL | ECONOMIC_LOSS | |
|---|---|---|---|---|
| 48 | FLOOD | 143944833550 | 4974778400 | 148919611950 |
| 88 | HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| 141 | STORM SURGE | 43193536000 | 5000 | 43193541000 |
| 149 | TORNADO | 24616945710 | 283425010 | 24900370720 |
| 66 | HAIL | 14595143420 | 2476029450 | 17071172870 |
| 46 | FLASH FLOOD | 15222203910 | 1334901700 | 16557105610 |
| 86 | HURRICANE | 11812819010 | 2741410000 | 14554229010 |
| 32 | DROUGHT | 1046101000 | 13367566000 | 14413667000 |
| 152 | TROPICAL STORM | 7642475550 | 677711000 | 8320186550 |
| 83 | HIGH WIND | 5247860360 | 633561300 | 5881421660 |
#plotting health loss
library(ggplot2)
g <- ggplot(data = Top10_events_people, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS))
g <- g + geom_bar(stat = "identity", colour = "green", fill = "darkgreen")
g <- g + labs(title = "Total people loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
print(g)
#plotting economic loss
g <- ggplot(data = Top10_events_economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "red", fill = "darkred")
g <- g + labs(title = "Total economic loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
print(g)