Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
An analysis of the NOAA Storm Data has been performed with respect to population health and economic consequences due to weather events. The analysis shows that major reason for INJURIES was TORNADOES while main reason for FATALITIES was EXCESSIVE HEAT. Further, it shows that major reason for CROP damage was DROUGHT while main reason for PROPERTY damage was FLASH FLOOD.
Load the required packages
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.6.0 (2017-11-04) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Download the bz2files format data file and unzip if it doesn’t exist.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_name <- "StormData.csv.bz2"
if(!file.exists(file_name)) {
download.file(url, file_name)
}
if(!file.exists('StormData.csv')) {
bunzip2("StormData.csv.bz2", overwrite=T, remove=F)
}
raw_storm_data <- read.csv('StormData.csv')
Run the required data exploration activities
dim(raw_storm_data)
## [1] 902297 37
head(raw_storm_data, n = 2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## 2 NA 0 2 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
str(raw_storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(raw_storm_data)
## STATE__ BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 MADISON : 5632 NE : 30271
## (Other): 3631 (Other) :862369 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 :547332
## TSTM WIND :219940 1st Qu.: 0.000 N : 86752
## THUNDERSTORM WIND: 82563 Median : 0.000 W : 38446
## TORNADO : 60652 Mean : 1.484 S : 37558
## FLASH FLOOD : 54277 3rd Qu.: 1.000 E : 33178
## FLOOD : 25326 Max. :3749.000 NW : 24041
## (Other) :170878 (Other):134990
## BGN_LOCATI END_DATE END_TIME
## :287743 :243411 :238978
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 :724837
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082
## Median :0 Median : 0.0000 S : 22510
## Mean :0 Mean : 0.9862 W : 20119
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047
## Max. :0 Max. :925.0000 NE : 14606
## (Other): 72096
## END_LOCATI LENGTH WIDTH
## :499225 Min. : 0.0000 Min. : 0.000
## COUNTYWIDE : 19731 1st Qu.: 0.0000 1st Qu.: 0.000
## SOUTH PORTION : 833 Median : 0.0000 Median : 0.000
## NORTH PORTION : 780 Mean : 0.2301 Mean : 7.503
## CENTRAL PORTION: 617 3rd Qu.: 0.0000 3rd Qu.: 0.000
## SPRINGFIELD : 575 Max. :2315.0000 Max. :4400.000
## (Other) :380536
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 569 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588294
The raw storm data contains 902297 rows and 37 columns.
We took the following variables those are sufficient for this analysis.
1 FATALITIES: approx. number of deaths 2 INJURIES: approx. number of injuries
1 PROPDMG: approx. property damags 2 PROPDMGEXP: the units for property damage value 3 CROPDMG: approx. crop damages 4 CROPDMGEXP: the units for crop damage value
1 EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..) 2 BGN_DATE: Years
required_vars <- c( "BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
tidy_storm_data <- raw_storm_data[, required_vars]
tidy_storm_data$EVTYPE <- toupper(tidy_storm_data$EVTYPE)
Added additional new variable called YEAR which stores the proper format of date.
tidy_storm_data$YEAR <- as.numeric(format(as.Date(tidy_storm_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Look at the histogram of the data Yearly wise:
hist(tidy_storm_data$YEAR, col = "blue", breaks = 30, main="Number of Weather EVENTs per Year", xlab="Year", ylab="Frequency")
As for the above histogram, we can see that the number of events tracked starts to significantly increase from around 1990. So, we can use the subset of the data from 1990 to 2011 to get most out of good records.
tidy_storm_data <- tidy_storm_data[tidy_storm_data$YEAR >= 1990, ]
dim(tidy_storm_data)
## [1] 751740 9
Look for the blank values in the data:
sum(is.na(tidy_storm_data$BGN_DATE))
## [1] 0
sum(is.na(tidy_storm_data$EVTYPE))
## [1] 0
sum(is.na(tidy_storm_data$INJURIES))
## [1] 0
sum(is.na(tidy_storm_data$PROPDMG))
## [1] 0
sum(is.na(tidy_storm_data$PROPDMGEXP))
## [1] 0
sum(is.na(tidy_storm_data$CROPDMG))
## [1] 0
sum(is.na(tidy_storm_data$CROPDMGEXP))
## [1] 0
we can see that the PROPDMG and CROPDMG variables each one has an associated EXP variable. We need to handle this so that we can have numbers that we can use. The factors we need to handle with are:
levels(tidy_storm_data$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(tidy_storm_data$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
We can write a function which calculates the actual cost taking into account the correct exponent. The conversion is based on the information found at How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP.
exp_calc <- function(x, exponent) {
if (exponent %in% c("h", "H"))
return(x * (10 ** 2))
else if (exponent %in% c("k", "K"))
return(x * (10 ** 3))
else if (exponent %in% c("m", "M"))
return(x * (10 ** 6))
else if (exponent %in% c("b", "B"))
return(x * (10 ** 9))
else if (!is.na(as.numeric(exponent)))
return(x * (10 ** as.numeric(exponent)))
else if (exponent %in% c("+"))
return(x * 10)
else if (exponent %in% c("", "-", "?"))
return(x) # Actually x * 10^0 = x * 1 = x
else {
stop("It is not a valid value")
}
}
Now we can use the mapply function along with the exp_calc function to calculate the actual damage costs for crops and properties and store the results in two new variables CROPDMG_calc and PROPDMG_calc respectively:
tidy_storm_data$PROPDMG_calc <- mapply(exp_calc, tidy_storm_data$PROPDMG, tidy_storm_data$PROPDMGEXP)
tidy_storm_data$CROPDMG_calc <- mapply(exp_calc, tidy_storm_data$CROPDMG, tidy_storm_data$CROPDMGEXP)
dim(tidy_storm_data)
## [1] 751740 11
head(tidy_storm_data)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 4408 1/5/1990 0:00:00 HAIL 0 0 0.0
## 4409 1/20/1990 0:00:00 TSTM WIND 0 0 0.0
## 4410 1/20/1990 0:00:00 TSTM WIND 0 0 0.0
## 4411 1/25/1990 0:00:00 TSTM WIND 0 0 0.0
## 4412 1/25/1990 0:00:00 TORNADO 0 28 2.5 M
## 4413 1/25/1990 0:00:00 TSTM WIND 0 0 0.0
## CROPDMG CROPDMGEXP YEAR PROPDMG_calc CROPDMG_calc
## 4408 0 1990 0 0
## 4409 0 1990 0 0
## 4410 0 1990 0 0
## 4411 0 1990 0 0
## 4412 0 1990 2500000 0
## 4413 0 1990 0 0
Now the tidy Storm Data contains 751740 rows and 11 columns.
Now we can store the final required summarised data in a tibble.
tidy_storm_data_summary <- tidy_storm_data %>%
group_by(EVTYPE) %>%
summarise( total_fatalities = sum(FATALITIES), total_injuries = sum(INJURIES),
total_properties_damage = sum(PROPDMG_calc), total_crop_damage = sum(CROPDMG_calc) )
dim(tidy_storm_data_summary)
## [1] 898 5
head(tidy_storm_data_summary)
## # A tibble: 6 x 5
## EVTYPE total_fatalities total_injuries total_propertie~ total_crop_dama~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 " H~ 0. 0. 200000. 0.
## 2 " COA~ 0. 0. 0. 0.
## 3 " FLA~ 0. 0. 50000. 0.
## 4 " LIG~ 0. 0. 0. 0.
## 5 " TST~ 0. 0. 8100000. 0.
## 6 " TST~ 0. 0. 8000. 0.
EVTYPE variable) are most harmful with respect to population health?Fetch the top 5 causes of fatalities and store the results in a new tibble:
fatalities <- tidy_storm_data_summary %>%
arrange_(~ desc(total_fatalities)) %>%
select(EVTYPE, total_fatalities) %>%
slice(1:5)
Fetch the top 5 causes of injuries and store the results in a new tibble:
injuries <- tidy_storm_data_summary %>%
arrange_(~ desc(total_injuries)) %>%
select(EVTYPE, total_injuries) %>%
slice(1:5)
Plotting the data for visualization:
# First plot
plot1 <- ggplot(data = injuries, aes(x = reorder(EVTYPE, total_injuries), y = total_injuries)) + geom_bar(fill = "blue", stat = "identity") + coord_flip() + ylab("Total number of injuries") + xlab("Event type") + ggtitle("Fatalities and injuries caused by weather in the US - Top 10") + theme(legend.position = "none")
# Second plot
plot2 <- ggplot(data = fatalities, aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) + geom_bar(fill = "red", stat = "identity") + coord_flip() + ylab("Total number of fatalities") + xlab("Event type") + theme(legend.position = "none")
# Combine both plots in a grid
grid.arrange(plot1, plot2, nrow = 2)
The above plot shows that the EXCESSIVE HEAT and TORNADO were caused most FATALITIES while TORNADO caused most INJURIES in the US from 1990 to 2011.
Fetch the top 5 causes of crop damage and store the results in a new tibble:
crop_damage <- tidy_storm_data_summary %>%
arrange_(~ desc(total_crop_damage)) %>%
select(EVTYPE, total_crop_damage) %>%
slice(1:5)
Fetch the top 5 causes of properties damage and store the results in a new tibble:
prop_damage <- tidy_storm_data_summary %>%
arrange_(~ desc(total_properties_damage)) %>%
select(EVTYPE, total_properties_damage) %>%
slice(1:5)
Plot the data for visualization:
# First plot
plot3 <- ggplot(data = crop_damage, aes(x = reorder(EVTYPE, total_crop_damage), y = total_crop_damage)) + geom_bar(fill = "blue", stat = "identity") + coord_flip() + ylab("Economic consequence of crop damage in USD") + xlab("Event type") + ggtitle("Crop and property damage caused by weather in the US - Top 10") + theme(legend.position = "none")
# Second plot
plot4 <- ggplot(data = prop_damage, aes(x = reorder(EVTYPE, total_properties_damage), y = total_properties_damage)) + geom_bar(fill = "red", stat = "identity") + coord_flip() + ylab("Economic consequence of property damage in USD") + xlab("Event type") + theme(legend.position = "none")
# Combine both plots in a grid
grid.arrange(plot3, plot4, nrow = 2)
The above plot shows that the DROUGHT and FLOOD were caused most CROP damage while FLASH FLOOD and THUNDERSTORM WINDS were caused most PROPERTY damage in the US from 1990 to 2011.
From the analysis, we can say that EXCESSIVE HEAT and TORNADO were most harmful with respect to population health, whereas DROUGHT, THUNDERSTORM WINDS, and FLOOD have the greatest economic consequences.