In this report we aim to find the weahter events that have been most harmful to public health and had greatest economic consequences in United States between the years 1950 and 2011. To find these events, we obtained National Oceanic and Atmospheric Administration’s (NOAA) storm database from Coursera web page of Reproducible Research course. From these data, we found that Tornado has by far done the most harm to public health in United States, with more than 58% of total population’s health damage. We also found that Flood and Hurrican/Typhoon had the greatest economic consequences in United States, with more than 33% and 19% of economic damage corresponding to property and crop damages.
This project involves exploring National Oceanic and Atmospheric Administration’s (NOAA) storm database from Coursera web page of Reproducible Research course. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
As part of data processing, we will load and clean this data set in the following sections.
First of all, we check to see if there is a folder named “data” in working directory. If this folder doesn’t exist, we create it. Next we check to see if NOAA Storm Database named “repdatadataStormData.csv.bz2” exist in “data” directory and download it otherwise.
#check and create data folder
if(!dir.exists("./data")){dir.create("./data")}
#check to see if "repdatadataStormData.csv.bz2" exist, and if not, dowload it
if(!file.exists("./data//repdatadataStormData.csv.bz2")){
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "./data/repdatadataStormData.csv.bz2", method = "curl")
downloadTime <- date()
}
When we are sure that “repdatadataStormData.csv.bz2” is in “data” folder, we read it simply using read.csv and store it in “stormDataFull” data frame.
#reading Storm Data
stormDataFull <- read.csv("./data/repdatadataStormData.csv.bz2")
After reading data, we will look at some summeries of it. We start by looking at its dimension.
dim(stormDataFull)
## [1] 902297 37
We can see that it has 902297 rows and 37 columns. Now we look at its column names.
names(stormDataFull)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For this data analysis, we want to answer two specific questions which concerncs about events that affect public health and economy of U.S. So we can select only the columns that relate to these questions (with REFNUM as id) and discard other columns. We will use select function from “dplyr” package for this purpose.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
stormData <- select(stormDataFull, EVTYPE, FATALITIES:CROPDMGEXP, REFNUM)
We can look at first few rows to get a feeling of this data.
head(stormData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM
## 1 TORNADO 0 15 25.0 K 0 1
## 2 TORNADO 0 0 2.5 K 0 2
## 3 TORNADO 0 2 25.0 K 0 3
## 4 TORNADO 0 2 2.5 K 0 4
## 5 TORNADO 0 2 2.5 K 0 5
## 6 TORNADO 0 6 2.5 K 0 6
As we can see, each row correspond to an event and has an event type (EVTYPE) that caused fatalities (FATALITIES), injuries (INJURIES), property damage (PROPDMG times 10 to the power of PROPDMGEXP) in dollars and crop damage (CROPDMG times 10 to the power of CROPDMGEXP) in dollars.
Lets look at summary and structure of this data set.
summary(stormData)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
str(stormData)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Now we look deeper into PROPDMGEXP and CROPDMGEXP.
table(stormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(stormData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
We want these exponential multipliers to be numbers (and in numeric type). As “K” is the most common value for both of these variables, we will impute “K” for values that don’t have proper values. It is important to note that there is a probability that missing values for these multipiers were 0, which would result in 1$ multiplications, but we assume that they are equal to most common value which is “K” (x1000$).
#changing the type of PROPDMGEXP and CROPDMGEXP to character
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
#changing unknown exponential multiplires to the common one, which is "K"
stormData[stormData$PROPDMGEXP %in% c("", "-", "?", "+"), "PROPDMGEXP"] <- "K"
stormData[stormData$CROPDMGEXP %in% c("", "?"), "CROPDMGEXP"] <- "K"
#changing abbreviations of exponential multipliers to numbers for PROPDMGEXP
stormData[stormData$PROPDMGEXP %in% c("h", "H"), "PROPDMGEXP"] <- "2"
stormData[stormData$PROPDMGEXP == "K", "PROPDMGEXP"] <- "3"
stormData[stormData$PROPDMGEXP %in% c("m", "M"), "PROPDMGEXP"] <- "6"
stormData[stormData$PROPDMGEXP == "B", "PROPDMGEXP"] <- "9"
#changing abbreviations of exponential multipliers to numbers for CROPDMGEXP
stormData[stormData$CROPDMGEXP %in% c("k", "K"), "CROPDMGEXP"] <- "3"
stormData[stormData$CROPDMGEXP %in% c("m", "M"), "CROPDMGEXP"] <- "6"
stormData[stormData$CROPDMGEXP == "B", "CROPDMGEXP"] <- "9"
#changing the type of PROPDMGEXP and CROPDMGEXP to numeric
stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)
Now we want to clean EVTYPE, which according to documents from NOAA should have 48 unique values. Lets look at its number of unique values.
n_distinct(stormData$EVTYPE)
## [1] 985
It has 985 unique values, so we need to do a lot of cleaning to reduce these values to the original 48 values. A quick look at this variable show that there are lots of typos in it. for cleaning this variable, I first changed all of the values to upper case and removed the starting spaces from them.
stormData <- mutate(stormData, EVTYPE = toupper(EVTYPE))
stormData$EVTYPE <- sub("^ +", "", x = stormData$EVTYPE)
After this quick fix, I worked for several hours (about 6 hours) to reduce the number of values to near 48 proper values, specified by NOAA document. My strategy was to group the “stormData” data set by “EVTYPE”, then calculate number of rows for each group, arranged by both number of rows and EVTYPE name. I assumed that EVTYPE values that accured more, have higher probability that have typos in data set. So each time I looked at top of “EVTYPEnumber”, and then in the “EVTYPEname” searched for patterns that might be corresponded to those top values in “EVTYPEnumber”. Whenever I didn’t know what a value might be, I referred to document from NOAA and searched the values in it.
EVTYPEnumber <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(desc(n))
EVTYPEname <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(EVTYPE)
In every cycle of the above strategy, I found a pattern and used mostly gsub to correct the pattern for one value (out of total 48 values). For example, every EVTYPE value that started with “HAIL” and “SMALL HAIL” values corresponded to “HAIL” value. This long list is for correcting incorrect patterns:
#changing events for HAIL
stormData$EVTYPE <- gsub("^HAIL.+|SMALL HAIL", "HAIL", x = stormData$EVTYPE)
#changing events for THUNDERSTORM
stormData$EVTYPE <- gsub("^TSTM", "THUNDERSTORM", x = stormData$EVTYPE)
stormData$EVTYPE <- gsub("^THU.+|.*MICROBURST.*", "THUNDERSTORM WIND", x = stormData$EVTYPE)
stormData$EVTYPE <- gsub("TUNDERSTORM WIND", "THUNDERSTORM WIND", x = stormData$EVTYPE)
#changing events for TORNADO
stormData$EVTYPE <- gsub("^TORN.+", "TORNADO", x = stormData$EVTYPE)
#changing events for FLASH FLOOD
stormData$EVTYPE <- gsub("^FLASH.+", "FLASH FLOOD", x = stormData$EVTYPE)
#changing events for FLOOD
stormData$EVTYPE <- gsub("^FLOOD.+|^RIVER.+", "FLOOD", x = stormData$EVTYPE)
#changing events for HIGH WIND
stormData$EVTYPE <- gsub("^HIGH.+WIND.+", "HIGH WIND", x = stormData$EVTYPE)
#changing events for LIGHTNING
stormData$EVTYPE <- gsub("^LIGHTNING.+|LIGHTING", "LIGHTNING", x = stormData$EVTYPE)
#changing events for HEAVY SNOW
stormData$EVTYPE <- gsub("^HEAVY SNOW.+|HEAVY WET SNOW|HEAVY MIX|^SNOW.*|^EXCESSIVE SNOW.*", "HEAVY SNOW", x = stormData$EVTYPE)
#changing events for HEAVY RAIN
stormData$EVTYPE <- gsub("^HEAVY RAIN.+|^HEAVY PRECIP.+|^HEAVY SHOWER.*", "HEAVY RAIN", x = stormData$EVTYPE)
#changing events for WINTER STORM
stormData$EVTYPE <- gsub("^WINTER STORM.+|^FREEZING RAIN.*|^FREEZING DRIZZLE", "WINTER STORM", x = stormData$EVTYPE)
#changing events for WINTER WEATHER
stormData$EVTYPE <- gsub("^WINTER WEATHER.+|^WINT.+MIX|^LIGHT SNOW.*|^MODERATE SNOW.*", "WINTER WEATHER", x = stormData$EVTYPE)
#changing events for FUNNEL CLOUD
stormData$EVTYPE <- gsub(".*FUNNEL.*|.*CLOUD.*", "FUNNEL CLOUD", x = stormData$EVTYPE)
#changing TSTM in events to THUNDERSTORM
stormData$EVTYPE <- gsub("TSTM", "THUNDERSTORM", x = stormData$EVTYPE)
#changing events for WATERSPOUT
stormData$EVTYPE <- gsub("^WA.*TER.+", "WATERSPOUT", x = stormData$EVTYPE)
#changing events for STRONG WIND
stormData$EVTYPE <- gsub("^STRONG WIND.+|^WIND|^WIND [ADGS].+|^GUSTY WIND.*", "STRONG WIND", x = stormData$EVTYPE)
#changing events of URBAN... (small floods) to HEAVY RAIN
stormData$EVTYPE <- gsub("^URBAN.+|^SMALL STREAM.*|^STREET FLOOD.*", "HEAVY RAIN", x = stormData$EVTYPE)
#changing events for WILDFIRE
stormData$EVTYPE <- gsub("^WILD.+", "WILDFIRE", x = stormData$EVTYPE)
#changing events for BLIZZARD
stormData$EVTYPE <- gsub("^BLIZZARD.+", "BLIZZARD", x = stormData$EVTYPE)
#changing events for DROUGHT
stormData$EVTYPE <- gsub("^DROUGHT.+", "DROUGHT", x = stormData$EVTYPE)
#changing events for ICE STORM
stormData$EVTYPE <- gsub("^ICE.*", "ICE STORM", x = stormData$EVTYPE)
#changing events for EXCESSIVE HEAT
stormData$EVTYPE <- gsub("EXTREME HEAT|^EXCESSIVE HEAT.+", "EXCESSIVE HEAT", x = stormData$EVTYPE)
#changing events for HIGH SURF
stormData$EVTYPE <- gsub(".*SURF.*|^HIGH WA|^HIGH TIDES|^HIGH.+SWELLS", "HIGH SURF", x = stormData$EVTYPE)
#changing events for FROST/FREEZE
stormData$EVTYPE <- gsub(".*FROST.*|^FREEZE", "FROST/FREEZE", x = stormData$EVTYPE)
#changing events for DENSE FOG
stormData$EVTYPE <- gsub("^FOG.*", "DENSE FOG", x = stormData$EVTYPE)
#changing events for EXTREME COLD/WIND CHILL
stormData$EVTYPE <- gsub("^EXTREME WIND.+|^EXTREME.* COLD.*|EXCESSIVE COLD", "EXTREME COLD/WIND CHILL", x = stormData$EVTYPE)
#changing events for HEAT
stormData$EVTYPE <- gsub("^HEAT.+|^UNSEASONABLY WARM|^UNSEASONABLY HOT|^UNUSUAL.*WARM|^RECORD HEAT.*|^RECORD WARM.*", "HEAT", x = stormData$EVTYPE)
#changing events for TROPICAL STORM
stormData$EVTYPE <- gsub("^TROPICAL STORM.+", "TROPICAL STORM", x = stormData$EVTYPE)
#changing events for COASTAL FLOOD
stormData$EVTYPE <- gsub("^COASTAL.*FLOOD.*|^COASTAL.*STORM", "COASTAL FLOOD", x = stormData$EVTYPE)
#changing events for LAKE-EFFECT SNOW
stormData$EVTYPE <- gsub(".*LAKE.* SNOW.*", "LAKE-EFFECT SNOW", x = stormData$EVTYPE)
#changing events for LANDSLIDE (Debris Flow)
stormData$EVTYPE <- gsub("^LAND.+", "LANDSLIDE", x = stormData$EVTYPE)
#changing events for COLD/WIND CHILL
stormData$EVTYPE <- gsub("^COLD.*|^COOL.*|^WIND CHILL.*|^RECORD.*COLD.*|.*LOW.*TEMP.*", "COLD/WIND CHILL", x = stormData$EVTYPE)
#changing events for RIP CURRENT
stormData$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", x = stormData$EVTYPE)
#changing events for STORM SURGE/TIDE
stormData$EVTYPE <- gsub("^STORM SURGE.*", "STORM SURGE/TIDE", x = stormData$EVTYPE)
#changing events for ASTRONOMICAL LOW TIDE
stormData$EVTYPE <- gsub(".*ASTRONOMICAL.*", "ASTRONOMICAL LOW TIDE", x = stormData$EVTYPE)
#changing events for HURRICANE (Typhoon)
stormData$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON", "HURRICANE/TYPHOON", x = stormData$EVTYPE)
#changing events for SLEET
stormData$EVTYPE <- gsub("^SLEET.*", "SLEET", x = stormData$EVTYPE)
#changing events for FREEZING FOG
stormData$EVTYPE <- gsub("^GLAZE.*", "FREEZING FOG", x = stormData$EVTYPE)
So, this way we reduced the number of unique EVTYPE values.
n_distinct(stormData$EVTYPE)
## [1] 347
We can see what proportion of rows are in the most accuring values. We check first 42 values:
#calculating percentage of cleaned data for EVTYPE
a <- group_by(stormData, EVTYPE) %>% summarize(n = n()) %>% arrange(desc(n))
sum(a[1:42, 2])/sum(a[, 2])
## [1] 0.9985404
So, the other 305 values have less than 0.15% of rows! This is enough for cleaning the data and we assume that other 305 values corresponding to 0.15% of rows wouldn’t affect our analysis.
A more solution oriented strategy to clean data could be to summerize data with respect to harm to population health and damage to economy for each EVTYPE, and then do the above mentioned steps; but with optimizing our summaries (instead of number of rows). We could also assume a threshold (like 95%) for our output summary and stop the EVTYPE data cleaning at that point.
While this strategy would be more efficient and capture only EVTYPE values that had great impact on our questions, that would be alot solution oriented and not a good data cleaning!
Our data analysis must address the following questions:
The first question is that: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
We can group our data by EVTYPE, then make summaries of fatalities and injuries to answer this question. I hypothesize that every fatality should have 3 times effect and every injury would have 1 time effect. It’s just my subjective hypothesis and by no means is an objective decision. You can make other decisions and with summarizing in other ways, continue with this analysis.
q1StormData <- group_by(stormData, EVTYPE) %>%
summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), population = sum(FATALITIES*3 + INJURIES)) %>%
arrange(desc(population))
We can look at events that are most harmful to population health (with our assumption about harmfulness criteria).
head(q1StormData, 10)
## Source: local data frame [10 x 4]
##
## EVTYPE FATALITIES INJURIES population
## 1 TORNADO 5658 91364 108338
## 2 EXCESSIVE HEAT 1999 6680 12677
## 3 THUNDERSTORM WIND 715 9537 11682
## 4 FLOOD 499 6809 8306
## 5 LIGHTNING 817 5232 7683
## 6 HEAT 1131 2561 5954
## 7 FLASH FLOOD 1018 1785 4839
## 8 ICE STORM 96 2115 2403
## 9 HIGH WIND 293 1471 2350
## 10 RIP CURRENT 572 529 2245
First 6 events do more that 83% of the harm to population health.
populationSum <- sum(q1StormData$population)
sum(q1StormData$population[1:6])*100/populationSum
## [1] 83.15633
As we want to find the types of events (as indicated in the EVTYPE variable) that are most harmful with respect to population health, we will plot the percentage of harm that each event has done to population health.
barplot(q1StormData$population[1:6]*100/populationSum, names.arg = q1StormData$EVTYPE[1:6], cex.names = 0.45, ylab = "Percentage of harm to population health", main = "Most harmful events for population health")
We can see that Tornado is by far the most harmful event for population health (with our criteria), with 58.26% of the harms, along with Exessive heat and Thundestorm wind with about 6.82% and 6.28% respectively.
The second question is that: Across the United States, which types of events have the greatest economic consequences?
We can groupd our data frame by EVTYPE, then calculate damage summaries for property and crop and total economic damage, then arrange our data frame by total economic damage. Here we assume that economic consequence of an event is sum of its damage to properties and crops and we store this value in “totalDMG” variable.
q2StormData <- group_by(stormData, EVTYPE) %>%
summarize(property = sum(PROPDMG*(10^PROPDMGEXP)), crop = sum(CROPDMG*(10^CROPDMGEXP)), totalDMG = sum(property + crop)) %>%
arrange(desc(totalDMG))
We can look at events that have implied the most damage to economy, with their damages in dollars.
head(q2StormData, 10)
## Source: local data frame [10 x 4]
##
## EVTYPE property crop totalDMG
## 1 FLOOD 150194585307 10936191950 161130777257
## 2 HURRICANE/TYPHOON 85356410010 5516117800 90872527810
## 3 TORNADO 58552215314 417461470 58969676784
## 4 STORM SURGE/TIDE 47964724000 855000 47965579000
## 5 HAIL 15977596956 3046890620 19024487576
## 6 FLASH FLOOD 17414948282 1437163150 18852111432
## 7 DROUGHT 1046106000 13972571780 15018677780
## 8 THUNDERSTORM WIND 9979128673 1225486980 11204615653
## 9 ICE STORM 3971716860 5027113500 8998830360
## 10 WILDFIRE 8491563500 402781630 8894345130
Again, first 6 events have done more than 83% of the damage to the economy.
damageSum <- sum(q2StormData$totalDMG)
sum(q2StormData$totalDMG[1:6])*100/damageSum
## [1] 83.1323
As we want to find the types of events that had the greatest economic consequences, we will plot the percentage of damage that each event has done to economy.
barplot(q2StormData$totalDMG[1:6]*100/damageSum, names.arg = q2StormData$EVTYPE[1:6], cex.names = 0.5, ylab = "Percentage of damage to economy", main = "Most harmful events for economy")
We can see that Flood and Hurricane/Typhoon have done the most damage to economy, with 33.76% and 19.04% of damages respectively. Again, we can see that Tornado is in the third place and has done 12.35% of damage to economy.