In this report the NOAA Storm Events data set will be analyzed. The main focus of this report is to analyze whuch types of weather events are most harmful to the US population health and furthermore which types of events have the greatest economic consequnces. At first we read the raw data into R and process it for further analysis. In this proces variables which are not needed are eliminated and and the data set will be transformed into a clean and tidy data set. The 7 variables needed out of the 37 are: BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. The next step is to analyse the data regarding the 2 main questions. Most fatalities are caused by excessive heat. However when we take injuries into account as well, the most harmful weather event are Tornados. The must crucial type of event regarding economical consequences are hurricanes, followed bei storm tides and floods.
##2. Data Processing Loading the required libraries
library(dplyr)
library(data.table)
library(knitr)
Downloading and importing the data into R
#set working directory
setwd("D:/Data Science/05 Reproducible Research/Course Project 2")
#download file
sourcefile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "repdata-data-StormData.csv.bz2"
if(!file.exists(destfile)) {
download.file(sourcefile, destfile=destfile, quiet=TRUE, method="curl") }
#read file
data <- read.csv(bzfile(destfile))
data <- data.table(data)
Before 1996 as stated on the noaa Website http://www.ncdc.noaa.gov/stormevents/details.jsp only the following event types were recorded: Hail tornado, thunderstorm and wind. Therefore only the data from 1996 onwards will be used.
## convert column to date
data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
## Remove all data before 1996
data <- filter(data, BGN_DATE >= "1996/1/1")
Some meassurements were outliers and had to be corrected.
##Outlier correction
data$PROPDMGEXP<- as.character(data$PROPDMGEXP)
data$PROPDMGEXP <- ifelse(data$REFNUM == "605943", "M", data$PROPDMGEXP)
At first we select all the variables we need for our analysis. Then we will remove all rows where neither of the variables INJURIES, FATALITIES, PROPDMG, CROPDMG is greater than zero. And furthermore
data_cleaned <- data %>% filter(PROPDMG > 0 | CROPDMG > 0 | INJURIES > 0 | FATALITIES > 0) %>% select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REFNUM, REMARKS)
Since the EVTYPE variable is pretty messy and many have differently spelled duplicates we clean it up as suggested in the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 PDF under 7. Event Types.
## Have a look a the unique event types
length(unique(data_cleaned$EVTYPE))
## [1] 222
##Convert all lower case letters to upper case letters
data_cleaned$EVTYPE <- toupper(data_cleaned$EVTYPE)
length(unique(data_cleaned$EVTYPE))
## [1] 186
#Renaming the 186 different weather events to fit the 47 suggested ones from National Weather Service Instructions
data_cleaned$EVTYPE <- ifelse((grepl("TSTM", data_cleaned$EVTYPE, ignore.case=T)), "THUNDERSTORM WIND", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse((grepl("FLASH", data_cleaned$EVTYPE, ignore.case=T)), "FLASH FLOOD", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse((grepl("SURF", data_cleaned$EVTYPE, ignore.case=T)), "HIGH SURF", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse((grepl("TIDE", data_cleaned$EVTYPE, ignore.case=T)), "STORM TIDE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BLACK ICE","FREEZING DRIZZLE","FREEZING RAIN","FREEZING SPRAY","GLAZE","ICE ROADS","ICE ON ROAD","ICY ROADS","LIGHT FREEZING RAIN"), "FREEZING FOG", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BLOWING SNOW","FALLING SNOW/ICE", "EXCESSIVE SNOW", "HEAVY SNOW SHOWER", "LATE SEASON SNOW","LIGHT SNOW","LIGHT SNOWFALL","SNOW","SNOW AND ICE","SNOW SQUALL","SNOW SQUALLS"), "HEAVY SNOW", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BRUSH FIRE"), "WILDFIRE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BEACH EROSION", "COASTAL FLOODING", "COASTAL FLOODING/EROSION", "COASTAL STORM", "COASTALSTORM", "COASTAL FLOOD", "EROSION/CSTL FLOOD", "COASTAL EROSION", "COASTAL FLOODING/EROSION", "COASTALSTORM", "COASTAL STORM"),"COASTAL FLOOD", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("COLD AND SNOW", "COLD TEMPERATURE", "COLD WEATHER", "COLD", "EXTENDED COLD","UNSEASONABLE COLD", "UNSEASONABLY COLD" ), "COLD/WIND CHILL", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("WET MICROBURST","DRY MICROBURST","THUNDERSTORM WIND (G40)","THUNDERSTORM", "MICROBURST", "URBAN/SML STREAM FLD"), "THUNDERSTORM WIND", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("EXTREME COLD", "EXTREME WINDCHILL"), "EXTREME COLD/WIND CHILL", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("FOG"), "DENSE FOG", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("FROST","AGRICULTURAL FREEZE", "DAMAGING FREEZE", "EARLY FROST", "FREEZE","HARD FREEZE"), "FROST/FREEZE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("GUSTY WIND", "GUSTY WINDS", "DOWNBURST","GRADIENT WIND","GUSTY WIND/HAIL","GUSTY WIND/HVY RAIN","GUSTY WIND/RAIN","STRONG WINDS"), "STRONG WIND", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("HEAT WAVE", "WARM WEATHER"), "HEAT", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BEACH EROSION", "TIDAL FLOODING", "DROWNING", "STORM SURGE"), "STORM TIDE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("BLOWING DUST"), "DUST STORM", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("HEAVY SEAS","HIGH SWELLS","HIGH WATER","HIGH SEAS","ROUGH SEAS","ROGUE WAVE"), "HIGH SURF", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("HIGH WIND (G40)","HIGH WINDS"), "HIGH WIND", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("HURRICANE","HURRICANE EDOUARD"), "HURRICANE/TYPHOON", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("HYPERTHERMIA/EXPOSURE","HYPOTHERMIA/EXPOSURE", "RECORD HEAT", "UNSEASONABLY WARM"), "EXCESSIVE HEAT", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("ICE JAM FLOOD (MINOR"), "FLASH FLOOD", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("LAKE-EFFECT SNOW"), "LAKE EFFECT SNOW", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("ROCK SLIDE","LANDSPOUT","LANDSLUMP","LANDSLIDES","MUD SLIDE","MUDSLIDE","MUDSLIDES"), "LANDSLIDE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("MIXED PRECIP", "MIXED PRECIPITATION"), "SLEET", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("NON-SEVERE WIND DAMAGE","WIND","WIND AND WAVE","WIND DAMAGE","WINDS"), "STRONG WIND", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("RAIN","RAIN/SNOW"), "HEAVY RAIN", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("RAIN","RAIN/SNOW", "TORRENTIAL RAINFALL"), "HEAVY RAIN", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("RIP CURRENTS"), "RIP CURRENT", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("RIVER FLOODING"), "RIVER FLOOD", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("SMALL HAIL"), "HAIL", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("WHIRLWIND"), "TORNADO", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("WILD/FOREST FIRE"), "WILDFIRE", data_cleaned$EVTYPE)
data_cleaned$EVTYPE <- ifelse(data_cleaned$EVTYPE %in% c("WINTER WEATHER MIX", "WINTER WEATHER/MIX","WINTRY MIX"), "WINTER WEATHER", data_cleaned$EVTYPE)
Now that we have cleaned up our EVTYPE variable some further action is required to get to the results we want.
##Compute the final data frames regarding harm done to people splitted by fatalities and injuries
data_fata <- data_cleaned %>% group_by(EVTYPE) %>% summarise_each(funs(sum), FATALITIES) %>% arrange(desc(FATALITIES))
data_inju <- data_cleaned %>% group_by(EVTYPE) %>% summarise_each(funs(sum), INJURIES) %>% arrange(desc(INJURIES))
top5_fata <- head(data_fata)
top5_inju <- head(data_inju)
The most hazardous events which cause fatalities are the following:
head(data_fata)
## EVTYPE FATALITIES
## 1 EXCESSIVE HEAT 1807
## 2 TORNADO 1512
## 3 FLASH FLOOD 887
## 4 LIGHTNING 651
## 5 RIP CURRENT 542
## 6 THUNDERSTORM WIND 419
The most hazardous events which cause injuries are the following:
head(data_inju)
## EVTYPE INJURIES
## 1 TORNADO 20667
## 2 FLOOD 6758
## 3 EXCESSIVE HEAT 6408
## 4 THUNDERSTORM WIND 5242
## 5 LIGHTNING 4141
## 6 FLASH FLOOD 1674
Now we have to process the data further to analyze the economic consquences
#new data frame for economy damage
data_eco <- data_cleaned
#have a look at the PROPDMGECP and CROPDMGEXP variables
unique(data_eco$PROPDMGEXP)
## [1] "K" "" "M" "B"
unique(data_eco$CROPDMGEXP)
## [1] K M B
## Levels: ? 0 2 B k K m M
# convert from factor to character
data_eco$PROPDMGEXP<- as.character(data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP<- as.character(data_eco$CROPDMGEXP)
#replace k, m, b with 1000, 1000000, 1000000000
data_eco$PROPDMGEXP <- ifelse(data_eco$PROPDMGEXP == "", 0, data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP <- ifelse(data_eco$CROPDMGEXP == "", 0, data_eco$CROPDMGEXP)
data_eco$PROPDMGEXP <- ifelse(data_eco$PROPDMGEXP == "K", 1000, data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP <- ifelse(data_eco$CROPDMGEXP %in% c("K"), 1000, data_eco$CROPDMGEXP)
data_eco$PROPDMGEXP <- ifelse(data_eco$PROPDMGEXP %in% c("M"), 1000000, data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP <- ifelse(data_eco$CROPDMGEXP %in% c("M"), 1000000, data_eco$CROPDMGEXP)
data_eco$PROPDMGEXP <- ifelse(data_eco$PROPDMGEXP %in% c("B"), 1000000000, data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP <- ifelse(data_eco$CROPDMGEXP %in% c("B"), 1000000000, data_eco$CROPDMGEXP)
#convert back to numeric
data_eco$PROPDMGEXP<- as.numeric(data_eco$PROPDMGEXP)
data_eco$CROPDMGEXP<- as.numeric(data_eco$CROPDMGEXP)
#calculate PROPDMG and CROPDMG
data_eco <- mutate(data_eco, PROPDMG = (PROPDMG * PROPDMGEXP)/1000000)
data_eco <- mutate(data_eco, CROPDMG = (CROPDMG * CROPDMGEXP)/1000000)
#calculate combined costs
data_eco$combined <- data_eco$PROPDMG + data_eco$CROPDMG
#calculate economic dmg per eventtype
data_combined <- data_eco %>% group_by(EVTYPE) %>% summarise_each(funs(sum), combined) %>% arrange(desc(combined))
top5_combined <- head(data_combined)
head(data_combined)
## EVTYPE combined
## 1 HURRICANE/TYPHOON 86467.94
## 2 STORM TIDE 47845.34
## 3 FLOOD 34034.61
## 4 TORNADO 24900.38
## 5 HAIL 17092.04
## 6 FLASH FLOOD 16557.17
After I saw the results I was confused that Floods are the most costly weather events and so I took a closer look at the data. Since I knew that the most costly event ever in the United States was Hurricane Kathrina I googled a bit and found out that the Napa River Flood caused damage in the millions, not in the billions like in our data set. So I corrected that outlier.
As we can see in the plots below most fatalities are caused by excessive heat. However when we take injuries into account as well, the most harmful weather event are Tornados.
##Weather events with most fatalities across the U.S.
barplot(top5_fata$FATALITIES, names = top5_fata$EVTYPE,
xlab = "Weather event", ylab = "Number of fatalities",
main = "Weather events with most fatalities across the U.S.", border = "black", col = c("blue"), cex.names=0.5)
##Weather events with most finjuries across the U.S.
barplot(top5_inju$INJURIES, names = top5_inju$EVTYPE,
xlab = "Weather event", ylab = "Number of injuries",
main = "Weather events with most injuries across the U.S.", border = "black", col = c("blue"), cex.names=0.5)
barplot(top5_combined$combined, names = top5_combined$EVTYPE,
xlab = "Weather event", ylab = "Damage in Million Dollars",
main = "The Most costly weather events in the US", border = "black", col = c("blue"), cex.names=0.5)
As we can see the plot above the must crucial type of event regarding economical consequences are hurricanes, followed bei storm tides and floods.