This analysis summarizes the impacts that storms have on people and property. The dataset is provided by ???? and represents the time period between Jan 3, 1950 to Nov 30, 2011.
The first part of the analysis looks at what types of storms cause the most casualties and the second part looks at what types of storms cause the most property and crop damage.
Analysis showed that wind storms caused the vast majority of casualties with 886% more than the second highest cause.
Flood caused the most damage to property and crops. It was 238% higher than the next highest cause. # Preliminaries ### Start with an empty environment and set the working directory
rm(list = ls())
setwd("~/Analytics Course/04_ReproducibleResearch/Project2")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(R.cache)
## R.cache v0.12.0 (2015-11-12) successfully loaded. See ?R.cache for help.
# Location of data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#download.file(url, "storm_data.csv.bz2")
# Read data into variable
storm_data_raw <- read.csv("storm_data.csv.bz2")
# Get range of dates covered by data
sdr <- storm_data_raw
sdr$BGN_DATE <- mdy_hms(sdr$BGN_DATE)
# Get range of dates in dataset
range(sdr$BGN_DATE)
## [1] "1950-01-03 UTC" "2011-11-30 UTC"
rm(sdr)
We can see that out data spans the time period of 1950-01-03 to 2011-11-30.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
For this part only three columns are needed: EVTYPE,FATALITIES,INJURIES
Fatalities and Injuries are added together to form total casualties.
NOTE: No more weight is given to deaths than injuries
storm_data <- storm_data_raw %>%
select(EVTYPE,FATALITIES,INJURIES) %>%
mutate(EVTYPE = toupper(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(sum_fatal = sum(FATALITIES), sum_injured = sum(INJURIES)) %>%
mutate(total_casualties = sum_fatal + sum_injured) %>%
arrange(-total_casualties)
print(storm_data)
## # A tibble: 898 x 4
## EVTYPE sum_fatal sum_injured total_casualties
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## # ... with 888 more rows
After getting this initial dataset for the question I noticed that the top ten had several event types that were very similar. For instance #2 Excessive Heat, and #6 Heat; #4 Flood, and #7 Flash Flood; #3 TSTM Wind, and #9 Thunderstorm Wind. This meant that without being categorized another level this data is fairly meaningless, even though we can see already that the tornado is, by far, the cause of most casualties. The problem is that there are 898 unique EVTYPEs.
length(unique(storm_data$EVTYPE))
## [1] 898
If we limit the events to only ones that caused more than 50 casualties then we are left with a more manageble 52 EVTYPEs.
# Extract all unique EVTYPEs with more than 50 casualties
more_than_50 <- storm_data %>%
filter(total_casualties > 50) %>%
select(EVTYPE) %>%
arrange(EVTYPE)
unique(more_than_50$EVTYPE)
## [1] "AVALANCHE" "BLIZZARD"
## [3] "COLD" "COLD/WIND CHILL"
## [5] "DENSE FOG" "DUST STORM"
## [7] "EXCESSIVE HEAT" "EXTREME COLD"
## [9] "EXTREME COLD/WIND CHILL" "EXTREME HEAT"
## [11] "FLASH FLOOD" "FLOOD"
## [13] "FOG" "GLAZE"
## [15] "HAIL" "HEAT"
## [17] "HEAT WAVE" "HEAVY RAIN"
## [19] "HEAVY SNOW" "HEAVY SURF/HIGH SURF"
## [21] "HIGH SURF" "HIGH WIND"
## [23] "HIGH WINDS" "HURRICANE"
## [25] "HURRICANE/TYPHOON" "ICE"
## [27] "ICE STORM" "LANDSLIDE"
## [29] "LIGHTNING" "RECORD HEAT"
## [31] "RIP CURRENT" "RIP CURRENTS"
## [33] "STORM SURGE" "STRONG WIND"
## [35] "THUNDERSTORM WIND" "THUNDERSTORM WINDS"
## [37] "TORNADO" "TROPICAL STORM"
## [39] "TROPICAL STORM GORDON" "TSTM WIND"
## [41] "TSTM WIND/HAIL" "TSUNAMI"
## [43] "URBAN/SML STREAM FLD" "WILD FIRES"
## [45] "WILD/FOREST FIRE" "WILDFIRE"
## [47] "WIND" "WINTER STORM"
## [49] "WINTER WEATHER" "WINTER WEATHER MIX"
## [51] "WINTER WEATHER/MIX" "WINTRY MIX"
OK, this is where I divert a little from data science norms…
I could categorize the EVTYPEs using a lot of grep calls searching for certain key words, but I’m on a deadline so I just extracted the unique event types to a csv, opened it up in excel, created another column for “Category” and manually assigned each of the 52 EVTYPEs to a more general category. I then saved it as a CSV and imported it as storm_category.
NOTE: Since the categories are subjective feel free to re-categorize them and see if you get different results.
# write.csv(more_than_50, "storm_category.csv", row.names = FALSE)
# Read in category xref
storm_category <- read.csv("storm_category.csv", header=TRUE)
Now let’s merge the categories into the storm_data set to give groupings.
merged <- merge(storm_data, storm_category, by="EVTYPE", all.x = TRUE)
# Now, further consolidate the total_casualties by their category
consolidated <- merged %>%
group_by(CATEGORY) %>%
summarise(category_total = sum(total_casualties)) %>%
arrange(-category_total)
print(consolidated)
## # A tibble: 20 x 2
## CATEGORY category_total
## <fctr> <dbl>
## 1 WIND 109247
## 2 HEAT 12319
## 3 FLOOD 10267
## 4 WINTER STORM 7407
## 5 LIGHTNING 6046
## 6 TROPICAL STORM 2038
## 7 FIRE 1696
## 8 HAIL 1476
## 9 NA 1185
## 10 FOG 1156
## 11 RIP CURRENT 1101
## 12 COLD 509
## 13 DUST STORM 462
## 14 AVALANCHE 394
## 15 RAIN 353
## 16 HIGH SURF 350
## 17 COLD AND WIND 256
## 18 GLAZE 223
## 19 LANDSLIDE 90
## 20 DROUGHT 4
Since we’re only interested in the ones with the most casualties I want to choose the grouping that shows the most info. There is a big dropoff between #5 Lightning, and #6 Tropical Storm. Let’s just use the top six to show that big decline.
top_six <- consolidated[1:6, ]
# In order to get them to print in the order of rank I need to set the levels of the EVTYPE factor to the order I want them.
levels_top_six <- c("WIND", "HEAT", "FLOOD", "WINTER STORM", "LIGHTNING",
"TROPICAL STORM")
top_six$CATEGORY <- ordered(top_six$CATEGORY, levels=levels_top_six)
Now we are ready to plot out the results
ggplot(top_six, aes(CATEGORY, category_total, fill=category_total)) +
geom_bar(stat = "identity") +
labs(title = "Casualties per Storm Category",
x="Storm Category",
y="People Killed or Injured",
fill="Total\nCasualties") +
theme(axis.text.x = element_text(size = 8))
Across the United States, which types of events have the greatest economic consequences?
Let’s get the data that deals with economics…
storm_data <- storm_data_raw %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(EVTYPE = toupper(EVTYPE), PROPDMGEXP = toupper(PROPDMGEXP),
CROPDMGEXP = toupper(CROPDMGEXP))
head(storm_data)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 25.0 K 0
## 2 TORNADO 2.5 K 0
## 3 TORNADO 25.0 K 0
## 4 TORNADO 2.5 K 0
## 5 TORNADO 2.5 K 0
## 6 TORNADO 2.5 K 0
Looking at the documentation. The PROPDMGEXP and CROPDMGEXP are important. Values of K, M, & B mean thousands, millions, and billions and are multipliers of the values in PROPDMG & CROPDMG. The others can use the value without modification.
Lets make any value adjustments now before working any further with this dataset.
# Property damage adjustments
# Get the indexes for data with these multipliers
index_K <- which(storm_data$PROPDMGEXP == "K")
index_M <- which(storm_data$PROPDMGEXP == "M")
index_B <- which(storm_data$PROPDMGEXP == "B")
storm_data[,2][index_K] <- storm_data[,2][index_K]*(10^3)
storm_data[,2][index_M] <- storm_data[,2][index_M]*(10^6)
storm_data[,2][index_B] <- storm_data[,2][index_B]*(10^9)
# Crop damage adjustments. I'm re-using variables for conciseness.
index_K <- which(storm_data$CROPDMGEXP == "K")
index_M <- which(storm_data$CROPDMGEXP == "M")
index_B <- which(storm_data$CROPDMGEXP == "B")
storm_data[,4][index_K] <- storm_data[,4][index_K]*(10^3)
storm_data[,4][index_M] <- storm_data[,4][index_M]*(10^6)
storm_data[,4][index_B] <- storm_data[,4][index_B]*(10^9)
Group and re-summarise storm data. Add the total_damage variable.
grouped <- storm_data %>%
group_by(EVTYPE) %>%
summarise(sum_prop = sum(PROPDMG), sum_crop = sum(CROPDMG)) %>%
mutate(total_damage = sum_prop + sum_crop) %>%
arrange(-total_damage)
head(grouped, 10)
## # A tibble: 10 x 4
## EVTYPE sum_prop sum_crop total_damage
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160779 414953270 57352114049
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732267048 3025954473 18758221521
## 6 FLASH FLOOD 16140812067 1421317100 17562129167
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
We have the same problem with redundant EVTYPEs. I pulled out the top 52, which are very different than the top 52 for casualties, extracted them to excel, manually grouped them in a ‘Category’ column and then re-saved and imported as storm_category.
NOTE: This is the same file as above and can be skipped if already imported.
# for_extract <- head(grouped, 52)
# write.csv(for_extract, "storm_type.csv", row.names = FALSE)
# Read in category xref
storm_category <- read.csv("storm_category.csv", header=TRUE)
Merge the categories into the grouped data set.
merged <- merge(grouped, storm_category, by="EVTYPE", all.x = TRUE)
Consolidate the totals by category
consolidated <- merged %>%
select(CATEGORY, total_damage) %>%
group_by(CATEGORY) %>%
summarise(sum_damage = sum(total_damage)) %>%
arrange(-sum_damage)
consolidated
## # A tibble: 20 x 2
## CATEGORY sum_damage
## <fctr> <dbl>
## 1 FLOOD 227189427503
## 2 TROPICAL STORM 95059614060
## 3 WIND 75022527432
## 4 HAIL 18867253271
## 5 WINTER STORM 18402190803
## 6 DROUGHT 15018672000
## 7 NA 12489228304
## 8 FIRE 8793313130
## 9 COLD 2494076900
## 10 RAIN 1500432940
## 11 LIGHTNING 940751537
## 12 HEAT 924539250
## 13 LANDSLIDE 344613000
## 14 HIGH SURF 99825000
## 15 FOG 22829500
## 16 COLD AND WIND 11288000
## 17 DUST STORM 8649000
## 18 AVALANCHE 3721800
## 19 GLAZE 1000000
## 20 RIP CURRENT 163000
It looks like we have a nice break between 5 & 6 in this set also.
top_six <- consolidated[1:6, ]
Change factor levels for plotting purposes.
top_six$CATEGORY <- as.character(top_six$CATEGORY)
top_six$CATEGORY[5] <- "WINTER\nSTORM"
# Get the category order to assign as ordered levels
levels_dmg <- top_six$CATEGORY
top_six$CATEGORY <- ordered(top_six$CATEGORY, levels=levels_dmg)
Plot out our results
ggplot(top_six, aes(CATEGORY, sum_damage/10^9, fill=sum_damage/10^9)) +
geom_bar(stat = "identity") +
labs(x="Storm Category",
y="Damage in billions($)",
fill="Damage\nBillions($)") +
ggtitle("Storm Damage by Category") +
scale_x_discrete(name = "Storm Category") +
theme(axis.text.x = element_text(size = 8))