In this report my aim is to describe the total effects on human life and the total costs associated with different weather events in the United States between the years 1996 and 2011. To determine the effects to human health from weather events it is the total quantity of fatalities and injuries during this time period that will be used. To determine the costs it will be the total dollar value of property and crop damage during this time period. The data is from the NOAA website. From the data it was determined that TORNADOES have the highest total number of fatalities and injuries during the time period. It was found that the total property cost of FLOODING was greater than 25 Million Dollars; and the total cost of crop damage was approximately 25 Million Dollars from HURRICANES.
There are two additional supporting documents: the NOAA Storm Data Cheatsheet and the NOAA DATA FAQ. The actual raw data file can be found here.
library(dplyr)
library(stringr)
The raw data is comma delimited so we use read.csv. The data is reasonably clean so strings will be treated as factors. NA values are denoted “NA”.
raw_data <- read.csv("./StormData.csv", stringsAsFactors = TRUE, na.strings = "NA")
After reading in the data we check the first few rows (rows = 902297) in the data set.
dim(raw_data)
## [1] 902297 37
head(raw_data[,1:8])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
Data names and columns look okay. We will format the BGN_Date column.
The original data set began in 1950, however, since there are many missing values prior to 1996 then the subset we are interested in is only the values between the dates of 1996 and 2011. To make this a more workable set of data we are only interested in weather events where there are at least one fatality or injury; we will also eliminate the unnecessary columns.
health_data <- raw_data %>%
filter(BGN_DATE > "1996-01-01" & FATALITIES > 0 & INJURIES > 0) %>%
select(BGN_DATE | STATE:EVTYPE | FATALITIES:INJURIES)
sum(is.na(health_data$FATALITIES))
## [1] 0
sum(is.na(health_data$INJURIES))
## [1] 0
From this smaller set of data we see that we are missing no numerical fatality or injury values.
max(health_data$FATALITIES) - min(health_data$FATALITIES)
## [1] 157
The range of fatality data isn’t too wide, but the total number of named Event Types is. Rather than try and format all named events I will use kmeans and cluster the data using fatalities and injuries as related coordinates.
health_df <- data.frame(health_data$FATALITIES,health_data$INJURIES)
health_kmeans <- kmeans(health_df, 8, nstart = 25)
z <- data.frame(health_kmeans$centers)
z$cluster <- c(1:8)
z <- z %>% arrange(desc(health_data.FATALITIES))
z <- as.list(z)
top_3 <- z$cluster[1:3]
health_data <- mutate(health_data, cluster = health_kmeans$cluster)
health_data$cluster <- as.factor(health_data$cluster)
final_health <- health_data %>%
filter(cluster %in% top_3) %>%
group_by(EVTYPE) %>%
summarise(tot_fatal = sum(FATALITIES), tot_inj = sum(INJURIES))
final_health$EVTYPE <- final_health$EVTYPE %>%
str_replace_all(c("EXCESSIVE " = "",
"FLASH " = "",
"BLIZZARD" = "WINTER WEATHER",
"WINTER STORM" = "WINTER WEATHER",
"/TYPHOON" = ""))
final_health <- final_health %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(tot_fatal), injuries = sum(tot_inj))
Using multiple attempts we found that kmeans always converged when we have 8 centers and 25 random starts. Grouping the data this way significantly reduced the number of formatting discrepancies within the EVTYPE data.
damage_data <- raw_data %>%
filter(BGN_DATE >= "1996-01-01") %>%
filter(CROPDMG > 0 & PROPDMG > 0)
# Format the exponential column to numerical values
damage_data$PROPDMGEXP <- str_replace_all(damage_data$PROPDMGEXP, "K", "1000")
damage_data$PROPDMGEXP <- str_replace_all(damage_data$PROPDMGEXP, "M", "1000000")
damage_data$PROPDMGEXP <- str_replace_all(damage_data$PROPDMGEXP, "B", "1000000000")
damage_data$PROPDMGEXP <- as.numeric(damage_data$PROPDMGEXP)
damage_data$CROPDMGEXP <- str_replace_all(damage_data$CROPDMGEXP, "K", "1000")
damage_data$CROPDMGEXP <- str_replace_all(damage_data$CROPDMGEXP, "M", "1000000")
damage_data$CROPDMGEXP <- str_replace_all(damage_data$CROPDMGEXP, "B", "1000000000")
damage_data$CROPDMGEXP <- as.numeric(damage_data$CROPDMGEXP)
# Mutate the data so that crop and property damage are in the correct numerical values
damage_data <- mutate(damage_data, property_damage = PROPDMG*PROPDMGEXP)
damage_data <- mutate(damage_data, crop_damage = CROPDMG*CROPDMGEXP)
damage_data <- as.data.frame(damage_data)
# Limit the data to the columns of interest
damage_data <- damage_data %>% select(BGN_DATE | STATE | EVTYPE | property_damage | crop_damage)
However, when attempting to clean the data for analysis of the economic costs associated with weather events using kmeans didn’t work; we kept increasing the number of random starts but the data never converged. When reducing the numbers of clusters kmeans returned a single event as the largest property damage value. So we decided to use property and crop damage values that are in the 95th percentile in the data.
prop_cut_off <- max(quantile(damage_data$property_damage, probs = seq(0,1,.95)))
crop_cut_off <- max(quantile(damage_data$crop_damage, probs = seq(0,1,0.95)))
damage_subset <- damage_data %>%
filter(property_damage >= prop_cut_off | crop_damage >= crop_cut_off) %>%
group_by(EVTYPE) %>%
summarise(property_damage = sum(property_damage), crop_damage = sum(crop_damage))
damage_subset$EVTYPE <- damage_subset$EVTYPE %>%
str_replace_all(c("EXCESSIVE " = "",
"/High Surf" = "",
"/HAIL" = "",
"/FOREST " = "",
"BLIZZARD" = "WINTER WEATHER",
"WINTER STORM" = "WINTER WEATHER",
"FROST/FREEZE" = "WINTER WEATHER",
"HEAVY SNOW" = "WINTER WEATHER",
"ICE STORM" = "WINTER WEATHER",
"HIGH " = "",
"/TYPHOON" = "",
"TYPHOON" = "HURRICANE",
"TROPICAL STORM" = "HURRICANE",
"FLASH FLOOD" = "HEAVY RAIN",
"Heavy Rain" = "HEAVY RAIN",
"River Flooding" = "FLOOD",
"URBAN/SML STREAM FLD" = "FLOOD",
"THUNDERSTORM " = "",
"STRONG " = "",
"TSTM " = "",
"HAIL" = "WIND"))
Grouping the data this way gave me many more formatting issues than when I used kmeans on the health data frame. We filtered the data and re-formated some of the strings to make them easier to group. After completing the formatting we still had a lot of data. Therefore we filtered the rows to look at property and crop damage that were greater than tens of millions of dollars.
final_damage <- damage_subset %>%
group_by(EVTYPE) %>%
filter(property_damage > 10^8 & crop_damage > 10^8) %>%
summarise(property_damage = sum(property_damage),
crop_damage = sum(crop_damage))
par(mfrow = c(2,1))
barplot(final_health$fatalities,
names.arg = final_health$EVTYPE,
col = "light blue",
xlab = "Event Type",
ylab = "Total Fatalities",
main = "Health Effects by Weather Event from 1996 to 2011")
barplot(final_health$injuries,
names.arg = final_health$EVTYPE,
col = "thistle",
xlab = "Event Type",
ylab = "Total Injuries")
From the barplots we see for both fatalities and injuries that the highest total number was from the Tornado type weather events.
par(mfrow= c(1,1))
barplot(final_damage$property_damage,
names.arg = final_damage$EVTYPE,
col = "yellow",
xlab = "Event Type",
ylab = "Property Damage (Millions of Dollars)",
main = "Total Property Cost by Weather Event between 1996 to 2011")
barplot(final_damage$crop_damage,
names.arg = final_damage$EVTYPE,
col = "red",
xlab = "Event Type",
ylab = "Crop Damage (Millions of Dollars)",
main = "Total Crop Cost by Weather Event between 1996 to 2011")
The resulting plots show that FLOODING was the largest total cost associated with property and that HURRICANES had the largest effect on crops.