This data analysis uses data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, collected from 1950 to November 2011.
The goal of the data analysis is to investigate the effects of severe weather events on population health and economy.
We found that the weather events that cause the most harm to population health are tornados, as they caused the most fatalities, deaths and overall casualties. The weather events that caused the most severe agricultural damages are drought. The weather events that caused the most property damages and combined monetary damages are floods.
From the course website, we downloaded the dataset provided by U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and loaded the raw data into R.
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, destfile="StormData.csv.bz2")
stormraw <- read.csv("StormData.csv.bz2")
After reading the raw dataset, we check the dataset dimension and column headers.
dim(stormraw)
## [1] 902297 37
names(stormraw)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To analyze the impact of weather events on population health, we create a new variable casualties which is the sum of fatalities and injury.
##subset for Fatalities and Injuries
storm_population <- stormraw[, c("EVTYPE", "FATALITIES", "INJURIES")]
##calculate Casualties= FATALITEIS+INJURIES
storm_population$Casualties <- storm_population$FATALITIES + storm_population$INJURIES
##confirm that there is no missing values in this subset
sum(is.na(storm_population))
## [1] 0
We then grouped the data by event types and evaluated the total deaths, injuries and casualties for each type of events.
##group data by event types
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
storm_casualties_bytype <- storm_population %>%
group_by(EVTYPE) %>%
summarise(occurence=n(),
fatalities_total=sum(FATALITIES),
injuries_total=sum(INJURIES),
casualties_total=sum(Casualties),
)
##acquire top 5 event types that causes the highest deaths, injuries and casualties
top5fatalities_total <- storm_casualties_bytype %>%
slice_max(order_by = fatalities_total, n=5)
top5injuries_total <- storm_casualties_bytype %>%
slice_max(order_by = injuries_total, n=5)
top5casualties_total <- storm_casualties_bytype %>%
slice_max(order_by = casualties_total, n=5)
print(top5fatalities_total)
## # A tibble: 5 × 5
## EVTYPE occurence fatalities_total injuries_total casualties_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 TORNADO 60652 5633 91346 96979
## 2 EXCESSIVE HEAT 1678 1903 6525 8428
## 3 FLASH FLOOD 54277 978 1777 2755
## 4 HEAT 767 937 2100 3037
## 5 LIGHTNING 15754 816 5230 6046
print(top5injuries_total)
## # A tibble: 5 × 5
## EVTYPE occurence fatalities_total injuries_total casualties_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 TORNADO 60652 5633 91346 96979
## 2 TSTM WIND 219940 504 6957 7461
## 3 FLOOD 25326 470 6789 7259
## 4 EXCESSIVE HEAT 1678 1903 6525 8428
## 5 LIGHTNING 15754 816 5230 6046
print(top5casualties_total)
## # A tibble: 5 × 5
## EVTYPE occurence fatalities_total injuries_total casualties_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 TORNADO 60652 5633 91346 96979
## 2 EXCESSIVE HEAT 1678 1903 6525 8428
## 3 TSTM WIND 219940 504 6957 7461
## 4 FLOOD 25326 470 6789 7259
## 5 LIGHTNING 15754 816 5230 6046
##The event types that are in any of the above three top 5 rankings are:
TopEvents_population <- union(union(top5injuries_total$EVTYP, top5fatalities_total$EVTYPE), top5casualties_total$EVTYPE)
## Warning: Unknown or uninitialised column: `EVTYP`.
print(TopEvents_population)
## [1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT"
## [5] "LIGHTNING" "TSTM WIND" "FLOOD"
We see that the top 5 rankings of events that caused the most deaths, injuries and overall casualties mostly overlap, and we focus on the 7 event types and appear in these rankings.
##Prepare the dataset for plotting by subsetting for the above 7 top harmful events and reshape to long format
top_population_harm <- subset(storm_casualties_bytype, EVTYPE %in% TopEvents_population)
library(tidyr)
top_population_long <- top_population_harm %>%
pivot_longer(cols=c("fatalities_total", "injuries_total", "casualties_total"), names_to = "harm", values_to = "total_number")
To analyze the economic consequences of severe weather events, we look at the property damages, crop damages and total moneytary damages (sum of property and crop damages).
##subset for event types, property damages and crop damages.
storm_econ <- stormraw[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
In this dataset, each property damages are represented as a numeric value (“PROPDMG” and “CROPDMG) and a multiplier (”PROPDMGEXP” and “CROPDMGEXP”). We calculated the actual dollar amount is calculated by multiplying the numeric values by their multiplier. The total monetary damages are the sum of the property and crop damages after multiplier conversions.
##Create a function that read and convert the multiplier columns
unique(storm_econ$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(storm_econ$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
multiplier_conversion <- function(multiplier) {
case_when(
multiplier == "K" ~ 1e3,
multiplier == "k" ~ 1e3,
multiplier == "M" ~ 1e6,
multiplier == "m" ~ 1e3,
multiplier == "B" ~ 1e9,
multiplier == "b" ~ 1e3,
multiplier == "H" ~ 1e2,
multiplier == "h" ~ 1e2,
multiplier == "1" ~ 10,
multiplier == "2" ~ 1e2,
multiplier == "3" ~ 1e3,
multiplier == "4" ~ 1e4,
multiplier == "5" ~ 1e5,
multiplier == "6" ~ 1e6,
multiplier == "7" ~ 1e7,
multiplier == "8" ~ 1e8,
TRUE ~ 1 # Default if multiplier is not one of the above
)
}
##calculate property and crop damages by multiplying the numeric values by their corresponding multipliers
storm_econ$property_dollar <- storm_econ$PROPDMG * multiplier_conversion(storm_econ$PROPDMGEXP)
storm_econ$crop_dollar <- storm_econ$CROPDMG * multiplier_conversion(storm_econ$CROPDMGEXP)
##create add up the property and crop damages
storm_econ$total_dollar <- storm_econ$property_dollar+storm_econ$crop_dollar
##confirm that there is no missing data
sum(is.na(storm_econ))
## [1] 0
We then grouped the data by event types and evaluated the total dollar amount of property damages, crop damages and combined monetary damages.
##group data by event types
storm_econ_bytype <- storm_econ %>%
group_by(EVTYPE) %>%
summarise(occurence = n(),
property_total = sum(property_dollar),
crop_total = sum(crop_dollar),
money_total = sum(total_dollar)
)
##accquire top 5 events types that caused the highest property, crop and overall monetary damages.
top5prop_total <- storm_econ_bytype %>%
slice_max(order_by = property_total, n=5)
top5crop_total <- storm_econ_bytype %>%
slice_max(order_by = crop_total, n=5)
top5money <- storm_econ_bytype %>%
slice_max(order_by = money_total, n=5)
print(top5prop_total)
## # A tibble: 5 × 5
## EVTYPE occurence property_total crop_total money_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 FLOOD 25326 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 88 69305840000 2607872800 71913712800
## 3 TORNADO 60652 56935892176. 414953270 57350845446.
## 4 STORM SURGE 261 43323536000 5000 43323541000
## 5 FLASH FLOOD 54277 16822673978. 1421317100 18243991078.
print(top5crop_total)
## # A tibble: 5 × 5
## EVTYPE occurence property_total crop_total money_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 DROUGHT 2488 1046106000 13972566000 15018672000
## 2 FLOOD 25326 144657709807 5661968450 150319678257
## 3 RIVER FLOOD 173 5118945500 5029459000 10148404500
## 4 ICE STORM 2006 3944927860 5022113500 8967041360
## 5 HAIL 288661 15730372413. 3025954473 18756326886.
print(top5money)
## # A tibble: 5 × 5
## EVTYPE occurence property_total crop_total money_total
## <chr> <int> <dbl> <dbl> <dbl>
## 1 FLOOD 25326 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 88 69305840000 2607872800 71913712800
## 3 TORNADO 60652 56935892176. 414953270 57350845446.
## 4 STORM SURGE 261 43323536000 5000 43323541000
## 5 HAIL 288661 15730372413. 3025954473 18756326886.
##The event types that are in any of the above three top 5 rankings are:
TopEvents_econ <- union(union(top5prop_total$EVTYPE, top5crop_total$EVTYPE), top5money$EVTYPE)
print(TopEvents_econ)
## [1] "FLOOD" "HURRICANE/TYPHOON" "TORNADO"
## [4] "STORM SURGE" "FLASH FLOOD" "DROUGHT"
## [7] "RIVER FLOOD" "ICE STORM" "HAIL"
We see that the top 5 rankings by the three criteria have some overlaps and there are nine types of weather events that appear on these rankings.
##prepare the dataset for plotting by subsetting for the nine focus event types and reshape dataset to long format
top_econ_damage <- subset(storm_econ_bytype, EVTYPE %in% TopEvents_econ)
##convert the dollar amount to million
top_econ_damage$property_total_M <- top_econ_damage$property_total/1000000
top_econ_damage$crop_total_M <- top_econ_damage$crop_total/1000000
top_econ_damage$money_total_M <- top_econ_damage$money_total/1000000
top_econ_long <- top_econ_damage %>%
pivot_longer(cols=c("property_total_M", "crop_total_M", "money_total_M"), names_to = "damage", values_to = "million")
To show the harms to population health by severe weather events, we can make a clustered bar plots of the top event types that caused the most death, injuries or combined casualties.
##arrange order of clustered bars to display in order of 1.fatalities, 2.injuries, 3.combined casualties
top_population_long$harm <- factor(top_population_long$harm, levels = c("fatalities_total", "injuries_total", "casualties_total"))
##create clustered barplot
library(ggplot2)
g1 <- ggplot(top_population_long, aes(x=EVTYPE, y=total_number, fill=harm))
g1+geom_bar(stat="identity", position="dodge") +
labs(x="Events", y="Total Number", fill="Color Legend", title = "Top Harmful Weather Events for Population Health") +
scale_fill_discrete(labels = c("Fatalities", "Injuries", "Combined Casualties")) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.key.size = unit(0.3,"cm"))
From this barplot, it is apparent that Tornados caused overwhelming the most deaths and injuries among all severe weather events, and therefore is the most harmful to population health.
To show the economic damages by severe weather events, we can make a clustered bar plots of the top event types that caused the most properties, crops and combined damages. The damage amount is shown in million dollars.
##arrange order of clustered bars to display in order of 1.property damages, 2.crop damages, 3.combined monetary damages
top_econ_long$damage <- factor(top_econ_long$damage, levels = c("property_total_M", "crop_total_M", "money_total_M"))
##create clustered barplot
g2 <- ggplot(top_econ_long, aes(x=EVTYPE, y=million, fill=damage))
g2 + geom_bar(stat="identity", position="dodge") +
labs(x="Events", y="Total Damage (million dollars)", fill="Damages", title="Top Damaging Weather Events for Economic Loss ") +
scale_fill_discrete(labels = c("Properties", "Crops","Combined Loss" )) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.key.size = unit(0.3,"cm"))
From this barplot, it is apparent that floods caused caused the heaviest economic impact among all severe weather events, and this was mostly driven by the enormous property damages. Droughts, while caused less damages to properties, are the most severe events for agricultural losses.