The basic goal of this analysis is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database and answer the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The NOAA Storm Database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The database for this assignment was downloaded directly from the coursera assignment web page. For more details, access this link Storm Data Documentation
Results: The analysis showed that, across the United States, tornados are the most harmful event to population health, while floods have the greatest economic consequences.
Setting Rmarkdown global options and loading needed libraries.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
library(dplyr)
library(knitr)
library(kableExtra)
library(ggplot2)
Let’s first load the data using the read.csv() function
and look at its variable names.
raw_data <- read.csv("repdata_data_StormData.csv.bz2", na.strings = "")
names(raw_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The 7 relevant Variables to answer the assignment questions are:
Creating a subset from the original database containing only these 7 variables. Changing variable names to more intuitive ones.
df <- raw_data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
#renaming col names
colnames(df) <- c("event", "fatalities", "injuries", "property_dmg", "property_dmg_magnitute",
"crop_dmg", "crop_dmg_magnitute")
Some event types, representing the same event, are recorded differently due to typos or different labels.
groups <- df %>% group_by(event) %>% count()
groups [c(18,19,21,22, 42,43,44),] %>% kable() %>%
kable_styling(full_width = FALSE, position = "left")
| event | n |
|---|---|
| AVALANCE | 1 |
| AVALANCHE | 386 |
| Beach Erosion | 1 |
| BEACH EROSION | 3 |
| blowing snow | 2 |
| Blowing Snow | 3 |
| BLOWING SNOW | 12 |
Before proceeding to actual analysis, we need to correct this. Let’s
do it using grepl() function with
ignore.case = TRUE, so case is ignored during matching. We
finish this step transforming this variable in a factor.
df$event[grepl("AVALANCE", df$event, ignore.case = TRUE)] <- "AVALANCHE"
df$event[grepl("FLOOD|FLOOODING", df$event, ignore.case = TRUE)] <- "FLOOD"
df$event[grepl("TORN", df$event, ignore.case = TRUE)] <- "TORNADO"
df$event[grepl("TSTM|THUNDERSTORM", df$event, ignore.case = TRUE)] <- "TSTM"
df$event[grepl("TROPICAL|STORM", df$event, ignore.case = TRUE)] <- "STORM"
df$event[grepl("HURRICANE", df$event, ignore.case = TRUE)] <- "HURRICANE"
df$event[grepl("ICE|SNOW|FROST|SLEET", df$event, ignore.case = TRUE)] <- "SNOW"
df$event[grepl("FOG", df$event, ignore.case = TRUE)] <- "FOG"
df$event[grepl("COLD|WINDCHILL|FREEZE|WINTER|COOL|^LOW T", df$event, ignore.case = TRUE)] <- "COLD"
df$event[grepl("HEAT|WARM|HOT", df$event, ignore.case = TRUE)] <- "HEAT"
df$event[grepl("CLOUD|FUNNEL", df$event, ignore.case = TRUE)] <- "CLOUD"
df$event[grepl("HAIL", df$event, ignore.case = TRUE)] <- "HAIL"
df$event[grepl("DROUGHT|DRY", df$event, ignore.case = TRUE)] <- "DROUGHT"
df$event[grepl("LIGHTNING", df$event, ignore.case = TRUE)] <- "LIGHTNING"
df$event[grepl("FIRE", df$event, ignore.case = TRUE)] <- "FIRE"
df$event[grepl("RAIN|SHOWER", df$event, ignore.case = TRUE)] <- "RAIN"
df$event[grepl("WATERSPOUT", df$event, ignore.case = TRUE)] <- "WATERSPOUT"
df$event[grepl("SURF", df$event, ignore.case = TRUE)] <- "SURF"
df$event[grepl("CURRENT", df$event, ignore.case = TRUE)] <- "CURRENT"
df$event[grepl("WIND|MICROBURST", df$event, ignore.case = TRUE)] <- "WIND"
df$event[grepl("BLIZZARD", df$event, ignore.case = TRUE)] <- "BLIZZARD"
df$event[grepl("SLIDE", df$event, ignore.case = TRUE)] <- "LANDSLIDE"
df$event[grepl("DUST", df$event, ignore.case = TRUE)] <- "DUST"
df$event[grepl("^(BEACH ERO|COASTAL ERO)", df$event, ignore.case = TRUE)] <- "BEACH EROSION"
df$event[grepl("[?]|^SUMM", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^BLOW-OUT TIDE", df$event, ignore.case = TRUE)] <- "BLOW-OUT TIDE"
df$event[grepl("^DAM", df$event, ignore.case = TRUE)] <- "DAM BREAK"
df$event[grepl("EXCESSIVE|^HIGH$|^OTH", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^FREEZING DRI|Freezing Spray", df$event, ignore.case = TRUE)] <- "FREEZING DRIZZLE"
df$event[grepl("GLAZE", df$event, ignore.case = TRUE)] <- "GLAZE"
df$event[grepl("^GUSTNADO", df$event, ignore.case = TRUE)] <- "GUSTNADO"
df$event[grepl("^HEAVY PREC", df$event, ignore.case = TRUE)] <- "HEAVY PRECIPITATION"
df$event[grepl("^(HEAVY SE|HIGH SE|^ROUG)", df$event, ignore.case = TRUE)] <- "HEAVY SEAS"
df$event[grepl("^(HEAVY SWE|HIGH SWEL)", df$event, ignore.case = TRUE)] <- "HEAVY SWELLS"
df$event[grepl("^HIGH WAV|HIGH TID", df$event, ignore.case = TRUE)] <- "HIGH WAVES"
df$event[grepl("^HYPO", df$event, ignore.case = TRUE)] <- "HYPOTHERMIA/EXPOSURE"
df$event[grepl("^ICY R", df$event, ignore.case = TRUE)] <- "ICY ROADS"
df$event[grepl("^LANDSL", df$event, ignore.case = TRUE)] <- "LANDSLIDE"
df$event[grepl("^LIG", df$event, ignore.case = TRUE)] <- "LIGHTNING"
df$event[grepl("^MIXED PR", df$event, ignore.case = TRUE)] <- "MIXED PRECIPITATION"
df$event[grepl("^(RECORD HIGH$|RECORD LOW$)", df$event, ignore.case = TRUE)] <- "UNKNOWN EVENT"
df$event[grepl("^RECORD TEMP", df$event, ignore.case = TRUE)] <- "RECORD TEMPERATURE"
df$event[grepl("^(No Sev|NORMAL PREC)|NONE", df$event, ignore.case = TRUE)] <- "NO SEVERE WEATHER"
df$event[grepl("HIGH TEMP", df$event, ignore.case = TRUE)] <- "RECORD HIGH TEMPERATURE"
df$event[grepl("STREAM|STRM", df$event, ignore.case = TRUE)] <- "URBAN_SMALL STREAM"
df$event[grepl("^(TEMPERATURE RE|RECORD TEMP)", df$event, ignore.case = TRUE)] <- "TEMP RECORD"
df$event[grepl("ASH|PLUME", df$event, ignore.case = TRUE)] <- "VOLCANIC ASH"
df$event[grepl("SPOUT", df$event, ignore.case = TRUE)] <- "WATERSPOUT"
df$event[grepl("WINT", df$event, ignore.case = TRUE)] <- "WINTRY MIX"
df$event<-factor(df$event)
str(df$event)
## Factor w/ 86 levels "ABNORMALLY WET",..: 65 65 65 65 65 65 65 65 65 65 ...
This database has 2 different variables to register property and crop damage: one variable represents the numeric damage and the other indicates its magnitude, recorded as the following characters: “K” for thousands, “M” for millions, and “B” for billions. For some reason (maybe typos or other record errors) there are a few magnitude values that do not match the characters “K”, “M” and “B”. Before deciding what to do with them, let’s check the values proportions on that magnitudes strings.
#KMB index for magnitudes
magnitude.index <- df %>% mutate(
prop.mag.index = case_when(is.na(property_dmg_magnitute) ~ "NA",
property_dmg_magnitute == "K" ~ "KMB registered (Prop)",
property_dmg_magnitute == "M" ~ "KMB registered (Prop)",
property_dmg_magnitute == "B" ~ "KMB registered (Prop)",
TRUE ~ "other entries"),
crop.mag.index = case_when(is.na(crop_dmg_magnitute) ~ "NA",
crop_dmg_magnitute == "K" ~ "KMB registered (Crop)",
crop_dmg_magnitute == "M" ~ "KMB registered (Crop)",
crop_dmg_magnitute == "B" ~ "KMB registered (Crop)",
TRUE ~ "other entries"))
#property damage magnitude
magnitude.index %>% group_by(property_dmg_magnitute) %>% count() %>%
arrange(desc(n)) %>% head(10) %>% kable() %>%
kable_styling(full_width = FALSE, position = "left")
| property_dmg_magnitute | n |
|---|---|
| NA | 465934 |
| K | 424665 |
| M | 11330 |
| 0 | 216 |
| B | 40 |
| 5 | 28 |
| 1 | 25 |
| 2 | 13 |
| ? | 8 |
| m | 7 |
#crop damage magnitude
magnitude.index %>% group_by(crop_dmg_magnitute) %>% count() %>%
arrange(desc(n)) %>% head(10) %>% kable() %>%
kable_styling(full_width = FALSE, position = "left")
| crop_dmg_magnitute | n |
|---|---|
| NA | 618413 |
| K | 281832 |
| M | 1994 |
| k | 21 |
| 0 | 19 |
| B | 9 |
| ? | 7 |
| 2 | 1 |
| m | 1 |
t<-apply(magnitude.index[c("prop.mag.index", "crop.mag.index")], 2, function(x){
tbl <- table(x)
res <- cbind(tbl,paste0(round(prop.table(tbl)*100,2),"%"))
colnames(res) <- c('Count','Perc')
as.data.frame (res)})
kable(t, table.attr = "style='width:60%;'") %>%
kableExtra::kable_styling()
|
|
As we can see, the “other entries” category is basically irrelevant and we can just ignore it on further analysis. The NA entries usually occur when there’s no damage recorded to a specific event. No damage, no magnitude.
To compute economic damages, we are going to convert the KMB magnitude characters to their respective numeric values and multiply these values with property_dmg and crop_dmg.
To do this, we are going to use recode() function; it
will preserve the existing order of levels while changing the values.
Note that we decided to to set default=1, so all “other
entries” values will be given the value 1.
#Converting property magnitudes to numeric values
df$property_dmg_magnitute<-recode(df$property_dmg_magnitute,
'K'=1000,'M'=1000000,'B'=1000000000,.default=1)
#Converting crop magnitudes to numeric values
df$crop_dmg_magnitute<-recode(df$crop_dmg_magnitute,
'K'=1000,'M'=1000000,'B'=1000000000,.default=1)
# multiplying to the corresponding damage numeric variable. storing the result in new variables
df$PropDmgValue <- df$property_dmg*df$property_dmg_magnitute
df$CropDmgValue <- df$crop_dmg*df$crop_dmg_magnitute
head(df)[,-c(2,3)] %>% kable(table.attr = "style='width:100%;'") %>% kable_styling()
| event | property_dmg | property_dmg_magnitute | crop_dmg | crop_dmg_magnitute | PropDmgValue | CropDmgValue |
|---|---|---|---|---|---|---|
| TORNADO | 25.0 | 1000 | 0 | NA | 25000 | NA |
| TORNADO | 2.5 | 1000 | 0 | NA | 2500 | NA |
| TORNADO | 25.0 | 1000 | 0 | NA | 25000 | NA |
| TORNADO | 2.5 | 1000 | 0 | NA | 2500 | NA |
| TORNADO | 2.5 | 1000 | 0 | NA | 2500 | NA |
| TORNADO | 2.5 | 1000 | 0 | NA | 2500 | NA |
Assignment question 1, asks us to identify the most harmful events with respect to population health across the US.
To run this analysis, we will use the fatalities and the
injuries variables as proxies for population health and
look at the impact the event variable had on them.
Let’s first create a data frame with these 3 variables. We will use
the aggregate() function to sum fatalities and
injuries and place the result under the variable
fatalities_injures
The top 10 most harmful event types are illustrated below.
harmful <- aggregate(fatalities + injuries ~ event, df, FUN = sum)
harmful %>% slice_max(`fatalities + injuries`, n=10)
## event fatalities + injuries
## 1 TORNADO 97068
## 2 HEAT 12421
## 3 TSTM 10273
## 4 FLOOD 10129
## 5 LIGHTNING 6048
## 6 STORM 4634
## 7 WIND 2337
## 8 FIRE 1698
## 9 SNOW 1511
## 10 HURRICANE 1463
healthplot <- harmful %>% slice_max(`fatalities + injuries`, n=10)
ggplot(healthplot,
aes(x= reorder(event,desc(`fatalities + injuries`)), y= `fatalities + injuries`,
fill = event)) +
geom_bar(stat="identity") +
scale_fill_hue(c = 40) +
theme(legend.position="none")+
labs(title="10 Most Harmful Events to Population Health", x="Event Type", y="Count")
The data shows that the most harmful event to population health is a
Tornado, followed by Heat, Thunderstorms (TSTS) and Floods. Other event
impacts can be seen on the table and plot above.
Assignment question 2, asks us to identify the events economic consequences. For this analysis we will evaluated property and crop damage collectively. The top 10 event types with greatest economic consequences are shown below.
#new dataframe with economic damage
eco <- aggregate(PropDmgValue + CropDmgValue ~ event, df, FUN = sum)
eco %>% slice_max(`PropDmgValue + CropDmgValue`, n=10)
## event PropDmgValue + CropDmgValue
## 1 FLOOD 157764930803
## 2 HURRICANE 44283445830
## 3 TORNADO 18172843863
## 4 STORM 13215140400
## 5 HAIL 10046289587
## 6 TSTM 5513399354
## 7 FIRE 3838549570
## 8 WIND 3355326595
## 9 DROUGHT 1886540000
## 10 SNOW 1292667400
ecoplot <- eco %>% slice_max(`PropDmgValue + CropDmgValue`, n=10)
ggplot(ecoplot,
aes(x= reorder(event,desc(`PropDmgValue + CropDmgValue`)), y= `PropDmgValue + CropDmgValue`,
fill = event)) +
geom_bar(stat="identity") +
scale_fill_hue(c = 40) +
theme(legend.position="none")+
labs(title="Top 10 economically harmfull events", x="Event Type", y="Count")
According to NOAA Storm Database, the most economically harmful events are Floods, followed by Thunderstorms (TSTM), Hails and Tornados. Other relevant events can be seen on the table and plot above.