Storms and other severe weather events can cause both public health and economic problems for communities and municipalities.
This assignment explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal is to use the database to answer the questions below and show the code for the entire analysis.
The data for this assignment come in the form of the csv file compressed via the bzip2 algorithm to reduce its size. The data for this assignment can be downloaded from here
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The analysis below aims to answer two questions: 1. Across the United
States, which types of events are the most harmful to population
health?
2. Across the United States, which types of events have the greatest
economic consequences?
First, we want to load all the necessary packages.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Then, we import and read the dataset. We call it “weatherData”.
weatherData <- read_csv("~/Downloads/repdata_data_StormData.csv.bz2")
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl (1): COUNTYENDN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, we inspect the data.
names(weatherData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For the sake of the analysis, we define population health by looking at fatalities and injuries caused by weather event. Let’s look at them separately.
First, we want to create a smaller dataset that is grouped by event type, so we can see what is the total number of fatalities caused by each weather event.
fatalities_weatherData<-weatherData%>%
group_by(EVTYPE)%>%
summarise(totalFATALITIES=sum(FATALITIES))%>%
arrange(desc(totalFATALITIES))
Now let’s do the same for the total number of injuries, following the same method as above.
injuries_weatherData<-weatherData%>%
group_by(EVTYPE)%>%
summarise(totalINJURIES=sum(INJURIES))%>%
arrange(desc(totalINJURIES))
Now let’s look at what kind of weather event causes the most injuries AND fatalities. For this, we want to subset the dataset.
weatherData$total <- rowSums(weatherData[, 23:24], na.rm = TRUE)
topHealth<-weatherData%>%
group_by(EVTYPE)%>%
summarise(total=sum(total,na.rm=TRUE))%>%
arrange(desc(total))
In this dataset, the economic consequences of weather events are captured through property and crop damage. The actual values are econded in the “EXP” columns.
unique(weatherData$PROPDMGEXP)
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(weatherData$CROPDMGEXP)
## [1] NA "M" "K" "m" "B" "?" "0" "k" "2"
Note that numbers, characters, and capitals all mixed together. Therefore, we need to write a function that transforms the column into a factor value of 10.
getVal <- function(expType) {
if (expType %in% c('h', 'H')) {
return(2)
} else if (expType %in% c('k', 'K')) {
return(3)
} else if (expType %in% c('m', 'M')) {
return(6)
} else if (expType %in% c('b', 'B')) {
return(9)
} else if (suppressWarnings(!is.na(as.numeric(expType)))) {
return(as.numeric(expType))
} else {
return(0)
}
}
c(10**getVal('h'), 10**getVal(4), 10**getVal('B'), 10**getVal('?'))
## [1] 1e+02 1e+04 1e+09 1e+00
Make a table that applies the function and calculates the actual value of property and crop damage.
weatherData_new<-weatherData[,c(8,25:28)]%>%
rowwise()%>%
mutate(PROP = PROPDMG*10**getVal(PROPDMGEXP),
CROP = CROPDMG*10**getVal(CROPDMGEXP))
head(weatherData_new)
## # A tibble: 6 × 7
## # Rowwise:
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP PROP CROP
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 TORNADO 25 K 0 <NA> 25000 0
## 2 TORNADO 2.5 K 0 <NA> 2500 0
## 3 TORNADO 25 K 0 <NA> 25000 0
## 4 TORNADO 2.5 K 0 <NA> 2500 0
## 5 TORNADO 2.5 K 0 <NA> 2500 0
## 6 TORNADO 2.5 K 0 <NA> 2500 0
weatherData_new_sum<-weatherData_new[,c(1,6,7)]%>%
group_by(EVTYPE)%>%
summarise_all(sum)
head(weatherData_new_sum)
## # A tibble: 6 × 3
## EVTYPE PROP CROP
## <chr> <dbl> <dbl>
## 1 ? 5000 0
## 2 ABNORMAL WARMTH 0 0
## 3 ABNORMALLY DRY 0 0
## 4 ABNORMALLY WET 0 0
## 5 ACCUMULATED SNOWFALL 0 0
## 6 AGRICULTURAL FREEZE 0 28820000
Find the top 10 event types that causes the most property damage.
topProperty <- weatherData_new_sum[order(weatherData_new_sum$PROP,decreasing = TRUE),]
Find the top 10 event types that causes the most crop damage
topCrop <- weatherData_new_sum[order(weatherData_new_sum$CROP,decreasing = TRUE),]
Find the top 10 event types that causes the most total damage
weatherData_new_sum$total <- rowSums(weatherData_new_sum[, 2:3])
topEcon <- weatherData_new_sum[order(weatherData_new_sum$total, decreasing = T), ]
head(fatalities_weatherData)
## # A tibble: 6 × 2
## EVTYPE totalFATALITIES
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
head(injuries_weatherData)
## # A tibble: 6 × 2
## EVTYPE totalINJURIES
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
If adding the numbers of fatalities and injuries and ranked by the total number, the top 10 events that causes the most population health are as follows.
head(topHealth)
## # A tibble: 6 × 2
## EVTYPE total
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
The following figure depicts top 10 event types that causes population health hazards (sum of fatalities and injuries)
ggplot(topHealth[1:10,],aes(reorder(EVTYPE,total), total))+
geom_col(fill="darkred")+
coord_flip()+
labs(title = "Top 10 weather events that cause damage to public health",
x = "Event type",
y = "Total injuries and fatalities")
As the figure shows, tornados are by far the weather events that affect public health the most.
If separated by property and crop damages, the top 10 events that causes the most
head(topProperty)
## # A tibble: 6 × 3
## EVTYPE PROP CROP
## <chr> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450
## 2 HURRICANE/TYPHOON 69305840000 2607872800
## 3 TORNADO 56947380676. 414953270
## 4 STORM SURGE 43323536000 5000
## 5 FLASH FLOOD 16822723978. 1421317100
## 6 HAIL 15735267513. 3025954473
head(topCrop)
## # A tibble: 6 × 3
## EVTYPE PROP CROP
## <chr> <dbl> <dbl>
## 1 DROUGHT 1046106000 13972566000
## 2 FLOOD 144657709807 5661968450
## 3 RIVER FLOOD 5118945500 5029459000
## 4 ICE STORM 3944927860 5022113500
## 5 HAIL 15735267513. 3025954473
## 6 HURRICANE 11868319010 2741910000
head(topEcon)
## # A tibble: 6 × 4
## EVTYPE PROP CROP total
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56947380676. 414953270 57362333946.
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15735267513. 3025954473 18761221986.
## 6 FLASH FLOOD 16822723978. 1421317100 18244041078.
The following figure depicts top 10 event types that causes economic hazards (sum of crops and property damages)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(topEcon[1:10,], aes(reorder(EVTYPE, total), total)) +
geom_col(fill = "purple") +
coord_flip() +
scale_y_continuous(labels = comma) + # Use scale_y_continuous since total is on y-axis
labs(
title = "Top 10 Weather Events Causing Economic Damage",
x = "Weather Event",
y = "Total Property and Crop Damage")
As the figure shows, floods cause by far the biggest amount of property and crop damages.