This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to address the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Population health impact is determined by the number of fatalities and injuries caused by the severe weather, while economic damage is calculated by the total cost damage to crops and properties of the public.
The analysis shows that, across US, excessive heat caused the greatest number of fatalities, tornado caused the greatest number of injuries and flood was the most damaging severe weather in terms of economic consequences.
The database and variables used in the analysis are comprehensive described at:
Load the required libraries
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
Load the data
Download the compressed data file (bz2 file) and using read.csv to read ‘bz2’ directly without the need to decompress it. Note that, to prevent repeatedly download the compressed file everytime you run the code, add a command that make R check if the file does locally exist. Make sure the data file was downloaded correctly. It must be 46.8 MB (49,177,144 bytes) with 902297 records** on 37 variables.
if(!exists("./data")){dir.create("./data")}
stormDataFileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
stormDataFile <- "data/stormData.csv.bz2"
if (!file.exists(stormDataFile)) {
download.file(url = stormDataFileUrl, destfile = stormDataFile)
}
stopifnot(file.size(stormDataFile) == 49177144)
stormData <- read.csv(stormDataFile, header = TRUE)
stopifnot(dim(stormData) == c(902297,37))
Seeing structure of the storm dataset
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Data processing
The dataset is big and contains many data irrelevant to the purposes of the analysis. So subsetting only those that are needed.
The analysis want to address two questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
After reading the Storm Data Documentation and inspecting all the variables of the storm dataset, I decided that, variables listed below are those needed for doing the analysis:
| Variable | Description |
|---|---|
| EVTYPE | Event type (Flood, Heat, Hurricane, Tornado, …) |
| FATALITIES | Number of fatalities resulting from event |
| INJURIES | Number of injuries resulting from event |
| PROPDMG | Property damage in USD |
| PROPDMGEXP | Unit multiplier for property damage |
| CROPDMG | Crop damage in USD |
| CROPDMGEXP | Unit multiplier for property damage |
| BGN_DATE | Begin date of the event |
According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type. Since my objective is comparing the effects of different weather events, I only included records from Jan. 1996 onwards.
The dataset contains a lot of observations without any information about health and/or economic damages. These observations are excluded from the analysis.
stormData$BGN_DATE <- mdy_hms(stormData$BGN_DATE)
interestedData <- stormData %>%
select(c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP",
"BGN_DATE")) %>%
filter(year(BGN_DATE) >= 1996) %>%
filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
The official events type are 48. However, there are total 222 unique event types in the current dataset.
length(unique(interestedData$EVTYPE))
## [1] 222
That is because, many values are very similar. Just some changes in the case, spelling that make them become different event type. Solving this problem by changing all the words in EVTYPE column to uppercase and using gsub to combine similar event type into unique one.
interestedData$EVTYPE <- toupper(interestedData$EVTYPE)
## create a new variable to store the event types after cleaning
interestedData$CLEAN_EVTYPE <- NULL
## combine similar event types into unique group
interestedData$CLEAN_EVTYPE <- gsub('.*HAIL.*', 'HAIL', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*HEAT.*', 'HEAT', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*WIND.*', 'WIND', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*STORM.*', 'STORM', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*SNOW.*', 'SNOW', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*WINTER.*', 'WINTER', interestedData$EVTYPE)
interestedData$CLEAN_EVTYPE <- gsub('.*RAIN.*', 'RAIN', interestedData$EVTYPE)
length(unique(interestedData$CLEAN_EVTYPE))
## [1] 177
The current data now has 177 event types.
According to National Weather Service Storm Data Documentation, the property damage and crop damages prodivded in the storm dataset comes with a seperate exponent that is PROPDMGEXP and CROPDMGEXP. The values of those variables are shown as below:
table(stormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(stormData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
This in-depth work has shown us how to interpret those values.
Possible values of PROPDMGEXP and CROPDMGEXP:
H,h = hundreds = 100
K,k = kilos = thousands = 1,000
M,m = millions = 1,000,000
B,b = billions = 1,000,000,000
(+) = 1
(-) = 0
(?) = 0
black/empty character = 0
numeric 0..8 = 10
In order to calculate costs, the PROPDMGEXP and CROPDMGEXP variables will be mapped to a multiplier factor which will then be used to calculate the actual costs for both property and crop damage. Two new variables will be created to store damage costs:
# function to get multiplier factor
getMultiplier <- function(exp) {
exp <- toupper(exp);
if (exp == "") return (10^0);
if (exp == "-") return (10^0);
if (exp == "?") return (10^0);
if (exp == "+") return (10^0);
if (exp == "0") return (10^0);
if (exp == "1") return (10^1);
if (exp == "2") return (10^2);
if (exp == "3") return (10^3);
if (exp == "4") return (10^4);
if (exp == "5") return (10^5);
if (exp == "6") return (10^6);
if (exp == "7") return (10^7);
if (exp == "8") return (10^8);
if (exp == "9") return (10^9);
if (exp == "H") return (10^2);
if (exp == "K") return (10^3);
if (exp == "M") return (10^6);
if (exp == "B") return (10^9);
return (NA);
}
# calculate property damage and crop damage costs
interestedData$PROP_COST <- with(interestedData, as.numeric(PROPDMG) * sapply(PROPDMGEXP, getMultiplier))
interestedData$CROP_COST <- with(interestedData, as.numeric(CROPDMG) * sapply(CROPDMGEXP, getMultiplier))
healthImpact <- interestedData %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES))
tidyHealthImpact <- pivot_longer(healthImpact,
cols = -EVTYPE,
names_to = "health.impact",
values_to = "values")
fatalityData <- tidyHealthImpact %>%
filter(health.impact == "FATALITIES") %>%
arrange(desc(values)) %>%
slice(1:10)
injuryData <- tidyHealthImpact %>%
filter(health.impact == "INJURIES") %>%
arrange(desc(values)) %>%
slice(1:10)
economicImpactData <- aggregate(x = list(DAMAGE_IMPACT = interestedData$PROP_COST + interestedData$CROP_COST),
by = list(EVENT_TYPE = interestedData$EVTYPE),
FUN = sum,
na.rm = TRUE)
economicImpactData <- economicImpactData[order(economicImpactData$DAMAGE_IMPACT, decreasing = TRUE),]
fatalityPlot <- ggplot(fatalityData)+
geom_bar(aes(x = EVTYPE, y = values), stat = "identity", fill = "red", alpha = 0.5)+
coord_flip()+
ggtitle ("Top 10 weather events with greatest number of fatalities")+
ylab("Number of fatalities")+
xlab("Event type")
injuryPlot <- ggplot(injuryData)+
geom_bar(aes(x = EVTYPE, y = values), stat = "identity", fill = "blue", alpha = 0.5)+
coord_flip()+
ggtitle ("Top 10 weather events with greatest number of injuries")+
ylab("Number of injuries")+
xlab("Event type")
gridExtra::grid.arrange(fatalityPlot, injuryPlot)
So far, excessive heat caused the greatest number of fatalities in the US, while tornado caused the greatest number of injuries.
economicImpactPlot <- ggplot(head(economicImpactData, 10),
aes(x = EVENT_TYPE, y = DAMAGE_IMPACT, fill = EVENT_TYPE)) +
coord_flip() +
geom_bar(stat = "identity") +
xlab("Event Type") +
ylab("Total Property / Crop Damage Cost") +
ggtitle("Top 10 Weather Events with\nGreatest Economic Consequences")+
theme(legend.position = "none")
print(economicImpactPlot)
Across the United States, flood have the greatest economic consequences.