This report is written in part of the Course Project for the Reproducible Research course on Coursera by Johns Hopkins University.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The analysis of the data is conducted to answer the following questions:
EVTYPE
variable) are most harmful with respect to population health?The selected packages (“dplyr” and “ggplot”) are loaded into the workspace before the data is loaded as well. Some data cleaning is conducted, where significant variables for the analysis are selected, any existing missing values are handled, and some variables are transformed so that the data is less complex.
The Results section is divided into two parts: one for each research question. In each section, weather events are grouped together, and the total number of fatalities and injuries, and property and crop damages are calculated, respectively, in each section. Stacked bar plots are created in each section depicting the values calculated and displayed in tables at each section.
The packages included in this analysis are dplyr for data cleaning and ggplot2 for data visualization.
if(!require("dplyr")){
install.packages("dplyr")
}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if(!require("ggplot2")){
install.packages("ggplot2")
}
## Loading required package: ggplot2
library(dplyr)
library(ggplot2)
The data is downloaded from this link. The data is a comma-separated-value (csv) file compressed via the bzip2 algorithm to reduce its size. After the data is loaded, the structure and dimension of the data is inspected.
df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)
# Structure of data
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
# Dimension of data
dim(df)
## [1] 902297 37
Only selected variables are used in the analysis, and they are:
FATALITIES
- approximate number of deathsINJURIES
- approximate number of injuriesPROPDMG
- approximate property damagePROPDMGEXP
- property damage valueCROPDMG
- approximate crop damagesCROPDMGEXP
- crop damage valueEVTYPE
- weather eventThese variables are selected using the select()
function from dplyr.
mydata <- select(df, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, EVTYPE)
head(mydata)
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE
## 1 0 15 25.0 K 0 TORNADO
## 2 0 0 2.5 K 0 TORNADO
## 3 0 2 25.0 K 0 TORNADO
## 4 0 2 2.5 K 0 TORNADO
## 5 0 2 2.5 K 0 TORNADO
## 6 0 6 2.5 K 0 TORNADO
To ensure a complete and thorough analysis, any missing values should be handled with care. The following code checks for any missing values.
sapply(mydata, function(x) sum(is.na(x)))
## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE
## 0 0 0 0 0 0 0
Based on the output, no missing values are recorded in the data.
Some variables contain duplicate values. It is vital to transform them not only to standardize the data, but also to reduce the complexity of the data.
EVTYPE
The target variable EVTYPE
contains the type of weather events. According to the output of the code below, there are 985 unique weather events listed in the data.
length(unique(mydata$EVTYPE))
## [1] 985
Some of them can be categorized into one weather event (i.e. MARINE HIGH WIND
AND MARINE STRONG WIND
can be grouped together as WIND
). The following coding is used to transform the variable EVTYPE
by grouping together similar weather events.
#Ccreate a new variable `EVENT` to transform the variable into groups
mydata$EVENT <- "OTHER"
# Group by keyword in `EVTYPE`
mydata$EVENT[grep("HAIL", mydata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
mydata$EVENT[grep("HEAT", mydata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
mydata$EVENT[grep("FLOOD", mydata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
mydata$EVENT[grep("WIND", mydata$EVTYPE, ignore.case = TRUE)] <- "WIND"
mydata$EVENT[grep("STORM", mydata$EVTYPE, ignore.case = TRUE)] <- "STORM"
mydata$EVENT[grep("SNOW", mydata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
mydata$EVENT[grep("TORNADO", mydata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
mydata$EVENT[grep("WINTER", mydata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
# Check the variable
table(mydata$EVENT)
##
## FLOOD HAIL HEAT OTHER RAIN SNOW STORM TORNADO WIND WINTER
## 82686 289270 2648 48970 12241 17660 113156 60700 255362 19604
PROPDMGEXP
AND CROPDMGEXP
The variables PROPDMGEXP
and CROPDMGEXP
contains the unit property and crop damages, respectively, in dollars.
table(mydata$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(mydata$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
K
and k
represent values in thousand dollars, M
and m
represent million dollars, and B
represent billion dollars. Similar units are grouped together, whereas NA
s are considered as normal dollar values.
# Convert to character type
mydata$PROPDMGEXP <- as.character(mydata$PROPDMGEXP)
# NA's considered as dollars
mydata$PROPDMGEXP[is.na(mydata$PROPDMGEXP)] <- 0
# Everything exept K,M,B is dollar
mydata$PROPDMGEXP[!grepl("K|M|B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- 0
# Change values in the `PROPDMGEXP` variable
mydata$PROPDMGEXP[grep("K", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$PROPDMGEXP[grep("M", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$PROPDMGEXP[grep("B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$PROPDMGEXP <- as.numeric(as.character(mydata$PROPDMGEXP))
# Create new variable where the actual property damage value is calculated
mydata$property.damage <- mydata$PROPDMG * 10^mydata$PROPDMGEXP
# Look at first ten (sorted) property damage values
sort(table(mydata$property.damage), decreasing = TRUE)[1:10]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617
# Do the same with `CROPDMGEXP`
mydata$CROPDMGEXP <- as.character(mydata$CROPDMGEXP)
mydata$CROPDMGEXP[is.na(mydata$CROPDMGEXP)] <- 0
mydata$CROPDMGEXP[!grepl("K|M|B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- 0
mydata$CROPDMGEXP[grep("K", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$CROPDMGEXP[grep("M", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$CROPDMGEXP[grep("B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$CROPDMGEXP <- as.numeric(as.character(mydata$CROPDMGEXP))
mydata$crop.damage <- mydata$CROPDMG * 10^mydata$CROPDMGEXP
sort(table(mydata$crop.damage), decreasing = TRUE)[1:10]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05
## 880198 4097 2349 1984 1233 956 951 830 758 721
This section shows the analysis conducted and the results to answer the research questions.
EVTYPE
variable) are most harmful with respect to population health?fatal.injury <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(FATALITIES + INJURIES, na.rm = TRUE)) %>%
mutate(Percent = round(Total/sum(Total) * 100, 3),
Type = rep("Fatalities and Injuries"))
fatal.injury %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 97068 62.4 Fatalities and Injuries
## 2 OTHER 14850 9.54 Fatalities and Injuries
## 3 HEAT 12362 7.94 Fatalities and Injuries
## 4 WIND 10210 6.56 Fatalities and Injuries
## 5 FLOOD 10126 6.50 Fatalities and Injuries
## 6 STORM 5755 3.70 Fatalities and Injuries
## 7 WINTER 2169 1.39 Fatalities and Injuries
## 8 HAIL 1386 0.89 Fatalities and Injuries
## 9 SNOW 1328 0.853 Fatalities and Injuries
## 10 RAIN 419 0.269 Fatalities and Injuries
The table shows the total number of fatalities and injuries suffered by the population based on weather events. Based on the table, it is shown that fatalities and injuries caused by tornado-related events are the highest, making up 62.3% of the data, whereas rain-related events caused the least amount of fatalities and injuries, making up less than 0.3% of the data. This is also shown in the plot below. The values at each bar are the total counts on top and the percentage at the bottom.
fatalities <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(FATALITIES, na.rm = TRUE)) %>%
mutate(Percent = round(Total/sum(Total) * 100, 3),
Type = rep("Fatalities"))
fatalities %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 5661 37.4 Fatalities
## 2 HEAT 3138 20.7 Fatalities
## 3 OTHER 2626 17.3 Fatalities
## 4 FLOOD 1524 10.1 Fatalities
## 5 WIND 1209 7.98 Fatalities
## 6 STORM 416 2.75 Fatalities
## 7 WINTER 278 1.84 Fatalities
## 8 SNOW 164 1.08 Fatalities
## 9 RAIN 114 0.753 Fatalities
## 10 HAIL 15 0.099 Fatalities
The second table shows the number and percentage of fatalaties based on each weather events. Similar to the earlier table, fatalities caused by tornado-related events are the highest, making up 37.4% percent of the data. But this time, hail-related events caused the least amount of fatality, with less that 0.1% of the fatalities are caused by it.
injuries <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(INJURIES, na.rm = TRUE)) %>%
mutate(Percent = round(Total/sum(Total) * 100, 3),
Type = rep("Injuries"))
injuries %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 91407 65.0 Injuries
## 2 OTHER 12224 8.70 Injuries
## 3 HEAT 9224 6.56 Injuries
## 4 WIND 9001 6.40 Injuries
## 5 FLOOD 8602 6.12 Injuries
## 6 STORM 5339 3.80 Injuries
## 7 WINTER 1891 1.35 Injuries
## 8 HAIL 1371 0.976 Injuries
## 9 SNOW 1164 0.828 Injuries
## 10 RAIN 305 0.217 Injuries
This third table, on the other hand, details the number and percentage of injuries caused by weather events. Again, tornado-related events top the list, and rain-related events caused the least amount of injury.
All of the results are also shown in the bar plots below. As shown in the plot, the number of fatalities is not as notable compared to injuries.
combined.plot <- rbind(fatal.injury, fatalities, injuries)
ggplot(combined.plot, aes(x = EVENT, y = Total, fill = Type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Weather Events", title = "Total Number of Fatalities and Injuries", subtitle = "Based on Weather Events")
prop.crop <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(property.damage + crop.damage, na.rm = TRUE)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep("Property and Crop Damages"))
prop.crop %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 FLOOD 179769100029. 37.7 Property and Crop Damages
## 2 OTHER 120835593207. 25.4 Property and Crop Damages
## 3 STORM 72678890281. 15.3 Property and Crop Damages
## 4 TORNADO 59010559549. 12.4 Property and Crop Damages
## 5 HAIL 18779880521. 3.94 Property and Crop Damages
## 6 WIND 12250885768. 2.57 Property and Crop Damages
## 7 WINTER 6824739251 1.43 Property and Crop Damages
## 8 RAIN 4189545992 0.879 Property and Crop Damages
## 9 SNOW 1158852852. 0.243 Property and Crop Damages
## 10 HEAT 924795030 0.194 Property and Crop Damages
The table above shows the total amount of property and crop damages caused by each weather event. Miscellaneous weather events dominated the list, with nearly 50% of the damages are caused by them. On the other hand, winter-related events caused the least amount of damage, making up less than 0.1% of the data.
property <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(property.damage, na.rm = TRUE)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep("Property Damage"))
property %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 FLOOD 167502193929. 39.2 Property Damage
## 2 OTHER 97246712337. 22.8 Property Damage
## 3 STORM 66304415393. 15.5 Property Damage
## 4 TORNADO 58593098029. 13.7 Property Damage
## 5 HAIL 15733043048. 3.68 Property Damage
## 6 WIND 10847166618. 2.54 Property Damage
## 7 WINTER 6777295251 1.59 Property Damage
## 8 RAIN 3270230192 0.765 Property Damage
## 9 SNOW 1024169752. 0.24 Property Damage
## 10 HEAT 20325750 0.005 Property Damage
This table shows the amount of property damages caused by each weather event. Based on the table above, nearly 30% of the property damages suffered in the US is caused by tornado-related events, while heat-related events make up 0.03% of the total property damages suffered by the country.
crop <- mydata %>%
group_by(EVENT) %>%
summarize(Total = sum(crop.damage, na.rm = TRUE)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep("Crop Damage"))
crop %>% arrange(desc(Total))
## # A tibble: 10 x 4
## EVENT Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 OTHER 23588880870 48.0 Crop Damage
## 2 FLOOD 12266906100 25.0 Crop Damage
## 3 STORM 6374474888 13.0 Crop Damage
## 4 HAIL 3046837473 6.20 Crop Damage
## 5 WIND 1403719150 2.86 Crop Damage
## 6 RAIN 919315800 1.87 Crop Damage
## 7 HEAT 904469280 1.84 Crop Damage
## 8 TORNADO 417461520 0.85 Crop Damage
## 9 SNOW 134683100 0.274 Crop Damage
## 10 WINTER 47444000 0.097 Crop Damage
Based on the third table above, which shows the amount of crop damages suffered by the US, miscellaneous events make up nearly half of the total damages. Winter-related events caused the least amount of crop damage (less than 0.1%). We can see that this table is similar to the first table shown in this section.
The bar plot below shows the amount and distribution of the property and crop damages of each weather event. As shown in the plot, crop damages are not as prominent as property damages.
combined.plot2 <- rbind(prop.crop, property, crop)
ggplot(combined.plot2, aes(x = EVENT, y = Total/1000000, fill = Type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Weather Events", y = "Total",
title = "Total Amount of Property and Crop Damages",
subtitle = "Based on Weather Events")