Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Set up directory for downloading data
dir.create("./data", showWarnings = FALSE)
downloadURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
downloadedFile <- "./data/repdata_data_StormData.csv.bz2"
Download and unzip data
library(R.utils)
if(!file.exists(downloadedFile)) {
download.file(downloadURL, downloadedFile, method = "curl")
bunzip2(downloadedFile, destname = "./data/stormdata.csv", remove = FALSE)
}
Check if required data is downloaded
file.exists(downloadedFile)
## [1] TRUE
The actual dimension of the data is 902297x37. To reduce the time-consuming when loading it to the environment, we should read only the useful columns. Here are the useful columns and their column indexes:
| Column name | Column index |
|---|---|
| EVTYPE | 8 |
| FATALITIES | 23 |
| INJURIES | 24 |
| PROPDMG | 25 |
| PROPDMGEXP | 26 |
| CROPDMG | 27 |
| CROPDMGEXP | 28 |
To load these specific features, we need to specify their classes as numeric or character, the types of other columns will be set to NULL:
stormdata <- read.csv("./data/stormdata.csv",
colClasses = c(rep("NULL", 7),
"character",
rep("NULL", 14),
rep("numeric", 3),
"character",
"numeric",
"character",
rep("NULL", 9)),
sep = ",",
header = TRUE)
Take a quick look of our data:
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Summary:
summary(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
These are possible values of PROPDMGEXP and CROPDMGEXP:
For more information, consider to visit here or here
1. Convert the Exponent columns to number:
Create a converter data frame for each symbol and its value:
symbol <- c("B", "b", "M", "m", "K", "k", "H", "h",
"-", "+", "?", as.character(0:10), "")
value <- c(rep(10^9, 2), rep(10^6, 2), rep(10^3, 2), rep(10^2, 2),
rep(10^0, 3), 10^c(0:10), 10^0)
converter <- data.frame(Symbol = symbol, Value = value)
head(converter)
## Symbol Value
## 1 B 1e+09
## 2 b 1e+09
## 3 M 1e+06
## 4 m 1e+06
## 5 K 1e+03
## 6 k 1e+03
Replace each symbol by its value in the stormdata:
## process for PROPDMGEXP
temp <- sapply(stormdata$PROPDMGEXP,
function(ele) converter[converter$Symbol == ele, 2])
stormdata$PROPDMGEXP <- unlist(temp, use.names = FALSE)
## process for CROPDMGEXP
temp <- sapply(stormdata$CROPDMGEXP,
function(ele) converter[converter$Symbol == ele, 2])
stormdata$CROPDMGEXP <- unlist(temp, use.names = FALSE)
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 1000 0 1
## 2 TORNADO 0 0 2.5 1000 0 1
## 3 TORNADO 0 2 25.0 1000 0 1
## 4 TORNADO 0 2 2.5 1000 0 1
## 5 TORNADO 0 2 2.5 1000 0 1
## 6 TORNADO 0 6 2.5 1000 0 1
2. Re-calculate the Property and Crop Damage in PROPDMG and CROPDMG columns:
Re-calculate:
stormdata$PROPDMG <- stormdata$PROPDMG * stormdata$PROPDMGEXP
stormdata$CROPDMG <- stormdata$CROPDMG * stormdata$CROPDMGEXP
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25000 1000 0 1
## 2 TORNADO 0 0 2500 1000 0 1
## 3 TORNADO 0 2 25000 1000 0 1
## 4 TORNADO 0 2 2500 1000 0 1
## 5 TORNADO 0 2 2500 1000 0 1
## 6 TORNADO 0 6 2500 1000 0 1
Remove PROPDMGEXP and CROPDMGEXP columns:
stormdata$PROPDMGEXP <- NULL
stormdata$CROPDMGEXP <- NULL
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 TORNADO 0 15 25000 0
## 2 TORNADO 0 0 2500 0
## 3 TORNADO 0 2 25000 0
## 4 TORNADO 0 2 2500 0
## 5 TORNADO 0 2 2500 0
## 6 TORNADO 0 6 2500 0
In this part, we will group the records that have same EVTYPE and then calculate the sum of each feature:
library(dplyr)
group <- stormdata %>%
group_by(EVTYPE) %>%
summarise_all(sum)
head(group)
## # A tibble: 6 × 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0 200000 0
## 2 " COASTAL FLOOD" 0 0 0 0
## 3 " FLASH FLOOD" 0 0 50000 0
## 4 " LIGHTNING" 0 0 0 0
## 5 " TSTM WIND" 0 0 8100000 0
## 6 " TSTM WIND (G45)" 0 0 8000 0
Function to generate a bar plot:
This function called generate.bar with input:
df: data framex: value for x-axisy: value for y-axisx.lab: label for x-axisy.lab: label for y-axislibrary(ggplot2)
generate.bar <- function(df, x, y, x.lab, y.lab, title) {
p <- ggplot(df, aes(x = x, y = y, fill = x))
p <- p + geom_bar(stat = "identity") +
xlab(x.lab) +
ylab(y.lab) +
ggtitle(title) +
theme(legend.position = "none") + ## remove legends
theme(text = element_text(size = 12)) + ## resize text
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) ## rotate labels
}
EVTYPE variable) are most harmful with respect to population health?In this part, we will generate the bar plots to find the top 5 weather events that are most harmful to the US citizen for each type of damage:
In term of Fatalities damage:
Extract the top 5 worst harmful events that result in Fatalities damage:
fatalities <- group[order(-group$FATALITIES), ]
fatalities <- fatalities[1:5, ]
head(fatalities)
## # A tibble: 5 × 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 56947380676. 414953270
## 2 EXCESSIVE HEAT 1903 6525 7753700 492402000
## 3 FLASH FLOOD 978 1777 16822673978. 1421317100
## 4 HEAT 937 2100 1797000 401461500
## 5 LIGHTNING 816 5230 930379430. 12092090
Create a plot for this damage:
fatalities.plot <- generate.bar(fatalities,
fatalities$EVTYPE, fatalities$FATALITIES,
"Type of event", "Fatalities damage",
"Top 5 most harmful events damaged in Fatalities")
In term of Injuries damage:
Extract the top 5 worst harmful events that result in Injuries damage:
injuries <- group[order(-group$INJURIES), ]
injuries <- injuries[1:5, ]
head(injuries)
## # A tibble: 5 × 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 56947380676. 414953270
## 2 TSTM WIND 504 6957 4484928495 554007350
## 3 FLOOD 470 6789 144657709807 5661968450
## 4 EXCESSIVE HEAT 1903 6525 7753700 492402000
## 5 LIGHTNING 816 5230 930379430. 12092090
Create a plot for this damage:
injuries.plot <- generate.bar(injuries,
injuries$EVTYPE, injuries$INJURIES,
"Type of event", "Injuries damage",
"Top 5 most harmful events damaged in Injuries")
Plot both these types of damage into a panel
library(gridExtra)
grid.arrange(fatalities.plot, injuries.plot, ncol = 2)
→ As can be seen from the plot, the TORNADO event is the most harmful type of event resulting in both Fatalities and Injuries consequences. Its damage is very high compared to the figures of other kinds of events, almost 2.5 times and 80 times higher than the total damage for the second-worst harmful events in terms of Fatalities and Injuries, respectively.
In this part, we will generate the bar plots to find the top 5 weather events that are most harmful to the US citizen for each type of economic consequences:
In term of Property consequences:
Extract the top 5 worst harmful events that result in Property consequences:
property <- group[order(-group$PROPDMG), ]
property <- property[1:5, ]
head(property)
## # A tibble: 5 × 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 FLOOD 470 6789 144657709807 5661968450
## 2 HURRICANE/TYPHOON 64 1275 69305840000 2607872800
## 3 TORNADO 5633 91346 56947380676. 414953270
## 4 STORM SURGE 13 38 43323536000 5000
## 5 FLASH FLOOD 978 1777 16822673978. 1421317100
Create a plot for this economic consequences:
property.plot <- generate.bar(property,
property$EVTYPE, property$PROPDMG,
"Type of event", "Property consequences",
"Top 5 most harmful events damaged in Property")
In term of Crop consequences:
Extract the top 5 worst harmful events that result in Crop consequences:
crop <- group[order(-group$CROPDMG), ]
crop <- crop[1:5, ]
head(crop)
## # A tibble: 5 × 5
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 DROUGHT 0 4 1046106000 13972566000
## 2 FLOOD 470 6789 144657709807 5661968450
## 3 RIVER FLOOD 2 2 5118945500 5029459000
## 4 ICE STORM 89 1975 3944927860 5022113500
## 5 HAIL 15 1361 15735267513. 3025954473
Create a plot for this economic consequences:
crop.plot <- generate.bar(crop,
crop$EVTYPE, crop$CROPDMG,
"Type of event", "Crop damage",
"Top 5 most harmful events damaged in Crop")
Plot both these types of economic consequence into a panel
grid.arrange(property.plot, crop.plot, ncol = 2)
→ As can be seen from the plot, the FLOOD and DROUGHT events are the most harmful type of event resulting in Property and Crop economic consequences with approximately 15x1010 and 1.4x1010 in term of total damage, respectively.