## [1] "English_United States.1252"
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities like loss of life, injuries, significant property damage, and/or disruption to commerce. Preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as a programming assignment for the Reproducible Research course at COURSERA. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It was downloaded from the course web site:
Documentation of the database available:
Detailed information about the fields/columns were not found (NO CODEBOOK AVAILABLE).
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. That may bias the results toward TORNADOS, since in years before 1996 no other weather event were measured. For that reason, observations before 1996 became out of scope.
The basic goal of this work is to explore the NOAA Storm Database and answer some basic questions about severe weather events. The database was used to answer the questions below and show the code for the entire analysis. The analysis consist of tables, figures, and other summaries.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
We were supposed to consider writing this report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there were no need to make any specific recommendations in the report.
The data was downloaded from Storm Data [47Mb] and loaded
This block checks if the file exists and downloads it once.
if(!file.exists("./data/repdata-data-StormData.csv.bz2")) {
if(!file.exists("./data")) {
dir.create("./data")
}
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL,
destfile = "./data/repdata-data-StormData.csv.bz2",
quiet = FALSE,
mode = "wb")
fileURL <- NULL
}
From the R Documentation of connections{base} at the end of “Compression” section:
“With current computers decompression times even with compress = 9 are typically modest and reading compressed files is usually faster than uncompressed ones because of the reduction in disc activity.”
So, we used this method: reading directly from the bzfile.
storms <- read.csv(bzfile("./data/repdata-data-StormData.csv.bz2"),
stringsAsFactors = FALSE)
After the data was loaded, we check the structure of the data set.
dim_storms <- dim(storms)
head(storms, n = 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
sapply(storms, class)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## "numeric" "character" "character" "character" "numeric" "character"
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## "character" "character" "numeric" "character" "character" "character"
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## "character" "numeric" "logical" "numeric" "character" "character"
## LENGTH WIDTH F MAG FATALITIES INJURIES
## "numeric" "numeric" "integer" "numeric" "numeric" "numeric"
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## "numeric" "character" "numeric" "character" "character" "character"
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## "character" "numeric" "numeric" "numeric" "numeric" "character"
## REFNUM
## "numeric"
summary(storms[ , c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP") ])
## EVTYPE FATALITIES INJURIES
## Length:902297 Min. : 0.0000 Min. : 0.0000
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Median : 0.0000 Median : 0.0000
## Mean : 0.0168 Mean : 0.1557
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :583.0000 Max. :1700.0000
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
There were 902297 rows and 37 columns in the data set.
Missing values can introduce bias in any data analysis. For that reason we check the proportion of the observations are missing (i.e. coded as NA).
storms_clean <- storms
mean( is.na( storms_clean$FATALITIES ))
## [1] 0
mean( is.na( storms_clean$INJURIES))
## [1] 0
mean( is.na( storms_clean$PROPDMG ))
## [1] 0
mean( is.na( storms_clean$CROPDMG ))
## [1] 0
So, there are no NAs in variables:
Our analysis focus on obervations above year 1996, so we filter data.
library(lubridate)
storms_clean$BGN_DATE_clean <- mdy_hms(storms_clean$BGN_DATE)
bf <- dim(storms_clean)[1] # Before filtering
storms_clean <- storms_clean[ storms_clean$BGN_DATE_clean > "1995-12-31", ]
af <- dim(storms_clean)[1] # After filtering
diff <- bf - af
The difference after selecting years above 1996 is 248767.
Now, we perform a very basic padronization by changing every EVTYPE to upper.
storms_clean$EVTYPE <- toupper(storms_clean$EVTYPE)
# Standardizes categories of events (fixes some typos)
storms_clean$EVTYPE <- sub("winds$", "wind", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("HIGH WINDS", "HIGH WIND", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("THUNDERSTORM WINDS", "THUNDERSTORM WIND", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("WINTER STORMS", "WINTER STORM", storms_clean$EVTYPE)
summary(storms_clean[ , c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP") ])
## EVTYPE FATALITIES INJURIES
## Length:653530 Min. : 0.00000 Min. :0.00e+00
## Class :character 1st Qu.: 0.00000 1st Qu.:0.00e+00
## Mode :character Median : 0.00000 Median :0.00e+00
## Mean : 0.01336 Mean :8.87e-02
## 3rd Qu.: 0.00000 3rd Qu.:0.00e+00
## Max. :158.00000 Max. :1.15e+03
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:653530 Min. : 0.000 Length:653530
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 11.69 Mean : 1.839
## 3rd Qu.: 1.00 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
Translating PROPDMGEXP and CROPDMGEXP units into numeric exponents.
# Translates units into exponents
storms_clean$PROPDMGEXP <- sub("k", "3", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$PROPDMGEXP <- sub("m", "6", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$PROPDMGEXP <- sub("b", "9", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("k", "3", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("m", "6", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("b", "9", storms_clean$PROPDMGEXP, ignore.case = TRUE)
# Changes data type from char into numeric, in order to allow computations
storms_clean$PROPDMGEXP <- as.numeric(storms_clean$PROPDMGEXP)
storms_clean$CROPDMGEXP <- as.numeric(storms_clean$CROPDMGEXP)
Here we group data by EVTYPE, create the totals for FATALITIES, INJURIES, PROPRIETY DAMAGE and CROP DAMAGE. We also create a total for PUBLIC HEALTH PROBLEMS by adding TOTAL FATALITIES and TOTAL INJURIES and create a total for ECONOMIC PROBLEMS by adding TOTAL PROPDMG and TOTAL CROPDMG. Eventually, we created a TOTAL DAMAGE attribute by adding TOTAL PUBLIC HEALTH PROBLEMS and TOTAL ECONOMIC PROBLEMS.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
storms_clean_summary <- group_by(storms_clean, EVTYPE) %>%
summarize( TOTAL.FATALITIES = sum(FATALITIES), # as.integer(sum(FATALITIES)),
TOTAL.INJURIES = sum(INJURIES), # as.integer(sum(INJURIES)),
TOTAL.PROPDMG = sum(PROPDMG * (10 ^ PROPDMGEXP) ),
TOTAL.CROPDMG = sum(CROPDMG * (10 ^ CROPDMGEXP) )
) %>%
mutate(
TOTAL.PUBLIC.HEALTH.PROBLEMS = TOTAL.FATALITIES + TOTAL.INJURIES,
TOTAL.ECONOMIC.PROBLEMS = TOTAL.PROPDMG + TOTAL.CROPDMG,
TOTAL.DAMAGE = TOTAL.PUBLIC.HEALTH.PROBLEMS + TOTAL.ECONOMIC.PROBLEMS
)
# Number of observations to be ploted in top n rankings
N <- 10
storms_clean_summary_p1 <- top_n(storms_clean_summary, N, TOTAL.PUBLIC.HEALTH.PROBLEMS)
storms_clean_summary_p2 <- top_n(storms_clean_summary, N, TOTAL.ECONOMIC.PROBLEMS)
storms_clean_summary_p3 <- top_n(storms_clean_summary, N, TOTAL.DAMAGE)
The table below shows the top 10 weather events that caused both public health and economic problems for communities and municipalities in descending order by sum of TOTAL.PUBLIC.HEALTH.PROBLEMS and TOTAL.ECONOMIC.PROBLEMS.
library(xtable)
print(xtable(storms_clean_summary_p3), type = "html", include.rownames = FALSE)
| EVTYPE | TOTAL.FATALITIES | TOTAL.INJURIES | TOTAL.PROPDMG | TOTAL.CROPDMG | TOTAL.PUBLIC.HEALTH.PROBLEMS | TOTAL.ECONOMIC.PROBLEMS | TOTAL.DAMAGE |
|---|---|---|---|---|---|---|---|
| COASTAL FLOODING/EROSION | 0.00 | 0.00 | 15000000.00 | 0.00 | 0.00 | 15000000.00 | 15000000.00 |
| COASTAL EROSION | 0.00 | 0.00 | 766000.00 | 0.00 | 0.00 | 766000.00 | 766000.00 |
| EROSION/CSTL FLOOD | 0.00 | 0.00 | 16200000.00 | 0.00 | 0.00 | 16200000.00 | 16200000.00 |
| EXCESSIVE SNOW | 0.00 | 2.00 | 1935000.00 | 0.00 | 2.00 | 1935000.00 | 1935002.00 |
| HEAVY RAIN/HIGH SURF | 0.00 | 0.00 | 13500000.00 | 1500000.00 | 0.00 | 15000000.00 | 15000000.00 |
| LAKESHORE FLOOD | 0.00 | 0.00 | 7540000.00 | 0.00 | 0.00 | 7540000.00 | 7540000.00 |
| MARINE HIGH WIND | 1.00 | 1.00 | 1297010.00 | 0.00 | 2.00 | 1297010.00 | 1297012.00 |
| MARINE STRONG WIND | 14.00 | 22.00 | 418330.00 | 0.00 | 36.00 | 418330.00 | 418366.00 |
| MARINE THUNDERSTORM WIND | 10.00 | 26.00 | 436400.00 | 50000.00 | 36.00 | 486400.00 | 486436.00 |
| WIND AND WAVE | 0.00 | 0.00 | 1000000.00 | 0.00 | 0.00 | 1000000.00 | 1000000.00 |
The plots below shows which types of events have the greatest economic consequences, since crops affect directly the economy.
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
## Loading required package: grid
# TOTAL.PUBLIC.HEALTH.PROBLEMS X EVTYPE
p1 <- ggplot(data = storms_clean_summary_p1, aes(x = reorder(EVTYPE, TOTAL.PUBLIC.HEALTH.PROBLEMS),
y = TOTAL.PUBLIC.HEALTH.PROBLEMS,
fill=TOTAL.PUBLIC.HEALTH.PROBLEMS)) +
scale_fill_continuous(low="grey", high="green", guide = FALSE) +
geom_bar(stat = 'identity') +
ggtitle(paste("Top", N, " most harmful storm events with respect to population helath")) +
labs(x = paste("Top",N,"Storm Events"), y = "Number of fatalities and injuries") +
coord_flip()
print(p1)
# TOTAL.ECONOMIC.PROBLEMS x EVTYPE
p2 <- ggplot(data = storms_clean_summary_p2, aes(x = reorder(EVTYPE,TOTAL.ECONOMIC.PROBLEMS),
y = TOTAL.ECONOMIC.PROBLEMS,
fill=TOTAL.ECONOMIC.PROBLEMS)) +
scale_fill_continuous(low="green", high="black", guide = FALSE) +
geom_bar(stat = 'identity') +
ggtitle(paste("Top", N, " storm events that have the greatest economic consequences")) +
labs(x = paste("Top",N,"Storm Events"), y = "Number of properties or crops damaged") +
coord_flip()
print(p2)
# TOTAL.PROPDMG x EVTYPE
p3 <- ggplot(data = storms_clean_summary_p3, aes(x = reorder(EVTYPE,TOTAL.DAMAGE),
y = TOTAL.DAMAGE,
fill=TOTAL.DAMAGE)) +
scale_fill_continuous(low="black", high="grey", guide = FALSE) +
geom_bar(stat = 'identity') +
ggtitle(paste("Top", N, " storm events that have the greatest overall consequences")) +
labs(x = paste("Top",N,"Storm Events"), y = "Total damage") +
coord_flip()
print(p3)
# grid.arrange(p1,p2,p3, nrow=3)
Sys.setlocale("LC_TIME","English")
## [1] "English_United States.1252"
time <- format(Sys.time(), "%F %T %Z")
tz <- Sys.timezone()
All content generated in 2015-03-22 18:22:55 BRT in America/Sao_Paulo. This work is an exercise of the course Reproducible Research - Peer Assignment 2