## [1] "English_United States.1252"

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities like loss of life, injuries, significant property damage, and/or disruption to commerce. Preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as a programming assignment for the Reproducible Research course at COURSERA. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data for this analysis come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It was downloaded from the course web site:

Documentation of the database available:

Detailed information about the fields/columns were not found (NO CODEBOOK AVAILABLE).

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. That may bias the results toward TORNADOS, since in years before 1996 no other weather event were measured. For that reason, observations before 1996 became out of scope.

The basic goal of this work is to explore the NOAA Storm Database and answer some basic questions about severe weather events. The database was used to answer the questions below and show the code for the entire analysis. The analysis consist of tables, figures, and other summaries.

Questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

We were supposed to consider writing this report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there were no need to make any specific recommendations in the report.

Getting and Loading the Data Set

The data was downloaded from Storm Data [47Mb] and loaded

Get the Data Set

This block checks if the file exists and downloads it once.

if(!file.exists("./data/repdata-data-StormData.csv.bz2")) {
        if(!file.exists("./data")) {
                        dir.create("./data")
        }
        fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileURL,
                      destfile = "./data/repdata-data-StormData.csv.bz2",
                      quiet = FALSE,
                      mode = "wb")
        fileURL <- NULL
}

Load the Data Set

From the R Documentation of connections{base} at the end of “Compression” section:
“With current computers decompression times even with compress = 9 are typically modest and reading compressed files is usually faster than uncompressed ones because of the reduction in disc activity.”
So, we used this method: reading directly from the bzfile.

storms <- read.csv(bzfile("./data/repdata-data-StormData.csv.bz2"),
                   stringsAsFactors = FALSE)

After the data was loaded, we check the structure of the data set.

dim_storms <- dim(storms)
head(storms, n = 5)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
sapply(storms, class)
##     STATE__    BGN_DATE    BGN_TIME   TIME_ZONE      COUNTY  COUNTYNAME 
##   "numeric" "character" "character" "character"   "numeric" "character" 
##       STATE      EVTYPE   BGN_RANGE     BGN_AZI  BGN_LOCATI    END_DATE 
## "character" "character"   "numeric" "character" "character" "character" 
##    END_TIME  COUNTY_END  COUNTYENDN   END_RANGE     END_AZI  END_LOCATI 
## "character"   "numeric"   "logical"   "numeric" "character" "character" 
##      LENGTH       WIDTH           F         MAG  FATALITIES    INJURIES 
##   "numeric"   "numeric"   "integer"   "numeric"   "numeric"   "numeric" 
##     PROPDMG  PROPDMGEXP     CROPDMG  CROPDMGEXP         WFO  STATEOFFIC 
##   "numeric" "character"   "numeric" "character" "character" "character" 
##   ZONENAMES    LATITUDE   LONGITUDE  LATITUDE_E  LONGITUDE_     REMARKS 
## "character"   "numeric"   "numeric"   "numeric"   "numeric" "character" 
##      REFNUM 
##   "numeric"
summary(storms[ , c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
                    "CROPDMG", "CROPDMGEXP") ])
##     EVTYPE            FATALITIES          INJURIES        
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Mode  :character   Median :  0.0000   Median :   0.0000  
##                     Mean   :  0.0168   Mean   :   0.1557  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##                     Max.   :583.0000   Max.   :1700.0000  
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000

There were 902297 rows and 37 columns in the data set.

Preprocessing, Data Cleansing, Padronization

Check for NAs proportion

Missing values can introduce bias in any data analysis. For that reason we check the proportion of the observations are missing (i.e. coded as NA).

storms_clean <- storms
mean( is.na( storms_clean$FATALITIES ))
## [1] 0
mean( is.na( storms_clean$INJURIES))
## [1] 0
mean( is.na( storms_clean$PROPDMG ))
## [1] 0
mean( is.na( storms_clean$CROPDMG ))
## [1] 0

So, there are no NAs in variables:

  • FATALITIES
  • INJURIES
  • PROPDMG
  • CROPDMG

Preprocessing

Our analysis focus on obervations above year 1996, so we filter data.

library(lubridate)
storms_clean$BGN_DATE_clean <- mdy_hms(storms_clean$BGN_DATE)

bf <- dim(storms_clean)[1] # Before filtering

storms_clean <- storms_clean[ storms_clean$BGN_DATE_clean > "1995-12-31", ]

af <- dim(storms_clean)[1] # After filtering

diff <- bf - af

The difference after selecting years above 1996 is 248767.

Now, we perform a very basic padronization by changing every EVTYPE to upper.

storms_clean$EVTYPE <- toupper(storms_clean$EVTYPE)

# Standardizes categories of events (fixes some typos)
storms_clean$EVTYPE <- sub("winds$", "wind", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("HIGH WINDS", "HIGH WIND", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("THUNDERSTORM WINDS", "THUNDERSTORM WIND", storms_clean$EVTYPE)
storms_clean$EVTYPE <- sub("WINTER STORMS", "WINTER STORM", storms_clean$EVTYPE)

summary(storms_clean[ , c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
                          "PROPDMGEXP", "CROPDMG", "CROPDMGEXP") ])
##     EVTYPE            FATALITIES           INJURIES       
##  Length:653530      Min.   :  0.00000   Min.   :0.00e+00  
##  Class :character   1st Qu.:  0.00000   1st Qu.:0.00e+00  
##  Mode  :character   Median :  0.00000   Median :0.00e+00  
##                     Mean   :  0.01336   Mean   :8.87e-02  
##                     3rd Qu.:  0.00000   3rd Qu.:0.00e+00  
##                     Max.   :158.00000   Max.   :1.15e+03  
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:653530      Min.   :  0.000   Length:653530     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  11.69                      Mean   :  1.839                     
##  3rd Qu.:   1.00                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000

Translating PROPDMGEXP and CROPDMGEXP units into numeric exponents.

# Translates units into exponents
storms_clean$PROPDMGEXP <- sub("k", "3", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$PROPDMGEXP <- sub("m", "6", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$PROPDMGEXP <- sub("b", "9", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("k", "3", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("m", "6", storms_clean$PROPDMGEXP, ignore.case = TRUE)
storms_clean$CROPDMGEXP <- sub("b", "9", storms_clean$PROPDMGEXP, ignore.case = TRUE)

# Changes data type from char into numeric, in order to allow computations
storms_clean$PROPDMGEXP <- as.numeric(storms_clean$PROPDMGEXP)
storms_clean$CROPDMGEXP <- as.numeric(storms_clean$CROPDMGEXP)

Creating new variables, aggregating and summarizing data.

Here we group data by EVTYPE, create the totals for FATALITIES, INJURIES, PROPRIETY DAMAGE and CROP DAMAGE. We also create a total for PUBLIC HEALTH PROBLEMS by adding TOTAL FATALITIES and TOTAL INJURIES and create a total for ECONOMIC PROBLEMS by adding TOTAL PROPDMG and TOTAL CROPDMG. Eventually, we created a TOTAL DAMAGE attribute by adding TOTAL PUBLIC HEALTH PROBLEMS and TOTAL ECONOMIC PROBLEMS.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
storms_clean_summary <- group_by(storms_clean, EVTYPE) %>%
        summarize( TOTAL.FATALITIES = sum(FATALITIES), # as.integer(sum(FATALITIES)),
                   TOTAL.INJURIES   = sum(INJURIES), # as.integer(sum(INJURIES)),
                   TOTAL.PROPDMG    = sum(PROPDMG * (10 ^ PROPDMGEXP) ),
                   TOTAL.CROPDMG    = sum(CROPDMG * (10 ^ CROPDMGEXP) )
        ) %>%
        mutate(
                TOTAL.PUBLIC.HEALTH.PROBLEMS    = TOTAL.FATALITIES + TOTAL.INJURIES,
                TOTAL.ECONOMIC.PROBLEMS         = TOTAL.PROPDMG + TOTAL.CROPDMG,
                TOTAL.DAMAGE                    = TOTAL.PUBLIC.HEALTH.PROBLEMS + TOTAL.ECONOMIC.PROBLEMS
                )



# Number of observations to be ploted in top n rankings
N <- 10

storms_clean_summary_p1 <- top_n(storms_clean_summary, N, TOTAL.PUBLIC.HEALTH.PROBLEMS)
storms_clean_summary_p2 <- top_n(storms_clean_summary, N, TOTAL.ECONOMIC.PROBLEMS)
storms_clean_summary_p3 <- top_n(storms_clean_summary, N, TOTAL.DAMAGE)

Results

Table Results

The table below shows the top 10 weather events that caused both public health and economic problems for communities and municipalities in descending order by sum of TOTAL.PUBLIC.HEALTH.PROBLEMS and TOTAL.ECONOMIC.PROBLEMS.

library(xtable)
print(xtable(storms_clean_summary_p3), type = "html", include.rownames = FALSE)
EVTYPE TOTAL.FATALITIES TOTAL.INJURIES TOTAL.PROPDMG TOTAL.CROPDMG TOTAL.PUBLIC.HEALTH.PROBLEMS TOTAL.ECONOMIC.PROBLEMS TOTAL.DAMAGE
COASTAL FLOODING/EROSION 0.00 0.00 15000000.00 0.00 0.00 15000000.00 15000000.00
COASTAL EROSION 0.00 0.00 766000.00 0.00 0.00 766000.00 766000.00
EROSION/CSTL FLOOD 0.00 0.00 16200000.00 0.00 0.00 16200000.00 16200000.00
EXCESSIVE SNOW 0.00 2.00 1935000.00 0.00 2.00 1935000.00 1935002.00
HEAVY RAIN/HIGH SURF 0.00 0.00 13500000.00 1500000.00 0.00 15000000.00 15000000.00
LAKESHORE FLOOD 0.00 0.00 7540000.00 0.00 0.00 7540000.00 7540000.00
MARINE HIGH WIND 1.00 1.00 1297010.00 0.00 2.00 1297010.00 1297012.00
MARINE STRONG WIND 14.00 22.00 418330.00 0.00 36.00 418330.00 418366.00
MARINE THUNDERSTORM WIND 10.00 26.00 436400.00 50000.00 36.00 486400.00 486436.00
WIND AND WAVE 0.00 0.00 1000000.00 0.00 0.00 1000000.00 1000000.00

Graph Results

The plots below shows which types of events have the greatest economic consequences, since crops affect directly the economy.

library(ggplot2)
library(RColorBrewer)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
## Loading required package: grid
# TOTAL.PUBLIC.HEALTH.PROBLEMS X EVTYPE
p1 <- ggplot(data = storms_clean_summary_p1, aes(x = reorder(EVTYPE, TOTAL.PUBLIC.HEALTH.PROBLEMS),
                                              y = TOTAL.PUBLIC.HEALTH.PROBLEMS,
                                              fill=TOTAL.PUBLIC.HEALTH.PROBLEMS)) + 
        scale_fill_continuous(low="grey", high="green", guide = FALSE) +
        geom_bar(stat = 'identity') + 
        ggtitle(paste("Top", N, " most harmful storm events with respect to population helath")) +
        labs(x = paste("Top",N,"Storm Events"), y = "Number of fatalities and injuries") +
        coord_flip()

print(p1)

# TOTAL.ECONOMIC.PROBLEMS x EVTYPE
p2 <- ggplot(data = storms_clean_summary_p2, aes(x = reorder(EVTYPE,TOTAL.ECONOMIC.PROBLEMS),
                                                 y = TOTAL.ECONOMIC.PROBLEMS,
                                                 fill=TOTAL.ECONOMIC.PROBLEMS)) + 
        scale_fill_continuous(low="green", high="black", guide = FALSE) +
        geom_bar(stat = 'identity') + 
        ggtitle(paste("Top", N, " storm events that have the greatest economic consequences")) +
        labs(x = paste("Top",N,"Storm Events"), y = "Number of properties or crops damaged") +
        coord_flip()
        
print(p2)

# TOTAL.PROPDMG x EVTYPE
p3 <- ggplot(data = storms_clean_summary_p3, aes(x = reorder(EVTYPE,TOTAL.DAMAGE),
                                                 y = TOTAL.DAMAGE,
                                                 fill=TOTAL.DAMAGE)) + 
        scale_fill_continuous(low="black", high="grey", guide = FALSE) +
        geom_bar(stat = 'identity') + 
        ggtitle(paste("Top", N, " storm events that have the greatest overall consequences")) +
        labs(x = paste("Top",N,"Storm Events"), y = "Total damage") +
        coord_flip()

print(p3)

# grid.arrange(p1,p2,p3, nrow=3)
Sys.setlocale("LC_TIME","English")
## [1] "English_United States.1252"
time <- format(Sys.time(), "%F %T %Z")
tz   <- Sys.timezone()

All content generated in 2015-03-22 18:22:55 BRT in America/Sao_Paulo. This work is an exercise of the course Reproducible Research - Peer Assignment 2