Reproducible Research Course Project 2

Synopsis

This R-Markdown document provides step-by-step explanations and R-code how to complete the second course project of the Reproducible Research course. The basic goal of this assignment is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database and examine which type of storms/weather events are most harmful with respect to population health and have the greatest economic consequences. TORNADO and FLOOD, respectively were found to be the two events that proved to be more harmful for the population health and had the greatest economic consequences.

Introduction

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The basic goal of this assignment is to explore the NOAA Storm Database and examine which type of storms/weather events are most harmful with respect to population health and have the greatest economic consequences. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern, hence the need for exploring the NOAA storm database. This project provides useful insight in two main questions:

  • Across the United States, which types of events are most harmful with respect to population health (Q1)?
  • Across the United States, which types of events have the greatest economic consequences (Q2)?

Data Processing

  1. Download - Load the data set

The following lines of code describe how to download the data set, and load the containing csv.bz2 file from the working directory.

At this step no data transformation was performed, but transformation was done depending on the tasks needed at each of the next steps.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "raw_data.csv.bz2")
data <- read.csv("raw_data.csv.bz2")
  1. Exploring the dataset

Using the str() and colnames() functions we can see the variables names and subset later the variables we are interested to answer the two questions.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
colnames(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

So, EVTYPE is the type of recorded events and the variables associated with population health and economic consequences are FATALITIES, INJURIES and PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, respectively.

selected <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
subset <- data[,selected]
summary(subset)
##     EVTYPE            FATALITIES          INJURIES            PROPDMG       
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Mode  :character   Median :  0.0000   Median :   0.0000   Median :   0.00  
##                     Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##                     Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000
str(subset)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
  1. Data Analysis To answer Question 1 (Q1) first we create another variable (TOTAL_HEALTH) that sums the number of recorded FATALITIES and INJURIES. Then using the dplyr R-package we can summarize, order and store in a new dataset (Health) the first 10 event types that were associated with the highest number of fatalities plus injuries (TOTAL_HEALTH).
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
subset$TOTAL_HEALTH = subset$FATALITIES + subset$INJURIES
a = subset %>% group_by(EVTYPE) %>% summarize(sum(FATALITIES), sum(INJURIES), sum(TOTAL_HEALTH))
Health = a[order(-a$`sum(TOTAL_HEALTH)`),]
Health = Health[1:10,]

We can do the same for Question 2 (Q2) - finding which type of events have the greatest economic consequences, but here the crop and property damages are encoded with two variables; one for the numerical estimate and the other one for the exponent with a base of 10, e.g., K - 1000 = 10^3, M - 1000000=10^6 and so on.

library(dplyr)
unique(subset$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(subset$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

For the property damages we’ll decode the 10-base exponent values as: K - 10^3, M - 10^6, B - 10^9, m - 10^6, + - 10^0, and so on, while for the crop damages “” - 10^0, M - 10^6, K - 10^3, m - 10^3, B - 10^6, ? - 10^0, 0 - 10^0, k - 10^3, 2 - 10^2.

PROPDM_Decoded = c("K" = 10^3, "M" = 10^6, "B" = 10^9, "m" = 10^6, "+" = 10^0, "0" = 10^0, "5" = 10^5, "6" = 10^6, "?" = 10^0, "4" = 10^4, "2" = 10^2, "3" = 10^3, "h" = 10^2, "7" = 10^7, "H" = 10^2, "-" = 10^0, "1" = 10^1, "8" = 10^8)
CROPDM_Decoded = c("M" = 10^6,  "K" = 10^3, "m" = 10^6, "B" = 10^9, "?" = 10^0,  "0" = 10^0, "k" = 10^3, "2" = 10^2)

subset$PROPDMG_Decoded = PROPDM_Decoded[subset$PROPDMGEXP]
subset$PROPDMG_Decoded[which(is.na(subset$PROPDMG_Decoded))] = 10^0
subset$CROPDMG_Decoded = CROPDM_Decoded[subset$CROPDMGEXP]
subset$CROPDMG_Decoded[which(is.na(subset$CROPDMG_Decoded))] = 10^0

After decoding the 10-base exponent we can estimate the total amount of the crop and property damages multiplying the numerical estimate with the 10-base exponent and also estimate sum of crop + property damage (TOTAL_ECON). Afterwards using the same method as for the population health we can summarize, order and store in a new dataset (Econ) the first 10 event types that were associated with the highest damages on crop and property (TOTAL_ECON).

b = subset
b$PROP_DMG = b$PROPDMG * b$PROPDMG_Decoded 
b$CROP_DMG = b$CROPDMG * b$CROPDMG_Decoded 
b$TOTAL_ECON = b$CROP_DMG + b$PROP_DMG

Econ = b %>% group_by(EVTYPE) %>% summarize(sum(CROP_DMG), sum(PROP_DMG), sum(TOTAL_ECON))
Econ = Econ[order(-Econ$`sum(TOTAL_ECON)`),]
Econ = Econ[1:10,]

Results

Now, datasets Health and Econ contain the events that are considered harmful with respect to population health and have the greatest economic consequences, respectively.

By printing Health and Econ we can see how much each event has affected population health (number of fatalities/injuries/total) and the economy indicators (crop production, properrty values, both).

Health
## # A tibble: 10 x 4
##    EVTYPE            `sum(FATALITIES)` `sum(INJURIES)` `sum(TOTAL_HEALTH)`
##    <chr>                         <dbl>           <dbl>               <dbl>
##  1 TORNADO                        5633           91346               96979
##  2 EXCESSIVE HEAT                 1903            6525                8428
##  3 TSTM WIND                       504            6957                7461
##  4 FLOOD                           470            6789                7259
##  5 LIGHTNING                       816            5230                6046
##  6 HEAT                            937            2100                3037
##  7 FLASH FLOOD                     978            1777                2755
##  8 ICE STORM                        89            1975                2064
##  9 THUNDERSTORM WIND               133            1488                1621
## 10 WINTER STORM                    206            1321                1527
Econ
## # A tibble: 10 x 4
##    EVTYPE            `sum(CROP_DMG)` `sum(PROP_DMG)` `sum(TOTAL_ECON)`
##    <chr>                       <dbl>           <dbl>             <dbl>
##  1 FLOOD                  5661968450   144657709807      150319678257 
##  2 HURRICANE/TYPHOON      2607872800    69305840000       71913712800 
##  3 TORNADO                 414953270    56947380676.      57362333946.
##  4 STORM SURGE                  5000    43323536000       43323541000 
##  5 HAIL                   3025954473    15735267513.      18761221986.
##  6 FLASH FLOOD            1421317100    16822673978.      18243991078.
##  7 DROUGHT               13972566000     1046106000       15018672000 
##  8 HURRICANE              2741910000    11868319010       14610229010 
##  9 RIVER FLOOD            5029459000     5118945500       10148404500 
## 10 ICE STORM              5022113500     3944927860        8967041360

We can conclude that TORNADO and FLOOD, respectively were found to be the two events that proved to be more harmful for the population health and had the greatest economic consequences.

Visualize data - Conclusions

We can construct two histograms to visualize the impacts each type event has on population health and economy.

library(ggplot2)
library(tidyr)
Health = gather(Health, key = "sum", value = "points", 2:4)
ggplot(Health, aes(x = EVTYPE, y = points, fill = sum))+
  geom_col(position = position_dodge()) +
  ylab("Total Fatalities / Injuries") +
  xlab("Event Type") +
  ggtitle("Top 10 events types associated \n with higher impact on Population Health") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

library(ggplot2)
library(tidyr)
Econ = gather(Econ, key = "sum", value = "points", 2:4)

ggplot(Econ, aes(x = EVTYPE, y = points, fill = sum))+
  geom_col(position = position_dodge()) +
  ylab("Property/Crop Damage (in USD)") +
  xlab("Event Type") +
  ggtitle("Top 10 events types associated \n with higher economic consequences") +
  theme(axis.text.x = element_text(angle=45, hjust=1))