STORMDATA Project Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

DATA Processing

Loading and preprocessing the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.10.1 (2020-08-26 22:50:31 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,
##     warnings
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

setwd("C:/Users/ventd/OneDrive/Escritorio/Coursera/Reproducible Research/RRProject2")

filename <- "Dataset.csv.bz2"

##checking if the file already exists, if it doesn't it will download it
if (!file.exists(filename)){
        fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileURL, filename)
} 

##Reading data
xdata <- read.csv(bzfile(filename))

Calculating and analizing data

dim(xdata)
## [1] 902297     37
str(xdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Getting relevant variants for this project which contains the following questions:

    1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful                 with respect to population health?
    2. Across the United States, which types of events have the greatest economic consequences?

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

#Creating the table with variables we are going to use
pdata <- select(xdata, BGN_DATE, STATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)

Checking if there are any NA’s in the data

sum(is.na(pdata))
## [1] 0

Start to Cleaning Data

Format the BGN_DATE variable as a date Latest years are the ones with more relevant information, 48 event since 1996

pdata$BGN_DATE <- as.Date(pdata$BGN_DATE, "%m/%d/%Y")
pdata$YEAR <- year(pdata$BGN_DATE)
pdata <- filter(pdata, YEAR >= 1996)

Only use events with either health impact or economic damage

pdata <- filter(pdata, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)

Getting and cleaning variables to get the correct information for the economic consequences graphic

table(pdata$PROPDMGEXP)
## 
##             B      K      M 
##   8448     32 185474   7364
table(pdata$CROPDMGEXP)
## 
##             B      K      M 
## 102767      2  96787   1762
pdata$PROPDMGEXP <- toupper(pdata$PROPDMGEXP)
pdata$CROPDMGEXP <- toupper(pdata$CROPDMGEXP)

Converting, calculating, and adding new variables needed to get the graphics.

pdata <- pdata %>% 
        mutate(CROPDMGFACTOR = case_when(CROPDMGEXP == "" ~ 10^0 * CROPDMG,
                                         CROPDMGEXP == "?" ~ 10^0 * CROPDMG,
                                         CROPDMGEXP == "0" ~ 10^0 * CROPDMG,
                                         CROPDMGEXP == "2" ~ 10^2 * CROPDMG,
                                         CROPDMGEXP == "K" ~ 10^3 * CROPDMG,
                                         CROPDMGEXP == "M" ~ 10^6 * CROPDMG,
                                         CROPDMGEXP == "B" ~ 10^9 * CROPDMG)) %>%
        mutate(PROPDMGFACTOR = case_when(PROPDMGEXP == "" ~ 10^0 * PROPDMG,
                                         PROPDMGEXP == "-" ~ 10^0 * PROPDMG,
                                         PROPDMGEXP == "?" ~ 10^0 * PROPDMG,
                                         PROPDMGEXP == "+" ~ 10^0 * PROPDMG,
                                         PROPDMGEXP == "0" ~ 10^0 * PROPDMG,
                                         PROPDMGEXP == "1" ~ 10^1 * PROPDMG,
                                         PROPDMGEXP == "2" ~ 10^2 * PROPDMG,
                                         PROPDMGEXP == "3" ~ 10^3 * PROPDMG,
                                         PROPDMGEXP == "4" ~ 10^4 * PROPDMG,
                                         PROPDMGEXP == "5" ~ 10^5 * PROPDMG,
                                         PROPDMGEXP == "6" ~ 10^6 * PROPDMG,
                                         PROPDMGEXP == "7" ~ 10^7 * PROPDMG,
                                         PROPDMGEXP == "8" ~ 10^8 * PROPDMG,
                                         PROPDMGEXP == "H" ~ 10^2 * PROPDMG,
                                         PROPDMGEXP == "K" ~ 10^3 * PROPDMG,
                                         PROPDMGEXP == "M" ~ 10^6 * PROPDMG,
                                         PROPDMGEXP == "B" ~ 10^9 * PROPDMG,)) %>%
        mutate(SUMDMG = PROPDMGFACTOR+CROPDMGFACTOR) %>%
        mutate(SUMFATINJ = FATALITIES + INJURIES)

Check if there is any NA in the new 2 columns

sum(is.na(pdata))
## [1] 0

Summarizing data after cleaning

sumpdata <- pdata %>%
        group_by(EVTYPE) %>%
        summarize(SUMFATALITIES = sum(FATALITIES),
                  SUMINJURIES = sum(INJURIES),
                  TOTALFATINJ = sum(SUMFATINJ),
                  SUMPROPDMG = sum(PROPDMGFACTOR),
                  SUMCROPDMG = sum(CROPDMGFACTOR),
                  TOTALDMG = sum(SUMDMG))
## `summarise()` ungrouping output (override with `.groups` argument)
head(sumpdata)
## # A tibble: 6 x 7
##   EVTYPE    SUMFATALITIES SUMINJURIES TOTALFATINJ SUMPROPDMG SUMCROPDMG TOTALDMG
##   <chr>             <dbl>       <dbl>       <dbl>      <dbl>      <dbl>    <dbl>
## 1 "   HIGH~             0           0           0     200000          0   200000
## 2 " FLASH ~             0           0           0      50000          0    50000
## 3 " TSTM W~             0           0           0    8100000          0  8100000
## 4 " TSTM W~             0           0           0       8000          0     8000
## 5 "AGRICUL~             0           0           0          0   28820000 28820000
## 6 "ASTRONO~             0           0           0    9425000          0  9425000

Results

Harmful Impact with respect population Results

Impact <- arrange(sumpdata, desc(TOTALFATINJ))
ImpactData <- head(Impact,10)

ImpactData
## # A tibble: 10 x 7
##    EVTYPE   SUMFATALITIES SUMINJURIES TOTALFATINJ SUMPROPDMG SUMCROPDMG TOTALDMG
##    <chr>            <dbl>       <dbl>       <dbl>      <dbl>      <dbl>    <dbl>
##  1 TORNADO           1511       20667       22178    2.46e10  283425010  2.49e10
##  2 EXCESSI~          1797        6391        8188    7.72e 6  492402000  5.00e 8
##  3 FLOOD              414        6758        7172    1.44e11 4974778400  1.49e11
##  4 LIGHTNI~           651        4141        4792    7.43e 8    6898440  7.50e 8
##  5 TSTM WI~           241        3629        3870    4.48e 9  553915350  5.03e 9
##  6 FLASH F~           887        1674        2561    1.52e10 1334901700  1.66e10
##  7 THUNDER~           130        1400        1530    3.38e 9  398331000  3.78e 9
##  8 WINTER ~           191        1292        1483    1.53e 9   11944000  1.54e 9
##  9 HEAT               237        1222        1459    1.52e 6     176500  1.70e 6
## 10 HURRICA~            64        1275        1339    6.93e10 2607872800  7.19e10

Economic Impact Results

Economic <- arrange(sumpdata, desc(TOTALDMG))
EconomicData <- head(Economic,10)

EconomicData
## # A tibble: 10 x 7
##    EVTYPE   SUMFATALITIES SUMINJURIES TOTALFATINJ SUMPROPDMG SUMCROPDMG TOTALDMG
##    <chr>            <dbl>       <dbl>       <dbl>      <dbl>      <dbl>    <dbl>
##  1 FLOOD              414        6758        7172    1.44e11    4.97e 9  1.49e11
##  2 HURRICA~            64        1275        1339    6.93e10    2.61e 9  7.19e10
##  3 STORM S~             2          37          39    4.32e10    5.00e 3  4.32e10
##  4 TORNADO           1511       20667       22178    2.46e10    2.83e 8  2.49e10
##  5 HAIL                 7         713         720    1.46e10    2.48e 9  1.71e10
##  6 FLASH F~           887        1674        2561    1.52e10    1.33e 9  1.66e10
##  7 HURRICA~            61          46         107    1.18e10    2.74e 9  1.46e10
##  8 DROUGHT              0           4           4    1.05e 9    1.34e10  1.44e10
##  9 TROPICA~            57         338         395    7.64e 9    6.78e 8  8.32e 9
## 10 HIGH WI~           235        1083        1318    5.25e 9    6.34e 8  5.88e 9

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

ImpactData$EVTYPE <- with(ImpactData, reorder(EVTYPE, -TOTALFATINJ))
ImpactDataS <- ImpactData %>%
        gather(key = "Type", value = "TOTALIMPACT", c("SUMFATALITIES", "SUMINJURIES")) %>%
        select(EVTYPE, Type, TOTALIMPACT)
ImpactDataS$Type[ImpactDataS$Type %in% c("SUMFATALITIES")] <- "Deaths"
ImpactDataS$Type[ImpactDataS$Type %in% c("SUMINJURIES")] <- "Injuries"

plot1 <- ggplot(ImpactDataS, aes(x = EVTYPE, y = TOTALIMPACT, fill = Type)) +
        geom_bar(stat = "identity", position = "dodge2") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        xlab("Event Type") +
        ylab("Health Impact") +
        ggtitle("Health Impact Events") +
        theme(plot.title = element_text(hjust = 0.5))
plot1

Across the United States, which types of events have the greatest economic consequences?

EconomicData$EVTYPE <- with(EconomicData, reorder(EVTYPE, -TOTALDMG))
EconomicDataS <- EconomicData %>%
        gather(key = "Type", value = "TOTALDAMAGE", c("SUMPROPDMG", "SUMCROPDMG")) %>%
        select(EVTYPE, Type, TOTALDAMAGE)
EconomicDataS$Type[EconomicDataS$Type %in% c("SUMPROPDMG")] <- "Property damage"
EconomicDataS$Type[EconomicDataS$Type %in% c("SUMCROPDMG")] <- "Crop damage"

plot2 <- ggplot(EconomicDataS, aes(x = EVTYPE, y = TOTALDAMAGE, fill = Type)) +
        geom_bar(stat = "identity", position = "dodge2") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
        xlab("Event Type") +
        ylab("Economic Impact") +
        ggtitle("Economic Impact Events") +
        theme(plot.title = element_text(hjust = 0.5))
plot2

Conclusion

Per the graphics we can conclude what events are more dangerous for people’s economy and health, if we can get a better way to anticipate and evacuate people this can help to reduce the danger, and getting a better preparation for events such as a flood in materials that can help reduce the economic impact.