Synopsis

The present project aim to show which weather events cause more economics and health damages between 1950 and 2011 in U.S. To investigate this, It was obtained the necesary data from U.S. National Oceanic and Atmospheric Administration’s (NOAA).All the weather events was group in ten groups: flood, hail, heat, rain, snow, storm, tornado, wind,winter and others. From this data It was found that all the weather events caused importants damages in crops and properties but the group of events that caused more damage was “flood” in properties, and “others” in crops (The “others” group of events are weather events diferents of flood, hail, heat, rain, snow, storm, tornado, wind and winter like wintry mix, waterspout, etc.). Also It was found that the event that caused more health damage (Fatalities and injuries) was the tornados.

Loading and Processing the Raw Data.

From U.S. National Oceanic and Atmospheric Administration’s (NOAA) was obtained the storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Loading the necessary packages.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Reading the Storm database.

In this method It is been used the Url to download the data directly to the Website.

# Dowload the data.
if (!file.exists("StormData.csv.bz2")){
        DataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(DataUrl, destfile = "StormData.csv.bz2", method = "curl")
        
        if(!file.exists("StormData.csv.bz2")){
                stop("The data is ready")
        }
}

# Reading downloaded data.

df <- read.csv("StormData.csv.bz2")

Now It will be observed the head and tail to the first 10 columns of the data to see it general structure

head(df[,1:10])

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI
## 1         0        
## 2         0        
## 3         0        
## 4         0        
## 5         0        
## 6         0

tail(df[,1:10])

##        STATE__           BGN_DATE    BGN_TIME TIME_ZONE COUNTY
## 902292      47 11/28/2011 0:00:00 03:00:00 PM       CST     21
## 902293      56 11/30/2011 0:00:00 10:30:00 PM       MST      7
## 902294      30 11/10/2011 0:00:00 02:48:00 PM       MST      9
## 902295       2  11/8/2011 0:00:00 02:58:00 PM       AKS    213
## 902296       2  11/9/2011 0:00:00 10:21:00 AM       AKS    202
## 902297       1 11/28/2011 0:00:00 08:00:00 PM       CST      6
##                                  COUNTYNAME STATE         EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088    TN WINTER WEATHER         0
## 902293                         WYZ007 - 017    WY      HIGH WIND         0
## 902294                         MTZ009 - 010    MT      HIGH WIND         0
## 902295                               AKZ213    AK      HIGH WIND         0
## 902296                               AKZ202    AK       BLIZZARD         0
## 902297                               ALZ006    AL     HEAVY SNOW         0
##        BGN_AZI
## 902292        
## 902293        
## 902294        
## 902295        
## 902296        
## 902297

Also, It will be seen the dimensions of the data

dim(df)

## [1] 902297     37

Number of rows = 902297 Number of columns = 37

Now, It will be seen the questions to answer with the analisis.

Questions

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Indentificate the necessary variables to answer the questions.

First, let´s see all the variables in the data and their characteristics.

str(df)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

When the CODEBOOK is read, It´s observed that the necessary variables are:

– EVTYPE variable (To see the types of weather events)

– Related with people´s health:

FATALITIES (approx. number of deaths)
INJURIES (approx. number of injuries)

– Related with economic facts:

PROPDMG (Approx. property damags)
PROPDMGEXP (The units for property damage value)
CROPDMG (Approx. crop damages)
CROPDMGEXP (The units for crop damage value)

This is the information about the necessary variables for the Project.

str(df[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")])

## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Checking the missing values in the necessary variables.

Let´s see the percentage of missing values in each variable.

mean(is.na(df$EVTYPE))

## [1] 0

mean(is.na(df$FATALITIES))

## [1] 0

mean(is.na(df$INJURIES))

## [1] 0

mean(is.na(df$PROPDMG))

## [1] 0

mean(is.na(df$PROPDMGEXP))

## [1] 0

mean(is.na(df$CROPDMG))

## [1] 0

mean(is.na(df$CROPDMGEXP))

## [1] 0

In all the variables the percentage of missing values are 0. So, there is no N.A data in all the variables that its been needed.

Now, It will need to change the exponential values gives by PROPDMGEXP and CROPDMGEXP for calculate the economic and health damage. Where:

K or k = 10^3
M or m = 10^6
B or b = 10^9
0,1,…,8 = 10^0
+,-,? = 0

It is been created a dataframe with the necessary variables

DfWithVariables <-  df %>% select("EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")

Here, the exponent values (Showed previously) were change to the numerical values, respectively

PROPDMGEXP variable.

DfWithVariables$PROPDMGEXP <- as.character(DfWithVariables$PROPDMGEXP)

DfWithVariables <- DfWithVariables %>% 
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "K"), "1000", PROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "m" | PROPDMGEXP == "M"), "1000000", PROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "B"), "1000000000", PROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "h" | PROPDMGEXP == "H"), "100", PROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "0" | PROPDMGEXP == "1" |PROPDMGEXP == "2" | PROPDMGEXP == "3"|PROPDMGEXP == "4" |PROPDMGEXP == "5" |PROPDMGEXP == "6"|PROPDMGEXP == "7" |PROPDMGEXP == "8" |PROPDMGEXP == "+"),1, PROPDMGEXP)) %>% 
        mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "-" | PROPDMGEXP == "?" | PROPDMGEXP == ""),0, PROPDMGEXP))
DfWithVariables$PROPDMGEXP <- as.numeric(DfWithVariables$PROPDMGEXP)

CROPDMGEXP variable.

DfWithVariables$CROPDMGEXP <- as.character(DfWithVariables$CROPDMGEXP)

DfWithVariables <- DfWithVariables %>% 
        mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "k" | CROPDMGEXP == "K"), "1000",CROPDMGEXP)) %>%
        mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "m" | CROPDMGEXP == "M"), "1000000", CROPDMGEXP)) %>%
         mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "B"), "1000000000", CROPDMGEXP)) %>%
         mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "0" | CROPDMGEXP == "2"), "1", CROPDMGEXP)) %>%
         mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "" | CROPDMGEXP == "?"), "0", CROPDMGEXP))
DfWithVariables$CROPDMGEXP <- as.numeric(DfWithVariables$CROPDMGEXP)

Now, It will been calculated the total of CROP and PROPERTY damage

DfWithVariables$PROTOTALDAMAGE <- DfWithVariables$PROPDMG * DfWithVariables$PROPDMGEXP
DfWithVariables$CROPTOTALDAMAGE <- DfWithVariables$CROPDMG * DfWithVariables$CROPDMGEXP

In the CodeBook It is been seen that exist 10 types of events that group all the most specific events that is in the data

So It will been grouped the data in base to the next events:

Hail
Heat
Flood
Wind
Storm
Snow
Tornado
Winter
Rain
Others

# This table are the first 10 events of EVTYPE variable.

table(DfWithVariables$EVTYPE)[1:10]

## 
##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     1                     1                     1 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     1                     4                     1 
##            WATERSPOUT                  WIND                     ? 
##                     1                     1                     1 
##       ABNORMAL WARMTH 
##                     4

# This table are the first 10 events of EVTYPE variable.

table(DfWithVariables$EVTYPE)[975:985]

## 
## WINTER STORM/HIGH WINDS           WINTER STORMS          Winter Weather 
##                       1                       3                      19 
##          WINTER WEATHER      WINTER WEATHER MIX      WINTER WEATHER/MIX 
##                    7026                       6                    1104 
##             WINTERY MIX              Wintry mix              Wintry Mix 
##                       2                       3                       1 
##              WINTRY MIX                     WND 
##                      90                       1

In the previous tables, It is been observed all the diferents types of events to group.

Now, all the types of events showed previously were group in the 10 groups, also showed previously. The technique that It is been used was tha the event contain the word of one of the 10 groups. For example the COASTAL FLOOD event it is in the FLOOD group because contain the FLOOD word.

# Create new variable GROUPEVENTS
DfWithVariables$GROUPEVENTS <- "OTHERS"
DfWithVariables$GROUPEVENTS[grep("RAIN",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "RAIN"
DfWithVariables$GROUPEVENTS[grep("WINTER",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "WINTER"
DfWithVariables$GROUPEVENTS[grep("TORNADO",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "TORNADO"
DfWithVariables$GROUPEVENTS[grep("SNOW",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "SNOW"
DfWithVariables$GROUPEVENTS[grep("STORM",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "STORM"
DfWithVariables$GROUPEVENTS[grep("WIND",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "WIND"
DfWithVariables$GROUPEVENTS[grep("FLOOD",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "FLOOD"
DfWithVariables$GROUPEVENTS[grep("HEAT",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "HEAT"
DfWithVariables$GROUPEVENTS[grep("HAIL",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "HAIL"

# Table of the total of events.
table(DfWithVariables$GROUPEVENTS)

## 
##   FLOOD    HAIL    HEAT  OTHERS    RAIN    SNOW   STORM TORNADO    WIND  WINTER 
##   82730  290401    2648   48970   12155   17652   15123   60699  363759    8160

Analysis.

# Table with the total economic damage (In crops and properties) in dollars.

TotalEconomicDamage <- DfWithVariables %>% select(PROTOTALDAMAGE,CROPTOTALDAMAGE,GROUPEVENTS) %>%
        group_by(GROUPEVENTS) %>% summarise(TotalProp = sum(PROTOTALDAMAGE), TotalCrop = sum(CROPTOTALDAMAGE))

## `summarise()` ungrouping output (override with `.groups` argument)

# Table with the total Health damage(Fatalities and Injuries)
TotalHealthDamage <- DfWithVariables %>% select(FATALITIES,INJURIES,GROUPEVENTS) %>%
        group_by(GROUPEVENTS) %>% summarise(TotalFatalities = sum(FATALITIES), TotalInjuries = sum(INJURIES))

## `summarise()` ungrouping output (override with `.groups` argument)

Graphs

The first graph it´s about the total damage in properties caused by the weather events.

ggplot(data = TotalEconomicDamage, mapping = aes(x = GROUPEVENTS , y = TotalProp, fill = TotalProp )) +
        geom_bar(stat = "identity") + 
        geom_text(data = NULL, x = 3, y = 1.5e+10, label = "20325750", angle = 90) +
        geom_text(data = NULL, x = 10, y = 1.5e+10, label = "27298000", angle = 90) +
        ggtitle("Total property damage (In dollars) by weather type of event.") +
        theme(plot.title = element_text(color = "red")) + 
        labs(x = "Weather type of event.", y = "Total Property damage in dollars.")+
        scale_fill_continuous(name = "Total property damage scale")

It is important to say that HEAT and WINTER do not have a bar because the y axis range its so big, but the values of the total economic damages caused (In dollars) were put in the graph.

It is been seen that the FLOOD was the event that more damage caused in the properties between 1950 and 2011

The second graph it´s about the total damage in properties caused by the weather events.

ggplot(data = TotalEconomicDamage, mapping = aes(x = GROUPEVENTS , y = TotalCrop, fill = TotalCrop)) +
        geom_bar(stat = "identity", position = position_dodge()) +
        ggtitle("Total crop damage (In dollars) by weather type of event.") +
        theme(plot.title = element_text(color = "red"))+
        scale_fill_continuous(name = "Total Crop damage scale")+
        geom_text(data = NULL, x = 10, y = 3.5e+9, label = "15000000", angle = 90)+
        labs(x = "Weather type of event.", y = "Total crop damage in dollars.")

Events diferents of flood, hail, heat, rain, snow, storm, tornado, wind and winter like wintry mix, waterspout, etc. Was the most damage weather events for the crops between 1950 and 2011.

The third graph it´s about the total fatalities caused by the weather events.

ggplot(data = TotalHealthDamage, mapping = aes(x = GROUPEVENTS , y = TotalFatalities, fill = TotalFatalities )) +
        geom_bar(stat = "identity", position = position_dodge()) + 
        ggtitle("Total fatalities by weather type of event.") +
        theme(plot.title = element_text(color = "red")) +
        labs(x = "Weather type of event.", y = "Total fatalities")+
        scale_fill_continuous(name = "Total Fatalities")

It is been seen that the TORNADO was the event that more fatalities caused between 1950 and 2011

The fourth graph it´s about the total injured people caused by the weather events.

ggplot(data = TotalHealthDamage, mapping = aes(x = GROUPEVENTS , y = TotalInjuries, fill = TotalInjuries )) +
        geom_bar(stat = "identity", position = position_dodge()) + 
        ggtitle("Total injuries by weather type of event.") +
        theme(plot.title = element_text(color = "red")) +
        labs(x = "Weather type of event.", y = "Total injuries")+
        scale_fill_continuous(name = "Total Injuried people")

It is been seen that the TORNADO was the event that more injured people caused between 1950 and 2011

Types of weather events that caused more economics and health damage in U.S. Between 1950 and 2011.