Impact of weather events on USA Population Health and Economy

Synopsis

Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurrence of storms and other significant weather phenomena in the United States. Data are public and accessible through the web.

This project performs an analysis of the available data to evaluate what type of meteorological events are the most harmful to the population and which generate the greatest economic losses in the US.

Data Processing

Getting data

The starting database can be downloaded from the specified website. After downloading, we decompressed and read as rawdata

    url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    
    download.file(url, 'stormdata.csv.bz2', method = "wininet")
    # Read data
    rawdata<- read.csv('stormdata.csv.bz2')
    dim(rawdata)
## [1] 902297     37

The rawdata has 902297 entries and 37 variables. To reduce memory we select only 8 columns that are necessary for the analisys and store in a new data frame:

    library(dtplyr)
    library(data.table)
    library(dplyr)
    data <- select(rawdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)

We also transform the date column ‘BGN_DATE’ into YEAR for our pruposes:

    # conver BGN_DATE in year
    data$BGN_DATE<-year(strptime((data$BGN_DATE), "%m/%d/%Y"))
    #names(data)[8]="YEAR"
    colnames(data)[8] <- "YEAR"

Cleaning data

As we see there are a lot of typos in the data, we will clean some of them to make a better selection.

There is a huge disparity of cases, for example with the following code we see the names of column “EVTYPE”, and that there are upper and lowercase letters, inappropriate characters and errors.

    # Convert to uppercase
    data$EVTYPE <-  toupper(data$EVTYPE)
    data$CROPDMGEXP <-  toupper(data$CROPDMGEXP)
    data$PROPDMGEXP<-  toupper(data$PROPDMGEXP)
    
    # first remove number from type
    data$EVTYPE <- gsub("[[:punct:]]"," ", data$EVTYPE)
    data$EVTYPE <- gsub("\\d+","", data$EVTYPE )
    
    # fix some errors in EVTYPE
    data$EVTYPE <- gsub("^\\s","", data$EVTYPE) # first free space
    data$EVTYPE <- gsub("TORNDAO","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("TORNADOES","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERESTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("TSTM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDESTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERSTORMS","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERSTORMW","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDEERSTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNERSTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERSTORMWINDS","THUNDERSTORM WINDS", data$EVTYPE)
    
    data$EVTYPE <- gsub("THUDERSTORM WINDS","THUNDERSTORM WINDS", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERSTROM WIND","THUNDERSTORM WINDS", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERSTROM WINDS","THUNDERSTORM WINDS", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERTSORM WIND","THUNDERSTORM WINDS", data$EVTYPE)
    data$EVTYPE <- gsub("SEVERE THUNDERSTORM","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("SEVERE THUNDERSTORM WINDS","THUNDERSTORM", data$EVTYPE)
    
    data$EVTYPE <- gsub("WINS","WINDS", data$EVTYPE)
    data$EVTYPE <- gsub("LIGHTNING","LIGHTING", data$EVTYPE)
    
    data$EVTYPE <- gsub("TROPICAL STORM.*","TROPICAL STORM", data$EVTYPE)
    #remove duplicate snow
    data$EVTYPE <- gsub("* SNOW","SNOW", data$EVTYPE)
    # keep only firt type
    data$EVTYPE <- gsub("^URBAN.*","FLOOD", data$EVTYPE)
    data$EVTYPE <- gsub("^FLASH FLOOD.*","FLOOD", data$EVTYPE)
    data$EVTYPE <- gsub("^FLASH FLOOD.*","FLOOD", data$EVTYPE)
    data$EVTYPE <- gsub("^FLOOD.*","FLOOD", data$EVTYPE)
    data$EVTYPE <- gsub("^COLD.*","COLD", data$EVTYPE)
    data$EVTYPE <- gsub("^THUNDERSTORM.*","THUNDERSTORM", data$EVTYPE)
    data$EVTYPE <- gsub("^SNOW.*","SNOW", data$EVTYPE)
    data$EVTYPE <- gsub("^WIND.*","WIND", data$EVTYPE)
    data$EVTYPE <- gsub("^LIGHTING.*","LIGHTING", data$EVTYPE)
    data$EVTYPE <- gsub("^ICE.*","ICE", data$EVTYPE)
    data$EVTYPE <- gsub("^HURRICANE.*","HURRICANE", data$EVTYPE)
    data$EVTYPE <- gsub("^HIGH WIND.*","HIGH WIND", data$EVTYPE)
    data$EVTYPE <- gsub("^HEAVY SNOW.*","HEAVY SNOW", data$EVTYPE)
    data$EVTYPE <- gsub("^HEAVY RAIN.*","HEAVY RAIN", data$EVTYPE)
    
    data$EVTYPE <- gsub("^TORNADO.*","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("^HEAVY RAIN.*","HEAVY RAIN", data$EVTYPE)
    data$EVTYPE <- gsub("^HEAVYSNOW.*","HEAVYSNOW", data$EVTYPE)
    data$EVTYPE <- gsub("^BLIZZARD.*","BLIZZARD", data$EVTYPE)
    
    data$EVTYPE <- gsub("^LIGNTNING.*","LIGHTING", data$EVTYPE)

As detailed on Page 12 of the National Weather Service Storm Data Documentation, the property and crop damage data needs to be adjusted based on the corresponding multiplier field. More information about this point is available in https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.

    # recalculate real value of damage
    data$PROPDMG[data$PROPDMGEXP=="K"] <- data$PROPDMG[data$PROPDMGEXP=="K"] * 1000
    data$PROPDMG[data$PROPDMGEXP=="M"] <- data$PROPDMG[data$PROPDMGEXP=="M"] * 1000000
    data$PROPDMG[data$PROPDMGEXP=="B"] <- data$PROPDMG[data$PROPDMGEXP=="B"] * 1000000000
    data$PROPDMG[data$PROPDMGEXP=="+"] <- data$PROPDMG[data$PROPDMGEXP=="+"] * 1
    data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] <- data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] * 10
    
    
    data$CROPDMG[data$CROPDMGEXP=="K"] <- data$CROPDMG[data$CROPDMGEXP=="K"] * 1000
    data$CROPDMG[data$CROPDMGEXP=="M"] <- data$CROPDMG[data$CROPDMGEXP=="M"] * 1000000
    data$CROPDMG[data$CROPDMGEXP=="B"] <- data$CROPDMG[data$CROPDMGEXP=="B"] * 1000000000
    data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] <- data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] * 10

tidying data

As we see in next plot, the number of events in storm data has been increasing since 1950, so we have to keep in mind that records are not uniformly distributed over time.

hist(data$YEAR,
     border=F, col='skyblue',
     main="number of events by year in storm data",
     xlab="year", ylab="number of events",
     cex.axis=0.8)

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type.

Therefore, we have selected only data from 1996 to 2011 for this project.

    # select only modern data records
    data<-data[data$YEAR>1995,]

Results

This project is the final project for the Reproducible Research Course through the John Hopkins Coursera for Data Science. The purpose of the project is to process raw data into a reproducible analysis in response to two questions with regards to severe weather events in the United States.

  1. Which types of severe weather events are the most harmful with respect to population health?
  2. Which types of severe weather events have the greatest economic consequences?

These are the conclusions we have reached:

Which types of events are most harmful with respect to population health?

We have two variables to express the affection to population health: FATALITIES and INJURIES. To include both in analisys We have created the following harmful index:

  • harmful index – HAR.INDX == FATALITIES + INJURIES * 0.2

This is a simplification that we have considered necessary to estimate the total harmful with respect to population health including all data available in strom data.

Deeper studies have estimated this with economic equivalences of human life, but this is beyond our goal.

#cALCULATE harmful index
# create new variable with TOTAL harmful respect to population health
data$HAR.INDX <-data$FATALITIES + data$INJURIES*0.2

To see which is the worst event type (EVTYPE), we first calculate the accumulated sum by event and then we make a plot with the 20 worst (with higher harmful index)

    library(ggplot2)
    
    popdmg<-aggregate(HAR.INDX~EVTYPE,data,sum)
    popdmg <- popdmg[order(popdmg$HAR.INDX, decreasing=TRUE),]
    popdmg<-head(popdmg,20)
    
    barplot (
        height = popdmg$HAR.INDX,
        main = "Harmful index by Event Type (1996-2011)",
        ylab = "Harmful index",
        names.arg = popdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis = 0.8
    )

According to the results the worst phenomena for population health are tornados, heat (excessive heat and heat) and floods. Tornadoes are by far the most harmful.

Across the United States, which types of events have the greatest economic consequences?

First of all and although the damages to the poputalion are a fundamental part of the economy, it will not be taken into account in this analysis focusing only on damages to properties and crops.

To evaluate the total economic damage, we will add the two variables that we have in the database that evaluate damage: property damage (PROPDMG) and crop damage (CROPDMG).

As we have seen in the section on data cleanning, these variables have been readjusted to monetary values by multiplying each by a given exponent.

    # create new variable with total damage
    data$TOTALDMG <-data$PROPDMG + data$CROPDMG

A more in-depth review of the data would be necessary, as only the floods data have a monetary estimation. The Storm Data preparer must enter monetary damage amounts for flood events, even if it is a “guesstimate.” The U.S. Army Corps of Engineers requires the NWS to provide monetary damage amounts (property and/or crop) resulting from any flood event but not for the rest. Therefore only the floods have “good” data.

Another important factor is the updating of costs with inflation, the 1950 cost in $ are not equivalent with actual cost of a dollar.

    totdmg<-aggregate(TOTALDMG~EVTYPE,data,sum)
    totdmg <- totdmg[order(totdmg$TOTALDMG, decreasing=TRUE),]
    totdmg<-head(totdmg,20)
    totdmg$TOTALDMG<-totdmg$TOTALDMG/1000000000
    
    barplot (
        height = totdmg$TOTALDMG,
        main = "Total damage (1996-2011)",
        ylab = "bilion $",
        names.arg = totdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis= 0.6
    )

With the data analisys we can conclude that from 2000-2011, floods and hurricanes have had the greatest cumulative impact on the economy.