Analysis of the effects of severe weather in health and economy in the United states

Synopsis

This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database in other to identify the weather events that have the greatest impact in health and in ecconomic loss in the US. The necessary information was selected from downloaded data set, and data were prepared for analysis according to the documentation (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) Health damage was assessed using the quantification “Fatalities” and “Injuries”. Economic impact was evaluated using the quantification “Property damage” and “Crop damage”. The data set contains data since 1951 accross all the States. The sum of the damage variable across states and time was used for this analysis. Bar plots were used for the visualization of the results.

Data Processing

First, dataset was downloaded and read into r:

#packages needed for the program
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
#download data set
fileUrl <-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./stormdata.cvs.bz2", method = "curl")

#read files
storm <- read.csv("stormdata.cvs.bz2")

Next, a subset dataset was created by selecting the variables needed: STATE, BGN_DATE,EVTYPE (general info),FATALITIES,INJURIES (health), PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP (damage). BGN_DATE was transformed to data class

storm_wd <- select (storm,
                    STATE, BGN_DATE,EVTYPE,FATALITIES,INJURIES,
                    PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)
#format BGN_DATE as date
storm_wd$BGN_DATE <- mdy_hms (storm_wd$BGN_DATE)

The STATES variables lists 72 states codes in the original dataframe, while there should be 53. Information about what the other codes were could not be found in the documentation. The proportion of the unknown codes was estimated.

#Evaluate the porporcion of observations from unknown codes:
#  us_states: vector with proper US codes (53)    
us_states <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
                "GA", "HI", "ID", "IL", "IN", "IA", "KS","KY","LA", "ME", 
                "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", 
                "NJ", "NM", "NY","NC", "ND", "OH", "OK", "OR", "PA", "RI", 
                "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV","WI",
                "WY", "GU", "PR", "VI")
us_states <- as.data.frame (us_states)
colnames(us_states) <- c("STATE")
storm_wd_us <- merge(us_states,storm_wd,by = "STATE")
# Number of observations with correct code:
dim(storm_wd)
## [1] 902297      9
dim(storm_wd_us)
## [1] 886845      9
#886845 out of 902297 have a valid code for united states or unicorporated regions
# % unknown codes: 
percentage_ukn_codes <- 100 - (length(storm_wd_us$STATE)*100)/length(storm_wd$STATE)

Unkown codes are 1.7125% of the dataset. As we are evaluating the effect in the US and we do not know what those codes indicate, we do not include them in the analysis. Also it is a low proportion of the data. The analysis will be done with storm_wd_us, which contains the data for the correct STATE codes

The variable EVTYPE, which indicates the type of event, contains multiple typos and different names for same types of events. Therefore an intense data cleaning procedure was needed to minimize these. Permitted Storm Data Events are shown in page 6 in documentation, so this was taken into account for the data cleaning.

#Examine EVTYPE: 
table(storm_wd_us$EVTYPE)
#Transform data EVTYPE. Transform all characters to upper. 
storm_wd_us$EVTYPE <- toupper(storm_wd_us$EVTYPE)
#Permitted Storm Data Events are shown in page 6 in documentation
#/,-, or 2 spaces together to " "
storm_wd_us$EVTYPE <- sub("/|-|  ", " ", storm_wd_us$EVTYPE)
#no plurals in data event table: remove "S" at the end
storm_wd_us$EVTYPE <- sub ("S$","", storm_wd_us$EVTYPE)
#unify variants of "STRONG WIND":
storm_wd_us$EVTYPE <- sub("(WIND)|(WND)|(STORM FORCE WIND)|(LOW WIND CHILL)","STRONG WIND",storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("(WIND DAMAGE)|(WIND GUST)|(WIND STORM)|(WIND ADVISORY)|(WIND CHILL)","STRONG WIND",storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("(.*STRONG WIND.*)","STRONG WIND",storm_wd_us$EVTYPE)
#unify variants of "THUNDERSTORM WIND"
storm_wd_us$EVTYPE <- sub("TSTM.*", "THUNDERSTORM WIND", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub(".*THUNDER.*", "THUNDERSTORM WIND", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("TUNDER.*", "THUNDERSTORM WIND", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("(TUNDER.*)|(GUSTNADO.*)|(THUNDEER.*)|(THUNER.*)|(THUDER.*)|(THUNDE.*)|(.*BURST.*)", "THUNDERSTORM WIND", storm_wd_us$EVTYPE)
#unify variants of "TORNADO":
storm_wd_us$EVTYPE <- sub("(TORNADO.*)|(TORNDAO.*)", "TORNADO", storm_wd_us$EVTYPE)
# group categories with "SUMMARY.." in "UNCLASSIFIED":
storm_wd_us$EVTYPE <- sub("SUMMARY.*", "UNCLASSIFIED", storm_wd_us$EVTYPE)
#unify variants of "WINTER WEATHER":
storm_wd_us$EVTYPE <- sub("(.*WINT.*)|(.*DRIZZLE.*)", "WINTER WEATHER", storm_wd_us$EVTYPE)
#unify variants of "WILDFIRE":
storm_wd_us$EVTYPE <- sub("(.*WILD.*)|(.*FIRE.*)", "WILDFIRE", storm_wd_us$EVTYPE)
#unify variants of "WATERSPOUT":
storm_wd_us$EVTYPE <- sub("(.*WATERSPOUT.*)|(WAYTERSPOUT)|(WATER SPOUT)|(WAYTERSPOUT)", 
                          "WATERSPOUT", storm_wd_us$EVTYPE)
#unify variants of "VOLCANIC ASH":
storm_wd_us$EVTYPE <- sub("(.*VOLCANIC.*)", "VOLCANIC ASH", storm_wd_us$EVTYPE)
#unify variants of "flash flood" (use lower case for now to differenctiate later from "FLOOD") :
storm_wd_us$EVTYPE <- sub("(.*FLASH.*)", "flash flood", storm_wd_us$EVTYPE)
#unify variants of "FLOOD":
storm_wd_us$EVTYPE <- sub("(.*FLOOD.*)|(.*URBAN.*)|(FLD)|(.*STREAM.*)", "FLOOD", storm_wd_us$EVTYPE)
#unify variants of "HAIL":
storm_wd_us$EVTYPE <- sub("(.*HAIL.*)", "HAIL", storm_wd_us$EVTYPE)
#unify variants of "TROPICAL STORM":
storm_wd_us$EVTYPE <- sub("(.*TROPICAL STORM.*)", "TROPICAL STORM", storm_wd_us$EVTYPE)
#unify variants of "HEAT":
storm_wd_us$EVTYPE <- sub("(.*WARMTH.*)|(HYPERTHERMIA.*)|(.*WARM.*)|(.*HOT.*)|(RECORD HIGH)","HEAT", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub(".*HEAT.*|.*HIGH TEMPERATURE","HEAT", storm_wd_us$EVTYPE)
#unify variants of "HEAVY SNOW":
storm_wd_us$EVTYPE <- sub("(.*SNOW.*)", "HEAVY SNOW", storm_wd_us$EVTYPE)
#unify variants of "SLEET":
storm_wd_us$EVTYPE <- sub(".*SLEET.*", "SLEET", storm_wd_us$EVTYPE)
#unify variants of "HURRICANE":
storm_wd_us$EVTYPE <- sub("(.*HURRICANE.*)|(.*TYPHOON.*)","HURRICANE",storm_wd_us$EVTYPE)
#unify variants of "COLD/WIND CHILL":
storm_wd_us$EVTYPE <- sub(".*COLD.*|.*COOL.*|.*LOW TEMP.*|RECORD LOW|.*HYPOTHERM.*","COLD/WIND CHILL",storm_wd_us$EVTYPE)
#unify variants of "HIGH WIND":
storm_wd_us$EVTYPE <- sub("(.*HIGH WIND.*)|(WIND GUST)","HIGH WIND",storm_wd_us$EVTYPE)
#unify variants of "FUNNEL":
storm_wd_us$EVTYPE <- sub(".*FUNNEL.*", "FUNNEL", storm_wd_us$EVTYPE)
#unify variants of "DROUGHT":
storm_wd_us$EVTYPE <- sub("(.*DRY.*)|(.*DROUGHT.*)|(DRIEST.*)", "DROUGHT", storm_wd_us$EVTYPE)
#unify variants of "HIGH SURF":
storm_wd_us$EVTYPE <- sub("(.*SWELL)|(.*SURF.*)|(.*WAVE.*)|(ROUGH SEA)", "HIGH SURF", storm_wd_us$EVTYPE)
#unify variants of "HEAVY RAIN":
storm_wd_us$EVTYPE <- sub("(.*RAIN.*)|(.*SHOWER.*)|(.*PRECIPITATION.*)|(.*PRECIPATATION.*)", "HEAVY RAIN", storm_wd_us$EVTYPE)
#unify variants of "DUST STORM":
storm_wd_us$EVTYPE <- sub("(DUSTSTORM)|(BLOWING DUST)", "DUST STORM", storm_wd_us$EVTYPE)
#unify variants of "HIGH TIDE":
storm_wd_us$EVTYPE <- sub("(HIGH SEA)|(OTHERS TIDE)|(HIGH WATER)|(.*TIDE.*)", "HIGH TIDE", storm_wd_us$EVTYPE)
#unify variants of "LIGHTNING":
storm_wd_us$EVTYPE <- sub("(.*LIGHTNING.*)|(LIGNTNING.*)|(LIGHTING)", "LIGHTNING", storm_wd_us$EVTYPE)
#unify variants of "DENSE SMOKE":
storm_wd_us$EVTYPE <- sub(".*SMOKE.*", "DENSE SMOKE",storm_wd_us$EVTYPE)
#unify variants of "DENSE FOG":
storm_wd_us$EVTYPE <- sub(".*FOG.*", "DENSE FOG",storm_wd_us$EVTYPE)
#unify variants of "DUST DEVIL":
storm_wd_us$EVTYPE <- sub("DUST DEVEL", "DUST DEVIL",storm_wd_us$EVTYPE)
#unify variants of "BLIZZARD":
storm_wd_us$EVTYPE <- sub(".*BLIZZARD.*", "BLIZZARD",storm_wd_us$EVTYPE)
#unify variants of "AVALANCHE":
storm_wd_us$EVTYPE <- sub(".*AVALANCE.*", "AVALANCHE",storm_wd_us$EVTYPE)
#unify variants of "FROST FREEZE":
storm_wd_us$EVTYPE <- sub(".*FROST.*|.*FREEZE.*|.*ICY.*","FROST FREEZE",storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("ICE|ICE JAM|ICE PELLET|ICE.*ROAD","FROST FREEZE",storm_wd_us$EVTYPE)
#other fixes:
storm_wd_us$EVTYPE <- sub(".*HEAT.*","HEAT", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub(".*FROST.*|.*FREEZE.*","FROST FREEZE",storm_wd_us$EVTYPE)

#unify events that are not clear/no present in the table in documentation in page 6 in "OTHERS"
storm_wd_us$EVTYPE <- sub("(.*WET.*)|(METRO STORM.*)|(.*APACHE.*)|
                        (RAPIDLY RISING WATER)|(HIGH$)|(EXCESSIVE)|(.*DAM.*)|
                        (SOUTHEAST)|(.*MUD.*)|(NORTHERN LIGHT)|
                        (LARGE WALL CLOUD)|(VOG)|(GROUND BLIZZARD)|
                        (COASTAL EROSION)|(LANDSLUMP)|(.*BEACH.*)|
                        (HEAVY SEA)|(HEAVY MIX)|(MILD PATTERN)|
                        (MARINE ACCIDENT)","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("(NO SEVERE WEATHER)|(HEAVY SEA)|(NO SEVERE WEATHER)|(DROWNING)|(COASTAL EROSION)|
                        (NORTHERN LIGHT)|(NONE)|(ROCK SLIDE)|(SAHARAN DUST)|(COASTAL SURGE)|
                        (ROTATING WALL CLOUD)|(RED FLAG CRITERIA)|(\\?)",
                        "OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("(TURBULENCE)|(REMNANTS OF FLOYD)|(ROCK SLIDE)|(WALL CLOUD)|
                          |(SOUTHEAST)|(TEMPERATURE RECORD)|COASTALSTORM|
                          (.*RISING.*)|(MARINE ACCIDENT)|FREEZING SPRAY","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("RAPIDLY RISING WATER","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("ROTATING OTHER","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("MARINE MISHAP","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("LARGE OTHER","OTHER", storm_wd_us$EVTYPE)
storm_wd_us$EVTYPE <- sub("LANDSPOUT","OTHER", storm_wd_us$EVTYPE)
#Transform all characters to upper. 
storm_wd_us$EVTYPE <- toupper(storm_wd_us$EVTYPE)
#convert EVTYPE to factor
storm_wd_us$EVTYPE <- as.factor(storm_wd_us$EVTYPE)

Similarly, the economic loss variables, which are property damage and crop damage, were prepared for analysis.PROPDMGEXP and CROPDMGEXP were checked for unknown codes, transformed accordinf to the documentation(page 12) and new variables containing the absolute values (PROPDMG_TOTAL and CROPDMG_TOTAL) were calculated.

#Variables PROPDMG and CROPDMG: transform to include the exponentials from PROPDMGEXP and CROPDMGEXP.
#check PROPDMGEXP and CROPDMGEXP
table(storm_wd_us$PROPDMGEXP)
table(storm_wd_us$CROPDMGEXP)
#In the documentation (page 12) there is only explanation of H, K, M and B.
# We asume that 0 is log100, therefore considered at 1.
# For PROPDMGEXP 1,2,3,4,5,6,7,8,-,?,+ will be eliminated form the data set fo, 
#given its low representation of the data set
#Transform all characters to upper. 
storm_wd_us$PROPDMGEXP <- toupper(storm_wd_us$PROPDMGEXP)
storm_wd_us$CROPDMGEXP <- toupper(storm_wd_us$CROPDMGEXP)
#Remove unwanted values:
storm_wd_us$PROPDMGEXP <- sub("([1-8])|(\\?)|(\\+)|(\\-)", "EXCLUDE", storm_wd_us$PROPDMGEXP)
#% of undocumented values (1-8,?,+,-) of PROPDMGEXP
#98*100/45873 : 0.21% of the data have non-valid values for PROPDMGEXP

storm_wd_us$CROPDMGEXP <- sub("2|\\?", "EXCLUDE", storm_wd_us$CROPDMGEXP)
storm_wd_us <- subset(storm_wd_us,PROPDMGEXP != "EXCLUDE")
storm_wd_us <- subset(storm_wd_us,CROPDMGEXP != "EXCLUDE")
# Convert codes " " or 0 to value 1, and B,M,K,H to numeric values in PROPDMGEXP
storm_wd_us$PROPDMGEXP <- sub("", 1,storm_wd_us$PROPDMGEXP)
storm_wd_us$PROPDMGEXP <- sub("1|10", 1,storm_wd_us$PROPDMGEXP)
storm_wd_us$PROPDMGEXP <- sub("1B", 1000000000, 
                                   sub("1M", 1000000, 
                                       sub("1K", 1000,
                                           sub("1H", 100, storm_wd_us$PROPDMGEXP))))

# Convert codes " " or 0 to value 1, and B,M,K to numeric values in CROPDMGEXP
storm_wd_us$CROPDMGEXP <- sub("", 1,storm_wd_us$CROPDMGEXP)
storm_wd_us$CROPDMGEXP <- sub("1|10", 1,storm_wd_us$CROPDMGEXP)
storm_wd_us$CROPDMGEXP <- sub("1B", 1000000000, 
                               sub("1M", 1000000, 
                                   sub("1K", 1000, storm_wd_us$CROPDMGEXP)))
#converrt PROPDMGEXP and CROPDMGEXP to numeric 
storm_wd_us$PROPDMGEXP <- as.numeric(storm_wd_us$PROPDMGEXP)
storm_wd_us$CROPDMGEXP <- as.numeric(storm_wd_us$CROPDMGEXP)

# Create new variables for PROPDMG and CROPDMG multiplying 
#those by their EXP (PROPDMG_TOTAL and CRODMG_TOTAL):
storm_wd_us <- mutate(storm_wd_us, PROPDMG_TOTAL =PROPDMG*PROPDMGEXP)
storm_wd_us <- mutate(storm_wd_us, CROPDMG_TOTAL =CROPDMG*CROPDMGEXP)

Results

  1. Across the United States, which types of events as indicated in the EVTYPE variable) are most harmful with respect to population health?

Sum of all the fatalities and injuries accross states and dates were calculated for every type of weather event. For better visualization, only the 15 with highest number of injuries are displayed in the graph.

#calculate sums
EVTYPE_FATALITIES <- storm_wd_us %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES))
EVTYPE_INJURIES <- storm_wd_us %>% group_by(EVTYPE) %>% summarise(INJURIES = sum(INJURIES))
EVTYPE_HEALTH <- merge(EVTYPE_FATALITIES,EVTYPE_INJURIES,by = "EVTYPE")
EVTYPE_HEALTH<- EVTYPE_HEALTH[order(EVTYPE_HEALTH$INJURIES, decreasing = TRUE), ]
EVTYPE_HEALTH_TOP15INJ <- EVTYPE_HEALTH[1:15,]
#gather fatalities and injuries for plotting
EVTYPE_HEALTH_TOP15INJ_2 <- gather(EVTYPE_HEALTH_TOP15INJ, key= "fatalities_injuries", value = "number",FATALITIES, INJURIES)

#Make plot number of injuries and fatalities
ggplot(EVTYPE_HEALTH_TOP15INJ_2, aes(EVTYPE,number,fill=fatalities_injuries)) + 
        geom_bar (stat="identity") + 
        ylab("number") + 
        xlab("type of event") + 
        coord_flip()+
        ggtitle("Total number of injuries and fatalities caused by weather events in US") + 
        theme_bw() +theme(legend.position="right")

The results indicate that tornadoes, strong winds, heat, floods and lightning are the events that caused more injuries and fatalities fatalities, being therefore the most damaging for human health.

  1. Across the United States, which types of events have the greatest economic consequences? calculate the sum of crop damage and property damage across US since 1950

Sum of the property and crop damage accross states and dates were calculated for every type of weather event.For better visualization, only the 20 with highest number of injuries are displayed in the graph.

storm_economy <- select(storm_wd_us,STATE, BGN_DATE,EVTYPE,PROPDMG_TOTAL,CROPDMG_TOTAL)
#calculate sums
storm_economy_property <- storm_economy %>% group_by(EVTYPE) %>% summarise(property_damage = sum(PROPDMG_TOTAL))
storm_economy_crop <- storm_economy %>% group_by(EVTYPE) %>% summarise(crop_damage = sum(CROPDMG_TOTAL))
storm_economy_sum <- merge(storm_economy_property,storm_economy_crop,by = "EVTYPE")
storm_economy_sum <- storm_economy_sum[order(storm_economy_sum$property_damage, decreasing = TRUE), ]
#select top 20
storm_economy_sum_top20 <- storm_economy_sum[1:20,]
storm_economy_sum_top20_2 <- gather(storm_economy_sum_top20, key= "property_crop", value = "dollars",
                                    property_damage,crop_damage)

ggplot(storm_economy_sum_top20_2, aes(EVTYPE,dollars,fill=property_crop)) + 
        geom_bar (stat="identity") + 
        ylab("dollars") + 
        xlab("type of event") + 
        coord_flip()+
        ggtitle("Economic damage caused by weather events across US") + 
        theme_bw() +theme(legend.position="right")

The results indicate that floods, hurricane, tornadoes, and storm surge are the events that had the greatest economic consequences in the US since 1950.