Weather events and their effects in the United States

Synopsis

Weather events may have different consecuences in both economy and human wealth (injuries and even deaths), thus the U.S. National Oceanic and Atmospheric Administration’s (NOAA) began to create a data base where weather events are registered with different level of details such as the event type, duration, economic damage estimates. In particular for this study two main questions were adressed: What weather events have caused the most damage in human wealth? and second, what weather events have the greatest economic consequences? Storm Data (from NOAA) was used since year 1996, when all the event types began to be registered. The findings show that the most harmful events for human health in the U.S. are tonados and excessive heat. On the other hand the events with greater economic damage are: floods and hurricanes/typhoons.

Data Processing

The first step for making the analysis is the procesing of data, for this, first the raw data needs to be downloaded, decompressed and read. For this specific .bz2 file to be unzipped, the “R.utils” package is required.

## Create a working directory

dir.create("NOOAdataFolder")
setwd("./NOOAdataFolder")

## Load required packages

#install.packages("R.utils")
library(R.utils)

#install.packages("stringdist")
library(stringdist)

## Download and unzip file
URL <- 
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

download.file(URL,"repdata_data_StormData.csv.bz2")

bunzip2("repdata_data_StormData.csv.bz2","repdata_data_StormData.csv")

unload.Package(R.utils)

## Read table

stormData <- read.csv("repdata_data_StormData.csv",
                      header = T,colClasses = "character")

Now lets have a glance of how data looks like and what variables we will encounter.

names(stormData)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The variables we will be interested in are mainly 6: “EVTYPE” which names the type of weather event, “INJURIES” number of people injured, “FATALITIES” number of deaths caused directly or indirectly by the weather event, “PROPDMG” 3 significant figure property damage cost in US dollars,“PROPDMGEXP” corresponding exponent to the 3 significant figure vale in PROPDMG,“CROPDM” 3 significant figure crop damage cost in US dollars,“CROPDMEXP” corresponding exponent to the 3 significant figure vale in CROPDM.

For the matter of havig a copy of raw data this is stored in the variable “Original Data” then some exploratory analysis is done regarding the EVTYPE variable, which supossedly may only have 48 different correct values.

## Create a copy of the original data frame
originalData <- stormData

## Exploratory analysis
# Original event types characteristics
originalLength <- length(unique(stormData$EVTYPE))
originalTypes <- unique(stormData$EVTYPE)

Now the columns that will not be used for the analysis will be deleted.

stormData <- stormData[c("BGN_DATE", "EVTYPE","FATALITIES",
                  "INJURIES","PROPDMG","PROPDMGEXP","CROPDMG",
                  "CROPDMGEXP","REMARKS")]

head(stormData,10)

##              BGN_DATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1   4/18/1950 0:00:00 TORNADO       0.00    15.00   25.00          K    0.00
## 2   4/18/1950 0:00:00 TORNADO       0.00     0.00    2.50          K    0.00
## 3   2/20/1951 0:00:00 TORNADO       0.00     2.00   25.00          K    0.00
## 4    6/8/1951 0:00:00 TORNADO       0.00     2.00    2.50          K    0.00
## 5  11/15/1951 0:00:00 TORNADO       0.00     2.00    2.50          K    0.00
## 6  11/15/1951 0:00:00 TORNADO       0.00     6.00    2.50          K    0.00
## 7  11/16/1951 0:00:00 TORNADO       0.00     1.00    2.50          K    0.00
## 8   1/22/1952 0:00:00 TORNADO       0.00     0.00    2.50          K    0.00
## 9   2/13/1952 0:00:00 TORNADO       1.00    14.00   25.00          K    0.00
## 10  2/13/1952 0:00:00 TORNADO       0.00     0.00   25.00          K    0.00
##    CROPDMGEXP REMARKS
## 1                    
## 2                    
## 3                    
## 4                    
## 5                    
## 6                    
## 7                    
## 8                    
## 9                    
## 10

The number of unique elements for the EVTYPE variable is 985 which is far from the original 48 correct values. The next step is to identify why this is happening so “originalTypes” variable with the names of all this unique EVTYPE values is called. Due to the length of the list this will not be printed in the document, however the next step is to read the names in EVTYPE. Only 15 values of the list will be illustrated.

head(originalTypes,15)

##  [1] "TORNADO"                   "TSTM WIND"                
##  [3] "HAIL"                      "FREEZING RAIN"            
##  [5] "SNOW"                      "ICE STORM/FLASH FLOOD"    
##  [7] "SNOW/ICE"                  "WINTER STORM"             
##  [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"       
## [11] "RECORD COLD"               "HURRICANE ERIN"           
## [13] "HURRICANE OPAL"            "HEAVY RAIN"               
## [15] "LIGHTNING"

By inspection, the first problem that arises is that some EVTYPE values are begining with a blanck space " " and some are capitalized and other dont. To solve this problem the begining blank spaces will be deleted and all the characters will be transformed into lowercase.

# Delete blanks at the begining and mismatches due to capitalization
stormData$EVTYPE <- tolower(stormData$EVTYPE)
stormData$EVTYPE <- gsub("^ +","",stormData$EVTYPE)

# review the new event type characteristics
oldLength <- length(unique(stormData$EVTYPE))
newLength <- length(unique(stormData$EVTYPE))
newTypes <- unique(stormData$EVTYPE)

The new number of unique EVTYPE values is 890, however there are still plenty of typos and mismatch values that might be cleaned before analyzing.

At this point there is many data still with wrong EVTYPE coding, but let us have an initial guess of how the ranking at least on injuries and deaths would go like. For this we need to create a new variable that sums both injuries and death that will be counted as human wealth damage, this variable is called: “TotalWealth”.

# Make an initial ranking by adding injuries and fatalities
stormData["TotalWealth"]<-
  as.numeric(stormData$INJURIES)+as.numeric(stormData$FATALITIES)

deathorderTotal<-data.frame(Injuries_deaths=with(stormData,
        tapply(TotalWealth,as.factor(EVTYPE),sum)),EVTYPE =names(
        with(stormData,tapply(TotalWealth,as.factor(EVTYPE),sum))))

deathorderTotal$Injuries_deaths <-
  as.numeric(deathorderTotal$Injuries_deaths)

deathorderTotal <- 
  deathorderTotal[order(deathorderTotal$Injuries_deaths,
                                         decreasing = TRUE),]

head(deathorderTotal,10)

##                   Injuries_deaths            EVTYPE
## tornado                     96979           tornado
## excessive heat               8428    excessive heat
## tstm wind                    7461         tstm wind
## flood                        7259             flood
## lightning                    6046         lightning
## heat                         3037              heat
## flash flood                  2755       flash flood
## ice storm                    2064         ice storm
## thunderstorm wind            1621 thunderstorm wind
## winter storm                 1527      winter storm

Sum function is used rather than mean, this is because the mean function would make us loose one dimension of the data: frequency. An event type might occurr only twice in life, the first with null deaths but the second with 1’000,000 deaths, and the mean would be 500,000 so this might be the weather event that most deaths have caused, but since we calculated the mean this is missed. For this reason the total number of injuries and deaths per event type are adressed rather than their average.

It is observed that tornado is by far the most dangerous weather event. But let us continue the procedure of cleaning the data. This time, since it was decided to count the damage as the sum rather than the mean we are able to delete all entries whose: injuries + deaths + crop damage + property damage = 0.

## Delete the data that have values 0 in totalWealth 
## and whose monetary damage is 0

totalDamageTemp <- 
  (as.numeric(stormData$PROPDMG) + as.numeric(stormData$CROPDMG) +
     stormData$TotalWealth)

stormData <- stormData[-(which(totalDamageTemp == 0)),]

Now stormData contains only those rows that have a positive value in any category, crop damage, property damage, injuries, fatalities.
It is time to review if there is any relation between the names in the EVTYPE variable to make easier their correct classification. Because of the length of the list, only 20 elements will be printed in this document.

# try to find any relation between event types
orderedTypes <- newTypes[order(unique(stormData$EVTYPE))]

head(orderedTypes,20)

##  [1] "high winds/coastal flood"       "hail 150"                      
##  [3] "river flood"                    "lack of snow"                  
##  [5] "torrential rain"                "rip currents heavy surf"       
##  [7] "flash flooding/thunderstorm wi" "heavy snow andblowing snow"    
##  [9] "thunderstorm winds g60"         "dust storm"                    
## [11] "sleet/rain/snow"                "snow/cold"                     
## [13] "cold and wet conditions"        "flooding"                      
## [15] "snow squall"                    "tornado/waterspout"            
## [17] "thunderstorm winds 53"          "lightning/heavy rain"          
## [19] "extreme heat"                   "high wind 70"

There is a record that has a “?” as EVTYPE, this register is deleted.

# There is one event type labelled "?" delete this register

which(stormData$EVTYPE=="?")

## [1] 52498

stormData <- stormData[-which(stormData$EVTYPE=="?"),]

Now a data frame with the correct event types is created in order to compare the ones that are present in the data and the ones in the official list.

# create a DF with all the event types
Events48 <- c("Astronomical Low Tide","Avalanche","Blizzard",
              "Coastal Flood","Cold/Wind Chill","Debris Flow",
              "Dense Fog","Dense Smoke","Drought","Dust Devil",
              "Dust Storm","Excessive Heat",
              "Extreme Cold/Wind Chill",
              "Flash Flood","Flood","Frost/Freeze",
              "Funnel Cloud","Freezing Fog","Hail","Heat",
              "Heavy Rain","Heavy Snow","High Surf","High Wind",
              "Hurricane/Typhoon","Ice Storm",
              "Lake-Effect Snow","Lakeshore Flood", "Lightning",
              "Marine Hail","Marine High Wind",
              "Marine Strong Wind","Marine Thunderstorm Wind",
              "Rip Current","Seiche","Sleet",
              "Storm Surge/Tide","Strong Wind",
              "Thunderstorm Wind","Tornado",
              "Tropical Depression","Tropical Storm","Tsunami",
              "Volcanic Ash","Waterspout","Wildfire",
              "Winter Storm","Winter Weather")

Events48 <- as.data.frame(tolower(Events48))

Now it is interesting to determine since when all event types have been registered, because if we use the total rather than the mean, the years where only one EVTYPE event was being registered will affect the ranking and thus it would be an “unfair” comparison.

## Since when all the event types have been registered?

tempo <- as.factor(substring(
  as.Date(stormData$BGN_DATE,"%m/%d/%Y %H:%M:%S"),1,4))

tapply(X = stormData$EVTYPE,INDEX = tempo,FUN = function(x){length(
  subset(unique(x),unique(x) %in% Events48[,1]))})

## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 
##    1    1    2    2    2    2    2    2    2    2    2   23   25   28   21   26 
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
##   25   22   22   21   25   25   25   32   38   36   39   38   37   38

# we see that more events are present since 1993, however it is since
# 1996 that all 48 events were taken into account.

after96 <- as.Date(stormData$BGN_DATE,"%m/%d/%Y %H:%M:%S") >=
  as.Date("1996/01/01","%Y/%m/%d")

stormData <- stormData[after96,]

The stormData now only have the registers of the data since 1996 which is the year where all the 48 EVTYPE events were established. One problem we can observe is that PROPDMG and CROPDMG dont have their corresponding exponents and they only have 3 significant figures. The exponents for this values are in the PROPDMGEXP and CROPDMGEXP. It is time to convert k,M,B into 1000, 1’000,000 and 1000’000,000 respectively. Then multiply this exponents by the 3 significant figure values and add them to get the total economic damage of each register.

# Replace the signs in the exponents by the correct values
stormData$PROPDMGEXP <- gsub("(\\+|-|\\?)","",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Kk]","1000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Mm]","1000000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Bb]","1000000000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- ifelse(stormData$PROPDMGEXP ==
                                 "","1",stormData$PROPDMGEXP)

stormData$CROPDMGEXP <- gsub("(\\+|-|\\?)","",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Kk]","1000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Mm]","1000000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Bb]","1000000000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- ifelse(stormData$CROPDMGEXP ==
                                 "","1",stormData$CROPDMGEXP)

## Calculate the values of damage with their corresponding exponents
cropDamage <- as.numeric(stormData$PROPDMG)*as.numeric(stormData$PROPDMGEXP)

propDamage <- as.numeric(stormData$CROPDMG)*as.numeric(stormData$CROPDMGEXP)

## Creates a new variable which registers this information
stormData["totalDamage"] <- cropDamage+propDamage 

stormData <- stormData[c("BGN_DATE","EVTYPE","TotalWealth",
                         "totalDamage","REMARKS")]

The next step is to compare the stormData data frame and Events48 data frame and create a matrix “a” with the registers that have a correct EVTYPE and a matrix “b” whose EVTYPE values might be revised.

# find exact coincidences
a <- subset(stormData,stormData$EVTYPE %in% Events48[,1])


b <- subset(stormData,!(stormData$EVTYPE %in% Events48[,1]))
b <- b[order(b$TotalWealth, decreasing = TRUE),]

Now that the “b” matrix have only registers that dont match exactly with the events in the oficial list, we will begin to make the required changes in the EVTYPE column in order to correct it’s names.

tstm refers really to thunderstorm
landslide is now called debris flow
hurricane is now categorized as hurricane/typhoon
typhoon is now categorized as hurricane/typhoon
Much of the “tornado” event type has the name of the tornado, so all this names are now converted only into “tornado”
tropical storms sometimes have their name or are written incorrectly, change this and call them only “tropical storm”
thunderstorm wind is sometimes misstyped, correct this for all that fit the metadata: “t(.+)u(.+)st(.+)m”
Correct all names containing fire into wildfire EVTYPE
Many fog categorized have dense fog in the remarks column, change this “fog” to “dense fog”
Many fog categorized have freezing fog in the remarks column, change this “fog” to “freezing fog”
Change “glaze” to “freezing fog”
Correct “rip currents” to “rip current”
Change EVTYPE “tide” and “storm surge” to “storm surge/tide”
Change “chill” registers to “extreme cold/wind chill”

Finally matrix “b” EVTYPE column is compared again with Events48 and the new matches are added to matrix “a” and matrix b is recalculated without the data already assigned to “a”.

b$EVTYPE <- sub("tstm","thunderstorm",b$EVTYPE)
b$EVTYPE <- sub("^landslide","debris flow",b$EVTYPE)
b$EVTYPE <- sub("^hurricane","hurricane/typhoon",b$EVTYPE)
b$EVTYPE <- sub("^typhoon","hurricane/typhoon",b$EVTYPE)
b$EVTYPE[grep("^torn",b$EVTYPE)] <- "tornado"
b$EVTYPE[grep("^trop.*st",b$EVTYPE)] <- "tropical storm"
b$EVTYPE[grep("t(.+)u(.+)st(.+)m",b$EVTYPE)] <- "thunderstorm wind"
b$EVTYPE[grep("fire",b$EVTYPE)] <- "wildfire"

## check winter weather and winter storm
b[grep("^winter s",b$EVTYPE),"EVTYPE"] <- "winter storm"

b[grep("^winter w",b$EVTYPE),"EVTYPE"] <- "winter weather"

b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))

Now since we will correct the events that have the word fog we want to see which ones have the word dense fog and frezing fog in the column “REMARKS” which contain the comments on the event.

## This is to see which event types are included in remarks that #contain word dense fog

unique(b$EVTYPE[grep("[Dd][Ee][Nn][Ss][Ee] [Ff][Oo][Gg]",b$REMARKS)])

## [1] "fog"

## Since all the event types listed that contain the word fog make
#reference to dense fog, this data will be subset.
fogTypes <- subset(grep("[Dd][Ee][Nn][Ss][Ee]
                [Ff][Oo][Gg]",b$REMARKS),grep("[Dd][Ee][Nn][Ss][Ee]
                [Ff][Oo][Gg]",b$REMARKS) %in% grep("^fog",b$EVTYPE))

b$EVTYPE[fogTypes] <- "dense fog"

## This is to see which event types are included in remarks that
#contain word freezing fog
unique(b$EVTYPE[grep("[Ff]reezing [Ff][Oo][Gg]",b$REMARKS)])

## [1] "fog"

## Since all the event types listed that contain the word fog make
#reference to freezing fog, this data will be subset.
fogTypes <- subset(grep("[Ff]reezing
              [Ff][Oo][Gg]",b$REMARKS),grep("[Ff]reezing 
              [Ff][Oo][Gg]",b$REMARKS) %in% grep("^fog",b$EVTYPE))

b$EVTYPE[fogTypes] <- "freezing fog"

## Other common name in evtype is glaze, however in the pdf guide
#glaze is the same as freezing fog
b$EVTYPE[grep("glaz",b$EVTYPE)] <- "freezing fog"

Continuing with other EVTYPE names:

## Other common typo is the "rip currents" since in the official list
#  it is called "rip current", lets correct this

b$EVTYPE[grep("^rip c(.*)r(.*)s",b$EVTYPE)]<- "rip current"

## correct those that are storm surges
b$EVTYPE[b$EVTYPE=="storm surge"] <- "storm surge/tide"
b$EVTYPE[b$EVTYPE=="tide"] <- "storm surge/tide"

## Correct chill registers to extreme cold/wind chill
b$EVTYPE[grep("chill",b$EVTYPE)] <- "extreme cold/wind chill"

## Re calculate matrices a and b

b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))

Now that the number of registers in matrix “b” are reduced it is interesting to arrange which names have the greater frecuency and which have the greater impact on the ranking:

## table to see which names are more important to fix.

hola <- as.data.frame(table(as.factor(b$EVTYPE)))
names(hola) <- c("EVTYPE","Frequency")
hola <- hola[order(hola$Freq,decreasing = TRUE),]



hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
            as.factor(b$EVTYPE),sum), eventType = 
              names(tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))

hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]

head(hola,15)

##                   EVTYPE Frequency
## 102 urban/sml stream fld       702
## 27          extreme cold       166
## 70            light snow       141
## 31                   fog       101
## 84           river flood        80
## 22        dry microburst        75
## 106                 wind        67
## 51  heavy surf/high surf        50
## 95          strong winds        45
## 9       coastal flooding        35
## 80                 other        34
## 42           gusty winds        30
## 49            heavy surf        29
## 25        excessive snow        25
## 69   light freezing rain        22

head(hola2,15)

##                      totalInjuries            eventType
## fog                            772                  fog
## extreme cold                   194         extreme cold
## urban/sml stream fld           107 urban/sml stream fld
## wind                           102                 wind
## heavy surf/high surf            90 heavy surf/high surf
## wintry mix                      78           wintry mix
## heat wave                       70            heat wave
## heavy surf                      46           heavy surf
## snow squall                     37          snow squall
## cold                            30                 cold
## dry microburst                  28       dry microburst
## mixed precip                    28         mixed precip
## strong winds                    28         strong winds
## icy roads                       26            icy roads
## black ice                       25            black ice

Based on this we know that the event types with the word “flood”, “extreme cold”, “high winds” are some of the ones that could have a big effect on the ranking.

The registers that contain flood/flashflood couldn’t be reassigned because the difference between flood and flash flood was not clear. However later we will deal with this problem.

The names with “urban/sml stream fld” and “urban flod” are categorized as heavy rain in the data frame PDF so this is how we will assign them here.

Heat is another one, however since heat wave and extreme heat can’t be categorized because the difference between excessive heat and heat registers is not totally clear in the PDF of the data base.

## Correct extreme cold and high winds and urban stream flood
b$EVTYPE[(b$EVTYPE=="extreme cold")] <- "extreme cold/wind chill"
b$EVTYPE[b$EVTYPE=="high winds"] <- "high wind"
b$EVTYPE[b$EVTYPE=="urban/sml stream fld"] <- "heavy rain" ## This is
# because in the manual says that this kind of issues should be
#adressed as heavy rain

b$EVTYPE[b$EVTYPE=="urban flood"] <- "heavy rain" ## This is because
#in the manual says that this kind of issues should be adressed
#as heavy rain.

## Correct flash flooding
b$EVTYPE[b$EVTYPE=="flash flooding"] <- "flash flood"

Finally we recalculate matrices “a” and “b”

b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))

The next step is to recheck which names of event types may worth correcting.

hola <- as.data.frame(table(as.factor(b$EVTYPE)))
names(hola) <- c("EVTYPE","Frequency")
hola <- hola[order(hola$Freq,decreasing = TRUE),]
                  
hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
         as.factor(b$EVTYPE),sum), eventType = names(
           tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))

hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]

head(hola,20)

##                   EVTYPE Frequency
## 68            light snow       141
## 30                   fog       101
## 82           river flood        80
## 22        dry microburst        75
## 103                 wind        67
## 50  heavy surf/high surf        50
## 93          strong winds        45
## 9       coastal flooding        35
## 78                 other        34
## 41           gusty winds        30
## 48            heavy surf        29
## 25        excessive snow        25
## 67   light freezing rain        22
## 13                  cold        20
## 62             icy roads        18
## 73   mixed precipitation        18
## 89                  snow        16
## 31                freeze        14
## 37            gusty wind        13
## 88            small hail        11

head(hola2,20)

##                      totalInjuries            eventType
## fog                            772                  fog
## wind                           102                 wind
## heavy surf/high surf            90 heavy surf/high surf
## wintry mix                      78           wintry mix
## heat wave                       70            heat wave
## heavy surf                      46           heavy surf
## snow squall                     37          snow squall
## cold                            30                 cold
## dry microburst                  28       dry microburst
## mixed precip                    28         mixed precip
## strong winds                    28         strong winds
## icy roads                       26            icy roads
## black ice                       25            black ice
## unseasonably warm               17    unseasonably warm
## freezing drizzle                15     freezing drizzle
## cold and snow                   14        cold and snow
## gusty winds                     14          gusty winds
## snow                            14                 snow
## rough seas                      13           rough seas
## high seas                       10            high seas

Also it is interesting to see the ranking of the data we have arranged in the matrix “a”, in order to see the possible effect of the registers in the matrix “b”

rankingWealth <-data.frame(total =
                tapply(a$TotalWealth,as.factor(a$EVTYPE),sum),
                  evtype = names(
                  tapply(a$TotalWealth,as.factor(a$EVTYPE),sum)))

rankingWealth<-rankingWealth[order(rankingWealth$total,decreasing = TRUE),]

head(rankingWealth,6)

##                   total            evtype
## tornado           22178           tornado
## excessive heat     8188    excessive heat
## flood              7172             flood
## thunderstorm wind  5526 thunderstorm wind
## lightning          4792         lightning
## flash flood        2561       flash flood

Following this, the last set of corrections to the EVTYPE data in the “b” matrix to make the data as accurrate as possible.

the first function will be used to identify which names may be includded as “high surf”

unique(b[grep("^(.*)surf",b$EVTYPE),"EVTYPE"])

## [1] "heavy surf/high surf" "heavy surf"           "rough surf"          
## [4] "heavy surf and wind"  "hazardous surf"       "heavy rain/high surf"
## [7] "high surf advisory"

Knowig this and reading the REMARK column for this EVTYPE names the following procedure is done.

b[grep("^(.*)surf",b$EVTYPE),"EVTYPE"] <- "high surf"

This time the names which correspond to winter storm and winter weather but are misstyped will be fixed.

Data in the “a” and “b” matrix is updated for last time.The frequency and total health damage of each remaining EVTYPE name in b is displayed.

b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))

hola <- as.data.frame(table(as.factor(b$EVTYPE)))
hola <- hola[order(hola$Freq,decreasing = TRUE),]

hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
          as.factor(b$EVTYPE),sum), eventType = 
          names(tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))

hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]

head(hola,10)

##                Var1 Freq
## 62       light snow  141
## 30              fog  101
## 76      river flood   80
## 22   dry microburst   75
## 96             wind   67
## 86     strong winds   45
## 9  coastal flooding   35
## 72            other   34
## 41      gusty winds   30
## 25   excessive snow   25

head(hola2,10)

##                totalInjuries      eventType
## fog                      772            fog
## wind                     102           wind
## wintry mix                78     wintry mix
## heat wave                 70      heat wave
## snow squall               37    snow squall
## cold                      30           cold
## dry microburst            28 dry microburst
## mixed precip              28   mixed precip
## strong winds              28   strong winds
## icy roads                 26      icy roads

Since “fog”, “wind” and “heat” events are the ones that most deaths and injuries could have caused we will check that the sum of all the injuries and deaths of EVTYPE names containing these words would not change the top 5 ranking.

## se how many injuries does fog entries have
unique(b[grep("fog",b$EVTYPE),"EVTYPE"])

## [1] "fog"

sum(b[grep("fog",b$EVTYPE),"TotalWealth"])

## [1] 772

more than 700 injuries/deaths is a decent number, however it is not enough to change the top 5 ranking, thus the remaining “fog” registers in the “b” matrix will not be taken into account.

## se how many injuries does wind entries have
unique(b[grep("wind",b$EVTYPE),"EVTYPE"])

##  [1] "wind"                   "non-severe wind damage" "gusty winds"           
##  [4] "strong winds"           "winds"                  "whirlwind"             
##  [7] "gusty wind"             "wind damage"            "gusty wind/rain"       
## [10] "gusty wind/hvy rain"    "gradient wind"          "high wind (g40)"       
## [13] "gusty wind/hail"        "wind and wave"

sum(b[grep("wind",b$EVTYPE),"TotalWealth"])

## [1] 155

Not too much weigth so trying to fix this does not change the ranking, thus the remaining “wind” registers in the “b” matrix will not be taken into account.

# Lets do the same with heat
unique(b[grep("heat",b$EVTYPE),"EVTYPE"])

## [1] "heat wave"   "record heat"

sum(b[grep("heat",b$EVTYPE),"TotalWealth"])

## [1] 72

Once again not too much weigth so trying to fix this does not change the ranking, thus the remaining “heat” registers in the “b” matrix will not be taken into account.

The last step for cleaning the data is to check the ranking but now regarding the economic costs of the weather events. These costs are stored in the “totalDamage” column.

costinB <- data.frame(totalDamage=tapply(b$totalDamage,
          as.factor(b$EVTYPE),sum), eventType = 
          names(tapply(b$totalDamage,as.factor(b$EVTYPE),sum)))

costinB <- costinB[order(costinB$totalDamage,decreasing = TRUE),]

head(costinB,20)

##                           totalDamage                 eventType
## freeze                      156925000                    freeze
## river flooding              134175000            river flooding
## coastal flooding            103809000          coastal flooding
## damaging freeze              42130000           damaging freeze
## early frost                  42000000               early frost
## agricultural freeze          28820000       agricultural freeze
## unseasonably cold            25042500         unseasonably cold
## river flood                  22157000               river flood
## small hail                   20863000                small hail
## coastal flooding/erosion     20030000  coastal flooding/erosion
## erosion/cstl flood           16200000        erosion/cstl flood
## coastal  flooding/erosion    15000000 coastal  flooding/erosion
## fog                          13145500                       fog
## hard freeze                  12900000               hard freeze
## unseasonal rain              10000000           unseasonal rain
## astronomical high tide        9425000    astronomical high tide
## unseasonable cold             5100000         unseasonable cold
## wind                          2589500                      wind
## snow                          2554000                      snow
## light snow                    2513000                light snow

Having this information is time to see the ranking with the already cleaned data:

temporal <- tapply(a$totalDamage,as.factor(a$EVTYPE),sum)

damageRank <- data.frame(cost=temporal,event=names(temporal))
damageRank <- damageRank[order(damageRank$cost,decreasing = TRUE),]

leastDif <- damageRank[5,1]- damageRank[6,1]  
  
head(damageRank,6)

##                           cost             event
## flood             148919611950             flood
## hurricane/typhoon  87068996810 hurricane/typhoon
## storm surge/tide   47835579000  storm surge/tide
## tornado            24900370720           tornado
## hail               17071172870              hail
## flash flood        16557155610       flash flood

Regarding the “totalDamage” variable in the remaining registers in the “b” matrix and the ranking for the matrix “a” by “totalDamage”, some key words (weather types) are identified that could affect the top 5 elements in the ranking. These words are: “flood”, “freeze” and “rain”. To analyze if this weather types could interfere in the analysis, some aditions are made.

Let’s check first the “flood” events.

flood <- sum(b$totalDamage[grep("flood",b$EVTYPE)])
flood

## [1] 311400000

Next, review the “freeze” events.

freeze <- sum(b$totalDamage[grep("freeze",b$EVTYPE)])
freeze

## [1] 240775000

Finally, events involving “rain” events are checked.

rain <- sum(b$totalDamage[grep("rain",b$EVTYPE)])
rain

## [1] 11631000

Taking into account that the smallest difference between each rank for the 6 most expensive weather events arrangement is between the 5th and 6th with a difference of 5.140172610^{8} then we see that neither of the event types remaining in the matrix “b” if corrected, would change or affect the top 5 most economic costly weather events in the United Stated. For this reason the remaining registers in the matrix “b” will be ommited.

Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Based on the processing process done in the section above it is possible to plot which type of weather events are the most harmful to population health.

g1 <- barplot(height = (rankingWealth[1:5,1]/1000),names.arg = 
          rankingWealth[1:5,2], xlab = "weather event" , 
          ylab = "Injuries and deaths (thousands)",
          ylim = c(0,25),cex.names = 0.7)


points(x = g1, y = rankingWealth[1:5,1]/1000,pch=19)

text(x = g1, y= rankingWealth[1:5,1]/1000 ,
     labels = as.character(round(rankingWealth[1:5,1]/1000,
                            digits =c(1,2,2,2,2))),pos= 3, cex=0.7)

title(main="Most harmful weather events for human health")

$\label{fig:figs} Top 5 weather event types with the greatest damage in human health in the U.S.$

Top 5 weather event types with the greatest damage in human health in the U.S.

This is mostly because of the criteria that the sum of the injuries and deaths for each event type was the correct meassurement, because here the frequency of occurrence is taken into account.

Question 2

Across the United States, which types of events have the greatest economic consequences?

For this question a new ranking might be used, in this case from highest economic cost to lowest economic cost. This was saved in the “damageRank” variable during the processing of the data in the previous section.

g2 <- barplot(height = (damageRank[1:5,1]/1000000000),names.arg = 
          damageRank[1:5,2], xlab = "weather event" , 
          ylab = "Cost U.S. Dollars (billions)",
          ylim = c(0,200),cex.names = 0.7)

points(x = g2, y = damageRank[1:5,1]/1000000000,pch=19)

text(x = g2, y = (damageRank[1:5,1]/1000000000) ,
     labels = as.character(round(damageRank[1:5,1]/1000000000,1)),
     pos= 3, cex=0.7) 

title(main="Events with greater economic consequences")

$\label{fig:figs} Top 5 weather event types with the greatest economic cost in the United States.$

Top 5 weather event types with the greatest economic cost in the United States.

Regarding this plot is clear which of the weather events are the most expensive for the U.S. It is interesting how this ranking differs from the one in the first question, in this case heat isn’t present, which makes sense, as heat may cause varius deaths, but in general economic damage may be only caused when fires are generated. The strategy should adress flood and tornado events, because they are both very expensive and also harmful adressing human health.