Weather events may have different consecuences in both economy and human wealth (injuries and even deaths), thus the U.S. National Oceanic and Atmospheric Administration’s (NOAA) began to create a data base where weather events are registered with different level of details such as the event type, duration, economic damage estimates. In particular for this study two main questions were adressed: What weather events have caused the most damage in human wealth? and second, what weather events have the greatest economic consequences? Storm Data (from NOAA) was used since year 1996, when all the event types began to be registered. The findings show that the most harmful events for human health in the U.S. are tonados and excessive heat. On the other hand the events with greater economic damage are: floods and hurricanes/typhoons.
The first step for making the analysis is the procesing of data, for this, first the raw data needs to be downloaded, decompressed and read. For this specific .bz2 file to be unzipped, the “R.utils” package is required.
## Create a working directory
dir.create("NOOAdataFolder")
setwd("./NOOAdataFolder")
## Load required packages
#install.packages("R.utils")
library(R.utils)
#install.packages("stringdist")
library(stringdist)
## Download and unzip file
URL <-
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL,"repdata_data_StormData.csv.bz2")
bunzip2("repdata_data_StormData.csv.bz2","repdata_data_StormData.csv")
unload.Package(R.utils)
## Read table
stormData <- read.csv("repdata_data_StormData.csv",
header = T,colClasses = "character")
Now lets have a glance of how data looks like and what variables we will encounter.
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The variables we will be interested in are mainly 6: “EVTYPE” which names the type of weather event, “INJURIES” number of people injured, “FATALITIES” number of deaths caused directly or indirectly by the weather event, “PROPDMG” 3 significant figure property damage cost in US dollars,“PROPDMGEXP” corresponding exponent to the 3 significant figure vale in PROPDMG,“CROPDM” 3 significant figure crop damage cost in US dollars,“CROPDMEXP” corresponding exponent to the 3 significant figure vale in CROPDM.
For the matter of havig a copy of raw data this is stored in the variable “Original Data” then some exploratory analysis is done regarding the EVTYPE variable, which supossedly may only have 48 different correct values.
## Create a copy of the original data frame
originalData <- stormData
## Exploratory analysis
# Original event types characteristics
originalLength <- length(unique(stormData$EVTYPE))
originalTypes <- unique(stormData$EVTYPE)
Now the columns that will not be used for the analysis will be deleted.
stormData <- stormData[c("BGN_DATE", "EVTYPE","FATALITIES",
"INJURIES","PROPDMG","PROPDMGEXP","CROPDMG",
"CROPDMGEXP","REMARKS")]
head(stormData,10)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 4/18/1950 0:00:00 TORNADO 0.00 15.00 25.00 K 0.00
## 2 4/18/1950 0:00:00 TORNADO 0.00 0.00 2.50 K 0.00
## 3 2/20/1951 0:00:00 TORNADO 0.00 2.00 25.00 K 0.00
## 4 6/8/1951 0:00:00 TORNADO 0.00 2.00 2.50 K 0.00
## 5 11/15/1951 0:00:00 TORNADO 0.00 2.00 2.50 K 0.00
## 6 11/15/1951 0:00:00 TORNADO 0.00 6.00 2.50 K 0.00
## 7 11/16/1951 0:00:00 TORNADO 0.00 1.00 2.50 K 0.00
## 8 1/22/1952 0:00:00 TORNADO 0.00 0.00 2.50 K 0.00
## 9 2/13/1952 0:00:00 TORNADO 1.00 14.00 25.00 K 0.00
## 10 2/13/1952 0:00:00 TORNADO 0.00 0.00 25.00 K 0.00
## CROPDMGEXP REMARKS
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
The number of unique elements for the EVTYPE variable is 985 which is far from the original 48 correct values. The next step is to identify why this is happening so “originalTypes” variable with the names of all this unique EVTYPE values is called. Due to the length of the list this will not be printed in the document, however the next step is to read the names in EVTYPE. Only 15 values of the list will be illustrated.
head(originalTypes,15)
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING"
By inspection, the first problem that arises is that some EVTYPE values are begining with a blanck space " " and some are capitalized and other dont. To solve this problem the begining blank spaces will be deleted and all the characters will be transformed into lowercase.
# Delete blanks at the begining and mismatches due to capitalization
stormData$EVTYPE <- tolower(stormData$EVTYPE)
stormData$EVTYPE <- gsub("^ +","",stormData$EVTYPE)
# review the new event type characteristics
oldLength <- length(unique(stormData$EVTYPE))
newLength <- length(unique(stormData$EVTYPE))
newTypes <- unique(stormData$EVTYPE)
The new number of unique EVTYPE values is 890, however there are still plenty of typos and mismatch values that might be cleaned before analyzing.
At this point there is many data still with wrong EVTYPE coding, but let us have an initial guess of how the ranking at least on injuries and deaths would go like. For this we need to create a new variable that sums both injuries and death that will be counted as human wealth damage, this variable is called: “TotalWealth”.
# Make an initial ranking by adding injuries and fatalities
stormData["TotalWealth"]<-
as.numeric(stormData$INJURIES)+as.numeric(stormData$FATALITIES)
deathorderTotal<-data.frame(Injuries_deaths=with(stormData,
tapply(TotalWealth,as.factor(EVTYPE),sum)),EVTYPE =names(
with(stormData,tapply(TotalWealth,as.factor(EVTYPE),sum))))
deathorderTotal$Injuries_deaths <-
as.numeric(deathorderTotal$Injuries_deaths)
deathorderTotal <-
deathorderTotal[order(deathorderTotal$Injuries_deaths,
decreasing = TRUE),]
head(deathorderTotal,10)
## Injuries_deaths EVTYPE
## tornado 96979 tornado
## excessive heat 8428 excessive heat
## tstm wind 7461 tstm wind
## flood 7259 flood
## lightning 6046 lightning
## heat 3037 heat
## flash flood 2755 flash flood
## ice storm 2064 ice storm
## thunderstorm wind 1621 thunderstorm wind
## winter storm 1527 winter storm
Sum function is used rather than mean, this is because the mean function would make us loose one dimension of the data: frequency. An event type might occurr only twice in life, the first with null deaths but the second with 1’000,000 deaths, and the mean would be 500,000 so this might be the weather event that most deaths have caused, but since we calculated the mean this is missed. For this reason the total number of injuries and deaths per event type are adressed rather than their average.
It is observed that tornado is by far the most dangerous weather event. But let us continue the procedure of cleaning the data. This time, since it was decided to count the damage as the sum rather than the mean we are able to delete all entries whose: injuries + deaths + crop damage + property damage = 0.
## Delete the data that have values 0 in totalWealth
## and whose monetary damage is 0
totalDamageTemp <-
(as.numeric(stormData$PROPDMG) + as.numeric(stormData$CROPDMG) +
stormData$TotalWealth)
stormData <- stormData[-(which(totalDamageTemp == 0)),]
Now stormData contains only those rows that have a positive value in any category, crop damage, property damage, injuries, fatalities.
It is time to review if there is any relation between the names in the EVTYPE variable to make easier their correct classification. Because of the length of the list, only 20 elements will be printed in this document.
# try to find any relation between event types
orderedTypes <- newTypes[order(unique(stormData$EVTYPE))]
head(orderedTypes,20)
## [1] "high winds/coastal flood" "hail 150"
## [3] "river flood" "lack of snow"
## [5] "torrential rain" "rip currents heavy surf"
## [7] "flash flooding/thunderstorm wi" "heavy snow andblowing snow"
## [9] "thunderstorm winds g60" "dust storm"
## [11] "sleet/rain/snow" "snow/cold"
## [13] "cold and wet conditions" "flooding"
## [15] "snow squall" "tornado/waterspout"
## [17] "thunderstorm winds 53" "lightning/heavy rain"
## [19] "extreme heat" "high wind 70"
There is a record that has a “?” as EVTYPE, this register is deleted.
# There is one event type labelled "?" delete this register
which(stormData$EVTYPE=="?")
## [1] 52498
stormData <- stormData[-which(stormData$EVTYPE=="?"),]
Now a data frame with the correct event types is created in order to compare the ones that are present in the data and the ones in the official list.
# create a DF with all the event types
Events48 <- c("Astronomical Low Tide","Avalanche","Blizzard",
"Coastal Flood","Cold/Wind Chill","Debris Flow",
"Dense Fog","Dense Smoke","Drought","Dust Devil",
"Dust Storm","Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood","Flood","Frost/Freeze",
"Funnel Cloud","Freezing Fog","Hail","Heat",
"Heavy Rain","Heavy Snow","High Surf","High Wind",
"Hurricane/Typhoon","Ice Storm",
"Lake-Effect Snow","Lakeshore Flood", "Lightning",
"Marine Hail","Marine High Wind",
"Marine Strong Wind","Marine Thunderstorm Wind",
"Rip Current","Seiche","Sleet",
"Storm Surge/Tide","Strong Wind",
"Thunderstorm Wind","Tornado",
"Tropical Depression","Tropical Storm","Tsunami",
"Volcanic Ash","Waterspout","Wildfire",
"Winter Storm","Winter Weather")
Events48 <- as.data.frame(tolower(Events48))
Now it is interesting to determine since when all event types have been registered, because if we use the total rather than the mean, the years where only one EVTYPE event was being registered will affect the ranking and thus it would be an “unfair” comparison.
## Since when all the event types have been registered?
tempo <- as.factor(substring(
as.Date(stormData$BGN_DATE,"%m/%d/%Y %H:%M:%S"),1,4))
tapply(X = stormData$EVTYPE,INDEX = tempo,FUN = function(x){length(
subset(unique(x),unique(x) %in% Events48[,1]))})
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## 1 1 2 2 2 2 2 2 2 2 2 23 25 28 21 26
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 25 22 22 21 25 25 25 32 38 36 39 38 37 38
# we see that more events are present since 1993, however it is since
# 1996 that all 48 events were taken into account.
after96 <- as.Date(stormData$BGN_DATE,"%m/%d/%Y %H:%M:%S") >=
as.Date("1996/01/01","%Y/%m/%d")
stormData <- stormData[after96,]
The stormData now only have the registers of the data since 1996 which is the year where all the 48 EVTYPE events were established. One problem we can observe is that PROPDMG and CROPDMG dont have their corresponding exponents and they only have 3 significant figures. The exponents for this values are in the PROPDMGEXP and CROPDMGEXP. It is time to convert k,M,B into 1000, 1’000,000 and 1000’000,000 respectively. Then multiply this exponents by the 3 significant figure values and add them to get the total economic damage of each register.
# Replace the signs in the exponents by the correct values
stormData$PROPDMGEXP <- gsub("(\\+|-|\\?)","",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Kk]","1000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Mm]","1000000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- sub("[Bb]","1000000000",stormData$PROPDMGEXP)
stormData$PROPDMGEXP <- ifelse(stormData$PROPDMGEXP ==
"","1",stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- gsub("(\\+|-|\\?)","",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Kk]","1000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Mm]","1000000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- sub("[Bb]","1000000000",stormData$CROPDMGEXP)
stormData$CROPDMGEXP <- ifelse(stormData$CROPDMGEXP ==
"","1",stormData$CROPDMGEXP)
## Calculate the values of damage with their corresponding exponents
cropDamage <- as.numeric(stormData$PROPDMG)*as.numeric(stormData$PROPDMGEXP)
propDamage <- as.numeric(stormData$CROPDMG)*as.numeric(stormData$CROPDMGEXP)
## Creates a new variable which registers this information
stormData["totalDamage"] <- cropDamage+propDamage
stormData <- stormData[c("BGN_DATE","EVTYPE","TotalWealth",
"totalDamage","REMARKS")]
The next step is to compare the stormData data frame and Events48 data frame and create a matrix “a” with the registers that have a correct EVTYPE and a matrix “b” whose EVTYPE values might be revised.
# find exact coincidences
a <- subset(stormData,stormData$EVTYPE %in% Events48[,1])
b <- subset(stormData,!(stormData$EVTYPE %in% Events48[,1]))
b <- b[order(b$TotalWealth, decreasing = TRUE),]
Now that the “b” matrix have only registers that dont match exactly with the events in the oficial list, we will begin to make the required changes in the EVTYPE column in order to correct it’s names.
Finally matrix “b” EVTYPE column is compared again with Events48 and the new matches are added to matrix “a” and matrix b is recalculated without the data already assigned to “a”.
b$EVTYPE <- sub("tstm","thunderstorm",b$EVTYPE)
b$EVTYPE <- sub("^landslide","debris flow",b$EVTYPE)
b$EVTYPE <- sub("^hurricane","hurricane/typhoon",b$EVTYPE)
b$EVTYPE <- sub("^typhoon","hurricane/typhoon",b$EVTYPE)
b$EVTYPE[grep("^torn",b$EVTYPE)] <- "tornado"
b$EVTYPE[grep("^trop.*st",b$EVTYPE)] <- "tropical storm"
b$EVTYPE[grep("t(.+)u(.+)st(.+)m",b$EVTYPE)] <- "thunderstorm wind"
b$EVTYPE[grep("fire",b$EVTYPE)] <- "wildfire"
## check winter weather and winter storm
b[grep("^winter s",b$EVTYPE),"EVTYPE"] <- "winter storm"
b[grep("^winter w",b$EVTYPE),"EVTYPE"] <- "winter weather"
b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))
Now since we will correct the events that have the word fog we want to see which ones have the word dense fog and frezing fog in the column “REMARKS” which contain the comments on the event.
## This is to see which event types are included in remarks that #contain word dense fog
unique(b$EVTYPE[grep("[Dd][Ee][Nn][Ss][Ee] [Ff][Oo][Gg]",b$REMARKS)])
## [1] "fog"
## Since all the event types listed that contain the word fog make
#reference to dense fog, this data will be subset.
fogTypes <- subset(grep("[Dd][Ee][Nn][Ss][Ee]
[Ff][Oo][Gg]",b$REMARKS),grep("[Dd][Ee][Nn][Ss][Ee]
[Ff][Oo][Gg]",b$REMARKS) %in% grep("^fog",b$EVTYPE))
b$EVTYPE[fogTypes] <- "dense fog"
## This is to see which event types are included in remarks that
#contain word freezing fog
unique(b$EVTYPE[grep("[Ff]reezing [Ff][Oo][Gg]",b$REMARKS)])
## [1] "fog"
## Since all the event types listed that contain the word fog make
#reference to freezing fog, this data will be subset.
fogTypes <- subset(grep("[Ff]reezing
[Ff][Oo][Gg]",b$REMARKS),grep("[Ff]reezing
[Ff][Oo][Gg]",b$REMARKS) %in% grep("^fog",b$EVTYPE))
b$EVTYPE[fogTypes] <- "freezing fog"
## Other common name in evtype is glaze, however in the pdf guide
#glaze is the same as freezing fog
b$EVTYPE[grep("glaz",b$EVTYPE)] <- "freezing fog"
Continuing with other EVTYPE names:
## Other common typo is the "rip currents" since in the official list
# it is called "rip current", lets correct this
b$EVTYPE[grep("^rip c(.*)r(.*)s",b$EVTYPE)]<- "rip current"
## correct those that are storm surges
b$EVTYPE[b$EVTYPE=="storm surge"] <- "storm surge/tide"
b$EVTYPE[b$EVTYPE=="tide"] <- "storm surge/tide"
## Correct chill registers to extreme cold/wind chill
b$EVTYPE[grep("chill",b$EVTYPE)] <- "extreme cold/wind chill"
## Re calculate matrices a and b
b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))
Now that the number of registers in matrix “b” are reduced it is interesting to arrange which names have the greater frecuency and which have the greater impact on the ranking:
## table to see which names are more important to fix.
hola <- as.data.frame(table(as.factor(b$EVTYPE)))
names(hola) <- c("EVTYPE","Frequency")
hola <- hola[order(hola$Freq,decreasing = TRUE),]
hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
as.factor(b$EVTYPE),sum), eventType =
names(tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))
hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]
head(hola,15)
## EVTYPE Frequency
## 102 urban/sml stream fld 702
## 27 extreme cold 166
## 70 light snow 141
## 31 fog 101
## 84 river flood 80
## 22 dry microburst 75
## 106 wind 67
## 51 heavy surf/high surf 50
## 95 strong winds 45
## 9 coastal flooding 35
## 80 other 34
## 42 gusty winds 30
## 49 heavy surf 29
## 25 excessive snow 25
## 69 light freezing rain 22
head(hola2,15)
## totalInjuries eventType
## fog 772 fog
## extreme cold 194 extreme cold
## urban/sml stream fld 107 urban/sml stream fld
## wind 102 wind
## heavy surf/high surf 90 heavy surf/high surf
## wintry mix 78 wintry mix
## heat wave 70 heat wave
## heavy surf 46 heavy surf
## snow squall 37 snow squall
## cold 30 cold
## dry microburst 28 dry microburst
## mixed precip 28 mixed precip
## strong winds 28 strong winds
## icy roads 26 icy roads
## black ice 25 black ice
Based on this we know that the event types with the word “flood”, “extreme cold”, “high winds” are some of the ones that could have a big effect on the ranking.
The registers that contain flood/flashflood couldn’t be reassigned because the difference between flood and flash flood was not clear. However later we will deal with this problem.
The names with “urban/sml stream fld” and “urban flod” are categorized as heavy rain in the data frame PDF so this is how we will assign them here.
Heat is another one, however since heat wave and extreme heat can’t be categorized because the difference between excessive heat and heat registers is not totally clear in the PDF of the data base.
## Correct extreme cold and high winds and urban stream flood
b$EVTYPE[(b$EVTYPE=="extreme cold")] <- "extreme cold/wind chill"
b$EVTYPE[b$EVTYPE=="high winds"] <- "high wind"
b$EVTYPE[b$EVTYPE=="urban/sml stream fld"] <- "heavy rain" ## This is
# because in the manual says that this kind of issues should be
#adressed as heavy rain
b$EVTYPE[b$EVTYPE=="urban flood"] <- "heavy rain" ## This is because
#in the manual says that this kind of issues should be adressed
#as heavy rain.
## Correct flash flooding
b$EVTYPE[b$EVTYPE=="flash flooding"] <- "flash flood"
Finally we recalculate matrices “a” and “b”
b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))
The next step is to recheck which names of event types may worth correcting.
hola <- as.data.frame(table(as.factor(b$EVTYPE)))
names(hola) <- c("EVTYPE","Frequency")
hola <- hola[order(hola$Freq,decreasing = TRUE),]
hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
as.factor(b$EVTYPE),sum), eventType = names(
tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))
hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]
head(hola,20)
## EVTYPE Frequency
## 68 light snow 141
## 30 fog 101
## 82 river flood 80
## 22 dry microburst 75
## 103 wind 67
## 50 heavy surf/high surf 50
## 93 strong winds 45
## 9 coastal flooding 35
## 78 other 34
## 41 gusty winds 30
## 48 heavy surf 29
## 25 excessive snow 25
## 67 light freezing rain 22
## 13 cold 20
## 62 icy roads 18
## 73 mixed precipitation 18
## 89 snow 16
## 31 freeze 14
## 37 gusty wind 13
## 88 small hail 11
head(hola2,20)
## totalInjuries eventType
## fog 772 fog
## wind 102 wind
## heavy surf/high surf 90 heavy surf/high surf
## wintry mix 78 wintry mix
## heat wave 70 heat wave
## heavy surf 46 heavy surf
## snow squall 37 snow squall
## cold 30 cold
## dry microburst 28 dry microburst
## mixed precip 28 mixed precip
## strong winds 28 strong winds
## icy roads 26 icy roads
## black ice 25 black ice
## unseasonably warm 17 unseasonably warm
## freezing drizzle 15 freezing drizzle
## cold and snow 14 cold and snow
## gusty winds 14 gusty winds
## snow 14 snow
## rough seas 13 rough seas
## high seas 10 high seas
Also it is interesting to see the ranking of the data we have arranged in the matrix “a”, in order to see the possible effect of the registers in the matrix “b”
rankingWealth <-data.frame(total =
tapply(a$TotalWealth,as.factor(a$EVTYPE),sum),
evtype = names(
tapply(a$TotalWealth,as.factor(a$EVTYPE),sum)))
rankingWealth<-rankingWealth[order(rankingWealth$total,decreasing = TRUE),]
head(rankingWealth,6)
## total evtype
## tornado 22178 tornado
## excessive heat 8188 excessive heat
## flood 7172 flood
## thunderstorm wind 5526 thunderstorm wind
## lightning 4792 lightning
## flash flood 2561 flash flood
Following this, the last set of corrections to the EVTYPE data in the “b” matrix to make the data as accurrate as possible.
the first function will be used to identify which names may be includded as “high surf”
unique(b[grep("^(.*)surf",b$EVTYPE),"EVTYPE"])
## [1] "heavy surf/high surf" "heavy surf" "rough surf"
## [4] "heavy surf and wind" "hazardous surf" "heavy rain/high surf"
## [7] "high surf advisory"
Knowig this and reading the REMARK column for this EVTYPE names the following procedure is done.
b[grep("^(.*)surf",b$EVTYPE),"EVTYPE"] <- "high surf"
This time the names which correspond to winter storm and winter weather but are misstyped will be fixed.
Data in the “a” and “b” matrix is updated for last time.The frequency and total health damage of each remaining EVTYPE name in b is displayed.
b1 <- subset(b,b$EVTYPE %in% Events48[,1])
a <- rbind(a,b1)
a <- a[order(a$TotalWealth,decreasing = TRUE),]
b <- subset(b,!(b$EVTYPE %in% Events48[,1]))
hola <- as.data.frame(table(as.factor(b$EVTYPE)))
hola <- hola[order(hola$Freq,decreasing = TRUE),]
hola2 <- data.frame(totalInjuries=tapply(b$TotalWealth,
as.factor(b$EVTYPE),sum), eventType =
names(tapply(b$TotalWealth,as.factor(b$EVTYPE),sum)))
hola2 <- hola2[order(hola2$totalInjuries,decreasing = TRUE),]
head(hola,10)
## Var1 Freq
## 62 light snow 141
## 30 fog 101
## 76 river flood 80
## 22 dry microburst 75
## 96 wind 67
## 86 strong winds 45
## 9 coastal flooding 35
## 72 other 34
## 41 gusty winds 30
## 25 excessive snow 25
head(hola2,10)
## totalInjuries eventType
## fog 772 fog
## wind 102 wind
## wintry mix 78 wintry mix
## heat wave 70 heat wave
## snow squall 37 snow squall
## cold 30 cold
## dry microburst 28 dry microburst
## mixed precip 28 mixed precip
## strong winds 28 strong winds
## icy roads 26 icy roads
Since “fog”, “wind” and “heat” events are the ones that most deaths and injuries could have caused we will check that the sum of all the injuries and deaths of EVTYPE names containing these words would not change the top 5 ranking.
## se how many injuries does fog entries have
unique(b[grep("fog",b$EVTYPE),"EVTYPE"])
## [1] "fog"
sum(b[grep("fog",b$EVTYPE),"TotalWealth"])
## [1] 772
more than 700 injuries/deaths is a decent number, however it is not enough to change the top 5 ranking, thus the remaining “fog” registers in the “b” matrix will not be taken into account.
## se how many injuries does wind entries have
unique(b[grep("wind",b$EVTYPE),"EVTYPE"])
## [1] "wind" "non-severe wind damage" "gusty winds"
## [4] "strong winds" "winds" "whirlwind"
## [7] "gusty wind" "wind damage" "gusty wind/rain"
## [10] "gusty wind/hvy rain" "gradient wind" "high wind (g40)"
## [13] "gusty wind/hail" "wind and wave"
sum(b[grep("wind",b$EVTYPE),"TotalWealth"])
## [1] 155
Not too much weigth so trying to fix this does not change the ranking, thus the remaining “wind” registers in the “b” matrix will not be taken into account.
# Lets do the same with heat
unique(b[grep("heat",b$EVTYPE),"EVTYPE"])
## [1] "heat wave" "record heat"
sum(b[grep("heat",b$EVTYPE),"TotalWealth"])
## [1] 72
Once again not too much weigth so trying to fix this does not change the ranking, thus the remaining “heat” registers in the “b” matrix will not be taken into account.
The last step for cleaning the data is to check the ranking but now regarding the economic costs of the weather events. These costs are stored in the “totalDamage” column.
costinB <- data.frame(totalDamage=tapply(b$totalDamage,
as.factor(b$EVTYPE),sum), eventType =
names(tapply(b$totalDamage,as.factor(b$EVTYPE),sum)))
costinB <- costinB[order(costinB$totalDamage,decreasing = TRUE),]
head(costinB,20)
## totalDamage eventType
## freeze 156925000 freeze
## river flooding 134175000 river flooding
## coastal flooding 103809000 coastal flooding
## damaging freeze 42130000 damaging freeze
## early frost 42000000 early frost
## agricultural freeze 28820000 agricultural freeze
## unseasonably cold 25042500 unseasonably cold
## river flood 22157000 river flood
## small hail 20863000 small hail
## coastal flooding/erosion 20030000 coastal flooding/erosion
## erosion/cstl flood 16200000 erosion/cstl flood
## coastal flooding/erosion 15000000 coastal flooding/erosion
## fog 13145500 fog
## hard freeze 12900000 hard freeze
## unseasonal rain 10000000 unseasonal rain
## astronomical high tide 9425000 astronomical high tide
## unseasonable cold 5100000 unseasonable cold
## wind 2589500 wind
## snow 2554000 snow
## light snow 2513000 light snow
Having this information is time to see the ranking with the already cleaned data:
temporal <- tapply(a$totalDamage,as.factor(a$EVTYPE),sum)
damageRank <- data.frame(cost=temporal,event=names(temporal))
damageRank <- damageRank[order(damageRank$cost,decreasing = TRUE),]
leastDif <- damageRank[5,1]- damageRank[6,1]
head(damageRank,6)
## cost event
## flood 148919611950 flood
## hurricane/typhoon 87068996810 hurricane/typhoon
## storm surge/tide 47835579000 storm surge/tide
## tornado 24900370720 tornado
## hail 17071172870 hail
## flash flood 16557155610 flash flood
Regarding the “totalDamage” variable in the remaining registers in the “b” matrix and the ranking for the matrix “a” by “totalDamage”, some key words (weather types) are identified that could affect the top 5 elements in the ranking. These words are: “flood”, “freeze” and “rain”. To analyze if this weather types could interfere in the analysis, some aditions are made.
Let’s check first the “flood” events.
flood <- sum(b$totalDamage[grep("flood",b$EVTYPE)])
flood
## [1] 311400000
Next, review the “freeze” events.
freeze <- sum(b$totalDamage[grep("freeze",b$EVTYPE)])
freeze
## [1] 240775000
Finally, events involving “rain” events are checked.
rain <- sum(b$totalDamage[grep("rain",b$EVTYPE)])
rain
## [1] 11631000
Taking into account that the smallest difference between each rank for the 6 most expensive weather events arrangement is between the 5th and 6th with a difference of 5.140172610^{8} then we see that neither of the event types remaining in the matrix “b” if corrected, would change or affect the top 5 most economic costly weather events in the United Stated. For this reason the remaining registers in the matrix “b” will be ommited.
Based on the processing process done in the section above it is possible to plot which type of weather events are the most harmful to population health.
g1 <- barplot(height = (rankingWealth[1:5,1]/1000),names.arg =
rankingWealth[1:5,2], xlab = "weather event" ,
ylab = "Injuries and deaths (thousands)",
ylim = c(0,25),cex.names = 0.7)
points(x = g1, y = rankingWealth[1:5,1]/1000,pch=19)
text(x = g1, y= rankingWealth[1:5,1]/1000 ,
labels = as.character(round(rankingWealth[1:5,1]/1000,
digits =c(1,2,2,2,2))),pos= 3, cex=0.7)
title(main="Most harmful weather events for human health")
Top 5 weather event types with the greatest damage in human health in the U.S.
This is mostly because of the criteria that the sum of the injuries and deaths for each event type was the correct meassurement, because here the frequency of occurrence is taken into account.
For this question a new ranking might be used, in this case from highest economic cost to lowest economic cost. This was saved in the “damageRank” variable during the processing of the data in the previous section.
g2 <- barplot(height = (damageRank[1:5,1]/1000000000),names.arg =
damageRank[1:5,2], xlab = "weather event" ,
ylab = "Cost U.S. Dollars (billions)",
ylim = c(0,200),cex.names = 0.7)
points(x = g2, y = damageRank[1:5,1]/1000000000,pch=19)
text(x = g2, y = (damageRank[1:5,1]/1000000000) ,
labels = as.character(round(damageRank[1:5,1]/1000000000,1)),
pos= 3, cex=0.7)
title(main="Events with greater economic consequences")
Top 5 weather event types with the greatest economic cost in the United States.
Regarding this plot is clear which of the weather events are the most expensive for the U.S. It is interesting how this ranking differs from the one in the first question, in this case heat isn’t present, which makes sense, as heat may cause varius deaths, but in general economic damage may be only caused when fires are generated. The strategy should adress flood and tornado events, because they are both very expensive and also harmful adressing human health.