Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. The aim of this report is to find the most critical weather events in United States in terms of economical damage and impact to people health.
For this purpose, the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is used. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, crop damage and property damage.
FALTA INDICAR AQUÍ ALGO DEL RESULTADO
This section includes reading, processing and transforming raw data that is available here.
The data is found as CSV format and zipped as bz2 file:
dataNOAA <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
head(dataNOAA)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
The analysis is done accross the United States and considers the information of the following columns:
The events in the database start in the year 1950 and end in November 2011.
# Loading library:
library(ggplot2)
library(lubridate)
library(gridExtra)
dataNOAA$Year <- year(as.Date(dataNOAA$BGN_DATE,"%m/%d/%Y"))
years <- table(dataNOAA$Year)
plot(years,ylab = "Number of records in database",xlab = "Year")
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. As can be seen in the previous plot, from 1995 the number of records increase considerably. The data recorded since this year is used for this study as statistical analysis such as the aggregated economical impact of a certain event is impacted due to missing records in previous years.
# Subsetting data for records from 1995 (including this year):
dataCP <- dataNOAA[dataNOAA$Year > 1994, ]
Additionally, spelling of weather events in EVTYPE column could be different due to spaces at the beginning or case sensitivity. Therefore, EVTYPE records are converted to upper case and spaces at the beginning are removed while spaces in the middle are transformed to “.” character as they will be used in R scripts as names during this process:
# EVTYPE column with names in capital letters
# (e.g. WINTRY MIX appears as three different EVTYPE due to mix of capital and
# lower case characters)
dataCP$EVTYPE <- toupper(dataCP$EVTYPE)
# Avoiding spaces at the begining of EVTYPE column
dataCP$EVTYPE <- sub("^ ","",dataCP$EVTYPE)
dataCP$EVTYPE <- sub("^ ","",dataCP$EVTYPE)
# After these operations, EVTYPE comes from 985 to 890 different EVTYPES in the complete database (from 1950 onwards)
# Now spaces are converted to "." as later I will use them as variable names:
dataCP$EVTYPE <- gsub(" ",".",dataCP$EVTYPE)
head(dataCP,2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 187560 1 1/6/1995 0:00:00 0300 CST 0
## 187561 1 1/22/1995 0:00:00 1800 CST 0
## COUNTYNAME STATE EVTYPE BGN_RANGE
## 187560 ALZ001 - 011 - 014 - 021 - 026 AL FREEZING.RAIN 0
## 187561 ALZ001 - 021 - 024 - 027 AL SNOW 0
## BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 187560 1000CST 0 NA
## 187561 1/23/1995 0:00:00 0800CST 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 187560 0 0 0 NA 0 0
## 187561 0 0 0 NA 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 187560 0 0 0
## 187561 0 0 0
## ZONENAMES
## 187560 LAUDERDALE - LAUDERDALE - MARION - WINSTON - CLEBURNE - ST. CLAIR
## 187561 LAUDERDALE - LAUDERDALE - CLEBURNE - JEFFERSON - TALLADEGA
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 187560 0 0 0 0
## 187561 0 0 0 0
## REMARKS
## 187560 Light freezing rain coated bridges and overpasses across the northern third of Alabama during the early morning. The icing on bridges and roadways was responsible for a number of traffic accidents. Icing was apparently not severe enough to bring down trees or power lines. The death of a 33-year-old man in a single-car accident in Limestone County was attributed to icy road conditions.
## 187561 Light snow fell across the northern third of Alabama, but thanks to relatively warm ground temperatures, little accumulation occurred and travel was not significantly affected. Snow fell from about 1800 CST on January 22nd to 0300 CST on the 23rd across northwest and north-central Alabama and from 0300 CST to 0800 CST across northeast Alabama. Heaviest snowfall with amounts of one to two inches were reported in Franklin, Lawrence, and eastern Colbert Counties in the northwest and in Marshall and Jackson Counties in the northeast. Lauderdale, Limestone, Madison, Marion, Winston, Cullman, Morgan, Blount, Etowah, Dekalb, and northern Cherokee Counties reported varying amounts around an inch. Lamar, Fayette, Walker, Jefferson, northern Shelby, Talladega, St. Clair, Calhoun, and northern and western Cleburne Counties had snow flurries with spotty locations reporting up to an inch. Icing on bridges was spotty.
## REFNUM Year
## 187560 187560 1995
## 187561 187561 1995
In order to convert the property and crop damage estimations into dollars, the exponent values must be determined. The following code provides this information:
# dmgexp function provides the correct exponent value to consider according to the
# information extracted from the following link posted in discussion forums:
# https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html
# These are possible values of CROPDMGEXP and PROPDMGEXP:
# H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
#
# H,h = hundreds = 100
# K,k = kilos = thousands = 1,000
# M,m = millions = 1,000,000
# B,b = billions = 1,000,000,000
# (+) = 1
# (-) = 0
# (?) = 0
# black/empty character = 0
# numeric 0..8 = 10
#
# Inputs: factor vector with the indicators of the exponent value to consider
# Outputs: numeric vector with the exponent value for each position of the input vector
dmgexp <- function(values){
output <- rep.int(0,length(values))
output[grep("h",values,ignore.case=TRUE)] <- 100
output[grep("k",values,ignore.case=TRUE)] <- 1000
output[grep("m",values,ignore.case=TRUE)] <- 1000000
output[grep("b",values,ignore.case=TRUE)] <- 1000000000
output[grep("\\+",values)] <- 1
output[grep("[0-9]",values)] <- 10
output
}
dataCP$PropertyDMGExponent <- dmgexp(dataCP$PROPDMGEXP)
dataCP$CropDMGExponent <- dmgexp(dataCP$CROPDMGEXP)
For every weather event, there are two indicators that needs to be computed: > Economical damage indicator (EDI) impact defined as the sum of the property economical damage and the crop economical damage in dollars. > People’s health indicator (PHI) defined as the sum of the fatalities and injuries due to the weather event.
dataCP$economicalDMG <- dataCP$PROPDMG*dataCP$PropertyDMGExponent + dataCP$CROPDMG*dataCP$CropDMGExponent
dataCP$impactPeopleHealth <- dataCP$INJURIES + dataCP$FATALITIES
Another important figure to consider when understanding the impact of certain weather events is how many times this event happens during the period of observation. For that reason, the total number of occurrences of every event is computed. The top 10 of the most common events is shown.
occ <- table(dataCP$EVTYPE)
##To get the numerical values
occdata <- as.vector(occ)
##To get the names
names(occdata) <- names(occ)
# Even if this command will work to show part of the data, is clearer using kable
# head(sort(occ, decreasing = TRUE),10)
library(knitr)
# Showing top 10 most frequent events
kable(head(sort(occdata, decreasing = TRUE),10))
| HAIL | 215932 |
| TSTM.WIND | 128929 |
| THUNDERSTORM.WIND | 81746 |
| FLASH.FLOOD | 52674 |
| FLOOD | 24642 |
| TORNADO | 24335 |
| HIGH.WIND | 19958 |
| HEAVY.SNOW | 14710 |
| LIGHTNING | 14281 |
| HEAVY.RAIN | 11640 |
Finally, to figure out the impact of every type of weather event, EDI and PHI are aggregated for weather event types following two methods:
Summing the values of EDI and PHI for all weather events belonging to the same type during the period of observation. The 6 events with higher values are stored.
Averaging the values of EDI and PHI per weather event type. This figure is more robust against possible missing records during the period of observation and is not biased by the number of times a weather event occurs. The 6 events with highest average values are stored.
# Aggregating economical damage per weather type
tmp1 <- aggregate(economicalDMG ~ EVTYPE, data = dataCP, sum)
names(tmp1) <- c("EVTYPE","AggEconDMG")
# Averaging economical damage per weather type
tmp2 <- aggregate(economicalDMG ~ EVTYPE, data = dataCP, mean)
names(tmp2) <- c("EVTYPE","AvgEconDMG")
econDMG <- merge(tmp1,tmp2,by = "EVTYPE")
# Including number of occurrences
econDMG$occurrences <- occdata[econDMG$EVTYPE]
# Aggregating people health impact per weather type
tmp1 <- aggregate(impactPeopleHealth ~ EVTYPE, data = dataCP, sum)
names(tmp1) <- c("EVTYPE","AggHealthImpact")
# Averaging people health impact per weather type
tmp2 <- aggregate(impactPeopleHealth ~ EVTYPE, data = dataCP, mean)
names(tmp2) <- c("EVTYPE","AvgHealthImpact")
healthIPCT <- merge(tmp1,tmp2,by = "EVTYPE")
# Including number of occurrences
healthIPCT$occurrences <- occdata[healthIPCT$EVTYPE]
# Top 6 in each category:
aggHealthTop <- head(healthIPCT[with(healthIPCT, order(-AggHealthImpact)), ])
avgHealthTop <- head(healthIPCT[with(healthIPCT, order(-AvgHealthImpact)), ])
aggEconTop <- head(econDMG[with(econDMG, order(-AggEconDMG)), ])
avgEconTop <- head(econDMG[with(econDMG, order(-AvgEconDMG)), ])
The following plot shows the weather event types with highest impact on people’s health using the value of the PHI indicator after averaging or summing the impact of every weather event of that type that happened since 1995. The number of times that event happened is also shown in the plot.
#Top 6 weather events on aggregated impact
scatterAgg <- ggplot(aggHealthTop,aes(EVTYPE,AggHealthImpact,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Aggregated PHI")
#Number of times top 6 on aggregated impact happened
plot_topAgg <- ggplot(aggHealthTop,aes(EVTYPE,occurrences,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Number of events")
#Top 6 weather events on averaged impact
scatterAvg <- ggplot(avgHealthTop,aes(EVTYPE,AvgHealthImpact, color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Averaged PHI")
#Number of times top 6 on averaged impact happened
plot_topAvg <- ggplot(avgHealthTop,aes(EVTYPE,occurrences,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Number of events")
#arrange the plots together, with appropriate height and width for each row and column
grid.arrange(scatterAgg,scatterAvg,plot_topAgg, plot_topAvg,
ncol=2, nrow=2, widths=c(2, 2), heights=c(2, 2),
top = "Weather events with highest impact on people's health from 1995")
It can be observed that the weather event with, on average, the highest impact on people’s health since 1995 is the THUNDERSTORMW. Fortunately, THUNDERSTORMW is pretty uncommon in the United States and the weather event that causes the highest impact on people’s health due also to the number of times it occurs, is the TORNADO.
The following plot is analogous to previous plot but using EDI indicator instead of PHI.
#Top 6 weather events on aggregated impact
scatterAgg <- ggplot(aggEconTop,aes(EVTYPE,AggEconDMG,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Aggregated EDI")
#Number of times top 6 on aggregated impact happened
plot_topAgg <- ggplot(aggEconTop,aes(EVTYPE,occurrences,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Number of events")
#Top 6 weather events on averaged impact
scatterAvg <- ggplot(avgEconTop,aes(EVTYPE,AvgEconDMG, color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Averaged EDI")
#Number of times top 6 on averaged impact happened
plot_topAvg <- ggplot(avgEconTop,aes(EVTYPE,occurrences,color=EVTYPE, fill = EVTYPE)) +
geom_bar(aes(color=EVTYPE), stat="identity")+
theme(legend.position="none",text = element_text(size=8),
axis.text.x = element_text(angle=90, vjust=1)) +
xlab("Weather event") +
ylab("Number of events")
#arrange the plots together, with appropriate height and width for each row and column
grid.arrange(scatterAgg,scatterAvg,plot_topAgg, plot_topAvg,
ncol=2, nrow=2, widths=c(2, 2), heights=c(2, 2),
top = "Weather events with highest economical damage in dollars from 1995")
It can be observed that the weather event with, on average, the highest economical damage in dollars since 1995 is the HEAVY RAIN/SEVERE WEATHER Fortunately, HEAVY RAIN/SEVERE WEATHER is pretty uncommon in the United States and the weather event that causes the highest economical damage in dollars since 1995 due also to the number of times it occurs, is the FLOOD.