In this project we aim to explore the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. We aim to understand which types of events are most harmful with respect to population health and which types of events have the greatest economic consequences.
We first read the data from the National Weather Service about Storms and its effects in the U.S. Economy. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We will download it, uncompress it and read it into a data frame.
#Downloa the data into a temporary file
my.file <- tempfile()
fileUrl <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = my.file)
#read and unzip the file into a data frame
storm <- read.csv(bzfile(my.file))
After reading we check the data to see its structure:
#look at the dimention of the data i.e. number of variables and observations
dim(storm)
## [1] 902297 37
#look at the structure by the inicial observations
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
#a look at the structure of the data
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The colums we are interest are the ones that indicate the type of the event and its effects, these are indicated in the following variables:“EVTYPE” (event type) “FATALITIES”(fatalities) “INJURIES” (injuries) “PROPDMG” (property damage) CROPDMG" (crop damage)
We will also keep track of the following variables in case we want to get a deeper insight in our analysis: “BGN_DATE” (day, month and year of the event) “COUNTYNAME” (county of the event) “STATE” (U.S. State from the event)
We subset the data and tranform the Date variable of interest
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.0.3
storm[, 2] <- mdy_hms(storm[, 2])
sum.storm <- storm[, c(2, 4, 5, 6, 7, 8, 22:27)]
Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).
mean(is.na(sum.storm))
## [1] 0
As we can see missing values are not a problem in the significant variables and so we move on to the analysis of the data
We move on the examine the effects of storms over the health of americans. We inspect the number of fatalities and injuries caused by storms for health effects and we look at the damage caused to properties and agriculture for the economic damage.
Using a barplot, we can investigate the 10 types of the events that have the highest sum of impact in directly killing or injuring persons across all the States in the United States. We use only a subset of the data containing only the most harmful types of storms since the variety is great and many of them have very low impact.
The data is aggregated by event type to give a better insight of the relation between the variables
#subset and reshape the data by aggregating and sorting it
agr.FATALITIES <- aggregate(FATALITIES ~ EVTYPE, data = sum.storm, FUN = sum)
agr.INJ <- aggregate(INJURIES ~ EVTYPE, data = sum.storm, FUN = sum)
sorted.FAT <- agr.FATALITIES[with(agr.FATALITIES, order(agr.FATALITIES$FATALITIES, decreasing=TRUE)),]
sorted.INJ <- agr.INJ[with(agr.INJ, order(agr.INJ$INJURIES, decreasing=TRUE)),]
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.0.3
## Loading required package: grid
#barplots of deaths and injuries by storm type
plot1 <- ggplot(sorted.FAT[1:10, ], aes(x=EVTYPE, y=FATALITIES)) +
geom_bar(stat="identity", fill = "steelblue", color = "black") +
ggtitle("Number of deaths in the U.S. by Storm Type") +
xlab("Storm Type")+ylab("Number of Fatalities") +
geom_text(aes(label = FATALITIES), size = 4, vjust = 1)
plot2 <- ggplot(sorted.INJ[1:10, ], aes(x=EVTYPE, y=INJURIES)) +
geom_bar(stat="identity", fill = "lightblue", color = "black") +
ggtitle("Number of injuries caused by storms in the U.S.") +
xlab("Storm Type")+ylab("Number of Injuries") +
geom_text(aes(label = INJURIES), size = 4, vjust = -1)
grid.arrange(plot1, plot2)
paste("The most fatal type of storm is the", agr.FATALITIES[which.max(agr.FATALITIES$FATALITIES), "EVTYPE"], "with a total of", max(agr.FATALITIES$FATALITIES), "fatalities" )
## [1] "The most fatal type of storm is the TORNADO with a total of 5633 fatalities"
paste("The most harmful type of storm is the", agr.INJ[which.max(agr.INJ$INJURIES), "EVTYPE"], "with a total of", max(agr.INJ$INJURIES), "injuries" )
## [1] "The most harmful type of storm is the TORNADO with a total of 91346 injuries"
From the figures and the results we can see that the most harmful type of event accross the United States is the tornado, causing the biggest number of both deaths and injuries. The number of fatalities caused by tornados surpasses the mark of 5633, 3 times the number caused by the second most fatal type of storm which is the excessive heat. It is also the one that causes most injuries, 91346 injuries in the period analised, 13 times more than the TSTM WIND. But overall, we can see that tornardos, the TSTM wind and floods in general are the biggest threats in terms of health to the american people.
Now we move to see the effects of storms over the american economy.
Using a barplot, we can investigate the 10 types of the events which cause most damage in terms of property value and agricultural damage across all the States in the United States. Again, We look only at a small a subset of the kinds of storms (the most harmful) since the variety is great.
Again, the data is aggregated by event type to give a better insight of the relation between the variables
#aggregate and tranform the data
agr.PROPDMG <- aggregate(PROPDMG ~ EVTYPE, data = sum.storm, FUN = sum)
agr.CROPDMG <- aggregate(CROPDMG ~ EVTYPE, data = sum.storm, FUN = sum)
sorted.PROPDMG <- agr.PROPDMG[with(agr.PROPDMG, order(agr.PROPDMG$PROPDMG, decreasing=TRUE)),]
sorted.CROPDMG <- agr.CROPDMG[with(agr.CROPDMG, order(agr.CROPDMG$CROPDMG, decreasing=TRUE)),]
#barplot for property damage and crop damage
plot3<- ggplot(sorted.PROPDMG[1:10, ], aes(x=EVTYPE, y=PROPDMG)) +
geom_bar(stat="identity", fill = "red", color = "black") +
ggtitle("Property damage cauded in the U.S. by Storm Type") +
xlab("Storm Type")+ylab("Damage") +
geom_text(aes(label = PROPDMG), size = 4, vjust = -1)
plot4 <- ggplot(sorted.CROPDMG[1:10, ], aes(x=EVTYPE, y=CROPDMG)) +
geom_bar(stat="identity", fill = "#D55E00", color = "black") +
ggtitle("Crop damage cauded in the U.S. by Storm Type") +
xlab("Storm Type")+ylab("Damage") +
geom_text(aes(label = CROPDMG), size = 4, vjust = -1)
grid.arrange(plot3, plot4)
paste("The ratio between Agriculture Damage and Propertie Damage is :", sum(agr.CROPDMG$CROPDMG)/sum(agr.PROPDMG$PROPDMG))
## [1] "The ratio between Agriculture Damage and Propertie Damage is : 0.126586183906853"
In regards to economic damage again tornados are the most dramatic issue, but the TSMT Wind and Floods also cause a lot of damage to properties. In this case the damage is way more distributed than when we analysed injuries caused. However the storm type that causes more harm to agriculture is not the tornado, but the hail. But since the property damage caused is much higher than agriculture damage we can still consider tornados the greatest threat in terms of harm to the americans.
As we can see from our study the tornado from all kinds of storms is the one that causes most harm both to the american economy and to the health of americans. Further analysis is needed to investigate if there are patterns and recurrencies over the years and/or accross the states that may sugest better policies deal with the damage done by these events.