by Aki Taniguchi, 04/11/2019
Using the National Oceanic and Atmospheric Administration (NOAA) Storm Event Database, we will try to identify which type of events has been the most harmful, first in terms of population health, and second from an economic perspective. In order to do so, we will first process the dataset and keep only meaningful variables and observations. We will be using dates between 1996 and 2011 only as we will see that these are the most meaningful data ranges for the exercuse. Second, we will create 2 new variables: Human Damange and Economic damage, expressed as the sum of Fatalities and Injuries, and as the sum of Crop Damage and Property Damage (expressed in 1 US Dollar) respectively. Finally, we will plot each impact (both human and economic) against the different types of event. It will become clear to us “Tornado” events impact the most human health, whereas “Hurricane” damges the most the economy. Please note that everything has been coded and processed using R for Reproducibility sake, and sources (mostly URL) will be listed at the end of the analysis.
Let’s load the data first (the path can be changed at user’s convenience). The URL to download the raw file can be found here.
setwd("C:/Users/tngch/Documents/R/Learning/5. Reproducible Research")
nameFile = "repdata_data_StormData.csv.bz2"
df = read.csv(nameFile)
dim(df)
## [1] 902297 37
dfSize = nrow(df) * ncol(df) * 8 / 2^20
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "Mb."))
## [1] "The dataframe size is 254.71 Mb."
Given the dimension of the datasets, it seems necessary to analyze the relevance of each observation and variable that will help us in our analysis. Reducing the size of our datasets will greatly enhance the processing speed of the exercise.
colnames(df)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
First, let us look at how many observations are done per year:
df$Year <- as.numeric(format(as.Date(df$BGN_DATE, "%m/%d/%Y"), "%Y"))
with(df, hist(Year, breaks = max(Year) - min(Year), main = "Number of observation per year", xlab = "Year", ylab = "Frequency"))
One bar in the histogram represents 1 year. We can clearly see that there is a huge jump in number of observation starting from 1996. As it can be inferred from the plot in NOAA’ s website, there has been a change in regulation (probably the implementation of Directive 10-1605), the NOAA started to increase its observation to 48 events. We will therefore subset data which starts from 1996 only as previous years might skew our analysis (we need to keep an apple-to-apple comparison).
df = df[df$Year == 1996,]
Looking at the variables, it seems that we only need:
1. Types of events: EVTYPE
2. Population health related variables: FATALITIES and INJURIES
3. Economic consequences related variables: PROPDMG and CROPDMG, and their respective exponent PROPDMGEXP and CROPDMGEXP.
df = df[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
dim(df)
## [1] 32270 7
dfSize = nrow(df) * ncol(df) * 8 / 2^20
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "Mb."))
## [1] "The dataframe size is 1.72 Mb."
We can see that the dataset size is greatly reduced.
Let’s now see whether the data looks good to be processed.
str(df)
## 'data.frame': 32270 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 972 834 856 856 856 244 359 856 856 856 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 380 100 3 5 2 0 400 12 8 12 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 1 17 17 17 17 ...
## $ CROPDMG : num 38 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 1 1 1 1 1 1 1 1 1 ...
sum(is.na(df))
## [1] 0
length(unique(df$EVTYPE))
## [1] 228
The good news is there is no missing values so no need to do anything special on that front. However, from the NOAA documentation we know that there are 48 types of events, but we can clearly see there are more event type in the dataset: this is due to typo. We will need to clean this variable.
We first quantify the impact on population health and the impact on the economy:
1/ We will define the impact on population health as the sum of Fatalities and Injuries for each event, e.g. the sum of FATALITIES and INJURIES per event types.
2/ We will define the economic damage as being the dollar impact (in US dollar unit) done to property and crops, e.g. the sum of PROPDMG and CROPDMG, modified to take into account their respective Exponent.
length(intersect(unique(toupper(df$PROPDMGEXP)), unique(toupper(df$CROPDMGEXP)))) == length(unique(toupper(df$CROPDMGEXP)))
## [1] TRUE
length(intersect(unique(toupper(df$PROPDMGEXP)), unique(toupper(df$CROPDMGEXP)))) == length(unique(toupper(df$PROPDMGEXP)))
## [1] TRUE
We can see that all values of CROPDMG EXP are included in PROPDMGEXP. In order to convert PROPDMG and CROPDMG, we need to convert their Exponent in the relevant unit size thanks to the table shown here, and then reisizing PROPDMG and CROPDMG to put them on the same basis (e.g. 1 USD).
unitType = c(unique(toupper(df$PROPDMGEXP)))
unitValue = c(1000, 0, 1000000)
unitConversionDf = data.frame(unitType = unitType, unitValue = unitValue)
df$PROPDMGEXP = toupper(df$PROPDMGEXP)
df$CROPDMGEXP = toupper(df$CROPDMGEXP)
df = merge(df, unitConversionDf, by.x = "PROPDMGEXP", by.y = "unitType")
df = merge(df, unitConversionDf, by.x = "CROPDMGEXP", by.y = "unitType")
df$ECONOMIC.DAMAGE = df$PROPDMG * df$unitValue.x + df$CROPDMG * df$unitValue.y
df$HUMAN.DAMAGE = df$FATALITIES + df$INJURIES
df = df[, c("EVTYPE", "HUMAN.DAMAGE", "ECONOMIC.DAMAGE")]
df = aggregate(list(df$HUMAN.DAMAGE, df$ECONOMIC.DAMAGE), by = list(df$EVTYPE), sum)
colnames(df) = c("EVTS", "TOTAL.HUMAN.DAMAGE", "TOTAL.ECONOMIC.DAMAGE")
df = df[df$TOTAL.HUMAN.DAMAGE != 0 & df$TOTAL.ECONOMIC.DAMAGE != 0,]
dfSize = nrow(df) * ncol(df) * 8
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "b. The number of column is", ncol(df)))
## [1] "The dataframe size is 912.00 b. The number of column is 3"
We now have a dataframe which expresses Human Damage and Economic Damage in function of Events. We can see the dataset has been drastically reduced, so we can now clean Event types following the original 48 defined by NOAA (given the size of the data is now a lot smaller, we will do it by hand).
df$EVTS = toupper(df$EVTS)
## Coastal flood / Coastal storm grouping
df[c(2, 3), "EVTS"]
## [1] "COASTAL FLOODING" "COASTAL STORM"
df$TOTAL.HUMAN.DAMAGE[2] = df$TOTAL.HUMAN.DAMAGE[2] + df$TOTAL.HUMAN.DAMAGE[3]
df$TOTAL.ECONOMIC.DAMAGE[2] = df$TOTAL.ECONOMIC.DAMAGE[2] + df$TOTAL.ECONOMIC.DAMAGE[3]
df = df[-c(3),]
## Extreme cold / Extended cold / Extreme Windchill grouping
df[c(6, 7, 8, 9), "EVTS"]
## [1] "EXTENDED COLD" "EXTREME COLD" "EXTREME COLD"
## [4] "EXTREME WINDCHILL"
df$TOTAL.HUMAN.DAMAGE[7] = df$TOTAL.HUMAN.DAMAGE[6] + df$TOTAL.HUMAN.DAMAGE[7] + df$TOTAL.HUMAN.DAMAGE[9] + df$TOTAL.HUMAN.DAMAGE[8]
df$TOTAL.ECONOMIC.DAMAGE[7] = df$TOTAL.ECONOMIC.DAMAGE[6] + df$TOTAL.ECONOMIC.DAMAGE[7] + df$TOTAL.ECONOMIC.DAMAGE[9] + df$TOTAL.HUMAN.DAMAGE[8]
df = df[-c(6, 8, 9),]
## Heavy snow grouping
df[c(12, 13, 23, 24), "EVTS"]
## [1] "HEAVY SNOW" "HEAVY SNOW SHOWER" "SNOW"
## [4] "SNOW"
df$TOTAL.HUMAN.DAMAGE[12] = df$TOTAL.HUMAN.DAMAGE[12] + df$TOTAL.HUMAN.DAMAGE[13] + df$TOTAL.HUMAN.DAMAGE[24] + df$TOTAL.HUMAN.DAMAGE[23]
df$TOTAL.ECONOMIC.DAMAGE[12] = df$TOTAL.ECONOMIC.DAMAGE[12] + df$TOTAL.ECONOMIC.DAMAGE[13] + df$TOTAL.ECONOMIC.DAMAGE[24] + df$TOTAL.ECONOMIC.DAMAGE[23]
df = df[-c(13, 23, 24),]
## Heavy surf / High surf grouping
df[c(13, 14, 21), "EVTS"]
## [1] "HEAVY SURF" "HIGH SURF" "ROUGH SURF"
df$TOTAL.HUMAN.DAMAGE[14] = df$TOTAL.HUMAN.DAMAGE[14] + df$TOTAL.HUMAN.DAMAGE[13] + df$TOTAL.HUMAN.DAMAGE[21]
df$TOTAL.ECONOMIC.DAMAGE[14] = df$TOTAL.ECONOMIC.DAMAGE[14] + df$TOTAL.ECONOMIC.DAMAGE[13] + df$TOTAL.ECONOMIC.DAMAGE[21]
df = df[-c(13, 21),]
## Strong wind grouping
df[c(20, 21), "EVTS"]
## [1] "STRONG WIND" "STRONG WINDS"
df$TOTAL.HUMAN.DAMAGE[20] = df$TOTAL.HUMAN.DAMAGE[21] + df$TOTAL.HUMAN.DAMAGE[20]
df$TOTAL.ECONOMIC.DAMAGE[20] = df$TOTAL.ECONOMIC.DAMAGE[21] + df$TOTAL.ECONOMIC.DAMAGE[20]
df = df[-c(21),]
## Thunderstorm grouping (TSTM)
df[c(23, 24), "EVTS"]
## [1] "TSTM WIND" "TSTM WIND/HAIL"
df$TOTAL.HUMAN.DAMAGE[23] = df$TOTAL.HUMAN.DAMAGE[24] + df$TOTAL.HUMAN.DAMAGE[23]
df$TOTAL.ECONOMIC.DAMAGE[23] = df$TOTAL.ECONOMIC.DAMAGE[24] + df$TOTAL.ECONOMIC.DAMAGE[23]
df = df[-c(24),]
The data is now fully processed:
dfSize = nrow(df) * ncol(df) * 8
print(paste("The dataframe size is", format(round(dfSize, 2), nsmall= 2), "b. The number of column is", ncol(df), "and the number of row is", nrow(df)))
## [1] "The dataframe size is 648.00 b. The number of column is 3 and the number of row is 27"
str(df)
## 'data.frame': 27 obs. of 3 variables:
## $ EVTS : chr "BLIZZARD" "COASTAL FLOODING" "DRY MICROBURST" "DUST DEVIL" ...
## $ TOTAL.HUMAN.DAMAGE : num 193 3 1 1 28 88 136 51 44 83 ...
## $ TOTAL.ECONOMIC.DAMAGE: num 40755000 6375000 277000 8300 30000 ...
We are now ready to get the results of our analysis.
Let’s plot the 2 graphs now that we have a nice and clean dataset.
library(ggplot2)
library(scales)
ggplot(data = df, aes(x = EVTS, y = TOTAL.HUMAN.DAMAGE)) + geom_bar(stat="identity") + theme_classic() + ggtitle("Number of Death/Injuries per Weather events") + xlab("Events Type") + ylab("Sum of Fatalities and Injuries") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(face="bold", angle=90))
ggplot(data = df, aes(x = EVTS, y = TOTAL.ECONOMIC.DAMAGE)) + geom_bar(stat="identity") + theme_classic() + ggtitle("Economic Impact per Weather events (in USD)") + xlab("Events Type") + ylab("Amount of Economic Impact in USD") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(face="bold", angle=90)) + scale_y_continuous(labels = comma)
paste(c("Total of deaths and injuries from 1996 to 2011:",df[which.max(df$TOTAL.HUMAN.DAMAGE), "TOTAL.HUMAN.DAMAGE"]))
## [1] "Total of deaths and injuries from 1996 to 2011:"
## [2] "731"
paste(c("Total economic impact:",format(df[which.max(df$TOTAL.ECONOMIC.DAMAGE), "TOTAL.ECONOMIC.DAMAGE"], big.mark = ","),"USD"))
## [1] "Total economic impact:" "1,732,035,000"
## [3] "USD"
Using these 2 plots, it is clear that the event which has the most important impact on population health from 1996 to 2011 is Tornado. This represents 731 deaths and injuries combined.
Additionnally, we can also see that the event which has the most important economic impact from 1996 to 2011 is Hurricane. The impact amounts to $1.7bn.
1/ Storm events Dataset: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
2/ Evolution of the dataset in NOAA’s website: https://www.ncdc.noaa.gov/stormevents/details.jsp
3/ NOAA documentation on variables: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
4/ Exponential conversion explanation: https://github.com/flyingdisc/RepData_PeerAssessment2/blob/master/how-to-handle-PROPDMGEXP.md
OS: Windows 10 Home
R.version: 3.6.1