This report make an exploratory analysis on severe weather events which affect on the United States. Specially the purpose is generate some insights over the consequences of this events on economics and health. The data on which based this report are from the NOAA Storm Database.
The two fundamental questions are:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
Glimpse possible way to answer these questions, it is a start for more elaborate studies and ultimately aim to provide support to decision makers in public policy
The following report has the capability to reproduce all the analysis that has in his content. For this reason the document has incorporate all the codes that use for differents tasks along its extension. If you want to read this report in a reproducibility manner, you can run the next code that download and load the data for you into the variable “data”.
Note: The code create the folder “stormreport” in your working directory and stored the data there
if(!file.exists("stormreport")){dir.create("stormreport")}
setwd("./stormreport")
data.url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(data.url, destfile = basename(data.url))
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
During the report are used the following R packages:
library(dplyr)
library(ggplot2)
For any doubt regarding the variables of the data, you can consulting the following sources:
National Weather Service Storm Data Documentation
National Climate Data Center Storm EventsFAQ
The goal of this section is identify which variables of the data are relevant to answer the two questions that are described in the synopsis. Therefore the data will be more compact and thus the noise of many elements are avoided.
#More easy to manage data with tbl_df() of dplyr package
data <- tbl_df(data)
#To reduce time-consuming in writting
colnames(data) <- tolower(colnames(data))
#A look to the variables of the data
colnames(data)
## [1] "state__" "bgn_date" "bgn_time" "time_zone" "county"
## [6] "countyname" "state" "evtype" "bgn_range" "bgn_azi"
## [11] "bgn_locati" "end_date" "end_time" "county_end" "countyendn"
## [16] "end_range" "end_azi" "end_locati" "length" "width"
## [21] "f" "mag" "fatalities" "injuries" "propdmg"
## [26] "propdmgexp" "cropdmg" "cropdmgexp" "wfo" "stateoffic"
## [31] "zonenames" "latitude" "longitude" "latitude_e" "longitude_"
## [36] "remarks" "refnum"
Before to see the case of each questions, it is convenient clarify which variables are need in both questions. This variables has the role of identifier variables. An undoubtly relevant variable to achieve a proper analysis in both questions is evtype, indeed, is appealed explicitly in the questions. The rest of the variables that are useful: state, bgn_date, end_date, zonenames, longitude and latitude.
Between this identifier variables, state, has another relative variable in the data: “state__“. That it is because”state__" is the index (or level) and state is the name of the state (or label). Given the relationship of the variables, each level must to have only one label otherwise there will be cross-referenced problems.
#Select distnct state index number, there are 70 distinct index number
state.unique.levels <- nrow(data %>% select(state__) %>% distinct(state__))
#Select distinct state name labels, there are 72 distinct name labels
state.unique.labels <- nrow(data %>% select(state) %>% distinct(state))
state.label.level <- cbind(state.unique.levels, state.unique.labels)
#you can print(state.label.level) if you follow the analysis in the console, it's not necessary use kniter() package
kable(state.label.level, align = "c")
| state.unique.levels | state.unique.labels |
|---|---|
| 70 | 72 |
Like you see, differ 70 versus 72 unique observations. Selecting the distinct values between the two variables might help to identify the problem.
variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))
## Source: local data frame [70 x 2]
##
## state__ n
## (dbl) (int)
## 1 11 2
## 2 20 2
## 3 24 2
## 4 26 2
## 5 35 2
## 6 39 2
## 7 46 2
## 8 72 2
## 9 93 2
## 10 1 1
## .. ... ...
In the indeces 11, 20, 24, 26, 35, 39, 46, 72, 93 exist more than one reference. Here you can see more clearly:
x <- rbind(
data %>% filter(state__ == 11) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 20) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 24) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 26) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 35) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 39) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 46) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 72) %>% select(state__, state) %>% count(state),
data %>% filter(state__ == 93) %>% select(state__, state) %>% count(state)
)
y <- rep(1:9, each = 2)
xy <- cbind(y, x)
colnames(xy)[c(1,3)] <- c("pair number", "number of observations")
#you can print(xy) if you follow the analysis in the console, it's not necessary use kniter() package
kable(xy, align = "c")
| pair number | state | number of observations |
|---|---|---|
| 1 | DC | 437 |
| 1 | MD | 13 |
| 2 | KS | 53440 |
| 2 | ST | 1 |
| 3 | MD | 8172 |
| 3 | OH | 1 |
| 4 | MA | 1 |
| 4 | MI | 17910 |
| 5 | NJ | 1 |
| 5 | NM | 7129 |
| 6 | ND | 2 |
| 6 | OH | 24921 |
| 7 | SC | 1 |
| 7 | SD | 21727 |
| 8 | AK | 1 |
| 8 | PR | 3015 |
| 9 | SL | 7 |
| 9 | XX | 2 |
As you may notice, the table shows how many observations which has each of the nine cases of cross-reference between state labels and the state index. So, there are two problems:
Some state labels were introduced with the wrong state index
Despite the above problem, there are two state labels that aren’t associated with a state index
The approach to solving this will take two subsets of pairs between level - label , one pivot to the unique values of the levels and the other with the unique values of the labels. Subtract then couples who are in the smaller subset (state.index) of the largest subset (state.label) . And based on this, the remaining pairs (rp) will be removed from the data, and the remaining cases are change the index number of the states with minor observations.
state.index <- data %>% select(state__, state) %>% distinct(state__)
state.label <- data %>% select(state, state__) %>% distinct(state)
rp <- state.label[!(state.label$state %in% state.index$state), ]
#you can print(rp) if you follow the analysis in the console, it's not necessary use kniter() package
kable(rp, align = "c")
| state | state__ |
|---|---|
| ST | 20 |
| XX | 93 |
Below the summary of the changes:
case 1, the MD state is going to be changed of index (13 observations v/s DC with 437 observations)
case 2, the state ST is going to be removed
case 3, the OH state is going to be changed of index (1 observation v/s 8172 observations)
case 4, the MA state is going to be changed of index (1 observation v/s 17910 observations)
case 5, the NJ state is going to be changed of index (1 observation v/s 7129 observations)
case 6, the ND state is going to be changed of index (2 observations v/s 24921 observations)
case 7, the SC state is going to be changed of index (1 observation v/s 21727 observations)
case 8, the AK state is going to be changed of index (1 observation v/s 3015 observations)
case 9, the state XX is going to be removed
#ST and XX out, 3 observations out
data <- data %>% filter( state != "XX" & state != "ST")
#Fix the state index
data[data$state__ == 11 & data$state == "MD", 1] <- 24
data[data$state__ == 24 & data$state == "OH", 1] <- 39
data[data$state__ == 26 & data$state == "MA", 1] <- 25
data[data$state__ == 35 & data$state == "NJ", 1] <- 34
data[data$state__ == 39 & data$state == "ND", 1] <- 38
data[data$state__ == 46 & data$state == "SC", 1] <- 45
data[data$state__ == 72 & data$state == "AK", 1] <- 2
Finally, the unique values of the levels are matching with the unique values of the labels. Also each state index has only one combination.
#Select distnct state index number, there are 70 distinct index number
nrow(data %>% select(state__) %>% distinct(state__))
## [1] 70
#Select distinct state name labels, there are 72 distinct name labels
nrow(data %>% select(state) %>% distinct(state))
## [1] 70
variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))
## Source: local data frame [70 x 2]
##
## state__ n
## (dbl) (int)
## 1 1 1
## 2 2 1
## 3 4 1
## 4 5 1
## 5 6 1
## 6 8 1
## 7 9 1
## 8 10 1
## 9 11 1
## 10 12 1
## .. ... ...
“Across the United States, which types of events are most harmful with respect to population health?”
Starting off with the first question, the relevant variables should be those that have a relationship with the part of the question: “…are most harmful with respect to population health?”. For example, the variables fatalities and injuries are relevant with the clause.
Therefore, variable data.health contained the following variables from the original data:
data.health <- data %>% select(state,
bgn_date,
evtype,
fatalities,
injuries,
end_date,
zonenames,
longitude,
latitude)
“Across the United States, which types of events have the greatest economic consequences?”
In this questions, the relevants variables are propdmg (property damage) and cropdmg (crop damage) with his respectives scale variables: propdmgexp and cropdmgexp. So, is necessary give a treatment to the damage variable prodmg and cropdmg. Each unit is a dollar? That depends on who says the respective scale variable that accompanies each one.
data %>% select(propdmgexp) %>% distinct(propdmgexp)
## Source: local data frame [19 x 1]
##
## propdmgexp
## (fctr)
## 1 K
## 2 M
## 3
## 4 B
## 5 m
## 6 +
## 7 0
## 8 5
## 9 6
## 10 ?
## 11 4
## 12 2
## 13 3
## 14 h
## 15 7
## 16 H
## 17 -
## 18 1
## 19 8
data %>% select(cropdmgexp) %>% distinct(cropdmgexp)
## Source: local data frame [9 x 1]
##
## cropdmgexp
## (fctr)
## 1
## 2 M
## 3 K
## 4 m
## 5 B
## 6 ?
## 7 0
## 8 k
## 9 2
To avoid misunderstandings and biases , only the following scalars are used (in uppercase):
The rest of the observations are omitted.
data.economic <- data %>% select(state,
bgn_date,
evtype,
propdmg,
propdmgexp,
cropdmg,
cropdmgexp,
end_date,
zonenames,
longitude,
latitude)
data.economic <- data.economic %>% filter(propdmgexp == "0" | propdmgexp == "K" | propdmgexp == "M" | propdmgexp == "B") %>% filter(cropdmgexp == "0" | cropdmgexp == "K" | cropdmgexp == "M" | cropdmgexp == "B")
#Prop
data.economic[data.economic$propdmgexp== "0", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "0") %>% select(propdmg) %>% transmute(prop.damage = propdmg)
data.economic[data.economic$propdmgexp== "K", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "K") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000)
data.economic[data.economic$propdmgexp== "M", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "M") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000)
data.economic[data.economic$propdmgexp== "B", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "B") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000000)
#Crop
data.economic[data.economic$cropdmgexp== "0", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "0") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg)
data.economic[data.economic$cropdmgexp== "K", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "K") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000)
data.economic[data.economic$cropdmgexp== "M", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "M") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000)
data.economic[data.economic$cropdmgexp== "B", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "B") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000000)
#Normalized in 1,000,000 US dollards
data.economic <- mutate(data.economic, prop.damage = propdmg/(1000000))
data.economic <- mutate(data.economic, crop.damage = cropdmg/(1000000))
Therefore, the variable data.economic contained the following variables from the original data:
“Across the United States, which types of events are most harmful with respect to population health?”
In order to answer this question, it might be convenient to starting by adding all the fatalities and injuries per type of event. The harmful.events variable is the result of this operation.
harmful.events <- data.health %>% group_by(evtype) %>% summarize(num_fatalities = sum(fatalities), num_injuries = sum(injuries))
Now with the variable harmful.events will be constructing a ranking of the 10 most dangerous events in terms of fatalities and other in term of injuries.
rank.fatalities <- harmful.events %>% select(evtype, num_fatalities) %>% arrange(desc(num_fatalities)) %>% head(n = 10)
rank.injuries <- harmful.events %>% select(evtype, num_injuries) %>% arrange(desc(num_injuries)) %>% head(n = 10)
kable(rank.fatalities, align = "c")
| evtype | num_fatalities |
|---|---|
| TORNADO | 5633 |
| EXCESSIVE HEAT | 1903 |
| FLASH FLOOD | 978 |
| HEAT | 937 |
| LIGHTNING | 816 |
| TSTM WIND | 504 |
| FLOOD | 470 |
| RIP CURRENT | 368 |
| HIGH WIND | 248 |
| AVALANCHE | 224 |
kable(rank.injuries, align = "c")
| evtype | num_injuries |
|---|---|
| TORNADO | 91346 |
| TSTM WIND | 6957 |
| FLOOD | 6789 |
| EXCESSIVE HEAT | 6525 |
| LIGHTNING | 5230 |
| HEAT | 2100 |
| ICE STORM | 1975 |
| FLASH FLOOD | 1777 |
| THUNDERSTORM WIND | 1488 |
| HAIL | 1361 |
Tornado heads the rank robustly in the two dimensions, with 5633 fatalities and with 91346 along the history of the data.
data.plot1 <- data.health %>% select(bgn_date, evtype, fatalities, injuries) %>% filter(evtype == "TORNADO") %>% mutate( year = as.Date(bgn_date, format = "%m/%d/%Y"))
data.plot1 <- data.plot1 %>% select(year, fatalities, injuries)
data.plot1$year <- year(data.plot1$year)
data.plot1 <- data.plot1 %>% group_by(year) %>% summarize( num_fatalities = sum(fatalities), num_injuries = sum(injuries))
data.plot1 <- melt(data.plot1, id = "year", measured = c("num_fatalities", "num_injuries"))
It might be useful to plot a time series of the tornado events.
p <- ggplot(data.plot1, aes(year, value, colour = variable))
p + geom_line() + ylab("Number of cases")
Like you can see in the plot, the injuries and fatalities apparenty behave similar across the time, but just a little time before the 1990, started a period of more soft peaks in term of injuries cases. The logic questions are: this pattern is due some cyclical pattern of the weather? or some public policy or approach to confront this type of events? or maybe the location of that event occured?
“Across the United States, which types of events have the greatest economic consequences?”
It start building a ranking of the highest averages in property damage by event type and then repeating for damage to crops averages.
economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))
prop.dmg <- economic.stats %>% select(evtype, avg_prop_dmg, sd_prop_dmg) %>% arrange(desc(avg_prop_dmg)) %>% head(n = 10)
kable(prop.dmg, align = 'c')
| evtype | avg_prop_dmg | sd_prop_dmg |
|---|---|---|
| TORNADOES, TSTM WIND, HAIL | 1600.00000 | NaN |
| HURRICANE OPAL | 1074.00000 | 1450.98311 |
| HURRICANE/TYPHOON | 810.31197 | 1688.36816 |
| RIVER FLOOD | 317.47719 | 1248.73577 |
| HURRICANE | 142.88762 | 430.78119 |
| HURRICANE OPAL/HIGH WINDS | 100.00000 | NaN |
| HURRICANE ERIN | 85.33333 | 125.85839 |
| WINTER STORM HIGH WINDS | 60.00000 | NaN |
| River Flooding | 35.33000 | 37.70393 |
| STORM SURGE/TIDE | 34.12237 | 345.32862 |
crop.dmg <- economic.stats %>% select(evtype, avg_crop_dmg, sd_crop_dmg) %>% arrange(desc(avg_crop_dmg)) %>% head(n = 10)
kable(crop.dmg, align = 'c')
| evtype | avg_crop_dmg | sd_crop_dmg |
|---|---|---|
| RIVER FLOOD | 314.295875 | 1249.525051 |
| HURRICANE/TYPHOON | 79.026449 | 271.348245 |
| HURRICANE | 39.542794 | 96.357811 |
| SEVERE THUNDERSTORMS | 17.000000 | NaN |
| HEAVY RAINS | 15.125000 | 23.346574 |
| SEVERE THUNDERSTORM WINDS | 14.500000 | 16.263456 |
| HURRICANE OPAL/HIGH WINDS | 10.000000 | NaN |
| River Flooding | 9.340000 | 15.159720 |
| TROPICAL STORM JERRY | 8.000000 | 9.899495 |
| FLOOD/FLASH FLOOD | 6.788143 | 15.640531 |
Visualize the average damage for each type: property and crop, can be help to know in which part the data are more concentrated.
economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))
economic.plot <- ggplot(economic.stats)
economic.plot + geom_point(aes(avg_prop_dmg, sd_prop_dmg), color = "darkred", alpha = .5) + geom_point(aes(avg_crop_dmg, sd_crop_dmg), color = "steelblue", alpha = .5) + xlab("Average Damage") + ylab("Standard Deviation Damage") + xlim(0, 300)
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).
As can be seen, the vast majority of event type are concentrated below the 200 of average damage.
Finally, building a rank for the total sum of damage property and other for the total sum of damage crop per event type.
sum.prop.dmg <- data.economic %>% select(evtype, prop.damage) %>% group_by(evtype) %>% summarize( total_prop_damage = sum(prop.damage, na.rm = TRUE)) %>% arrange(desc(total_prop_damage)) %>% head( n = 10)
sum.crop.dmg <- data.economic %>% select(evtype, crop.damage) %>% group_by(evtype) %>% summarize( total_crop_damage = sum(crop.damage, na.rm = TRUE)) %>% arrange(desc(total_crop_damage)) %>% head( n = 10)
kable(sum.prop.dmg, align = 'c')
| evtype | total_prop_damage |
|---|---|
| FLOOD | 132836.489 |
| HURRICANE/TYPHOON | 26740.295 |
| TORNADO | 16166.947 |
| HURRICANE | 9716.358 |
| HAIL | 7991.593 |
| FLASH FLOOD | 7327.856 |
| RIVER FLOOD | 5079.635 |
| STORM SURGE/TIDE | 4640.643 |
| WILDFIRE | 3498.365 |
| THUNDERSTORM WIND | 3398.942 |
kable(sum.crop.dmg, align = 'c')
| evtype | total_crop_damage |
|---|---|
| FLOOD | 5170.9555 |
| RIVER FLOOD | 5028.7340 |
| ICE STORM | 5022.1135 |
| HURRICANE | 2688.9100 |
| HURRICANE/TYPHOON | 2607.8728 |
| HAIL | 2028.3909 |
| DROUGHT | 1652.6960 |
| FLASH FLOOD | 1387.5991 |
| FROST/FREEZE | 931.8010 |
| HIGH WIND | 631.9243 |
In the two ranking, the event flood is the most devastating in economics terms.