Synopsis

This report make an exploratory analysis on severe weather events which affect on the United States. Specially the purpose is generate some insights over the consequences of this events on economics and health. The data on which based this report are from the NOAA Storm Database.

The two fundamental questions are:

1. Across the United States, which types of events are most harmful with respect to population health?

2. Across the United States, which types of events have the greatest economic consequences?

Glimpse possible way to answer these questions, it is a start for more elaborate studies and ultimately aim to provide support to decision makers in public policy

Data Processing

2.1 Regarding the reproducibility of the analysis

The following report has the capability to reproduce all the analysis that has in his content. For this reason the document has incorporate all the codes that use for differents tasks along its extension. If you want to read this report in a reproducibility manner, you can run the next code that download and load the data for you into the variable “data”.

Note: The code create the folder “stormreport” in your working directory and stored the data there

if(!file.exists("stormreport")){dir.create("stormreport")}
setwd("./stormreport")
data.url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(data.url, destfile = basename(data.url))
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

During the report are used the following R packages:

library(dplyr)
library(ggplot2)

For any doubt regarding the variables of the data, you can consulting the following sources:

2.2 Data inspection and pre-processing

The goal of this section is identify which variables of the data are relevant to answer the two questions that are described in the synopsis. Therefore the data will be more compact and thus the noise of many elements are avoided.

#More easy to manage data with tbl_df() of dplyr package
data <- tbl_df(data)
#To reduce time-consuming in writting
colnames(data) <- tolower(colnames(data))
#A look to the variables of the data
colnames(data)
##  [1] "state__"    "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude_"
## [36] "remarks"    "refnum"

Before to see the case of each questions, it is convenient clarify which variables are need in both questions. This variables has the role of identifier variables. An undoubtly relevant variable to achieve a proper analysis in both questions is evtype, indeed, is appealed explicitly in the questions. The rest of the variables that are useful: state, bgn_date, end_date, zonenames, longitude and latitude.

Between this identifier variables, state, has another relative variable in the data: “state__“. That it is because”state__" is the index (or level) and state is the name of the state (or label). Given the relationship of the variables, each level must to have only one label otherwise there will be cross-referenced problems.

#Select distnct state index number, there are 70 distinct index number
state.unique.levels <- nrow(data %>% select(state__) %>% distinct(state__))
#Select distinct state name labels, there are 72 distinct name labels
state.unique.labels <- nrow(data %>% select(state) %>% distinct(state))

state.label.level <- cbind(state.unique.levels, state.unique.labels)
#you can print(state.label.level) if you follow the analysis in the console, it's not necessary use kniter() package
kable(state.label.level, align = "c")
state.unique.levels state.unique.labels
70 72

Like you see, differ 70 versus 72 unique observations. Selecting the distinct values between the two variables might help to identify the problem.

variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))
## Source: local data frame [70 x 2]
## 
##    state__     n
##      (dbl) (int)
## 1       11     2
## 2       20     2
## 3       24     2
## 4       26     2
## 5       35     2
## 6       39     2
## 7       46     2
## 8       72     2
## 9       93     2
## 10       1     1
## ..     ...   ...

In the indeces 11, 20, 24, 26, 35, 39, 46, 72, 93 exist more than one reference. Here you can see more clearly:

x <- rbind(
    data %>% filter(state__ == 11) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 20) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 24) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 26) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 35) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 39) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 46) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 72) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 93) %>% select(state__, state) %>% count(state)
)

y <- rep(1:9, each = 2)
xy <- cbind(y, x)
colnames(xy)[c(1,3)] <- c("pair number", "number of observations")
#you can print(xy) if you follow the analysis in the console, it's not necessary use kniter() package
kable(xy, align = "c")
pair number state number of observations
1 DC 437
1 MD 13
2 KS 53440
2 ST 1
3 MD 8172
3 OH 1
4 MA 1
4 MI 17910
5 NJ 1
5 NM 7129
6 ND 2
6 OH 24921
7 SC 1
7 SD 21727
8 AK 1
8 PR 3015
9 SL 7
9 XX 2

As you may notice, the table shows how many observations which has each of the nine cases of cross-reference between state labels and the state index. So, there are two problems:

  1. Some state labels were introduced with the wrong state index

  2. Despite the above problem, there are two state labels that aren’t associated with a state index

The approach to solving this will take two subsets of pairs between level - label , one pivot to the unique values of the levels and the other with the unique values of the labels. Subtract then couples who are in the smaller subset (state.index) of the largest subset (state.label) . And based on this, the remaining pairs (rp) will be removed from the data, and the remaining cases are change the index number of the states with minor observations.

state.index <- data %>%  select(state__, state) %>%  distinct(state__) 
state.label <- data %>% select(state, state__) %>% distinct(state)
rp <- state.label[!(state.label$state %in% state.index$state), ]
#you can print(rp) if you follow the analysis in the console, it's not necessary use kniter() package
kable(rp, align = "c")
state state__
ST 20
XX 93

Below the summary of the changes:

#ST and XX out, 3 observations out
data <- data %>% filter( state != "XX" & state != "ST")

#Fix the state index
data[data$state__ == 11 & data$state == "MD", 1] <- 24
data[data$state__ == 24 & data$state == "OH", 1] <- 39
data[data$state__ == 26 & data$state == "MA", 1] <- 25
data[data$state__ == 35 & data$state == "NJ", 1] <- 34
data[data$state__ == 39 & data$state == "ND", 1] <- 38
data[data$state__ == 46 & data$state == "SC", 1] <- 45
data[data$state__ == 72 & data$state == "AK", 1] <- 2

Finally, the unique values of the levels are matching with the unique values of the labels. Also each state index has only one combination.

#Select distnct state index number, there are 70 distinct index number
nrow(data %>% select(state__) %>% distinct(state__))
## [1] 70
#Select distinct state name labels, there are 72 distinct name labels
nrow(data %>% select(state) %>% distinct(state))
## [1] 70
variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))
## Source: local data frame [70 x 2]
## 
##    state__     n
##      (dbl) (int)
## 1        1     1
## 2        2     1
## 3        4     1
## 4        5     1
## 5        6     1
## 6        8     1
## 7        9     1
## 8       10     1
## 9       11     1
## 10      12     1
## ..     ...   ...

2.2.1 Data relevant for the first question: data.health

“Across the United States, which types of events are most harmful with respect to population health?”

Starting off with the first question, the relevant variables should be those that have a relationship with the part of the question: “…are most harmful with respect to population health?”. For example, the variables fatalities and injuries are relevant with the clause.

Therefore, variable data.health contained the following variables from the original data:

data.health <- data %>% select(state,
                                  bgn_date,
                                  evtype,
                                  fatalities,
                                  injuries,
                                  end_date,
                                  zonenames,
                                  longitude,
                                  latitude)

2.2.2 Data relevant for the second question: data.economic

“Across the United States, which types of events have the greatest economic consequences?”

In this questions, the relevants variables are propdmg (property damage) and cropdmg (crop damage) with his respectives scale variables: propdmgexp and cropdmgexp. So, is necessary give a treatment to the damage variable prodmg and cropdmg. Each unit is a dollar? That depends on who says the respective scale variable that accompanies each one.

data %>% select(propdmgexp) %>% distinct(propdmgexp)
## Source: local data frame [19 x 1]
## 
##    propdmgexp
##        (fctr)
## 1           K
## 2           M
## 3            
## 4           B
## 5           m
## 6           +
## 7           0
## 8           5
## 9           6
## 10          ?
## 11          4
## 12          2
## 13          3
## 14          h
## 15          7
## 16          H
## 17          -
## 18          1
## 19          8
data %>% select(cropdmgexp) %>% distinct(cropdmgexp)
## Source: local data frame [9 x 1]
## 
##   cropdmgexp
##       (fctr)
## 1           
## 2          M
## 3          K
## 4          m
## 5          B
## 6          ?
## 7          0
## 8          k
## 9          2

To avoid misunderstandings and biases , only the following scalars are used (in uppercase):

The rest of the observations are omitted.

data.economic <- data %>% select(state,
                                  bgn_date,
                                  evtype,
                                  propdmg,
                                  propdmgexp,
                                  cropdmg,
                                  cropdmgexp,
                                  end_date,
                                  zonenames,
                                  longitude,
                                  latitude) 

data.economic <- data.economic %>% filter(propdmgexp == "0" | propdmgexp == "K" | propdmgexp == "M" | propdmgexp == "B") %>% filter(cropdmgexp == "0" | cropdmgexp == "K" | cropdmgexp == "M" | cropdmgexp == "B")

#Prop

data.economic[data.economic$propdmgexp== "0", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "0") %>% select(propdmg) %>% transmute(prop.damage = propdmg)

data.economic[data.economic$propdmgexp== "K", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "K") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000)

data.economic[data.economic$propdmgexp== "M", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "M") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000) 

data.economic[data.economic$propdmgexp== "B", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "B") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000000) 

#Crop

data.economic[data.economic$cropdmgexp== "0", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "0") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg) 

data.economic[data.economic$cropdmgexp== "K", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "K") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000) 

data.economic[data.economic$cropdmgexp== "M", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "M") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000) 

data.economic[data.economic$cropdmgexp== "B", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "B") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000000) 


#Normalized in 1,000,000 US dollards
data.economic <- mutate(data.economic, prop.damage = propdmg/(1000000))
data.economic <- mutate(data.economic, crop.damage = cropdmg/(1000000))

Therefore, the variable data.economic contained the following variables from the original data:

Results

3.1 Consequences on health

“Across the United States, which types of events are most harmful with respect to population health?”

In order to answer this question, it might be convenient to starting by adding all the fatalities and injuries per type of event. The harmful.events variable is the result of this operation.

harmful.events <- data.health %>% group_by(evtype) %>% summarize(num_fatalities = sum(fatalities), num_injuries = sum(injuries))

Now with the variable harmful.events will be constructing a ranking of the 10 most dangerous events in terms of fatalities and other in term of injuries.

rank.fatalities <- harmful.events %>% select(evtype, num_fatalities) %>% arrange(desc(num_fatalities)) %>% head(n = 10)
rank.injuries <- harmful.events %>% select(evtype, num_injuries) %>% arrange(desc(num_injuries)) %>% head(n = 10)
kable(rank.fatalities, align = "c")
evtype num_fatalities
TORNADO 5633
EXCESSIVE HEAT 1903
FLASH FLOOD 978
HEAT 937
LIGHTNING 816
TSTM WIND 504
FLOOD 470
RIP CURRENT 368
HIGH WIND 248
AVALANCHE 224
kable(rank.injuries, align = "c")
evtype num_injuries
TORNADO 91346
TSTM WIND 6957
FLOOD 6789
EXCESSIVE HEAT 6525
LIGHTNING 5230
HEAT 2100
ICE STORM 1975
FLASH FLOOD 1777
THUNDERSTORM WIND 1488
HAIL 1361

Tornado heads the rank robustly in the two dimensions, with 5633 fatalities and with 91346 along the history of the data.

data.plot1 <- data.health %>% select(bgn_date, evtype, fatalities, injuries) %>% filter(evtype == "TORNADO") %>% mutate( year = as.Date(bgn_date, format = "%m/%d/%Y"))
data.plot1 <- data.plot1 %>% select(year, fatalities, injuries)
data.plot1$year <- year(data.plot1$year)
data.plot1 <- data.plot1 %>% group_by(year) %>% summarize( num_fatalities = sum(fatalities), num_injuries = sum(injuries))
data.plot1 <- melt(data.plot1, id = "year", measured = c("num_fatalities", "num_injuries"))

It might be useful to plot a time series of the tornado events.

p <- ggplot(data.plot1, aes(year, value, colour = variable))
p + geom_line() + ylab("Number of cases") 

Like you can see in the plot, the injuries and fatalities apparenty behave similar across the time, but just a little time before the 1990, started a period of more soft peaks in term of injuries cases. The logic questions are: this pattern is due some cyclical pattern of the weather? or some public policy or approach to confront this type of events? or maybe the location of that event occured?

3.2 Consequences on economics

“Across the United States, which types of events have the greatest economic consequences?”

It start building a ranking of the highest averages in property damage by event type and then repeating for damage to crops averages.

economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))

prop.dmg <- economic.stats %>% select(evtype, avg_prop_dmg, sd_prop_dmg) %>% arrange(desc(avg_prop_dmg)) %>% head(n = 10)

kable(prop.dmg, align = 'c')
evtype avg_prop_dmg sd_prop_dmg
TORNADOES, TSTM WIND, HAIL 1600.00000 NaN
HURRICANE OPAL 1074.00000 1450.98311
HURRICANE/TYPHOON 810.31197 1688.36816
RIVER FLOOD 317.47719 1248.73577
HURRICANE 142.88762 430.78119
HURRICANE OPAL/HIGH WINDS 100.00000 NaN
HURRICANE ERIN 85.33333 125.85839
WINTER STORM HIGH WINDS 60.00000 NaN
River Flooding 35.33000 37.70393
STORM SURGE/TIDE 34.12237 345.32862
crop.dmg <- economic.stats %>% select(evtype, avg_crop_dmg, sd_crop_dmg) %>% arrange(desc(avg_crop_dmg)) %>% head(n = 10)

kable(crop.dmg, align = 'c')
evtype avg_crop_dmg sd_crop_dmg
RIVER FLOOD 314.295875 1249.525051
HURRICANE/TYPHOON 79.026449 271.348245
HURRICANE 39.542794 96.357811
SEVERE THUNDERSTORMS 17.000000 NaN
HEAVY RAINS 15.125000 23.346574
SEVERE THUNDERSTORM WINDS 14.500000 16.263456
HURRICANE OPAL/HIGH WINDS 10.000000 NaN
River Flooding 9.340000 15.159720
TROPICAL STORM JERRY 8.000000 9.899495
FLOOD/FLASH FLOOD 6.788143 15.640531

Visualize the average damage for each type: property and crop, can be help to know in which part the data are more concentrated.

economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))


economic.plot <- ggplot(economic.stats)

economic.plot + geom_point(aes(avg_prop_dmg, sd_prop_dmg), color = "darkred", alpha = .5) + geom_point(aes(avg_crop_dmg, sd_crop_dmg), color = "steelblue", alpha = .5) + xlab("Average Damage") + ylab("Standard Deviation Damage") + xlim(0, 300)
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).

As can be seen, the vast majority of event type are concentrated below the 200 of average damage.

Finally, building a rank for the total sum of damage property and other for the total sum of damage crop per event type.

sum.prop.dmg <- data.economic %>% select(evtype, prop.damage) %>% group_by(evtype) %>% summarize( total_prop_damage = sum(prop.damage, na.rm = TRUE)) %>% arrange(desc(total_prop_damage)) %>% head( n = 10)

sum.crop.dmg <- data.economic %>% select(evtype, crop.damage) %>% group_by(evtype) %>% summarize( total_crop_damage = sum(crop.damage, na.rm = TRUE)) %>% arrange(desc(total_crop_damage)) %>% head( n = 10)


kable(sum.prop.dmg, align = 'c')
evtype total_prop_damage
FLOOD 132836.489
HURRICANE/TYPHOON 26740.295
TORNADO 16166.947
HURRICANE 9716.358
HAIL 7991.593
FLASH FLOOD 7327.856
RIVER FLOOD 5079.635
STORM SURGE/TIDE 4640.643
WILDFIRE 3498.365
THUNDERSTORM WIND 3398.942
kable(sum.crop.dmg, align = 'c')
evtype total_crop_damage
FLOOD 5170.9555
RIVER FLOOD 5028.7340
ICE STORM 5022.1135
HURRICANE 2688.9100
HURRICANE/TYPHOON 2607.8728
HAIL 2028.3909
DROUGHT 1652.6960
FLASH FLOOD 1387.5991
FROST/FREEZE 931.8010
HIGH WIND 631.9243

In the two ranking, the event flood is the most devastating in economics terms.