Synopsis

This report make an exploratory analysis on severe weather events which affect on the United States. Specially the purpose is generate some insights over the consequences of this events on economics and health. The data on which based this report are from the NOAA Storm Database.

The two fundamental questions are:

1. Across the United States, which types of events are most harmful with respect to population health?

2. Across the United States, which types of events have the greatest economic consequences?

Glimpse possible way to answer these questions, it is a start for more elaborate studies and ultimately aim to provide support to decision makers in public policy

Data Processing

2.1 Regarding the reproducibility of the analysis

The following report has the capability to reproduce all the analysis that has in his content. For this reason the document has incorporate all the codes that use for differents tasks along its extension. If you want to read this report in a reproducibility manner, you can run the next code that download and load the data for you into the variable “data”.

Note: The code create the folder “stormreport” in your working directory and stored the data there

if(!file.exists("stormreport")){dir.create("stormreport")}
setwd("./stormreport")
data.url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(data.url, destfile = basename(data.url))
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")

During the report are used the following R packages:

library(dplyr)
library(ggplot2)

For any doubt regarding the variables of the data, you can consulting the following sources:

National Weather Service Storm Data Documentation
National Climate Data Center Storm EventsFAQ

2.2 Data inspection and pre-processing

The goal of this section is identify which variables of the data are relevant to answer the two questions that are described in the synopsis. Therefore the data will be more compact and thus the noise of many elements are avoided.

#More easy to manage data with tbl_df() of dplyr package
data <- tbl_df(data)
#To reduce time-consuming in writting
colnames(data) <- tolower(colnames(data))
#A look to the variables of the data
colnames(data)

##  [1] "state__"    "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude_"
## [36] "remarks"    "refnum"

Before to see the case of each questions, it is convenient clarify which variables are need in both questions. This variables has the role of identifier variables. An undoubtly relevant variable to achieve a proper analysis in both questions is evtype, indeed, is appealed explicitly in the questions. The rest of the variables that are useful: state, bgn_date, end_date, zonenames, longitude and latitude.

Between this identifier variables, state, has another relative variable in the data: “state__“. That it is because”state__" is the index (or level) and state is the name of the state (or label). Given the relationship of the variables, each level must to have only one label otherwise there will be cross-referenced problems.

#Select distnct state index number, there are 70 distinct index number
state.unique.levels <- nrow(data %>% select(state__) %>% distinct(state__))
#Select distinct state name labels, there are 72 distinct name labels
state.unique.labels <- nrow(data %>% select(state) %>% distinct(state))

state.label.level <- cbind(state.unique.levels, state.unique.labels)
#you can print(state.label.level) if you follow the analysis in the console, it's not necessary use kniter() package
kable(state.label.level, align = "c")

state.unique.levels	state.unique.labels
70	72

Like you see, differ 70 versus 72 unique observations. Selecting the distinct values between the two variables might help to identify the problem.

variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))

## Source: local data frame [70 x 2]
## 
##    state__     n
##      (dbl) (int)
## 1       11     2
## 2       20     2
## 3       24     2
## 4       26     2
## 5       35     2
## 6       39     2
## 7       46     2
## 8       72     2
## 9       93     2
## 10       1     1
## ..     ...   ...

In the indeces 11, 20, 24, 26, 35, 39, 46, 72, 93 exist more than one reference. Here you can see more clearly:

x <- rbind(
    data %>% filter(state__ == 11) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 20) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 24) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 26) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 35) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 39) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 46) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 72) %>% select(state__, state) %>% count(state),
    data %>% filter(state__ == 93) %>% select(state__, state) %>% count(state)
)

y <- rep(1:9, each = 2)
xy <- cbind(y, x)
colnames(xy)[c(1,3)] <- c("pair number", "number of observations")
#you can print(xy) if you follow the analysis in the console, it's not necessary use kniter() package
kable(xy, align = "c")

pair number	state	number of observations
1	DC	437
1	MD	13
2	KS	53440
2	ST	1
3	MD	8172
3	OH	1
4	MA	1
4	MI	17910
5	NJ	1
5	NM	7129
6	ND	2
6	OH	24921
7	SC	1
7	SD	21727
8	AK	1
8	PR	3015
9	SL	7
9	XX	2

As you may notice, the table shows how many observations which has each of the nine cases of cross-reference between state labels and the state index. So, there are two problems:

Some state labels were introduced with the wrong state index
Despite the above problem, there are two state labels that aren’t associated with a state index

The approach to solving this will take two subsets of pairs between level - label , one pivot to the unique values of the levels and the other with the unique values of the labels. Subtract then couples who are in the smaller subset (state.index) of the largest subset (state.label) . And based on this, the remaining pairs (rp) will be removed from the data, and the remaining cases are change the index number of the states with minor observations.

state.index <- data %>%  select(state__, state) %>%  distinct(state__) 
state.label <- data %>% select(state, state__) %>% distinct(state)
rp <- state.label[!(state.label$state %in% state.index$state), ]
#you can print(rp) if you follow the analysis in the console, it's not necessary use kniter() package
kable(rp, align = "c")

state	state__
ST	20
XX	93

Below the summary of the changes:

case 1, the MD state is going to be changed of index (13 observations v/s DC with 437 observations)
case 2, the state ST is going to be removed
case 3, the OH state is going to be changed of index (1 observation v/s 8172 observations)
case 4, the MA state is going to be changed of index (1 observation v/s 17910 observations)
case 5, the NJ state is going to be changed of index (1 observation v/s 7129 observations)
case 6, the ND state is going to be changed of index (2 observations v/s 24921 observations)
case 7, the SC state is going to be changed of index (1 observation v/s 21727 observations)
case 8, the AK state is going to be changed of index (1 observation v/s 3015 observations)
case 9, the state XX is going to be removed

#ST and XX out, 3 observations out
data <- data %>% filter( state != "XX" & state != "ST")

#Fix the state index
data[data$state__ == 11 & data$state == "MD", 1] <- 24
data[data$state__ == 24 & data$state == "OH", 1] <- 39
data[data$state__ == 26 & data$state == "MA", 1] <- 25
data[data$state__ == 35 & data$state == "NJ", 1] <- 34
data[data$state__ == 39 & data$state == "ND", 1] <- 38
data[data$state__ == 46 & data$state == "SC", 1] <- 45
data[data$state__ == 72 & data$state == "AK", 1] <- 2

Finally, the unique values of the levels are matching with the unique values of the labels. Also each state index has only one combination.

#Select distnct state index number, there are 70 distinct index number
nrow(data %>% select(state__) %>% distinct(state__))

## [1] 70

#Select distinct state name labels, there are 72 distinct name labels
nrow(data %>% select(state) %>% distinct(state))

## [1] 70

variable.pairs <- data %>% distinct(state, state__) %>% count(state__)
variable.pairs %>% arrange(desc(n))

## Source: local data frame [70 x 2]
## 
##    state__     n
##      (dbl) (int)
## 1        1     1
## 2        2     1
## 3        4     1
## 4        5     1
## 5        6     1
## 6        8     1
## 7        9     1
## 8       10     1
## 9       11     1
## 10      12     1
## ..     ...   ...

2.2.1 Data relevant for the first question: data.health

“Across the United States, which types of events are most harmful with respect to population health?”

Starting off with the first question, the relevant variables should be those that have a relationship with the part of the question: “…are most harmful with respect to population health?”. For example, the variables fatalities and injuries are relevant with the clause.

Therefore, variable data.health contained the following variables from the original data:

state (identifier variable)
bgn_date (identifier variable)
evtype (identifier variable)
fatalities (measured variable)
injuries (measured variable)
end_date (identifier variable)
zonenames (identifier variable)
longitude (identifier variable)
latitude (identifier variable)

data.health <- data %>% select(state,
                                  bgn_date,
                                  evtype,
                                  fatalities,
                                  injuries,
                                  end_date,
                                  zonenames,
                                  longitude,
                                  latitude)

2.2.2 Data relevant for the second question: data.economic

“Across the United States, which types of events have the greatest economic consequences?”

In this questions, the relevants variables are propdmg (property damage) and cropdmg (crop damage) with his respectives scale variables: propdmgexp and cropdmgexp. So, is necessary give a treatment to the damage variable prodmg and cropdmg. Each unit is a dollar? That depends on who says the respective scale variable that accompanies each one.

data %>% select(propdmgexp) %>% distinct(propdmgexp)

## Source: local data frame [19 x 1]
## 
##    propdmgexp
##        (fctr)
## 1           K
## 2           M
## 3            
## 4           B
## 5           m
## 6           +
## 7           0
## 8           5
## 9           6
## 10          ?
## 11          4
## 12          2
## 13          3
## 14          h
## 15          7
## 16          H
## 17          -
## 18          1
## 19          8

data %>% select(cropdmgexp) %>% distinct(cropdmgexp)

## Source: local data frame [9 x 1]
## 
##   cropdmgexp
##       (fctr)
## 1           
## 2          M
## 3          K
## 4          m
## 5          B
## 6          ?
## 7          0
## 8          k
## 9          2

To avoid misunderstandings and biases , only the following scalars are used (in uppercase):

“0”" equal to one $US
“K” equal to one thousand $US
“M” equal to one million $US
“B” equal to one billion $US

The rest of the observations are omitted.

data.economic <- data %>% select(state,
                                  bgn_date,
                                  evtype,
                                  propdmg,
                                  propdmgexp,
                                  cropdmg,
                                  cropdmgexp,
                                  end_date,
                                  zonenames,
                                  longitude,
                                  latitude) 

data.economic <- data.economic %>% filter(propdmgexp == "0" | propdmgexp == "K" | propdmgexp == "M" | propdmgexp == "B") %>% filter(cropdmgexp == "0" | cropdmgexp == "K" | cropdmgexp == "M" | cropdmgexp == "B")

#Prop

data.economic[data.economic$propdmgexp== "0", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "0") %>% select(propdmg) %>% transmute(prop.damage = propdmg)

data.economic[data.economic$propdmgexp== "K", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "K") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000)

data.economic[data.economic$propdmgexp== "M", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "M") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000) 

data.economic[data.economic$propdmgexp== "B", 4] <- data.economic %>% select(evtype, propdmg, propdmgexp) %>% filter(propdmgexp == "B") %>% select(propdmg) %>% transmute(prop.damage = propdmg * 1000000000) 

#Crop

data.economic[data.economic$cropdmgexp== "0", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "0") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg) 

data.economic[data.economic$cropdmgexp== "K", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "K") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000) 

data.economic[data.economic$cropdmgexp== "M", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "M") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000) 

data.economic[data.economic$cropdmgexp== "B", 6] <- data.economic %>% select(evtype, cropdmg, cropdmgexp) %>% filter(cropdmgexp == "B") %>% select(cropdmg) %>% transmute(crop.damage = cropdmg * 1000000000) 


#Normalized in 1,000,000 US dollards
data.economic <- mutate(data.economic, prop.damage = propdmg/(1000000))
data.economic <- mutate(data.economic, crop.damage = cropdmg/(1000000))

Therefore, the variable data.economic contained the following variables from the original data:

state (identifier variable)
bgn_date (identifier variable)
evtype (identifier variable)
propdmg (measured variable and normalize in millions)
cropdmg (measured variable and normalize in millions)
end_date (identifier variable)
zonenames (identifier variable)
longitude (identifier variable)
latitude (identifier variable)

Results

3.1 Consequences on health

“Across the United States, which types of events are most harmful with respect to population health?”

In order to answer this question, it might be convenient to starting by adding all the fatalities and injuries per type of event. The harmful.events variable is the result of this operation.

harmful.events <- data.health %>% group_by(evtype) %>% summarize(num_fatalities = sum(fatalities), num_injuries = sum(injuries))

Now with the variable harmful.events will be constructing a ranking of the 10 most dangerous events in terms of fatalities and other in term of injuries.

rank.fatalities <- harmful.events %>% select(evtype, num_fatalities) %>% arrange(desc(num_fatalities)) %>% head(n = 10)
rank.injuries <- harmful.events %>% select(evtype, num_injuries) %>% arrange(desc(num_injuries)) %>% head(n = 10)
kable(rank.fatalities, align = "c")

evtype	num_fatalities
TORNADO	5633
EXCESSIVE HEAT	1903
FLASH FLOOD	978
HEAT	937
LIGHTNING	816
TSTM WIND	504
FLOOD	470
RIP CURRENT	368
HIGH WIND	248
AVALANCHE	224

kable(rank.injuries, align = "c")

evtype	num_injuries
TORNADO	91346
TSTM WIND	6957
FLOOD	6789
EXCESSIVE HEAT	6525
LIGHTNING	5230
HEAT	2100
ICE STORM	1975
FLASH FLOOD	1777
THUNDERSTORM WIND	1488
HAIL	1361

Tornado heads the rank robustly in the two dimensions, with 5633 fatalities and with 91346 along the history of the data.

data.plot1 <- data.health %>% select(bgn_date, evtype, fatalities, injuries) %>% filter(evtype == "TORNADO") %>% mutate( year = as.Date(bgn_date, format = "%m/%d/%Y"))
data.plot1 <- data.plot1 %>% select(year, fatalities, injuries)
data.plot1$year <- year(data.plot1$year)
data.plot1 <- data.plot1 %>% group_by(year) %>% summarize( num_fatalities = sum(fatalities), num_injuries = sum(injuries))
data.plot1 <- melt(data.plot1, id = "year", measured = c("num_fatalities", "num_injuries"))

It might be useful to plot a time series of the tornado events.

p <- ggplot(data.plot1, aes(year, value, colour = variable))
p + geom_line() + ylab("Number of cases")

Like you can see in the plot, the injuries and fatalities apparenty behave similar across the time, but just a little time before the 1990, started a period of more soft peaks in term of injuries cases. The logic questions are: this pattern is due some cyclical pattern of the weather? or some public policy or approach to confront this type of events? or maybe the location of that event occured?

3.2 Consequences on economics

“Across the United States, which types of events have the greatest economic consequences?”

It start building a ranking of the highest averages in property damage by event type and then repeating for damage to crops averages.

economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))

prop.dmg <- economic.stats %>% select(evtype, avg_prop_dmg, sd_prop_dmg) %>% arrange(desc(avg_prop_dmg)) %>% head(n = 10)

kable(prop.dmg, align = 'c')

evtype	avg_prop_dmg	sd_prop_dmg
TORNADOES, TSTM WIND, HAIL	1600.00000	NaN
HURRICANE OPAL	1074.00000	1450.98311
HURRICANE/TYPHOON	810.31197	1688.36816
RIVER FLOOD	317.47719	1248.73577
HURRICANE	142.88762	430.78119
HURRICANE OPAL/HIGH WINDS	100.00000	NaN
HURRICANE ERIN	85.33333	125.85839
WINTER STORM HIGH WINDS	60.00000	NaN
River Flooding	35.33000	37.70393
STORM SURGE/TIDE	34.12237	345.32862

crop.dmg <- economic.stats %>% select(evtype, avg_crop_dmg, sd_crop_dmg) %>% arrange(desc(avg_crop_dmg)) %>% head(n = 10)

kable(crop.dmg, align = 'c')

evtype	avg_crop_dmg	sd_crop_dmg
RIVER FLOOD	314.295875	1249.525051
HURRICANE/TYPHOON	79.026449	271.348245
HURRICANE	39.542794	96.357811
SEVERE THUNDERSTORMS	17.000000	NaN
HEAVY RAINS	15.125000	23.346574
SEVERE THUNDERSTORM WINDS	14.500000	16.263456
HURRICANE OPAL/HIGH WINDS	10.000000	NaN
River Flooding	9.340000	15.159720
TROPICAL STORM JERRY	8.000000	9.899495
FLOOD/FLASH FLOOD	6.788143	15.640531

Visualize the average damage for each type: property and crop, can be help to know in which part the data are more concentrated.

economic.stats <- data.economic %>% select(evtype, prop.damage, crop.damage) %>% group_by(evtype) %>% summarize(
avg_prop_dmg = mean(prop.damage, na.rm = TRUE),
avg_crop_dmg = mean(crop.damage, na.rm = TRUE),
sd_prop_dmg = sd(prop.damage, na.rm = TRUE),
sd_crop_dmg = sd(crop.damage, na.rm = TRUE))


economic.plot <- ggplot(economic.stats)

economic.plot + geom_point(aes(avg_prop_dmg, sd_prop_dmg), color = "darkred", alpha = .5) + geom_point(aes(avg_crop_dmg, sd_crop_dmg), color = "steelblue", alpha = .5) + xlab("Average Damage") + ylab("Standard Deviation Damage") + xlim(0, 300)

## Warning: Removed 43 rows containing missing values (geom_point).

## Warning: Removed 41 rows containing missing values (geom_point).

As can be seen, the vast majority of event type are concentrated below the 200 of average damage.

Finally, building a rank for the total sum of damage property and other for the total sum of damage crop per event type.

sum.prop.dmg <- data.economic %>% select(evtype, prop.damage) %>% group_by(evtype) %>% summarize( total_prop_damage = sum(prop.damage, na.rm = TRUE)) %>% arrange(desc(total_prop_damage)) %>% head( n = 10)

sum.crop.dmg <- data.economic %>% select(evtype, crop.damage) %>% group_by(evtype) %>% summarize( total_crop_damage = sum(crop.damage, na.rm = TRUE)) %>% arrange(desc(total_crop_damage)) %>% head( n = 10)


kable(sum.prop.dmg, align = 'c')

evtype	total_prop_damage
FLOOD	132836.489
HURRICANE/TYPHOON	26740.295
TORNADO	16166.947
HURRICANE	9716.358
HAIL	7991.593
FLASH FLOOD	7327.856
RIVER FLOOD	5079.635
STORM SURGE/TIDE	4640.643
WILDFIRE	3498.365
THUNDERSTORM WIND	3398.942

kable(sum.crop.dmg, align = 'c')

evtype	total_crop_damage
FLOOD	5170.9555
RIVER FLOOD	5028.7340
ICE STORM	5022.1135
HURRICANE	2688.9100
HURRICANE/TYPHOON	2607.8728
HAIL	2028.3909
DROUGHT	1652.6960
FLASH FLOOD	1387.5991
FROST/FREEZE	931.8010
HIGH WIND	631.9243

In the two ranking, the event flood is the most devastating in economics terms.

Report on storms and other severe weather events: consequences on economics and health

Cristóbal Alcázar

8 de junio de 2016