Synopsis

Analysis of storms and other severe weather events across the united states, and being able to decode pattern and estimating impacts is a challenging task. In this short analysis two questions are addressed in a limited summarized manner. The first question is about which types of events are most harmful with respect to population health? The second question is about which types of events have the greatest economic consequences? For the analysis the data was limited to span from 1996-2011 in order to focus on specific periods where the recordings of different events stated and not only tornadoes. From the analysis it seems like that extreme cold weather, tornadoes and thunder storm wind are main reasons for puplic health injuries and fatalities. Whereas thundor storm wind causes most property damage across the united states.

Data processing

The data is stored in National Weather Service database. The files are loaded and variables are converted to the correct classes. The data data is sub-setted with regards to the period, thus the data ranges from 1996-2011. Moreover, no specific transformation of data was performed.

library(stringdist)
library(ggplot2)
library(dplyr)
library(reshape)

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
              "indata.csv.bz2")
d <- read.csv("indata.csv.bz2")

# Converting the BGN_Date to date
d$BGN_DATE <- strptime(d$BGN_DATE, format = "%m/%d/%Y")

Reducing the data set to only include data from 1996 where different event types started being recorded. Moreover, the events typos are corrected and the data set is reduced to only include the ariables in interest for this particular analysis.

Unique events and choice of variables for analysis :

uniEvTy <- c("Astronomical Low Tide",
             "Avalanche",
             "Blizzard",
             "Coastal Flood",
             "Cold/Wind Chill",
             "Debris Flow",
             "Dense Fog",
             "Dense Smoke",
             "Drought",
             "Dust Devil",
             "Dust Storm",
             "Excessive Heat",
             "Extreme Cold/Wind Chill ",
             "Flash Flood",
             "Flood",
             "Freezing Fog",
             "Frost/Freeze", 
             "Funnel Cloud",
             "Hail",
             "Heat",
             "Heavy Rain",
             "Heavy Snow",
             "High Surf",
             "High Wind",
             "Hurricane/Typhoon",
             "Ice Storm",
             "Lakeshore Flood",
             "Lake-Effect Snow",
             "Lightning",
             "Marine Hail",
             "Marine High Wind",
             "Marine Strong Wind",
             "Marine Thunderstorm Wind",
             "Rip Current",
             "Seiche",
             "Sleet",
             "Storm Tide", 
             "Strong Wind",
             "Thunderstorm Wind",
             "Tornado",
             "Tropical Depression",
             "Tropical Storm",
             "Tsunami",
             "Volcanic Ash",
             "Waterspout",
             "Wildfire",
             "Winter Storm",
             "Winter Weather")

col <- c("STATE__","BGN_DATE","COUNTY","COUNTYNAME","STATE","EVTYPE","LENGTH","WIDTH",
         "MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP",
         "WFO","STATEOFFIC")
dsub <- subset(d, d$BGN_DATE >= "1996-01-01")
dsub <- dsub[,col]
dsub$EVTYPE <- dsub$EVTYPE[amatch(dsub$EVTYPE,uniEvTy,maxDist = 50)]
dsub$EVTYPE <- as.factor(dsub$EVTYPE)
dsub$BGN_DATE <- as.Date(dsub$BGN_DATE)

rm(d)

Summarizing data for the first question about the population health:

# One way of aggregating 
# InjAgg <- aggregate(INJURIES ~ EVTYPE, data = dsub, sum)
# FatAgg <- aggregate(FATALITIES ~ EVTYPE, data = dsub, sum)

# Another way of aggregating using ´dplyr´: 
dsub_sum <- dsub %>% 
        group_by(BGN_DATE,EVTYPE) %>% 
        summarise(total_inj = sum(INJURIES),
                  total_fat = sum(FATALITIES))

# Reshaping data for the plot later
PopHealth <- melt(as.data.frame(dsub_sum), id=c("BGN_DATE","EVTYPE"))

Summarizing data for the second question about the economic consequences:

# Creating variables from different factors in the data set
dsub$CROPDMG_rev <- as.numeric(ifelse(dsub$CROPDMGEXP == "K", dsub$CROPDMG*1e+03,
                                      ifelse(dsub$CROPDMGEXP == "M", dsub$CROPDMG*1e+06,
                                             ifelse(dsub$CROPDMGEXP == "B", dsub$CROPDMG*1e+09,
                                                    ifelse(dsub$CROPDMGEXP == "", dsub$CROPDMG*0,"N/A")))))


dsub$PROPDMG_rev <- as.numeric(ifelse(dsub$PROPDMGEXP == "K", dsub$PROPDMG*1e+03,
                                      ifelse(dsub$PROPDMGEXP == "M", dsub$PROPDMG*1e+06,
                                             ifelse(dsub$PROPDMGEXP == "B", dsub$PROPDMG*1e+09,
                                                    ifelse(dsub$PROPDMGEXP  == 0, dsub$PROPDMG*1e+01,
                                                           ifelse(dsub$PROPDMGEXP  == "",dsub$CROPDMG*0,"N/A"))))))




# Another way of aggregating using ´dplyr´: 
dsub_sum_1 <- dsub %>% 
        group_by(BGN_DATE,EVTYPE) %>% 
        summarise(total_crop = sum(CROPDMG_rev),
                  total_prop = sum(PROPDMG_rev))

# Reshaping data for the plot later2
EcoDam <- melt(as.data.frame(dsub_sum_1), id=c("BGN_DATE","EVTYPE"))

Results

The following plot gives an overview of the amount of events summarized by type. To have an idea about which of the events occurs the most, since recording of different events started post 1996, bear in mind that this analysis is intended to be very short and should only be exploratory.

ggplot(dsub, aes(x=BGN_DATE))+
        geom_bar()+
        facet_grid(cols=vars(EVTYPE))+
        theme(axis.text.x = element_text(angle = 30, vjust = 0.5))+ 
        xlab("Years")+
        ggtitle("Overview of event types as function of years")

Across the United States, which types of events are most harmful with respect to population health?

Focusing on the variables injuries and fatalities, the following plot shows the two type of population health factors

ggplot(PopHealth, aes(x=EVTYPE,y=value,fill=variable))+ 
        geom_col(position = "dodge")+
        xlab("Event Type")+
        ylab("Amount of injuries/fatalities")+
        ggtitle("Amount of injuries annd fatalities per event type")+
        theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

From the plot above it seems like ´Tornados´, ´Thunder storm wind´ and ´Extreme Cold´ causes most injuries whereas fatalities are more during ‘Tornados’ over the period from 1996-2011.

Across the United States, which types of events have the greatest economic consequences?

Looking at the economic consequences of the events it seems like there is the thunder storm wind which causes most property damage across the united states. (note: deeper analysis of the impact on specific states will not be performed here).

ggplot(EcoDam, aes(x=EVTYPE,y=value,fill=variable))+ 
        geom_col(position = "dodge")+
        xlab("Event Type")+
        ylab("Crop and property damage")+
        ggtitle("Economic damage as function of event type")+
        theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

Limitations and further analysis

This is a short and limited exploratory analysis of the given data set which focuses on the reproduceability and points given in this course. The nalysis could be taken further and a lot of interesting questions could be asked. In this assignment the data set was limited by period from 1996-2011. It could be interesting to analyse the whole periode, and analyse te consequences of taking that part of the recordings away which also could be interesting. It could be interesting to look at other factors such as specific states. Are there some states that are more impaceted than others? Etc. The data is interesting but we will keep it short for now.