A number of catastrophes over the years have had an adverse effect on the population health and economy of the United States of America. These include storms, hurricanes, floods etc. and have been a major deterrent in the progress of the country. These “severe events” can result in fatalities and damage to property and hence prevention or mainenance following such outcomes remains to be the key focus.
This project aims at understanding the effect of these severe outcomes on the population health and economy with an aim to better focus or prioritize the resources for research/maintenance within the general population or municipalities.
Based on the cursory data analysis, the fatalities incurred over the years from 1950 to 2011 were primarily caused by “Tornadoes” which also prove to be the major cause for injuries. However “Floods” turn out be the major deterrent of the US economy.
The data used for this analysis was obtained from the public domain provided by the National Weather Service (nws). The raw data is a zipped csv file and documentation pertaining to the data can be obtained from the NWS or the National Climatic Data Center.
After downloading the data it. We can read it with the read.csv command and look at the summary. (Note: read.csv can be used to read .csv.bz2 files.)
stormdata <- read.csv('./../repdata_data_StormData.csv.bz2', stringsAsFactors = F, header = T)
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are a total of 902297 observations with 37 variables.
We will be using only the data from the columns, evtype, propdmg, cropdmg, propdmgexp, cropdmgexp, state, fatalities and injuries. Therefore we will subset these variables into another dataframe reducing the number of variables to 8.
data <- stormdata[,c("STATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]
str(data)
## 'data.frame': 902297 obs. of 8 variables:
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMGEXP: chr "" "" "" "" ...
summary(data)
## STATE EVTYPE FATALITIES
## Length:902297 Length:902297 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0000
## Mean : 0.0168
## 3rd Qu.: 0.0000
## Max. :583.0000
## INJURIES PROPDMG CROPDMG
## Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 0.1557 Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 Max. :990.000
## PROPDMGEXP CROPDMGEXP
## Length:902297 Length:902297
## Class :character Class :character
## Mode :character Mode :character
##
##
##
To ensure the data is complete, we will first check whether the data contains any NA values.
sum(is.na(data))
## [1] 0
Since there are no missing values, We can proceed with attempting a cursory analysis of the data.
We will first convert the EVTYPE variable to a factor, aggregate the fatalities and using dplyr order in descending order. We will then plot the data.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library (ggplot2)
data$EVTYPE <- as.factor(data$EVTYPE)
fatal_data <- aggregate(FATALITIES ~ EVTYPE,data,sum)
ordered_fatal_data <- arrange(fatal_data,desc(FATALITIES))
# png(filename = 'fatalities_plot.png')
ggplot(ordered_fatal_data[1:10,], aes(EVTYPE, FATALITIES, fill=FATALITIES)) + geom_bar(stat="identity", colour="red") + ylab("Fatalities") + xlab("Event type") + ggtitle('Fatalities caused by Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))
# dev.off()
The graph above indicates that HAIL is the deadliest of the event types causing the loss of life of millions of people. To obtain an idea about the event type causing most injuries,
library(dplyr)
library(ggplot2)
injury_data <- aggregate(INJURIES ~ EVTYPE, data, sum)
ordered_injury_data <- arrange(injury_data, desc(INJURIES))
ggplot(ordered_injury_data[1:10,], aes(EVTYPE, INJURIES, fill=INJURIES)) + geom_bar(stat="identity", colour = "red") + ylab("Injuries") + xlab("Event type") + ggtitle('Injuries caused by Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))
Here we see that it is TORNADOES that cause the most injuries.
The PROPDMGEXP and the CROPDMGEXP are scale factors to be used as exponents for the PROPDMG and CROPDMG variables. Let us first see what are the various values that PROPDMGEXP and CROPDMGEXP take:
sort(unique(data$PROPDMGEXP))
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
sort(unique(data$CROPDMGEXP))
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
To cover the exponent values, we will use the following substitutions: - “” unchanged - + unchanged - ? unchanged - - -1 - H hundred (x 100) - K/k thousand (x 1000) - M/m million (x 1000,000) - B billion (x 1000,000,000) - Any other number will be 10 raised to the power of that number.
We wil write two functions to calculate the economic impact based on the scale factor and the PROPDMG variable. We will call that economic_impact.
change_exp <- function(x, y){switch(y,
'+' = x*1,
'?' = x*1,
' ' = x*1,
'1' = x*1,
'2' = x*100,
'3' = x*1000,
'4' = x*10000,
'5' = x*100000,
'6' = x*1000000,
'7' = x*10000000,
'8' = x*100000000,
'H' = x*100,
'K' = x*1000,
'k' = x*1000,
'M' = x*1000000,
'm' = x*1000000,
'B' = x*1000000000,
'b' = x*1000000000,
x)
}
compute_economic_imp <- function(x,exp){
len <- length(x)
res <- rep(0,len)
for (i in 1:len){
res[i] <- change_exp(x[i],exp[i])
}
res
}
data$economic_impact <- compute_economic_imp(data$PROPDMG, data$PROPDMGEXP)
Finally we will compute the economic impact based on the event type and plot it.
eco_data <- aggregate(economic_impact ~ EVTYPE, data, sum)
ordered_eco_data <- arrange(eco_data, desc(economic_impact))
ggplot(ordered_eco_data[1:10,], aes(EVTYPE, economic_impact, fill=economic_impact)) + geom_bar(stat="identity", colour = "red") + ylab("Injuries") + xlab("Event type") + ggtitle('Economic impact of different Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))