Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In this project I will use the data mentioned in the introduction, which was provided by the NOAA via the Coursera Data Science Specialization course. I will attempt to create an R Markdown file to best explain and share the method used, the code chunks adopted, the results, and conclusions of my findings. Visual as well as numerical presentations will be provided. Disclaimer: the conclusions and results are my own and should not be considered as those views of Coursera, Johns Hopkins or their professors.
At the time of this assignment I was using RStudio Version 0.99.902 - © 2009-2016 RStudio, Inc. Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1, under R version R x64 3.3.1.
First things first, we have to download the data and process it. But before we do that, set your working directory to wherever you choose, then:
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("./Reproducible Research Assignment II")) {dir.create("./Reproducible Research Assignment II")}
download.file(url, destfile="./Reproducible Research Assignment II/StormData.csv.bz2") ## I am using a windows computer hence no curl method
StormData <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
Per many lessons and classes, and common sense really, it’s always good to get a feel for the data even if you have no clue what you’re looking at. Sometimes you might find a pattern and it starts to make sense. Or, you could just read the 97 page pdf file that came as part of the assignment. Below are just a few I usually use before getting more specific.
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(StormData[, 1:10])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI
## 1 TORNADO 0
## 2 TORNADO 0
## 3 TORNADO 0
## 4 TORNADO 0
## 5 TORNADO 0
## 6 TORNADO 0
tail(StormData[, 1:10])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 902292 47 11/28/2011 0:00:00 03:00:00 PM CST 21
## 902293 56 11/30/2011 0:00:00 10:30:00 PM MST 7
## 902294 30 11/10/2011 0:00:00 02:48:00 PM MST 9
## 902295 2 11/8/2011 0:00:00 02:58:00 PM AKS 213
## 902296 2 11/9/2011 0:00:00 10:21:00 AM AKS 202
## 902297 1 11/28/2011 0:00:00 08:00:00 PM CST 6
## COUNTYNAME STATE EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088 TN WINTER WEATHER 0
## 902293 WYZ007 - 017 WY HIGH WIND 0
## 902294 MTZ009 - 010 MT HIGH WIND 0
## 902295 AKZ213 AK HIGH WIND 0
## 902296 AKZ202 AK BLIZZARD 0
## 902297 ALZ006 AL HEAVY SNOW 0
## BGN_AZI
## 902292
## 902293
## 902294
## 902295
## 902296
## 902297
It’s clear that this is a huge data set, some of which is due to more than one header for, assuming, the same data. Over 900k observations (rows) of 37 variables (columns). Luckily, the assignment specifies that we should be interested in event types, or EVTYPE, as well as focusing on recent events. So we will base our research on those. I have chosen to focus on data from 1995 until the most recent.
In order to select what data we need to focus on, let’s consider the questions: > 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? > 2. Across the United States, which types of events have the greatest economic consequences?
Based on these, and reading through the .pdf file provided, we note that the answer to the above questions lie in a few variables within EVTYPE. Namely:
Let’s start with question 1,
Out of this list, 2 observations relate to direct health - INURIES and FATALITIES.
Before we begin let’s cull the data to read from 1995. In the str() function above we noted a “BGN_DATE” variable. So first we need to create a new data set starting from 1995.
StormData$BGN_DATE <- as.Date(StormData$BGN_DATE, format="%m/%d/%Y")
New <- as.Date("1995-01-01")
NewStormData <- subset(StormData, StormData$BGN_DATE >= New)
This data set is now down to 681k observations (vs 902k previously).
Let’s modify our new data set to help with focusing on just the EVTYPE column,
Fatalities <- with(NewStormData, aggregate(FATALITIES, by = list(EVTYPE), sum)) ## to help simplify our modeling function
colnames(Fatalities) <- c("EVTYPE", "TOTAL") ## set the column name
Fatal_sub <- subset(Fatalities, Fatalities$TOTAL > 0) ## only consider values above 0
Fatal_sub_ord <- Fatal_sub[order(Fatal_sub$TOTAL, decreasing = TRUE),] ## set the order
Right now those numbers don’t tell us much - eg which number belongs to what event type - but at least we can see they are all numbers :) and not characters for example.
Now let’s plot (Note: I did this many times and decided to limit the outcome to the top 20 as the plots unded really messy and unclear otherwise). I followed advice from herefor the reordering as my order command wasn’t giving me what I was looking for.
library(ggplot2)
g <- ggplot(Fatal_sub_ord[1:20,], aes(x= reorder(EVTYPE, -TOTAL), y=TOTAL, color="Purple"))
FatalPlot <- g+geom_bar(stat="identity", show.legend= FALSE)+coord_flip()+ylab("Number of Fatalities")+xlab("Event Type")+ggtitle("Top 20 Fatal Natural Events Across the US")
FatalPlot
Now we do the same injuries.
Injuries <- with(NewStormData, aggregate(INJURIES, by = list(EVTYPE), sum))
colnames(Injuries) <- c("EVTYPE", "TOTAL")
Injur_sub <- subset(Injuries, Injuries$TOTAL > 0)
Injur_sub_ord <- Injur_sub[order(Injur_sub$TOTAL, decreasing=TRUE),]
g2 <- ggplot(Injur_sub_ord[1:20,], aes(x=reorder(EVTYPE, -TOTAL), y=TOTAL, color="Brown"))
InjuryPlot <- g2+geom_bar(stat="identity", show.legend = F)+coord_flip()+ylab("Number of Injuries")+xlab("Event Type")+ggtitle("Top 20 Injuries by Natural Events Across the US")
InjuryPlot
So far we have considered only the direct impact of weather events on human health, but now we must consider the indirect impacts or the direct impacts on the Economy. Which leads us to the next question:
Now we focus on the following variables:
These refer to PROPerty DaMaGe, CROP DaMaGe, then their EXP which I think refers to exponent to specify the 1000s. In the literature, pg 12, you can determine that H= 100s, K= 1,000s, M= 1,000,000s, and B=1,000,000,000s.
This is more complicated and after much referencing, I’ll share how I’ve tackled it.
To make life simple and create order, we’ll convert everything first (from letters to numbers) and looks at millions as our base reference value.
## select the data we are interested from within the set
Propdmg <- NewStormData$PROPDMG
Propdmgex <- NewStormData$PROPDMGEXP
Cropdmg <- NewStormData$CROPDMG
Cropdmgex <- NewStormData$CROPDMGEXP
## create the values by multiplying the desired exponent
Propdmg[Propdmgex%in%c("M", "m")]<-Propdmg[Propdmgex%in%c("M", "m")]*1
Propdmg[Propdmgex%in%c("B", "b")]<-Propdmg[Propdmgex%in%c("B", "b")]*1e+03
Propdmg[Propdmgex%in%c("K", "k")]<-Propdmg[Propdmgex%in%c("K", "k")]*1e-03
Propdmg[Propdmgex%in%c("H", "h")]<-Propdmg[Propdmgex%in%c("H", "h")]*1e-04
Propdmg[!(Propdmgex%in%c("H", "h", "K", "k", "B", "b", "M", "m"))]<-Propdmg[!(Propdmgex%in%c("H", "h", "B", "b", "M", "m", "K", "k"))]*1e-06
Cropdmg[Cropdmgex%in%c("M", "m")] <- Cropdmg[Cropdmgex%in%c("M", "m")]*1
Cropdmg[Cropdmgex%in%c("B", "b")] <- Cropdmg[Cropdmgex%in%c("B", "b")]*1e+03
Cropdmg[Cropdmgex%in%c("K", "k")] <- Cropdmg[Cropdmgex%in%c("K", "k")]*1e-03
Cropdmg[Cropdmgex%in%c("H", "h")] <- Cropdmg[Cropdmgex%in%c("H", "h")]*1e-04
Cropdmg[!(Cropdmgex%in%c("H", "h", "K", "k", "B", "b", "M", "m"))]<-Cropdmg[!(Cropdmgex%in%c("H", "h", "B", "b", "M", "m", "K", "k"))]*1e-06
CropProp <- Cropdmg + Propdmg
EcoDmg <- aggregate(CropProp~NewStormData$EVTYPE, FUN=sum)
EcoDmgOrd <- EcoDmg[order(EcoDmg$CropProp, decreasing=TRUE),]
colnames(EcoDmgOrd) <- c("EVTYPE", "TOTAL")
g3 <- ggplot(EcoDmgOrd[1:20, ], aes(x=reorder(EVTYPE, -TOTAL), y=TOTAL, color="Blue"))
EcoDmgPlot <- g3+geom_bar(stat="identity", show.legend = FALSE)+coord_flip()+ylab("Total Cost in '000")+xlab("Event Type")+ggtitle("Top 20 Highest Costing Natural Events Accross the US")
EcoDmgPlot
So in conclusion, since 1995 Highest Fatality Event = Excessive Heat (stroke)
Highest Injury Event = Tornadoes
Highest Economic Impact = Flood