This report is to submit an analysis to a hypothetical government or municipal manager who would be responsible in preparing for severe weather events and will need to prioritize resources for different types of events. Albeit, there are no specific recommendations in this report, the analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is sufficient with the following two questions:
The data and meta data supporting this analysis is located in the following locations:
library(ggplot2)
library(reshape2)
setwd("/home/alanoakes/Documents/Programming Notes/DataScience/Mod5 - Repoducible Research/NoaaStormData")
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url=url,destfile="repdata_data_StormData.csv.bz2")
NOAA.Data<-read.csv("repdata_data_StormData.csv.bz2")
list("Dimensions of Data Set"=dim(NOAA.Data),
"% NA's in Data Set"=mean(is.na(NOAA.Data)),
"Total Count of NA's of Data Set"=sum(is.na(NOAA.Data)),
"Variable Names"=names(NOAA.Data))
## $`Dimensions of Data Set`
## [1] 902297 37
##
## $`% NA's in Data Set`
## [1] 0.05229737
##
## $`Total Count of NA's of Data Set`
## [1] 1745947
##
## $`Variable Names`
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
As we can see with this data set provided by NOAA it has a decent amount of rows and columns. It has a total of 1,745,947 misssing data points throughout the data set making up 5.23% of the data provided. There are 37 columns (“variables”) to our data. This amount of variables gives us many things we could include in our total analysis. However, for our analysis we will need to focus on the variables supporting our analysis on population health and economic consquences. Before we are able to make any real measurements to our two questions we must determine a sample of most frequent and consequential storm types to evaluate in the variable “EVTYPE”.
Next, I will build a top-ten list of the most frequent and consequential storm types to appriately size our data for the initial two questions.
TopTen<-table(NOAA.Data$EVTYPE)
TopTenD<-as.data.frame(TopTen)
TopTenL<-head(TopTenD[order(-TopTenD$Freq),],10)
EventTypes<-TopTenL$Var1
EventTypes<-as.character(EventTypes)
NOAA.Data$EVTYPE<-as.character(NOAA.Data$EVTYPE)
head(TopTenD[order(-TopTenD$Freq),],10)
## Var1 Freq
## 244 HAIL 288661
## 856 TSTM WIND 219940
## 760 THUNDERSTORM WIND 82563
## 834 TORNADO 60652
## 153 FLASH FLOOD 54277
## 170 FLOOD 25326
## 786 THUNDERSTORM WINDS 20843
## 359 HIGH WIND 20212
## 464 LIGHTNING 15754
## 310 HEAVY SNOW 15708
As we can see with our above top-ten list, the frequency of occurences for these storm types is rather large. I notice in our top-ten items that wind has been notated four seperate times (e.g. “TSTM WIND”,“THUNDERSTORM WIND”,“THUNDERSTORM WINDS”,“HIGH WIND”). There is no absolute notation from the NWS to clarify these four wind categories, but suffice it to say this has been a vary frequent occurence in the United States.
Next we will pre-process our data with these top-ten storm types to perform our two question national impact analysis.
# Question 1
Question1<-NOAA.Data[NOAA.Data$EVTYPE==EventTypes,c(8,23,24)]
## Warning in NOAA.Data$EVTYPE == EventTypes: longer object length is not a
## multiple of shorter object length
names(Question1)<-c("EventType","Fatalities","Injuries")
Question1$EventType<-as.factor(Question1$EventType)
Question1$Fatalities<-as.numeric(Question1$Fatalities)
Question1$Injuries<-as.numeric(Question1$Injuries)
Q1Melt<-melt(Question1,na.rm=T)
## Using EventType as id variables
# Question 2
Question2<-NOAA.Data[NOAA.Data$EVTYPE==EventTypes,c(8,26,28)]
## Warning in NOAA.Data$EVTYPE == EventTypes: longer object length is not a
## multiple of shorter object length
names(Question2)<-c("EventType","PropDamage","CropDamage")
Question2$EventType<-as.factor(Question2$EventType)
Question2$PropDamage<-as.numeric(Question2$PropDamage)
Question2$CropDamage<-as.numeric(Question2$CropDamage)
Q2Melt<-melt(Question2,na.rm=T)
## Using EventType as id variables
# Summaries of Both Datasets
list("Question 1 Data Set"=summary(Q1Melt),
"Question 2 Data Set"=summary(Q2Melt))
## $`Question 1 Data Set`
## EventType variable value
## HAIL :57446 Fatalities:80517 Min. : 0.0000
## TSTM WIND :44456 Injuries :80517 1st Qu.: 0.0000
## THUNDERSTORM WIND:16618 Median : 0.0000
## TORNADO :12172 Mean : 0.0947
## FLASH FLOOD :10820 3rd Qu.: 0.0000
## FLOOD : 5092 Max. :1228.0000
## (Other) :14430
##
## $`Question 2 Data Set`
## EventType variable value
## HAIL :57446 PropDamage:80517 Min. : 1.00
## TSTM WIND :44456 CropDamage:80517 1st Qu.: 1.00
## THUNDERSTORM WIND:16618 Median : 1.00
## TORNADO :12172 Mean : 5.73
## FLASH FLOOD :10820 3rd Qu.: 7.00
## FLOOD : 5092 Max. :19.00
## (Other) :14430
Before we forward to our results section, we need to summarize our pre-processed data before we can visualize it for final analysis and ask this data our two questions. To state a simple summary, both of these two datasets now only have three variables instead of the original thrity-seven. Also with reducing our dataset with our top-ten storm types we now have a sampling size of n = 161,034 rows (“observations”) of the initial population size of N = 902,297 rows; giving us a sample size of n = 17.8471169%.
I put together two seperate stacked barplots to determine two variables for analysis:
# Plot
ggplot(data=Q1Melt,aes(x=EventType,y=value,fill=variable))+
geom_bar(stat="identity")+coord_flip()+theme_linedraw()
# Table
Q1<-as.data.frame(tapply(Q1Melt$value,Q1Melt$EventType,sum))
as.data.frame(Q1[order(-Q1),])
## Q1[order(-Q1), ]
## TORNADO 12308
## TSTM WIND 895
## LIGHTNING 653
## FLASH FLOOD 271
## FLOOD 263
## HEAVY SNOW 262
## THUNDERSTORM WIND 222
## HIGH WIND 165
## HAIL 141
## THUNDERSTORM WINDS 71
Question 1: Across the United States, which types of events are most harmful with respect to population health?
Analysis 1: Tornado frequencies of fatalities and injuries far out-weighed the other event types. Also, the four wind event types would easily come in second if they were combined together. Based off of our table which sums the fatalities and injuries per event type, we show nothing comes close to tornadoes with 12,308 counts of population health impacts. Our four wind event types equal a sum of 1,353 total injuries and fatalities.
# Plot
ggplot(data=Q2Melt,aes(x=EventType,y=value,fill=variable))+
geom_bar(stat="identity",position="stack")+
coord_flip()+theme_linedraw()
# Table
Q2<-as.data.frame(tapply(Q2Melt$value,Q2Melt$EventType,sum))
as.data.frame(Q2[order(-Q2),])
## Q2[order(-Q2), ]
## HAIL 255380
## THUNDERSTORM WIND 197902
## TSTM WIND 150818
## TORNADO 102388
## FLASH FLOOD 75968
## FLOOD 41609
## HIGH WIND 34270
## LIGHTNING 22878
## THUNDERSTORM WINDS 22843
## HEAVY SNOW 18622
Question 2: Across the United States, which types of events have the greatest economic consequences?
Analysis 2: Per the above stacked barplot and the table there are many event types which impacted our country economically in both property damage and crop damage. Cumulatively: Hail, Thunderstorm Wind, Tstm Wind and Tornadoes impacted the most with an above $100M worth of damage. The anlysis was conlcusive on event type frequency instead of economic values alone since resource allocation will be our judgement. Our two questions lends itself to most expensive by issue occurence rather than expense occurence alone so to have resources readily available for reocurrence. However, I have provided a top-ten table of event types after summing property damage expenses and crop damage expenses per event type exluding my previous event type sampling.
# compare to most expensive against most occuring
NOAA.Data$EVTYPE<-as.character(NOAA.Data$EVTYPE)
NOAA.Data$PROPDMGEXP<-as.numeric(NOAA.Data$PROPDMGEXP)
NOAA.Data$CROPDMGEXP<-as.numeric(NOAA.Data$CROPDMGEXP)
Expense<-NOAA.Data$PROPDMGEXP+NOAA.Data$CROPDMGE
ExpEvents<-cbind.data.frame(NOAA.Data$REFNUM,NOAA.Data$EVTYPE,NOAA.Data$PROPDMGEXP,NOAA.Data$CROPDMGEXP)
ExpEvents$Expense<-ExpEvents$`NOAA.Data$PROPDMGEXP`+ExpEvents$`NOAA.Data$CROPDMGEXP`
x1<-tapply(ExpEvents$Expense,ExpEvents$`NOAA.Data$EVTYPE`,sum)
x1.1<-as.data.frame(x1)
head(as.data.frame(x1.1[order(-x1),]),10)
## x1.1[order(-x1), ]
## HAIL 2545662
## THUNDERSTORM WIND 1965426
## TSTM WIND 1486346
## TORNADO 1017324
## FLASH FLOOD 769164
## FLOOD 414429
## HIGH WIND 336095
## THUNDERSTORM WINDS 235766
## LIGHTNING 232770
## WINTER STORM 183641
Comparatively, these two economic top-ten event type tables are the same. I actually expected to see hurricanes or volcanoes included in this list. Rather hail, flooding and wind have been the most expensive event types across our country.
We have drawn sufficient conclusions of the provided NOAA data. First we sampled the data per the event types that were most frequent. Our population health impact conclusion was very easily tornadoes and wind caused the most fatalities and injuries. Our economic conclusion was hail, variants of wind and tornadoes were the most expensive. When reanalyzing against most occuring and most expensive, our top-ten economic list output was the same.