This analysis uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data spans 1950 to 2011, but earlier records are not as complete as those from later years.
The objectives of the study are to discover which types of event are most harmful to population health, and which types of event have the greatest economic consequences across the whole United States.
We extracted the data from a zip file and identified the key variables - number of Injuries, number of Fatalities, dollar cost of Property Damage, and dollar cost of Crop Damage. Looking at these variables over time, it was clear that before 1993 there was no data for Crop Damage, and that more data was recorded for all measures after 1992.
Injuries and deaths were added together as a “health” impact variable, and crop and property damage were added together as a “cost” impact variable. These values were grouped by event type, and the maximums were identified.
The event type with the biggest economic impact was Flash Floods, whereas the event type with the biggest public health impact was Tornados. However, Tornados were only number one when weighing injuries and deaths equally. If public health impact was judged by fatalities, Excessive Heat had the greatest impact and Tornados came second.
list.files()
## [1] "repdata-data-StormData.csv.bz2" "rsconnect"
## [3] "Weather data peer assessment.Rmd" "Weather project.Rproj"
## [5] "Weather_data_peer_assessment.html" "Weather_data_peer_assessment.Rmd"
## read.csv can actually hand files compressed in the bz2 format so there's no need to unzip
stormData<-read.csv("repdata-data-StormData.csv.bz2")
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We have 37 variables. Event type is called ‘EVTYPE’. There are several date/time and location variables that we need not be too concerned about for this simple analysis, as well as some on the characteristics of the weather event itself that are also not likely to be relevant.
INJURIES and FATALIES seem to be the most relevant values for public health, whereas PROPDMG, CROPDMG, PROPDMGEXP and CROPDMGEXP - referring to property and crop damage respectively - seem most relevant to understanding economic consequences. The Data Documentation explains that PROPDMG and CROPDMG should refer to estimated dollar amounts and explains how these estimates are made.
We may wish to use only recent years, depending on how far back good data on these variables go. It is worth plotting each over time.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# create variable of the year of the beginning date of weather events in the dataset
stormData$stormYear<- format(as.Date(stormData$BGN_DATE,"%m/%d/%Y"),"%Y")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
## Reason: image not found
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
## Loading required package: DBI
# summarize data by year
yearData<- sqldf("select
stormYear
, sum(FATALITIES) total_FATALITIES
, sum(INJURIES) total_INJURIES
, sum(PROPDMG) total_PROPERTY_DAMAGE
, sum(CROPDMG) total_CROP_DAMAGE
from
stormData
group by
stormYear
"
)
yearData$stormYearFull<-as.POSIXct(paste(yearData$stormYear,"-01-01",sep=""))
We will now check how the totals for our key variables have changed over time to see whether we want to include all data or only certain years.
library(ggplot2)
injuries_deaths<-ggplot(yearData,aes(stormYearFull))+
geom_line(aes(y=total_INJURIES,colour="total_INJURIES"))+
geom_line(aes(y=total_FATALITIES,colour="total_FATALITIES"))+
xlab("year")+
ylab("Number of injuries or deaths")+
ggtitle("Number of deaths and injuries due to weather events by year")
costs<-ggplot(yearData,aes(stormYearFull))+
geom_line(aes(y=total_PROPERTY_DAMAGE,colour="total_PROPERTY_DAMAGE"))+
geom_line(aes(y=total_CROP_DAMAGE,colour="total_CROP_DAMAGE"))+
xlab("year")+
ylab("Estimated Dollar Cost")+
ggtitle("Estimated cost of weather events due to \ncrop and property damage by year")
require(gridExtra)
## Loading required package: gridExtra
grid.arrange(injuries_deaths, costs)
It appears that many more injuries occur than deaths, and that both have increased over time, with a few peaks in both presumably due to large events.
It appears that there is no crop damage data before 1993, and that property damage estimates also increased after this. For this reason, we will only use damage data after this date. To try to evaluate all data equally, we will also limit our analysis of deaths and injuries to after this date.
Event class is the class of data for which we want to find the highest costs and the highest risk to health. We will look at these variables by event type for data from 1993 onwards.
## create dataset grouping fatalities, injuries, property damage and crop damage by event type
eventData<- sqldf("select
EVTYPE
, sum(FATALITIES) total_FATALITIES
, sum(INJURIES) total_INJURIES
, sum(FATALITIES) + sum(INJURIES) total_HEALTH
, sum(PROPDMG) total_PROPERTY_DAMAGE
, sum(CROPDMG) total_CROP_DAMAGE
, sum(PROPDMG)+sum(CROPDMG) total_COST
, avg(FATALITIES) avg_FATALITIES
, avg(INJURIES) avg_INJURIES
, avg(PROPDMG) avg_PROPERTY_DAMAGE
, avg(CROPDMG) avg_CROP_DAMAGE
from
stormData
where stormYear >= '1993'
group by
EVTYPE
order by
total_FATALITIES desc, total_INJURIES desc, total_COST desc
"
)
We can look for the type of event that has caused the highest combined injuries and deaths to judge which has the biggest health impact
max_health_impact<-eventData[eventData$total_HEALTH == max(eventData$total_HEALTH),]
max_health_impact
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 2 TORNADO 1621 23310 24931
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 2 1387757 100018.5 1487776 0.06261588
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 2 0.9004172 53.60619 3.863509
This gives us Tornado as the event type with the biggest number of combined injuries and deaths, however, we might reasonably decide that deaths are more important than injuries, and choose instead to say that the event type with the highest fatalities has the most impact on health
max_fatalities<-eventData[eventData$total_FATALITIES == max(eventData$total_FATALITIES),]
max_fatalities
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 1 EXCESSIVE HEAT 1903 6525 8428
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 1 1460 494.4 1954.4 1.134088
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 1 3.888558 0.8700834 0.2946365
Looking at the event type with the maximum number of fatalities gives us excessive heat instead of Tornado as having the worst health impact.
max_injuries<-eventData[eventData$total_INJURIES == max(eventData$total_INJURIES),]
max_injuries
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 2 TORNADO 1621 23310 24931
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 2 1387757 100018.5 1487776 0.06261588
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 2 0.9004172 53.60619 3.863509
For economic impact it is more reasonable to combine the effects on both our dependent variables, since the amounts are all expressed in dollar costs and there is no reason to rate either crop damage or property damage as being more important.
max_cost<-eventData[eventData$total_COST == max(eventData$total_COST),]
max_cost
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 3 FLASH FLOOD 978 1777 2755
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 3 1420125 179200.5 1599325 0.01801868
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 3 0.03273947 26.16439 3.301591
Flash floods are the most expensive type of weather event. We can also separate out the relative most expensive events for crop versus property damage.
max_crop_dmg<-eventData[eventData$total_CROP_DAMAGE == max(eventData$total_CROP_DAMAGE),]
max_crop_dmg
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 55 HAIL 10 960 970
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 55 688693.4 579596.3 1268290 4.408607e-05
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 55 0.004232263 3.036179 2.555212
Hail had the biggest economic impact through crop damage.
max_property_dmg<-eventData[eventData$total_PROPERTY_DAMAGE == max(eventData$total_PROPERTY_DAMAGE),]
max_property_dmg
## EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 3 FLASH FLOOD 978 1777 2755
## total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 3 1420125 179200.5 1599325 0.01801868
## avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 3 0.03273947 26.16439 3.301591
However, flash floods had the biggest economic impact through property damage.
We can also create graphical representations to compare the event types with the biggest impacts
library(scales)
# create plot of the top 5 events by total cost since 1993
costplot<-ggplot(data=arrange(top_n(eventData,5,total_COST),desc(total_COST)),aes(EVTYPE,total_COST))+geom_bar(stat="identity")+ggtitle("Top 5 Event Types by Total Cost")+xlab("Event Type")+ylab("Total Dollar Cost")
costplot
Flash floods are most expensive, shortly followed by Tornados.
#create plot of the top 5 events by total number of injuries and deaths combined
healthplot<-ggplot(data=arrange(top_n(eventData,5,total_HEALTH),desc(total_HEALTH)),aes(EVTYPE,total_HEALTH,))+geom_bar(stat="identity")+ggtitle("Top 5 Event Types by Total Public Health Impact")+xlab("Event Type")+ylab("Total Injuries and Deaths")
healthplot
Tornados are a clear outlier for public health impact.