Sections

  1. Synopsis

  2. Data Processing

  3. Results

Synopsis

This analysis uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The data spans 1950 to 2011, but earlier records are not as complete as those from later years.

The objectives of the study are to discover which types of event are most harmful to population health, and which types of event have the greatest economic consequences across the whole United States.

We extracted the data from a zip file and identified the key variables - number of Injuries, number of Fatalities, dollar cost of Property Damage, and dollar cost of Crop Damage. Looking at these variables over time, it was clear that before 1993 there was no data for Crop Damage, and that more data was recorded for all measures after 1992.

Injuries and deaths were added together as a “health” impact variable, and crop and property damage were added together as a “cost” impact variable. These values were grouped by event type, and the maximums were identified.

The event type with the biggest economic impact was Flash Floods, whereas the event type with the biggest public health impact was Tornados. However, Tornados were only number one when weighing injuries and deaths equally. If public health impact was judged by fatalities, Excessive Heat had the greatest impact and Tornados came second.

Extracting the data

list.files()
## [1] "repdata-data-StormData.csv.bz2"    "rsconnect"                        
## [3] "Weather data peer assessment.Rmd"  "Weather project.Rproj"            
## [5] "Weather_data_peer_assessment.html" "Weather_data_peer_assessment.Rmd"
## read.csv can actually hand files compressed in the bz2 format so there's no need to unzip
stormData<-read.csv("repdata-data-StormData.csv.bz2")

Exploring the Data

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We have 37 variables. Event type is called ‘EVTYPE’. There are several date/time and location variables that we need not be too concerned about for this simple analysis, as well as some on the characteristics of the weather event itself that are also not likely to be relevant.

INJURIES and FATALIES seem to be the most relevant values for public health, whereas PROPDMG, CROPDMG, PROPDMGEXP and CROPDMGEXP - referring to property and crop damage respectively - seem most relevant to understanding economic consequences. The Data Documentation explains that PROPDMG and CROPDMG should refer to estimated dollar amounts and explains how these estimates are made.

We may wish to use only recent years, depending on how far back good data on these variables go. It is worth plotting each over time.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# create variable of the year of the beginning date of weather events in the dataset

stormData$stormYear<- format(as.Date(stormData$BGN_DATE,"%m/%d/%Y"),"%Y")

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
## Loading required package: DBI
# summarize data by year
yearData<- sqldf("select 
stormYear
, sum(FATALITIES) total_FATALITIES
, sum(INJURIES) total_INJURIES
, sum(PROPDMG) total_PROPERTY_DAMAGE
, sum(CROPDMG) total_CROP_DAMAGE
from
stormData
group by 
stormYear
"
)

yearData$stormYearFull<-as.POSIXct(paste(yearData$stormYear,"-01-01",sep=""))   

We will now check how the totals for our key variables have changed over time to see whether we want to include all data or only certain years.

library(ggplot2)



injuries_deaths<-ggplot(yearData,aes(stormYearFull))+
         geom_line(aes(y=total_INJURIES,colour="total_INJURIES"))+
         geom_line(aes(y=total_FATALITIES,colour="total_FATALITIES"))+
  xlab("year")+
  ylab("Number of injuries or deaths")+
  ggtitle("Number of deaths and injuries due to weather events by year")
  

  

costs<-ggplot(yearData,aes(stormYearFull))+
         geom_line(aes(y=total_PROPERTY_DAMAGE,colour="total_PROPERTY_DAMAGE"))+
         geom_line(aes(y=total_CROP_DAMAGE,colour="total_CROP_DAMAGE"))+
  xlab("year")+
  ylab("Estimated Dollar Cost")+
  ggtitle("Estimated cost of weather events due to \ncrop and property damage by year")



require(gridExtra)
## Loading required package: gridExtra
grid.arrange(injuries_deaths, costs)

It appears that many more injuries occur than deaths, and that both have increased over time, with a few peaks in both presumably due to large events.

It appears that there is no crop damage data before 1993, and that property damage estimates also increased after this. For this reason, we will only use damage data after this date. To try to evaluate all data equally, we will also limit our analysis of deaths and injuries to after this date.

Event class is the class of data for which we want to find the highest costs and the highest risk to health. We will look at these variables by event type for data from 1993 onwards.

## create dataset grouping fatalities, injuries, property damage and crop damage by event type
eventData<- sqldf("select 
EVTYPE
, sum(FATALITIES) total_FATALITIES
, sum(INJURIES) total_INJURIES
, sum(FATALITIES) + sum(INJURIES) total_HEALTH
, sum(PROPDMG) total_PROPERTY_DAMAGE
, sum(CROPDMG) total_CROP_DAMAGE
, sum(PROPDMG)+sum(CROPDMG) total_COST
, avg(FATALITIES) avg_FATALITIES
, avg(INJURIES) avg_INJURIES
, avg(PROPDMG) avg_PROPERTY_DAMAGE
, avg(CROPDMG) avg_CROP_DAMAGE


from
stormData

where stormYear >= '1993'

group by 
EVTYPE

order by
total_FATALITIES desc, total_INJURIES desc, total_COST desc
"
)

Results

We can look for the type of event that has caused the highest combined injuries and deaths to judge which has the biggest health impact

max_health_impact<-eventData[eventData$total_HEALTH == max(eventData$total_HEALTH),]
max_health_impact
##    EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 2 TORNADO             1621          23310        24931
##   total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 2               1387757          100018.5    1487776     0.06261588
##   avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 2    0.9004172            53.60619        3.863509

This gives us Tornado as the event type with the biggest number of combined injuries and deaths, however, we might reasonably decide that deaths are more important than injuries, and choose instead to say that the event type with the highest fatalities has the most impact on health

max_fatalities<-eventData[eventData$total_FATALITIES == max(eventData$total_FATALITIES),]
max_fatalities
##           EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 1 EXCESSIVE HEAT             1903           6525         8428
##   total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 1                  1460             494.4     1954.4       1.134088
##   avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 1     3.888558           0.8700834       0.2946365

Looking at the event type with the maximum number of fatalities gives us excessive heat instead of Tornado as having the worst health impact.

max_injuries<-eventData[eventData$total_INJURIES == max(eventData$total_INJURIES),]
max_injuries
##    EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 2 TORNADO             1621          23310        24931
##   total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 2               1387757          100018.5    1487776     0.06261588
##   avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 2    0.9004172            53.60619        3.863509

For economic impact it is more reasonable to combine the effects on both our dependent variables, since the amounts are all expressed in dollar costs and there is no reason to rate either crop damage or property damage as being more important.

max_cost<-eventData[eventData$total_COST == max(eventData$total_COST),]
max_cost
##        EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 3 FLASH FLOOD              978           1777         2755
##   total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 3               1420125          179200.5    1599325     0.01801868
##   avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 3   0.03273947            26.16439        3.301591

Flash floods are the most expensive type of weather event. We can also separate out the relative most expensive events for crop versus property damage.

max_crop_dmg<-eventData[eventData$total_CROP_DAMAGE == max(eventData$total_CROP_DAMAGE),]
max_crop_dmg
##    EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 55   HAIL               10            960          970
##    total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 55              688693.4          579596.3    1268290   4.408607e-05
##    avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 55  0.004232263            3.036179        2.555212

Hail had the biggest economic impact through crop damage.

max_property_dmg<-eventData[eventData$total_PROPERTY_DAMAGE == max(eventData$total_PROPERTY_DAMAGE),]
max_property_dmg
##        EVTYPE total_FATALITIES total_INJURIES total_HEALTH
## 3 FLASH FLOOD              978           1777         2755
##   total_PROPERTY_DAMAGE total_CROP_DAMAGE total_COST avg_FATALITIES
## 3               1420125          179200.5    1599325     0.01801868
##   avg_INJURIES avg_PROPERTY_DAMAGE avg_CROP_DAMAGE
## 3   0.03273947            26.16439        3.301591

However, flash floods had the biggest economic impact through property damage.

We can also create graphical representations to compare the event types with the biggest impacts

library(scales)

# create plot of the top 5 events by total cost since 1993

costplot<-ggplot(data=arrange(top_n(eventData,5,total_COST),desc(total_COST)),aes(EVTYPE,total_COST))+geom_bar(stat="identity")+ggtitle("Top 5 Event Types by Total Cost")+xlab("Event Type")+ylab("Total Dollar Cost")

costplot

Flash floods are most expensive, shortly followed by Tornados.

#create plot of the top 5 events by total number of injuries and deaths combined

healthplot<-ggplot(data=arrange(top_n(eventData,5,total_HEALTH),desc(total_HEALTH)),aes(EVTYPE,total_HEALTH,))+geom_bar(stat="identity")+ggtitle("Top 5 Event Types by Total Public Health Impact")+xlab("Event Type")+ylab("Total Injuries and Deaths")

healthplot

Tornados are a clear outlier for public health impact.