Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This data analysis addresses the following questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
The analysis will show that tornadoes, heat and lightning events were the prevalent causes of injury and death, whereas property and crop damage were most likely caused by flash flood, tornado or thunderstorm wind.
We are using data made availble by the National Weather Service at their Web site. See https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
The data was downloaded in its compressed format (bzip2) and extracted to a working directory. The extracted file used for analysis is called “repdata-data-StormData.csv”.
The first thing we’ll do is examine the first few rows of the data file to see if there are column headers and to get an idea of the kinds of data and the number of columns it contains.
## Load libraries
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Set working directory
setwd("/Users/mitchellfawcett/Documents/RProjects/StormActivity")
## Download compressed file from Web site.
url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "stormdata.csv.bz2", method = "curl")
## This is a large csv file so a few lines were first read in to see what the number of
## columns are and if there were column headers.
someData <- read.csv("stormdata.csv.bz2", nrows = 5)
someData
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 NA NA NA NA 0
## 2 TORNADO 0 NA NA NA NA 0
## 3 TORNADO 0 NA NA NA NA 0
## 4 TORNADO 0 NA NA NA NA 0
## 5 TORNADO 0 NA NA NA NA 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 NA NA 14.0 100 3 0 0
## 2 NA 0 NA NA 2.0 150 2 0 0
## 3 NA 0 NA NA 0.1 123 2 0 0
## 4 NA 0 NA NA 0.0 100 2 0 0
## 5 NA 0 NA NA 0.0 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 NA NA NA NA
## 2 0 2.5 K 0 NA NA NA NA
## 3 2 25.0 K 0 NA NA NA NA
## 4 2 2.5 K 0 NA NA NA NA
## 5 2 2.5 K 0 NA NA NA NA
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 NA 1
## 2 3042 8755 0 0 NA 2
## 3 3340 8742 0 0 NA 3
## 4 3458 8626 0 0 NA 4
## 5 3412 8642 0 0 NA 5
The csv file has a first row containing column names so we’ll read in the file with header = TRUE.
Next explore the data a little by reading the first 1000 rows with character values not converted to factors so we can get idea of the data they contain.
someData <- read.csv("stormdata.csv.bz2",
header = TRUE,
stringsAsFactors = FALSE,
nrows = 1000)
str(someData)
## 'data.frame': 1000 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : int 130 145 1600 900 1500 2000 100 900 2000 2000 ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : logi NA NA NA NA NA NA ...
## $ BGN_LOCATI: logi NA NA NA NA NA NA ...
## $ END_DATE : logi NA NA NA NA NA NA ...
## $ END_TIME : logi NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : logi NA NA NA NA NA NA ...
## $ END_LOCATI: logi NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: logi NA NA NA NA NA NA ...
## $ WFO : logi NA NA NA NA NA NA ...
## $ STATEOFFIC: logi NA NA NA NA NA NA ...
## $ ZONENAMES : logi NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : logi NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Based on our review of a few rows of data, some of the columns we’ll probably be interested in analyzing will be STATE, EVTTYPE (storm event type), FATALITIES, INJURIES, PROPDMG (property damage), and CROPDMG (crop damage).
Read in all of the data to a data frame called “allData”
allData <- read.csv("stormdata.csv.bz2",
header = TRUE,
na.strings = "NA")
According to documentation at the National Weather Service Web site, the EVTTYPE (event type) has undergone significant data formatting changes over time. To quote from their Collection Sources Web site:
“From 1996-1999, the event type field was a free-text field so there were many, many variations of event types. Most of the events were standardized into the 48 current event types in 2013. In 2000 the NWS added a drop-down selector for Event Type on the data entry interface, which standardized the Event Type values sent to NCDC.”
See https://www.ncdc.noaa.gov/stormevents/details.jsp?type=collection
For this reason, we will focus on weather events that have occurred since 2000. This will help to maintain a level of consistency when comparing the types of severe weather and the damage they cause. Also by focusing on more recent storm data we take into account population distribution (people) and land use (crops) that are more representative of the present time.
## Find storm data that occurred 2000 and later. This step also has the effect of eliminating
## rows of storm data that do not have valid dates of occurance.
recentData <- allData[as.Date(as.character(allData$BGN_DATE), format="%m/%d/%Y") >= '2000-01-01', ]
dim(recentData)
## [1] 523163 37
Now reduce the dataset to the columns of interest to speed up later computations.
## Select columns of interest
narrowRecentData <- recentData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "STATE", "BGN_DATE")]
narrowRecentData$EVTYPE <- toupper(narrowRecentData$EVTYPE)
dim(recentData)
## [1] 523163 37
We present two main results:
What types of weather events have caused the greatest number of fatalities and injuries for the years 2000 to 2012?
What types of weather events have caused the greatest amount of crop damage and property damage for the years 2000 to 2012?
## Group the data by event type.
byEvtype <- group_by(narrowRecentData, EVTYPE)
## Calculate the sums of fatalities, injuries, crop damage, & property damage for each grouping of event type.
totalByEventType <- data.frame(summarise(byEvtype,
Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
FatalitiesAndInjuries = sum(FATALITIES) + sum(INJURIES),
PropertyDamage = sum(PROPDMG),
CropDamage = sum(CROPDMG),
PropertyAndCropDamage = sum(PROPDMG) + sum(CROPDMG)))
str(totalByEventType)
## 'data.frame': 196 obs. of 7 variables:
## $ EVTYPE : chr " HIGH SURF ADVISORY" " FLASH FLOOD" " TSTM WIND" " WATERSPOUT" ...
## $ Fatalities : num 0 0 0 0 0 0 0 0 0 179 ...
## $ Injuries : num 0 0 0 0 0 0 0 0 0 126 ...
## $ FatalitiesAndInjuries: num 0 0 0 0 0 0 0 0 0 305 ...
## $ PropertyDamage : num 200 50 8 0 0 ...
## $ CropDamage : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PropertyAndCropDamage: num 200 50 8 0 0 ...
## Sort the event group sums and find the events with the highest combined fatalities and injuries.
health <- totalByEventType[order(totalByEventType$FatalitiesAndInjuries, decreasing = TRUE),][1:15, c("EVTYPE", "Fatalities", "Injuries")]
## Divide values by one thousand so x axis can be in units of thousands.
health$Fatalities <- health$Fatalities / 1000
health$Injuries <- health$Injuries / 1000
Create a stacked bargraph showing weather events with the greatest impact on population health.
## Transpose rows and columns of health dataframe.
# first remember the names
n <- health$EVTYPE
## transpose all but the first column (EVTYPE) of health dataframe.
health <- as.data.frame(t(health[,-1]))
colnames(health) <- n
## Set up parameters to be used for bar graph
par(las=2) # make label text perpendicular to axis
par(mar=c(5,10,4,0.1)) # increase y-axis margin.
barplot(as.matrix(health),
horiz = TRUE,
cex.names=0.8,
cex.lab = .8,
xlab = "Number Fatalities and Injuries (Thousands)",
main = "Figure 1 - Weather Effect on Population Health \n 2000 to 2012*",
col = gray.colors(2))
legend("topright", inset=.05,
c("Fatalities","Injuries"),
fill = gray.colors(2))
Figure 1 shows the top weather event types responsible for the greatest number of human fatalities and injuries between 2000 and 2012. The totals for each event type are cummulative for the entire period. Tornadoes are the greatest cause of death and injury, followed by excessive heat and lightning.
*Refer to the National Weather Service Web site for definitions of weather event types: http://w1.weather.gov/glossary/index.php?letter=t
Create a stacked bargraph showing weather events with the greatest impact on property and crops.
## Sort the event group sums and find the events with the highest combined property and crop damage.
damage <- totalByEventType[order(totalByEventType$PropertyAndCropDamage, decreasing = TRUE),][1:15, c("EVTYPE", "PropertyDamage", "CropDamage")]
## Divide dollar values by million so x axis can be in units of one million.
damage$PropertyDamage <- damage$PropertyDamage / 1000000
damage$CropDamage <- damage$CropDamage /1000000
## Transpose rows and columns of damage dataframe.
# first remember the names
n <- damage$EVTYPE
## transpose all but the first column (EVTYPE) of damage dataframe.
damage <- as.data.frame(t(damage[,-1]))
colnames(damage) <- n
## Set up parameters to be used for bar graph
par(las=2) # make label text perpendicular to axis
par(mar=c(5,10,4,0.1)) # increase y-axis margin.
barplot(as.matrix(damage),
horiz = TRUE,
cex.names=0.8,
cex.lab = .8,
xlab = "Property and Crop Damage (Millions of Dollars)",
xaxt="n",
main = "Figure 2 - Weather Effect on Property and Crops \n 2000 to 2012",
col = gray.colors(2))
legend("topright", inset=.05,
c("Property Damage","Crop Damage"),
fill = gray.colors(2))
options(scipen=5)
axis(1, at=axTicks(1), labels=sprintf("$%s", axTicks(1)), las = 2)
Figure 2 shows the top weather event types responsible for the greatest dollar loss of property and crops between 2000 and 2012. The totals for each event type are cummulative for the entire period. Flash floods were responsible for the greatest amount of property and crop damage, followed by tornadoes and thunderstorm wind.
As the National Weather Service mentions on their Web site, there is variability in how weather events are described during their data collection process. Further analysis of the data might be enhanced by grouping the events into broader categories such as Rain, High Wind, Cold, Flood, etc. For example, “Thunderstorm Wind” and “TSTM Wind” could represent the same type of event and so should be combined into one event type. For the purposes of this assignment it was not felt to be necessary to make these groupings.