Analysis of NOAA Database to Determine Impact of Weather on Population Health and Economy

Synopsis

The aim of this report is to determine which severe weather events are the most harmful with respect to population health and the have the greatest economic consequences. The database used for this analysis is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contains, among other things. data regarding different severe weather events and resulting numbers for human casualties, injuries, and property and crop damage. Analysis was performed using data from across the U.S., from the years 1990 - 2011. From analysis of the data, it was determined that excessive heat and tornadoe weather events caused the most fatalities, and tornadoes caused the most injuries. Also, it was determined that the weather event that caused the most property damage is flood and the weather event that caused the most crop damage was tornado.

Data Processing

The complete raw NOAA database is read in to a dataframe.

# load in libraries needed for analysis
library(dplyr) 
library(ggplot2)
library(lubridate)
# read in raw data
dataNOAAraw <- read.csv("repdata-data-StormData.csv.bz2", header = TRUE) 
str(dataNOAAraw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As can be seen from the structure, the database has 902297 observations with 37 variables. It is of interest to reduce the size of the dataframe by filtering out unnecessary variables and dates.

The first step is to determine the number of storm event observations for each year. It is likely that due to lack of good records and record keeping, there is not as much storm data in early years. The intention is to remove years that contribute little to the data.

# extract the year from the date column and add column to database
dataNOAAraw$YEAR <- year(as.Date(dataNOAAraw$BGN_DATE, "%m/%d/%Y %H:%M:%S"))

# histogram to sure frequency of events by year
bins <- seq(min(dataNOAAraw$YEAR), max(dataNOAAraw$YEAR)+1, 1)
hist(dataNOAAraw$YEAR, breaks = bins, xlab = "Year", ylab = "Number of Storm Events",
     main ="Total Number of Storm Events for Each Year \nAcross the US", col = "azure2")

As shown in the histogram, it appears the majority of the data collected occurs after the year 1990. Therefore, only the rows with data for years greater than 1990 are retained.

dataNOAA <- dataNOAAraw %>% filter(YEAR >= 1990)

Additionally, from the dataframe structure it can be seen that property damage and crop damage are both represented by two variables each. PROPDMGEXP and CROPDMGEXP are populated with the magnitude exponents by which to multiply the values in PROPDMG and CROPDMG respectively in order to obtain damage dollar values. For example, a PROPDMGEXP value of “2” corresponds to a 10^2 multiplier, and a value of “B” means the dollars are in “billions” and corresponds to 10^9 multiplier. Investigation of the levels for PROPDMGEXP and CROPDMGEXP show that there are some nonsense values, specifically {“”, “?”, “-”, “+”}.

levels(dataNOAA$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(dataNOAA$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

It will be assumed that these nonsense values are “0”. The PROPDMGEXP and CROPDMGEXP variables are converted to all number values, which are then appropriately multiplied with the PROPDMG and CROPDMG variables respectively resulting in a column of full dollar values for the different kinds of damage.

# Clean up PROPDMGEXP variable
dataNOAA$PROPDMGEXP <- gsub("^$|\\-|\\?|\\+", "0", dataNOAA$PROPDMGEXP)
dataNOAA$PROPDMGEXP <- gsub("[Bb]", "9", dataNOAA$PROPDMGEXP)
dataNOAA$PROPDMGEXP <- gsub("[Mm]", "6", dataNOAA$PROPDMGEXP)
dataNOAA$PROPDMGEXP <- gsub("[Kk]", "3", dataNOAA$PROPDMGEXP)
dataNOAA$PROPDMGEXP <- gsub("[Hh]", "2", dataNOAA$PROPDMGEXP)
dataNOAA$PROPDMGEXP <- as.numeric(dataNOAA$PROPDMGEXP)

# Clean up CROPDMGEXP variable
dataNOAA$CROPDMGEXP <- gsub("^$|\\-|\\?|\\+", "0", dataNOAA$CROPDMGEXP)
dataNOAA$CROPDMGEXP <- gsub("[Bb]", "9", dataNOAA$CROPDMGEXP)
dataNOAA$CROPDMGEXP <- gsub("[Mm]", "6", dataNOAA$CROPDMGEXP)
dataNOAA$CROPDMGEXP <- gsub("[Kk]", "3", dataNOAA$CROPDMGEXP)
dataNOAA$CROPDMGEXP <- gsub("[Hh]", "2", dataNOAA$CROPDMGEXP)
dataNOAA$CROPDMGEXP <- as.numeric(dataNOAA$CROPDMGEXP)

# Multiply PROPDMG and CROPDMG to get full dollar values
dataNOAA <- dataNOAA %>% mutate(PROPDMG = PROPDMG*10^PROPDMGEXP, CROPDMG*10^CROPDMGEXP)

Also, since the intent is to investigate impact on population health and economic consequences, only variables with information pertaining to the analysis are retained. These are decided to be:

  • STATE: State in which event occured.
  • EVTYPE: The type of storm/weather event.
  • FATALITIES: The number of fatalities due to the weather event.
  • INJURIES: The number of injuries due to the weather event.
  • PROPDMG: The amount in dollars of property damage.
  • CROPDMG: The amount in dollars of crop damage.
# reduction of dataframe
dataNOAAreduced <- dataNOAA %>% select(c(STATE,EVTYPE,FATALITIES,INJURIES,CROPDMG,PROPDMG,YEAR))
str(dataNOAAreduced)
## 'data.frame':    751740 obs. of  7 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 244 856 856 856 834 856 856 856 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 28 0 0 0 0 0 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  0 0 0 0 2500000 0 0 0 25000 25000 ...
##  $ YEAR      : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...

The resulting dataframe “dataNOAAreduced” is the tidy dataset used for the remaining analysis.

Results

The first question to answer with the data is which weather events were the most harmful with respect to population health. The variables pertaining to population health are measurements of fatalities and injuries. Since we are only interested in the most harmful weather events (and not all weather events), we limit our analysis to the top 15 weather events with highest number of fatalities, and the 15 with the highest number of injuries.

# Group the data by weather event and sum the number of fatalities and take only the top 15
topFatalities <- dataNOAAreduced %>% group_by(EVTYPE) %>% summarise(TOTAL = sum(FATALITIES)) %>% arrange(desc(TOTAL)) %>% top_n(15)
## Selecting by TOTAL
# Group the data by weather event and sum the number of injuries and take only the top 15
topInjuries <- dataNOAAreduced %>% group_by(EVTYPE) %>% summarise(TOTAL = sum(INJURIES)) %>% arrange(desc(TOTAL)) %>% top_n(15)
## Selecting by TOTAL
par(mfrow = c(1,2), mar = c(7.1, 5.1, 4.1, 3.1), oma = c(1, 1, 1, 1))
barplot(topFatalities$TOTAL, names.arg = topFatalities$EVTYPE, las=2, cex.names = 0.75, ylab = "Total Fatalities",  main = "Total Fatalities in all US \n for Top Weather Events \n from Years 1990 - 2011", col = "red")
barplot(topInjuries$TOTAL, names.arg = topInjuries$EVTYPE, las=2, cex.names = 0.75, ylab = "Total Injuries", main = "Total Injuries in all US \n for Top Weather Events \n from Years 1990 - 2011", col = "yellow")

The weather events that caused the most fatalities across the U.S. from 1990 - 2011 were excessive heat and tornadoes (with excessive heat slightly more). The weather event that caused the most injuries across the U.S from 1990 - 2011 was tornadoes, far more than any other weather event. It can be concluded that tornadoes are the most harmful weather event with respect to population health.

The second question to answer with the data is which weather events caused the greates economic consequences. The variables pertaining to economic consequences are measurements of property damage and crop damage. Since we are only interested in the most harmful weather events (and not all weather events), we limit our analysis to the top 15 weather events with highest dollar values for damage to both property and crops.

# Group the data by weather event and sum the number of fatalities and take only the top 15
topPropDam <- dataNOAAreduced %>% group_by(EVTYPE) %>% summarise(TOTAL = sum(PROPDMG)) %>% arrange(desc(TOTAL)) %>% top_n(15)
## Selecting by TOTAL
# Group the data by weather event and sum the number of injuries and take only the top 15
topCropDam <- dataNOAAreduced %>% group_by(EVTYPE) %>% summarise(TOTAL = sum(CROPDMG)) %>% arrange(desc(TOTAL)) %>% top_n(15)
## Selecting by TOTAL
par(mfrow = c(1,2), mar = c(7.1, 4.1, 4.1, 3.1), oma = c(1, 2, 1, 1))
barplot(topPropDam$TOTAL, names.arg = topPropDam$EVTYPE, las=2, cex.names = 0.75, ylab = "Total Dollar Value of Property Damage",  main = "Total Dollar Value of Property Damage \n in all US for Top Weather Events \n from Years 1990 - 2011", col = "brown", cex.main = 0.8, cex.axis = 0.7, cex.lab = 0.7)
barplot(topInjuries$TOTAL, names.arg = topInjuries$EVTYPE, las=2, cex.names = 0.75, ylab = "Total Dollar Value of Crop Damage", main = "Total Dollar Value of Crop Damage \n in all US for Top Weather Events \n from Years 1990 - 2011", col = "green", cex.main =0.8, cex.axis = 0.7, cex.lab = 0.7)

The weather event that caused the most property damage across the U.S. from 1990 - 2011 was flood. The weather event that caused the most crop damage across the U.S from 1990 - 2011 was tornadoes. It can be concluded that flood and tornadoes are the weather events with the most severe economic consequences.