SUMMARY

This document analyses the United States (U.S.) National Oceanic and Atmospheric Admisnistration’s (NOAA) storm database.

The NOAA databse tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The following questions and data analysis are presented in the code below:

Q1) What types of weather events are most harmful to population health in the U.S.: 1.Sum number of fatailities and injuries by event type and by state 2.Calculate the max. number of fatalities and injuries in each state 3.Determine the event with the max. number of fatalities and injuries in each state 4.Determine the event with the max number of fatalities and injuries across all states.

Q2) What types of weather events have the greatest economic consequences in the U.S: 1.Sum amount of property damage (US Dollars) by event type and by state 2.Calculate the max. amount of property damage in each state 3.Determine the event with the max. amount of property damage in each state 4.Determine the event with the max amount of property damage across all states.

DATA PROCESSING

STEPS:

  1. Download NOAA storm data
  2. Examine data structure
#define working directory to store data and results
dname <- file.path("C:/Users/datacent52/Documents/Temilade Adelore_Office", "DataScienceCourse", "ReproducibleResearch") 

#set working directory
setwd(dname) 

#setup libraries
library(lubridate)
library(plyr)
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
## 
##     here
library(ggplot2)

#check to see if file exists in directory and download if it does not exist
destfile = "./repdata-data-Stormdata.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
 if (!file.exists(destfile)) {
    setInternet2(TRUE)
    download.file(fileURL ,destfile,method="auto") }

#read database from file
SD <- read.csv("./repdata-data-Stormdata.csv.bz2")

#whats the structure of the data 
str(SD)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
#lets take a look at the data
head(SD)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

RESULTS

Q1: Across the United States, which events are most harmful to population health?

Storm Data variables that indicate population health include:

  1. Fatalities
  2. Injuries
#Calculate the total number of fatalities per state per event type 
FA <- data.frame(tapply(SD$FATALITIES, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))
IN <- data.frame(tapply(SD$INJURIES, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))

#Calculate the most harmful events to population health
#i.e. event type with the maximum number of fatalities 
# initialize a total number of fatalities per state variable (FS)
FS = NULL 

#get maximum number of fatalities per state
FS$max <- sapply(as.list(FA), max, na.rm = TRUE)

#get index (event type) with max number of fatalities in each state 
Fs_maxi <- sapply(as.list(FA), which.max)
Fs_maxi <- row.names(FA[Fs_maxi,])

#remove non alpha characters in name of event type
FS$maxi <- gsub('[^[:alpha:]]', "", Fs_maxi)

FS <- data.frame(FS) 

#Most Harmful event to population health across the U.S.
#Sum the maximum number of fatalities by event type across all U.S. states 
#Determine the event type with the highest sum across all U.S. states
MHF = tapply(FS$max, FS$maxi, sum, na.rm=TRUE)

#Calculate the most harmful events to population health
#i.e. event type with the maximum number of injuries
# initialize total number of injuries per state variable (IS)
IS = NULL

IS$max <- sapply(as.list(IN), max, na.rm = TRUE)

#get index (event type) with max number of injuries in each state
Is_maxi <- sapply(as.list(IN), which.max)
Is_maxi <- row.names(IN[Is_maxi,])

#remove non alpha characters in name of event type
IS$maxi <- gsub('[^[:alpha:]]', "", Is_maxi)

IS <- data.frame(IS) 

#Most Harmful event to population health across the U.S.
#Sum the maximum number of injuries by event type across all U.S. states 
#Determine the event type with the highest sum across all U.S. states
MHI <- tapply(IS$max, IS$maxi, sum, na.rm = TRUE)

#plot maximum number of fatalities in each U.S. state
#png(file = "plot1.png", width = 1006, height = 796, res = 55)
g1 <- ggplot(FS, aes(x = row.names(FS), y = max, fill = factor(maxi))) 
g1 + geom_bar(stat = "identity") + 
        labs(x="U.S. states", 
             y = "Maximum no. of fatalities",
             title = "Most harmful event to population health across the U.S.") 

dev.off()
## null device 
##           1
#plot maximum number of injuries in each U.S. state
#png(file = "plot2.png", width = 1006, height = 796, res = 55)
g2 <- ggplot(IS, aes(x = row.names(IS), y = max, fill = factor(maxi))) 
g2 + geom_bar(stat = "identity") + 
        labs(x="U.S. States", 
             y="Maximum no. of injuries",
             title = "Most harmful event to population health across the U.S.") 

dev.off()
## null device 
##           1

Across the U.S., the most harmful event to population health (calculated as the sum of maximum number of fatalities across U.S. states) is the TORNADO with 5,118 fatalities across the U.S.

Across the U.S., the most harmful event to population health (calculated as the sum of maximum number of injuries across U.S. states) is TORNADO with 90,319 injuries across the U.S.

Q2: Across the U.S., which type of events have the greatest economic consequences?

Storm Data variables that indicate economic loss/damage in dollars include:

  1. Property damage (in dollars)
  2. Crop damage (in dollars)
#reformat (US dollar) amount of property damage 
unique(SD$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
#remove non alpha characters (i.e. all characters except "K" or "k", "B" or "b" and "M" or "m")
SD$PROPDMGEXP <- gsub('[^[:alpha:]]', "", SD$PROPDMGEXP)

#replace "m" with "M" in PROPDMGEXP varaiable
SD$PROPDMGEXP <- gsub('m', "M", SD$PROPDMGEXP)

#replace character with numeric values in PROPDMGEXP varaiable
SD$PROPDMGEXP <- mapvalues(SD$PROPDMGEXP, from = c("K", "M", "B"), to = c("1000", "1000000", "1000000000"))

#calculate total amount of property damage by event type and by state
PD <- SD$PROPDMG*as.numeric(SD$PROPDMGEXP)
## Warning: NAs introduced by coercion
PD <- data.frame(tapply(PD, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))

#Calculate event with most economic consequence 
#i.e. event types with the maximum amount of property damage per state
# initialize a total amount of property damage per state variable (PDS)
PDS = NULL 

#get maximum amount of property damage per state
PDS$max <- sapply(as.list(PD), max, na.rm = TRUE)

#get index (event type) with max amount of property damage per state
PDs_maxi <- sapply(as.list(PD), which.max)
PDs_maxi <- row.names(PD[PDs_maxi,])

#remove non alpha characters in name of event type
PDS$maxi <- gsub('[^[:alpha:]]', "", PDs_maxi)

PDS <- data.frame(PDS) 

#Event with the most economic consequence across the U.S.
#sum the maximum amounts of property damage across states by event
#Determine the event type with the highest sum of property damage across all U.S. states
MEC <- tapply(PDS$max, PDS$maxi, sum, na.rm = TRUE)

#plot maximum number of fatalities and injuries per state
#png(file = "plot3.png", width = 1006, height = 796, res = 55)
g3 <- ggplot(PDS, aes(x = row.names(PDS), y = max, fill = factor(maxi))) + 
                 geom_bar(stat = "identity") 
g3 + labs(x="U.S. States",
          y="Maximum amount of property damage in US Dollars",
        title = "Event with the most economic consequence across the U.S.")

dev.off()
## null device 
##           1

Across the U.S., the event with the most economic consequence, determined by the total maximum amount of property damaga (in US Dollars), is FLOOD with US $130,434,488,240 in property damage.