This document analyses the United States (U.S.) National Oceanic and Atmospheric Admisnistration’s (NOAA) storm database.
The NOAA databse tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The following questions and data analysis are presented in the code below:
Q1) What types of weather events are most harmful to population health in the U.S.: 1.Sum number of fatailities and injuries by event type and by state 2.Calculate the max. number of fatalities and injuries in each state 3.Determine the event with the max. number of fatalities and injuries in each state 4.Determine the event with the max number of fatalities and injuries across all states.
Q2) What types of weather events have the greatest economic consequences in the U.S: 1.Sum amount of property damage (US Dollars) by event type and by state 2.Calculate the max. amount of property damage in each state 3.Determine the event with the max. amount of property damage in each state 4.Determine the event with the max amount of property damage across all states.
#define working directory to store data and results
dname <- file.path("C:/Users/datacent52/Documents/Temilade Adelore_Office", "DataScienceCourse", "ReproducibleResearch")
#set working directory
setwd(dname)
#setup libraries
library(lubridate)
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
library(ggplot2)
#check to see if file exists in directory and download if it does not exist
destfile = "./repdata-data-Stormdata.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(destfile)) {
setInternet2(TRUE)
download.file(fileURL ,destfile,method="auto") }
#read database from file
SD <- read.csv("./repdata-data-Stormdata.csv.bz2")
#whats the structure of the data
str(SD)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
#lets take a look at the data
head(SD)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
#Calculate the total number of fatalities per state per event type
FA <- data.frame(tapply(SD$FATALITIES, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))
IN <- data.frame(tapply(SD$INJURIES, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))
#Calculate the most harmful events to population health
#i.e. event type with the maximum number of fatalities
# initialize a total number of fatalities per state variable (FS)
FS = NULL
#get maximum number of fatalities per state
FS$max <- sapply(as.list(FA), max, na.rm = TRUE)
#get index (event type) with max number of fatalities in each state
Fs_maxi <- sapply(as.list(FA), which.max)
Fs_maxi <- row.names(FA[Fs_maxi,])
#remove non alpha characters in name of event type
FS$maxi <- gsub('[^[:alpha:]]', "", Fs_maxi)
FS <- data.frame(FS)
#Most Harmful event to population health across the U.S.
#Sum the maximum number of fatalities by event type across all U.S. states
#Determine the event type with the highest sum across all U.S. states
MHF = tapply(FS$max, FS$maxi, sum, na.rm=TRUE)
#Calculate the most harmful events to population health
#i.e. event type with the maximum number of injuries
# initialize total number of injuries per state variable (IS)
IS = NULL
IS$max <- sapply(as.list(IN), max, na.rm = TRUE)
#get index (event type) with max number of injuries in each state
Is_maxi <- sapply(as.list(IN), which.max)
Is_maxi <- row.names(IN[Is_maxi,])
#remove non alpha characters in name of event type
IS$maxi <- gsub('[^[:alpha:]]', "", Is_maxi)
IS <- data.frame(IS)
#Most Harmful event to population health across the U.S.
#Sum the maximum number of injuries by event type across all U.S. states
#Determine the event type with the highest sum across all U.S. states
MHI <- tapply(IS$max, IS$maxi, sum, na.rm = TRUE)
#plot maximum number of fatalities in each U.S. state
#png(file = "plot1.png", width = 1006, height = 796, res = 55)
g1 <- ggplot(FS, aes(x = row.names(FS), y = max, fill = factor(maxi)))
g1 + geom_bar(stat = "identity") +
labs(x="U.S. states",
y = "Maximum no. of fatalities",
title = "Most harmful event to population health across the U.S.")
dev.off()
## null device
## 1
#plot maximum number of injuries in each U.S. state
#png(file = "plot2.png", width = 1006, height = 796, res = 55)
g2 <- ggplot(IS, aes(x = row.names(IS), y = max, fill = factor(maxi)))
g2 + geom_bar(stat = "identity") +
labs(x="U.S. States",
y="Maximum no. of injuries",
title = "Most harmful event to population health across the U.S.")
dev.off()
## null device
## 1
Across the U.S., the most harmful event to population health (calculated as the sum of maximum number of fatalities across U.S. states) is the TORNADO with 5,118 fatalities across the U.S.
Across the U.S., the most harmful event to population health (calculated as the sum of maximum number of injuries across U.S. states) is TORNADO with 90,319 injuries across the U.S.
#reformat (US dollar) amount of property damage
unique(SD$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
#remove non alpha characters (i.e. all characters except "K" or "k", "B" or "b" and "M" or "m")
SD$PROPDMGEXP <- gsub('[^[:alpha:]]', "", SD$PROPDMGEXP)
#replace "m" with "M" in PROPDMGEXP varaiable
SD$PROPDMGEXP <- gsub('m', "M", SD$PROPDMGEXP)
#replace character with numeric values in PROPDMGEXP varaiable
SD$PROPDMGEXP <- mapvalues(SD$PROPDMGEXP, from = c("K", "M", "B"), to = c("1000", "1000000", "1000000000"))
#calculate total amount of property damage by event type and by state
PD <- SD$PROPDMG*as.numeric(SD$PROPDMGEXP)
## Warning: NAs introduced by coercion
PD <- data.frame(tapply(PD, list(SD$EVTYPE, SD$STATE), sum, na.rm = TRUE))
#Calculate event with most economic consequence
#i.e. event types with the maximum amount of property damage per state
# initialize a total amount of property damage per state variable (PDS)
PDS = NULL
#get maximum amount of property damage per state
PDS$max <- sapply(as.list(PD), max, na.rm = TRUE)
#get index (event type) with max amount of property damage per state
PDs_maxi <- sapply(as.list(PD), which.max)
PDs_maxi <- row.names(PD[PDs_maxi,])
#remove non alpha characters in name of event type
PDS$maxi <- gsub('[^[:alpha:]]', "", PDs_maxi)
PDS <- data.frame(PDS)
#Event with the most economic consequence across the U.S.
#sum the maximum amounts of property damage across states by event
#Determine the event type with the highest sum of property damage across all U.S. states
MEC <- tapply(PDS$max, PDS$maxi, sum, na.rm = TRUE)
#plot maximum number of fatalities and injuries per state
#png(file = "plot3.png", width = 1006, height = 796, res = 55)
g3 <- ggplot(PDS, aes(x = row.names(PDS), y = max, fill = factor(maxi))) +
geom_bar(stat = "identity")
g3 + labs(x="U.S. States",
y="Maximum amount of property damage in US Dollars",
title = "Event with the most economic consequence across the U.S.")
dev.off()
## null device
## 1
Across the U.S., the event with the most economic consequence, determined by the total maximum amount of property damaga (in US Dollars), is FLOOD with US $130,434,488,240 in property damage.