Analaysis was done by exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database available at “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The following transformation was done on the Data base
Recoding the cost units of Property and Crop damage as provided in the variables PROPDMGEXP & CROPDMGEXP and creating a new variable ECONOMICCOST to calculate the sum of property and crop damage costs after excluding all storm and weather events which didn’t have any population health or economic impact.
Mapping of the 482 event types including typos in the EVTYPE variable to the 48 storm and weather event types mentioned in the National Weather Service Storm Data Documentation at *“https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"* by creating a variable EVTYPEM after summarizing the data yearwise.
Creating a new data table stmavgcost to summarize the mean yearly fatalties, injuries and economic costs due to each of the 49 Event types(including “Others” category). (Since the data of various event types is not available across all the years mean will be a more appropriate measure than the sum.)
Final step of the anlaysis involves identifying the top ten events that causes maximum fatalaties,injuries and economic losses that addresses the following questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
The following chunk downloads the database to the working directory.
download.file(
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"StormData.csv.bz2"
)
stmdata <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
stmdata dataframe has 902,297 weather event observations of 37 variables recorded through the years 1950 -2011.
In the following code chunk all the weather events which doesn’t have any health or economic impact and those variables that are not required for the analysis are removed to create the stormdata dataframe which has 254,633 observations and 9 variables.
The cost units of the ‘PROPDMGEXP’ & ‘CROPDMGEXP’ variables are recoded to enable the multiplication of these variables with cost figures in ‘PROPDMG’ & ‘CROPDMG’ variables to create a new variable ECONOMICCOST which gives the total value of economic loss due to a particular weather event. Another variable YEAR is created to identify the Year of Occurence of the weather event. After these transformation stormdata will have 11 variables.
library(plyr)
library(dplyr)
library(stringdist)
library(data.table)
library(ggplot2)
stormdata <-
subset(
stmdata,
FATALITIES != 0 |
INJURIES != 0 |
PROPDMG != 0 |
CROPDMG != 0,
select = c(BGN_DATE, STATE:EVTYPE, FATALITIES:CROPDMGEXP)
)
stormdata$PROPDMGEXP[stormdata$PROPDMGEXP %in% c("+", "-", "0", "")] <-1
stormdata$PROPDMGEXP[stormdata$PROPDMGEXP %in% c("H", "h", "2")] <-100
stormdata$PROPDMGEXP[stormdata$PROPDMGEXP %in% c("K", "k", "3")] <-1000
stormdata$PROPDMGEXP[toupper(stormdata$PROPDMGEXP) == "4"] <- 10000
stormdata$PROPDMGEXP[toupper(stormdata$PROPDMGEXP) == "5"] <- 100000
stormdata$PROPDMGEXP[stormdata$PROPDMGEXP %in% c("M", "m", "6")] <-1000000
stormdata$PROPDMGEXP[toupper(stormdata$PROPDMGEXP) == "7"] <-10000000
stormdata$PROPDMGEXP[toupper(stormdata$PROPDMGEXP) == "B"] <-1000000000
stormdata$CROPDMGEXP[stormdata$CROPDMGEXP %in% c("0", "?", "")] <- 1
stormdata$CROPDMGEXP[stormdata$CROPDMGEXP %in% c("K", "k", "3")] <-1000
stormdata$CROPDMGEXP[stormdata$CROPDMGEXP %in% c("M", "m", "6")] <-1000000
stormdata$CROPDMGEXP[toupper(stormdata$CROPDMGEXP) == "B"] <-1000000000
stormdata$EVTYPE <-
gsub("[Tt][Ss][Tt][Mm]", "THUNDERSTORM", stormdata$EVTYPE)
stormdata$YEAR <-
as.POSIXlt.date(as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))$year +
1900
stormdata$CROPDMGEXP <- as.numeric(stormdata$CROPDMGEXP)
stormdata$PROPDMGEXP <- as.numeric(stormdata$PROPDMGEXP)
stormdata <-
stormdata %>% mutate(
ECONOMICCOST = (PROPDMG * PROPDMGEXP) + (CROPDMG * CROPDMGEXP)
)
The stmyrdata data frame created in the code chunk below has 1371 observations of 6 variables .It is a yearly summary of the weather event details based on the 482 EVTYPE values and provides the yearly sums of fatalities,injuries and economic cost. This reduction in number of rows from 254,633 to 1371 improves the processing speed. The vector lookup contains all the 48 weatherevents listed in the National Weather Service Storm Data Documentation. The mapping of the 482 EVTYPE values to 48 eventtypes in the lookup vector is done using the amatch function in stringdist package and the mapped value is stored in the variable EVTYPEM . The method used in the amatch function is “jw” and maxDist is given a low value of 0.25 to ensure a close match. The 158 EVTYPE values that are unmatched is given event type name of “Others” in the EVTYPEM variable . Most of the eventtypes in “Others”" are matched to the nearestof the 48 weather events using ‘grepl’ function.
stmyrdata <-
stormdata %>% group_by(YEAR, EVTYPE) %>% summarize(
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
ECONOMICCOST = sum(ECONOMICCOST)
)
lookup <-
c(
"Astonomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Freezing Fog",
"Frost/Freeze",
"Funnel Cloud",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane/Typhoon",
"Ice Storm",
"Lakeshore Flood",
"Lake-Effect Snow",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather"
)
stmyrdata$EVTYPEM <-
lookup[amatch(
tolower(stmyrdata$EVTYPE),
tolower(lookup),
method = "jw",
maxDist = 0.25,
nomatch = 49
)]
stmyrdata$EVTYPEM[which(is.na(stmyrdata$EVTYPEM))] <- "Others"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("FLOOD$|FLD$|FLOODS", toupper(stmyrdata$EVTYPE))] <- "Flood"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("SLIDE|SLIDES", toupper(stmyrdata$EVTYPE))] <- "Debris Flow"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("WARM|HEAT", toupper(stmyrdata$EVTYPE))] <- "Excessive Heat"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("FREEZE|FROST|GLAZE", toupper(stmyrdata$EVTYPE))] <-
"Frost/Freeze"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("SNOW|COLD|ICE|ICY", toupper(stmyrdata$EVTYPE))] <-
"Winter Weather"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("THUNDERSTORM|MICROBURST|DOWNBURST",
toupper(stmyrdata$EVTYPE))] <- "Thunderstorm Wind"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("HAIL", toupper(stmyrdata$EVTYPE))] <- "Hail"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("GUSTY|WIND$|WINDS$|^WIND", toupper(stmyrdata$EVTYPE))] <-
"High Wind"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("SURF", toupper(stmyrdata$EVTYPE))] <- "High Surf"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("HURRICANE|TYPHOON", toupper(stmyrdata$EVTYPE))] <-
"Hurricane/Typhoon"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("RAIN$|RAINFALL$", toupper(stmyrdata$EVTYPE))] <- "Heavy Rain"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("FOG", toupper(stmyrdata$EVTYPE))] <- "Dense Fog"
stmyrdata$EVTYPEM[stmyrdata$EVTYPEM == "Others" &
grepl("FIRE", toupper(stmyrdata$EVTYPE))] <- "Wildfire"
In the code chunk below yearly sums of fatalities, injuries and economic cost is tabulated based on the 49 event types in the EVTYPEM variable in the stmcost data table .Since the data of various event types is not available across all the years mean will be a more appropriate measure than the sum to identify the major weather events contributing to fatalaties,injuries and economic losses. stmavgcost data table has the mean yearly values of fatalities,injuries and economic losses for each of the 49 values of EVTYPEM variable. The top ten weather events causing maximum fatalaties,injuries and economic losses are stored in the stmtopavgfatalities , stmtopavginjuries & stmtopavgecocost datatables.
stmcost <-
data.table(
stmyrdata %>% group_by(YEAR, EVTYPEM) %>% summarize(
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
ECONOMICCOST = sum(ECONOMICCOST)
)
)
stmavgcost <-
data.table(
stmcost %>% group_by(EVTYPEM) %>% summarize(
FATALITIES = mean(FATALITIES),
INJURIES = mean(INJURIES),
ECONOMICCOST = mean(ECONOMICCOST)
)
)
stmtopavgecocost <-
subset(top_n(stmavgcost, 10, ECONOMICCOST),
select = c(EVTYPEM, ECONOMICCOST))
stmtopavgfatalities <-
subset(top_n(stmavgcost, 10, FATALITIES),
select = c(EVTYPEM, FATALITIES))
stmtopavginjuries <-
subset(top_n(stmavgcost, 10, INJURIES),
select = c(EVTYPEM, INJURIES))
Figures 1 & 2 address the question of which types of events are most harmful to population health across United States.
Figure 3 address the question of which types of events have the greatest economic consequences across United States.
g1 <- ggplot(stmtopavgfatalities, aes(x = reorder(EVTYPEM, FATALITIES), y =
FATALITIES))
g1 + geom_bar(stat = "identity", fill = "red") + coord_flip()+ theme_bw() + geom_text(aes(label = round(FATALITIES, 0)), vjust = "middle", size = 4) + labs(x = "Weather Event", y = " Average Yearly Fatalities", title = "figure 1. Top Ten Weather Events Causing Maximum Fatalities Across USA")
g2 <- ggplot(stmtopavginjuries, aes(x = reorder(EVTYPEM, INJURIES), y = INJURIES))
g2 + geom_bar(stat = "identity", fill = "orange") +coord_flip()+ theme_bw() + geom_text(aes(label = round(INJURIES, 0)), vjust = "middle", size = 4) + labs(x = "Weather Event", y = " Average Yearly Injuries", title = "figure 2. Top Ten Weather Events Causing Maximum Injuries Across USA")
g3 <- ggplot(stmtopavgecocost, aes(x = reorder(EVTYPEM, ECONOMICCOST), y = ECONOMICCOST / 1000000))
g3 + geom_bar(stat = "identity", fill = "light blue") +coord_flip()+ theme_bw() + geom_text(aes(label = round(ECONOMICCOST / 1000000, 0)), vjust = "middle", size = 4) + labs(x = "Weather Event", y =" Average Yearly Economic Losses(Million USD)", title = "figure 3.Top Ten Weather Events Causing Maximum Economic Losses Across USA")
End of Document