This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The project aims to find the type of weather events that are more harmful with respect to population health and the type of events that have the greatest economic consequences
The storm database had events that caused fatalities, injuries and damages to crops and property. The data analysis was done using aggregate data recorded for years between 1950 and 2011. The analysis shows that tornado is the most harmful event causing large number of fatalities and injuries while flood and draught caused billions of dollars in property and crop damages.
dataUrl <-
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(dataUrl, "StormData.csv.bz2", method = "curl")
stormData <- read.csv(bzfile("StormData.csv.bz2"))
The storm database has a very large set of data. We need data related to population health issues and the ones that have the economic consequence. So we choose the following columns for analysis “BGN_DATE” “EVTYPE” “FATALITIES” “INJURIES” “PROPDMG” “PROPDMGEXP” “CROPDMG” “CROPDMGEXP”.
reqdColumns <-
c("STATE","BGN_DATE","EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
cleanedData <- stormData[, reqdColumns]
library(data.table)
setnames(
cleanedData,
c("STATE","BGN_DATE","EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"),
c("State","Year","EventType","Fatalities","Injuries",
"PropertiesDamaged","PropertyDamageExpenses",
"CropsDamaged","CropDamageExpenses")
)
cleanedData$Year <- as.numeric(format(as.Date(cleanedData$Year,
format = "%m/%d/%Y %H:%M:%S"), "%Y"))
2.Expenses columns have character coding for the damaged values like ‘K’ for thousands or ‘B’ for billions. Convert the expenses to meaningful values.
## Convert the character denoting the expense into a multiplier and calculate expenses
## using damage * expense.
## The following is the conversion routine used
## H = 100, K = 1000, M = 10^6, B = 10 ^9 and everything else s treated as 1
calculateExpenses <- function(damage, expense) {
expenseMultiplier = 1
if (expense != "") {
expenseMultiplier <- switch(toupper(expense),
H = 100,K = 1000,M = 10 ^ 6,B = 10 ^ 9,1)
}
return (damage * expenseMultiplier)
}
cleanedData$PropertiesDamaged <- mapply(calculateExpenses,
cleanedData$PropertiesDamaged,
cleanedData$PropertyDamageExpenses)
cleanedData$CropsDamaged <-mapply(calculateExpenses,
cleanedData$CropsDamaged,
cleanedData$CropDamageExpenses)
For this analysis we use the sum or aggregate of fatalities and injuries for all recorded data years to find which event causes the maximum number of fatalities and which event causes the maximum number of injuries. For this analysis we will use only the top ten events for easy plotting
## Load the libraries required for analysis and plotting
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Calaculate total fatalities
totalFatalitiesData <-
aggregate(Fatalities ~ EventType, cleanedData, sum) %>%
arrange(desc(Fatalities)) %>%
head(10)
## Set the factor for event type in just the order chosen for plotting
totalFatalitiesData$EventType <-
factor(totalFatalitiesData$EventType,
levels =
totalFatalitiesData[order(sort(totalFatalitiesData$Fatalities,
decreasing = TRUE)),
"EventType"])
## Plot the data
plotFatalities <- ggplot(totalFatalitiesData,
aes(x = EventType, y = Fatalities, fill = EventType)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Total Fatalities") +
xlab("Weather Type") +
ylab("No: of Fatalities") +
theme(legend.position = "none")
## Calculate total inuries
totalInjuryData <-
aggregate(Injuries ~ EventType, cleanedData, sum) %>%
arrange(desc(Injuries)) %>%
head(10)
## Set the factor for event type in just the order chosen for plotting
totalInjuryData$EventType <-
factor(totalInjuryData$EventType,
levels =
totalInjuryData[order(sort
(totalInjuryData$Injuries,
decreasing = TRUE)),
"EventType"])
## Plot the data
plotInjuries <- ggplot(totalInjuryData,
aes(x = EventType, y = Injuries, fill = EventType)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Total Injuries") +
xlab("Weather Type") +
ylab("No: of Injuries") +
theme(legend.position = "none")
From these two plots we can infer that Tornado causes lot of fatalities and injuries followed by excessive heat and flash flood as far as fatalities are conserned and the TstmWind, Flood and Excessive heat causing almost the same number of injuries.
For this analysis we use the sum or aggregate of properties damaged and crops damaged for all recorded data years to find which event causes the maximum property damage and which event causes crop damages. For this analysis we will use only the top ten events for easy plotting
## Toal properties that have been damaged due to weather related events
totalPropertyDamageData <-
aggregate(PropertiesDamaged ~ EventType, cleanedData, sum) %>%
arrange(desc(PropertiesDamaged)) %>%
head(10)
## Convert the expense to amount in billions
totalPropertyDamageData$PropertiesDamaged <-
totalPropertyDamageData$PropertiesDamaged / 10 ^ 6
## Set the factor for event type in just the order chosen for plotting
totalPropertyDamageData$EventType <-
factor(totalPropertyDamageData$EventType,
levels =
totalPropertyDamageData[order(sort
(totalPropertyDamageData$PropertiesDamaged,
decreasing = TRUE)),
"EventType"])
## Plot the data
plotProperty <- ggplot(totalPropertyDamageData,
aes(x = EventType, y = PropertiesDamaged, fill =
EventType)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Total Expenses due to Property Damage") +
xlab("Weather Type") +
ylab("Expense in billions") +
theme(legend.position = "none")
## Total crops damaged due to weather related events
totalCropsDamageData <-
aggregate(CropsDamaged ~ EventType, cleanedData, sum) %>%
arrange(desc(CropsDamaged)) %>%
head(10)
## Convert the expense to amount in billions
totalCropsDamageData$CropsDamaged <-
totalCropsDamageData$CropsDamaged / 10 ^ 6
## Set the factor for event type in just the order chosen for plotting
totalCropsDamageData$EventType <-
factor(totalCropsDamageData$EventType,
levels =
totalCropsDamageData[order(sort
(totalCropsDamageData$CropsDamaged,
decreasing = TRUE)),
"EventType"])
## Plot the data
plotCrop <- ggplot(totalCropsDamageData,
aes(x = EventType, y = CropsDamaged, fill = EventType)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Total Expenses due to Crops Damage") +
xlab("Weather Type") +
ylab("Expense in billions") +
theme(legend.position = "none")
## Print fatalities and injuries plot in a single grid
grid.arrange(plotFatalities,
plotInjuries,
ncol = 1,
bottom = "Population Health Damages")
A. Fatalities : Tornado causes the most fatalities followed by excessive heat and flash flood.
B. Injuries Tornados cause the most inuries followed by TSTM Wind, flood and excessive heat which more or less result in the same number of injuries.
## Print property and crop damages in a single figure
grid.arrange(plotProperty,
plotCrop,
ncol = 1,
bottom = "Property and Crop Damage Expenses")
A. Expense Due to Property Damage
Flood causes the most damage to the properties followed by Hurricane and Tornado.
B. Expense due to Crops Damage Draught causes the most damage to crops followed by floods, river flood and Ice storm.
Across the United States tornado is the most harmful with respect to population health while flood and drought cause the greatest economic consequence.