The following analysis uses weather data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine the weather events which are most harmful to the population health, and economy respectively.
Information about the data set is given by:
Further more explanation about the data set is can be obtained from resources available at this link
The analysis performed below shows that Tornado are the biggest danger to public health, while Floods have the greatest economic impact.
To find the weather event which has the most adverse effect to the economy and population health from the given data, we perform the following steps sequentially:
Check if the .csv file is in the directory, if not download the file from the following location.
Computing the total property and crop damage due to the weather event and storing the new variable in columns PROPDMGT and CROPDMGT.
The data is grouped by EVTYPE values using the dplyr package and the original property and crop damage variables can be dropped now.
At this moment, our data has been properly cleaned for the further analysis.
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colClass <- c(rep('NULL', 7), 'character', rep('NULL', 14), rep('numeric', 3), 'character', 'numeric', 'character', rep('NULL', 9))
if(!file.exists('storm_data.csv.bz2')){
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'storm_data.csv.bz2')
}
storm_data <- read.csv(bzfile('storm_data.csv.bz2'), colClasses = colClass)
#Now for some editing
storm_data$EVTYPE <- factor(storm_data$EVTYPE)
storm_data$EVTYPE <- tolower(storm_data$EVTYPE)
storm_data$PROPDMGEXP <- tolower(storm_data$PROPDMGEXP)
storm_data$PROPDMGEXP <- factor(storm_data$PROPDMGEXP, levels = c('h', 'k', 'm', 'b'))
storm_data$CROPDMGEXP <- tolower(storm_data$CROPDMGEXP)
storm_data$CROPDMGEXP <- factor(storm_data$CROPDMGEXP, levels = c('h', 'k', 'm', 'b'))
multiplier <- c('h' = 100, 'k' = 1000, 'm' = 1e+06, 'b' = 1e+09)
storm_data <- mutate(storm_data, PROPDMGT = PROPDMG*multiplier[PROPDMGEXP])
storm_data <- mutate(storm_data, CROPDMGT = CROPDMG*multiplier[CROPDMGEXP])
storm_data <- select(storm_data, -(PROPDMG:CROPDMGEXP))
The data set is grouped as per the event type value given for each record. Since there are a number of distinct events, many without any significant impact on either economy or population health, we will be concentrating only the top 10 most adverse weather events.
The quantitative measure of harm caused to population health by an event is computed here by adding the number of Fatalities and Injuries, both given equal weight. Since there is such a stark difference between these values given by each event, we store the log_10 of the value obtained after addition. This value is stored in another column called CASUALTY.
The barplot given below shows the CASUALTY values of the 10 most harmful events.
storm_data <- group_by(storm_data, EVTYPE)
storm_data.health <- summarise(storm_data, CASUALTY = log10(sum(FATALITIES, INJURIES)))
storm_data.health <- arrange(storm_data.health, desc(CASUALTY))
#We will only concentrate on the top 10 casualty causing events
ggplot(storm_data.health[1:10,], aes(x = reorder(EVTYPE, -CASUALTY), y = CASUALTY)) + geom_bar(stat = 'identity', aes(fill = EVTYPE)) +
labs(title = 'Top 10 most harmful events to population health in US', x = 'WEATHER EVENT', y = 'CASUALTY (log10)') +
guides(fill = FALSE) +
coord_flip()
rm(storm_data.health)
As shown by the barplot, the most harmful weather event to population health is TORNADO.
We create two new datasets, having only only one of the PROPDM or CROPDMG and the rest of the processing is as follows:
The end result is a denormalized form of the previous data set, which enables to get better insights from the data.
storm_data.prop <- mutate(storm_data, PROPDMGT = ifelse(is.na(PROPDMGT), 0, PROPDMGT)) %>%
select(EVTYPE, DMG = PROPDMGT)
storm_data.crop <- mutate(storm_data, CROPDMGT = ifelse(is.na(CROPDMGT), 0, CROPDMGT)) %>%
select(EVTYPE, DMG = CROPDMGT)
storm_data.prop$TYPE <- 'Property'
storm_data.crop$TYPE <- 'Crop'
storm_data.new <- rbind(storm_data.crop, storm_data.prop)
rm(storm_data.crop, storm_data.prop)
storm_data.new <- group_by(storm_data.new, EVTYPE, TYPE) %>%
summarise(TDMG = sum(DMG))
ggplot(storm_data.new[which(storm_data.new$TDM >= 5e+9),], aes(x = EVTYPE, y = TDMG, TYPE)) +
geom_bar(stat = 'identity', aes(fill = TYPE)) +
labs(title = 'Top 13 most costly weather events in US', x = 'WEATHER EVENT', y = 'TOTAL DAMAGE COST') +
guides(guides = 'legend', title = 'Damages in $ made to') +
coord_flip()
It is evident from the barplot, that the weather event which costs the people greatest, by damage of property and crops combined is FLOODS and that such events have more damage to property than to crops. The most harmful weather event to crops is DROUGHT which also features in the top 13 most costly weather events.