This analysis was made to answer important questions about weather events on the United States. The entire process consisted in cleaning and filtering the data, making some calculations and plotting.
The questions are:
First, we will talk about how the data was processed to reach the answers.
On our first step, we will download the dataset and read it into R. We will also load the dplyr package that we will need:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repdata_data_StormData.csv.bz2")
storm_data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
Next, we will do some cleaning on the EVTYPE column, that is vital to our analysis as you will see later
# Cleaning the EVTYPE
new_storm_data <- filter(storm_data, !grepl('Summary', EVTYPE))
new_storm_data <- filter(new_storm_data, !grepl('SUMMARY', EVTYPE))
new_storm_data <- new_storm_data[new_storm_data$EVTYPE != "?",]
new_storm_data[grepl("TSTM WIND", new_storm_data$EVTYPE), "EVTYPE"] <- "TSTM WIND"
new_storm_data[grepl("THUNDERSTORM WIND", new_storm_data$EVTYPE), "EVTYPE"] <- "TSTM WIND"
new_storm_data[grepl("TORNADO", new_storm_data$EVTYPE), "EVTYPE"] <- "TORNADO"
We will also extract the date and filter the data frame based on it. And the reason we will do this is because up to 1996, data was only collected for Tornado, Thunderstorm Wind and Hail were recorder. So, to make an adequate comparison, we will only use the years where data for all events were collected.
new_storm_data$Date <- strptime(new_storm_data$BGN_DATE, format = "%m/%d/%Y")
new_storm_data <- new_storm_data[new_storm_data$Date$year >= 96,]
We will also calculate the damage value. According to the directive, B corresponds to billions, M to millions and K to thousands. Any other value on the PROPDMGEXP and CROPDMGEXP will be dropped. First we will make the calculations:
if (new_storm_data$PROPDMGEXP == "b" | new_storm_data$PROPDMGEXP == "B") {
new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000000000
} else if (new_storm_data$PROPDMGEXP == "m" | new_storm_data$PROPDMGEXP == "M") {
new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000000
} else if (new_storm_data$PROPDMGEXP == "k" | new_storm_data$PROPDMGEXP == "K") {
new_storm_data$DMGVALUE <- new_storm_data$PROPDMG * 1000
}
## Warning in if (new_storm_data$PROPDMGEXP == "b" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
## Warning in if (new_storm_data$PROPDMGEXP == "m" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
## Warning in if (new_storm_data$PROPDMGEXP == "k" | new_storm_data$PROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
if (new_storm_data$CROPDMGEXP == "b" | new_storm_data$CROPDMGEXP == "B") {
new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000000000
} else if (new_storm_data$CROPDMGEXP == "m" | new_storm_data$CROPDMGEXP == "M") {
new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000000
} else if (new_storm_data$CROPDMGEXP == "k" | new_storm_data$CROPDMGEXP == "K") {
new_storm_data$DMGVALUE <- new_storm_data$DMGVALUE + new_storm_data$CROPDMG * 1000
}
## Warning in if (new_storm_data$CROPDMGEXP == "b" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
## Warning in if (new_storm_data$CROPDMGEXP == "m" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
## Warning in if (new_storm_data$CROPDMGEXP == "k" | new_storm_data$CROPDMGEXP
## == : a condição tem comprimento > 1 e somente o primeiro elemento será
## usado
And then we will create a new data frame with only rows that have damage value calculated and greater than 0. This will be used to show the results about damage value. But first we will need to convert the data frame Date column to POSIXct, since dplyr filter will not work with POSIXlt
new_storm_data$Date <- as.POSIXct(new_storm_data$Date)
storm_data_with_value <- filter(new_storm_data, new_storm_data$DMGVALUE != 0)
Now, we will calculate the total number of fatalities, injuries and damage value by type of event (EVTYPE)
injuries_per_type <- tapply(new_storm_data$INJURIES, new_storm_data$EVTYPE, sum)
fatalities_per_type <- tapply(new_storm_data$FATALITIES, new_storm_data$EVTYPE, sum)
values_per_type <- tapply(storm_data_with_value$DMGVALUE, storm_data_with_value$EVTYPE, sum)
Now we will check the results. First, the data about injuries and fatalities. We will select the 15 events that caused most injuries and fatalities. We will also see how much of the cases they cover in comparison to the remaining events.
worst_types_injuries <- sort(injuries_per_type, decreasing=TRUE)
worst_types_injuries <- worst_types_injuries[1:15]
worst_types_fatalities <- sort(fatalities_per_type, decreasing=TRUE)
worst_types_fatalities <- worst_types_fatalities[1:15]
sum(worst_types_fatalities) / sum(new_storm_data$FATALITIES)
## [1] 0.8508933
sum(worst_types_injuries) / sum(new_storm_data$INJURIES)
## [1] 0.9184131
Since they cover at least more than 85% of the cases, we conclude that more cleaning on the EVTYPE column could take much time and performance for an improvement that wouldn’t be significant. Let’s see what are the worst events and plot the data. First the injuries:
print(worst_types_injuries)
## TORNADO FLOOD EXCESSIVE HEAT TSTM WIND
## 20667 6758 6391 5163
## LIGHTNING FLASH FLOOD WINTER STORM HURRICANE/TYPHOON
## 4141 1674 1292 1275
## HEAT HIGH WIND WILDFIRE HAIL
## 1222 1083 911 713
## FOG HEAVY SNOW WILD/FOREST FIRE
## 712 698 545
barplot(worst_types_injuries, las=2)
And then the fatalities:
print(worst_types_fatalities)
## EXCESSIVE HEAT TORNADO FLASH FLOOD
## 1797 1511 887
## LIGHTNING FLOOD TSTM WIND
## 651 414 397
## RIP CURRENT HEAT HIGH WIND
## 340 237 235
## AVALANCHE RIP CURRENTS WINTER STORM
## 223 202 191
## EXTREME COLD/WIND CHILL EXTREME COLD HEAVY SNOW
## 125 113 107
barplot(worst_types_fatalities, las=2)
Finally, we will check the results about damage value. Let’s select the 15 worst events and check how much they cover.
worst_types_values <- sort(values_per_type, decreasing=TRUE)
worst_types_values <- worst_types_values[1:15]
sum(worst_types_values) / sum(new_storm_data$DMGVALUE)
## [1] 0.9644712
And then we will see which are the worst events based on damage value and plot them:
print(worst_types_values)
## TSTM WIND FLASH FLOOD TORNADO HAIL
## 2384933920 1408629250 1278006730 1073656400
## FLOOD LIGHTNING HIGH WIND WINTER STORM
## 976762890 490465290 332366270 128874480
## HEAVY SNOW WILDFIRE STRONG WIND ICE STORM
## 90984810 87371540 64110710 58202120
## HEAVY RAIN TROPICAL STORM WILD/FOREST FIRE
## 57980550 53071940 43492970
barplot(worst_types_values, las=2)