Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Needed libraries
library(plyr)
library(ggplot2)
library(scales)
library(gridExtra)
## Loading required package: grid
Download and Extract the file:
setwd("~/Repos/RepData_PeerAssessment2/")
if (!"repdata-data-StormData.csv.bz2" %in% dir("./data/")) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "data/repdata-data-StormData.csv.bz2")
bunzip2("./data/repdata-data-StormData.csv.bz2", overwrite=TRUE, remove=FALSE)
}else{
print("File already Downloaded and extracted.")
}
## [1] "File already Downloaded and extracted."
data <- read.csv("./data/repdata-data-StormData.csv")
dim(data)
## [1] 902297 37
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
There are a total of 902297 observations of 37 variables. For this analysis I will only consider the variables related to Event Type, Fatalities, Injuries and Damage.
First I will remove the unused variables to have a more readable console output:
data <- subset(data, select=c("EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP", "CROPDMG","CROPDMGEXP"))
summary(data)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
In a first look we can say that there aren’t NA but both PROPDMGEXP and CROPDMGEXP have trash values that will need a further clean.
To answer the first question I will take a look to the columns FATALITIES and INJURIES. I will aggregate EVTYPE by these 2 variables and plot it.
pop.healt <- ddply(data, .(EVTYPE),summarize,
TOTAL = sum(FATALITIES) + sum(INJURIES))
head(pop.healt)
## EVTYPE TOTAL
## 1 ? 0
## 2 ABNORMALLY DRY 0
## 3 ABNORMALLY WET 0
## 4 ABNORMAL WARMTH 0
## 5 ACCUMULATED SNOWFALL 0
## 6 AGRICULTURAL FREEZE 0
dim(pop.healt)
## [1] 985 2
I’m not interested in events with zero casualties so I will remove it from the DataSet
pop.healt <- subset(pop.healt, TOTAL != 0)
dim(pop.healt)
## [1] 220 2
220 obs is still too much for a quick overview of the top events. Let’s look at the quantiles to see if we can narrow to the top events.
quantile(pop.healt$TOTAL)
## 0% 25% 50% 75% 100%
## 1.00 1.75 5.00 44.25 96979.00
It seems like the 25% of the Top events has more than 44.25 casualties. Let’s narrow it a little bit more to have a clear big picture of the top events…
pop.healt <- subset(pop.healt, TOTAL > 44.25)
dim(pop.healt)
## [1] 55 2
Let’s look at the top event:
pop.healt[which.max(pop.healt$TOTAL),]
## EVTYPE TOTAL
## 830 TORNADO 96979
Now let’s plot the Top 10 Events ordered by TOTAL
pop.healt <- arrange(pop.healt, TOTAL, decreasing = TRUE)[1:10,]
# Re-arrange the levels is mandatory for ggplot to plot the bars in the same
# order as the dataframe
pop.healt$EVTYPE <- factor(pop.healt$EVTYPE,levels = pop.healt$EVTYPE)
# Custom ggplot theme
theme <- theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_minimal()
ggplot(pop.healt) + geom_bar(aes(EVTYPE,TOTAL),stat = "identity") +
xlab("Weather Events") + ylab("Number of people affected") +
ggtitle("Numbers of people Affected in the U.S. by Weather Events from 1995 to 2011") +
theme
That looks great. But I can’t sum apples and bananas (FATALITIES and INJURIES) so let’s look a little bit into the two variables by separated and how they impact in the number of people affected
fatalities <- ddply(data, .(EVTYPE),summarize,
FATALITIES = sum(FATALITIES))
injuries <- ddply(data, .(EVTYPE),summarize,
INJURIES = sum(INJURIES))
fatalities <- arrange(fatalities, FATALITIES, decreasing = TRUE)[1:10,]
injuries <- arrange(injuries, INJURIES, decreasing = TRUE)[1:10,]
fatalities$EVTYPE <- factor(fatalities$EVTYPE,levels = fatalities$EVTYPE)
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE)
fatalities
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
injuries
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
Perfect! A little bit of Copy-Paste but it’s fine so far. Now lets try to plot it together
injuries.plot <- ggplot(injuries) + geom_bar(aes(EVTYPE,INJURIES), stat = "identity")+
ggtitle("Numbers of people injured\n in the U.S.\n by Weather Events\n from 1995 to 2011") +
theme
fatalities.plot <- ggplot(fatalities) + geom_bar(aes(EVTYPE,FATALITIES), stat = "identity")+
ggtitle("Numbers of fatalities\n in the U.S.\n by Weather Events\n from 1995 to 2011") +
theme
grid.arrange(fatalities.plot, injuries.plot, ncol = 2)
Now let’s dig a little bit about the economic consequences. We have 4 columns:
- PROPDMG
- PROPDMGEXP
- CROPDMG
- CROPDMGEXP
Let’s try to unify the columns in just 1 numeric Value: TOTALDMG. Both PROPDMGEXP and CROPDMGEXP columns have a multiplier for each value in PROPDMG and CROPDMG. We have Hundred (H), Thousand (K), Million (M) and Billion (B). We also have trash values like -, +, ?
data$PROPDMGEXP <- tolower(as.character(data$PROPDMGEXP))
data$CROPDMGEXP <- tolower(as.character(data$CROPDMGEXP))
data[data$CROPDMGEXP == "b", "CROPDMGEXP"] <- 9
data[data$CROPDMGEXP == "m", "CROPDMGEXP"] <- 6
data[data$CROPDMGEXP == "k", "CROPDMGEXP"] <- 3
data[data$CROPDMGEXP == "", "CROPDMGEXP"] <- 0
data[data$CROPDMGEXP == "?", "CROPDMGEXP"] <- 0
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
table(data$CROPDMGEXP)
##
## 0 2 3 6 9
## 618439 1 281853 1995 9
data[data$PROPDMGEXP == "b", "PROPDMGEXP"] <- 9
data[data$PROPDMGEXP == "m", "PROPDMGEXP"] <- 6
data[data$PROPDMGEXP == "k", "PROPDMGEXP"] <- 3
data[data$PROPDMGEXP == "h", "PROPDMGEXP"] <- 2
data[data$PROPDMGEXP == "", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "?", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "-", "PROPDMGEXP"] <- 0
data[data$PROPDMGEXP == "+", "PROPDMGEXP"] <- 0
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
table(data$PROPDMGEXP)
##
## 0 1 2 3 4 5 6 7 8 9
## 466164 25 20 424669 4 28 11341 5 1 40
Excellent! Now we have to calculate the TOTALDMG:
crop.total.dmg <- data$CROPDMG * (10 ^ data$CROPDMGEXP)
prop.total.dmg <- data$PROPDMG * (10 ^ data$PROPDMGEXP)
total.dmg <- crop.total.dmg + prop.total.dmg
data$TOTALDMG <- total.dmg
Now we just need to summarize the data and create a nice plot.
event.dmg <- ddply(data, .(EVTYPE), summarize,
TOTALDMG = sum(TOTALDMG))
event.dmg <- arrange(event.dmg, TOTALDMG, decreasing = TRUE)[1:10,]
event.dmg$EVTYPE <- factor(event.dmg$EVTYPE, levels = event.dmg$EVTYPE)
ggplot(event.dmg) + geom_bar(aes(EVTYPE,TOTALDMG), stat="identity") +
xlab("Weather Events") +
scale_y_continuous("Total Damage in Dollars",labels = comma) +
ggtitle("Total Damage in Dollars in the U.S,\n by Weather Events from 1995 to 2011") +
theme
It’s safe to say that TORNADOS are the most harmful events, followed by EXCESIVE HEAT.
It’s safe to say that FLOODS are the most harmful events, followed by HURRICANE/TYPHOON