Author: Jawad Rashid
This analysis analyzes the events across United States which impact public health and economic the most. The data collected is from weather data and for each type of event there is information about the impact of the event on different areas including crop damage, property damage, fatalities, injures etc. In this analysis we will try to find which event has impacted health and economic the most by seing their frequency and their impact. For health we will use fatalities and injuries as a appropriate measure and for economic loss we will analyze crop and property damage. These attributes seems to be most interesting in this analysis.
This analysis uses the downloaded zipped file from the source:
if (!file.exists("data")) {
dir.create("data")
}
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "data/stormdata.csv.bz2", method = "curl")
The zipped file is read in for analysis.
data <- read.csv("data/stormdata.csv.bz2", stringsAsFactors = FALSE, strip.white = TRUE,
na.strings = c("NA", ""))
Only the required fields will be kept other columns will be removed as we are interested in fatalaties, injuries, property damage and crop damage on event type will only those column and remove all others. All the required events will be renamed to make it more easier to understand.
processedData <- data[, c(8, 23, 24, 25, 27)]
names(processedData) <- c("eventtype", "fatalaties", "injuries", "propertydamage",
"cropdamage")
Some summary of the dataset is given below:
head(processedData)
## eventtype fatalaties injuries propertydamage cropdamage
## 1 TORNADO 0 15 25.0 0
## 2 TORNADO 0 0 2.5 0
## 3 TORNADO 0 2 25.0 0
## 4 TORNADO 0 2 2.5 0
## 5 TORNADO 0 2 2.5 0
## 6 TORNADO 0 6 2.5 0
More summary is given below.
str(processedData)
## 'data.frame': 902297 obs. of 5 variables:
## $ eventtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ fatalaties : num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propertydamage: num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ cropdamage : num 0 0 0 0 0 0 0 0 0 0 ...
In order to answer the question which event type is most harmful to public health we will first for each event type calculate the maximum fatalaties, and injuries. First we will aggregate fatataties for each event type and add it up and sort it. Here are the top 10 events in terms of fatalaties
library(plyr)
fatalatiesSummary <- aggregate(processedData$fatalaties, list(event = processedData$eventtype),
sum)
fatalatiesSummary <- arrange(fatalatiesSummary, desc(x))
names(fatalatiesSummary) <- c("event", "fatalaties")
head(fatalatiesSummary, n = 10)
## event fatalaties
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
We will do the same for injuries.
injurySummary <- aggregate(processedData$injuries, list(event = processedData$eventtype),
sum)
injurySummary <- arrange(injurySummary, desc(x))
names(injurySummary) <- c("event", "injuries")
head(injurySummary, n = 10)
## event injuries
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
Visually we can see the results we have come up so far by plotting boxplot of the top 10 fatalaties and injuries.
library(ggplot2)
par(mfrow = c(1, 2))
barplot(fatalatiesSummary$fatalaties[1:10]/1000, main = "Top 10 Fatalaties/Event",
ylab = expression("Count * 1000"), names.arg = fatalatiesSummary$event[1:10],
cex.names = 0.6, las = 2, cex.axis = 0.7, col = 3)
barplot(injurySummary$injuries[1:10]/1000, main = "Top 10 Injuries/Event", ylab = expression("Count * 1000"),
names.arg = injurySummary$event[1:10], cex.names = 0.6, las = 2, cex.axis = 0.7,
col = 2)
# These stats are used in the paragraphs below
firstFatalityPercentage <- fatalatiesSummary[1, 2]/sum(fatalatiesSummary[, 2]) *
100
secondFatalityPercentage <- fatalatiesSummary[2, 2]/sum(fatalatiesSummary[,
2]) * 100
firstInjuryPercentage <- injurySummary[1, 2]/sum(injurySummary[, 2]) * 100
secondInjuryPercentage <- injurySummary[2, 2]/sum(injurySummary[, 2]) * 100
As you can see from the figures and the table tornado is the most harmful to public health in terms of fatalaties. It is responsible for 37.1938% of the total fatalaties with second of excessive heat with percentage of 12.5652%
Same analysis with injury reveals that tornado is the most harmful to public health in terms of injuries. It is responsible for 65.002% of the total injuries with second of tstm wind with percentage of 4.9506%
In order to answer the question which event type is most harmful to economic we will first for each event type calculate the maximum crop and property damage. First we will aggregate fatataties for each event type and add it up and sort it. Here are the top 10 events in terms of crop damage
cropSummary <- aggregate(processedData$cropdamage, list(event = processedData$eventtype),
sum)
cropSummary <- arrange(cropSummary, desc(x))
names(cropSummary) <- c("event", "crop")
head(cropSummary, n = 10)
## event crop
## 1 HAIL 579596
## 2 FLASH FLOOD 179200
## 3 FLOOD 168038
## 4 TSTM WIND 109203
## 5 TORNADO 100019
## 6 THUNDERSTORM WIND 66791
## 7 DROUGHT 33899
## 8 THUNDERSTORM WINDS 18685
## 9 HIGH WIND 17283
## 10 HEAVY RAIN 11123
We will do the same for property damage.
propertySummary <- aggregate(processedData$propertydamage, list(event = processedData$eventtype),
sum)
propertySummary <- arrange(propertySummary, desc(x))
names(propertySummary) <- c("event", "property")
head(propertySummary, n = 10)
## event property
## 1 TORNADO 3212258
## 2 FLASH FLOOD 1420125
## 3 TSTM WIND 1335966
## 4 FLOOD 899938
## 5 THUNDERSTORM WIND 876844
## 6 HAIL 688693
## 7 LIGHTNING 603352
## 8 THUNDERSTORM WINDS 446293
## 9 HIGH WIND 324732
## 10 WINTER STORM 132721
Visually we can see the results we have come up so far by plotting box plot of the top 10 crop and property damage.
par(mfrow = c(1, 2))
barplot(cropSummary$crop[1:10]/10^5, main = "Top 10 Crop Damage/Event", ylab = expression("Count * 100000"),
names.arg = cropSummary$event[1:10], cex.names = 0.6, las = 2, cex.axis = 0.7,
col = 3)
barplot(propertySummary$property[1:10]/10^5, main = "Top 10 Propery Damage/Event",
ylab = expression("Count * 100000"), names.arg = propertySummary$event[1:10],
cex.names = 0.6, las = 2, cex.axis = 0.7, col = 2)
# These stats are used in the paragraphs below
firstCropPercentage <- cropSummary[1, 2]/sum(cropSummary[, 2]) * 100
secondCropPercentage <- cropSummary[2, 2]/sum(cropSummary[, 2]) * 100
firstPropertyPercentage <- propertySummary[1, 2]/sum(propertySummary[, 2]) *
100
secondPropertyPercentage <- propertySummary[2, 2]/sum(propertySummary[, 2]) *
100
As you can see from the figures and the table hail is the most harmful to economic in terms of crop damage. It is responsible for 42.066% of the total crop damage with second of flash flood with percentage of 13.006%
Same analysis with property damage reveals that tornado is the most harmful to economic in terms of property damage. It is responsible for 29.5122% of the total property damage with second of flash flood with percentage of 13.0472%