In this report, I aim to identify which types of weather storms are most harmful with respect to population health, and which have the gravest economic consequences. My overall hypothesis is that some types of event have a higher impact than others, for each of the two types of consequences studied. To investigate this hypothesis, I obtained the storm dataset from the U.S. National Oceanic and Atmospheric Administration’s (NOAA), which has data from 1950 to 2011. From these data, I found that, on average, across the U.S., Tornados have the most significant publich health impact (both fatalities and injuries), by a large margin; flood has the highest economic impact.
An expedite exploratory analysis of NOAA’s Storm Database was carried out to identify:
# Loading Libraries
library(dplyr)
library(ggplot2)
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "noaa.csv", method = "curl")
data
;data <- read.csv(file = "noaa.csv", header = TRUE)
data
to use for analysis. This is to keep a ready-to-use original version of the data.
df
(copy of data
);df <- data
dim(df)
## [1] 902297 37
names(df)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
df <- df %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
According to the National Weather Service Storm Data Documentation:
VARIABLE | DESCRIPTION |
---|---|
EVTYPE | Type of weather srtorm event |
FATALITIES | Number of people killed directly |
INJURIES | Number of people injured directly |
PROPDMG | Property damage |
PROPDMGEXP | Hundreed (H), Thousand (D), Million (M), Billion (B) |
CROPDMG | Crop damage |
CROPDMGEXP | Hundreed (H), Thousand (D), Million (M), Billion (B) |
The values of property damage are not all in the same unit (see table of variables above).
DAMAGE CODE | VALUE |
---|---|
B | 1,000,000,000 |
M | 1,000,000 |
K | 1,000 |
H | 100 |
NA or BLANK | 1 |
dfE <- df # create copy before transformations
# table for dollar value and code equivalency
tequiv <- data.frame(code = c("B", "M", "K", "H", "", NA), value = c(1000000000, 1000000, 1000, 100, 1, 1))#table of equivalency
# update table with correct dollar value by merging with table of equivalency (code.x)
dfE1 <- merge(dfE, tequiv, by.x = "PROPDMGEXP", by.y = "code")
# update table by merging with cropcode to get dollar value (code.y)
dfE1 <- merge(dfE1, tequiv, by.x = "CROPDMGEXP", by.y = "code")
dfE1$property.damage <- dfE1$PROPDMG * dfE1$value.x # calculate property damage value in dollars
dfE1$crop.damage <- dfE1$CROPDMG * dfE1$value.y # same for crop damage
EVTYPE
, reveals that many records have a description summary...
, which shows that they are not events per se. A few examples are shown below:levels(df$EVTYPE)[721:723]
## [1] "Summary of March 23" "Summary of March 24"
## [3] "SUMMARY OF MARCH 24-25"
df
).removeRows <- grep("^[Ss][Uu][Mm][Mm][Aa][Rr][Yy]", df$EVTYPE)
df <- df[-removeRows,]
#confirming that the number of tows removed was tthe intended: expression should evaluate to TRUE
nrow(data) - nrow(df) == length(removeRows)
## [1] TRUE
75 rows (events) were removed from the dataset.
df[1:3,]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
Calculation of the total number of fatalities and injuries.
totalFatalities <- sum(df$FATALITIES)
totalFatalities
## [1] 15145
totalInjuries <- sum(df$INJURIES)
totalInjuries
## [1] 140528
There are over 800 storm types. Howeve, a quick analysis of their histograms (see below) shows that the vast majority of fatalities (and injuries) are caused by a few types. Furthermore, in order to
hist(fatalities$Total)
hist(injuries$Total)
Furthermore, there is a strong association between fatalities and injuries for all storm types.
df1 <- df %>%
group_by(EVTYPE) %>%
summarize(fatalities = sum(FATALITIES), injuries = sum(INJURIES))
correlation1 <- cor(df1$fatalities, df1$injuries)
correlation1
## [1] 0.9438477
The correlation between fatalities and injuries per event type is (0.9438477), as calculated above.
Identify the top 10 causes of fatalities, and order them is descending order.
fatalities <- df %>%
group_by(EVTYPE) %>%
summarize(Total = sum(FATALITIES), Percentage = (Total * 100 / totalFatalities)) %>%
arrange(desc(Percentage))
head(fatalities, 10)
## Source: local data frame [10 x 3]
##
## EVTYPE Total Percentage
## 1 TORNADO 5633 37.193793
## 2 EXCESSIVE HEAT 1903 12.565203
## 3 FLASH FLOOD 978 6.457577
## 4 HEAT 937 6.186860
## 5 LIGHTNING 816 5.387917
## 6 TSTM WIND 504 3.327831
## 7 FLOOD 470 3.103334
## 8 RIP CURRENT 368 2.429845
## 9 HIGH WIND 248 1.637504
## 10 AVALANCHE 224 1.479036
fatalities_10 <- fatalities[1:10,]
percentageTop10Fat <- sum(fatalities_10$Percentage)
percentageTop10Fat
## [1] 79.7689
The top 10 storm types account for 80% of fatalities. Tornado is at the type of event responsible for the highest number of fatalities, as documented above. Tornado is also the event type responsible for the highest number of injuries, as can be seeen below:
Identify the top 10 causes of injuries, and order them is descending order.
injuries <- df %>%
group_by(EVTYPE) %>%
summarize(Total = sum(INJURIES), Percentage = (Total * 100 / totalInjuries)) %>%
arrange(desc(Percentage))
head(injuries, 10)
## Source: local data frame [10 x 3]
##
## EVTYPE Total Percentage
## 1 TORNADO 91346 65.0019925
## 2 TSTM WIND 6957 4.9506148
## 3 FLOOD 6789 4.8310657
## 4 EXCESSIVE HEAT 6525 4.6432028
## 5 LIGHTNING 5230 3.7216782
## 6 HEAT 2100 1.4943641
## 7 ICE STORM 1975 1.4054139
## 8 FLASH FLOOD 1777 1.2645167
## 9 THUNDERSTORM WIND 1488 1.0588637
## 10 HAIL 1361 0.9684903
injuries_10 <- injuries[1:10,]
percentageTop10Inj <- sum(injuries_10$Percentage)
percentageTop10Inj
## [1] 89.3402
As calculated above, 89% of injuries are cause by the top 10 storm types.
total.property.damage <- sum(dfE1$property.damage)
total.crop.damage <- sum(dfE1$crop.damage)
economic <- dfE1 %>%
group_by(EVTYPE) %>%
summarize(totalProperty = sum(property.damage), percProperty = (totalProperty * 100 / total.property.damage), totalCrop = sum(crop.damage), percCrop = (totalCrop * 100 / total.crop.damage), overall.damage = (totalProperty + totalCrop))
Both damage present high positive skew (not printed in report, but code available below).
hist(dfE1$property.damage)
hist(dfE1$crop.damage)
correlation2 <- cor(economic$totalProperty, economic$totalCrop)
correlation2
## [1] 0.3784556
Their correlation is moderate at 0.3784556, meaning that the same event type have different impacts on crops and property.
property <- economic %>%
select(EVTYPE, totalProperty, percProperty) %>%
arrange(desc(totalProperty))
property_10 <- property[1:10,]
top_tenProperty <- sum(property_10$percProperty)
top_tenProperty
## [1] 88.37599
crop <- economic %>%
select(EVTYPE, totalCrop, percCrop) %>%
arrange(desc(totalCrop))
crop_10 <- crop[1:10,]
top_tenCrop <- sum(crop_10$percCrop)
top_tenCrop
## [1] 85.36471
ggplot(data = fatalities_10, aes(x = reorder(EVTYPE, Total), y = Total)) +
geom_bar(stat = "identity", fill = "#333333") +
xlab("") +
ylab("Total Fatalities") +
coord_flip() +
ggtitle("Storm Types That Cause Highest Number of Fatalities")
The chart above illustrates the impact of each of the top-ten storm types which cause the highest number of fatalities.
ggplot(data = injuries_10, aes(x = reorder(EVTYPE, Total), y = Total)) +
geom_bar(stat = "identity", fill = "#468499") +
xlab("") +
ylab("Total Injuries") +
coord_flip() +
ggtitle("Storm Types That Cause Highest Number of Injuries")
The chart above illustrates the impact of each of the top-ten storm types which cause the highest number of injuries.
economicTotal <- economic %>%
arrange(desc(overall.damage))
economicTotal <- economicTotal[1:10,]
economicTotal$overall.damage.M <- (economicTotal$overall.damage / 1000000) # overall damage in millions of Dollars
ggplot(data = economicTotal, aes(x = reorder(EVTYPE, overall.damage.M), y = overall.damage.M)) +
geom_bar(stat = "identity", fill = "#ff4444") +
coord_flip() +
ylab("Overall Damage (millions of U.S. Dollars)") +
xlab("") +
ggtitle("Economic Impact of Top-ten Storm Event Types in the U.S.")
The chart above shows the top-ten storm types with the greatest economic impact (property + crops) in the U.S.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis. Source: Assignement brief. Limit 3 figures.