The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis:
echo = TRUE # Always make code visible
library(ggplot2)
library(plyr)
The NOAA storm data file is downloaded from internet and unzipped to Coursera folder under Desktop.
setwd("~/Desktop/Coursera/")
cache = TRUE
if (!"stormData.csv.bz2" %in% dir("~/Desktop/Coursera")) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "stormData.csv.bz2")
bunzip2("stormData.csv.bz2", overwrite=T, remove=F)
}
The dimension, and content of NOAA data are studied.
data.storm <- read.csv("~/Desktop/Coursera/stormData.csv", sep = ",")
dim(data.storm)
## [1] 902297 37
head(data.storm, n=3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
A plot graph of storm events distributed years showing that the storm data is more complete in recent years.
# Since we will only care about the year of the event date, format it to the number of year
data.storm$year <- as.numeric(format(as.Date(data.storm$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
# A quick histogram to see how data was distributed across the years
hist(data.storm$year, xlab= "Year", breaks = 30, main = "Distribution of Weather Events Data over Years", col="steel blue")
Following the histogram in above, we could see that the weather events captured since 1995 have been almost doubled than in the past. We believe this increase in the number of captured events is due to the advancement of modern technlogy, which enabled scientists to record the event that they were not able to capture before. Since our study is focused on the total impact of weather events over years, I have chosen to not discriminate data based on this known observance.
We are interested to see the top 15 types of weather events that resulted fatalities and injuries.
Top 15 types of weather events that have the highest fatality rate:
# Subtotal of fatablities by event type
data.storm.fatality <- aggregate(data.storm[, "FATALITIES"], by = list(data.storm$EVTYPE), FUN = "sum", na.rm=T)
# Sorting the data by numbers of fatalities in descending order
data.storm.fatality <- arrange(data.storm.fatality, data.storm.fatality[,2], decreasing=T)
# Subsetting the top 15
data.storm.fatality <- data.storm.fatality[1:15,]
# Adding column headers
colnames(data.storm.fatality) <- c("Event.Type", "Fatalities")
# Setting levels of Event.Type
data.storm.fatality[,"Event.Type"] <- factor(data.storm.fatality[,"Event.Type"], levels = data.storm.fatality[,"Event.Type"])
# Show the top 15 fatality data
data.storm.fatality
## Event.Type Fatalities
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
# Plotting the figure
ggplot(data.storm.fatality, aes(Event.Type, Fatalities)) + geom_bar(stat = "identity", color="steel blue", fill="steel blue", width = 0.7) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + labs(x="Event Type", y="Total No. of Fatalities", title="Total Fatalities by Severe Weather\n Events in the U.S.\n from 1950 - 2011\n")
Top 15 types of weather events that have the highest injury rate:
# Subtotal of injuries by event type
data.storm.injury <- aggregate(data.storm[, "INJURIES"], by = list(data.storm$EVTYPE), FUN = "sum", na.rm=T)
# Sorting the data by numbers of injuries in descending order
data.storm.injury <- arrange(data.storm.injury, data.storm.injury[,2], decreasing=T)
# Subsetting the top 15
data.storm.injury <- data.storm.injury[1:15,]
# Adding column headers
colnames(data.storm.injury) <- c("Event.Type", "Injuries")
# Setting levels of Event.Type
data.storm.injury[,"Event.Type"] <- factor(data.storm.injury[,"Event.Type"], levels = data.storm.injury[,"Event.Type"])
# Show the top 15 injury data
data.storm.injury
## Event.Type Injuries
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
# Plotting the figure
ggplot(data.storm.injury, aes(Event.Type, Injuries)) + geom_bar(stat = "identity", color="steel blue", fill="steel blue", width = 0.7) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + labs(x="Event Type", y="Total No. of Injuries", title="Total Injuries by Severe Weather\n Events in the U.S.\n from 1950 - 2011\n")
We are interested to see the top 15 types of weeather events that resulted loss in property damage and crop damage.
Top 15 types of weather events that have caused the highest property damage:
# Remove unnecessary columns from data
data.storm.property.damage <- subset(data.storm, select = c("EVTYPE", "PROPDMG", "PROPDMGEXP"))
# Transform the measurements to number
data.storm.property.damage <- mutate(data.storm.property.damage,
PROPDMGTTL = ifelse(PROPDMGEXP=="k" | PROPDMGEXP=="K",
PROPDMG * 1000,
ifelse(PROPDMGEXP=="m" | PROPDMGEXP=="M",
PROPDMG * 1000000,
ifelse(PROPDMGEXP=="b" | PROPDMGEXP=="B",
PROPDMG * 1000000000,
ifelse(PROPDMGEXP=="h" | PROPDMGEXP=="H",
PROPDMG * 100, PROPDMG)))))
# Calculating total by event types
data.storm.property.damage <- aggregate(data.storm.property.damage$PROPDMGTTL, by = list(data.storm.property.damage$EVTYPE), FUN = "sum", na.rm=T)
# Sort
data.storm.property.damage <- arrange(data.storm.property.damage, data.storm.property.damage[,2], decreasing=T)
# Select top 15
data.storm.property.damage <- data.storm.property.damage[1:15,]
# Adding column names
colnames(data.storm.property.damage) <- c("Event.Type", "Property.Damage.Total")
# Setting levels of Event Type
data.storm.property.damage[,"Event.Type"] <- factor(data.storm.property.damage[,"Event.Type"], levels = data.storm.property.damage[,"Event.Type"])
# Display the top 15 property damage table
data.storm.property.damage
## Event.Type Property.Damage.Total
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937160779
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140812067
## 6 HAIL 15732267543
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497251
## 10 HIGH WIND 5270046295
## 11 RIVER FLOOD 5118945500
## 12 WILDFIRE 4765114000
## 13 STORM SURGE/TIDE 4641188000
## 14 TSTM WIND 4484928495
## 15 ICE STORM 3944927860
# Plotting the figure
ggplot(data.storm.property.damage, aes(Event.Type, Property.Damage.Total/1e9)) + geom_bar(stat = "identity", color="steel blue", fill="steel blue", width = 0.7) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + labs(x="Event Type", y="Property Damage (in billion dollars)", title="Total Property Damages by Severe Weather\n Events in the U.S.\n from 1950 - 2011\n")
Top 15 types of weather events that have caused the highest crop damage:
# Remove unnecessary columns from data
data.storm.crop.damage <- subset(data.storm, select = c("EVTYPE", "CROPDMG", "CROPDMGEXP"))
# Transform the measurements to number
data.storm.crop.damage <- mutate(data.storm.crop.damage,
CROPDMGTTL = ifelse(CROPDMGEXP=="k" | CROPDMGEXP=="K",
CROPDMG * 1000,
ifelse(CROPDMGEXP=="m" | CROPDMGEXP=="M",
CROPDMG * 1000000,
ifelse(CROPDMGEXP=="b" | CROPDMGEXP=="B",
CROPDMG * 1000000000,
ifelse(CROPDMGEXP=="h" | CROPDMGEXP=="H",
CROPDMG * 100, CROPDMG)))))
# Calculating total by event types
data.storm.crop.damage <- aggregate(data.storm.crop.damage$CROPDMGTTL, by = list(data.storm.crop.damage$EVTYPE), FUN = "sum", na.rm=T)
# Sort
data.storm.crop.damage <- arrange(data.storm.crop.damage, data.storm.crop.damage[,2], decreasing=T)
# Select top 15
data.storm.crop.damage <- data.storm.crop.damage[1:15,]
# Adding column names
colnames(data.storm.crop.damage) <- c("Event.Type", "Crop.Damage.Total")
# Setting levels of Event Type
data.storm.crop.damage[,"Event.Type"] <- factor(data.storm.crop.damage[,"Event.Type"], levels = data.storm.crop.damage[,"Event.Type"])
# Display the top 15 property damage table
data.storm.crop.damage
## Event.Type Crop.Damage.Total
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
## 7 HURRICANE/TYPHOON 2607872800
## 8 FLASH FLOOD 1421317100
## 9 EXTREME COLD 1292973000
## 10 FROST/FREEZE 1094086000
## 11 HEAVY RAIN 733399800
## 12 TROPICAL STORM 678346000
## 13 HIGH WIND 638571300
## 14 TSTM WIND 554007350
## 15 EXCESSIVE HEAT 492402000
# Plotting the figure
ggplot(data.storm.crop.damage, aes(Event.Type, Crop.Damage.Total/1e9)) + geom_bar(stat = "identity", color="steel blue", fill="steel blue", width = 0.7) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + labs(x="Event Type", y="Crop Damage (in billion dollars)", title="Total Crop Damages by Severe Weather\n Events in the U.S.\n from 1950 - 2011\n")
Following our analysis on the provided data, Tornado and Excessive Heat are the top two severe weather events that have been most harmful to public health, while Drought and Flood are the top two severe weather events that have caused the greatest economic loss in the United States.