The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.
1 Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2 Across the United States, which types of events have the greatest economic consequences?
Our analysis of the data will demonstrate that the most harmful event to public health is “tornado”, while the most harmful event to the Economy is “flood”.
There are 6 basic steps required for loading and preprocessing the data: 1 set workding directory to project
echo = TRUE
setwd("~/Desktop/Coursera/ReproducibleResearch/PeerAssessment2")
2 make sure the required libraries are loaded
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library("gridExtra")
## Loading required package: grid
3 set the download, and unzip file name
downloadFile <- "data/repdata-data-StormData.csv.bz2"
downloadURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
4 test for data foloder and zip file, if NOT found create
if(!file.exists("./data")) { dir.create("./data")}
if (!file.exists(downloadFile)) {
download.file(downloadURL, downloadFile, method = "curl");
unzip(downloadFile, overwrite = T, exdir = ".")
}
5 read in the csv data, take a quick view of the file structures and data
data <- read.csv("./data/repdata-data-StormData.csv", header=TRUE)
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(data, n=2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## 2 NA 0 2 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
There are 7 variables we are interested regarding the two questions. They are:
EVTYPE as a measure of event type (e.g. tornado, flood, etc.)
FATALITIES as a measure of harm to human health
INJURIES as a measure of harm to human health
PROPDMG as a measure of property damage and hence economic damage in USD
PROPDMGEXP as a measure of magnitude of property damage (e.g. thousands, millions USD, etc.)
CROPDMG as a measure of crop damage and hence economic damage in USD
CROPDMGEXP as a measure of magnitude of crop damage (e.g. thousands, millions USD, etc.)
To make our analysis more efficient, we can select only the columns we need for computation and analysis.
6 Remove unwanted columns not used in this assignment
desiredColumns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- data[desiredColumns]
str(storm)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
We need to determine which weather events caused the highest number of fatalities, and the most injuries. There are 8 steps in determining weather impacts on Public Health:
1 Here are the top 10 weather events that caused the highest number of fatalities:
FATAL <- group_by(storm, EVTYPE)
FATAL10 <- summarise(FATAL,
total = sum(FATALITIES))%>%
arrange(desc(total))%>%
top_n(10)
## Selecting by total
FATAL10
## Source: local data frame [10 x 2]
##
## EVTYPE total
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
As shown above Tornados lead by a major factor in injuries and deaths, here are two graphs that reflect the results.
par(mfrow=c(1,2), mar = c(12, 6, 3, 2),mgp = c(4, 1, 0), cex = 0.7)
barplot(FATAL10$total,
names = FATAL10$EVTYPE,
col = "red",
ylab = "Total Deaths",
main = "Top 10 Weather Events \n Resulting in a Fatality",
las = 3, xpd = TRUE)
barplot(INJURY10$total,
names = INJURY10$EVTYPE,
col = "yellow",
ylab = "Total Deaths",
main = "Top 10 Weather Events \n Resulting in Injuries",
las = 3, xpd = TRUE)
1 Property Damage due to Weather Events:
PropDMG10
## EVTYPE ActPropDMG
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56947380676
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16822673978
## 244 HAIL 15735267513
## 402 HURRICANE 11868319010
## 848 TROPICAL STORM 7703890550
## 972 WINTER STORM 6688497251
## 359 HIGH WIND 5270046295
2 Crop Damage due to Weather Events:
CropDMG10
## EVTYPE ActCropDMG
## 95 DROUGHT 13972566000
## 170 FLOOD 5661968450
## 590 RIVER FLOOD 5029459000
## 427 ICE STORM 5022113500
## 244 HAIL 3025954473
## 402 HURRICANE 2741910000
## 411 HURRICANE/TYPHOON 2607872800
## 153 FLASH FLOOD 1421317100
## 140 EXTREME COLD 1292973000
## 212 FROST/FREEZE 1094086000
3 Total Economic Impact due to Weather Events:
TotalDMG10
## EVTYPE total
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333946
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221986
## 6 FLASH FLOOD 18243991078
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
As shown in the above tables, Floods were the leading cause of ecnomic impact (even when Drought leads the economic impact for Crop Damages), here are three graphs that reflect the results.
par(mfrow = c(1, 3), mar = c(12, 6, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(PropDMG10$ActPropDMG/(10^6),
names = PropDMG10$EVTYPE,
col = "green",
ylab = "Total Property Damage (million $)",
main = "Top 10 Events \n Resulting in Property Damage",
las = 3, xpd = TRUE)
barplot(CropDMG10$ActCropDMG/(10^6),
names = CropDMG10$EVTYPE,
ylab = "Total Crop Damage (million $)",
col = "purple",
main = "Top 10 Events \n Resulting in Crop Damage",
las = 3, xpd = TRUE)
barplot(TotalDMG10$total/(10^6),
names = TotalDMG10$EVTYPE,
col = "orange",
ylab = "Total Crop Damage (million $)",
main = "Top 10 Events \n Resulting in Total Damage",
las = 3, xpd = TRUE)