1.Across the United States, which types of events are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We first load and skim through the data. Then we extract columns we need to solve the given questions. After extracting, transforming, and summarizing the data, we would be able to get the answers for the results. We would then plot the results into figures.
Preparing data. Storm Data Click the link above and save the file into your working directory.
Preparing R setup and loading libraries.
##As a Korean, had trouble loading csv. To solve the issue, needed to change langauage setting.
Sys.setlocale("LC_ALL", "English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
##libraries
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.0 (2020-02-14 07:10:20 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.23.0 successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.9.2 successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,
## warnings
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Loading the data.
if(!exists("stormdata")) {
if(!file.exists("stormdata.csv")) {
bunzip2("repdata_data_StormData.csv.bz2","stormdata.csv",remove=FALSE)
}
stormdata<-read.csv("stormdata.csv", sep = ",", header = TRUE)
}
Examine the data
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
summary(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
By checking and skimming the data, we could find out that for the desired analysis we want to run, we only need to check following columns.
Required columns for analysis 1-EVTYPE : event type 2-FATALITIES,INJURIES : population health 3-PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP : economic results, Impact should be driven by multiplying numeric in PROPDMG and CROPDMG with K(1,000) and M(1,000,000) in PROPDMGEXP and CROPDMGEXP.
Impact on health
##Which type of events are most harmful to population health?
fatalities <- stormdata %>% select(EVTYPE,FATALITIES) %>% group_by(EVTYPE) %>% summarise(total = sum(FATALITIES)) %>% arrange(-total)
## `summarise()` ungrouping output (override with `.groups` argument)
head(fatalities)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 2
## EVTYPE total
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
injuries <- stormdata %>% select(EVTYPE,INJURIES) %>% group_by(EVTYPE) %>% summarise(total = sum(INJURIES)) %>% arrange(-total)
## `summarise()` ungrouping output (override with `.groups` argument)
head(injuries)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 2
## EVTYPE total
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
Imapct on economy: We need to set change the data a little bit. The rules are as follows, according to the link; (1)K,k - 10^3 (2)M,m - 10^6 (3)H,h - 10^2 (4)B,b - 10^9 (5)-,NA,? - 0
stormdata$PROPDMGEXP<-tolower(stormdata$PROPDMGEXP)
stormdata$PROPDMGEXP <- gsub("k",1000,stormdata$PROPDMGEXP)
stormdata$PROPDMGEXP <- gsub("m",100000,stormdata$PROPDMGEXP)
stormdata$PROPDMGEXP <- gsub("h",100,stormdata$PROPDMGEXP)
stormdata$PROPDMGEXP <- gsub("b",1000000000,stormdata$PROPDMGEXP)
stormdata$PROPDMGACTUAL <- stormdata$PROPDMG * as.numeric(stormdata$PROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata$PROPDMGACTUAL[is.na(stormdata$PROPDMGACTUAL)] <- 0
stormdata$CROPDMGEXP<-tolower(stormdata$CROPDMGEXP)
stormdata$CROPDMGEXP <- gsub("k",1000,stormdata$CROPDMGEXP)
stormdata$CROPDMGEXP <- gsub("m",1000000,stormdata$CROPDMGEXP)
stormdata$CROPDMGEXP <- gsub("h",100,stormdata$CROPDMGEXP)
stormdata$CROPDMGEXP <- gsub("b",1000000000,stormdata$CROPDMGEXP)
stormdata$CROPDMGACTUAL <- stormdata$CROPDMG * as.numeric(stormdata$CROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata$CROPDMGACTUAL[is.na(stormdata$CROPDMGACTUAL)] <- 0
stormdata$economic.dmg <- stormdata$CROPDMGACTUAL + stormdata$PROPDMGACTUAL
dmg.total <- stormdata %>% select(EVTYPE, economic.dmg) %>% group_by(EVTYPE) %>% summarise(total = sum(economic.dmg)) %>% arrange(-total)
## `summarise()` ungrouping output (override with `.groups` argument)
head(dmg.total)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 2
## EVTYPE total
## <chr> <dbl>
## 1 FLOOD 131168416250
## 2 HURRICANE/TYPHOON 68490229800
## 3 STORM SURGE 42653104000
## 4 DROUGHT 14079927000
## 5 TORNADO 13725802101
## 6 RIVER FLOOD 10053724500
1.Across the United States, which types of events are most harmful with respect to population health?
According to the suggested table above, we could see that Tornado is the most harmful with respect to population health. It is most dangerous to both fatality and injury. We could check it more easily with the following figure.
fatality_plot <- ggplot() + geom_bar(data = fatalities[1:5,], aes(x = EVTYPE,
y = total, fill = interaction(total, EVTYPE)), stat = "identity",
show.legend = F) + theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
xlab("Events") + ylab("No. of fatailities") + ggtitle("Top 5 weather events causing fatalities") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
injuries_plot <- ggplot() + geom_bar(data = injuries[1:5,], aes(x = EVTYPE, y = total,
fill = interaction(total, EVTYPE)), stat = "identity", show.legend = F) +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) + xlab("Events") +
ylab("No. of Injuries") + ggtitle("Top 5 weather events causing Injuries") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
grid.arrange(fatality_plot, injuries_plot, ncol = 2)
2.Across the United States, which types of events have the greatest economic consequences?
It could be easily seen from the table above, that flood has the greatest ecoomic consequences. We could see it more clearly with the following figure below.
ggplot() + geom_bar(data = dmg.total[1:5,], aes(x = EVTYPE, y = total, fill = interaction(total,
EVTYPE)), stat = "identity", show.legend = F) + theme(axis.text.x = element_text(angle = 30,
hjust = 1)) + xlab("Events") + ylab("Total Damage")