This R-Markdown document provides step-by-step explanations and R-code how to complete the second course project of the Reproducible Research course. The basic goal of this assignment is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database and examine which type of storms/weather events are most harmful with respect to population health and have the greatest economic consequences. TORNADO and FLOOD, respectively were found to be the two events that proved to be more harmful for the population health and had the greatest economic consequences.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The basic goal of this assignment is to explore the NOAA Storm Database and examine which type of storms/weather events are most harmful with respect to population health and have the greatest economic consequences. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern, hence the need for exploring the NOAA storm database. This project provides useful insight in two main questions:
The following lines of code describe how to download the data set, and load the containing csv.bz2 file from the working directory.
At this step no data transformation was performed, but transformation was done depending on the tasks needed at each of the next steps.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "raw_data.csv.bz2")
data <- read.csv("raw_data.csv.bz2")
Using the str() and colnames() functions we
can see the variables names and subset later the variables we are
interested to answer the two questions.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
colnames(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
So, EVTYPE is the type of recorded events and the
variables associated with population health and economic consequences
are FATALITIES, INJURIES and
PROPDMG, PROPDMGEXP, CROPDMG,
CROPDMGEXP, respectively.
selected <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
subset <- data[,selected]
summary(subset)
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
str(subset)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
TOTAL_HEALTH) that sums the number of recorded
FATALITIES and INJURIES. Then using the
dplyr R-package we can summarize, order and store in a new
dataset (Health) the first 10 event types that were
associated with the highest number of fatalities plus injuries
(TOTAL_HEALTH).library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
subset$TOTAL_HEALTH = subset$FATALITIES + subset$INJURIES
a = subset %>% group_by(EVTYPE) %>% summarize(sum(FATALITIES), sum(INJURIES), sum(TOTAL_HEALTH))
Health = a[order(-a$`sum(TOTAL_HEALTH)`),]
Health = Health[1:10,]
We can do the same for Question 2 (Q2) - finding which type of events have the greatest economic consequences, but here the crop and property damages are encoded with two variables; one for the numerical estimate and the other one for the exponent with a base of 10, e.g., K - 1000 = 10^3, M - 1000000=10^6 and so on.
library(dplyr)
unique(subset$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(subset$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
For the property damages we’ll decode the 10-base exponent values as: K - 10^3, M - 10^6, B - 10^9, m - 10^6, + - 10^0, and so on, while for the crop damages “” - 10^0, M - 10^6, K - 10^3, m - 10^3, B - 10^6, ? - 10^0, 0 - 10^0, k - 10^3, 2 - 10^2.
PROPDM_Decoded = c("K" = 10^3, "M" = 10^6, "B" = 10^9, "m" = 10^6, "+" = 10^0, "0" = 10^0, "5" = 10^5, "6" = 10^6, "?" = 10^0, "4" = 10^4, "2" = 10^2, "3" = 10^3, "h" = 10^2, "7" = 10^7, "H" = 10^2, "-" = 10^0, "1" = 10^1, "8" = 10^8)
CROPDM_Decoded = c("M" = 10^6, "K" = 10^3, "m" = 10^6, "B" = 10^9, "?" = 10^0, "0" = 10^0, "k" = 10^3, "2" = 10^2)
subset$PROPDMG_Decoded = PROPDM_Decoded[subset$PROPDMGEXP]
subset$PROPDMG_Decoded[which(is.na(subset$PROPDMG_Decoded))] = 10^0
subset$CROPDMG_Decoded = CROPDM_Decoded[subset$CROPDMGEXP]
subset$CROPDMG_Decoded[which(is.na(subset$CROPDMG_Decoded))] = 10^0
After decoding the 10-base exponent we can estimate the total amount
of the crop and property damages multiplying the numerical estimate with
the 10-base exponent and also estimate sum of crop + property damage
(TOTAL_ECON). Afterwards using the same method as for the
population health we can summarize, order and store in a new dataset
(Econ) the first 10 event types that were associated with
the highest damages on crop and property (TOTAL_ECON).
b = subset
b$PROP_DMG = b$PROPDMG * b$PROPDMG_Decoded
b$CROP_DMG = b$CROPDMG * b$CROPDMG_Decoded
b$TOTAL_ECON = b$CROP_DMG + b$PROP_DMG
Econ = b %>% group_by(EVTYPE) %>% summarize(sum(CROP_DMG), sum(PROP_DMG), sum(TOTAL_ECON))
Econ = Econ[order(-Econ$`sum(TOTAL_ECON)`),]
Econ = Econ[1:10,]
Now, datasets Health and Econ contain the
events that are considered harmful with respect to population health and
have the greatest economic consequences, respectively.
By printing Health and Econ we can see how
much each event has affected population health (number of
fatalities/injuries/total) and the economy indicators (crop production,
properrty values, both).
Health
## # A tibble: 10 x 4
## EVTYPE `sum(FATALITIES)` `sum(INJURIES)` `sum(TOTAL_HEALTH)`
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
Econ
## # A tibble: 10 x 4
## EVTYPE `sum(CROP_DMG)` `sum(PROP_DMG)` `sum(TOTAL_ECON)`
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 5661968450 144657709807 150319678257
## 2 HURRICANE/TYPHOON 2607872800 69305840000 71913712800
## 3 TORNADO 414953270 56947380676. 57362333946.
## 4 STORM SURGE 5000 43323536000 43323541000
## 5 HAIL 3025954473 15735267513. 18761221986.
## 6 FLASH FLOOD 1421317100 16822673978. 18243991078.
## 7 DROUGHT 13972566000 1046106000 15018672000
## 8 HURRICANE 2741910000 11868319010 14610229010
## 9 RIVER FLOOD 5029459000 5118945500 10148404500
## 10 ICE STORM 5022113500 3944927860 8967041360
We can conclude that TORNADO and FLOOD, respectively were found to be the two events that proved to be more harmful for the population health and had the greatest economic consequences.
We can construct two histograms to visualize the impacts each type event has on population health and economy.
library(ggplot2)
library(tidyr)
Health = gather(Health, key = "sum", value = "points", 2:4)
ggplot(Health, aes(x = EVTYPE, y = points, fill = sum))+
geom_col(position = position_dodge()) +
ylab("Total Fatalities / Injuries") +
xlab("Event Type") +
ggtitle("Top 10 events types associated \n with higher impact on Population Health") +
theme(axis.text.x = element_text(angle=45, hjust=1))
library(ggplot2)
library(tidyr)
Econ = gather(Econ, key = "sum", value = "points", 2:4)
ggplot(Econ, aes(x = EVTYPE, y = points, fill = sum))+
geom_col(position = position_dodge()) +
ylab("Property/Crop Damage (in USD)") +
xlab("Event Type") +
ggtitle("Top 10 events types associated \n with higher economic consequences") +
theme(axis.text.x = element_text(angle=45, hjust=1))