Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Here we use part of this database to figure out top 10 types of weather events are the most harmful with respect to population health across the United States, and top 10 types of weather events that have the greatest economic consequences.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
[URL of the dataset:] (“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”)
Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(utils)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
The encoding for this dataset is ASCII
guess_encoding("/Users/Beeta/Rcodes/datasciencecoursera/course5-ReproducableResearch/project2/repdata_data_StormData.csv", n_max = 30000)
## # A tibble: 1 x 2
## encoding confidence
## <chr> <dbl>
## 1 ASCII 1
It is also compressed by bzip2 and reading it directly from the url causes warnings/errors So I have downloaded the file and used the local file for running this analysis and eventhough read_csv() is much faster than read.csv() when I used it, it created many parsiing failures.
stormData <- read.csv("/Users/Beeta/Rcodes/datasciencecoursera/course5-ReproducableResearch/project2/repdata_data_StormData.csv")
head(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
stormData %>% colnames()
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We do NOT need all the columns for this specific study, hence we select only those that are required for this report.
keeps <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- stormData[keeps]
head(data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Note that PROPDMGEXP and CROPDMGEXP are factors.
Across the United States, which types of events (as indicated in the EVTYPE are most harmful with respect to population health?
fatal <- aggregate(FATALITIES ~ EVTYPE, data, FUN = sum)
injur <- aggregate(INJURIES ~ EVTYPE, data, FUN = sum)
fatal <- arrange(fatal, desc(FATALITIES)) %>% top_n(10, FATALITIES)
injur <- arrange(injur, desc(INJURIES)) %>% top_n(10, INJURIES)
head(fatal)
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
head(injur)
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
Plotting the top 10 weather events that are the most harmful with respect to public health
par(mfrow = c(1, 2), cex = 0.7, mar = c(10, 4, 3, 2))
barplot(fatal$FATALITIES, names.arg = fatal$EVTYPE, ylab = "Fatalities", main = "Events cause most fatalities", col = "black", las = 3)
barplot(injur$INJURIES, names.arg = injur$EVTYPE, ylab = "Injuries", main = "Events cause most injuries", col = "dark gray", las = 3)
Across the United States, which types of events have the greatest economic consequences?
We are going to need the columns: “EVTYPE”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” and recall that EXP columns are factors so
Finding the property levels and exponents
unique(data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
Assigning values to exponents and 0 for invalid ones to be able to calculate the numbers
data$PROPEXP[data$PROPDMGEXP == "K"] <- 1000
data$PROPEXP[data$PROPDMGEXP == "M"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == ""] <- 1
data$PROPEXP[data$PROPDMGEXP == "B"] <- 1e+09
data$PROPEXP[data$PROPDMGEXP == "m"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == "0"] <- 1
data$PROPEXP[data$PROPDMGEXP == "5"] <- 1e+05
data$PROPEXP[data$PROPDMGEXP == "6"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == "4"] <- 10000
data$PROPEXP[data$PROPDMGEXP == "2"] <- 100
data$PROPEXP[data$PROPDMGEXP == "3"] <- 1000
data$PROPEXP[data$PROPDMGEXP == "h"] <- 100
data$PROPEXP[data$PROPDMGEXP == "7"] <- 1e+07
data$PROPEXP[data$PROPDMGEXP == "H"] <- 100
data$PROPEXP[data$PROPDMGEXP == "1"] <- 10
data$PROPEXP[data$PROPDMGEXP == "8"] <- 1e+08
Assigning ‘0’ to invalid exponent data
data$PROPEXP[data$PROPDMGEXP == "+"] <- 0
data$PROPEXP[data$PROPDMGEXP == "-"] <- 0
data$PROPEXP[data$PROPDMGEXP == "?"] <- 0
Calculating the property damage value
data$PROPDMGVAL <- data$PROPDMG * data$PROPEXP
Finding the crop levels and exponents
unique(data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
Assigning values for the crop exponent data to be able to calculate the numbers
data$CROPEXP[data$CROPDMGEXP == "M"] <- 1e+06
data$CROPEXP[data$CROPDMGEXP == "K"] <- 1000
data$CROPEXP[data$CROPDMGEXP == "m"] <- 1e+06
data$CROPEXP[data$CROPDMGEXP == "B"] <- 1e+09
data$CROPEXP[data$CROPDMGEXP == "0"] <- 1
data$CROPEXP[data$CROPDMGEXP == "k"] <- 1000
data$CROPEXP[data$CROPDMGEXP == "2"] <- 100
data$CROPEXP[data$CROPDMGEXP == ""] <- 1
Assigning ‘0’ to invalid exponent data
data$CROPEXP[data$CROPDMGEXP == "?"] <- 0
calculating the crop damage value
data$CROPDMGVAL <- data$CROPDMG * data$CROPEXP
Calculating the top 10
prop <- aggregate(PROPDMGVAL ~ EVTYPE, data, FUN = sum)
crop <- aggregate(CROPDMGVAL ~ EVTYPE, data, FUN = sum)
prop <- arrange(prop, desc(PROPDMGVAL)) %>% top_n(10, PROPDMGVAL)
crop <- arrange(crop, desc(CROPDMGVAL)) %>% top_n(10, CROPDMGVAL)
head(prop)
## EVTYPE PROPDMGVAL
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56947380616
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16822673978
## 6 HAIL 15735267513
head(crop)
## EVTYPE CROPDMGVAL
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
Plotting the top 10 weather events with highest economic impacts
par(mfrow = c(1, 2), cex = 0.7, mar = c(10, 4, 3, 2))
barplot(prop$PROPDMGVAL/10^9, names.arg = prop$EVTYPE, ylab = "Billion Dollar in Property damages", main = "Events cause highest property damage", col = "black", las = 3)
barplot(crop$CROPDMGVAL/10^9, names.arg = crop$EVTYPE, ylab = "Billion Dollar in Crop damage", main = "Events cause highest crop damage", col = "dark gray", las = 3)
Weather events that have the highest number of fatalities are: 1. Tornado 2. Excessive heat 3. Flash flood
Weather events that cause the highest number of injuries are: 1. Tornado 2. TSTM wind 3. Flood
Weather evenst that cause the most damage to properties are: 1. Flood 2. Hurricane/Typhoon 3. Tornado
Weather evenst thatcause the most damage to crops are: 1. Drought 2. Flood 3. River flood