Analyze NOAA storm database to study the effects of weather events on health (injuries and fatalities) and economy (property and crop damages) of affected population


SYNOPSIS

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Here we use part of this database to figure out top 10 types of weather events are the most harmful with respect to population health across the United States, and top 10 types of weather events that have the greatest economic consequences.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

[URL of the dataset:] (“https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2”)

Storm Data [47Mb] There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

library(utils)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)

Data Processing

The encoding for this dataset is ASCII

guess_encoding("/Users/Beeta/Rcodes/datasciencecoursera/course5-ReproducableResearch/project2/repdata_data_StormData.csv", n_max = 30000)
## # A tibble: 1 x 2
##   encoding confidence
##   <chr>         <dbl>
## 1 ASCII             1

It is also compressed by bzip2 and reading it directly from the url causes warnings/errors So I have downloaded the file and used the local file for running this analysis and eventhough read_csv() is much faster than read.csv() when I used it, it created many parsiing failures.

stormData <- read.csv("/Users/Beeta/Rcodes/datasciencecoursera/course5-ReproducableResearch/project2/repdata_data_StormData.csv")
head(stormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
stormData %>% colnames()
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We do NOT need all the columns for this specific study, hence we select only those that are required for this report.

keeps <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- stormData[keeps]
head(data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0
str(data)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Note that PROPDMGEXP and CROPDMGEXP are factors.

Processing PART 1 - Fatalities and Injuries

Across the United States, which types of events (as indicated in the EVTYPE are most harmful with respect to population health?

fatal <- aggregate(FATALITIES ~ EVTYPE, data, FUN = sum)
injur <- aggregate(INJURIES ~ EVTYPE, data, FUN = sum)

fatal <- arrange(fatal, desc(FATALITIES)) %>% top_n(10, FATALITIES)
injur <- arrange(injur, desc(INJURIES)) %>% top_n(10, INJURIES)

head(fatal)
##           EVTYPE FATALITIES
## 1        TORNADO       5633
## 2 EXCESSIVE HEAT       1903
## 3    FLASH FLOOD        978
## 4           HEAT        937
## 5      LIGHTNING        816
## 6      TSTM WIND        504
head(injur)
##           EVTYPE INJURIES
## 1        TORNADO    91346
## 2      TSTM WIND     6957
## 3          FLOOD     6789
## 4 EXCESSIVE HEAT     6525
## 5      LIGHTNING     5230
## 6           HEAT     2100

Results - Fatalities and Injuries

Plotting the top 10 weather events that are the most harmful with respect to public health

par(mfrow = c(1, 2), cex = 0.7, mar = c(10, 4, 3, 2))
barplot(fatal$FATALITIES, names.arg = fatal$EVTYPE, ylab = "Fatalities", main = "Events cause most fatalities", col = "black", las = 3)
barplot(injur$INJURIES, names.arg = injur$EVTYPE, ylab = "Injuries", main = "Events cause most injuries", col = "dark gray", las = 3)

Processing PART 2 - Property and Crop Damage

Across the United States, which types of events have the greatest economic consequences?

We are going to need the columns: “EVTYPE”, “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” and recall that EXP columns are factors so

Property damage

Finding the property levels and exponents

unique(data$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

Assigning values to exponents and 0 for invalid ones to be able to calculate the numbers

data$PROPEXP[data$PROPDMGEXP == "K"] <- 1000
data$PROPEXP[data$PROPDMGEXP == "M"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == ""] <- 1
data$PROPEXP[data$PROPDMGEXP == "B"] <- 1e+09
data$PROPEXP[data$PROPDMGEXP == "m"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == "0"] <- 1
data$PROPEXP[data$PROPDMGEXP == "5"] <- 1e+05
data$PROPEXP[data$PROPDMGEXP == "6"] <- 1e+06
data$PROPEXP[data$PROPDMGEXP == "4"] <- 10000
data$PROPEXP[data$PROPDMGEXP == "2"] <- 100
data$PROPEXP[data$PROPDMGEXP == "3"] <- 1000
data$PROPEXP[data$PROPDMGEXP == "h"] <- 100
data$PROPEXP[data$PROPDMGEXP == "7"] <- 1e+07
data$PROPEXP[data$PROPDMGEXP == "H"] <- 100
data$PROPEXP[data$PROPDMGEXP == "1"] <- 10
data$PROPEXP[data$PROPDMGEXP == "8"] <- 1e+08

Assigning ‘0’ to invalid exponent data

data$PROPEXP[data$PROPDMGEXP == "+"] <- 0
data$PROPEXP[data$PROPDMGEXP == "-"] <- 0
data$PROPEXP[data$PROPDMGEXP == "?"] <- 0

Calculating the property damage value

data$PROPDMGVAL <- data$PROPDMG * data$PROPEXP

Crop damage

Finding the crop levels and exponents

unique(data$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Assigning values for the crop exponent data to be able to calculate the numbers

data$CROPEXP[data$CROPDMGEXP == "M"] <- 1e+06
data$CROPEXP[data$CROPDMGEXP == "K"] <- 1000
data$CROPEXP[data$CROPDMGEXP == "m"] <- 1e+06
data$CROPEXP[data$CROPDMGEXP == "B"] <- 1e+09
data$CROPEXP[data$CROPDMGEXP == "0"] <- 1
data$CROPEXP[data$CROPDMGEXP == "k"] <- 1000
data$CROPEXP[data$CROPDMGEXP == "2"] <- 100
data$CROPEXP[data$CROPDMGEXP == ""] <- 1

Assigning ‘0’ to invalid exponent data

data$CROPEXP[data$CROPDMGEXP == "?"] <- 0

calculating the crop damage value

data$CROPDMGVAL <- data$CROPDMG * data$CROPEXP

Results - Property and Crop Damage

Calculating the top 10

prop <- aggregate(PROPDMGVAL ~ EVTYPE, data, FUN = sum)
crop <- aggregate(CROPDMGVAL ~ EVTYPE, data,  FUN = sum)


prop <- arrange(prop, desc(PROPDMGVAL)) %>% top_n(10, PROPDMGVAL)
crop <- arrange(crop, desc(CROPDMGVAL)) %>% top_n(10, CROPDMGVAL)

head(prop)
##              EVTYPE   PROPDMGVAL
## 1             FLOOD 144657709807
## 2 HURRICANE/TYPHOON  69305840000
## 3           TORNADO  56947380616
## 4       STORM SURGE  43323536000
## 5       FLASH FLOOD  16822673978
## 6              HAIL  15735267513
head(crop)
##        EVTYPE  CROPDMGVAL
## 1     DROUGHT 13972566000
## 2       FLOOD  5661968450
## 3 RIVER FLOOD  5029459000
## 4   ICE STORM  5022113500
## 5        HAIL  3025954473
## 6   HURRICANE  2741910000

Plotting the top 10 weather events with highest economic impacts

par(mfrow = c(1, 2), cex = 0.7, mar = c(10, 4, 3, 2))
barplot(prop$PROPDMGVAL/10^9, names.arg = prop$EVTYPE, ylab = "Billion Dollar in Property damages", main = "Events cause highest property damage", col = "black", las = 3)
barplot(crop$CROPDMGVAL/10^9, names.arg = crop$EVTYPE, ylab = "Billion Dollar in Crop damage", main = "Events cause highest crop damage",  col = "dark gray", las = 3)

Final Conclusion

Weather events that have the highest number of fatalities are: 1. Tornado 2. Excessive heat 3. Flash flood

Weather events that cause the highest number of injuries are: 1. Tornado 2. TSTM wind 3. Flood

Weather evenst that cause the most damage to properties are: 1. Flood 2. Hurricane/Typhoon 3. Tornado

Weather evenst thatcause the most damage to crops are: 1. Drought 2. Flood 3. River flood