NOAA Storm Data Analysis ; with respect to population health and economic damage

1. Synopsis

In this analysis, we’ll take a look at the NOAA Storm data. The meteorological events in this dataset start in the year 1950 and end in November 2011. Our purpose is to find, among these phenomenon, which affects most to the population health and which event causes most damage to economics. To investigate these, we downloaded Storm Data from NOAA(National Oceanic and Atmospheric Administration) collected between 1950 and 2011. We created new datasets with storm data ; storm_one for population health, and storm_two for economic effects. We used ggplot2 package to draw plots for these datasets.

2. Data Processing

First, we downloaded the data from given URL and loaded that. Because loading this dataset is a time consuming process, we set cache=TRUE for code chunks.

if(!file.exists("./data")){dir.create("./data")}
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "./data/stormdata.csv.bz2", method = 'curl')
storm <- read.csv("./data/stormdata.csv.bz2", header=T)

These are columns in this dataset. Among these, we used only necessary things for each questions.

colnames(storm)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

2-1. 1st Question : Population Health

The 1st question is “Which types of events are most harmful to population health?”. For answering this question, what we need is columns for ‘event’ and ‘population health’. In ‘NATIONAL WEATHER SERVICE INSTRUCTION 10-1605’, the documentation for this dataset, variables related to population health are the Injuries and Fatalities. So, we created new dataset for question 1.

storm_one <- with(storm, data.frame(EVTYPE, FATALITIES, INJURIES))
str(storm_one)

## 'data.frame':    902297 obs. of  3 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...

There are such kinds of variables in storm_one data.

EVTYPE : Type of meteorological events / Factor variable with 985 levels.
FATALITIES : Fatalities of meteorological events / Double
INJURIES : Injured people of meteorological events / Double

2-2. 2nd Question : Economic Damages

We created other new dataset for question 2 also. In ‘NATIONAL WEATHER SERVICE INSTRUCTION 10-1605’, the documentation for this dataset, variables related to economic damage are the PROPDMG, CROPDMG, PROPDMGEXP, and CROPDMGEXP. Among them, EXP indexes represent the direct financial damage. So we used those two EXP indexes only.

storm_two <- with(storm, data.frame(EVTYPE, PROPDMGEXP, CROPDMGEXP))
str(storm_two)

## 'data.frame':    902297 obs. of  3 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

EVTYPE : Type of meteorological events / Factor variable with 985 levels.
PROPDMGEXP : Exponent values for property damage.
CROPDMGEXP : Exponent values for crop damage.

There are 19 and 9 levels in each variables.

levels(storm_two$PROPDMGEXP) ; levels(storm_two$CROPDMGEXP)

##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Among these levels, the alphabets(B, b, M, n, K, k, H, h) means billion, million, kilo, and hundred. Other words represents lower value than these letters. That’s why we investigated letters only. (Reference of above : https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html)

3. Results

3-1. Population Health : Which event affects most?

What we need is FATALITIES and INJURIES in storm_one data. So we investigated each of them for each events.

result1_0 <- aggregate(FATALITIES ~ EVTYPE, storm_one, sum)
result2_0 <- aggregate(INJURIES ~ EVTYPE, storm_one, sum)

To see the total victims of each events, we used aggregate() function. The result1_0 and 2_0 are about total fatalities and injuries. We wanted to see the maximum value, so we ordered above lists by victims, and extract top 5 rows only.

result1_1 <- result1_0[order(result1_0$FATALITIES, 
                             decreasing = T),][1:5,]
result2_1 <- result2_0[order(result2_0$INJURIES, 
                             decreasing = T),][1:5,]

And then, we used ggplot2 package to draw bar plots.

library(gridExtra)
library(ggplot2)

gg1 <- ggplot(result1_1, 
        aes(EVTYPE, log10(FATALITIES))) + geom_bar(stat="identity") + coord_flip()
gg2 <- ggplot(result2_1, 
        aes(EVTYPE, log10(INJURIES))) + geom_bar(stat="identity") + coord_flip()

grid.arrange(gg1, gg2, nrow=2)

We can see that both in fatalities and injuries, the tornado is the most harmful effect to population health dominantly. The next is the excessive heat in fatalities, and the thunderstorm wind in injuries.

3-2. Economic Damages : Which event has the largest effect?

What we need is PROPDMGEXP and CROPDMGEXP in storm_two data. Like the former question, we investigated them. We added other indicating values called ‘PROPind’ and ‘CROPind’, for each 8 alphabet DMGEXP values. We assigned values like below.

1E02 to H and h
1E03 to K and k
1E05 to M and m
1E09 to B and b
0 to other values

need <- c("B","b","M","m","K","k","H","h")
have <- union(which(storm_two$PROPDMGEXP %in% need),
              which(storm_two$CROPDMGEXP %in% need))
storm_two_new <- storm_two[have,]

# Add indicator columns
storm_two_new[,c("PROPind","CROPind")] <- 0

# Add indicating value
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("H","h")), 4] <- 1E02
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("K","k")), 4] <- 1E03
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("M","m")), 4] <- 1E05
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("B","B")), 4] <- 1E09

storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("H","h")), 5] <- 1E02
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("K","k")), 5] <- 1E03
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("M","m")), 5] <- 1E06
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("B","B")), 5] <- 1E09

We also used aggregate() function, extracted top 5 values, and ggplot2 package to draw bar plots.

result3_0 <- aggregate(PROPind ~ EVTYPE, storm_two_new, sum)
result4_0 <- aggregate(CROPind ~ EVTYPE, storm_two_new, sum)

result3_1 <- result3_0[order(result3_0$PROPind, decreasing = TRUE)[1:5],]
result4_1 <- result4_0[order(result4_0$CROPind, decreasing = TRUE)[1:5],]

gg3 <- ggplot(result3_1, 
              aes(EVTYPE, PROPind)) + geom_bar(stat="identity") + coord_flip()

gg4 <- ggplot(result4_1, 
              aes(EVTYPE, CROPind)) + geom_bar(stat="identity") + coord_flip()
grid.arrange(gg3, gg4, nrow=2)

We can see that the events caused greatest economic consequences are the hurricane & typhoon for property damage, and the drought for crop damage. The next are the flood for property damage, and the hurricane & typhoon for crop damage.