In this analysis, we’ll take a look at the NOAA Storm data. The meteorological events in this dataset start in the year 1950 and end in November 2011. Our purpose is to find, among these phenomenon, which affects most to the population health and which event causes most damage to economics. To investigate these, we downloaded Storm Data from NOAA(National Oceanic and Atmospheric Administration) collected between 1950 and 2011. We created new datasets with storm data ; storm_one for population health, and storm_two for economic effects. We used ggplot2 package to draw plots for these datasets.
First, we downloaded the data from given URL and loaded that. Because loading this dataset is a time consuming process, we set cache=TRUE for code chunks.
if(!file.exists("./data")){dir.create("./data")}
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "./data/stormdata.csv.bz2", method = 'curl')
storm <- read.csv("./data/stormdata.csv.bz2", header=T)
These are columns in this dataset. Among these, we used only necessary things for each questions.
colnames(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The 1st question is “Which types of events are most harmful to population health?”. For answering this question, what we need is columns for ‘event’ and ‘population health’. In ‘NATIONAL WEATHER SERVICE INSTRUCTION 10-1605’, the documentation for this dataset, variables related to population health are the Injuries and Fatalities. So, we created new dataset for question 1.
storm_one <- with(storm, data.frame(EVTYPE, FATALITIES, INJURIES))
str(storm_one)
## 'data.frame': 902297 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
There are such kinds of variables in storm_one data.
We created other new dataset for question 2 also. In ‘NATIONAL WEATHER SERVICE INSTRUCTION 10-1605’, the documentation for this dataset, variables related to economic damage are the PROPDMG, CROPDMG, PROPDMGEXP, and CROPDMGEXP. Among them, EXP indexes represent the direct financial damage. So we used those two EXP indexes only.
storm_two <- with(storm, data.frame(EVTYPE, PROPDMGEXP, CROPDMGEXP))
str(storm_two)
## 'data.frame': 902297 obs. of 3 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
There are 19 and 9 levels in each variables.
levels(storm_two$PROPDMGEXP) ; levels(storm_two$CROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Among these levels, the alphabets(B, b, M, n, K, k, H, h) means billion, million, kilo, and hundred. Other words represents lower value than these letters. That’s why we investigated letters only. (Reference of above : https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html)
What we need is FATALITIES and INJURIES in storm_one data. So we investigated each of them for each events.
result1_0 <- aggregate(FATALITIES ~ EVTYPE, storm_one, sum)
result2_0 <- aggregate(INJURIES ~ EVTYPE, storm_one, sum)
To see the total victims of each events, we used aggregate() function. The result1_0 and 2_0 are about total fatalities and injuries. We wanted to see the maximum value, so we ordered above lists by victims, and extract top 5 rows only.
result1_1 <- result1_0[order(result1_0$FATALITIES,
decreasing = T),][1:5,]
result2_1 <- result2_0[order(result2_0$INJURIES,
decreasing = T),][1:5,]
And then, we used ggplot2 package to draw bar plots.
library(gridExtra)
library(ggplot2)
gg1 <- ggplot(result1_1,
aes(EVTYPE, log10(FATALITIES))) + geom_bar(stat="identity") + coord_flip()
gg2 <- ggplot(result2_1,
aes(EVTYPE, log10(INJURIES))) + geom_bar(stat="identity") + coord_flip()
grid.arrange(gg1, gg2, nrow=2)
We can see that both in fatalities and injuries, the tornado is the most harmful effect to population health dominantly. The next is the excessive heat in fatalities, and the thunderstorm wind in injuries.
What we need is PROPDMGEXP and CROPDMGEXP in storm_two data. Like the former question, we investigated them. We added other indicating values called ‘PROPind’ and ‘CROPind’, for each 8 alphabet DMGEXP values. We assigned values like below.
need <- c("B","b","M","m","K","k","H","h")
have <- union(which(storm_two$PROPDMGEXP %in% need),
which(storm_two$CROPDMGEXP %in% need))
storm_two_new <- storm_two[have,]
# Add indicator columns
storm_two_new[,c("PROPind","CROPind")] <- 0
# Add indicating value
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("H","h")), 4] <- 1E02
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("K","k")), 4] <- 1E03
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("M","m")), 4] <- 1E05
storm_two_new[which(storm_two_new$PROPDMGEXP %in% c("B","B")), 4] <- 1E09
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("H","h")), 5] <- 1E02
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("K","k")), 5] <- 1E03
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("M","m")), 5] <- 1E06
storm_two_new[which(storm_two_new$CROPDMGEXP %in% c("B","B")), 5] <- 1E09
We also used aggregate() function, extracted top 5 values, and ggplot2 package to draw bar plots.
result3_0 <- aggregate(PROPind ~ EVTYPE, storm_two_new, sum)
result4_0 <- aggregate(CROPind ~ EVTYPE, storm_two_new, sum)
result3_1 <- result3_0[order(result3_0$PROPind, decreasing = TRUE)[1:5],]
result4_1 <- result4_0[order(result4_0$CROPind, decreasing = TRUE)[1:5],]
gg3 <- ggplot(result3_1,
aes(EVTYPE, PROPind)) + geom_bar(stat="identity") + coord_flip()
gg4 <- ggplot(result4_1,
aes(EVTYPE, CROPind)) + geom_bar(stat="identity") + coord_flip()
grid.arrange(gg3, gg4, nrow=2)
We can see that the events caused greatest economic consequences are the hurricane & typhoon for property damage, and the drought for crop damage. The next are the flood for property damage, and the hurricane & typhoon for crop damage.