The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.
This report will answer the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
There were two major data processing initiatives for this project. The first was to get the total number of injuries and the total number of fatlities by event type. This was done using dplyr to group and summarize by event type. The gather (tidyr) function was used to make this data graphable. The results were graphed on a stacked barplot.
The second action was to use dplyr to group and sort the data by property damage and crop damage. Upon getting the data grouped and sorted and summed, actions were taken to substiute out the multipliers for each so that total damage coule be calculated. The results were then graphed on two different barplots.
We first read in the US National Oceanic and Atmospheric Administrations (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011.
The database was downloaded from the following LINK. Upon unzipping the .zip file, we read the .csv file into the variable stormdat using read.csv.
stormdat <- read.csv("storm_dat/repdata-data-StormData.csv")
There are 902297 rows and 37 columns in stormdat.
dim(stormdat)
## [1] 902297 37
head(stormdat)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
We will now properly format the column names for R data frames and rename the first state column (column 1) to “state.num”.
names(stormdat) <- gsub("__$", "", names(stormdat))
names(stormdat) <- gsub("_", ".", names(stormdat))
names(stormdat) <- tolower(names(stormdat))
names(stormdat)[1] <- "state.num"
head(stormdat)
## state.num bgn.date bgn.time time.zone county countyname state
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## evtype bgn.range bgn.azi bgn.locati end.date end.time county.end
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## countyendn end.range end.azi end.locati length width f mag fatalities
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## injuries propdmg propdmgexp cropdmg cropdmgexp wfo stateoffic zonenames
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## latitude longitude latitude.e longitude. remarks refnum
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Loading the “dplyr” package to assist with data manipulation.
suppressMessages(library(dplyr))
To determine the answer to the questions, I will first select only the columns required for this calculation, then I group the stormdata dataframe by “Event Type” which is column “evtype”.
stormdat_gp_evtype <- stormdat %>% select(evtype, fatalities:cropdmgexp) %>% group_by(evtype)
The following code will process the data for the purpose of answering the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
To answer this question we will sum, for each event type, the following columns and sum those columns together to create “total_poph”. The columns are:
fatalities (creates “total_fat”) injuries (creates “total_inj”)
I will then sort the resulting data frame (“stormdata_poph”) by “total_poph” by total_poph in descending order. Only the top 10 event types will be stored and graphed.
stormdat_poph <- stormdat_gp_evtype %>% summarise(total_fat = sum(fatalities), total_inj = sum(injuries)) %>% mutate(total_poph = total_fat + total_inj) %>% arrange(desc(total_poph)) %>% top_n(10)
## Selecting by total_poph
head(stormdat_poph)
## Source: local data frame [6 x 4]
##
## evtype total_fat total_inj total_poph
## (fctr) (dbl) (dbl) (dbl)
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.4
Gather the data (using tidyr) for graphing a stack barplot and save it to “stormdat_popg”.
stormdat_popg <- stormdat_poph %>% select(evtype:total_inj) %>% gather(evtype, total)
names(stormdat_popg) <- c("evtype", "health_risk", "total")
The following code will process the data for the purpose of answering the question: Across the United States, which types of events have the greatest economic consequences?
To answer this question we will sum, for each event type, the following columns and sum those columns together to create “total_eco”. The columns are:
propdmg propdmgexp cropdmg cropdmgexp
The “*exp" columns are factors. Exploring those columns.
unique(stormdat_gp_evtype$propdmgexp)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(stormdat_gp_evtype$cropdmgexp)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
This extract from the following document LINK explains the (k,K,h,H,m,M,b,B) factors in columns “propdmgexp” and “cropdmgexp”.
Estimates can be obtained from emergency managers, U.S. Geological Survey, U.S. Army Corps of Engineers, power utility companies, and newspaper articles. If the values provided are rough estimates, then this should be stated as such in the narrative. Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry. When damage is due to more than one element of the storm, indicate, when possible, the amount of damage caused by each element. If the dollar amount of damage is unknown, or not available, check the “no informationavailable” box.
I will explore the frequency of the various levels in “propdmgexp” and “cropdmgexp”.
table(stormdat_gp_evtype$propdmgexp)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(stormdat_gp_evtype$cropdmgexp)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Since “cropdmgexp” and “propdmgexp” are considere multipliers, I will assign numeric multipliers to the “stormdat_gpevtype$propdmgexp”" following perform a replacement of the values based off of the following table.
Convert the factors for “propdmgexp” to numeric.
levels(stormdat_gp_evtype$propdmgexp) <- list("1"="0", "10"="1", "100"="2", "1000"="3", "10000"="4", "100000"="5", "1000000"="6", "10000000"="7", "100000000"="8", "100"="H", "100"="h", "1000"="K", "1000000"="m", "1000000"="M", "1000000000"="B")
levels(stormdat_gp_evtype$propdmgexp)
## [1] "1" "10" "100" "1000" "10000"
## [6] "100000" "1000000" "10000000" "100000000" "1000000000"
Convert the factors for “cropdmgexp” to numeric.
levels(stormdat_gp_evtype$cropdmgexp) <- list("1"="0","100"="2","1000"="K","1000"="k", "1000000"="m", "1000000"="M", "1000000000"="B")
Multiply “propdmg” by “propdmgexp” for each row and save the results to column “total_prop”. Multiply “cropdmg” by “cropdmgexp” for each row and save the results to “total_crop. The results will be added to stormdat_eco.
stormdat_eco <- stormdat_gp_evtype %>% mutate(total_prop = propdmg * as.numeric(as.character(propdmgexp))) %>% mutate(total_crop = cropdmg * as.numeric(as.character(cropdmgexp)))
I will now sum “total_prop” for each event type and save it to “stormdat_eco_p”. I will then filter “stormdat_eco_p” for the 10 largest instances of “sump”.
stormdat_eco_p <- stormdat_eco %>% group_by(evtype) %>% summarise(sump = sum(total_prop)) %>% arrange(desc(sump)) %>% top_n(10)
## Selecting by sump
I will now sum “total_crop” for each event type and save it to “stormdat_eco_c”. I will then filter “stormdat_eco_c” for the 10 largest instances of “sumc”.
stormdat_eco_c <- stormdat_eco %>% group_by(evtype) %>% summarise(sumc = sum(total_crop)) %>% arrange(desc(sumc)) %>% top_n(10)
## Selecting by sumc
ggplot2 will be required for final graphs.
suppressMessages(library(ggplot2))
Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_popg”. The resulting graph shows the Top 10 event types from 1950-2011 based on health risks to humans. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the total injuries and fatilities cause by that event type from 1950-2011.
stormdat_popg$evtype <- factor(stormdat_popg$evtype, levels = stormdat_popg$evtype[order(-stormdat_popg$total)])
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
ggplot(stormdat_popg, aes(x=evtype, y=total, fill=health_risk)) + geom_bar(stat="identity") + xlab("Event types") + ylab("Health risk count") + scale_fill_discrete(name = "Health Risks", labels=c("Total Fatilities", "Total Injuries")) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8))
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_eco_p”. The resulting graph shows the Top 10 event types from 1950-2011 based on property damage. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the property damages cause by that event type from 1950-2011 in dollars.
stormdat_eco_p$evtype <- factor(stormdat_eco_p$evtype, levels = stormdat_eco_p$evtype[order(-stormdat_eco_p$sump)])
options(scipen=10000)
ggplot(data = stormdat_eco_p, aes(x=evtype, y=sump)) + geom_bar(stat="identity", color="blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8)) + scale_y_continuous() + ggtitle("Top 10 event types by property damage (1950-2011)") + labs(x="Event types", y="Property damage in dollars")
Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_eco_c”. The resulting graph shows the Top 10 event types from 1950-2011 based on crop damage. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the property damages cause by that event type from 1950-2011 in dollars.
stormdat_eco_c$evtype <- factor(stormdat_eco_c$evtype, levels = stormdat_eco_c$evtype[order(-stormdat_eco_c$sumc)])
options(scipen=10000)
ggplot(data = stormdat_eco_c, aes(x=evtype, y=sumc)) + geom_bar(stat="identity", color="blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8)) + scale_y_continuous() + ggtitle("Top 10 event types by crop damage (1950-2011)") + labs(x="Event types", y="Crop damage in dollars")