An Analysis of the NOAA Storm Database

Synopsis

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.

This report will answer the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data processing synopsis

There were two major data processing initiatives for this project. The first was to get the total number of injuries and the total number of fatlities by event type. This was done using dplyr to group and summarize by event type. The gather (tidyr) function was used to make this data graphable. The results were graphed on a stacked barplot.

The second action was to use dplyr to group and sort the data by property damage and crop damage. Upon getting the data grouped and sorted and summed, actions were taken to substiute out the multipliers for each so that total damage coule be calculated. The results were then graphed on two different barplots.

Reading the NOAA Storm Database

We first read in the US National Oceanic and Atmospheric Administrations (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011.

The database was downloaded from the following LINK. Upon unzipping the .zip file, we read the .csv file into the variable stormdat using read.csv.

stormdat <- read.csv("storm_dat/repdata-data-StormData.csv")

There are 902297 rows and 37 columns in stormdat.

dim(stormdat)

## [1] 902297     37

head(stormdat)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

We will now properly format the column names for R data frames and rename the first state column (column 1) to “state.num”.

names(stormdat) <- gsub("__$", "", names(stormdat))
names(stormdat) <- gsub("_", ".", names(stormdat))
names(stormdat) <- tolower(names(stormdat))
names(stormdat)[1] <- "state.num"

head(stormdat)

##   state.num           bgn.date bgn.time time.zone county countyname state
## 1         1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2         1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3         1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4         1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5         1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6         1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    evtype bgn.range bgn.azi bgn.locati end.date end.time county.end
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   countyendn end.range end.azi end.locati length width f mag fatalities
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   injuries propdmg propdmgexp cropdmg cropdmgexp wfo stateoffic zonenames
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   latitude longitude latitude.e longitude. remarks refnum
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Data Processing

Loading the “dplyr” package to assist with data manipulation.

suppressMessages(library(dplyr))

To determine the answer to the questions, I will first select only the columns required for this calculation, then I group the stormdata dataframe by “Event Type” which is column “evtype”.

stormdat_gp_evtype <- stormdat %>% select(evtype, fatalities:cropdmgexp) %>% group_by(evtype)

Data Processing Question 1

The following code will process the data for the purpose of answering the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

To answer this question we will sum, for each event type, the following columns and sum those columns together to create “total_poph”. The columns are:

fatalities (creates “total_fat”) injuries (creates “total_inj”)

I will then sort the resulting data frame (“stormdata_poph”) by “total_poph” by total_poph in descending order. Only the top 10 event types will be stored and graphed.

stormdat_poph <- stormdat_gp_evtype %>% summarise(total_fat = sum(fatalities), total_inj = sum(injuries)) %>% mutate(total_poph = total_fat + total_inj) %>% arrange(desc(total_poph)) %>% top_n(10)

## Selecting by total_poph

head(stormdat_poph)

## Source: local data frame [6 x 4]
## 
##           evtype total_fat total_inj total_poph
##           (fctr)     (dbl)     (dbl)      (dbl)
## 1        TORNADO      5633     91346      96979
## 2 EXCESSIVE HEAT      1903      6525       8428
## 3      TSTM WIND       504      6957       7461
## 4          FLOOD       470      6789       7259
## 5      LIGHTNING       816      5230       6046
## 6           HEAT       937      2100       3037

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.4

Gather the data (using tidyr) for graphing a stack barplot and save it to “stormdat_popg”.

stormdat_popg <- stormdat_poph %>% select(evtype:total_inj) %>% gather(evtype, total)
names(stormdat_popg) <- c("evtype", "health_risk", "total")

Data Processing Question 2

The following code will process the data for the purpose of answering the question: Across the United States, which types of events have the greatest economic consequences?

To answer this question we will sum, for each event type, the following columns and sum those columns together to create “total_eco”. The columns are:

propdmg propdmgexp cropdmg cropdmgexp

The “*exp" columns are factors. Exploring those columns.

unique(stormdat_gp_evtype$propdmgexp)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

unique(stormdat_gp_evtype$cropdmgexp)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

This extract from the following document LINK explains the (k,K,h,H,m,M,b,B) factors in columns “propdmgexp” and “cropdmgexp”.

Estimates can be obtained from emergency managers, U.S. Geological Survey, U.S. Army Corps of Engineers, power utility companies, and newspaper articles. If the values provided are rough estimates, then this should be stated as such in the narrative. Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry. When damage is due to more than one element of the storm, indicate, when possible, the amount of damage caused by each element. If the dollar amount of damage is unknown, or not available, check the “no informationavailable” box.

I will explore the frequency of the various levels in “propdmgexp” and “cropdmgexp”.

table(stormdat_gp_evtype$propdmgexp)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(stormdat_gp_evtype$cropdmgexp)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Since “cropdmgexp” and “propdmgexp” are considere multipliers, I will assign numeric multipliers to the “stormdat_gpevtype$propdmgexp”" following perform a replacement of the values based off of the following table.

Convert the factors for “propdmgexp” to numeric.

levels(stormdat_gp_evtype$propdmgexp) <- list("1"="0", "10"="1", "100"="2", "1000"="3", "10000"="4", "100000"="5", "1000000"="6", "10000000"="7", "100000000"="8", "100"="H", "100"="h", "1000"="K", "1000000"="m", "1000000"="M", "1000000000"="B")

levels(stormdat_gp_evtype$propdmgexp)

##  [1] "1"          "10"         "100"        "1000"       "10000"     
##  [6] "100000"     "1000000"    "10000000"   "100000000"  "1000000000"

Convert the factors for “cropdmgexp” to numeric.

levels(stormdat_gp_evtype$cropdmgexp) <- list("1"="0","100"="2","1000"="K","1000"="k", "1000000"="m", "1000000"="M", "1000000000"="B")

Multiply “propdmg” by “propdmgexp” for each row and save the results to column “total_prop”. Multiply “cropdmg” by “cropdmgexp” for each row and save the results to “total_crop. The results will be added to stormdat_eco.

stormdat_eco <- stormdat_gp_evtype %>% mutate(total_prop = propdmg * as.numeric(as.character(propdmgexp))) %>% mutate(total_crop = cropdmg * as.numeric(as.character(cropdmgexp)))

I will now sum “total_prop” for each event type and save it to “stormdat_eco_p”. I will then filter “stormdat_eco_p” for the 10 largest instances of “sump”.

stormdat_eco_p <- stormdat_eco %>% group_by(evtype) %>% summarise(sump = sum(total_prop)) %>% arrange(desc(sump)) %>% top_n(10)

## Selecting by sump

I will now sum “total_crop” for each event type and save it to “stormdat_eco_c”. I will then filter “stormdat_eco_c” for the 10 largest instances of “sumc”.

stormdat_eco_c <- stormdat_eco %>% group_by(evtype) %>% summarise(sumc = sum(total_crop)) %>% arrange(desc(sumc)) %>% top_n(10)

## Selecting by sumc

Results

ggplot2 will be required for final graphs.

suppressMessages(library(ggplot2))

Results for Question 1

Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_popg”. The resulting graph shows the Top 10 event types from 1950-2011 based on health risks to humans. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the total injuries and fatilities cause by that event type from 1950-2011.

stormdat_popg$evtype <- factor(stormdat_popg$evtype, levels = stormdat_popg$evtype[order(-stormdat_popg$total)])

## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated

ggplot(stormdat_popg, aes(x=evtype, y=total, fill=health_risk)) + geom_bar(stat="identity") + xlab("Event types") + ylab("Health risk count") + scale_fill_discrete(name = "Health Risks", labels=c("Total Fatilities", "Total Injuries")) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8))

## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated

## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated

Results for Question 2

Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_eco_p”. The resulting graph shows the Top 10 event types from 1950-2011 based on property damage. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the property damages cause by that event type from 1950-2011 in dollars.

stormdat_eco_p$evtype <- factor(stormdat_eco_p$evtype, levels = stormdat_eco_p$evtype[order(-stormdat_eco_p$sump)])

options(scipen=10000)
ggplot(data = stormdat_eco_p, aes(x=evtype, y=sump)) + geom_bar(stat="identity", color="blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8)) + scale_y_continuous() + ggtitle("Top 10 event types by property damage (1950-2011)") + labs(x="Event types", y="Property damage in dollars")

Reorder the factors in column “evtype” and create a barplot using data frame “stormdat_eco_c”. The resulting graph shows the Top 10 event types from 1950-2011 based on crop damage. The data used came from the NOAA Storm Database. The x-axis is the event type and the y-axis represents the property damages cause by that event type from 1950-2011 in dollars.

stormdat_eco_c$evtype <- factor(stormdat_eco_c$evtype, levels = stormdat_eco_c$evtype[order(-stormdat_eco_c$sumc)])

options(scipen=10000)
ggplot(data = stormdat_eco_c, aes(x=evtype, y=sumc)) + geom_bar(stat="identity", color="blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) + theme(axis.text.y = element_text(hjust = 1, size = 8)) + scale_y_continuous() + ggtitle("Top 10 event types by crop damage (1950-2011)") + labs(x="Event types", y="Crop damage in dollars")