Peer Assesment 2

Synopsis: Analyze data from the last 50 years from the national weather service which records the weather event and various measures associated with them, including property damage, injuries, and fatalities.

Data Processing

We need to analyze storm data to answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

The documentation for the storm data has been provided along with an FAQ on what the storm events mean.

The events in the database start in 1950 and end in November 2011.

First we must load the dataset, and cache it to save time later.

data <- read.csv("repdata_data_StormData.csv.bz2", stringsAsFactors=FALSE)

Load data packages dplyr for data manipulation and ggplot2 for plotting.

library(dplyr)
library(ggplot2)

Find out how many unique weather events there are (EVTYPE)

# Create a frequency table of unique EVTYPES
EVcounts <- as.data.frame(table(data$EVTYPE))
# Sort frequency in descending order
EVcounts <- arrange(EVcounts, desc(Freq))
# How many unique
uniqueEVTYPE <- nrow(EVcounts)

There are 985 unique codes for weather events! That’s too many for this analysis. Let’s clean it up and recode into more general categories.

# 985 is too many, with many codes being used only once.
# For simplicity keep only codes mentioned 30 or more times
EVkeep <- filter(EVcounts, Freq>=30)

# Test to see what percentage of data is kept
leftover <- summarize(EVkeep, sum(Freq))
original <- nrow(data)

# Number of kept cases divided by total
percent <- leftover/original

Using codes mentioned 30 or more times we keep 0.9969888% of the data, which is good enough for this analysis.

Let’s subset the data to only those cases and clean up some of the variables. Make event a factor, fix the date variable, and create a damage multiplier from PROPDMGEXP.

# Keep only data where EVTYPE is mentioned 30 or more times
data2 <- filter(data, data$EVTYPE %in% EVkeep$Var1)

# Make event type a factor
data2$EVTYPE <- as.factor(data2$EVTYPE)

# Fix date variable
data2$fulldate <- paste(data2$BGN_DATE, data2$BGN_TIME)
data2$fulldate <- strptime(data2$fulldate, "%m/%d/%Y 0:00:00 %H%M")
data2$fulldate <- as.POSIXct(data2$fulldate)

# Create a multiplier that is a recoded PROPDMGEXP.
data2$multiplier[data2$PROPDMGEXP == "K"] <- 1E3
data2$multiplier[data2$PROPDMGEXP == "M"] <- 1E6
data2$multiplier[data2$PROPDMGEXP == "B"] <- 1E9
data2$multiplier[data2$PROPDMGEXP == "0"] <- 1
data2$multiplier[data2$PROPDMGEXP == "1"] <- 1E1
data2$multiplier[data2$PROPDMGEXP == "2"] <- 1E2
data2$multiplier[data2$PROPDMGEXP == "3"] <- 1E3
data2$multiplier[data2$PROPDMGEXP == "4"] <- 1E4
data2$multiplier[data2$PROPDMGEXP == "5"] <- 1E5
data2$multiplier[data2$PROPDMGEXP == "6"] <- 1E6
data2$multiplier[data2$PROPDMGEXP == "7"] <- 1E7
data2$multiplier[data2$PROPDMGEXP == "8"] <- 1E8
data2$multiplier[data2$PROPDMGEXP == "-"] <- 1
data2$multiplier[data2$PROPDMGEXP == "+"] <- 1
data2$multiplier[data2$PROPDMGEXP == ""] <- 1

Reading through the 106 kept EVTYPEs, we will code them into 20 more general categories using regular expressions. The final categories will be:

Lightning, Flood, Heat, Cold, Fog, Fire, Dry, Tidal Changes, Thunderstorm, Hurricane, Tornado, Tropical Storm, Wind, Hail, Snow, Winter Storm, Ice, Rain, Other, and Landslide

# Clean EVTYPE recode to group like things together.
data2$EVTYPER[grepl("lightning", data2$EVTYPE, ignore.case=TRUE)] <- "LIGHTNING"
data2$EVTYPER[grepl("flood|fld", data2$EVTYPE, ignore.case=TRUE)] <- "FLOOD"
data2$EVTYPER[grepl("heat|warmth|unseasonably warm|temperature record", data2$EVTYPE, ignore.case=TRUE)] <- "HEAT"
data2$EVTYPER[grepl("cold", data2$EVTYPE, ignore.case=TRUE)] <- "COLD"
data2$EVTYPER[grepl("fog", data2$EVTYPE, ignore.case=TRUE)] <- "FOG"
data2$EVTYPER[grepl("fire", data2$EVTYPE, ignore.case=TRUE)] <- "FIRE"
data2$EVTYPER[grepl("dust|drought|dry", data2$EVTYPE, ignore.case=TRUE)] <- "DRY"
data2$EVTYPER[grepl("surf|surge|rip|tide", data2$EVTYPE, ignore.case=TRUE)] <- "TIDAL CHANGES"
data2$EVTYPER[grepl("tstm|thunderstorm", data2$EVTYPE, ignore.case=TRUE)] <- "THUNDERSTORM"
data2$EVTYPER[grepl("hurricane", data2$EVTYPE, ignore.case=TRUE)] <- "HURRICANE"
data2$EVTYPER[grepl("funnel|tornado|spout", data2$EVTYPE, ignore.case=TRUE)] <- "TORNADO"
data2$EVTYPER[grepl("tropical", data2$EVTYPE, ignore.case=TRUE)] <- "TROPICAL STORM"
data2$EVTYPER[grepl("wind", data2$EVTYPE, ignore.case=TRUE)] <- "WIND"
data2$EVTYPER[grepl("hail", data2$EVTYPE, ignore.case=TRUE)] <- "HAIL"
data2$EVTYPER[grepl("snow|avala", data2$EVTYPE, ignore.case=TRUE)] <- "SNOW"
data2$EVTYPER[grepl("wint|blizz|ice storm|sleet|snow|mixed preci", data2$EVTYPE, ignore.case=TRUE)] <- "WINTER STORM"
data2$EVTYPER[grepl("freeze|ice|frost|glaze|snow and ice", data2$EVTYPE, ignore.case=TRUE)] <- "ICE"
data2$EVTYPER[grepl("rain|monthly preci", data2$EVTYPE, ignore.case=TRUE)] <- "RAIN"
data2$EVTYPER[grepl("other", data2$EVTYPE, ignore.case=TRUE)] <- "OTHER"
data2$EVTYPER[grepl("landslide", data2$EVTYPE, ignore.case=TRUE)] <- "LANDSLIDE"

To answer the question about the effect of weather on humans, let’s make two metrics.

# Calculate the total casualties (injuries + death)
data2$casualties <- data2$INJURIES + data2$FATALITIES

# Calculate the actual total property damage.
data2$propdamage <- data2$PROPDMG * data2$multiplier

Let’s subset the data and calculate the average damage, total damage, average casualties, and total casualties

# Subset the data and 
data3 <- data2 %>%
    group_by(EVTYPER) %>%
    summarize(AverageDamage = mean(propdamage, na.rm=TRUE),
              TotalDamage =  sum(propdamage, na.rm=TRUE),
              AverageCasualties = mean(casualties, na.rm=TRUE),
              TotalCasualties = sum(casualties, na.rm=TRUE))

To answer this question let’s arrange the data in descending order by total casualties and plot it against the event type. Let’s display the top 5 weather events.

data3 <- data3 %>% arrange(desc(TotalCasualties))

casualtiesplot <- ggplot(data3[1:5,], aes(reorder(EVTYPER, -TotalCasualties), TotalCasualties)) +
    geom_bar(stat="identity", fill="dark blue") +
    xlab("Weather Event") +
    ylab("Total Casualties") +
    ggtitle("Casualties by Weather Event")

To answer the second question, we will arrange the data in descending order by total economic damage and plot it against the event type. Let’s display again the top 5 weather events.

data3 <- data3 %>% arrange(desc(TotalDamage))

damageplot <- ggplot(data3[1:5,], aes(reorder(EVTYPER, -TotalDamage), TotalDamage)) +
    geom_bar(stat="identity", fill="dark green") +
    xlab("Weather Event") +
    ylab("Total Property Damage") +
    ggtitle("Total Property Damage by Weather Event")

Results

print(casualtiesplot)

print(damageplot)

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The data shows that Tornados have the biggest effect on population health, followed by high winds, flood, and lightning. It would be advisable to increase funding for tornado alerts, and increase advisory warnings for high winds and floods. There is no way to prevent lightning strikes except by educating the public about the dangers.

  1. Across the United States, which types of events have the greatest economic consequences?

The data shows that floods cause the most property damage, followed by hurricanes, tornados, sudden tide changes, and high wind. It would be advisable to make sure that flood prone areas have adequate protection, including dams, levees, and sandbags and adequate drainage to get rid of excess water.