Repro Course Project 2: Title that Briefly summarizes data analysis

Storms are bad. If you can’t stop them, try to live better with them.

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The 985 event labels will be be grouped together into 7 Broad Categories. The costs for each Broad Category will be totaled and presented for comparison.

Questions This data analysis addresses the following questions:

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The Dataset has columns for “FATALITIES” and “INJURIES”. Those will be the categories for population health.

2. Across the United States, which types of events have the greatest economic consequences?

The Dataset has columns for Property Damage (“PROPDMG”) and Crop Damge (“CROPDMG”). Those will be the categories for population health.

902297 seperate events recorded, asssigned an Event Type. 985 different “Event” Type names.

Broadly speaking the Events can be categorized into 7 categories. Seven is a much more reasonable number compared to 985. Some of the 985 event names are typos or redundancies, evidence of data entry error. The importance of a broad category could be lost if data entry was divided accross categories that should be joined.

Data Processing

Get the data

Get the data from the internet.

file_loc <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
filename <-'Storm_data.csv.bz2'
download.file(file_loc,filename)
data <- read.csv(filename)

ev_names <- data$EVTYPE
dim(data)
## [1] 902297     37

Most of the columns are useless for this activity. Pair down to just the possibly needed variables.

var_names <- names(data)
names_to_pair <- var_names[c(7,8,22,23,24,25,27)]
# [1] "STATE"      "EVTYPE"     "MAG"        "FATALITIES"
# [5] "INJURIES"   "PROPDMG"    "CROPDMG" 
data <- data[names_to_pair]

902297 seperate events recorded, asssigned an Event Type. 985 different “Event” Type names. Broadly speaking the Events can be categorized into 7 categories. Seven is a much more reasonable number compared to 985. Some of the 985 event names are typos or redundancies, evidence of data entry error. The importance of a broad category could be lost if data entry was divided accross categories that should be joined.

var_names <- names(data)
names_to_pair <- var_names[c(7,8,22,23,24,25,27)]


storm_names <-c("wind", "tstm", "tornado", "tropical", "storm", "torrent", "thunder", "hurric", "blizzard")
big_wind_names <-c("tornado", "hurric", "gustnado", "funnel")
precip_names <-c("hail", "rain", "snow", "shower", "precip", "sleet")
mud_names <- c("mud")
flood_names <-c("flood","surf","surge","current")
temper_names <- c("cold","heat","hot","blizzard","freez","frost")

Add a new Category to data data.frame, based on 7 category names, and “Other” if it doesn’t fit. Use a function, get_tf, to compare different possible names in category.

get_tf <-function(sub_names,ev_names){
        tf <- logical(length(ev_names))
        for(n in sub_names){
                #print(n)
                temp <- grepl(n,tolower(ev_names))
                tf <-tf + temp
        }
        tf <- tf >0  
}
storm_tf <- get_tf(storm_names,ev_names)
big_wind_tf <- get_tf(big_wind_names,ev_names)
precip_tf <- get_tf(precip_names,ev_names)
mud_tf <- get_tf(mud_names,ev_names)
flood_tf <- get_tf(flood_names,ev_names)
temper_tf <- get_tf(temper_names,ev_names)

categ <- character(length(ev_names))
categ[storm_tf] <-"storm"
categ[big_wind_tf] <- "big_wind"
categ[precip_tf] <- "precip"
categ[mud_tf] <- "mud"
categ[flood_tf] <- "flood"
categ[temper_tf] <- "temper"
categ[!nzchar(categ)] <- "other"

data$Broad_categ <- categ

Split up the data based on the broad categories. A “pairs” plot on the original “data” takes forever, and might be useless. “pairs” plots on the seperate broad categories also take a long time. Save them to file if you wish. Entire “for”loop took more than 45 minutes to make files.

data_sub_categ_split <- split(data,data$Broad_categ)
categ_names <- names(data_sub_categ_split)
# 
# for(n in categ_names){
#         com <-paste0("pairs(data_sub_categ_split$",n,")")
#         plot_name <- paste0("plot_",n,".png")
#         png(file=plot_name)
#         eval(parse(text = com))
#         dev.off()
# }

# pairs(data_sub_categ_split$other)

Now sum INJURIES, FATALITIES, PROPDMG and CROPDMG for the 7 subcategories

FATALITIES <- data.frame(Category = categ_names)
FATALITIES$Total <-0
row.names(FATALITIES) <- categ_names
INJURIES <- FATALITIES
PROPDMG <- FATALITIES
CROPDMG <- FATALITIES

for(n in categ_names){
        FATALITIES[n,"Total"] <- sum(data_sub_categ_split[[n]]$FATALITIES)
        INJURIES[n,"Total"] <- sum(data_sub_categ_split[[n]]$INJURIES)
        PROPDMG[n,"Total"] <- sum(data_sub_categ_split[[n]]$PROPDMG)
        CROPDMG[n,"Total"] <- sum(data_sub_categ_split[[n]]$CROPDMG)
}

7. Is there a section titled “Results” where the main results are presented?

Results

Create Bar charts for the 4 variables of interest. Point out interesting things in the Analysis section.

library(ggplot2)
install.packages("gridExtra",repos = "http://cran.us.r-project.org")
## package 'gridExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\brian.liswell\AppData\Local\Temp\RtmpMVVetn\downloaded_packages
library(gridExtra)
p<-ggplot(data=FATALITIES, aes(x=Category, y=Total)) +
  geom_bar(stat="identity")+ggtitle("FATALITIES")

p2<-ggplot(data=INJURIES, aes(x=Category, y=Total)) +
  geom_bar(stat="identity")+ggtitle("INJURIES")

grid.arrange(p,p2,nrow = 1)
Figure 1: Fatalities and Injuries Bar Charts

Figure 1: Fatalities and Injuries Bar Charts

p3<-ggplot(data=PROPDMG, aes(x=Category, y=Total)) +
  geom_bar(stat="identity")+ggtitle("PROPDMG")

p4<-ggplot(data=CROPDMG, aes(x=Category, y=Total)) +
  geom_bar(stat="identity")+ggtitle("CROPDMG")

grid.arrange(p3,p4,nrow = 1)
Figure 2:Property Bar Charts

Figure 2:Property Bar Charts

Analysis

In Figure 1 we see that “Big Wind” type events, like Tornado’s and Hurricanes, are the leading cause of Fatalities, followed by extreme “Temperature” type weather events (with labels such as “Cold”, “Blizzard” and “Hot”).

Meanwhile, in Figure 1, “Big Wind” type events are the clear leading cause of weather induced Injuries.

In Figure 2 we see that Property Damage is caused by various things, such as “Big Wind” type events or “Storms” (wind storm, thunderstorms, troppical storms). I suspect that those two event types are basically the same thing, with just a difference in degree. Individual “Storm” events probably cause less damage than “Big Wind” events, but are probably more frequent.

We also see in Figure 2 that Precipitation events (hail, rain, snow, shower, precip, sleet) are the leading cause of Crop Damage, with Flood damage being a distant second.