This study looks at U.S weather data collected from 1950 to 2011. It focuses on the impacts of weather against human health and economic damage. The four principle metrics this study will judge weather events by are human injuries, human fatalities, property damage and crop damage. This study is conducted using the R programming language.
Load all neccessary library packages.
library(RCurl)
library(datasets)
library(dplyr)
library(ggplot2)
library(knitr)
library(tidyr)
The data was downloaded from the internet and saved to a local drive.
## Download the data
stormDataURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
stormDataFilePath <- "./Coursera/Reproducible Research/Course Project 2/repdata%2Fdata%2FStormData.csv.bz2"
if (url.exists(stormDataURL)) {
if (!file.exists(stormDataFilePath)) {
download.file(stormDataURL, method = "libcurl", destfile = stormDataFilePath)
}
}
The data was loaded into R using read.csv() and forced all text data types to be strings instead of factors.
## Load the data
stormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", header = TRUE, sep = ",",
na.strings = NA, fill = TRUE,
stringsAsFactors = FALSE)
Inspect what the data looks like.
dim(stormData)
## [1] 902297 37
glimpse(stormData)
## Observations: 902,297
## Variables: 37
## $ STATE__ (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE (chr) "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/195...
## $ BGN_TIME (chr) "0130", "0145", "1600", "0900", "1500", "2000", "01...
## $ TIME_ZONE (chr) "CST", "CST", "CST", "CST", "CST", "CST", "CST", "C...
## $ COUNTY (dbl) 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME (chr) "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN...
## $ STATE (chr) "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL...
## $ EVTYPE (chr) "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNAD...
## $ BGN_RANGE (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ BGN_LOCATI (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ END_DATE (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ END_TIME (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ COUNTY_END (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ END_LOCATI (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ LENGTH (dbl) 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH (dbl) 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F (int) 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES (dbl) 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG (dbl) 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP (chr) "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "...
## $ CROPDMG (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ WFO (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ STATEOFFIC (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ ZONENAMES (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ LATITUDE (dbl) 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE (dbl) 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E (dbl) 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ (dbl) 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS (chr) "", "", "", "", "", "", "", "", "", "", "", "", "",...
## $ REFNUM (dbl) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
The important columns for data analysis are FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, and EVTYPE.
sum(is.na(stormData$FATALITIES))
## [1] 0
sum(is.na(stormData$INJURIES))
## [1] 0
sum(is.na(stormData$PROPDMG))
## [1] 0
sum(is.na(stormData$PROPDMGEXP))
## [1] 0
sum(is.na(stormData$CROPDMG))
## [1] 0
sum(is.na(stormData$CROPDMGEXP))
## [1] 0
sum(is.na(stormData$EVTYPE))
## [1] 0
unique(stormData$STATE)
## [1] "AL" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [15] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV"
## [29] "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN"
## [43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY" "PR" "AK" "ST" "AS" "GU" "MH"
## [57] "VI" "AM" "LC" "PH" "GM" "PZ" "AN" "LH" "LM" "LE" "LS" "SL" "LO" "PM"
## [71] "PK" "XX"
unique(stormData$TIME_ZONE)
## [1] "CST" "EST" "PST" "MST" "CDT" "PDT" "EDT" "UNK" "HST" "GMT" "MDT"
## [12] "AST" "ADT" "CSt" "ESt" "CSC" "SCT" "ESY" "UTC" "SST" "AKS" "GST"
Analysis:
From the STATE & TIME_ZONE columns it is obvious there is data either comes from outside the US or the state code was marked incorrectly. Regardless the data will be removed before determining the results. The data set includes a code “DC”. This most likely refers to Washington D.C. We will include this designation as a U.S. state.
unique(stormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(stormData$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
stormData %>% group_by(PROPDMGEXP) %>% summarize(Count = length(PROPDMGEXP))
## Source: local data frame [19 x 2]
##
## PROPDMGEXP Count
## (chr) (int)
## 1 465934
## 2 - 1
## 3 ? 8
## 4 + 5
## 5 0 216
## 6 1 25
## 7 2 13
## 8 3 4
## 9 4 4
## 10 5 28
## 11 6 4
## 12 7 5
## 13 8 1
## 14 B 40
## 15 h 1
## 16 H 6
## 17 K 424665
## 18 m 7
## 19 M 11330
stormData %>% group_by(CROPDMGEXP) %>% summarize(Count = length(CROPDMGEXP))
## Source: local data frame [9 x 2]
##
## CROPDMGEXP Count
## (chr) (int)
## 1 618413
## 2 ? 7
## 3 0 19
## 4 2 1
## 5 B 9
## 6 k 21
## 7 K 281832
## 8 m 1
## 9 M 1994
Analysis:
From the data there are many different damage exponents. From the documentation we learn that “K” stands for thousands, “M” for millions and “B” for billions. In order to provide accurate damage amounts new columns will need to be created that display the damage in dollars only. There are many exponents that could not be found in the documentation. We will remove them from the data set.
evnType <- unique(stormData$EVTYPE)
length(evnType)
## [1] 985
evnType[order(evnType)][1:100]
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
## [21] "Beach Erosion" "BEACH EROSION"
## [23] "BEACH EROSION/COASTAL FLOOD" "BEACH FLOOD"
## [25] "BELOW NORMAL PRECIPITATION" "BITTER WIND CHILL"
## [27] "BITTER WIND CHILL TEMPERATURES" "Black Ice"
## [29] "BLACK ICE" "BLIZZARD"
## [31] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [33] "Blizzard Summary" "BLIZZARD WEATHER"
## [35] "BLIZZARD/FREEZING RAIN" "BLIZZARD/HEAVY SNOW"
## [37] "BLIZZARD/HIGH WIND" "BLIZZARD/WINTER STORM"
## [39] "BLOW-OUT TIDE" "BLOW-OUT TIDES"
## [41] "BLOWING DUST" "blowing snow"
## [43] "Blowing Snow" "BLOWING SNOW"
## [45] "BLOWING SNOW & EXTREME WIND CH" "BLOWING SNOW- EXTREME WIND CHI"
## [47] "BLOWING SNOW/EXTREME WIND CHIL" "BREAKUP FLOODING"
## [49] "BRUSH FIRE" "BRUSH FIRES"
## [51] "COASTAL FLOODING/EROSION" "COASTAL EROSION"
## [53] "Coastal Flood" "COASTAL FLOOD"
## [55] "coastal flooding" "Coastal Flooding"
## [57] "COASTAL FLOODING" "COASTAL FLOODING/EROSION"
## [59] "Coastal Storm" "COASTAL STORM"
## [61] "COASTAL SURGE" "COASTAL/TIDAL FLOOD"
## [63] "COASTALFLOOD" "COASTALSTORM"
## [65] "Cold" "COLD"
## [67] "COLD AIR FUNNEL" "COLD AIR FUNNELS"
## [69] "COLD AIR TORNADO" "Cold and Frost"
## [71] "COLD AND FROST" "COLD AND SNOW"
## [73] "COLD AND WET CONDITIONS" "Cold Temperature"
## [75] "COLD TEMPERATURES" "COLD WAVE"
## [77] "COLD WEATHER" "COLD WIND CHILL TEMPERATURES"
## [79] "COLD/WIND CHILL" "COLD/WINDS"
## [81] "COOL AND WET" "COOL SPELL"
## [83] "CSTL FLOODING/EROSION" "DAM BREAK"
## [85] "DAM FAILURE" "Damaging Freeze"
## [87] "DAMAGING FREEZE" "DEEP HAIL"
## [89] "DENSE FOG" "DENSE SMOKE"
## [91] "DOWNBURST" "DOWNBURST WINDS"
## [93] "DRIEST MONTH" "Drifting Snow"
## [95] "DROUGHT" "DROUGHT/EXCESSIVE HEAT"
## [97] "DROWNING" "DRY"
## [99] "DRY CONDITIONS" "DRY HOT WEATHER"
Analysis:
There are 985 unique Event Types and many of them similar to each other (e.g. all the event types that start with “Cold”). For brevity we will only show the first 100 event types.
902,297 records
Keep rows with States codes that are actual US states plus Washington DC.
Using the state.abb vector from the datasets package to get the abbreviations for the US states.
stormData_Trans <- filter(stormData, STATE %in% c(state.abb, "DC"))
883,623 records
Create a new column for PROPDMG costs that displays the results in dollars (no abbreviations) called Prop_Damage. If the exponent column is “K” then multiply by one thousand. If “m” or “M” then multiply by one million. If “B” then multiply by one billion.
stormData_Trans <- stormData_Trans %>% mutate(Prop_Damage = ifelse(PROPDMGEXP == "K", PROPDMG * 1000,
ifelse(PROPDMGEXP == "M", PROPDMG * 1000000,
ifelse(PROPDMGEXP == "m", PROPDMG * 1000000,
ifelse(PROPDMGEXP == "B", PROPDMG * 1000000000,
PROPDMG)))))
Create a second new column for CROPDMG costs that displays the resutls in dollars (no abbreviations) called Crop_Damage. If the exponent column is “K” or “k” then multiply by one thousand. If “m” or “M” then multiply by one million. If “B” then multiply by one billion.
stormData_Trans <- stormData_Trans %>% mutate(Crop_Damage = ifelse(CROPDMGEXP == "K", CROPDMG * 1000,
ifelse(CROPDMGEXP == "k", CROPDMG * 1000,
ifelse(CROPDMGEXP == "M", CROPDMG * 1000000,
ifelse(CROPDMGEXP == "m", CROPDMG * 1000000,
ifelse(CROPDMGEXP == "B", CROPDMG * 1000000000,
CROPDMG))))))
Convert column BGN_DATE to date format.
The column is in the format of Month/Day/Year Time
stormData_Trans$BGN_DATE <- as.POSIXct(strptime(stormData_Trans$BGN_DATE, "%m/%d/%Y %H:%M:%S"))
Keep records that have at least one non-zero number in the following four columns: Fatalities, Injuries, Prop_Damage, and Crop_Damage. With 985 unique event types it would be cumbersome to plot all the event types. Also since I am not a meterologist I do not feel I have enough knowledge to condense the Event types into similar names. So I have decided to remove records that have a zero in all the columns Fatalities, Injuries, Prop_Damage, and Crop_Damage.
stormData_Trans <- stormData_Trans %>% filter(FATALITIES > 0 | INJURIES > 0 | Prop_Damage > 0 | Crop_Damage > 0)
253,352 records
## Event Types by Fatalities
stormData_Fatal <- stormData_Trans %>% group_by(EVTYPE) %>% summarize(Fatalities = sum(FATALITIES)) %>%
filter(Fatalities > 0) %>% arrange(desc(Fatalities))
## Event Types by Injuries
stormData_Inj <- stormData_Trans %>% group_by(EVTYPE) %>% summarize(Injuries = sum(INJURIES)) %>%
filter(Injuries > 0) %>% arrange(desc(Injuries))
## Combine Fatalities and Injuries summarized tables into one table (inner join)
stormData_PopHealth <- inner_join(stormData_Inj, stormData_Fatal, by = "EVTYPE")
## Event Type by Property Damage
stormData_PropDmg <- stormData_Trans %>% group_by(EVTYPE) %>% summarize(PropDamage = sum(Prop_Damage)) %>%
filter(PropDamage > 0) %>% arrange(desc(PropDamage))
## Event Type by Crop Damage
stormData_CropDmg <- stormData_Trans %>% group_by(EVTYPE) %>% summarize(CropDamage = sum(Crop_Damage)) %>%
filter(CropDamage > 0) %>% arrange(desc(CropDamage))
## Combine Property Damage and Crop Damage summarized tables into on table (inner join)
stormData_ECon <- inner_join(stormData_PropDmg, stormData_CropDmg, by = "EVTYPE")
Analysis:
These summarized tables are the final output needed to chart, plot or display the results.
To make the tables I:
1. Grouped the now clean and tidy stormData_Trans data frame by the Event Types (EVTYPE)
2. Summed the respective column (Injuries, Fatalties, Property Damage, or Crop Damage) for each group
3. Removed any Event Types that had zero Injuries, Fatalties, Property Damage, or Crop Damage
4. Sorted the table from highest to lowest
After cleaning, analyizing, and transforming the data it is obvious that weather have a tremendous impact on the health of humans and our economy from 1950 to 2011. Below is a table outlining the most dangerous/impactful weather event for the four categories: injuries, fatalities, property damage, and crop damage.
| Most Harmful Weather Event | Classification | Total |
|---|---|---|
| TORNADO | Fatalities | 5,633 people |
| TORNADO | Injuries | 91,346 people |
| FLOOD | Property Damage | 144,541,318,807 USD |
| DROUGHT | Crop Damage | 13,972,366,000 USD |
Below are the ten highest weather events by human fatalities.
g <- ggplot(data = stormData_Fatal[1:10,], aes(x = EVTYPE, y = Fatalities))
g + geom_bar(stat = "identity") +
geom_text(aes(label = stormData_Fatal$Fatalities[1:10]), vjust = -0.25, size = 5, color = "blue") +
xlab("Event Type") +
ggtitle("Highest 10 Weather Events by Fatalities") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = scales::comma) +
scale_x_discrete(limits = stormData_Fatal$EVTYPE[1:10]) +
theme(panel.background = element_rect(fill = "lightblue"))
Below are the ten highest weather events by human injuries.
p <- ggplot(data = stormData_Inj[1:10,], aes(x = EVTYPE, y = Injuries))
p + geom_bar(stat = "identity") + xlab("Event Type") +
ggtitle("Highest 10 Weather Events by Injuries") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_discrete(limits = stormData_Inj$EVTYPE[1:10]) +
geom_text(aes(label = stormData_Inj$Injuries[1:10]), vjust = -0.25, size = 4, color = "brown") +
theme(panel.background = element_rect(fill = "lightblue"))
Analysis:
From observing the graphs, tornado is the most harmful weather event as it relates to human injuries and fatalities. Since 1950, 96,979 people have been injured or killed by tornados. In fact, more people are injured or killed by tornados than the combined total of the rest of the top ten weather events. Which only amounts to 40,545 people.
So, more than twice as many people are effected by tornados than the combined total of the other top nine weather events!
Below is a table of the top 10 weather conditions by property damage.
kable(format(stormData_PropDmg[1:10,], big.mark=","), align = "l")
| EVTYPE | PropDamage |
|---|---|
| FLOOD | 144,541,318,807 |
| HURRICANE/TYPHOON | 69,033,100,000 |
| TORNADO | 56,936,690,779 |
| STORM SURGE | 43,323,536,000 |
| FLASH FLOOD | 15,884,265,417 |
| HAIL | 15,732,260,548 |
| HURRICANE | 9,913,998,010 |
| TROPICAL STORM | 7,602,721,550 |
| WINTER STORM | 6,688,497,251 |
| HIGH WIND | 5,269,120,295 |
Below is a table of the top 10 weather conditions by crop damage.
kable(format(stormData_CropDmg[1:10,], big.mark=","), align = "l")
| EVTYPE | CropDamage |
|---|---|
| DROUGHT | 13,972,366,000 |
| FLOOD | 5,613,968,450 |
| RIVER FLOOD | 5,029,459,000 |
| ICE STORM | 5,022,113,500 |
| HAIL | 3,025,954,473 |
| HURRICANE/TYPHOON | 2,603,500,800 |
| HURRICANE | 2,189,930,000 |
| FLASH FLOOD | 1,406,905,100 |
| EXTREME COLD | 1,292,973,000 |
| FROST/FREEZE | 1,094,086,000 |
Analysis:
After inspecting the two tables the highest costing weather event to property is flood while the highest costing weather event to crops is drought. Like the weather impacts to humans above, the highest costing weather event by property and crop damage is twice as costly than each second place weather event.
This analysis of weather data that spans over 60 years has been enlighting. We learned that tornados pose the greatest threat to humans safety. We also learned that farmers fear drought more than any other weather event and flooding poses the greatest threat to property.