This study has been performed for the Coursera “Reproducible Research” course which is part of the Data Science Specialization provided by the Johns Hopkins University.
The goal of this project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The assignment focus mainly in addressing this two questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the this link:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
It is important to notice that the assignment set only general rules but the result can vary due to the assumptions and approaches considered.
3.1. The data has been downloaded from the “Storm Data” link above and stored in the working directory.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_noaa <- read.csv("repdata-data-StormData.csv")
colnames(data_noaa) <- tolower(colnames(data_noaa))
str(data_noaa)
## 'data.frame': 902297 obs. of 37 variables:
## $ state__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ bgn_date : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ bgn_time : Factor w/ 3608 levels "000","0000","0001",..: 152 167 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ time_zone : Factor w/ 22 levels "ADT","AKS","AST",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ county : num 97 3 57 89 43 77 9 123 125 57 ...
## $ countyname: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ state : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ evtype : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 826 826 826 826 826 826 826 826 826 826 ...
## $ bgn_range : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bgn_azi : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bgn_locati: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ end_date : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ end_time : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ county_end: num 0 0 0 0 0 0 0 0 0 0 ...
## $ countyendn: logi NA NA NA NA NA NA ...
## $ end_range : num 0 0 0 0 0 0 0 0 0 0 ...
## $ end_azi : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ end_locati: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ length : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ width : num 100 150 123 100 150 177 33 33 100 100 ...
## $ f : int 3 2 2 2 2 2 2 1 3 3 ...
## $ mag : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ propdmgexp: Factor w/ 19 levels "","+","-","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: Factor w/ 9 levels "","0","2","?",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ wfo : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ stateoffic: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ zonenames : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ latitude : num 3040 3042 3340 3458 3412 ...
## $ longitude : num 8812 8755 8742 8626 8642 ...
## $ latitude_e: num 3051 0 0 0 0 ...
## $ longitude_: num 8806 0 0 0 0 ...
## $ remarks : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ refnum : num 1 2 3 4 5 6 7 8 9 10 ...
3.2 While the information is available since 1950, they only started recording all events type from Jan. 1996 onwards. Hence, this analysis started in Jan. 1996 and filtered for only those columns that we are interested to analyze.
#subsetting the columns needed
storm_data <- data_noaa %>%
select(bgn_date, evtype, fatalities: cropdmgexp) %>%
filter(evtype != "?")
#subsetting the sample since 1996
storm_data <- subset(storm_data,
as.Date(storm_data$bgn_date, format = "%m/%d/%Y") > as.Date("1996-01-01"))
3.3 The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). They need to be manipulated given a poor notation in the original dataset. In order to understand the data manipulation, please read the following approach: Handle Exponent Value of PROPDMGEXP and CROPDMGEXP
#Unique values of PROPDMGEXP
unique(storm_data$propdmgexp)
## [1] K M B 0
## Levels: + - 0 1 2 3 4 5 6 7 8 ? B H K M h m
It is important to notice from the link, the proof for the numeric notation and the symbols “-”, “?” and “+” Thus, I continued with the manipuation of these variables and multiplying the exponent by the corresponding value.
storm_data$propdmgexp <- toupper(storm_data$propdmgexp)
storm_data$propdmgexp[storm_data$propdmgexp == ""] <- 0
storm_data$propdmgexp[storm_data$propdmgexp == "?"] <- 0
storm_data$propdmgexp[storm_data$propdmgexp == "-"] <- 0
storm_data$propdmgexp[storm_data$propdmgexp == "+"] <- 1
numbers_damage <- c("0","1","2","3","4","5","6","7","8")
storm_data$propdmgexp[storm_data$propdmgexp %in% numbers_damage ] <- 10
storm_data$propdmgexp[storm_data$propdmgexp == "H"] <- 10^2 # = 100
storm_data$propdmgexp[storm_data$propdmgexp == "K"] <- 10^3 # = 1000
storm_data$propdmgexp[storm_data$propdmgexp == "M"] <- 10^4 # = 10000
storm_data$propdmgexp[storm_data$propdmgexp == "B"] <- 10^5 # = 100000
storm_data$propdmgexp <- as.numeric(storm_data$propdmgexp)
storm_data$property_damage <- storm_data$propdmg * storm_data$propdmgexp
#unique values of CROPDMGEXP
unique(storm_data$cropdmgexp)
## [1] K M B
## Levels: 0 2 ? B K M k m
storm_data$cropdmgexp <- toupper(storm_data$cropdmgexp)
storm_data$cropdmgexp[storm_data$cropdmgexp == ""] <- 0
storm_data$cropdmgexp[storm_data$cropdmgexp == "?"] <- 0
storm_data$cropdmgexp[storm_data$cropdmgexp == "0"] <- 10
storm_data$cropdmgexp[storm_data$cropdmgexp == "2"] <- 10
storm_data$cropdmgexp[storm_data$cropdmgexp == "K"] <- 10^3 # = 1000
storm_data$cropdmgexp[storm_data$cropdmgexp == "M"] <- 10^4 # = 10000
storm_data$cropdmgexp[storm_data$cropdmgexp == "B"] <- 10^5 # = 10000
storm_data$cropdmgexp <- as.numeric(storm_data$cropdmgexp)
storm_data$crop_damage <- storm_data$cropdmg * storm_data$cropdmgexp
storm_data$evtype <- toupper(storm_data$evtype)
Subsetting and compiling the information regarding Fatalities due to Weather Events
# obtaining the Top 5 cause of Fatalities
total_damage <- storm_data %>%
group_by(evtype) %>%
summarise(fatalities = sum(fatalities), injuries = sum(injuries))
total_damage_fatalities <- total_damage[order(total_damage$fatalities, decreasing = TRUE),]
total_damage_fatalities$injuries <- NULL
total_fatalities_head <- head(total_damage_fatalities,5)
The Top 5 Fatalities due to major Weather events were calculated since 1996. They were ploted using the Base Plotting System of R for a better understanding.
barplot(total_fatalities_head$fatalities,las = 0,names.arg = total_fatalities_head$evtype,
main = "Top 5 of Fatalities by type of Weather Events in US since 1996",
ylab = "Number of Fatalities", xlab = "Weather Events",
col = "black", cex.names =0.6, cex = 0.6, ylim = c(0, 2000))
Subsetting and compiling the information regarding Injuries due to Weather Events
# obtaining the Top 5 cause of Injuries
total_damage_injuries <- filter(total_damage, injuries >0)
total_damage_injuries <- arrange(total_damage_injuries, desc(injuries))
total_damage_injuries$fatalities <- NULL
total_injuries_head <- head(total_damage_injuries,5)
The Top 5 Injuries for major Weather events were calculated since 1996. They were ploted using the Base Plotting System of R for a better understanding.
barplot(total_injuries_head$injuries,las = 0,names.arg = total_injuries_head$evtype,
main = "Top 5 of Injuries by type of Weather Events in US since 1996",
ylab = "Number of Injuries", xlab = "Weather Events",
col = "black", cex.names =0.6, cex.axis = 0.6)
Subsetting and compiling the information regarding economic impact due to major Weather Events I have summed both type of economic damage, property and crop damage.
# Top 5 in propery and crop damage
storm_data$total_value <- storm_data$property_damage + storm_data$crop_damage
total_value <- storm_data[storm_data$total_value > 0,c("evtype", "total_value")]
total_value_head <- total_value %>% group_by(evtype) %>%
summarise(total_value =sum(total_value))
total_value_head <- head(arrange(total_value_head, desc(total_value)),5)
Plot of the Top 5 Weather events by economical impact since 1996 in US.
barplot(total_value_head$total_value/10^9,las = 0,names.arg = total_value_head$evtype,
main = "Top 5 of Economical Loss by type of Weather Events in US since 1996",
ylab = "Total Cost (in billion)",
col = "black", cex.names =0.6, cex.axis = 0.6, ylim = c(0, 1.7))
EXCESSIVE HEAT and TORNADO have been the two largest cause of FATALITIES due to major Weather Events since 1996 in US.
TORNADO and FLOOD have been the two largest cause of INJURIES due to major Weather Events since 1996 in US.
FLASH FLOOD and THUNDERSTORM(TSTM) WIND have caused the largest amount of economics loss since 1996 in US.
While this project solve the main issues that the assignment required, it might be posible to improve the analysis:
Due to the fact that this database consider economical values for a long period of time, Inflation consideration and adjustments would make the results more accurate. But for the sake of simplicity, at least in a first step, I have decided to focus in the cleaning process and a first data manipulation draft.
Even with the subset since 1996, the year where all the major Weather Events were recording, it might be posible that some events can be wrongly register or have typos. Further manipulation and match the raw data with a list of Weather Events might improve the accuracy of the results.