This is a second course project for Reproducible Research course which is part of the Coursera’s Data Science Specialization.
Severe weather has serious economic and health impacts, causing property damage, crop damage, injury and even death. The purpose of the project was to determine which storm event(s) had the most significant economic and health effects.
This project involves exploring and analysing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The analysis of the data revealed that floods cause the greatest economic impact as measured by property damage and crop damage. The analysis also indicated that tornadoes have the greatest health impact as measured by the number of injuries and fatalities.
This data analysis will address the following questions:
Across the United States, which types of events have the greatest economic consequences?
Across the United States, which types of events are most harmful with respect to population health?
The main objective of this project is to determine which storm event(s) have the most significant economic and health effects.
To identify the weather events that have the greatest economic consequences.
To identify the weather events that are most harmful with respect to population health.
This project used these libraries for data processing and analysis:
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(knitr)
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: storm data[47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Uncomment the code below to download and unzip the data. The data was downloaded from the link above and saved to current working directory. If object strom_df is already loaded, use that cached object instead of loading it each time where the Rmd file is knitted.
#URL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#destfile <- "repdata_data_StormData.csv.bz2"
#download.file(URL, destfile)
if(!exists("storm_df")){
storm_df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
}
In storm_df there is 37 columns (variables) and 902,297 rows (observations).
dim(storm_df)
## [1] 902297 37
Examine the structure of the data
str(storm_df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From a list of variables in storm data set, the following are columns (variables) of interest:
Health variables:
Economic variables:
Events - target variable:
Extract variables of interest from original data set:
storm_data <- subset(storm_df, select = c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
In every analysis we must check number of missing values in variables.
count_NAs <- function(df) {
for (col_Name in colnames(df)) {
NA_count <- 0
NA_count < as.numeric(sum(is.na(df[,col_Name])))
if(NA_count > 0) {
message(col_Name, ":", NA_count, "missing values")
} else {
message(col_Name, ":", "No NAs")
}
}
}
count_NAs(storm_data)
## EVTYPE:No NAs
## FATALITIES:No NAs
## INJURIES:No NAs
## PROPDMG:No NAs
## PROPDMGEXP:No NAs
## CROPDMG:No NAs
## CROPDMGEXP:No NAs
There is no missing values (no NAS), so moving on to examine data integrity.
There are 985 unique event types in EVTYPE column. Therefore, it is better to limit database to a reasonable number. We can make it by capitalizing all letters in EVTYPE column as well as subsetting only non-zero data regarding our target numbers.
## cleaning event type names
storm_data$EVTYPE <- toupper(storm_data$EVTYPE)
## eliminating zero data
storm_data <- storm_data[storm_data$FATALITIES !=0 |
storm_data$INJURIES !=0 |
storm_data$PROPDMG !=0 |
storm_data$CROPDMG !=0, ]
length(unique(storm_data$EVTYPE))
## [1] 447
Now we have 447 unique event types and it seems like something to work with.
To determine economic consequences of storms, we need to calculate the total amount of damage in US Dollars. The data provided one column indicating the amount and another indicating the unit: “K” for thousands,“M” for millions, or “B” for billions. There is some mess in units, so we transform those variables in one unit (dollar) variable by the following rule:
We need to combine these two columns into a single column representing the total cost dollars for each observation.
storm_data$PROPDMGEXP <- as.character(storm_data$PROPDMGEXP)
storm_data$PROPDMGEXP[is.na(storm_data$PROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$PROPDMGEXP[!grepl("K|M|B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$PROPDMGEXP[grep("K", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$PROPDMGEXP[grep("M", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$PROPDMGEXP[grep("B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$PROPDMGEXP <- as.numeric(as.character(storm_data$PROPDMGEXP))
storm_data$PROPDMG_DOL <- storm_data$PROPDMG * 10^storm_data$PROPDMGEXP
storm_data$CROPDMGEXP <- as.character(storm_data$CROPDMGEXP)
storm_data$CROPDMGEXP[is.na(storm_data$CROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$CROPDMGEXP[!grepl("K|M|B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$CROPDMGEXP[grep("K", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$CROPDMGEXP[grep("M", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$CROPDMGEXP[grep("B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$CROPDMGEXP <- as.numeric(as.character(storm_data$CROPDMGEXP))
storm_data$CROPDMG_DOL <- storm_data$CROPDMG * 10^storm_data$CROPDMGEXP
head(storm_data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 3 0 0
## 2 TORNADO 0 0 2.5 3 0 0
## 3 TORNADO 0 2 25.0 3 0 0
## 4 TORNADO 0 2 2.5 3 0 0
## 5 TORNADO 0 2 2.5 3 0 0
## 6 TORNADO 0 6 2.5 3 0 0
## PROPDMG_DOL CROPDMG_DOL
## 1 25000 0
## 2 2500 0
## 3 25000 0
## 4 2500 0
## 5 2500 0
## 6 2500 0
Now we aggregate property and crop damage numbers in order to identify TOP-10 events contributing the total economic loss:
Economic_data <- aggregate(cbind(PROPDMG_DOL, CROPDMG_DOL) ~ EVTYPE, data = storm_data, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMG_DOL + Economic_data$CROPDMG_DOL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
| EVTYPE | PROPDMG_DOL | CROPDMG_DOL | ECONOMIC_LOSS | |
|---|---|---|---|---|
| 78 | FLOOD | 144657709807 | 5661968450 | 150319678257 |
| 203 | HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| 371 | TORNADO | 56937160779 | 414953270 | 57352114049 |
| 317 | STORM SURGE | 43323536000 | 5000 | 43323541000 |
| 116 | HAIL | 15732267048 | 3025954473 | 18758221521 |
| 65 | FLASH FLOOD | 16140812067 | 1421317100 | 17562129167 |
| 43 | DROUGHT | 1046106000 | 13972566000 | 15018672000 |
| 194 | HURRICANE | 11868319010 | 2741910000 | 14610229010 |
| 280 | RIVER FLOOD | 5118945500 | 5029459000 | 10148404500 |
| 216 | ICE STORM | 3944927860 | 5022113500 | 8967041360 |
We aggregate fatalities and injuries numbers in order to identify TOP-10 events contributing the total people loss:
Health_data <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = storm_data, FUN=sum)
Health_data$TOTAL <- Health_data$FATALITIES + Health_data$INJURIES
Health_data <- Health_data[order(Health_data$TOTAL, decreasing = TRUE), ]
Top10_events_people <- Health_data[1:10,]
knitr::kable(Top10_events_people, format = "markdown")
| EVTYPE | FATALITIES | INJURIES | TOTAL | |
|---|---|---|---|---|
| 371 | TORNADO | 5633 | 91346 | 96979 |
| 54 | EXCESSIVE HEAT | 1903 | 6525 | 8428 |
| 386 | TSTM WIND | 504 | 6957 | 7461 |
| 78 | FLOOD | 470 | 6789 | 7259 |
| 233 | LIGHTNING | 816 | 5230 | 6046 |
| 133 | HEAT | 937 | 2100 | 3037 |
| 65 | FLASH FLOOD | 978 | 1777 | 2755 |
| 216 | ICE STORM | 89 | 1975 | 2064 |
| 329 | THUNDERSTORM WIND | 133 | 1488 | 1621 |
| 441 | WINTER STORM | 206 | 1321 | 1527 |
# aggregate PROPDMG_DOL, CROPDMG_DOL and ECONOMIC_LOSS by Event Type
Top10_events_economy2<- melt(Top10_events_economy, id=c("EVTYPE", "ECONOMIC_LOSS"), measure.vars=c("PROPDMG_DOL","CROPDMG_DOL"))
## transform EVTYPE to factor variable for economic variables
Top10_events_economy2$EVTYPE <- as.factor(Top10_events_economy2$EVTYPE)
## plotting economic loss
ggplot(data = Top10_events_economy2, aes(reorder(EVTYPE, ECONOMIC_LOSS), value, fill=variable)) +
geom_bar(stat = "identity")+
ggtitle("Crop and Property Damage by Weather Events")+
theme(plot.title = element_text(hjust = 0.5))+
labs(y = "Property and Crop Loss", x = "Event Type") +
coord_flip()
The result of economic impact from the graph revealed that FLOOD, HURRICANE/TYPHOON and TORNADO are the main contributors to severe economic consequences out of all event types of weather events. Thus, the most devastating weather event with the greatest economic loss (property and crop damage) is a flood.
# aggregate FATALITIES, INJURIES and TOTAL by EVTYPE
Top10_events_people2<- melt(Top10_events_people, id=c("EVTYPE", "TOTAL"), measure.vars=c("FATALITIES","INJURIES"))
## transform EVTYPE to factor variable for health variables
Top10_events_people2$EVTYPE <- as.factor(Top10_events_people2$EVTYPE)
## plot FATALITIES and INJURIES by Event Type
ggplot(data = Top10_events_people2, aes(x = reorder(EVTYPE, TOTAL), y = value, fill = variable)) +
geom_bar(stat = "identity") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Injuries and Fatalities by Weather Event") +
labs(y = "Number of fatalities and injuries", x = "Event Type") +
coord_flip()
The result of population health impact from the graph show that TORNADOS, EXCESSIVE HEAT and TSTM WIND are the main contributors to deaths and injuries out of all event types of weather events. Thus, the most harmful weather event for health (in number of total fatalites and injuries) is, by far, a tornado.
The analysis of the data revealed that floods cause the greatest economic impact as measured by property damage and crop damage. The analysis also indicated that tornadoes have the greatest health impact as measured by the number of injuries and fatalities.
floods caused the most economic damage, with over $150 billion dollars in economic loss (property and crop damage). Tornadoes are the most dangerous having caused 5,633 fatalities and 91,346 injuries from 1950 to 2011.