The following exersice has been requested within the course “Reproducible Research” (Coursera). The aim of it is to answer two questions:
To do so, it has been analysed data from NOAA Storm Database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Data used in this exercise can be downloaded here.
Documentation about the data can be downloaded here.
Once datafile is in the working directory the data can be readed:
data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
During the exercise it has been remarked that there are several “misstakes” over the variable EVTYPE. The following steps have the purpose of cleaning duplicates, remove double spaces, last spaces, etc. After each of the processes, a message with the number of different Event types will be shown in order to see how data cleaning is useful to lighten data and get precise results.
data0 <- data
message(length(unique(data$EVTYPE))," different event types")
## 985 different event types
data0$EVTYPE <- toupper(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 898 different event types
data0$EVTYPE <- gsub("/"," ", data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 882 different event types
data0$EVTYPE <- gsub("-"," ", data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 876 different event types
data0$EVTYPE <- trimws(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 865 different event types
grepl create a logical vector to select values starting by SUMM. ! inverts the logical vector.
data0 <- data0[!grepl("^SUMM", data0$EVTYPE),]
message(length(unique(data0$EVTYPE))," different event types")
## 799 different event types
library(qdapRegex)
## Warning: package 'qdapRegex' was built under R version 3.3.3
data0$EVTYPE <- rm_white(data0$EVTYPE)
message(length(unique(data0$EVTYPE))," different event types")
## 784 different event types
library(stringr)
## Warning: package 'stringr' was built under R version 3.3.3
source("erratas.R")
for (i in 1:length(erratas)){
data0$EVTYPE <- gsub(names(erratas[i]),erratas[1], data0$EVTYPE)
}
message(length(unique(data0$EVTYPE))," different event types")
## 768 different event types
Prepare libraries for the whole exercise.
library(reshape)
## Warning: package 'reshape' was built under R version 3.3.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.3.3
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:reshape':
##
## colsplit, melt, recast
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
##
## rename
## The following objects are masked from 'package:qdapRegex':
##
## escape, explain
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
##
## %+%
Preparing the data (creation of new variable result of the sum of FAT + INJ, selection of collums of interest and summary by event type):
data1 <- data0
data1 <-
data0 %>%
mutate(TFI = FATALITIES+INJURIES) %>%
select(EVTYPE, FATALITIES, INJURIES, TFI) %>%
group_by(EVTYPE) %>%
summarize(
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
TFI = sum(TFI)
)
Selection of the top 10 most harmful on human health (fatalities+injuries)
data1sel <- head(arrange(data1, desc(TFI)),10)
data1sel <- as.data.frame(data1sel)
data1sel
## EVTYPE FATALITIES INJURIES TFI
## 1 TORNADO 5645 91376 97021
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
The function melt will allow us to create stacked barplots in the results.
data1melt <- melt(data1sel, id.vars = "EVTYPE")
To get the exact amount of money expended on each EVTYPE it is necesary to multiply the cost (PROPDMG,CROPDMG) by their respective exponent (PROPDMGEXP,CROPDMGEXP) defined as “k”, “m”, “b”. To do so, it has been created a function to facilitate the consequent mutation of data (THIS FUNCTION COMES FROM THE RPUB MADE BY Sergio L. B. Dalge (nice idea to do fast mutations).
exp_mutate <- function (x){
ifelse(x == "K" | x == "k", 1000,
ifelse(x == "M" | x == "m", 1000000,
ifelse(x == "B" | x =="b",1000000000, 1)))
}
Using the function before we can create a chain of actions over the data, using dplyr to do so. The aim here is to mutate the values to get the “real” money expended, select just the data of interest (TOTAL PROPERTIES DAMAGE, TOTAL CROP DAMAGE and TOTAL DAMAGE) and finally sumarize the values by EVTYPE.
data2 <- data0
data2 <-
data2 %>%
mutate(PROPDMGEXP=exp_mutate(PROPDMGEXP)) %>%
mutate(CROPDMGEXP=exp_mutate(CROPDMGEXP)) %>%
mutate(TPROPDMG = PROPDMG * PROPDMGEXP) %>%
mutate(TCROPDMG = CROPDMG * CROPDMGEXP) %>%
mutate(TOTALDMG = TPROPDMG + TCROPDMG) %>%
select(EVTYPE, TPROPDMG, TCROPDMG, TOTALDMG) %>%
group_by(EVTYPE) %>%
summarize(TPROPDMG = sum(TPROPDMG), TCROPDMG = sum(TCROPDMG), TOTALDMG=sum(TOTALDMG))
Arrange and select the top 10 costly (TOTALDMG) events.
data2sel <- head(arrange(data2, desc(TOTALDMG)),10)
data2sel<-as.data.frame(data2sel)
data2sel
## EVTYPE TPROPDMG TCROPDMG TOTALDMG
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 57162361285 452344770 57614706055
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732267048 3025954473 18758221521
## 6 FLASH FLOOD 16141362067 1421317100 17562679167
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
The function melt will allow us to create stacked barplots in the results.
data2melt <- melt(data2sel, id.vars = "EVTYPE")
Next plot shows the top 10 most harmful events for human health sorted by Event Type:
ggplot(data = filter(data1melt, variable != "TFI"),
aes(
x = reorder(EVTYPE,-value),
y = value,
fill = variable
)) +
geom_bar(position = "stack", stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "Event type (top 10 of total efect on human health)", y = "Human damage", title =
"Human health damage by Event Type") +
scale_fill_manual(
values = c("#CC0000", "#99CCFF"),
name = "Type of damage",
breaks = c("FATALITIES", "INJURIES"),
labels = c("FATALITIES", "INJURIES")
) +
guides(fill = guide_legend(title = NULL))
Next plot shows the top 10 most harmful events sorted by TOTAL DAMAGE:
ggplot(data=filter(data2melt,variable!="TOTALDMG"),
aes(x=reorder(EVTYPE,-value), y=value, fill=variable)) +
geom_bar(position="stack",stat="identity")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(x= "Event type (top 10 of total cost)",
y="Cost damage",
title="Economic effects on Properties and Crops by Event Type") +
scale_fill_manual(values=c("#999999", "#E69F00"),
name="Type of expenses",
breaks=c("TPROPDMG", "TCROPDMG"),
labels=c("Total properties dmg", "Total crops dmg"))+
guides(fill=guide_legend(title=NULL))