This document addresses the following questions:
1- Across the United States, which types of events are most harmful with respect to population health?
2- Across the United States, which types of events have the greatest economic consequences?
setwd("C:/Users/Alvaro/Google Drive/R programming/5. Reproducible Research/assignment_2")
stormData <- read.csv("repdata-data-StormData.csv")
stormData <-subset(stormData, select = -c(1:7,9:22, 29:37)) # remove unnecesary columns
events <- levels(factor(stormData$EVTYPE))
event_types <- length(events)
four <- levels(factor(substr(x=toupper(levels(factor(stormData$EVTYPE))), start=1,stop=4)))
four_length <- length(four)
stormData$short<-substr(x=toupper(stormData$EVTYPE), start=1,stop=4)
There are 985 different events in the database. So, the first task will be to aggregate some of these events in order to consider a lower range of event types. We shall convert the event types to uppercase and then take advantage of the first four letters of the event type in order to group, for instance, TORNADO, TORNADO F0, TORNADO F1, … under the factor TORN. This leaves a total of 162 factors.
Further inspection shows that some events can be grouped if we take only three letters, thus grouping “ICE”, “ICE”, “ICE/” and “ICES”. But some handwork is required in order to aggregate “ICY” to the former series, and avoid aggregating TORNADO and TORRENTIAL. There are also some typos that we can solve aggregating “THUDERSTORM”, “THUNDEERSTORM” and “THUNDERSTORM” under the factor “THU”.
So, there is some work to be done prior to trimming the names to the first three letters.
# I allow the merging of 'extended cold' with 'extreme cold'
stormData[stormData$short=='DRIF',]$short="DRF" # Prevent merging 'Drifting' with 'Driest'
stormData[stormData$short=='DROW',]$short="DRW" # Prevent merging 'DROUGHT' with 'DROWNING'
stormData[stormData$short=='GRAS',]$short="GRS" # Prevent merging 'GRASS' with 'GRADIENT'
stormData[stormData$short=='HVY ',]$short="HEAV" # Merge values
stormData[stormData$short=='HEAT',]$short="HET" # Prevent merging 'HEAT' and 'HEAVY'
stormData[stormData$short=='ICY ',]$short="ICE " # Merge values
stormData[stormData$short=='LIGN',]$short="LIGH" # Correct typo
stormData[stormData$short=='NO S',]$short="NON" # Merge 'No severe weather' with similar event types beginning with 'NON'
stormData[stormData$short=='NORM',]$short="NRM" # Prevent merging 'NORMAL' with 'NORTHERN'
stormData[stormData$short=='STRE',]$short="STE" # Prevent merging 'STREET' with 'STRONG'
stormData[stormData$short=='WIND',]$short="WND" # Merge "WND" and "WIND"
And then event types are trimmed down to three letters, thus reducing the number of factors.
stormData$short<-substr(x=toupper(stormData$short), start=1,stop=3)
three <- levels(factor(stormData$short))
three_length <- length(three)
The number of factors is now 141.
Let us compute the total amount of fatalities and injuries, and also the number of casualties per event type.
fatalities <- sum(stormData$FATALITIES)
injuries <-sum(stormData$INJURIES)
# Distribution of fatalities and injuries by event type
fatalities_evtype <- by(stormData$FATALITIES,stormData$short,sum)
injuries_evtype <- by(stormData$INJURIES,stormData$short,sum)
# Max fatalities and injuries
sort_fat <- sort(fatalities_evtype, decreasing=TRUE)
sort_inj <- sort(injuries_evtype, decreasing=TRUE)
perc_tor_fat <- format(round(sort_fat[1]/fatalities,2)*100, scientific=FALSE)
perc_tor_inj <- format(round(sort_inj[1]/injuries,2)*100, scientific =FALSE)
# acumulate five worst event types
five_worst_fat <- sum(round(sort_fat[1:5]/fatalities,2)*100)
five_worst_inj <- sum(round(sort_inj[1:5]/injuries,2)*100)
missing_prop <- dim(stormData[which(stormData$PROPDMG!=0 &
stormData$PROPDMGEXP!='B' &
stormData$PROPDMGEXP!='K' &
stormData$PROPDMGEXP!='M' &
stormData$PROPDMGEXP!='m'),][4])[1]
missing_crop <- dim(stormData[which(stormData$CROPDMG!=0 &
stormData$CROPDMGEXP!='B' &
stormData$CROPDMGEXP!='K' &
stormData$CROPDMGEXP!='M' &
stormData$CROPDMGEXP!='m'),][6])[1]
total_rows <- dim(stormData)[1]
According to page nr. 12 of the Data Preparation provided, damage figures are given in the columns PROPDMG (on properties) and CROPDMG (on crops), and the magnitud is indicated by a ‘K’ for thousands, an ‘M’ for millions and a ‘B’ for billions. However, some rows include other characters (like ‘-’, ‘?’, ‘+’ or a number) that will not be taken into account.
This decision is not very relevant, because there are only 327 not null values in the PROPDMG column and 36 not null values in the PROPDMG column with wrong units in a dataset with 902297 rows.
# Calculate property damages
stormData$PROPDAMAGE <- stormData$PROPDMG *
ifelse(stormData$PROPDMGEXP=='K' | stormData$PROPDMGEXP=='k',1000,
ifelse(stormData$PROPDMGEXP=='M' | stormData$PROPDMGEXP=='m',1000000,
ifelse(stormData$PROPDMGEXP=='B' | stormData$PROPDMGEXP=='b',1000000000,1)))
# Calculate crop damages
stormData$CROPDAMAGE <- stormData$CROPDMG *
ifelse(stormData$CROPDMGEXP=='K' | stormData$CROPDMGEXP=='k',1000,
ifelse(stormData$CROPDMGEXP=='M' | stormData$CROPDMGEXP=='m',1000000,
ifelse(stormData$CROPDMGEXP=='B' | stormData$CROPDMGEXP=='b',1000000000,1)))
# Aggregate per type of event
# Total losses
prop_losses <- sum(stormData$PROPDAMAGE)
crop_losses <-sum(stormData$CROPDAMAGE)
# Distribution of losses per event type
prop_losses_evtype <- by(stormData$PROPDAMAGE,stormData$short,sum)
crop_losses_evtype <- by(stormData$CROPDAMAGE,stormData$short,sum)
# Max losses
sort_prop_losses <- sort(prop_losses_evtype, decreasing=TRUE)
sort_crop_losses <- sort(crop_losses_evtype, decreasing=TRUE)
perc_1_prop_losses <- format(round(sort_prop_losses[1]/prop_losses,2)*100, scientific=FALSE)
perc_1_crop_losses <- format(round(sort_crop_losses[1]/crop_losses,2)*100, scientific =FALSE)
# acumulate five worst event types
five_worst_prop_losses <- sum(round(sort_prop_losses[1:5]/prop_losses,2)*100)
five_worst_crop_losses <- sum(round(sort_crop_losses[1:5]/crop_losses,2)*100)
The most harmful event type are tornados, with a total number of 5658 fatalities (37% of total fatalities) and 91368 injuries (65%). If we consider the total number of fatalities and injuries, the five most harmful events account for 69% of fatalities and 84% of injuries.
TOR: Tornados, EXC: Excessive weather (cold, heat, rain, snow, …), HET (heat waves, heat drought), FLA (flash flood), LIG (lightning), RIP (rip current), TST (thunderstorm wind), FLO (flood), HIG (high tides, surf, wind), EXT (extreme cold, heat, wind).
The most harmful event type on properties are floods, with a total amount of losses of 144957.5 M$ (34% of total property losses), whereas the most harmful event on crops are droughts, with a total amount of losses of 13972.57 M$ (28% of total crop losses). If we consider the total amount of property losses, the five most harmful events account for 83% of total prop losses, whereas the five most harmful events account for 71% of total crop losses.
FLO: floods, HUR: hurricanes, TOR: tornados, STO: storms, FLA: flash flood, HAI: hail, WIL: wild fires, DRO: droughts, RIV: river flooding, ICE: ice conditions.