This report has the intetion to analysis the data for diferent natural disasters(available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) and try to answer 2 fundamental questions - Across the United States, which types of events (as indicated in the EVTYPE EVTYPEstart color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health? - Across the United States, which types of events have the greatest economic consequences? For that some preprocessing and group summarising is need as described along the document ## Data Proccessing
To load the data, the read.csv function is used. This function has the capabilites of reading files compressed with the bz2 algorithm. Note that for R to properly load the file, the folder in which the file is located should be the same one as the one for this markdown file
fileName <- "repdata_data_StormData.csv.bz2"
if (file.exists(fileName)){
stormsDF <- read.csv("repdata_data_StormData.csv.bz2")
}
Let’s look now at the first 5 rows of the dataset
head(stormsDF, 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
We can see clearly that the file has 37 columns. To determine the economic and health consequences of those storms we will focus on the EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. So let’s subset the dataset by those columns
stormsDF <- stormsDF %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
It is clear that for damage related columns with need to come with a combination of the number and the quantifier. For example, if we have 2.5 and M int the PROPDMG & PROPDMGEXP respectively, we are talking about 2,500,000. To write this function we need to first determine the quantifiers. That is accomplished with the following code.
print(unique(stormsDF$PROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
print(unique(stormsDF$CROPDMGEXP))
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
That seems odd, why do we have “” or “?” as quantifiers. Let’s have a count of their frequency to see if that is related to dirty data in the database
print(table(stormsDF$PROPDMGEXP))
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
print(table(stormsDF$CROPDMGEXP))
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Oh! interesting, it is clear from the frequencies that stuff like numbers or dashes are clearly a mistake. Let’s discard those results and keep just the cleaned data. First we well map all characters to upercase and then filter by the ones not included in the inclusion list. This way we will only keep valid suffixes and empty ones
inclusionChars <- c('B','K','M','')
stormsDF$PROPDMGEXP <- toupper(stormsDF$PROPDMGEXP)
stormsDF$CROPDMGEXP <- toupper(stormsDF$CROPDMGEXP)
stormsDF <- stormsDF %>% filter(PROPDMGEXP %in%inclusionChars ) %>% filter(CROPDMGEXP %in%inclusionChars )
Now if we see again the table of frequencies of unique values we will see just values in the inclusionChars vector.
print(table(stormsDF$PROPDMGEXP))
##
## B K M
## 465928 40 424645 11336
print(table(stormsDF$CROPDMGEXP))
##
## B K M
## 618100 9 281847 1993
Now let’s have a valid modifier so we can easily multiply
inputinFunction <- function(val){
if(val == 'B')
{
return(1000000000)
}
if(val == 'M')
{
return(1000000)
}
if(val == 'K')
{
return(1000)
}
else{
return(1)
}
}
stormsDF <- stormsDF %>% rowwise() %>% mutate(propSufix =inputinFunction(PROPDMGEXP) ) %>% ungroup()
stormsDF <- stormsDF %>% rowwise() %>% mutate(cropSufix =inputinFunction(CROPDMGEXP) ) %>% ungroup()
Now let’s simply add an new column with the correct values for damages based on the prefixes
stormsDF <- stormsDF %>% mutate(propDamageReal = PROPDMG * propSufix, cropDamageReal = CROPDMG * cropSufix)
Finally for preprocessing let’s summarise the data by event type EVTYPE column so we can make some conclussions easily
sotrmsMetrics <- (stormsDF %>% group_by(EVTYPE) %>% summarise(deaths = sum(FATALITIES), inju = sum(INJURIES), propertyDmg = sum(propDamageReal), cropsDamage = sum(cropDamageReal)))
Let’s look at a graph of the top 10 eventypes for injury and fatalities. Our main goal here will be to try to see if there is some discrepancy between deaths and injuries. Mainly if there are some events that cause more deaths, even if they have less injuries, or the other way around
top10Fatalities <- head(sotrmsMetrics %>% arrange(desc(deaths)),10)
top10FInjuries <- head(sotrmsMetrics %>% arrange(desc(inju)),10)
par(mfrow = c(2, 1))
barplot(top10Fatalities$deaths, names.arg = top10Fatalities$EVTYPE, col = 'red', main = 'Top 10 fatalities by event type', cex.names = 0.6, las = 2)
barplot(top10FInjuries$inju, names.arg = top10FInjuries$EVTYPE, col = 'blue', main = 'Top 10 Injuries by event type', cex.names = 0.6, las = 2)
Interesting, it is clear from the graphs that Tornado is the most
dangerous events for the population by far and for both, injuries and
casualties as it ranks first in both metrics. It is also worth noting
that the even that the events diverge for second and third places, the
top 5 most dangerous events seem to be shared across metris: injuries
and casualties.
Let’s make the same analysis we did for injuries and casualties, both for crops and properties damages. Remember that we have preprocessed this specific inputs to acount for the multiplier K = x1000, M = x1000000, B = x1000000000
top10PropertyDMG <- head(sotrmsMetrics %>% arrange(desc(propertyDmg)),10)
top10CropDMG <- head(sotrmsMetrics %>% arrange(desc(cropsDamage)),10)
par(mfrow = c(2, 1))
barplot(top10PropertyDMG$propertyDmg, names.arg = top10PropertyDMG$EVTYPE, col = 'red', main = 'Top 10 Eventyypes with most property damage [US dollars$]', cex.names = 0.6, las = 2)
barplot(top10CropDMG$cropsDamage, names.arg = top10CropDMG$EVTYPE, col = 'blue', main = 'Top 10 Eventyypes with most crop damage [US dollars$]', cex.names = 0.6, las = 2)
That is pretty interesting, we have that FLOOD is the natural disaster with the most economic damages to the properties, while Droughts are the one most harmful to crops. Let’s see the raw number for damage in us dolars caused by this natural disasters
print(head(top10PropertyDMG %>% select(propertyDmg),1))
## # A tibble: 1 × 1
## propertyDmg
## <dbl>
## 1 144657709807
print(head(top10CropDMG %>% select(cropsDamage),1))
## # A tibble: 1 × 1
## cropsDamage
## <dbl>
## 1 13972566000
print(paste('The ratio between property damage and crops damage is',top10PropertyDMG$propertyDmg[1]/top10CropDMG$cropsDamage[1] ,sep = ' '))
## [1] "The ratio between property damage and crops damage is 10.3529809633392"
This confirm that the property damage by raw values are ten times higher than the crops damage. Yet this assumption is not that fair since we will need to consider the impact that crops and property damage have in a local economy. Unfortunately, that is out of scope the analysis of this report, and we will need to have way more data.
So in conclusion we can say that we must watch out for Tornados and try not to buy a property in a prone to flood zone!