StormsReport

Synopsis

This report has the intetion to analysis the data for diferent natural disasters(available at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) and try to answer 2 fundamental questions - Across the United States, which types of events (as indicated in the EVTYPE EVTYPEstart color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health? - Across the United States, which types of events have the greatest economic consequences? For that some preprocessing and group summarising is need as described along the document ## Data Proccessing

To load the data, the read.csv function is used. This function has the capabilites of reading files compressed with the bz2 algorithm. Note that for R to properly load the file, the folder in which the file is located should be the same one as the one for this markdown file

fileName <- "repdata_data_StormData.csv.bz2"
if (file.exists(fileName)){
   stormsDF <- read.csv("repdata_data_StormData.csv.bz2")
}

Let’s look now at the first 5 rows of the dataset

head(stormsDF, 5)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5

We can see clearly that the file has 37 columns. To determine the economic and health consequences of those storms we will focus on the EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. So let’s subset the dataset by those columns

stormsDF <- stormsDF %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

It is clear that for damage related columns with need to come with a combination of the number and the quantifier. For example, if we have 2.5 and M int the PROPDMG & PROPDMGEXP respectively, we are talking about 2,500,000. To write this function we need to first determine the quantifiers. That is accomplished with the following code.

print(unique(stormsDF$PROPDMGEXP))

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

print(unique(stormsDF$CROPDMGEXP))

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

That seems odd, why do we have “” or “?” as quantifiers. Let’s have a count of their frequency to see if that is related to dirty data in the database

print(table(stormsDF$PROPDMGEXP))

## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330

print(table(stormsDF$CROPDMGEXP))

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Oh! interesting, it is clear from the frequencies that stuff like numbers or dashes are clearly a mistake. Let’s discard those results and keep just the cleaned data. First we well map all characters to upercase and then filter by the ones not included in the inclusion list. This way we will only keep valid suffixes and empty ones

inclusionChars <- c('B','K','M','')
stormsDF$PROPDMGEXP <- toupper(stormsDF$PROPDMGEXP)
stormsDF$CROPDMGEXP <- toupper(stormsDF$CROPDMGEXP)
stormsDF <- stormsDF %>% filter(PROPDMGEXP %in%inclusionChars ) %>% filter(CROPDMGEXP %in%inclusionChars )

Now if we see again the table of frequencies of unique values we will see just values in the inclusionChars vector.

print(table(stormsDF$PROPDMGEXP))

## 
##             B      K      M 
## 465928     40 424645  11336

print(table(stormsDF$CROPDMGEXP))

## 
##             B      K      M 
## 618100      9 281847   1993

Now let’s have a valid modifier so we can easily multiply

inputinFunction <- function(val){ 
  if(val == 'B')
  {
    return(1000000000)
  }
  if(val == 'M')
  {
    return(1000000)
  }
  if(val == 'K')
  {
    return(1000)
  }
  else{
    return(1)
  }
}
stormsDF <- stormsDF %>% rowwise() %>% mutate(propSufix =inputinFunction(PROPDMGEXP) ) %>% ungroup()
stormsDF <- stormsDF %>% rowwise() %>% mutate(cropSufix =inputinFunction(CROPDMGEXP) ) %>% ungroup()

Now let’s simply add an new column with the correct values for damages based on the prefixes

stormsDF <- stormsDF %>% mutate(propDamageReal = PROPDMG * propSufix, cropDamageReal = CROPDMG * cropSufix)

Finally for preprocessing let’s summarise the data by event type EVTYPE column so we can make some conclussions easily

sotrmsMetrics <- (stormsDF %>% group_by(EVTYPE) %>% summarise(deaths = sum(FATALITIES), inju = sum(INJURIES), propertyDmg = sum(propDamageReal), cropsDamage = sum(cropDamageReal)))

Results

Damage to Population

Let’s look at a graph of the top 10 eventypes for injury and fatalities. Our main goal here will be to try to see if there is some discrepancy between deaths and injuries. Mainly if there are some events that cause more deaths, even if they have less injuries, or the other way around

top10Fatalities <- head(sotrmsMetrics %>% arrange(desc(deaths)),10)
top10FInjuries <- head(sotrmsMetrics %>% arrange(desc(inju)),10)
par(mfrow = c(2, 1))
barplot(top10Fatalities$deaths, names.arg = top10Fatalities$EVTYPE, col = 'red', main = 'Top 10 fatalities by event type', cex.names = 0.6, las = 2)
barplot(top10FInjuries$inju, names.arg = top10FInjuries$EVTYPE, col = 'blue', main = 'Top 10 Injuries by event type', cex.names = 0.6, las = 2)

Interesting, it is clear from the graphs that Tornado is the most dangerous events for the population by far and for both, injuries and casualties as it ranks first in both metrics. It is also worth noting that the even that the events diverge for second and third places, the top 5 most dangerous events seem to be shared across metris: injuries and casualties.

Economic inmpact

Let’s make the same analysis we did for injuries and casualties, both for crops and properties damages. Remember that we have preprocessed this specific inputs to acount for the multiplier K = x1000, M = x1000000, B = x1000000000

top10PropertyDMG <- head(sotrmsMetrics %>% arrange(desc(propertyDmg)),10)
top10CropDMG <- head(sotrmsMetrics %>% arrange(desc(cropsDamage)),10)
par(mfrow = c(2, 1))
barplot(top10PropertyDMG$propertyDmg, names.arg = top10PropertyDMG$EVTYPE, col = 'red', main = 'Top 10 Eventyypes with most property damage [US dollars$]', cex.names = 0.6, las = 2)
barplot(top10CropDMG$cropsDamage, names.arg = top10CropDMG$EVTYPE, col = 'blue', main = 'Top 10 Eventyypes with most crop damage [US dollars$]', cex.names = 0.6, las = 2)

That is pretty interesting, we have that FLOOD is the natural disaster with the most economic damages to the properties, while Droughts are the one most harmful to crops. Let’s see the raw number for damage in us dolars caused by this natural disasters

print(head(top10PropertyDMG %>% select(propertyDmg),1))

## # A tibble: 1 × 1
##    propertyDmg
##          <dbl>
## 1 144657709807

print(head(top10CropDMG %>% select(cropsDamage),1))

## # A tibble: 1 × 1
##   cropsDamage
##         <dbl>
## 1 13972566000

print(paste('The ratio between property damage and crops damage is',top10PropertyDMG$propertyDmg[1]/top10CropDMG$cropsDamage[1] ,sep = ' '))

## [1] "The ratio between property damage and crops damage is 10.3529809633392"

This confirm that the property damage by raw values are ten times higher than the crops damage. Yet this assumption is not that fair since we will need to consider the impact that crops and property damage have in a local economy. Unfortunately, that is out of scope the analysis of this report, and we will need to have way more data.

So in conclusion we can say that we must watch out for Tornados and try not to buy a property in a prone to flood zone!

StormsReport

IvanPoblette

2025-08-20

Synopsis

Results

Damage to Population

Economic inmpact