Reproducible Research Project 2

Andres Beltran

2022-07-01

Data analysis

Obtaining data

We can get the data file from the link shared, this file is compressed via the bzip2 algorithm to reduce its size.

url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormDa;[.csv.bz2'
  

File <- 'StormData.csv.bz2'

if(!file.exists(File)){
  download.file(url, File, mode = 'wb')
}

rawData <- read.csv(file = File, header = T, sep = ',')

We can also find documentation for the data base, some of the variables are constructed or defined here:

Data processing

According to NOAA, the data recording starts from January, 1050. Only one event type could be recorder at that time, tornado. More events appeared gradually, and from 1996 all type of events can be found. Knowing the the objective is to compare the effects of weather events in economy and public health, we can subset and select the events that happened after 1996:

mainEvents <- rawData
mainEvents$BGN_DATE <- strptime(rawData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
mainEvents <- subset(mainEvents, BGN_DATE > "1995-12-31")

Now that we have the correct time period to inspect, we can select which variables can be important to express the effect of natural disasters in society:

  • First, we can inspect the names of the variables, They should be self explanatory:
colnames(mainEvents)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

from this we can select the following interesting variables:

  • EVTTYPE the type of event
  • FATALITIES Number of fatalities
  • INJURIES Number of injuries
  • PROPDMG the size of property damage
  • PROPDMGEXP The order of magnitude of PROPDMG
  • CROPDMG The size of crop damage
  • CROPDMGEXP The exponent values for CROPDMG

Now we can proceed to subset the data using only the selected variables:

mainEvents <- subset(mainEvents, select = c(EVTYPE, 
                                            FATALITIES, 
                                            INJURIES, 
                                            PROPDMG, 
                                            PROPDMGEXP, 
                                            CROPDMG, 
                                            CROPDMGEXP))
head(mainEvents)
##              EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 248768 WINTER STORM          0        0     380          K      38          K
## 248769      TORNADO          0        0     100          K       0           
## 248770    TSTM WIND          0        0       3          K       0           
## 248771    TSTM WIND          0        0       5          K       0           
## 248772    TSTM WIND          0        0       2          K       0           
## 248773         HAIL          0        0       0                  0

We can check how many different event types we have:

length(unique(mainEvents$EVTYPE))
## [1] 516

There may be some repeated events, to fix that we can capitalize all events present in the variable EVTYPE:

mainEvents$EVTYPE <- toupper(mainEvents$EVTYPE)
length(unique(mainEvents$EVTYPE))
## [1] 438

Also we can select only the events that had outcome in the analyzed variables:

mainEvents <-   mainEvents[ mainEvents$FATALITIES !=0 | 
                            mainEvents$INJURIES !=0 | 
                            mainEvents$PROPDMG !=0 | 
                            mainEvents$CROPDMG !=0, ]
length(unique(mainEvents$EVTYPE))
## [1] 186

Once we have cleaned the data, we can analyze some things, such as which was the event type that got the most people affected. This can be calculated by adding the variables FATALITIES and INJURIES for all events, and saving the results in the variable PEOPLEAFFECTED of the new data.frame healthData

healthData <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = mainEvents, FUN = sum)
head(healthData) 
##                   EVTYPE FATALITIES INJURIES
## 1     HIGH SURF ADVISORY          0        0
## 2            FLASH FLOOD          0        0
## 3              TSTM WIND          0        0
## 4        TSTM WIND (G45)          0        0
## 5    AGRICULTURAL FREEZE          0        0
## 6 ASTRONOMICAL HIGH TIDE          0        0
healthData$PEOPLEAFFECTED <- healthData$INJURIES + healthData$FATALITIES

Now, we can order the data frame so we have in the first 10 rows the events that affected the greater amount of people:

healthData <- healthData[order(healthData$PEOPLEAFFECTED, decreasing =T), ]
knitr::kable(healthData[1:10,])
EVTYPE FATALITIES INJURIES PEOPLEAFFECTED
149 TORNADO 1511 20667 22178
39 EXCESSIVE HEAT 1797 6391 8188
48 FLOOD 414 6758 7172
107 LIGHTNING 651 4141 4792
153 TSTM WIND 241 3629 3870
46 FLASH FLOOD 887 1674 2561
146 THUNDERSTORM WIND 130 1400 1530
182 WINTER STORM 191 1292 1483
69 HEAT 237 1222 1459
88 HURRICANE/TYPHOON 64 1275 1339

Transforming data for economic consequences into workable numbers

Since both crop damage and property damage are divided into number and exponent, we can use this information to get the numbers we need for comparison:

  • The order of magnitude is described by a key, a letter :
    • B/b - billion
    • M/m - million
    • K/k - Thousand
    • H/h - Hundred
  • Other symbols: -, + and ? which refers to less than, greather than, and low certainty. We can ignore these.
  mainEvents$PROPDMGEXP <- gsub("[Hh]", "2",    mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Kk]", "3",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Mm]", "6",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Bb]", "9",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("\\+", "1",   mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0",   mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- as.numeric(    mainEvents$PROPDMGEXP)

    mainEvents$CROPDMGEXP <- gsub("[Hh]", "2",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Kk]", "3",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Mm]", "6",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Bb]", "9",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("\\+", "1",   mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0",   mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- as.numeric(    mainEvents$CROPDMGEXP)

    mainEvents$PROPDMGEXP[is.na(    mainEvents$PROPDMGEXP)] <- 0
    mainEvents$CROPDMGEXP[is.na(    mainEvents$CROPDMGEXP)] <- 0

Once we have information about order of magnitude in an operable format, we can use it as follows:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
    mainEvents <- mutate(   mainEvents, 
                    PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                    CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

Now we can use both variables, crop and property damage to find which events had the greatest effect in economical loss:

Economic_data <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data =   mainEvents, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMGTOTAL + Economic_data$CROPDMGTOTAL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
EVTYPE PROPDMGTOTAL CROPDMGTOTAL ECONOMIC_LOSS
48 FLOOD 143944833550 4974778400 148919611950
88 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
141 STORM SURGE 43193536000 5000 43193541000
149 TORNADO 24616945710 283425010 24900370720
66 HAIL 14595143420 2476029450 17071172870
46 FLASH FLOOD 15222203910 1334901700 16557105610
86 HURRICANE 11812819010 2741410000 14554229010
32 DROUGHT 1046101000 13367566000 14413667000
152 TROPICAL STORM 7642475550 677711000 8320186550
83 HIGH WIND 5247860360 633561300 5881421660

Results

Once we have the two tables needed to assess the effect of events in population and in economical loss, we can plot the results using a barplot:

library(ggplot2)
g <- ggplot(data = healthData[1:10,], aes(x = reorder(EVTYPE, PEOPLEAFFECTED), y = PEOPLEAFFECTED))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total people loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
print(g)

we can conclude form the graph that the events that affected the most amount of people were TORNADO and EXCESSIVE HEAT.

g <- ggplot(data = Economic_data[1:10,], aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total economic loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
print(g)

And from the economic loss assessment graph we can conlude thet the events that affected the most to society in terms of economy were flood and hurrycane/typhoon.