Reproducible Research Project 2

Andres Beltran

2022-07-01

Data analysis

Obtaining data

We can get the data file from the link shared, this file is compressed via the bzip2 algorithm to reduce its size.

url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormDa;[.csv.bz2'
  

File <- 'StormData.csv.bz2'

if(!file.exists(File)){
  download.file(url, File, mode = 'wb')
}

rawData <- read.csv(file = File, header = T, sep = ',')

We can also find documentation for the data base, some of the variables are constructed or defined here:

National weather service Storm Data Documentation
National Climatic data center storms event FAQ

Data processing

According to NOAA, the data recording starts from January, 1050. Only one event type could be recorder at that time, tornado. More events appeared gradually, and from 1996 all type of events can be found. Knowing the the objective is to compare the effects of weather events in economy and public health, we can subset and select the events that happened after 1996:

mainEvents <- rawData
mainEvents$BGN_DATE <- strptime(rawData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
mainEvents <- subset(mainEvents, BGN_DATE > "1995-12-31")

Now that we have the correct time period to inspect, we can select which variables can be important to express the effect of natural disasters in society:

First, we can inspect the names of the variables, They should be self explanatory:

colnames(mainEvents)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

from this we can select the following interesting variables:

EVTTYPE the type of event
FATALITIES Number of fatalities
INJURIES Number of injuries
PROPDMG the size of property damage
PROPDMGEXP The order of magnitude of PROPDMG
CROPDMG The size of crop damage
CROPDMGEXP The exponent values for CROPDMG

Now we can proceed to subset the data using only the selected variables:

mainEvents <- subset(mainEvents, select = c(EVTYPE, 
                                            FATALITIES, 
                                            INJURIES, 
                                            PROPDMG, 
                                            PROPDMGEXP, 
                                            CROPDMG, 
                                            CROPDMGEXP))
head(mainEvents)

##              EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 248768 WINTER STORM          0        0     380          K      38          K
## 248769      TORNADO          0        0     100          K       0           
## 248770    TSTM WIND          0        0       3          K       0           
## 248771    TSTM WIND          0        0       5          K       0           
## 248772    TSTM WIND          0        0       2          K       0           
## 248773         HAIL          0        0       0                  0

We can check how many different event types we have:

length(unique(mainEvents$EVTYPE))

## [1] 516

There may be some repeated events, to fix that we can capitalize all events present in the variable EVTYPE:

mainEvents$EVTYPE <- toupper(mainEvents$EVTYPE)
length(unique(mainEvents$EVTYPE))

## [1] 438

Also we can select only the events that had outcome in the analyzed variables:

mainEvents <-   mainEvents[ mainEvents$FATALITIES !=0 | 
                            mainEvents$INJURIES !=0 | 
                            mainEvents$PROPDMG !=0 | 
                            mainEvents$CROPDMG !=0, ]
length(unique(mainEvents$EVTYPE))

## [1] 186

Once we have cleaned the data, we can analyze some things, such as which was the event type that got the most people affected. This can be calculated by adding the variables FATALITIES and INJURIES for all events, and saving the results in the variable PEOPLEAFFECTED of the new data.frame healthData

healthData <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = mainEvents, FUN = sum)
head(healthData)

##                   EVTYPE FATALITIES INJURIES
## 1     HIGH SURF ADVISORY          0        0
## 2            FLASH FLOOD          0        0
## 3              TSTM WIND          0        0
## 4        TSTM WIND (G45)          0        0
## 5    AGRICULTURAL FREEZE          0        0
## 6 ASTRONOMICAL HIGH TIDE          0        0

healthData$PEOPLEAFFECTED <- healthData$INJURIES + healthData$FATALITIES

Now, we can order the data frame so we have in the first 10 rows the events that affected the greater amount of people:

healthData <- healthData[order(healthData$PEOPLEAFFECTED, decreasing =T), ]
knitr::kable(healthData[1:10,])

	EVTYPE	FATALITIES	INJURIES	PEOPLEAFFECTED
149	TORNADO	1511	20667	22178
39	EXCESSIVE HEAT	1797	6391	8188
48	FLOOD	414	6758	7172
107	LIGHTNING	651	4141	4792
153	TSTM WIND	241	3629	3870
46	FLASH FLOOD	887	1674	2561
146	THUNDERSTORM WIND	130	1400	1530
182	WINTER STORM	191	1292	1483
69	HEAT	237	1222	1459
88	HURRICANE/TYPHOON	64	1275	1339

Transforming data for economic consequences into workable numbers

Since both crop damage and property damage are divided into number and exponent, we can use this information to get the numbers we need for comparison:

The order of magnitude is described by a key, a letter :
- B/b - billion
- M/m - million
- K/k - Thousand
- H/h - Hundred
Other symbols: -, + and ? which refers to less than, greather than, and low certainty. We can ignore these.

  mainEvents$PROPDMGEXP <- gsub("[Hh]", "2",    mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Kk]", "3",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Mm]", "6",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("[Bb]", "9",  mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("\\+", "1",   mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0",   mainEvents$PROPDMGEXP)
    mainEvents$PROPDMGEXP <- as.numeric(    mainEvents$PROPDMGEXP)

    mainEvents$CROPDMGEXP <- gsub("[Hh]", "2",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Kk]", "3",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Mm]", "6",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("[Bb]", "9",  mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("\\+", "1",   mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0",   mainEvents$CROPDMGEXP)
    mainEvents$CROPDMGEXP <- as.numeric(    mainEvents$CROPDMGEXP)

    mainEvents$PROPDMGEXP[is.na(    mainEvents$PROPDMGEXP)] <- 0
    mainEvents$CROPDMGEXP[is.na(    mainEvents$CROPDMGEXP)] <- 0

Once we have information about order of magnitude in an operable format, we can use it as follows:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

    mainEvents <- mutate(   mainEvents, 
                    PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                    CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

Now we can use both variables, crop and property damage to find which events had the greatest effect in economical loss:

Economic_data <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data =   mainEvents, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMGTOTAL + Economic_data$CROPDMGTOTAL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")

	EVTYPE	PROPDMGTOTAL	CROPDMGTOTAL	ECONOMIC_LOSS
48	FLOOD	143944833550	4974778400	148919611950
88	HURRICANE/TYPHOON	69305840000	2607872800	71913712800
141	STORM SURGE	43193536000	5000	43193541000
149	TORNADO	24616945710	283425010	24900370720
66	HAIL	14595143420	2476029450	17071172870
46	FLASH FLOOD	15222203910	1334901700	16557105610
86	HURRICANE	11812819010	2741410000	14554229010
32	DROUGHT	1046101000	13367566000	14413667000
152	TROPICAL STORM	7642475550	677711000	8320186550
83	HIGH WIND	5247860360	633561300	5881421660

Results

Once we have the two tables needed to assess the effect of events in population and in economical loss, we can plot the results using a barplot:

library(ggplot2)
g <- ggplot(data = healthData[1:10,], aes(x = reorder(EVTYPE, PEOPLEAFFECTED), y = PEOPLEAFFECTED))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total people loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
print(g)

we can conclude form the graph that the events that affected the most amount of people were TORNADO and EXCESSIVE HEAT.

g <- ggplot(data = Economic_data[1:10,], aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total economic loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
print(g)

And from the economic loss assessment graph we can conlude thet the events that affected the most to society in terms of economy were flood and hurrycane/typhoon.