Data analysis
Obtaining data
We can get the data file from the link shared, this file is compressed via the bzip2 algorithm to reduce its size.
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormDa;[.csv.bz2'
File <- 'StormData.csv.bz2'
if(!file.exists(File)){
download.file(url, File, mode = 'wb')
}
rawData <- read.csv(file = File, header = T, sep = ',')We can also find documentation for the data base, some of the variables are constructed or defined here:
- National weather service Storm Data Documentation
- National Climatic data center storms event FAQ
Data processing
According to NOAA, the data recording starts from January, 1050. Only one event type could be recorder at that time, tornado. More events appeared gradually, and from 1996 all type of events can be found. Knowing the the objective is to compare the effects of weather events in economy and public health, we can subset and select the events that happened after 1996:
mainEvents <- rawData
mainEvents$BGN_DATE <- strptime(rawData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
mainEvents <- subset(mainEvents, BGN_DATE > "1995-12-31")Now that we have the correct time period to inspect, we can select which variables can be important to express the effect of natural disasters in society:
- First, we can inspect the names of the variables, They should be self explanatory:
colnames(mainEvents)## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
from this we can select the following interesting variables:
EVTTYPEthe type of eventFATALITIESNumber of fatalitiesINJURIESNumber of injuriesPROPDMGthe size of property damagePROPDMGEXPThe order of magnitude ofPROPDMGCROPDMGThe size of crop damageCROPDMGEXPThe exponent values forCROPDMG
Now we can proceed to subset the data using only the selected variables:
mainEvents <- subset(mainEvents, select = c(EVTYPE,
FATALITIES,
INJURIES,
PROPDMG,
PROPDMGEXP,
CROPDMG,
CROPDMGEXP))
head(mainEvents)## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 248768 WINTER STORM 0 0 380 K 38 K
## 248769 TORNADO 0 0 100 K 0
## 248770 TSTM WIND 0 0 3 K 0
## 248771 TSTM WIND 0 0 5 K 0
## 248772 TSTM WIND 0 0 2 K 0
## 248773 HAIL 0 0 0 0
We can check how many different event types we have:
length(unique(mainEvents$EVTYPE))## [1] 516
There may be some repeated events, to fix that we can capitalize all
events present in the variable EVTYPE:
mainEvents$EVTYPE <- toupper(mainEvents$EVTYPE)
length(unique(mainEvents$EVTYPE))## [1] 438
Also we can select only the events that had outcome in the analyzed variables:
mainEvents <- mainEvents[ mainEvents$FATALITIES !=0 |
mainEvents$INJURIES !=0 |
mainEvents$PROPDMG !=0 |
mainEvents$CROPDMG !=0, ]
length(unique(mainEvents$EVTYPE))## [1] 186
Once we have cleaned the data, we can analyze some things, such as
which was the event type that got the most people affected. This can be
calculated by adding the variables FATALITIES and
INJURIES for all events, and saving the results in the
variable PEOPLEAFFECTED of the new data.frame
healthData
healthData <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = mainEvents, FUN = sum)
head(healthData) ## EVTYPE FATALITIES INJURIES
## 1 HIGH SURF ADVISORY 0 0
## 2 FLASH FLOOD 0 0
## 3 TSTM WIND 0 0
## 4 TSTM WIND (G45) 0 0
## 5 AGRICULTURAL FREEZE 0 0
## 6 ASTRONOMICAL HIGH TIDE 0 0
healthData$PEOPLEAFFECTED <- healthData$INJURIES + healthData$FATALITIESNow, we can order the data frame so we have in the first 10 rows the events that affected the greater amount of people:
healthData <- healthData[order(healthData$PEOPLEAFFECTED, decreasing =T), ]
knitr::kable(healthData[1:10,])| EVTYPE | FATALITIES | INJURIES | PEOPLEAFFECTED | |
|---|---|---|---|---|
| 149 | TORNADO | 1511 | 20667 | 22178 |
| 39 | EXCESSIVE HEAT | 1797 | 6391 | 8188 |
| 48 | FLOOD | 414 | 6758 | 7172 |
| 107 | LIGHTNING | 651 | 4141 | 4792 |
| 153 | TSTM WIND | 241 | 3629 | 3870 |
| 46 | FLASH FLOOD | 887 | 1674 | 2561 |
| 146 | THUNDERSTORM WIND | 130 | 1400 | 1530 |
| 182 | WINTER STORM | 191 | 1292 | 1483 |
| 69 | HEAT | 237 | 1222 | 1459 |
| 88 | HURRICANE/TYPHOON | 64 | 1275 | 1339 |
Transforming data for economic consequences into workable numbers
Since both crop damage and property damage are divided into number and exponent, we can use this information to get the numbers we need for comparison:
- The order of magnitude is described by a key, a letter :
- B/b - billion
- M/m - million
- K/k - Thousand
- H/h - Hundred
- Other symbols:
-,+and?which refers to less than, greather than, and low certainty. We can ignore these.
mainEvents$PROPDMGEXP <- gsub("[Hh]", "2", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- gsub("[Kk]", "3", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- gsub("[Mm]", "6", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- gsub("[Bb]", "9", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- gsub("\\+", "1", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0", mainEvents$PROPDMGEXP)
mainEvents$PROPDMGEXP <- as.numeric( mainEvents$PROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("[Hh]", "2", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("[Kk]", "3", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("[Mm]", "6", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("[Bb]", "9", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("\\+", "1", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", mainEvents$CROPDMGEXP)
mainEvents$CROPDMGEXP <- as.numeric( mainEvents$CROPDMGEXP)
mainEvents$PROPDMGEXP[is.na( mainEvents$PROPDMGEXP)] <- 0
mainEvents$CROPDMGEXP[is.na( mainEvents$CROPDMGEXP)] <- 0Once we have information about order of magnitude in an operable format, we can use it as follows:
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mainEvents <- mutate( mainEvents,
PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP),
CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))Now we can use both variables, crop and property damage to find which events had the greatest effect in economical loss:
Economic_data <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data = mainEvents, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMGTOTAL + Economic_data$CROPDMGTOTAL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")| EVTYPE | PROPDMGTOTAL | CROPDMGTOTAL | ECONOMIC_LOSS | |
|---|---|---|---|---|
| 48 | FLOOD | 143944833550 | 4974778400 | 148919611950 |
| 88 | HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| 141 | STORM SURGE | 43193536000 | 5000 | 43193541000 |
| 149 | TORNADO | 24616945710 | 283425010 | 24900370720 |
| 66 | HAIL | 14595143420 | 2476029450 | 17071172870 |
| 46 | FLASH FLOOD | 15222203910 | 1334901700 | 16557105610 |
| 86 | HURRICANE | 11812819010 | 2741410000 | 14554229010 |
| 32 | DROUGHT | 1046101000 | 13367566000 | 14413667000 |
| 152 | TROPICAL STORM | 7642475550 | 677711000 | 8320186550 |
| 83 | HIGH WIND | 5247860360 | 633561300 | 5881421660 |
Results
Once we have the two tables needed to assess the effect of events in population and in economical loss, we can plot the results using a barplot:
library(ggplot2)
g <- ggplot(data = healthData[1:10,], aes(x = reorder(EVTYPE, PEOPLEAFFECTED), y = PEOPLEAFFECTED))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total people loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
print(g)
we can conclude form the graph that the events that affected the most
amount of people were TORNADO and EXCESSIVE HEAT.
g <- ggplot(data = Economic_data[1:10,], aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black")
g <- g + labs(title = "Total economic loss in USA by weather events in 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
print(g)
And from the economic loss assessment graph we can conlude thet the
events that affected the most to society in terms of economy were flood
and hurrycane/typhoon.