Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurrence of storms and other significant weather phenomena in the United States. Data are public and accessible through the web.
This project performs an analysis of the available data to evaluate what type of meteorological events are the most harmful to the population and which generate the greatest economic losses across the US. First we download data, then we repare lots of typos, and later we subset, calculate and plot to see the results.
The starting database can be downloaded from the specified website. After downloading, we decompressed and read as rawdata
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, 'stormdata.csv.bz2', method = "wininet")
# Read data
rawdata<- read.csv('stormdata.csv.bz2')
dim(rawdata)
## [1] 902297 37
The rawdata has 902297 entries and 37 variables. To reduce memory we select only 8 columns that are necessary for the analisys and store in a new data frame:
library(dtplyr)
library(data.table)
library(dplyr)
data <- select(rawdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)
We also transform the date column ‘BGN_DATE’ into YEAR for our purposes:
# conver BGN_DATE in year
data$BGN_DATE<-year(strptime((data$BGN_DATE), "%m/%d/%Y"))
#names(data)[8]="YEAR"
colnames(data)[8] <- "YEAR"
As we see there are a lot of typos in the data, we will clean some of them to make a better selection.
There is a huge disparity of cases, for example with the following code we see the names of column “EVTYPE”, and that there are upper and lowercase letters, inappropriate characters and errors.
We will use the amatchfunction to automatically correct errors (typos) and assimilate each typo value to the nearest official value (event data frame)
# Correcting the event names
# Convert to uppercase
data$EVTYPE <- toupper(data$EVTYPE)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$PROPDMGEXP<- toupper(data$PROPDMGEXP)
# first remove number from type
data$EVTYPE <- gsub("[[:punct:]]"," ", data$EVTYPE)
data$EVTYPE <- gsub("\\d+","", data$EVTYPE )
# fix some errors in EVTYPE
data$EVTYPE <- gsub("^\\s","", data$EVTYPE) # first free space
data$EVTYPE <- gsub("TORNDAO","TORNADO", data$EVTYPE)
data$EVTYPE <- gsub("TORNADOES","TORNADO", data$EVTYPE)
data$EVTYPE <- gsub("THUNDERESTORM","THUNDERSTORM", data$EVTYPE)
# lets assimilate each type to official ones
library(stringdist)
library(knitr)
#OFFICIAL EVENTS TYPE pag 6 storm data documentation
event<-toupper(c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Freezing Fog","Frost/Freeze","Funnel Cloud","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane/Typhoon","Ice Storm","Lakeshore Flood","Lake-Effect Snow","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather"))
event<-data.frame(event)
# take all 985 levels (lots of typos) to 48 official ones using amatch
data$EVTYPE<-event$event[amatch(data$EVTYPE,event$event,maxDist = 20)]
As detailed on Page 12 of the National Weather Service Storm Data Documentation, the property and crop damage data needs to be adjusted based on the corresponding multiplier field. More information about this point is available in https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.
# recalculate real value of damage
data$PROPDMG[data$PROPDMGEXP=="K"] <- data$PROPDMG[data$PROPDMGEXP=="K"] * 1000
data$PROPDMG[data$PROPDMGEXP=="M"] <- data$PROPDMG[data$PROPDMGEXP=="M"] * 1000000
data$PROPDMG[data$PROPDMGEXP=="B"] <- data$PROPDMG[data$PROPDMGEXP=="B"] * 1000000000
data$PROPDMG[data$PROPDMGEXP=="+"] <- data$PROPDMG[data$PROPDMGEXP=="+"] * 1
data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] <- data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] * 10
data$CROPDMG[data$CROPDMGEXP=="K"] <- data$CROPDMG[data$CROPDMGEXP=="K"] * 1000
data$CROPDMG[data$CROPDMGEXP=="M"] <- data$CROPDMG[data$CROPDMGEXP=="M"] * 1000000
data$CROPDMG[data$CROPDMGEXP=="B"] <- data$CROPDMG[data$CROPDMGEXP=="B"] * 1000000000
data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] <- data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] * 10
As we see in next plot, the number of events in storm data has been increasing since 1950, so we have to keep in mind that records are not uniformly distributed over time.
library(ggplot2)
qplot(data=data,data$YEAR, geom="histogram",binwidth=1,
main = "Number of events by year",alpha=I(.7), col=I("gray"),
xlab = "year", ylab="number of events",
fill=I("skyblue"))
According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type.
Therefore, we have selected only data from 1996 to 2011 for this project.
# select only modern data records
data<-data[data$YEAR>1995,]
This project is the final project for the Reproducible Research Course through the John Hopkins Coursera for Data Science. The purpose of the project is to process raw data into a reproducible analysis in response to two questions with regards to severe weather events in the United States.
These are the conclusions we have reached:
We have two variables to express the affection to population health: FATALITIES and INJURIES. To include both in analisys We have created the following harmful index:
This is a simplification that we have considered necessary to estimate the total harmful with respect to population health including all data available in strom data.
Deeper studies have estimated this with economic equivalences of human life, but this is beyond our goal.
# cALCULATE harmful index
# create new variable with TOTAL harmful respect to population health
data$HAR.INDX <-data$FATALITIES + data$INJURIES * 0.2
To see which is the worst event type (EVTYPE), we first calculate the accumulated sum by event and then we make a plot with the 20 worst (with higher harmful index)
# aggregate
popdmg <- aggregate(HAR.INDX~EVTYPE,data,sum)
# order
popdmg <- popdmg[order(popdmg$HAR.INDX, decreasing=TRUE),]
# first 20
popdmg <- head(popdmg,20)
# print table
kable(popdmg)
| EVTYPE | HAR.INDX | |
|---|---|---|
| 40 | TORNADO | 5644.4 |
| 12 | EXCESSIVE HEAT | 3075.6 |
| 15 | FLOOD | 2074.6 |
| 29 | LIGHTNING | 1480.6 |
| 24 | HIGH WIND | 1431.2 |
| 14 | FLASH FLOOD | 1224.0 |
| 34 | RIP CURRENT | 642.6 |
| 20 | HEAT | 498.2 |
| 47 | WINTER STORM | 465.8 |
| 46 | WILDFIRE | 433.2 |
| 39 | THUNDERSTORM WIND | 412.2 |
| 25 | HURRICANE/TYPHOON | 319.4 |
| 22 | HEAVY SNOW | 281.6 |
| 2 | AVALANCHE | 260.0 |
| 23 | HIGH SURF | 199.4 |
| 38 | STRONG WIND | 169.8 |
| 48 | WINTER WEATHER | 162.0 |
| 26 | ICE STORM | 155.2 |
| 19 | HAIL | 149.6 |
| 13 | EXTREME COLD/WIND CHILL | 147.8 |
# Plot results
barplot (
height = popdmg$HAR.INDX,
main = "Harmful index by Event Type (1996-2011)",
ylab = "Harmful index",
names.arg = popdmg$EVTYPE,
col = rainbow (20),
las = 2,
cex.names= 0.6,
cex.axis = 0.8
)
According to the results the worst phenomena for population health are tornados, heat (excessive heat and heat) and floods. Tornadoes are by far the most harmful.
First of all and although the damages to the poputalion are a fundamental part of the economy, it will not be taken into account in this analysis focusing only on damages to properties and crops.
To evaluate the total economic damage, we will add the two variables that we have in the database that evaluate damage: property damage (PROPDMG) and crop damage (CROPDMG).
As we have seen in the section on data cleanning, these variables have been readjusted to monetary values by multiplying each by a given exponent.
# create new variable with total damage
data$TOTALDMG <-data$PROPDMG + data$CROPDMG
A more in-depth review of the data would be necessary, as only the floods data have a monetary estimation. The Storm Data preparer must enter monetary damage amounts for flood events, even if it is a “guesstimate.” The U.S. Army Corps of Engineers requires the NWS to provide monetary damage amounts (property and/or crop) resulting from any flood event but not for the rest. Therefore only the floods have “good” data.
Another important factor is the updating of costs with inflation, the 1950 cost in $ are not equivalent with actual cost of a dollar.
# aggregate
totdmg <- aggregate(TOTALDMG~EVTYPE,data,sum)
# order
totdmg <- totdmg[order(totdmg$TOTALDMG, decreasing=TRUE),]
# 20 firts
totdmg <- head(totdmg,20)
# billions
totdmg$TOTALDMG<-totdmg$TOTALDMG/1000000000
# print table
kable(totdmg)
| EVTYPE | TOTALDMG | |
|---|---|---|
| 15 | FLOOD | 149.5396750 |
| 25 | HURRICANE/TYPHOON | 71.9137128 |
| 37 | STORM TIDE | 47.8357290 |
| 40 | TORNADO | 24.9003707 |
| 19 | HAIL | 17.0722929 |
| 14 | FLASH FLOOD | 16.7135026 |
| 35 | SEICHE | 14.5564340 |
| 9 | DROUGHT | 14.4154366 |
| 24 | HIGH WIND | 10.9242760 |
| 46 | WILDFIRE | 8.5083096 |
| 42 | TROPICAL STORM | 8.3201866 |
| 39 | THUNDERSTORM WIND | 3.7809854 |
| 26 | ICE STORM | 3.6582520 |
| 47 | WINTER STORM | 1.5446997 |
| 21 | HEAVY RAIN | 1.3389342 |
| 18 | FUNNEL CLOUD | 1.3288675 |
| 17 | FROST/FREEZE | 1.1885160 |
| 29 | LIGHTNING | 0.7525685 |
| 22 | HEAVY SNOW | 0.7069546 |
| 3 | BLIZZARD | 0.5327390 |
# Plot results
barplot (
height = totdmg$TOTALDMG,
main = "Total damage (1996-2011)",
ylab = "bilion $",
names.arg = totdmg$EVTYPE,
col = rainbow (20),
las = 2,
cex.names= 0.6,
cex.axis= 0.6
)
With the data analisys we can conclude that from 2000-2011, floods and hurricanes have had the greatest cumulative impact on the economy.