Synopsis

Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurrence of storms and other significant weather phenomena in the United States. Data are public and accessible through the web.

This project performs an analysis of the available data to evaluate what type of meteorological events are the most harmful to the population and which generate the greatest economic losses across the US. First we download data, then we repare lots of typos, and later we subset, calculate and plot to see the results.

Data Processing

Getting data

The starting database can be downloaded from the specified website. After downloading, we decompressed and read as rawdata

    url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(url, 'stormdata.csv.bz2', method = "wininet")
    # Read data
    rawdata<- read.csv('stormdata.csv.bz2')
    dim(rawdata)
## [1] 902297     37

The rawdata has 902297 entries and 37 variables. To reduce memory we select only 8 columns that are necessary for the analisys and store in a new data frame:

    library(dtplyr)
    library(data.table)
    library(dplyr)
    data <- select(rawdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)

We also transform the date column ‘BGN_DATE’ into YEAR for our purposes:

    # conver BGN_DATE in year
    data$BGN_DATE<-year(strptime((data$BGN_DATE), "%m/%d/%Y"))
    #names(data)[8]="YEAR"
    colnames(data)[8] <- "YEAR"

Cleaning data

As we see there are a lot of typos in the data, we will clean some of them to make a better selection.

There is a huge disparity of cases, for example with the following code we see the names of column “EVTYPE”, and that there are upper and lowercase letters, inappropriate characters and errors.

We will use the amatchfunction to automatically correct errors (typos) and assimilate each typo value to the nearest official value (event data frame)

    # Correcting the event names

    # Convert to uppercase
    data$EVTYPE <-  toupper(data$EVTYPE)
    data$CROPDMGEXP <-  toupper(data$CROPDMGEXP)
    data$PROPDMGEXP<-  toupper(data$PROPDMGEXP)
    
    # first remove number from type
    data$EVTYPE <- gsub("[[:punct:]]"," ", data$EVTYPE)
    data$EVTYPE <- gsub("\\d+","", data$EVTYPE )
    
    # fix some errors in EVTYPE
    data$EVTYPE <- gsub("^\\s","", data$EVTYPE) # first free space
    data$EVTYPE <- gsub("TORNDAO","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("TORNADOES","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERESTORM","THUNDERSTORM", data$EVTYPE)
  
    # lets assimilate each type to official ones     
    library(stringdist)
    library(knitr)
    
    #OFFICIAL EVENTS TYPE pag 6 storm data documentation

    event<-toupper(c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Freezing Fog","Frost/Freeze","Funnel Cloud","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane/Typhoon","Ice Storm","Lakeshore Flood","Lake-Effect Snow","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather"))
    event<-data.frame(event)
    
    # take all 985 levels (lots of typos) to 48 official ones using amatch
    data$EVTYPE<-event$event[amatch(data$EVTYPE,event$event,maxDist = 20)]

As detailed on Page 12 of the National Weather Service Storm Data Documentation, the property and crop damage data needs to be adjusted based on the corresponding multiplier field. More information about this point is available in https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.

    # recalculate real value of damage
    data$PROPDMG[data$PROPDMGEXP=="K"] <- data$PROPDMG[data$PROPDMGEXP=="K"] * 1000
    data$PROPDMG[data$PROPDMGEXP=="M"] <- data$PROPDMG[data$PROPDMGEXP=="M"] * 1000000
    data$PROPDMG[data$PROPDMGEXP=="B"] <- data$PROPDMG[data$PROPDMGEXP=="B"] * 1000000000
    data$PROPDMG[data$PROPDMGEXP=="+"] <- data$PROPDMG[data$PROPDMGEXP=="+"] * 1
    data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] <- data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] * 10
    
    
    data$CROPDMG[data$CROPDMGEXP=="K"] <- data$CROPDMG[data$CROPDMGEXP=="K"] * 1000
    data$CROPDMG[data$CROPDMGEXP=="M"] <- data$CROPDMG[data$CROPDMGEXP=="M"] * 1000000
    data$CROPDMG[data$CROPDMGEXP=="B"] <- data$CROPDMG[data$CROPDMGEXP=="B"] * 1000000000
    data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] <- data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] * 10

tidying data

As we see in next plot, the number of events in storm data has been increasing since 1950, so we have to keep in mind that records are not uniformly distributed over time.

    library(ggplot2)

    qplot(data=data,data$YEAR, geom="histogram",binwidth=1,
      main = "Number of events by year",alpha=I(.7), col=I("gray"),
      xlab = "year", ylab="number of events",
      fill=I("skyblue"))

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type.

Therefore, we have selected only data from 1996 to 2011 for this project.

    # select only modern data records
    data<-data[data$YEAR>1995,]

Results

This project is the final project for the Reproducible Research Course through the John Hopkins Coursera for Data Science. The purpose of the project is to process raw data into a reproducible analysis in response to two questions with regards to severe weather events in the United States.

  1. Which types of severe weather events are the most harmful with respect to population health?
  2. Which types of severe weather events have the greatest economic consequences?

These are the conclusions we have reached:

Which types of events are most harmful with respect to population health?

We have two variables to express the affection to population health: FATALITIES and INJURIES. To include both in analisys We have created the following harmful index:

  • harmful index – HAR.INDX == FATALITIES + INJURIES * 0.2

This is a simplification that we have considered necessary to estimate the total harmful with respect to population health including all data available in strom data.

Deeper studies have estimated this with economic equivalences of human life, but this is beyond our goal.

    # cALCULATE harmful index
    # create new variable with TOTAL harmful respect to population health
    data$HAR.INDX <-data$FATALITIES + data$INJURIES * 0.2

To see which is the worst event type (EVTYPE), we first calculate the accumulated sum by event and then we make a plot with the 20 worst (with higher harmful index)

    # aggregate
    popdmg <- aggregate(HAR.INDX~EVTYPE,data,sum)
    # order
    popdmg <- popdmg[order(popdmg$HAR.INDX, decreasing=TRUE),]
    # first 20
    popdmg <- head(popdmg,20)
    # print table
    kable(popdmg)
EVTYPE HAR.INDX
40 TORNADO 5644.4
12 EXCESSIVE HEAT 3075.6
15 FLOOD 2074.6
29 LIGHTNING 1480.6
24 HIGH WIND 1431.2
14 FLASH FLOOD 1224.0
34 RIP CURRENT 642.6
20 HEAT 498.2
47 WINTER STORM 465.8
46 WILDFIRE 433.2
39 THUNDERSTORM WIND 412.2
25 HURRICANE/TYPHOON 319.4
22 HEAVY SNOW 281.6
2 AVALANCHE 260.0
23 HIGH SURF 199.4
38 STRONG WIND 169.8
48 WINTER WEATHER 162.0
26 ICE STORM 155.2
19 HAIL 149.6
13 EXTREME COLD/WIND CHILL 147.8
    # Plot results
    barplot (
        height = popdmg$HAR.INDX,
        main = "Harmful index by Event Type (1996-2011)",
        ylab = "Harmful index",
        names.arg = popdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis = 0.8
    )

According to the results the worst phenomena for population health are tornados, heat (excessive heat and heat) and floods. Tornadoes are by far the most harmful.

Across the United States, which types of events have the greatest economic consequences?

First of all and although the damages to the poputalion are a fundamental part of the economy, it will not be taken into account in this analysis focusing only on damages to properties and crops.

To evaluate the total economic damage, we will add the two variables that we have in the database that evaluate damage: property damage (PROPDMG) and crop damage (CROPDMG).

As we have seen in the section on data cleanning, these variables have been readjusted to monetary values by multiplying each by a given exponent.

    # create new variable with total damage
    data$TOTALDMG <-data$PROPDMG + data$CROPDMG

A more in-depth review of the data would be necessary, as only the floods data have a monetary estimation. The Storm Data preparer must enter monetary damage amounts for flood events, even if it is a “guesstimate.” The U.S. Army Corps of Engineers requires the NWS to provide monetary damage amounts (property and/or crop) resulting from any flood event but not for the rest. Therefore only the floods have “good” data.

Another important factor is the updating of costs with inflation, the 1950 cost in $ are not equivalent with actual cost of a dollar.

    # aggregate
    totdmg <- aggregate(TOTALDMG~EVTYPE,data,sum)
    # order
    totdmg <- totdmg[order(totdmg$TOTALDMG, decreasing=TRUE),]
    # 20 firts
    totdmg <- head(totdmg,20)
    # billions
    totdmg$TOTALDMG<-totdmg$TOTALDMG/1000000000
    # print table
    kable(totdmg)
EVTYPE TOTALDMG
15 FLOOD 149.5396750
25 HURRICANE/TYPHOON 71.9137128
37 STORM TIDE 47.8357290
40 TORNADO 24.9003707
19 HAIL 17.0722929
14 FLASH FLOOD 16.7135026
35 SEICHE 14.5564340
9 DROUGHT 14.4154366
24 HIGH WIND 10.9242760
46 WILDFIRE 8.5083096
42 TROPICAL STORM 8.3201866
39 THUNDERSTORM WIND 3.7809854
26 ICE STORM 3.6582520
47 WINTER STORM 1.5446997
21 HEAVY RAIN 1.3389342
18 FUNNEL CLOUD 1.3288675
17 FROST/FREEZE 1.1885160
29 LIGHTNING 0.7525685
22 HEAVY SNOW 0.7069546
3 BLIZZARD 0.5327390
    # Plot results
    barplot (
        height = totdmg$TOTALDMG,
        main = "Total damage (1996-2011)",
        ylab = "bilion $",
        names.arg = totdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis= 0.6
    )

With the data analisys we can conclude that from 2000-2011, floods and hurricanes have had the greatest cumulative impact on the economy.