Synopsis
Data Processing
- Getting data
- Cleaning data
  - tidying data
Results
- Which types of events are most harmful with respect to population health?
- Across the United States, which types of events have the greatest economic consequences?

Synopsis

Storm Data is an official publication of the National Oceanic and Atmospheric Administration (NOAA) which documents the occurrence of storms and other significant weather phenomena in the United States. Data are public and accessible through the web.

This project performs an analysis of the available data to evaluate what type of meteorological events are the most harmful to the population and which generate the greatest economic losses across the US. First we download data, then we repare lots of typos, and later we subset, calculate and plot to see the results.

Data Processing

Getting data

The starting database can be downloaded from the specified website. After downloading, we decompressed and read as rawdata

    url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(url, 'stormdata.csv.bz2', method = "wininet")

    # Read data
    rawdata<- read.csv('stormdata.csv.bz2')
    dim(rawdata)

## [1] 902297     37

The rawdata has 902297 entries and 37 variables. To reduce memory we select only 8 columns that are necessary for the analisys and store in a new data frame:

    library(dtplyr)
    library(data.table)
    library(dplyr)

    data <- select(rawdata, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)

We also transform the date column ‘BGN_DATE’ into YEAR for our purposes:

    # conver BGN_DATE in year
    data$BGN_DATE<-year(strptime((data$BGN_DATE), "%m/%d/%Y"))
    #names(data)[8]="YEAR"
    colnames(data)[8] <- "YEAR"

Cleaning data

As we see there are a lot of typos in the data, we will clean some of them to make a better selection.

There is a huge disparity of cases, for example with the following code we see the names of column “EVTYPE”, and that there are upper and lowercase letters, inappropriate characters and errors.

We will use the amatchfunction to automatically correct errors (typos) and assimilate each typo value to the nearest official value (event data frame)

    # Correcting the event names

    # Convert to uppercase
    data$EVTYPE <-  toupper(data$EVTYPE)
    data$CROPDMGEXP <-  toupper(data$CROPDMGEXP)
    data$PROPDMGEXP<-  toupper(data$PROPDMGEXP)
    
    # first remove number from type
    data$EVTYPE <- gsub("[[:punct:]]"," ", data$EVTYPE)
    data$EVTYPE <- gsub("\\d+","", data$EVTYPE )
    
    # fix some errors in EVTYPE
    data$EVTYPE <- gsub("^\\s","", data$EVTYPE) # first free space
    data$EVTYPE <- gsub("TORNDAO","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("TORNADOES","TORNADO", data$EVTYPE)
    data$EVTYPE <- gsub("THUNDERESTORM","THUNDERSTORM", data$EVTYPE)
  
    # lets assimilate each type to official ones     
    library(stringdist)
    library(knitr)
    
    #OFFICIAL EVENTS TYPE pag 6 storm data documentation

    event<-toupper(c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Freezing Fog","Frost/Freeze","Funnel Cloud","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane/Typhoon","Ice Storm","Lakeshore Flood","Lake-Effect Snow","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather"))
    event<-data.frame(event)
    
    # take all 985 levels (lots of typos) to 48 official ones using amatch
    data$EVTYPE<-event$event[amatch(data$EVTYPE,event$event,maxDist = 20)]

As detailed on Page 12 of the National Weather Service Storm Data Documentation, the property and crop damage data needs to be adjusted based on the corresponding multiplier field. More information about this point is available in https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.

    # recalculate real value of damage
    data$PROPDMG[data$PROPDMGEXP=="K"] <- data$PROPDMG[data$PROPDMGEXP=="K"] * 1000
    data$PROPDMG[data$PROPDMGEXP=="M"] <- data$PROPDMG[data$PROPDMGEXP=="M"] * 1000000
    data$PROPDMG[data$PROPDMGEXP=="B"] <- data$PROPDMG[data$PROPDMGEXP=="B"] * 1000000000
    data$PROPDMG[data$PROPDMGEXP=="+"] <- data$PROPDMG[data$PROPDMGEXP=="+"] * 1
    data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] <- data$PROPDMG[grep("[[:digit:]]",data$PROPDMGEXP)] * 10
    
    
    data$CROPDMG[data$CROPDMGEXP=="K"] <- data$CROPDMG[data$CROPDMGEXP=="K"] * 1000
    data$CROPDMG[data$CROPDMGEXP=="M"] <- data$CROPDMG[data$CROPDMGEXP=="M"] * 1000000
    data$CROPDMG[data$CROPDMGEXP=="B"] <- data$CROPDMG[data$CROPDMGEXP=="B"] * 1000000000
    data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] <- data$CROPDMG[grep("[[:digit:]]",data$CROPDMGEXP)] * 10

tidying data

As we see in next plot, the number of events in storm data has been increasing since 1950, so we have to keep in mind that records are not uniformly distributed over time.

    library(ggplot2)

    qplot(data=data,data$YEAR, geom="histogram",binwidth=1,
      main = "Number of events by year",alpha=I(.7), col=I("gray"),
      xlab = "year", ylab="number of events",
      fill=I("skyblue"))

According to NOAA the data recording start from Jan. 1950. At that time they recorded one event type, tornado. They add more events gradually and only from Jan. 1996 they start recording all events type.

Therefore, we have selected only data from 1996 to 2011 for this project.

    # select only modern data records
    data<-data[data$YEAR>1995,]

Results

This project is the final project for the Reproducible Research Course through the John Hopkins Coursera for Data Science. The purpose of the project is to process raw data into a reproducible analysis in response to two questions with regards to severe weather events in the United States.

Which types of severe weather events are the most harmful with respect to population health?
Which types of severe weather events have the greatest economic consequences?

These are the conclusions we have reached:

Which types of events are most harmful with respect to population health?

We have two variables to express the affection to population health: FATALITIES and INJURIES. To include both in analisys We have created the following harmful index:

harmful index – HAR.INDX == FATALITIES + INJURIES * 0.2

This is a simplification that we have considered necessary to estimate the total harmful with respect to population health including all data available in strom data.

Deeper studies have estimated this with economic equivalences of human life, but this is beyond our goal.

    # cALCULATE harmful index
    # create new variable with TOTAL harmful respect to population health
    data$HAR.INDX <-data$FATALITIES + data$INJURIES * 0.2

To see which is the worst event type (EVTYPE), we first calculate the accumulated sum by event and then we make a plot with the 20 worst (with higher harmful index)

    # aggregate
    popdmg <- aggregate(HAR.INDX~EVTYPE,data,sum)
    # order
    popdmg <- popdmg[order(popdmg$HAR.INDX, decreasing=TRUE),]
    # first 20
    popdmg <- head(popdmg,20)
    # print table
    kable(popdmg)

	EVTYPE	HAR.INDX
40	TORNADO	5644.4
12	EXCESSIVE HEAT	3075.6
15	FLOOD	2074.6
29	LIGHTNING	1480.6
24	HIGH WIND	1431.2
14	FLASH FLOOD	1224.0
34	RIP CURRENT	642.6
20	HEAT	498.2
47	WINTER STORM	465.8
46	WILDFIRE	433.2
39	THUNDERSTORM WIND	412.2
25	HURRICANE/TYPHOON	319.4
22	HEAVY SNOW	281.6
2	AVALANCHE	260.0
23	HIGH SURF	199.4
38	STRONG WIND	169.8
48	WINTER WEATHER	162.0
26	ICE STORM	155.2
19	HAIL	149.6
13	EXTREME COLD/WIND CHILL	147.8

    # Plot results
    barplot (
        height = popdmg$HAR.INDX,
        main = "Harmful index by Event Type (1996-2011)",
        ylab = "Harmful index",
        names.arg = popdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis = 0.8
    )

According to the results the worst phenomena for population health are tornados, heat (excessive heat and heat) and floods. Tornadoes are by far the most harmful.

Across the United States, which types of events have the greatest economic consequences?

First of all and although the damages to the poputalion are a fundamental part of the economy, it will not be taken into account in this analysis focusing only on damages to properties and crops.

To evaluate the total economic damage, we will add the two variables that we have in the database that evaluate damage: property damage (PROPDMG) and crop damage (CROPDMG).

As we have seen in the section on data cleanning, these variables have been readjusted to monetary values by multiplying each by a given exponent.

    # create new variable with total damage
    data$TOTALDMG <-data$PROPDMG + data$CROPDMG

A more in-depth review of the data would be necessary, as only the floods data have a monetary estimation. The Storm Data preparer must enter monetary damage amounts for flood events, even if it is a “guesstimate.” The U.S. Army Corps of Engineers requires the NWS to provide monetary damage amounts (property and/or crop) resulting from any flood event but not for the rest. Therefore only the floods have “good” data.

Another important factor is the updating of costs with inflation, the 1950 cost in $ are not equivalent with actual cost of a dollar.

    # aggregate
    totdmg <- aggregate(TOTALDMG~EVTYPE,data,sum)
    # order
    totdmg <- totdmg[order(totdmg$TOTALDMG, decreasing=TRUE),]
    # 20 firts
    totdmg <- head(totdmg,20)
    # billions
    totdmg$TOTALDMG<-totdmg$TOTALDMG/1000000000
    # print table
    kable(totdmg)

	EVTYPE	TOTALDMG
15	FLOOD	149.5396750
25	HURRICANE/TYPHOON	71.9137128
37	STORM TIDE	47.8357290
40	TORNADO	24.9003707
19	HAIL	17.0722929
14	FLASH FLOOD	16.7135026
35	SEICHE	14.5564340
9	DROUGHT	14.4154366
24	HIGH WIND	10.9242760
46	WILDFIRE	8.5083096
42	TROPICAL STORM	8.3201866
39	THUNDERSTORM WIND	3.7809854
26	ICE STORM	3.6582520
47	WINTER STORM	1.5446997
21	HEAVY RAIN	1.3389342
18	FUNNEL CLOUD	1.3288675
17	FROST/FREEZE	1.1885160
29	LIGHTNING	0.7525685
22	HEAVY SNOW	0.7069546
3	BLIZZARD	0.5327390

    # Plot results
    barplot (
        height = totdmg$TOTALDMG,
        main = "Total damage (1996-2011)",
        ylab = "bilion $",
        names.arg = totdmg$EVTYPE,
        col = rainbow (20),
        las = 2,
        cex.names= 0.6,
        cex.axis= 0.6
    )

With the data analisys we can conclude that from 2000-2011, floods and hurricanes have had the greatest cumulative impact on the economy.

Impact of weather events on USA Population Health and Economy

fervilber

6 de abril de 2017