Peer Assessment 2 - Storm Data Analysis

This document is an analysis of the Storms and other severe weather events which cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The objective of the project is to explore the storm and weather related events with respect to its impact on population health and economy.To identify the impact of weather events on the population health a composite index is introduced which is the summation of fatalities and injuries due to a weather event type.

The impact on the economy is assessed based on an index factor which is the summation of loss due to property damage and crop damage. From the analysis it is found that the event type “Tornado” has the biggest impact on both population health and economic loss.

Data Processing

The first step in data processing is to set the working directory for R. This is done as follows

setwd("D:/R/Reproducible research")

The dataset for the analysis is available through the following URL “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

The data is downloaded using the follwing command

setInternet2(T)
fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

download.file(fileurl,destfile="StormData.csv.bz2")

stormdata <- read.csv("StormData.csv.bz2")

The data set has the following variables

names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Through observation of the variables, it can be realised that the variables “Fatalities” & “Injuries” are the ones which are relevant for assessing the impact on population health. Similarly “PROPDMG” & “CROPDMG” which stands for Property damage and Crop damage are the pertinent variables for assessing the economic loss.

Let us first assess the impact of weather types on population health.

Effects of weather type on population health

As a first step the “stormdata” data set has to be transformed to take only the data relevant to the analysis. This can be done by subsetting the data set to include only data relevant to health.

pophealth <- stormdata[,c("COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_DATE","FATALITIES","INJURIES")]

head(pophealth)
##   COUNTY COUNTYNAME STATE  EVTYPE           BGN_DATE FATALITIES INJURIES
## 1     97     MOBILE    AL TORNADO  4/18/1950 0:00:00          0       15
## 2      3    BALDWIN    AL TORNADO  4/18/1950 0:00:00          0        0
## 3     57    FAYETTE    AL TORNADO  2/20/1951 0:00:00          0        2
## 4     89    MADISON    AL TORNADO   6/8/1951 0:00:00          0        2
## 5     43    CULLMAN    AL TORNADO 11/15/1951 0:00:00          0        2
## 6     77 LAUDERDALE    AL TORNADO 11/15/1951 0:00:00          0        6

As seen from the dataset, the “Fatalities” & “Injuries” are two variables which are indicators of the impact of weather type on population health. Inorder to understand the total impact of these two indicators on population health, we wil introduce a new variable called “healthindex”. This variable is a weighted composite score derived from the former two variables. A heigher weight of “7” is given to “Fatalities” and a lower weight of “3” is assigned to the “Injuries” variable. The composite variable is derived as follows.

pophealth$healthindex <- pophealth$FATALITIES*7 + pophealth$INJURIES*3

head(pophealth)
##   COUNTY COUNTYNAME STATE  EVTYPE           BGN_DATE FATALITIES INJURIES
## 1     97     MOBILE    AL TORNADO  4/18/1950 0:00:00          0       15
## 2      3    BALDWIN    AL TORNADO  4/18/1950 0:00:00          0        0
## 3     57    FAYETTE    AL TORNADO  2/20/1951 0:00:00          0        2
## 4     89    MADISON    AL TORNADO   6/8/1951 0:00:00          0        2
## 5     43    CULLMAN    AL TORNADO 11/15/1951 0:00:00          0        2
## 6     77 LAUDERDALE    AL TORNADO 11/15/1951 0:00:00          0        6
##   healthindex
## 1          45
## 2           0
## 3           6
## 4           6
## 5           6
## 6          18

The next step is to find those eventype based on its impact on the population health. This can be done by summing up the composite score with respect to the “EvTYPE” variable. The summation will give the composite index of an eventype accross all states and accross all time periods. The aggregation can be done as follows and the results are also listed.

aggpophealth <- aggregate(pophealth$healthindex,by=list(pophealth$EVTYPE),FUN=sum,na.rm=T)

names(aggpophealth) <- c("event","index")

head(aggpophealth)
##                   event index
## 1    HIGH SURF ADVISORY     0
## 2         COASTAL FLOOD     0
## 3           FLASH FLOOD     0
## 4             LIGHTNING     0
## 5             TSTM WIND     0
## 6       TSTM WIND (G45)     0
tail(aggpophealth)
##                  event index
## 980 WINTER WEATHER/MIX   412
## 981        WINTERY MIX     0
## 982         Wintry mix     0
## 983         Wintry Mix     0
## 984         WINTRY MIX   238
## 985                WND     0

Once we have summed up the composite score, the next step is to sort this data so as to identify those event types which have the highest impact on population health.After sorting, the top 10 eventypes are also listed. The results are detailed as below.

library(data.table)

sorted <- aggpophealth[order(aggpophealth[,"index"],decreasing=T),]

select <- sorted[1:10,]

data.table(select)
##                 event  index
##  1:           TORNADO 313469
##  2:    EXCESSIVE HEAT  32896
##  3:         TSTM WIND  24399
##  4:             FLOOD  23657
##  5:         LIGHTNING  21402
##  6:              HEAT  12859
##  7:       FLASH FLOOD  12177
##  8:         ICE STORM   6548
##  9:      WINTER STORM   5405
## 10: THUNDERSTORM WIND   5395

A plot of the data is also as shown below. The x axis shows the event type and the y axis shows the severity index.

library(ggplot2)
popplot <- ggplot(select,aes(event,index,fill=event))
popplot + geom_bar(stat="identity")+labs(title = "Event type v/s Severity to Population Health ")+labs(x="Event type",y="Severity Index")

plot of chunk unnamed-chunk-8

From the analysis it is evident that weather event “Tornado” has the biggest impact on population health. Let us go a step further and identify the states & counties accross the country which are most impacted by this event type.To achieve this let us take only the data relevant to “Tornado” from the data set.

tordata <- subset(pophealth,pophealth$EVTYPE=="TORNADO")

head(tordata)
##   COUNTY COUNTYNAME STATE  EVTYPE           BGN_DATE FATALITIES INJURIES
## 1     97     MOBILE    AL TORNADO  4/18/1950 0:00:00          0       15
## 2      3    BALDWIN    AL TORNADO  4/18/1950 0:00:00          0        0
## 3     57    FAYETTE    AL TORNADO  2/20/1951 0:00:00          0        2
## 4     89    MADISON    AL TORNADO   6/8/1951 0:00:00          0        2
## 5     43    CULLMAN    AL TORNADO 11/15/1951 0:00:00          0        2
## 6     77 LAUDERDALE    AL TORNADO 11/15/1951 0:00:00          0        6
##   healthindex
## 1          45
## 2           0
## 3           6
## 4           6
## 5           6
## 6          18
tail(tordata)
##        COUNTY   COUNTYNAME STATE  EVTYPE           BGN_DATE FATALITIES
## 901814     61       JASPER    MS TORNADO 11/16/2011 0:00:00          0
## 901815    143 PITTSYLVANIA    VA TORNADO 11/16/2011 0:00:00          0
## 901821    117       ORANGE    IN TORNADO 11/14/2011 0:00:00          0
## 901826    487    WILBARGER    TX TORNADO  11/7/2011 0:00:00          0
## 901827     65      JACKSON    OK TORNADO  11/7/2011 0:00:00          0
## 901829     81          LEE    AL TORNADO 11/16/2011 0:00:00          0
##        INJURIES healthindex
## 901814        0           0
## 901815        0           0
## 901821        0           0
## 901826        0           0
## 901827        0           0
## 901829        2           6

Once the “Tornado” data has been subsetted, the next step is to summarise the composite index with respect to states and also counties. After summarising the data the data is sorted and the top ten states and counties accross the country which are most affected by “Tornado” is listed. All these results are as below.

library(plyr)
## summarising based on county name
aggtorcon <- ddply(tordata,c("STATE","COUNTYNAME"),summarise,index=sum(healthindex))
## sorting the data according to the severity index
consort <- aggtorcon[order(aggtorcon[,"index"],decreasing=T),]

Top 10 counties accross the country with the biggest impact of Tornado

data.table(consort[1:10,])
##     STATE COUNTYNAME index
##  1:    TX    WICHITA  5892
##  2:    AL  JEFFERSON  5336
##  3:    MO     JASPER  4915
##  4:    MA  WORCESTER  4403
##  5:    OH     GREENE  4084
##  6:    AL TUSCALOOSA  3584
##  7:    MI    GENESEE  3573
##  8:    AL    MADISON  2702
##  9:    TX   MCLENNAN  2628
## 10:    IL       COOK  2583

Identifying the States accross the country with biggest impact of Tornado

aggtorsta <- ddply(tordata,c("STATE"),summarise,index=sum(healthindex))
stasort <- aggtorsta[order(aggtorsta[,"index"],decreasing=T),]

Top 10 states accross the country with the biggest impact of Tornado

data.table(stasort[1:10,])
##     STATE index
##  1:    TX 28387
##  2:    AL 28106
##  3:    MS 21882
##  4:    AR 18001
##  5:    TN 16820
##  6:    OK 16559
##  7:    MO 15706
##  8:    OH 14651
##  9:    IN 14436
## 10:    IL 13856

Effects of weather type on Economy

As a first step the “stormdata” data set has to be transformed to take only the data relevant to the analysis. This can be done by subsetting the data set to include only data relevant to economy.

ecovalue <- stormdata[,c("COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_DATE","PROPDMG","CROPDMG")]

As seen from the dataset, the “PROPDMG” & “CROPDMG” are two variables which are indicators of the impact of weather type on economy. Inorder to understand the total impact of these two indicators on the economy, we wil introduce a new variable called “ecoindex”. This variable is a score derived by summing up the property damage variable and crop damage variables. The ecoindex variable is derived as follows.

ecovalue$ecoindex <- ecovalue$PROPDMG + ecovalue$CROPDMG

head(ecovalue)
##   COUNTY COUNTYNAME STATE  EVTYPE           BGN_DATE PROPDMG CROPDMG
## 1     97     MOBILE    AL TORNADO  4/18/1950 0:00:00    25.0       0
## 2      3    BALDWIN    AL TORNADO  4/18/1950 0:00:00     2.5       0
## 3     57    FAYETTE    AL TORNADO  2/20/1951 0:00:00    25.0       0
## 4     89    MADISON    AL TORNADO   6/8/1951 0:00:00     2.5       0
## 5     43    CULLMAN    AL TORNADO 11/15/1951 0:00:00     2.5       0
## 6     77 LAUDERDALE    AL TORNADO 11/15/1951 0:00:00     2.5       0
##   ecoindex
## 1     25.0
## 2      2.5
## 3     25.0
## 4      2.5
## 5      2.5
## 6      2.5
tail(ecovalue)
##        COUNTY                           COUNTYNAME STATE         EVTYPE
## 902292     21 TNZ001>004 - 019>021 - 048>055 - 088    TN WINTER WEATHER
## 902293      7                         WYZ007 - 017    WY      HIGH WIND
## 902294      9                         MTZ009 - 010    MT      HIGH WIND
## 902295    213                               AKZ213    AK      HIGH WIND
## 902296    202                               AKZ202    AK       BLIZZARD
## 902297      6                               ALZ006    AL     HEAVY SNOW
##                  BGN_DATE PROPDMG CROPDMG ecoindex
## 902292 11/28/2011 0:00:00       0       0        0
## 902293 11/30/2011 0:00:00       0       0        0
## 902294 11/10/2011 0:00:00       0       0        0
## 902295  11/8/2011 0:00:00       0       0        0
## 902296  11/9/2011 0:00:00       0       0        0
## 902297 11/28/2011 0:00:00       0       0        0

The next step is to find those eventype based on the loss it imparts on the economy. This can be done by summing up the econindex score with respect to the “EvTYPE” variable. The summation will give the composite index of an eventype accross all states and accross all time periods. The aggregation can be done as follows and the results are also listed.

aggecoind <- aggregate(ecovalue$ecoindex,by=list(ecovalue$EVTYPE),FUN=sum,na.rm=T)

names(aggecoind) <- c("event","index")

head(aggecoind)
##                   event index
## 1    HIGH SURF ADVISORY   200
## 2         COASTAL FLOOD     0
## 3           FLASH FLOOD    50
## 4             LIGHTNING     0
## 5             TSTM WIND   108
## 6       TSTM WIND (G45)     8
tail(aggecoind)
##                  event  index
## 980 WINTER WEATHER/MIX 4873.5
## 981        WINTERY MIX    0.0
## 982         Wintry mix    0.0
## 983         Wintry Mix    2.5
## 984         WINTRY MIX   10.0
## 985                WND    0.0

Once we have summed up econindex score, the next step is to sort this data so as to identify those event types which have the biggest impact on the economy.After sorting, the top 10 eventypes are also listed. The results are detailed as below.

library(data.table)
ecosorted <- aggecoind[order(aggecoind[,"index"],decreasing=T),]

ecosel <- ecosorted[1:10,]

data.table(ecosel)
##                  event   index
##  1:            TORNADO 3312277
##  2:        FLASH FLOOD 1599325
##  3:          TSTM WIND 1445168
##  4:               HAIL 1268290
##  5:              FLOOD 1067976
##  6:  THUNDERSTORM WIND  943636
##  7:          LIGHTNING  606932
##  8: THUNDERSTORM WINDS  464978
##  9:          HIGH WIND  342015
## 10:       WINTER STORM  134700

A plot of the data is also as shown below. The x axis shows the event type and the y axis shows the economic index.

library(ggplot2)
ecoplot <- ggplot(ecosel,aes(event,index,fill=event))
ecoplot + geom_bar(stat="identity")+labs(title = "Event type v/s Economic Consequence Index ")+labs(x="Event type",y="Consequence Index")

plot of chunk unnamed-chunk-18

Similar to the population health, it is evident that weather event “Tornado” is the biggest contributor to economic loss. Let us go a step further and identify the states & counties accross the country which are most impacted by this event type.To achieve this let us take only the data relevant to “Tornado” from the data set.

ecotordata <- subset(ecovalue,ecovalue$EVTYPE=="TORNADO")

Once the “Tornado” data has been subsetted, the next step is to summarise the economic index with respect to states and also counties. After summarising the data the data is sorted and the top ten states and counties accross the country which are most affected by “Tornado” is listed. All these results are as below.

library(plyr)
library(data.table)
## summarising based on state and county name
ecotorcon <- ddply(ecotordata,c("STATE","COUNTYNAME"),summarise,index=sum(ecoindex))
ecotorsort <- ecotorcon[order(ecotorcon[,"index"],decreasing=T),]

## summarising based on state name
ecotorsta <- ddply(ecotordata,c("STATE"),summarise,index=sum(ecoindex))
ecostasort <- ecotorsta[order(ecotorsta[,"index"],decreasing=T),]

Top 10 counties accross the country with the biggest economic loss due to Tornado

data.table(ecotorsort[1:10,])
##     STATE   COUNTYNAME index
##  1:    TX       HARRIS 14137
##  2:    AL    JEFFERSON  8570
##  3:    FL      BREVARD  8559
##  4:    FL HILLSBOROUGH  8516
##  5:    NE       THAYER  8215
##  6:    FL     PINELLAS  7634
##  7:    MS        SMITH  6922
##  8:    FL         POLK  6555
##  9:    TN   RUTHERFORD  6380
## 10:    KS     MITCHELL  6371

Top 10 states accross the country with the biggest economic loss due to Tornado

data.table(ecostasort[1:10,])
##     STATE  index
##  1:    TX 287963
##  2:    MS 212805
##  3:    AL 169469
##  4:    OK 165774
##  5:    FL 159901
##  6:    IA 156895
##  7:    GA 155142
##  8:    KS 148692
##  9:    MO 134446
## 10:    LA 134320

Results

The results from the analysis can be summarised as follows

  1. Top 10 event types having biggest impact on population health
data.table(select)
##                 event  index
##  1:           TORNADO 313469
##  2:    EXCESSIVE HEAT  32896
##  3:         TSTM WIND  24399
##  4:             FLOOD  23657
##  5:         LIGHTNING  21402
##  6:              HEAT  12859
##  7:       FLASH FLOOD  12177
##  8:         ICE STORM   6548
##  9:      WINTER STORM   5405
## 10: THUNDERSTORM WIND   5395
  1. Top 10 states accross the country with biggest impact on population health because of Tornado.
data.table(stasort[1:10,])
##     STATE index
##  1:    TX 28387
##  2:    AL 28106
##  3:    MS 21882
##  4:    AR 18001
##  5:    TN 16820
##  6:    OK 16559
##  7:    MO 15706
##  8:    OH 14651
##  9:    IN 14436
## 10:    IL 13856
  1. Top 10 counties accross the country with biggest impact on population health because of Tornado
data.table(consort[1:10,])
##     STATE COUNTYNAME index
##  1:    TX    WICHITA  5892
##  2:    AL  JEFFERSON  5336
##  3:    MO     JASPER  4915
##  4:    MA  WORCESTER  4403
##  5:    OH     GREENE  4084
##  6:    AL TUSCALOOSA  3584
##  7:    MI    GENESEE  3573
##  8:    AL    MADISON  2702
##  9:    TX   MCLENNAN  2628
## 10:    IL       COOK  2583

4.Top 10 event types with largest loss to the economy

data.table(ecosel)
##                  event   index
##  1:            TORNADO 3312277
##  2:        FLASH FLOOD 1599325
##  3:          TSTM WIND 1445168
##  4:               HAIL 1268290
##  5:              FLOOD 1067976
##  6:  THUNDERSTORM WIND  943636
##  7:          LIGHTNING  606932
##  8: THUNDERSTORM WINDS  464978
##  9:          HIGH WIND  342015
## 10:       WINTER STORM  134700
  1. Top 10 states accross the country with largest loss to economy because of Tornado.
data.table(ecostasort[1:10,])
##     STATE  index
##  1:    TX 287963
##  2:    MS 212805
##  3:    AL 169469
##  4:    OK 165774
##  5:    FL 159901
##  6:    IA 156895
##  7:    GA 155142
##  8:    KS 148692
##  9:    MO 134446
## 10:    LA 134320

6.Top 10 counties accross the country with the largest economic loss due to Tornado

data.table(ecotorsort[1:10,])
##     STATE   COUNTYNAME index
##  1:    TX       HARRIS 14137
##  2:    AL    JEFFERSON  8570
##  3:    FL      BREVARD  8559
##  4:    FL HILLSBOROUGH  8516
##  5:    NE       THAYER  8215
##  6:    FL     PINELLAS  7634
##  7:    MS        SMITH  6922
##  8:    FL         POLK  6555
##  9:    TN   RUTHERFORD  6380
## 10:    KS     MITCHELL  6371