Title: Analysis of Storm data from US NOAA for effects on population health and economic consequences

Synopsis

By ‘effects on population health’, we meant the counts for fatalities and injuries. By ‘economic consequences’ we meant the property and crop damage, in terms of the doller-value of the damages.

It needs to be noted that we have not accounted for any ‘bias’ in deriving the numbers. E.g. it may be so that the counting of the fatalities and injuries were not used to be properly documented in the 50’s and 60’s - or might have been attributed to other events. Similarly, it is also possible that putting a value to the damages were less scientific in the beginning and are becoming more sophisticated and accurate. Also we have not accounted for the inflationary effects on the values of crop and property.

We also need to note that, for both analyses, we have considered the top 5 events (in terms of health or economic effects) for identifying trends.

Data Processing

  1. We assume that the input file, repdata_data_StormData.csv.bz2, is present in the current directory.
  2. Then we start by unzipping the file with bzfile() with open=“r” option for reading-only and read the content into ‘file’ variable with read.csv() function. There are some warning messages in the processing - but those are ignorable.
bzfile ("repdata_data_StormData.csv.bz2", open="r")
##                      description                            class 
## "repdata_data_StormData.csv.bz2"                         "bzfile" 
##                             mode                             text 
##                              "r"                           "text" 
##                           opened                         can read 
##                         "opened"                            "yes" 
##                        can write 
##                             "no"
file <- read.csv("repdata_data_StormData.csv.bz2")
## Warning: closing unused connection 5 (repdata_data_StormData.csv.bz2)
  1. Then we take only those columns that are relevant for our analysis, i.e., BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG and keep it in a dataframe called df.
  2. Extract the year from the BGN_DATE and add it to DF as ‘tmp’ variable.
df <- file[,c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")]
tmp<- format(as.POSIXlt(strptime(df$BGN_DATE,"%m/%d/%Y %H:%M:%S")),"%Y")
df <- data.frame(df, tmp)
colnames(df) <- c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "YEAR")
  1. To answer the first question, i.e. which types of events are most harmful to population health, we concentrate on the FATALITIES and INJURIES for various EVTYPEs for the years. As fatalities are more harmful than injuries, we give fatalities higher priority while sorting the data.
library(sqldf)
## Warning: package 'sqldf' was built under R version 3.1.1
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 3.1.1
## Loading required package: proto
## Warning: package 'proto' was built under R version 3.1.1
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.1.1
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
## Warning: package 'RSQLite.extfuns' was built under R version 3.1.1
dfh <- sqldf("select EVTYPE, sum(FATALITIES) as sfat, sum(INJURIES) as sinj, YEAR from df group by EVTYPE, YEAR")
## Loading required package: tcltk
dfht <- sqldf("select EVTYPE, sum(sfat) as ssfat, sum(sinj) as ssinj from dfh group by EVTYPE")
dfht <- dfht[order(dfht$ssfat, dfht$ssinj),]
tail(dfht)
##             EVTYPE ssfat ssinj
## 846      TSTM WIND   504  6957
## 453      LIGHTNING   816  5230
## 271           HEAT   937  2100
## 151    FLASH FLOOD   978  1777
## 124 EXCESSIVE HEAT  1903  6525
## 826        TORNADO  5633 91346
  1. As we see from the above list, that till now, the 5 most harmful events are TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT and LIGHTNING in that order. Please note that we go upward from the bottom of the list, as the sorting was done in ascending order.

  2. Let us now examine what is the trend for the fatalities for these 5 events together.

dfhp <- dfh[dfh$EVTYPE %in% c("TORNADO", "EXCESSIVE HEAT", "FLASH FLOOD", "HEAT", "LIGHTNING"),]
plot(dfhp$YEAR,dfhp$sfat,type="c", main="Fatalities of 5 top events over the years")
lines(lowess(dfhp$YEAR,dfhp$sfat))

plot of chunk plot_h

  1. As there is a very gradual downward trend in fatalities in those 5 events since 1980-s, we can extrapolate that same will hold true for the injuries as well. Because, it seems the measures taken to reduce the fatalities, should help reduce the injuries too. Following is the plot for the injuries - which supports this hypothesis.
plot(dfhp$YEAR,dfhp$sinj,type="c", main = "Injuries from top 5 events over the years")
lines(lowess(dfhp$YEAR,dfhp$sinj))

plot of chunk plot_inj

  1. To answer the second question, i.e., which types of events have the greatest economic consequences, we measure the PROPDMG and CROPDMG figures for the various events.
dfe <- sqldf("select EVTYPE, sum(PROPDMG) as sprop, sum(CROPDMG) as scrop, YEAR from df group by EVTYPE, YEAR")
## Loading required package: tcltk
dfet <- sqldf("select EVTYPE, sum(sprop) as ssprop, sum(scrop) as sscrop from dfe group by EVTYPE")
totalDmg <- dfet$ssprop + dfet$sscrop
dfet <- data.frame(dfet,totalDmg)
dfet <- dfet[order(dfet$totalDmg, dfet$ssprop, dfet$sscrop),]
tail(dfet)
##                EVTYPE  ssprop sscrop totalDmg
## 753 THUNDERSTORM WIND  876844  66791   943636
## 167             FLOOD  899938 168038  1067976
## 241              HAIL  688693 579596  1268290
## 846         TSTM WIND 1335966 109203  1445168
## 151       FLASH FLOOD 1420125 179200  1599325
## 826           TORNADO 3212258 100019  3312277
  1. As we see from above, that the 5 events that are have most adverse economic consequences are, TORNADO, FLASH FLOOD, TSTM WIND, HAIL and FLOOD. Please note that we go upward from the bottom of the list, as the sorting was done in ascending order.

  2. Let us now examine what is the trend for the total damage (i.e. property and crop together) for these 5 events together.

dfep <- dfe[dfe$EVTYPE %in% c("TORNADO", "FLASH FLOOD", "TSTM WIND", "HAIL","FLOOD"),]
plot(dfep$YEAR, (dfep$sprop + dfep$scrop), type="c", main="Property and Crop Damages of top 5 events over the years")
lines(lowess(dfep$YEAR, (dfep$sprop + dfep$scrop)))

plot of chunk plot_e

  1. From the plot above, it is clear that since the 1980-s, the economic consequences (in terms of doller-value) are increasing very rapidly. Part of it may be due to inflations - but that needs to be examined, which is out of our current scope of analysis.

Results

From our analysis we found that:-
1. The effects on population health have reduced over period of time - though very slowly.
2. The economic consequences of the storms are increasing rapidly since 1980-s.