Determining Worst Weather Events With Regard To Public Health And Economy

Working assumptions

We will be examining the data set contained in the file repdata_data_StormData.csv.bz2. We assume that the file is contained in the current working directory. To verify that the data file is the same as the data file we used in our analysis, we will make use of the tools package and the md5sum function in that package. In our analysis, the md5sum function returns the hash df4aa61fff89427db6b7f7b1113b5553.

library(tools)
md5sum("repdata_data_StormData.csv.bz2")
##     repdata_data_StormData.csv.bz2 
## "df4aa61fff89427db6b7f7b1113b5553"

Introduction

We seek to examine the given historical data and make recommendations describing the weather events which generate the most risk to public health and the most risk to the economy. To do this, we will examine all of the columns in the provided data, a set of weather event data taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and determine which events have the most relevance to public health and which have relevance to the economy. Once these data have been identified, we will compute the mean effects for each weather event type. We will also examine the commonality of each weather event to see if the likelihood of an event needs to be taken into consideration before making any recommendations.

Data Processing

To begin, we read the data from the given file and print out the names of the columns in the data.

stormdata <- read.table("repdata_data_StormData.csv.bz2", sep=",",header=TRUE)
print(names(stormdata))
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

From the list of columns, we see that we will likely be most interested in the columns with the label EVTYPE which identifies the event type, the columns FATALITIES and INJURIES to determine public health impact and PROPDMG and CROPDMG will pertain to resolving the economic question.

In an attempt to reduce some redundancy in the rows due to similar names for events, we will replace the EVTYPE column with the column consisting of events in all lower case. We do this because we noticed that there are some event records written in all caps, others with the first letter capitalized, and others with no letters capitalized. Sometimes these refer to the same event type. Replacing all of the event type characters with lower case characters will eliminate this duplication.

stormdata$EVTYPE <- tolower(stormdata$EVTYPE)

We notice that there still appear to be some repetition of events,for example there are rows corresponding to “winter weather mix” as well as “winter weather/mix” and we hope that these incidences will not affect the outcome of our findings too greatly.

We are now in a position to filter down our data to the results that we would like to focus upon. We will create two new data frames. The first, total_public_health, contains four columns: EVTYPE, fatalities, injuries, and log_sum_casualty which is the logarithm of the sum of that row’s fatalities and injuries (+1 to avoid domain errors with the logarithm). Each row consists of a weather event type, the total number of fatalities for that type, the total number of injuries for that type, and the sum of the fatalities and injuries for that type. We form a similar data frame, total_economic_health, constructed similarly for economic data.

library(plyr)
total_public_health <- ddply(stormdata,.(EVTYPE),summarize,
                             fatalities=sum(FATALITIES),
                             injuries=sum(INJURIES), 
                             log_sum_casualty=log(sum(fatalities+injuries+1)))
total_economic_health <- ddply(stormdata,.(EVTYPE),summarize,
                               propdmg=sum(PROPDMG),
                               cropdmg=sum(CROPDMG),
                               log_sum_damage=log(sum(propdmg+cropdmg+1)))  

Results

We begin by trimming our data down a bit so that we can exclude from consideration all items which have zero contribution towards casualties or damage.

trimmed_public_health <- total_public_health[total_public_health$log_sum_casualty > 0, ]
trimmed_economic_health <- total_economic_health[total_economic_health$log_sum_damage > 0, ]

We will identify our most significant events with respect to public health and economic health by computing the mean and standard deviation for the log_sum_casualty column and the log_sum_damage column in the trimmed data frames. Once we have these values, we will search for all log_sum_casualty values that exceed two standard deviations away from the mean. Likewise, we perform a similar simple test to compute the most significant weather events with respect to economic impact.

mean_casualty <- mean( trimmed_public_health$log_sum_casualty, na.rm=TRUE )
mean_damage <- mean( trimmed_economic_health$log_sum_damage, na.rm=TRUE )
sd_casualty <- sd( trimmed_public_health$log_sum_casualty, na.rm=TRUE)
sd_damage <- sd( trimmed_economic_health$log_sum_damage, na.rm=TRUE )

Now that we have computed the means and standard deviations, we will be looking for all data points that lie at least two standard deviations away from the mean. In the following plots, we have indicated the mean in blue and the threshold of two standard deviations away from the mean in red. Thus, all data points that lie above the red line will be the weather events of most significance.

par(mfrow=c(1,2))
plot(trimmed_public_health$log_sum_casualty, 
     type='p', 
     ylab="total casualties" )
abline(h=mean_casualty,col='blue')
abline(h=mean_casualty+2*sd_casualty,col='red')
plot(trimmed_economic_health$log_sum_damage, 
     type='p', 
     ylab="total damage" )
abline(h=mean_damage,col='blue')
abline(h=mean_damage+2*sd_damage,col='red')

Perhaps easier to read, we could also compute histograms to visualize the same idea:

par(mfrow=c(1,2))
hist(trimmed_public_health$log_sum_casualty, 
        xlab='log score', 
        ylab='frequency',
        main="Health")
abline(v=mean_casualty + 2.0*sd_casualty, col='red')
abline(v=mean_casualty, col='blue')
hist(trimmed_economic_health$log_sum_damage, 
     xlab='log score', 
     ylab='frequency',
     main="Economy")
abline(v=mean_damage + 2.0*sd_damage, col='red')
abline(v=mean_damage, col='blue')

Finally, we explicitly subset out all of the values that lie at least three standard deviations above the mean.

major_health_hazards <- trimmed_public_health[
        trimmed_public_health$log_sum_casualty > (mean_casualty + 2.0*sd_casualty),]
major_economic_hazards <- trimmed_economic_health[
        trimmed_economic_health$log_sum_damage > (mean_damage + 2.0*sd_damage),]
major_health_hazards <- major_health_hazards[
        order(-major_health_hazards$log_sum_casualty),]
major_economic_hazards <- major_economic_hazards[
        order(-major_economic_hazards$log_sum_damage),]
head(major_health_hazards)
##             EVTYPE fatalities injuries log_sum_casualty
## 758        tornado       5633    91346        11.482260
## 116 excessive heat       1903     6525         9.039433
## 779      tstm wind        504     6957         8.917579
## 154          flood        470     6789         8.890135
## 418      lightning        816     5230         8.707318
## 243           heat        937     2100         8.018955
head(major_economic_hazards)
##                EVTYPE   propdmg   cropdmg log_sum_damage
## 758           tornado 3212258.2 100018.52       15.01315
## 138       flash flood 1420124.6 179200.46       14.28509
## 779         tstm wind 1335995.6 109202.60       14.18376
## 212              hail  688693.4 579596.28       14.05318
## 154             flood  899938.5 168037.88       13.88128
## 685 thunderstorm wind  876844.2  66791.45       13.75750

As we can see, the top three most damaging weather events with respect to human health are:

head(major_health_hazards[1:3,])
##             EVTYPE fatalities injuries log_sum_casualty
## 758        tornado       5633    91346        11.482260
## 116 excessive heat       1903     6525         9.039433
## 779      tstm wind        504     6957         8.917579

and the top three most damaging weather events with respect to economic health are:

major_economic_hazards[1:3,]
##          EVTYPE propdmg  cropdmg log_sum_damage
## 758     tornado 3212258 100018.5       15.01315
## 138 flash flood 1420125 179200.5       14.28509
## 779   tstm wind 1335996 109202.6       14.18376