We will be examining the data set contained in the file repdata_data_StormData.csv.bz2. We assume that the file is contained in the current working directory. To verify that the data file is the same as the data file we used in our analysis, we will make use of the tools package and the md5sum function in that package. In our analysis, the md5sum function returns the hash df4aa61fff89427db6b7f7b1113b5553.
library(tools)
md5sum("repdata_data_StormData.csv.bz2")
## repdata_data_StormData.csv.bz2
## "df4aa61fff89427db6b7f7b1113b5553"
We seek to examine the given historical data and make recommendations describing the weather events which generate the most risk to public health and the most risk to the economy. To do this, we will examine all of the columns in the provided data, a set of weather event data taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and determine which events have the most relevance to public health and which have relevance to the economy. Once these data have been identified, we will compute the mean effects for each weather event type. We will also examine the commonality of each weather event to see if the likelihood of an event needs to be taken into consideration before making any recommendations.
To begin, we read the data from the given file and print out the names of the columns in the data.
stormdata <- read.table("repdata_data_StormData.csv.bz2", sep=",",header=TRUE)
print(names(stormdata))
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
From the list of columns, we see that we will likely be most interested in the columns with the label EVTYPE which identifies the event type, the columns FATALITIES and INJURIES to determine public health impact and PROPDMG and CROPDMG will pertain to resolving the economic question.
In an attempt to reduce some redundancy in the rows due to similar names for events, we will replace the EVTYPE column with the column consisting of events in all lower case. We do this because we noticed that there are some event records written in all caps, others with the first letter capitalized, and others with no letters capitalized. Sometimes these refer to the same event type. Replacing all of the event type characters with lower case characters will eliminate this duplication.
stormdata$EVTYPE <- tolower(stormdata$EVTYPE)
We notice that there still appear to be some repetition of events,for example there are rows corresponding to “winter weather mix” as well as “winter weather/mix” and we hope that these incidences will not affect the outcome of our findings too greatly.
We are now in a position to filter down our data to the results that we would like to focus upon. We will create two new data frames. The first, total_public_health, contains four columns: EVTYPE, fatalities, injuries, and log_sum_casualty which is the logarithm of the sum of that row’s fatalities and injuries (+1 to avoid domain errors with the logarithm). Each row consists of a weather event type, the total number of fatalities for that type, the total number of injuries for that type, and the sum of the fatalities and injuries for that type. We form a similar data frame, total_economic_health, constructed similarly for economic data.
library(plyr)
total_public_health <- ddply(stormdata,.(EVTYPE),summarize,
fatalities=sum(FATALITIES),
injuries=sum(INJURIES),
log_sum_casualty=log(sum(fatalities+injuries+1)))
total_economic_health <- ddply(stormdata,.(EVTYPE),summarize,
propdmg=sum(PROPDMG),
cropdmg=sum(CROPDMG),
log_sum_damage=log(sum(propdmg+cropdmg+1)))
We begin by trimming our data down a bit so that we can exclude from consideration all items which have zero contribution towards casualties or damage.
trimmed_public_health <- total_public_health[total_public_health$log_sum_casualty > 0, ]
trimmed_economic_health <- total_economic_health[total_economic_health$log_sum_damage > 0, ]
We will identify our most significant events with respect to public health and economic health by computing the mean and standard deviation for the log_sum_casualty column and the log_sum_damage column in the trimmed data frames. Once we have these values, we will search for all log_sum_casualty values that exceed two standard deviations away from the mean. Likewise, we perform a similar simple test to compute the most significant weather events with respect to economic impact.
mean_casualty <- mean( trimmed_public_health$log_sum_casualty, na.rm=TRUE )
mean_damage <- mean( trimmed_economic_health$log_sum_damage, na.rm=TRUE )
sd_casualty <- sd( trimmed_public_health$log_sum_casualty, na.rm=TRUE)
sd_damage <- sd( trimmed_economic_health$log_sum_damage, na.rm=TRUE )
Now that we have computed the means and standard deviations, we will be looking for all data points that lie at least two standard deviations away from the mean. In the following plots, we have indicated the mean in blue and the threshold of two standard deviations away from the mean in red. Thus, all data points that lie above the red line will be the weather events of most significance.
par(mfrow=c(1,2))
plot(trimmed_public_health$log_sum_casualty,
type='p',
ylab="total casualties" )
abline(h=mean_casualty,col='blue')
abline(h=mean_casualty+2*sd_casualty,col='red')
plot(trimmed_economic_health$log_sum_damage,
type='p',
ylab="total damage" )
abline(h=mean_damage,col='blue')
abline(h=mean_damage+2*sd_damage,col='red')
Perhaps easier to read, we could also compute histograms to visualize the same idea:
par(mfrow=c(1,2))
hist(trimmed_public_health$log_sum_casualty,
xlab='log score',
ylab='frequency',
main="Health")
abline(v=mean_casualty + 2.0*sd_casualty, col='red')
abline(v=mean_casualty, col='blue')
hist(trimmed_economic_health$log_sum_damage,
xlab='log score',
ylab='frequency',
main="Economy")
abline(v=mean_damage + 2.0*sd_damage, col='red')
abline(v=mean_damage, col='blue')
Finally, we explicitly subset out all of the values that lie at least three standard deviations above the mean.
major_health_hazards <- trimmed_public_health[
trimmed_public_health$log_sum_casualty > (mean_casualty + 2.0*sd_casualty),]
major_economic_hazards <- trimmed_economic_health[
trimmed_economic_health$log_sum_damage > (mean_damage + 2.0*sd_damage),]
major_health_hazards <- major_health_hazards[
order(-major_health_hazards$log_sum_casualty),]
major_economic_hazards <- major_economic_hazards[
order(-major_economic_hazards$log_sum_damage),]
head(major_health_hazards)
## EVTYPE fatalities injuries log_sum_casualty
## 758 tornado 5633 91346 11.482260
## 116 excessive heat 1903 6525 9.039433
## 779 tstm wind 504 6957 8.917579
## 154 flood 470 6789 8.890135
## 418 lightning 816 5230 8.707318
## 243 heat 937 2100 8.018955
head(major_economic_hazards)
## EVTYPE propdmg cropdmg log_sum_damage
## 758 tornado 3212258.2 100018.52 15.01315
## 138 flash flood 1420124.6 179200.46 14.28509
## 779 tstm wind 1335995.6 109202.60 14.18376
## 212 hail 688693.4 579596.28 14.05318
## 154 flood 899938.5 168037.88 13.88128
## 685 thunderstorm wind 876844.2 66791.45 13.75750
As we can see, the top three most damaging weather events with respect to human health are:
head(major_health_hazards[1:3,])
## EVTYPE fatalities injuries log_sum_casualty
## 758 tornado 5633 91346 11.482260
## 116 excessive heat 1903 6525 9.039433
## 779 tstm wind 504 6957 8.917579
and the top three most damaging weather events with respect to economic health are:
major_economic_hazards[1:3,]
## EVTYPE propdmg cropdmg log_sum_damage
## 758 tornado 3212258 100018.5 15.01315
## 138 flash flood 1420125 179200.5 14.28509
## 779 tstm wind 1335996 109202.6 14.18376