This report aims to measure and describe the impact that storms and other severe weather events can have on public health and economic situtaion in the U.S. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(scales)
From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database we obtainded ‘Storm Data’ data set that tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011.
We read in the data that comes in the form of a comma-separated-value file compressed via the bzip2 algorithm.
data <- read.csv(bzfile("/Users/alex/Documents/R directory/RepData_PeerAssessment2/data.csv.bz2"))
The data contains 902,297 observations of 37 variables in the data. Here are first 3 rows of the data displayed.
dim(data)
## [1] 902297 37
head(data[3,10])
## [1]
## 35 Levels: N NW E Eas EE ENE ESE fee M mi mil Mil N nd Ne NE ... WSW
We notice that the column containing the date of the event has unsuitable for futher analysis format. Therefore, it is desired to create a new column with the approprite formatting.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Here we see the class of the columns containing the date of the event.
class(data$BGN_DATE)
## [1] "factor"
We prefer “year” and “month” columns to have a numeric rather than character format.
data$year <- as.numeric(format(as.Date(data$BGN_DATE,"%m/%d/%Y"),"%Y"))
class(data$year)
## [1] "numeric"
data$month <- as.numeric(format(as.Date(data$BGN_DATE,"%m/%d/%Y"),"%m"))
class(data$month)
## [1] "numeric"
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We can clearly see from the histogram that a strong upper trend in the amount of recorded observations starts in the year of 1993. Thus, we will use only data starting the year of 1993. The newly created subset has 714,738 observations of 39 variables.
par(mfrow=c(1,2))
with(data, {
hist(year,breaks=61, xlab="Year", ylab="Observations",main="Total number of observations in the dataset", col="grey")
abline(v=1993,lty=5,col="red")
hist(month,breaks=12,col="grey",xlab="Month",main="Severe Weather Events Density")
})
workData <- data[data$year>=1993,]
dim(workData)
## [1] 714738 39
In this section of analysis, we will focus on fatalities and injuries caused by major storms and weather events in the United States. Since missing values is common isue, thus we will start with evaluating quality of representativeness of these two indicators. As the results show, there are no missing values so we can be confident that our measuring the impact of weather events based on the data set will be valid.
sum(is.na(workData$FATALITIES))
## [1] 0
sum(is.na(workData$INJURIES))
## [1] 0
There are 985 distinct types of weather events in total in our subset.
head(unique(workData$EVTYPE),0)
## factor(0)
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
Here we create a table with the summary of health impact of major weather events from 1993 to 2011.
healthImp <- ddply(workData,.(EVTYPE),summarise,TotalFat=sum(FATALITIES),TotalInj=sum(INJURIES),Month=round(mean(month),0)) %>% mutate(TotalDam=TotalFat+TotalInj) %>% arrange(desc(TotalDam))
colnames(healthImp)[1:3] <- c("Event Type","Total Fatalities","Total Injuries"); colnames(healthImp)[5] <- "Total Damage"
healthImp <- healthImp[,c(1,4,2,3,5)]
head(healthImp,3)
## Event Type Month Total Fatalities Total Injuries Total Damage
## 1 TORNADO 6 1621 23310 24931
## 2 EXCESSIVE HEAT 7 1903 6525 8428
## 3 FLOOD 6 470 6789 7259
In this section of our analysis, we will work with Property Damage and Crop Damage indicators of severe weather events occured from 1993 to 2011 in the United States. First of all, we make sure that our analysis would not be affected by missing values for the observations.
sum(is.na(workData$PROPDMG))
## [1] 0
sum(is.na(workData$CROPDMG))
## [1] 0
There are no missing values in the columns.
Then we create a new table to structure the total property, total crop and overall total damges inflicted by major weather events from 1993 to 2011 in the United States. The table will be sorted by overall total damge in a descending order.
propImp <- ddply(workData,.(EVTYPE),summarise,TotalProp=sum(PROPDMG),TotalCrop=sum(CROPDMG),Month=round(mean(month),0)) %>% mutate(TotalDMG=TotalProp+TotalCrop) %>% arrange(desc(TotalDMG))
colnames(propImp)[1:3] <- c("Event Type","Total Property Damage","Total Crop Damage")
colnames(propImp)[5] <- "Total Damage"
propImp <- propImp[,c(1,4,2,3,5)]
In this section the findings of the performed analysis are presented.
In order to show aggregate impact of severe weather events occured from 1993 to 2011 in the United States on population health, we present a table that shows top 10 weather events in order of the amount of damage inflicted.
healthImp1 <- healthImp
healthImp1[,3] <- format(healthImp1[,3],big.mark=",", preserve.width="none")
healthImp1[,4] <- format(healthImp1[,4],big.mark=",", preserve.width="none")
healthImp1[,5] <- format(healthImp1[,5],big.mark=",", preserve.width="none")
head(healthImp1,10)
The plot show the breakdown of Total Damage Inflicted by Injuries and Fatalities as direct impact of severe weather events from 1993 to 2011 in the United States.
healthPlot <- healthImp[1:10,]
healthPlot <- melt(healthPlot, id = "Event Type", variable.name="Damage Type",value.name="Total Damage",measure.vars = c("Total Fatalities", "Total Injuries"))
ggplot(healthPlot,aes(healthPlot[,1],healthPlot[,3],fill=healthPlot[,2])) +
geom_bar(stat="identity") +
scale_fill_manual(values = alpha(c("red", "blue"),.3),name="Legend:") +
theme(axis.text.x = element_text(angle=35)) +
xlab("Event Type") + ylab("Total Damage")
In order to show aggregate impact of severe weather events occured from 1993 to 2011 in on the economy of the United States, we present a table that shows top 10 weather events in order of the amount of damage inflicted.
propImp1 <- propImp
propImp1[,3] <- format(propImp1[,3],big.mark=",", preserve.width="none")
propImp1[,4] <- format(propImp1[,4],big.mark=",", preserve.width="none")
propImp1[,5] <- format(propImp1[,5],big.mark=",", preserve.width="none")
head(propImp1,10)
The plot show the breakdown of Total Damage Inflicted by Property Damage and Crop Damage as direct impact of severe weather events from 1993 to 2011 in the United States.
propImpPlot <- propImp[1:10,]
propImpPlot <- melt(propImpPlot, id = "Event Type", variable.name="Damage Type",value.name="Total Damage",measure.vars = c("Total Property Damage", "Total Crop Damage"))
ggplot(propImpPlot,aes(propImpPlot[,1],propImpPlot[,3],fill=propImpPlot[,2])) +
geom_bar(stat="identity") +
scale_fill_manual(values = alpha(c("red", "blue"),.3),name="Legend:") +
theme(axis.text.x = element_text(angle=35)) +
xlab("Event Type") + ylab("Total Damage")