Saugata Ghosh September 3, 2016
In this report, the goal is to analyze the impact of different weather events on public health and economy based on the storm database collected from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) from 1950 - 2011. The data used will be estimates of fatalities, injuries, property and crop damage to decide which types of events are most harmful to the population health and economy. From these data, we found that tornados and heat are most harmful with respect to population health, while flood and hurricanes have the greatest economic impacts.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(knitr)
opts_chunk$set(echo = TRUE)
In this section we first load the downloaded csv file on storm data and examine its structure. We also verify that the data has no missing values
stormdata <- read.csv("stormdata.csv", header =TRUE, sep = ",", stringsAsFactors = FALSE)
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
colSums(is.na(stormdata))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
Next we need to view the distribution of the data over years. For this we transform the variable BGN_DATE into a suitable format, extract the year of the data and examine the distribution
stormdata$BGN_DATE <- mdy_hms(stormdata$BGN_DATE)
stormdata$YEAR <- year(stormdata$BGN_DATE)
hist(stormdata$YEAR, breaks = 30, xlab = "Year", main = "Distribution of Available data by Year")
From the distribution we can see that for years prior to the 1990s the data available is quiet sparse. Hence we decide to only consider the data from 1989 onwards(the first year for which we have more than 10000 records) and subset only the variables of interest to the analysis
storm <- subset(stormdata, YEAR >= 1989)
storm <- storm[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
We examine the type of events and remove those with the heading ‘Summary’ because that header is unclassifiable. Next we derive a measure of impact on population health which is a sum of the number of injuries and fatalities and name the new variable ‘casualties’. Then we aggregate ‘casualties’ data by event type. For the purpose of analysis we first filter out those events with total ‘casualties’ figure less than 10 between 1989 and 2011.
storm <- storm[!grepl("Summary|SUMMARY", storm$EVTYPE),]
storm$CASUALTIES <- storm$FATALITIES+storm$INJURIES
storm2 <- aggregate(CASUALTIES ~ EVTYPE, data = storm,sum)
storm2 <- storm2[storm2$CASUALTIES >=10,]
storm2 <- storm2[order(-storm2$CASUALTIES),]
We observe that many of the minor event types appear to be variants of the names of some of the major event types. Based on our understanding of the data dictionaries provided we club the different event type names into broader categories as follows:
storm2$EVTYPE[grep("TORNADO|WATERSPOUT",storm2$EVTYPE)]<- "TORNADO"
storm2$EVTYPE[grep("Heat|HEAT|WARM",storm2$EVTYPE)]<- "HEAT"
storm2$EVTYPE[grep("FLOOD|FLASH FLOOD|URBAN/SML STREAM FLD",storm2$EVTYPE)]<- "FLOOD"
storm2$EVTYPE[grep("TSTM|THUNDER",storm2$EVTYPE)]<- "THUNDER STORM"
storm2$EVTYPE[grep("ICE STORM|BLIZZARD|WINTER STORM",storm2$EVTYPE)]<- "BLIZZARD"
storm2$EVTYPE[grep("COLD|WINTER WEATHER|WINTRY",storm2$EVTYPE)]<- "COLD"
storm2$EVTYPE[grep("WILD",storm2$EVTYPE)]<- "WILDFIRE"
storm2$EVTYPE[grep("HURRICANE",storm2$EVTYPE)]<- "HURRICANE"
storm2$EVTYPE[grep("DENSE FOG",storm2$EVTYPE)]<- "FOG"
storm2$EVTYPE[grep("WIND|WINDS|STORM SURGE",storm2$EVTYPE)]<- "WIND"
storm2$EVTYPE[grep("SURF",storm2$EVTYPE)]<- "SURF"
storm2$EVTYPE[grep("TROPICAL",storm2$EVTYPE)]<- "TROPICAL STORM"
storm2$EVTYPE[grep("RIP",storm2$EVTYPE)]<- "CURRENT"
storm2$EVTYPE[grep("GLAZE|ICE|ICY",storm2$EVTYPE)]<- "ICE"
storm2$EVTYPE[grep("SNOW",storm2$EVTYPE)]<- "SNOW"
storm2$EVTYPE[grep("HAIL",storm2$EVTYPE)]<- "HAIL"
storm2$EVTYPE[grep("RAIN|DRIZZLE|PRECIP",storm2$EVTYPE)]<- "RAIN"
storm2$EVTYPE[grep("SEAS|MARINE",storm2$EVTYPE)]<- "SEAS"
storm2$EVTYPE[grep("DUST",storm2$EVTYPE)]<- "DUST STORM"
We aggregate the ‘casualties’ figure based on the new groupings and extract the top ten events with the highest casualties
storm3 <- aggregate(CASUALTIES ~ EVTYPE, data = storm2, sum)
storm4 <- storm3[order(-storm3$CASUALTIES),][1:10,]
Similarly we create a variable ‘PDMG’ in the storm dataset which multiplies the property damage (PROPDMG) variable by the relevant unit found in the ‘PROPDMGEXP’ variable (‘H’ or ‘h’ = 100, ‘K’ or ‘k’ = 1000, ‘M’ or ‘m’ = 100000, ‘B’ or ‘b’ = 1000000000). We perform the same transformation for crop damage and create a new variable CDMG. Finally we arrive at a measure of total economic damage TDMG (in ’000 $) by adding PDMG and CDMG
storm$PDMG[storm$PROPDMG==0] <- 0
storm$PDMG[storm$PROPDMGEXP=="H" | storm$PROPDMGEXP=="h"] <-
100*storm$PROPDMG[storm$PROPDMGEXP=="H" | storm$PROPDMGEXP=="h"]
storm$PDMG[storm$PROPDMGEXP=="K" | storm$PROPDMGEXP=="k"] <-
1000*storm$PROPDMG[storm$PROPDMGEXP=="K" | storm$PROPDMGEXP=="k"]
storm$PDMG[storm$PROPDMGEXP=="M" | storm$PROPDMGEXP=="m"] <-
1000000*storm$PROPDMG[storm$PROPDMGEXP=="M" | storm$PROPDMGEXP=="m"]
storm$PDMG[storm$PROPDMGEXP=="B" | storm$PROPDMGEXP=="b"] <-
1000000000*storm$PROPDMG[storm$PROPDMGEXP=="B" | storm$PROPDMGEXP=="b"]
storm$CDMG[storm$CROPDMG==0] <- 0
storm$CDMG[storm$CROPDMGEXP=="H" | storm$CROPDMGEXP=="h"] <-
100*storm$CROPDMG[storm$CROPDMGEXP=="H" | storm$CROPDMGEXP=="h"]
storm$CDMG[storm$CROPDMGEXP=="K" | storm$CROPDMGEXP=="k"] <-
1000*storm$CROPDMG[storm$CROPDMGEXP=="K" | storm$CROPDMGEXP=="k"]
storm$CDMG[storm$CROPDMGEXP=="M" | storm$CROPDMGEXP=="m"] <-
1000000*storm$CROPDMG[storm$CROPDMGEXP=="M" | storm$CROPDMGEXP=="m"]
storm$CDMG[storm$CROPDMGEXP=="B" | storm$CROPDMGEXP=="b"] <-
1000000000*storm$CROPDMG[storm$CROPDMGEXP=="B" | storm$CROPDMGEXP=="b"]
storm$TDMG <- (storm$PDMG + storm$CDMG)/1000
Next we aggregate the total economic damage data by event type and filter out events which have caused less than $ 1 million in damage over the years of study
storm5 <- aggregate(TDMG ~ EVTYPE, data = storm,sum)
storm5 <- storm5[storm5$TDMG >=1000,]
We perform a grouping of event types into broader categories as done earlier
storm5$EVTYPE[grep("TORNADO|WATERSPOUT",storm5$EVTYPE)]<- "TORNADO"
storm5$EVTYPE[grep("Heat|HEAT|WARM",storm5$EVTYPE)]<- "HEAT"
storm5$EVTYPE[grep("FLOOD|FLASH FLOOD|URBAN/SML STREAM FLD|Flood|Flooding",storm5$EVTYPE)]<- "FLOOD"
storm5$EVTYPE[grep("TSTM|THUNDER",storm5$EVTYPE)]<- "THUNDER STORM"
storm5$EVTYPE[grep("ICE STORM|BLIZZARD|WINTER STORM",storm5$EVTYPE)]<- "BLIZZARD"
storm5$EVTYPE[grep("COLD|WINTER WEATHER|WINTRY|Cold",storm5$EVTYPE)]<- "COLD"
storm5$EVTYPE[grep("WILD|FOREST",storm5$EVTYPE)]<- "WILDFIRE"
storm5$EVTYPE[grep("HURRICANE|TYPHOON",storm5$EVTYPE)]<- "HURRICANE"
storm5$EVTYPE[grep("DENSE FOG",storm5$EVTYPE)]<- "FOG"
storm5$EVTYPE[grep("WIND|WINDS|STORM SURGE",storm5$EVTYPE)]<- "WIND"
storm5$EVTYPE[grep("SURF|Surf",storm5$EVTYPE)]<- "SURF"
storm5$EVTYPE[grep("TROPICAL",storm5$EVTYPE)]<- "TROPICAL STORM"
storm5$EVTYPE[grep("RIP",storm5$EVTYPE)]<- "CURRENT"
storm5$EVTYPE[grep("GLAZE|ICE|ICY",storm5$EVTYPE)]<- "ICE"
storm5$EVTYPE[grep("SNOW",storm5$EVTYPE)]<- "SNOW"
storm5$EVTYPE[grep("HAIL",storm5$EVTYPE)]<- "HAIL"
storm5$EVTYPE[grep("RAIN|DRIZZLE|PRECIP",storm5$EVTYPE)]<- "RAIN"
storm5$EVTYPE[grep("SEAS|MARINE",storm5$EVTYPE)]<- "SEAS"
storm5$EVTYPE[grep("DUST",storm5$EVTYPE)]<- "DUST STORM"
storm5$EVTYPE[grep("FROST|FREEZE|Frost|Freeze",storm5$EVTYPE)]<- "FROST"
storm5$EVTYPE[grep("LIGHTNING",storm5$EVTYPE)]<- "LIGHTNING"
We aggregate the total economic data by event type and extract the top ten sources of economic damage
storm6 <- aggregate(TDMG ~ EVTYPE, data = storm5, sum)
storm7 <- storm6[order(-storm6$TDMG),][1:10,]
We create below a barplot showing the impact of the most harmful events on population health. From the plot we can see that tornadoes followed by heat are the two biggest harmful events
plot1<-ggplot(storm4, aes(x = reorder(factor(EVTYPE),-CASUALTIES), y = CASUALTIES))+geom_bar(stat = "identity", fill = "red", width = 0.5)+
xlab("Event Type") + ylab("Total Casualties") + ggtitle("Impact of harmful events on population health")
plot1 <- plot1+theme_bw()+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 10))
plot1
We create below a barplot showing the impact of the most harmful events on the economy.From the plot we can see that floods followed by hurricanes are the most harmful events to the economy.
plot2<-ggplot(storm7, aes(x = reorder(factor(EVTYPE),-TDMG), y = TDMG))+geom_bar(stat = "identity", fill = "blue", width = 0.5)+
xlab("Event Type") + ylab("Total Damage ('000 $") + ggtitle("Impact of harmful events on economy")
plot2 <- plot2+theme_bw()+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 10))
plot2