Analysis storm weather impact on population health and economics in United State

This is a analysis answers which types of events are most harmful to population health and which events have the greatest economic consequences.

Synopsis

The process of analysis is based on data from NOAA storm database, with National weather service documentation and national climate data center storm events FAQ. This analysis answers which types of events are most harmful to population health and which events have the greatest economic consequences. In this analysis, it uses processing raw data, clean data, sort data, create figures etc. It has 2 figures to demonstrate results.

Data processing

The packages we use.

setwd("C:/Users/stephanie song/Desktop")
echo = TRUE
library(R.utils)
library(ggplot2)
library(plyr)
library(knitr)
library(gtools)
## 
## Attaching package: 'gtools'
## 
## The following object is masked from 'package:R.utils':
## 
##     capture
library(gridExtra)
## Loading required package: grid
library(ggplot2)
library(grid)
opts_chunk$set(cache=TRUE, fig.align='center')

Unzip file.

url<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "C:/Users/stephanie song/Desktop/data.bz2", mode="wb")
bunzip2("C:/Users/stephanie song/Desktop/data.bz2","C:/Users/stephanie song/Desktop/data.csv")

Load data into R by using read.csv and carefully look at it.

data<-read.csv("C:/Users/stephanie song/Desktop/data.csv",sep=",",quote="\"")
head(data,2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                        14   100 3   0          0
## 2         NA         0                         2   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2

Subset data

date<-as.Date(data$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data$year<-as.numeric(format(date, "%Y"))
head(data,2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                        14   100 3   0          0
## 2         NA         0                         2   150 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM year
## 1     3040      8812       3051       8806              1 1950
## 2     3042      8755          0          0              2 1950
sum(is.na(data$year))
## [1] 0
summary(data$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1995    2002    1999    2007    2011
sum(data$year==1950)
## [1] 223
sum(data$year==1960)
## [1] 1945
sum(data$year==1970)
## [1] 3215
sum(data$year==1980)
## [1] 6146
hist(data$year, main="numbers of records of each year", xlab="year", ylab="Number of records",col="purple")

dev.copy(png, file="record_year.png",height=480, width=480)
## png 
##   3
dev.off
## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x0000000015c84ac0>
## <environment: namespace:grDevices>
subdata<-subset(data, data$year>=1970)
dim(subdata)
## [1] 866041     38

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.Then we decide which part of data is desirable for analysis.

We can see from year 1950-1970 there are no significant change on records, so subset data to be from year 1970-2010.

Transformation

Transformation for question 1:

First convert EVTYPE names to upper case. Clean data by simplify EVTYPE name. Originally there are more than 900 EVTYPE, by using grep we can shorten them into small amount. With code book , there are 48 EVYPE in the table.

subdata[,8]<-toupper(subdata[,8])
subdata[grep("AVALANCE*|AVALANCHE", subdata$EVTYPE),8]<-"AVALANCHE"
subdata[grep("BLIZZARD*|BLIZZARD/*", subdata$EVTYPE),8]<-"BLIZZARD"
subdata[grep("COASTAL FLOOD*|COASTAL/TIDAL FLOOD|COASTAL", subdata$EVTYPE),8]<-"COASTAL FLOOD"
subdata[grep("EXCESSIVE HEAT*", subdata$EVTYPE),8]<-"EXCESSIVE HEAT"
subdata[grep("FLASH FLOOD*|FLASH FLOOD/*", subdata$EVTYPE),8]<-"FLASH FLOOD"
subdata[grep("HAIL*", subdata$EVTYPE),8]<-"HAIL"
subdata[grep("FLOOD*|FLOOD/FLASH *|FLOOD/RAIN/*", subdata$EVTYPE),8]<-"FLOOD"
subdata[grep("DUSTSTORM*|DUST STORM*", subdata$EVTYPE),8]<-"DUST STORM"
subdata[grep("DUST DEVIL*", subdata$EVTYPE),8]<-"DUST DEVIL"
subdata[grep("HEAVY SNOW*", subdata$EVTYPE),8]<-"HEAVY SNOW"
subdata[grep("HEAVY WIND*", subdata$EVTYPE),8]<-"HEAVY WIND"
subdata[grep("THUNDERSTORM WIND*|THUNDERSTROM*|THUNERSTROM*|THUNDESTROM*|TSTMW*", subdata$EVTYPE),8]<-"THUNDERSTORM WIND"
subdata[grep("HIGH WIND*", subdata$EVTYPE),8]<-"HIGH WIND"
subdata[grep("HURRICANE*|TYPHOON*", subdata$EVTYPE),8]<-"HURRICANE"
subdata[grep("ICE STORM*|ICE STORM/*", subdata$EVTYPE),8]<-"ICE STORM"
subdata[grep("LIGHTNING*", subdata$EVTYPE),8]<-"LIGHTNING"
subdata[grep("RIP CURRENT*", subdata$EVTYPE),8]<-"RIP CURRENT"
subdata[grep("STRONG WIND*", subdata$EVTYPE),8]<-"STRONG WIND"
subdata[grep("TSTM WIND*|TSTM WIND/*", subdata$EVTYPE),8]<-"TSTM WIND"
subdata[grep("VALCANIC ASH*", subdata$EVTYPE),8]<-"VAOLCANIC ASH"
subdata[grep("WATERSPOUT*|WATER SPOUT*", subdata$EVTYPE),8]<-"WATERSPOUT"
subdata[grep("WINTER STORM*", subdata$EVTYPE),8]<-"WINTER STORM"
subdata[grep("WINTER WEATHER*", subdata$EVTYPE),8]<-"WINTER WEATHER"
subdata[grep("TORNADO*|TORNDAO*|TORNADO/WATERSPOUT*", subdata$EVTYPE),8]<-"TORNADO"
subdata[grep("TROPICAL STORM*", subdata$EVTYPE),8]<-"TROPICAL STORM"
subdata[grep("EXTREME COLD*", subdata$EVTYPE),8]<-"EXTREME COLD"
subdata[grep("DROUGHT*", subdata$EVTYPE),8]<-"DROUGHT"
subdata[grep("COLD*", subdata$EVTYPE),8]<-"COLD"
subdata[grep("HIGH WIND*", subdata$EVTYPE),8]<-"HIGH WIND"
subdata[grep("HIGH SURF*", subdata$EVTYPE),8]<-"HIGH SURF"
subdata[grep("SLEET*", subdata$EVTYPE),8]<-"SLEET"
subdata[grep("STORM SURGE*", subdata$EVTYPE),8]<-"STORM SURGE"
subdata[grep("WILDFIRE*|WILD/FOREST*", subdata$EVTYPE),8]<-"WILDFIRE"
subdata[grep("LAKE EFFECT*|LAKE-EFFECT*", subdata$EVTYPE),8]<-"LAKE EFFECT SNOW"

The column FATALITIES and INJURIES are what we want to analyze harmful event to population health. Create a new column called total_harm using transform(), which is the product of FATALITIES + INJURIES.

subdata<-transform(subdata, total_harm=FATALITIES+INJURIES)

Transformation for question 2:

Check levels of PROPDMGEXP & CROPDMGEXP, convert columns names to upper. Convert PROPDMGEXP and CROPDMGEXP levels from original letters into comparable numeric forms. After checking levels of PROPDMGEXP(column 26, which is multipier of PROPDMG) & CROPDMGEXP (column 28, which is multipier of CROPDMG) we find that H & h(Hundred–100), K & k (Thousand–10^3), M & m (Millon–10^6) and B (Billion–10^9), rest of them convert to 1. In this way we can solve how to sort mixed vectors.

levels(subdata$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(subdata$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"
subdata[,26]<-toupper(subdata[,26])
subdata[,28]<-toupper(subdata[,28])
subdata[grep(" |-|?|+|0", subdata$PROPDMGEXP),26]<-"1"
subdata[grep("1", subdata$PROPDMGEXP),26]<-"10"
subdata[grep("2|H|h", subdata$PROPDMGEXP),26]<-"100"
subdata[grep("3|K|k", subdata$PROPDMGEXP),26]<-"1000"
subdata[grep("4", subdata$PROPDMGEXP),26]<-"10000"
subdata[grep("5", subdata$PROPDMGEXP),26]<-"100000"
subdata[grep("6|m|M", subdata$PROPDMGEXP),26]<-"1000000"
subdata[grep("7", subdata$PROPDMGEXP),26]<-"10000000"
subdata[grep("8", subdata$PROPDMGEXP),26]<-"100000000"
subdata[grep("B", subdata$PROPDMGEXP),26]<-"1000000000"

subdata[grep(" |-|?|+|0", subdata$CROPDMGEXP),28]<-"1"
subdata[grep("1", subdata$CROPDMGEXP),28]<-"10"
subdata[grep("2|H|h", subdata$CROPDMGEXP),28]<-"100"
subdata[grep("3|K|k", subdata$CROPDMGEXP),28]<-"1000"
subdata[grep("4", subdata$CROPDMGEXP),28]<-"10000"
subdata[grep("5", subdata$CROPDMGEXP),28]<-"100000"
subdata[grep("6|m|M", subdata$CROPDMGEXP),28]<-"1000000"
subdata[grep("7", subdata$CROPDMGEXP),28]<-"10000000"
subdata[grep("8", subdata$CROPDMGEXP),28]<-"100000000"
subdata[grep("B", subdata$CROPDMGEXP),28]<-"1000000000"

Creat a column called prop (which is multiply product of PROPDMG & PROPDMGEXP) , also create a column called crop (which is multiply product of CROPDMG & CROPDMGEXP). Create a column called total_dmg(which is product of PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP )

subdata$PROPDMGEXP<-as.numeric(subdata$PROPDMGEXP)
subdata<-transform(subdata, propdmg=PROPDMG*PROPDMGEXP)
subdata$CROPDMGEXP<-as.numeric(subdata$CROPDMGEXP)
subdata<-transform(subdata, cropdmg=CROPDMG*CROPDMGEXP)
subdata<-transform(subdata, total_dmg=propdmg+cropdmg)

Results

Most harmful event to population health assessment

  • Analyze which event is the most harmful to population health by deal with variable total_harm(product of FATALITIES+INJURIES).

  • Calculate the total_harm by each type of events. And use arrange to rank the most harmful events.

harm_sum<-aggregate.data.frame(subdata$total_harm, list(subdata$EVTYPE), sum)
names(harm_sum)<-c("events_name1","total_harm")
head(arrange(harm_sum, -total_harm))
##        events_name1 total_harm
## 1           TORNADO      62901
## 2             FLOOD      10135
## 3 THUNDERSTORM WIND      10121
## 4    EXCESSIVE HEAT       8447
## 5         LIGHTNING       6048
## 6              HEAT       3037
harm_df<-data.frame(head(arrange(harm_sum, -total_harm)))

We see TORNADO is the most harmful.

Greatest economic consequence assessment

  • Analyze which event is the most harmful to population health by deal with variable total_harm(product of PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP).

  • Calculate the total_dmg by each type of events. And use arrange to rank the most economic consequence events.

dmg_sum<-aggregate.data.frame(subdata$total_dmg, list(subdata$EVTYPE), sum)
names(dmg_sum)<-c("events_name2","total_dmg")
head(arrange(dmg_sum, -total_dmg))
##        events_name2 total_dmg
## 1 THUNDERSTORM WIND  28601096
## 2             FLOOD  28019542
## 3           TORNADO  27484004
## 4              HAIL  12852770
## 5         LIGHTNING   6070054
## 6         HIGH WIND   4032891
damage_df<-data.frame(head(arrange(dmg_sum, -total_dmg)))

Plot a panel plot showing harmful events and greatest economic damage events:

plot1<-ggplot(harm_df, aes(events_name1, total_harm)) +
        geom_bar(stat="identity", colour="purple", fill="purple", width=0.5)+
        labs(title=" Total number of harmful events for human health(top 5)", x="event name",y="total number")

plot2<-ggplot(damage_df, aes(events_name2, total_dmg)) +
        geom_bar(stat="identity", colour="purple", fill="purple", width=0.5)+
        labs(title=" Total number of greatest economic consequense events(top 5)", x="event name",y="total number")

grid.arrange(plot1, plot2, nrow = 2)

dev.copy(png, file="results.png",height=480, width=480)
## png 
##   4
dev.off
## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x0000000015c84ac0>
## <environment: namespace:grDevices>

Conclusion

  • The most harmful event to population health is TORNADO.
  • The greatest economic consequense for crop and property damage is THUNDERSTORM WIND, then FLOOD is the second.