This analysis aims to address to questions:
The storm database from the U.S. National Oceanic and Atmospheric Administrations (NOAA) from 1950 to November 2011 was used. The analysis was done in RStudio.
# downloading data and reading it into work space
download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "data.csv.bz2")
data<-read.csv("data.csv.bz2")
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
str(data$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
According to the National Weather Service Storm Data Documentation, p.6, the event type can be classified into 3 designator groups: Z, C, and M. Hence, we processed the data$EVTYPE back to the corresponding designator groups by picking specific keywords as follows:
type<-data$EVTYPE
type<-as.character(type)
type<-tolower(type)
# M group
type[grep("marine",type)]<-"m"
type[grep("tstm",type)]<-"m"
type[grep("spout",type)]<-"m"
# C group
type[grep("debris",type)]<-"c"
type[grep("dust devil",type)]<-"c"
type[grep("flood",type)]<-"c"
type[grep("cloud",type)]<-"c"
type[grep("hail",type)]<-"c"
type[grep("rain",type)]<-"c"
type[grep("lightning",type)]<-"c"
type[grep("thunderstorm",type)]<-"c"
type[grep("tornado",type)]<-"c"
# Z group
type[grep("astronomical",type)]<-"z"
type[grep("avalanche",type)]<-"z"
type[grep("blizzard",type)]<-"z"
type[grep("coastal",type)]<-"z"
type[grep("chill",type)]<-"z"
type[grep("dense",type)]<-"z"
type[grep("drought",type)]<-"z"
type[grep("dust storm",type)]<-"z"
type[grep("excessive",type)]<-"z"
type[grep("extreme",type)]<-"z"
type[grep("frost",type)]<-"z"
type[grep("freeze",type)]<-"z"
type[grep("fog",type)]<-"z"
type[grep("heat",type)]<-"z"
type[grep("snow",type)]<-"z"
type[grep("high",type)]<-"z"
type[grep("hurricane",type)]<-"z"
type[grep("typhoon",type)]<-"z"
type[grep("ice",type)]<-"z"
type[grep("lake",type)]<-"z"
type[grep("current",type)]<-"z"
type[grep("seiche",type)]<-"z"
type[grep("sleet",type)]<-"z"
type[grep("surge",type)]<-"z"
type[grep("wind",type)]<-"z"
type[grep("tropical",type)]<-"z"
type[grep("tsunami",type)]<-"z"
type[grep("volcanic",type)]<-"z"
type[grep("wild",type)]<-"z"
type[grep("winter",type)]<-"z"
type[grep("wintry",type)]<-"z"
datam <- subset(x = data,subset = type=="m")
datac <- subset(x = data,subset = type=="c")
dataz <- subset(x = data,subset = type=="z")
We extracted 896672 observations out of 902297 total observations, which is 99.3765911 percents.
According to the data, we have two recorded variables which relate to population health: FATALITIES and INJURIES. In this analysis, we concern only INJURIES.
tempdatam<-subset(x = datam,select = INJURIES)
tempdatac<-subset(x = datac,select = INJURIES)
tempdataz<-subset(x = dataz,select = INJURIES)
tempm=tempdatam$INJURIES
tempc=tempdatac$INJURIES
tempz=tempdataz$INJURIES
boxplot(tempm,tempc,tempz,
horizontal=TRUE,
names=c("M","C","Z"),
xlab = "Numbers of Injury",ylab = "Group",main = "Numbers of Injury by Storm Types in US (1950 - NOV 2011)")
To answer the question, we applied linear regression to see the impact of event type on numbers of injury.
tempdatam<-data.frame("Injuries" = tempm,"Group" = "M")
tempdatac<-data.frame("Injuries" = tempc,"Group" = "C")
tempdataz<-data.frame("Injuries" = tempz,"Group" = "Z")
tempdata<-rbind(tempdatam,tempdatac,tempdataz)
M_GROUP<-c(rep(0,dim(tempdata)[1]))
M_GROUP[tempdata$Group=="M"]<-1
C_GROUP<-c(rep(0,dim(tempdata)[1]))
C_GROUP[tempdata$Group=="C"]<-1
Z_GROUP<-c(rep(0,dim(tempdata)[1]))
Z_GROUP[tempdata$Group=="Z"]<-1
tempdata<-data.frame(tempdata$Injuries,M_GROUP,C_GROUP,Z_GROUP)
lm(tempdata$tempdata.Injuries ~ tempdata$M_GROUP);lm(tempdata$tempdata.Injuries ~ tempdata$C_GROUP);lm(tempdata$tempdata.Injuries ~ tempdata$Z_GROUP)
##
## Call:
## lm(formula = tempdata$tempdata.Injuries ~ tempdata$M_GROUP)
##
## Coefficients:
## (Intercept) tempdata$M_GROUP
## 0.2014 -0.1711
##
## Call:
## lm(formula = tempdata$tempdata.Injuries ~ tempdata$C_GROUP)
##
## Coefficients:
## (Intercept) tempdata$C_GROUP
## 0.09393 0.09760
##
## Call:
## lm(formula = tempdata$tempdata.Injuries ~ tempdata$Z_GROUP)
##
## Coefficients:
## (Intercept) tempdata$Z_GROUP
## 0.1442 0.1214
From the results, the coefficient of Z_GROUP is 0.1214493 which is the highest compared to other groups. Hence, the storm type Z is the most harmful with respect to population health in term of numbers of injury.
According to the data, we have four recorded variables which relate to economic consequences: PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP. Only PROPDMG and CROPDMG are considered. We applied non-weighted method to accumulate both variables into one.
# non-weighted accumulation of PROPDMG and CROPDMG
econm<-datam$PROPDMG+datam$CROPDMG
econc<-datac$PROPDMG+datac$CROPDMG
econz<-dataz$PROPDMG+dataz$CROPDMG
tempdatam<-data.frame("Damage" = econm,"Group" = "M")
tempdatac<-data.frame("Damage" = econc,"Group" = "C")
tempdataz<-data.frame("Damage" = econz,"Group" = "Z")
tempdata<-rbind(tempdatam,tempdatac,tempdataz)
tempdata<-data.frame(tempdata$Damage,M_GROUP,C_GROUP,Z_GROUP)
lm(tempdata$tempdata.Damage ~ tempdata$M_GROUP);lm(tempdata$tempdata.Damage ~ tempdata$C_GROUP);lm(tempdata$tempdata.Damage ~ tempdata$Z_GROUP)
##
## Call:
## lm(formula = tempdata$tempdata.Damage ~ tempdata$M_GROUP)
##
## Coefficients:
## (Intercept) tempdata$M_GROUP
## 16.28 -10.08
##
## Call:
## lm(formula = tempdata$tempdata.Damage ~ tempdata$C_GROUP)
##
## Coefficients:
## (Intercept) tempdata$C_GROUP
## 8.364 8.236
##
## Call:
## lm(formula = tempdata$tempdata.Damage ~ tempdata$Z_GROUP)
##
## Coefficients:
## (Intercept) tempdata$Z_GROUP
## 13.5453 0.6522
From the results, the coefficient of C_GROUP is 8.2359209 which is the highest compared to other groups. Hence, the storm type C has the greatest economic consequences.