In this report, I use the data from NOAA to find out the most harmful weather event type in the US. It turns out that Tornado is the most harmful weather type with respect to population health and flood has the greatest economic consequences in the years between 1950 and 2011.
Obtain the data on characteristics of major storms and weather events in the United States from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
data <- read.csv("repdata-data-StormData.csv.bz2")
According to the National Weather Service Storm Data Documentation, there are only 48 kinds of permitted storm data events. However, in the raw data, there are 985 kinds of events. In order to make meaningful comparison, I have to try my best to put the original 985 kinds of events into the 48 permitted kinds.
Firstly, I pick out the rows that contain non zero values in at least one of the four columns we concern about. By doing so, I reduce the 985 kinds into 488 kinds.
data1 <- subset(data,FATALITIES!=0 | INJURIES!=0 | PROPDMG!=0 | CROPDMG!=0)
Then, I read in the 48 permitted kinds of events and reorder them based on their length because in general the longer the name is, the more specific the category is. So I try to put the data into more specific category first and then consider putting them into more general category if there is no specific category good for them. (the content of permittedevents.txt is copied from the table of National Weather Service Storm Data Documentation on page 6.)
pe <- readLines("permittedevents.txt")
pe1 <- substr(pe,1,nchar(pe)-2)
pe2 <- pe1[order(nchar(pe1),decreasing=T)]
Next, I try the best to put the raw data into 48 permitted categories. The detailed critera are shown in the following code.
data1$newtype <- NA
for (i in 1:48){
m <- grep(pe2[i],data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- pe2[i]
}
m <- grep("hurricane|typhoon",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Hurricane (Typhoon)"
m <- grep("Non",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Strong Wind"
m <- grep("TSTM|Thunderstorm|thundertorm|tunderstorm|thuderstorm|thunderestorm|thundeerstorm|thunerstorm|thunderstrom",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Thunderstorm Wind"
m <- grep("wind",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Strong Wind"
m <- grep("extreme cold|extended cold",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Extreme Cold/Wind Chill"
m <- grep("cold",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Cold/Wind Chill"
m <- grep("freez",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Frost/Freeze"
m <- grep("snow",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Heavy Snow"
m <- grep("rain",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Heavy Rain"
m <- grep("lighting",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Lightning"
m <- grep("torndao",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Tornado"
m <- grep("avalance",data1$EVTYPE,ignore.case=T)
n <- which(is.na(data1$newtype))
m1 <- intersect(m,n)
data1$newtype[m1] <- "Avalanche"
n <- which(is.na(data1$newtype))
data1$newtype[n] <- "Other"
Please note that I put the last 90+ kinds of events into the “Other” categroy. This will not affect the final conclusion of our study because these records only represent less than 2% of the total effects caused by weather conditions.
To calculate the harmfulness of the events, I sum up fatalities and injuries for each event.
library(ggplot2)
f1 <- aggregate(data1$FATALITIES,by=list(data1$newtype),sum)
f2 <- aggregate(data1$INJURIES,by=list(data1$newtype),sum)
f <- merge(f1,f2,by.x="Group.1",by.y="Group.1")
f$sum <- f[,2]+f[,3]
print("The most hamful event type goes to ")
## [1] "The most hamful event type goes to "
f[f$sum==max(f$sum),1]
## [1] "Tornado"
g <- ggplot(data=f,aes(x=f[,1],y=f[,4]))
g+geom_bar(stat="identity")+coord_flip()+xlab("Types of weather events")+ylab("Number of people who died or injured")+ggtitle("The Most Harmful Event to Population Health")
To know this, I need to sum up the damages to both properties and crops. The problem is the units of the the damages are different and some units such as “-” or “H” are hard to understand. So I ignore the records that have such weird units and only keep the records with standard units including “b”, “k” and “m”.
#Unit of the property
m <- grep("k",data1$PROPDMGEXP,ignore.case=T)
data1$pu[m] <- 1000
m <- grep("m",data1$PROPDMGEXP,ignore.case=T)
data1$pu[m] <- 1000000
m <- grep("b",data1$PROPDMGEXP,ignore.case=T)
data1$pu[m] <- 1000000000
#Property Damage
data1$pd <- data1$PROPDMG*data1$pu
data1$pd[which(is.na(data1$pd))] <- 0
#Unit of the crops
m <- grep("k",data1$CROPDMGEXP,ignore.case=T)
data1$cu[m] <- 1000
m <- grep("m",data1$CROPDMGEXP,ignore.case=T)
data1$cu[m] <- 1000000
m <- grep("b",data1$CROPDMGEXP,ignore.case=T)
data1$cu[m] <- 1000000000
#Crops Damage
data1$cd <- data1$CROPDMG*data1$cu
data1$cd[which(is.na(data1$cd))] <- 0
#Aggregate
f1 <- aggregate(data1$pd,by=list(data1$newtype),sum)
f2 <- aggregate(data1$cd,by=list(data1$newtype),sum)
f <- merge(f1,f2,by.x="Group.1",by.y="Group.1")
f$sum <- f[,2]+f[,3]
print("The most costly event type goes to ")
## [1] "The most costly event type goes to "
f[f$sum==max(f$sum),1]
## [1] "Flood"
g <- ggplot(data=f,aes(x=f[,1],y=f[,4]))
g+geom_bar(stat="identity")+coord_flip()+xlab("Types of weather events")+ylab("The USD amount of damage caused")+ggtitle("The Type of Event with the Greatest Economic Consequence")
Based on the above analysis, I can tell that across the United States, Tornado is the most harmful with respect to population health and flood has the greatest economic consequences.