The National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database contains data on storms in the United States and its territories from 1951 to 2011. The data details the type of storm, location, duration, economic costs, and human costs. The purpose of this analysis is to analyze past events and present any trends or insights that would assist in preparing for natural disasters. The analysis includes details on data sources, data preparation, and data visualization. The visualizations reveal storm event types that are national in reach, while other are regional and have never affected other parts of the United States. While causation is discussed in the analysis, it is up to the reader to make their own judgments based on their knowledge of storms and their region.
We will be using the stringr and dplyr libraries to process the data into a format we can use. We will then use maps to visualize the data. We also disable scientific notation to ease the reading of tables.
library(stringr)
library(dplyr)
library(xtable)
library(maps)
options(scipen=999)
With the file in our working directory, we unzip and read in the data
data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), stringsAsFactors = FALSE) #, nrows =100)
We will now separate the date from the time, which is 0:00:00 throughout the variable. Reviewing the structure of the data on the NOAA Website indicates that there are three distinct time periods where there will be variations in how the data was recorded. Looking at the time periods and the amount of data in each shows that the time period from 1996 - 2011 contains 70% of the records in the data-set. Given that the more recent data has more complete information, is more representative of current population and other demographics, and has fewer corrections for inflation affects, we will confine the analysis to the data from 1996 to 2011 and place it in a data.frame called recent.
data$BGN_DATE <- gsub(" 0:00:00", "", data$BGN_DATE)
data$year <- as.numeric(str_sub(data$BGN_DATE, start= -4))
recent <- subset(data, year >= 1996)
There are several variables that are not related to the analysis, so we will subset out the variables we are interested in into a data.frame called recent.clean.
recent.clean <- subset(recent, select = c(BGN_DATE, COUNTYNAME, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, LATITUDE, LONGITUDE, year))
With the data subsetted into the most recent era of data collection, we will convert the Property and Crop damage values from exponential to absolute values. In the raw data, the damage is broken up into 3 significant figures in one column, with another column containing K, M, B, or nothing to indicate Thousands, Millions, or Billions of dollars.
recent.clean$PROPDMGEXP <- gsub("", "1", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("101", "1", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1B1", "1000000000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1M1", "1000000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- gsub("1K1", "1000", recent.clean$PROPDMGEXP)
recent.clean$PROPDMGEXP <- as.numeric(recent.clean$PROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("", "1", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1B1", "1000000000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1M1", "1000000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- gsub("1K1", "1000", recent.clean$CROPDMGEXP)
recent.clean$CROPDMGEXP <- as.numeric(recent.clean$CROPDMGEXP)
Reviewing the location data reveals that Latitude and Longitude are not in the correct format. We multiply each by 0.01 and change the sign on the Longitude values since the United States is West of the Prime Meridian.
recent.clean$PROPDMG <- recent.clean$PROPDMG*recent.clean$PROPDMGEXP
recent.clean$CROPDMG <- recent.clean$CROPDMG*recent.clean$CROPDMGEXP
recent.clean$LATITUDE <- recent.clean$LATITUDE *0.01
recent.clean$LONGITUDE <- recent.clean$LONGITUDE *-0.01
We wish to confine the study to the 50 states, so we use the matching function %in% to subset out only the events that match up with the abbreviations for the 50 states.
states <- c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL",
"GA", "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA",
"MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE",
"NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI",
"SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")
recent.clean <- subset(recent.clean, STATE %in% states)
Next we combine the counts for Fatalities and Injuries into one stat called Harm since it appears to be a variable that is (thankfully) sparse in the data-set.
recent.clean$harm <- recent.clean$FATALITIES + recent.clean$INJURIES
We then create a new variable with the full state names using the state.name and state.abb functions.
recent.clean$statename <- state.name[match(recent.clean$STATE, state.abb)]
While the NOAA website lists 48 Event Types, the data-set records over 400 unique events in the EVTYPE variable. Upon inspection we find alternative spelling, misspellings, lowercase vs. uppercase, etc. It would greatly expand the scope of the project to clean up every instance, and the author’s only rudimentary knowledge of text analytics does not contain an easy solution. The following text adjustments were made to the data and took the number of unique events down to only 348.
recent.clean$EVTYPE <- tolower(recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub(" ", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("[0-9]*", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(g)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(.)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\(", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\)", "", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("tstmwind", "thunderstormwind", recent.clean$EVTYPE)
recent.clean$EVTYPE <- gsub("\\/typhoon", "", recent.clean$EVTYPE)
Review of the updated EVTYPE variable shows that over 80% of the entries fall into the top 20 Events by count. Converting the events that have 1 entry into one of the 48 would have diminishing returns, hence we will work with the data set having the 348 unique event types.
With the tidy data-set names recent.clean, we can begin to look at the impact of the weather events on property/crop damage and harm to the population. The first aspect to look at is the top 6 events by count, Total Property Damage, Total Crop Damage, and Number (Count) of People Harmed from 1996 - 2011. We will use the group_by() and summarise() functions from dplyr to perform the analysis
events <- group_by(recent.clean, EVTYPE)
events_sum <- summarise(events, Count =n(), Total_Prop = sum(PROPDMG),
Total_Crop = sum(CROPDMG), Health = sum(harm))
With the data summarized in the data.table events_sum, we can start looking a the top events. Note that all of the monetary values have not be adjusted for inflation. In an expansion of this report, it would be recommended to get all monetary values in the terms of a base year.
Top 6 Events by Count
events_sum <- arrange(events_sum, desc(Count))
topsix_count <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]
EVTYPE Count Total_Prop Total_Crop Health
1 thunderstormwind 209947 7849636570 952241350 5372
2 hail 207685 14595136920 2476029450 720
3 flashflood 50012 14950618910 1320489700 2512
4 flood 23874 143818447550 4926778400 7163
5 tornado 23139 24616723710 283424010 22178
6 highwind 19878 5246651360 633287300 1315
The number of events is dominated by Thunder Storm Winds and Hail. The next closest category, at quarter of the numbers, is Flash Floods. Interestingly, the events with the most occurrences are nowhere near the top events in the Damage categories, except maybe for Hail, as we will examine next.
Top 6 Events by Total Property Damage ($)
events_sum <- arrange(events_sum, desc(Total_Prop))
topsix_prop <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]
EVTYPE Count Total_Prop Total_Crop Health
1 flood 23874 143818447550 4926778400 7163
2 hurricane 211 78942098010 4793430800 1069
3 stormsurge 251 43193461000 5000 39
4 tornado 23139 24616723710 283424010 22178
5 flashflood 50012 14950618910 1320489700 2512
6 hail 207685 14595136920 2476029450 720
Looking at Property Damage, Floods have had the most economic impact on Property. After that, each successive ‘event’ is almost half of the previous. One might have thought that Hurricanes would be the top event, but this may be explained when we look at the geographical distribution of events later.
Top 6 Events by Total Crop Damage ($)
events_sum <- arrange(events_sum, desc(Total_Crop))
topsix_crop <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]
EVTYPE Count Total_Prop Total_Crop Health
1 drought 2408 1041101000 13367361000 4
2 flood 23874 143818447550 4926778400 7163
3 hurricane 211 78942098010 4793430800 1069
4 hail 207685 14595136920 2476029450 720
5 flashflood 50012 14950618910 1320489700 2512
6 extremecold 617 19760400 1308973000 194
Crop Damage has mostly occurred from Drought for 1996 - 2011, with Floods, Hurricanes, and Storm Surge rounding out the top 4. Drought damage to crops is 4 times as high as the next Event on the list. If you add up the three Events immediately after Drought, they still would not be equal to the damage drought has done.
Top 6 Events by People Harmed (Count)
events_sum <- arrange(events_sum, desc(Health))
topsix_health <- events_sum[1:6,1]
events_sum[1:6,]
Source: local data frame [6 x 5]
EVTYPE Count Total_Prop Total_Crop Health
1 tornado 23139 24616723710 283424010 22178
2 excessiveheat 1638 7723700 492402000 7866
3 flood 23874 143818447550 4926778400 7163
4 thunderstormwind 209947 7849636570 952241350 5372
5 lightning 13152 742045580 6898440 4765
6 flashflood 50012 14950618910 1320489700 2512
Harmful events show that Tornadoes are by far and away the most dangerous of weather events in the United State. The next highest Event, Excessive Heat, is only 1 one third of the numbers. Floods, Thunder storm Winds, and Lightning round out the top 5. It would be an interesting additional study to look into the relationships between Tornadoes and Thunder Storm Winds to see if they are the same thing being reported differently.
While lists of the top events are informative, a more useful visualization would be to plot the levels of the each event on a map of the United States to get a feel for the geographic distribution of Storm Events. A choropleth is a plot utilizing Latitude and Longitude as the X and Y variables. For the purposes of this analysis, we will not be plotting at the point scale since storms cover a large geographic area. We will be plotting the Top 6 Events on a state level. We will create 6 maps per plot, each one representing each of the Top 6 Events for Property and Crop Damage, along the a count of the Harm incurred during the Event.
The code for the creating the plot is as follows: We initially create some color functions in order to distinguish between the various levels of each category. All of the data is skewed to the left, but Log based axis may not make much sense. We will use a set of bins that will use a pseudo-log style, but keep the numbers in an easily translatable form. Also, given the distance that U.S. territories, Alaska, and Hawai’i are from the U.S. mainland, only the 48 states and the district of Columbia will be shown in the choropleth plots.
getColor <- function(x) {
if (x > 1000000000) {
col <- "#13373e"
} else if (x > 100000000) {
col <- "#246571"
} else if (x > 1000000) {
col <- "#308898"
} else {
col <- "#7bc7d5"
}
return(col)
}
getColor.harm <- function(x) {
if (x > 600) {
col <- "#13373e"
} else if (x > 400) {
col <- "#246571"
} else if (x > 200) {
col <- "#308898"
} else {
col <- "#7bc7d5"
}
}
With the color functions set, we can start to lay out the plots for 6 charts on each plot.
par(mfrow=c(4,2), mar = c(1,1,1,1))
topevents <- 1:6
The heart of the process is taking each event in the Top 6, summarizing the relevant category, assigning a color to it, and then plotting it. The map() function in R needs some help because some state are broken up into multiple regions, hence the section with mapnames, regionlist, ect. Once the maps are on the plot, we create a legend based on our pseudo-log assignment of colors.
par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6
for (evt in topevents) {
# Select out the current Top Event, Sum up the category for each state
prop.state <- recent.clean[recent.clean$EVTYPE == topsix_prop[evt],]
prop.state.group <- group_by(prop.state, statename)
prop.state.group_sum <- summarize(prop.state.group, total = sum(PROPDMG))
# Match values to database region names
mapnames <- map("state", plot=FALSE)$names
regionlist <- strsplit(mapnames, ":")
mapnames.fin <- sapply(regionlist, "[", 1)
m <- match(mapnames.fin, tolower(prop.state.group_sum$statename))
maprates <- prop.state.group_sum$total[m]
maprates <- na.omit(maprates)
# Assign colors based on Event category and then plot eac state to that color
statecols <- sapply(maprates, FUN=getColor)
map("state", regions=prop.state.group_sum$statename[m], proj = "albers",
param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
text(mapproject(-90,26), topsix_prop[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Property Damage by State (1996-2011)", outer=TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
col=sapply(c(0,1000001,100000001,1000000001), getColor), border="white", lwd=0.4)
text(15, 1.35, "Property Damage ($)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("$0", "$1M","$100M","$1B"), cex=0.8) # Tick labels
The choropleths reveal some interesting aspects of the data. Floods, which are the top property damage values are present in almost every state and may be the largest total damage from reports in California, Florida, and the Northeast corridor. These areas have high property values and would result in a high summation. Hurricanes on the other hand, while very destructive, are concentrated on the southern coasts. Note that Nevada shows up in the Hurricane plot. The bin goes from $0 to $1 Million, but while most state reported no Hurricane costs, Nevada reports $0, so the state gets assigned a color. Other interesting observations are that, as expected, Storm Surges are confined to the coasts (except for Montana), Tornadoes are predominately in the area known as ‘Tornado Alley’, and Flash Floods and Hail have been encountered in all states.
par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6
for (evt in topevents) {
crop.state <- recent.clean[recent.clean$EVTYPE == topsix_crop[evt],]
crop.state.group <- group_by(crop.state, statename)
crop.state.group_sum <- summarize(crop.state.group, total = sum(CROPDMG))
# Match values to database region names
mapnames <- map("state", plot=FALSE)$names
regionlist <- strsplit(mapnames, ":")
mapnames.fin <- sapply(regionlist, "[", 1)
m <- match(mapnames.fin, tolower(crop.state.group_sum$statename))
maprates <- crop.state.group_sum$total[m]
maprates <- na.omit(maprates)
statecols <- sapply(maprates, FUN=getColor)
map("state", regions=crop.state.group_sum$statename[m], proj = "albers",
param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
text(mapproject(-90,26), topsix_crop[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Crop Damage by State (1996-2011)", outer=TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
col=sapply(c(0,1000001,100000001,1000000001), getColor), border="white", lwd=0.4)
text(15, 1.35, "Crop Damage ($)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("$0", "$1M","$100M","$1B"), cex=0.8) # Tick labels
The next plot of choropleths show the distribution of the Top 6 Events for crop damage. Crop damage is going to be influenced by the effect that most crops in the United States are grown in the South, Midwest, and California. Drought leads the way with concentrations in Texas and Oklahoma. The rest of the Midwest and states south of the Great Lakes also show effects from Drought. Floods occur in all states, with a Concentration in California, where Drought did not have such a large effect. Again Hurricanes show up on the southeast coastline. Hail and Flashfloods show up in the Midwest and South. Extreme Cold is interesting because it shows up in some states that usually do not get as cold as the rest of the country. Because of this, we can deduce more cold sensitive crops are grown in Texas, Louisiana, Florida, and California.
par(mfrow=c(4,2), mar = c(1,1,1,1), oma = c(0,0,2,0))
topevents <- 1:6
for (evt in topevents) {
health.state <- recent.clean[recent.clean$EVTYPE == topsix_health[evt],]
health.state.group <- group_by(health.state, statename)
health.state.group_sum <- summarize(health.state.group, total = sum(harm))
# Match values to database region names
mapnames <- map("state", plot=FALSE)$names
regionlist <- strsplit(mapnames, ":")
mapnames.fin <- sapply(regionlist, "[", 1)
m <- match(mapnames.fin, tolower(health.state.group_sum$statename))
maprates <- health.state.group_sum$total[m]
maprates <- na.omit(maprates)
statecols <- sapply(maprates, FUN=getColor.harm)
map("state", regions=health.state.group_sum$statename[m], proj = "albers",
param = c(39,45), fill=TRUE, col=statecols, border=NA, resolution=0)
text(mapproject(-90,26), topsix_health[evt], cex = 1)
}
mtext("Lattice Plot of Top 6 Events: Harmful Events by State (1996-2011)", outer = TRUE)
# Make-shift legend
plot(0, 0, type="n", axes=FALSE, xlim=c(0,30), ylim=c(0,2))
rect(c(5,10,15,20), c(1,1,1,1), c(10,15,20,25), c(1.25,1.25,1.25,1.25),
col=sapply(c(0,201,401,601), getColor.harm), border="white", lwd=0.4)
text(15, 1.35, "Harm (Count)")
text(c(5,10,15,20), c(0.9,0.9,0.9), c("0", "200","400","600"), cex=0.8) # Tick labels
The final plot shows the choropleths associated with Harm, injuries and fatalities, from Storm Events. Tornado deaths and injuries are concentrated in the ‘Tornado Alley’ and are practically non-existent in the West. Excessive Heat shows up in Northern Midwest and Northeast states that normally do not get as hot, along with California and Texas. Floods seem unusually concentrated in Texas, but with only a count of 7163 in 15 years, one or two events could skew Texas to the top. Thunder Storm Winds follow a path from Texas to New York, which could be a result of Jet Stream effects. Lightning is concentrated in Florida, with some additional cases in Texas, Georgia, and North Carolina. Flash Floods are similar to Floods and concentrated in Texas, with some in Arizona.