Brynjólfur Gauti Jónsson
Weather events can have disastrous consequences if we are ill prepared. Heat waves can cause loss of life and floods can cause property damage. If we had reliable information which ranked the different types of events, the government could choose where to spend their budget the most efficiently. The purpose of this analysis is to see which types of events cause the biggest population and property damage. I will look at the mean and total amount of damage caused by major events and plot them for easy viewing.
The database we work with is obtained and kept up to date by the National Climatic Data Center (NCDC). The NCDC receives the data from the National Weather Service who receive their data from a variety of sources.
The dataset was originally downloaded 11. september, 2017. I downloaded it into a temporary file and read it from that file into RStudio.
temp <- tempfile()
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', temp)
data <- read.csv(temp)
Let’s see what variables the data contains.
dim(data)
## [1] 902297 37
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We are not interested in all of the data, we only want the variables that concern population and property damage. Thus we will grab a subset of the data containing only the variables we need.
smalldata <- data[, c(8, 23, 24, 25, 27)]
names(smalldata)
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "CROPDMG"
suppressMessages(library(dplyr))
suppressMessages(library(tidyr))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(gridExtra))
suppressMessages(library(stringr))
Now let’s start out by looking at the mean effect of each type of event. We arrange the data by each variable one at a time and grab four tables, each containing the 100 top contenders for their respective measurement.
means <- smalldata %>%
group_by(EVTYPE) %>%
summarise(meanfat=mean(FATALITIES), meaninj=mean(INJURIES),
meanprop=mean(PROPDMG), meancrop=mean(CROPDMG))
meanfatal <- arrange(means, desc(meanfat))[1:100,c(1,2)]
meaninj <- arrange(means, desc(meaninj))[1:100,c(1,3)]
meanprop <- arrange(means, desc(meanprop))[1:100,c(1,4)]
meancrop <- arrange(means, desc(meancrop))[1:100,c(1,5)]
cbind(head(meaninj, 10), head(meanfatal, 10))
## EVTYPE meaninj EVTYPE meanfat
## 1 Heat Wave 70.00000 TORNADOES, TSTM WIND, HAIL 25.000000
## 2 TROPICAL STORM GORDON 43.00000 COLD AND SNOW 14.000000
## 3 WILD FIRES 37.50000 TROPICAL STORM GORDON 8.000000
## 4 THUNDERSTORMW 27.00000 RECORD/EXCESSIVE HEAT 5.666667
## 5 HIGH WIND AND SEAS 20.00000 EXTREME HEAT 4.363636
## 6 SNOW/HIGH WINDS 18.00000 HEAT WAVE DROUGHT 4.000000
## 7 GLAZE/ICE STORM 15.00000 HIGH WIND/SEAS 4.000000
## 8 HEAT WAVE DROUGHT 15.00000 MARINE MISHAP 3.500000
## 9 WINTER STORM HIGH WINDS 15.00000 WINTER STORMS 3.333333
## 10 HURRICANE/TYPHOON 14.48864 Heavy surf and wind 3.000000
cbind(head(meancrop, 10), head(meanprop, 10))
## EVTYPE meancrop EVTYPE meanprop
## 1 DUST STORM/HIGH WINDS 500.0000 COASTAL EROSION 766
## 2 FOREST FIRES 500.0000 HEAVY RAIN AND FLOOD 600
## 3 TROPICAL STORM GORDON 500.0000 RIVER AND STREAM FLOOD 600
## 4 HIGH WINDS/COLD 401.0000 Landslump 570
## 5 HURRICANE FELIX 250.0000 BLIZZARD/WINTER STORM 500
## 6 River Flooding 241.3680 FLASH FLOOD/ 500
## 7 WINTER STORMS 166.6667 FLASH FLOODING/THUNDERSTORM WI 500
## 8 EXCESSIVE WETNESS 142.0000 FLOOD/RIVER FLOOD 500
## 9 Frost/Freeze 100.0000 FROST\\FREEZE 500
## 10 TYPHOON 75.0000 HEAVY PRECIPITATION 500
So we get a good first look at the data we need. One thing to keep in mind is that there are two named hurricanes there, Gordon and Felix. Single named occurrences will dominate the data if we’re looking at means, but let’s let them go for now.
We do the same for the total measurements instead of means. We should see more events that occur often with less catastrophic effects since their measurements woul be washed out if we only looked at means.
sums <- smalldata %>%
group_by(EVTYPE) %>%
summarise(sumfat=sum(FATALITIES), suminj=sum(INJURIES),
sumprop=sum(PROPDMG), sumcrop=sum(CROPDMG))
sumfat <- arrange(sums, desc(sumfat))[1:100, c(1,2)]
suminj <- arrange(sums, desc(suminj))[1:100, c(1,3)]
sumprop <- arrange(sums, desc(sumprop))[1:100, c(1,4)]
sumcrop <- arrange(sums, desc(sumprop))[1:100, c(1,5)]
cbind(head(suminj, 10), head(sumfat, 10))
## EVTYPE suminj EVTYPE sumfat
## 1 TORNADO 91346 TORNADO 5633
## 2 TSTM WIND 6957 EXCESSIVE HEAT 1903
## 3 FLOOD 6789 FLASH FLOOD 978
## 4 EXCESSIVE HEAT 6525 HEAT 937
## 5 LIGHTNING 5230 LIGHTNING 816
## 6 HEAT 2100 TSTM WIND 504
## 7 ICE STORM 1975 FLOOD 470
## 8 FLASH FLOOD 1777 RIP CURRENT 368
## 9 THUNDERSTORM WIND 1488 HIGH WIND 248
## 10 HAIL 1361 AVALANCHE 224
cbind(head(sumcrop, 10), head(sumprop, 10))
## EVTYPE sumcrop EVTYPE sumprop
## 1 TORNADO 100018.52 TORNADO 3212258.2
## 2 FLASH FLOOD 179200.46 FLASH FLOOD 1420124.6
## 3 TSTM WIND 109202.60 TSTM WIND 1335965.6
## 4 FLOOD 168037.88 FLOOD 899938.5
## 5 THUNDERSTORM WIND 66791.45 THUNDERSTORM WIND 876844.2
## 6 HAIL 579596.28 HAIL 688693.4
## 7 LIGHTNING 3580.61 LIGHTNING 603351.8
## 8 THUNDERSTORM WINDS 18684.93 THUNDERSTORM WINDS 446293.2
## 9 HIGH WIND 17283.21 HIGH WIND 324731.6
## 10 WINTER STORM 1978.99 WINTER STORM 132720.6
So tornados have had a huge effect throughout history but they didn’t show up on the top of the means lists. It seems that although tornadoes occur often, they don’t cause a high mean amount of damage.
Now we’re going to make some plots. Let’s remove the uniquely named occurences from the list of means before we proceed.
names <- means[c(grep("TROPICAL STORM ",means$EVTYPE), grep("HURRICANE ", means$EVTYPE)), 1]
meanfatal <- meanfatal[!(meanfatal$EVTYPE %in% names$EVTYPE),]
meaninj <- meaninj[!(meaninj$EVTYPE %in% names$EVTYPE),]
meanprop <- meanprop[!(meanprop$EVTYPE %in% names$EVTYPE),]
meancrop <- meancrop[!(meancrop$EVTYPE %in% names$EVTYPE),]
# Population plot
meanpop <- arrange(merge(meanfatal, meaninj, by ='EVTYPE'), desc(meanfat))
sumpop <- arrange(merge(sumfat, suminj, by = 'EVTYPE'), desc(sumfat))
gpop <- ggplot(meanpop[1:10,]) + xlab('Event Type') + theme_tufte() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
gsumpop <- ggplot(sumpop[1:10,]) + xlab('Event Type') + theme_tufte() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
gfat <- gpop + geom_col(aes(x = EVTYPE, y = meanfat)) +
ylab('Mean Fatality Count') + ggtitle('Mean Effect of Events on Population')
ginj <- gpop + geom_col(aes(x = EVTYPE, y = meaninj)) + ylab('Mean Injury Count')
gsumfat <- gsumpop + geom_col(aes(x=EVTYPE, y=sumfat)) +
xlab('Event Type') + ylab('Total Fatalities') +
ggtitle('Total Effect of Events on Population')
gsuminj <- gsumpop + geom_col(aes(x=EVTYPE, y=suminj)) +
xlab('Event Type') + ylab('Total Fatalities')
# Economy plot
meaneco <- arrange(merge(meanprop, meancrop, by = 'EVTYPE'), desc(meanprop))
sumeco <- arrange(merge(sumprop, sumcrop, by='EVTYPE'), desc(sumprop))
geco <- ggplot(meaneco[1:10,]) + xlab('Event Type') + theme_tufte() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
gsumeco <- ggplot(sumeco[1:10,]) + xlab('Event Type') + theme_tufte() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
gprop <- geco + geom_col(aes(x=EVTYPE, y=meanprop)) +
ylab('Property damage (1000$)') +
ggtitle('Mean Effect of Events on Economy')
gcrop <- geco + geom_col(aes(x=EVTYPE, y=meancrop)) +
ylab('Crop Damage (1000$)')
gsumprop <- gsumeco + geom_col(aes(x=EVTYPE, y=sumprop)) +
ylab('Crop Damage (1000$)') +
ggtitle('Total Effect of Events on Economy')
gsumcrop <- gsumeco + geom_col(aes(x=EVTYPE, y=sumcrop)) +
ylab('Crop Damage (1000$)')
# Summary Plot
meanall <- merge(meaneco, meanpop, by = 'EVTYPE')
gall <- ggplot(meanall[1:5,]) + xlab('Event Type') + theme_tufte() +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
gpropall <- gall + geom_col(aes(x=EVTYPE, y=meanprop)) +
ylab('Property damage (1000$)')
gcropall <- gall + geom_col(aes(x=EVTYPE, y=meancrop)) +
ylab('Crop Damage (1000$)')
gfatall <- gall + geom_col(aes(x = EVTYPE, y = meanfat)) + ylab('Mean Fatality Count')
ginjall <- gall + geom_col(aes(x = EVTYPE, y = meaninj)) + ylab('Mean Injury Count')
We made a table where we selected the events that scored highly on both injuries and fatalities. Now we will plot the ten most destructive events ranked by number of fatalities.
grid.arrange(gfat, ginj, gsumfat, gsuminj, ncol = 1)
Looking at the means, it seems that the most dangerous weather event is high heat and events that follow heat waves. We also see winter storms, so we’ve got both extremes of hot and cold weather. After that we see that many weather related deaths and injuries are attributed to the sea, which makes sense since a huge amount of the population works at sea.
From the plot of the totals, we see that tornadoes have caused a huge amount of lifeloss and injuries throughout history but not on average. So we might assume that has something to do with how well the different places are prepared for them.
Next we did the same for property and crops damage. We grabbed the events that caused a lot of damage on both property and crops. Now we will show the top ten contenders ranked by property and crops damage.
grid.arrange(gprop, gcrop, gsumprop, gsumcrop, ncol=1)
The plots on the economic consequenses show a lot of events concernig rain, floods and water. Since water flooding into buildings can cause a large amount of damage in dollars it makes sense that those typed of events historically cause a lot of damage. We also see tornadoes here. If a tornado causes damage to a property at all it’s probably causing a lot at once, so the amount of dollars should stack up over time.
Then we merged together the two newly created tables, meaneco and meanpop, containing the top contenders for population and property damage respectively.
grid.arrange(gpropall, gcropall, gfatall, ginjall, ncol = 2)
These are the events that cause a lot of property damage and are a high risk to the population. If the government were to choose any events from which to strengthen their safeguards, the above would be safe bets.