Severe weather events and storms can cause massive amounts of damage to property, crops, and public health. In order to reduce the amount of damage, it is important to analyze the data so as to help reduce the amount of damage in future events. This project uses data from the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. The analysis is performed using the R programming language, using RStudio as a programming environment. The data was sorted to extract only the relevant data columns (property damage, crop damage, fatalities, and injuries), summed together by event types, and sorted to find the top 25 events with respect to public health and economic damage. The analysis revealed that the event type that had the greatest impact on public health were tornadoes, based on the amount of fatalities and injuries. The event type that had the greatest economic impact were floods, based on property and crop damage.
Raw data is often very large and messy, so the provided dataset must undergo some processing prior to being analyzed.
R has many “packages” (libraries of code) that are included, but not loaded into the environment by default. In order to execute some of the code in this analysis, it is necessary to load the required packages prior to executing the code to ensure no errors occur.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Since working with large datasets is very resource intensive, we will extract just the columns that we are interested in.
data <- read.csv("repdata_data_StormData.csv.bz2")
col_select <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")
tidy_data <- data[, col_select]
The damage values are provided in two different columns. The PROPDMG and CROPDMG columns provide a numerical value, and the PROPDMGEXP and CROPDMGEXP columns contain exponent values to multiply the previous values to get the accurate number. In order to properly analyze the data, these numbers must be corrected.
The exponent columns contain letters and other characters in place of numerical exponents. These values need to be converted to numerical values in order to be able to perform mathematical operations.
tidy_data$PROPDMGEXP <- gsub("[Hh]", "2", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Kk]", "3", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Mm]", "6", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[Bb]", "9", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[\\+]", "1", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("-", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("[\\?]", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- gsub("^$", "0", tidy_data$PROPDMGEXP)
tidy_data$PROPDMGEXP <- as.numeric(tidy_data$PROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Kk]", "3", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Mm]", "6", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Bb]", "9", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[Hh]", "2", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("[\\?]", "0", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- gsub("^$", "0", tidy_data$CROPDMGEXP)
tidy_data$CROPDMGEXP <- as.numeric(tidy_data$CROPDMGEXP)
Now that all of the values are numerical and prepared, we can multiply the damage values by their exponents to get the accurate values.
tidy_data <- tidy_data %>% mutate(PROPDMG = PROPDMG * 10^PROPDMGEXP)
tidy_data <- tidy_data %>% mutate(CROPDMG = CROPDMG * 10^CROPDMGEXP)
Similar to the subsetting before, there is a lot of extraneous data with very low values. Since we are only interested in non-zero values, we will subset them out of the data to improve computational efficiency.
tidy_data <- tidy_data[tidy_data$FATALITIES != 0|
tidy_data$INJURIES != 0|
tidy_data$PROPDMG != 0|
tidy_data$CROPDMG != 0,]
Now that the data has been filtered, we can assign the four categories of values to different variables, sum them by event type, and sort them to filter out the top 25 values in each category.
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = tidy_data, sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = tidy_data, sum)
propdmg <- aggregate(PROPDMG ~ EVTYPE, data = tidy_data, sum)
cropdmg <- aggregate(CROPDMG ~ EVTYPE, data = tidy_data, sum)
## Combining economic data together
econ_dmg <- cbind(propdmg,cropdmg$CROPDMG)
colnames(econ_dmg)[3] <- "CROPDMG"
econ_dmg <- mutate(econ_dmg, TOTAL_DMG = PROPDMG + CROPDMG)
fatalities <- arrange(fatalities, desc(FATALITIES))[1:25,]
injuries <- arrange(injuries, desc(INJURIES))[1:25,]
econ_dmg <- arrange(econ_dmg, desc(TOTAL_DMG))[1:25,]
## Renaming column headers for clarity
colnames(fatalities)[1] <- "EVENT_TYPE"
colnames(injuries)[1] <- "EVENT_TYPE"
colnames(econ_dmg) <- c("EVENT_TYPE",
"PROPERTY_DAMAGE",
"CROP_DAMAGE",
"TOTAL_DAMAGE")
The data is now ready for analysis. First we will answer the question:
To answer this question, let’s look at the top 25 event types for fatalities and injuries. Figure 1 below shows the data in both tabular and graphical formats.
fatalities
## EVENT_TYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
## 16 HEAVY SNOW 127
## 17 EXTREME COLD/WIND CHILL 125
## 18 STRONG WIND 103
## 19 BLIZZARD 101
## 20 HIGH SURF 101
## 21 HEAVY RAIN 98
## 22 EXTREME HEAT 96
## 23 COLD/WIND CHILL 95
## 24 ICE STORM 89
## 25 WILDFIRE 75
injuries
## EVENT_TYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
## 16 THUNDERSTORM WINDS 908
## 17 BLIZZARD 805
## 18 FOG 734
## 19 WILD/FOREST FIRE 545
## 20 DUST STORM 440
## 21 WINTER WEATHER 398
## 22 DENSE FOG 342
## 23 TROPICAL STORM 340
## 24 HEAT WAVE 309
## 25 HIGH WINDS 302
ggplot(fatalities, aes(EVENT_TYPE, FATALITIES)) +
geom_bar(stat = "identity", fill = "red3") +
guides(x=guide_axis(angle = 30)) +
xlab("Event Type") +
ylab("Fatalities")
ggplot(injuries, aes(EVENT_TYPE, INJURIES)) +
geom_bar(stat = "identity", fill = "red3") +
guides(x=guide_axis(angle = 30)) +
xlab("Event Type") +
ylab("Injuries")
As seen above, tornadoes are by far the most harmful types of events, with respect to population health.
Next, let’s answer the second question:
To answer this question, let’s look at the economic damage data. The values for property damage and crop damage have been combined to calculate the event types with the highest combined cost. Figure 2 below shows the top 25 event types with the highest damage costs in both tabular and graphical formats.
econ_dmg
## EVENT_TYPE PROPERTY_DAMAGE CROP_DAMAGE TOTAL_DAMAGE
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56947381217 414953270 57362334487
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15735267513 3025954473 18761221986
## 6 FLASH FLOOD 16822673979 1421317100 18243991079
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
## 11 TROPICAL STORM 7703890550 678346000 8382236550
## 12 WINTER STORM 6688497251 26944000 6715441251
## 13 HIGH WIND 5270046475 638571300 5908617775
## 14 WILDFIRE 4765114000 295472800 5060586800
## 15 TSTM WIND 4484928495 554007350 5038935845
## 16 STORM SURGE/TIDE 4641188000 850000 4642038000
## 17 THUNDERSTORM WIND 3483122472 414843050 3897965522
## 18 HURRICANE OPAL 3172846000 19000000 3191846000
## 19 WILD/FOREST FIRE 3001829500 106796830 3108626330
## 20 HEAVY RAIN/SEVERE WEATHER 2500000000 0 2500000000
## 21 THUNDERSTORM WINDS 1944590859 190654788 2135245647
## 22 TORNADOES, TSTM WIND, HAIL 1600000000 2500000 1602500000
## 23 HEAVY RAIN 694248090 733399800 1427647890
## 24 EXTREME COLD 67737400 1292973000 1360710400
## 25 SEVERE THUNDERSTORM 1205360000 200000 1205560000
ggplot(econ_dmg, aes(EVENT_TYPE, TOTAL_DAMAGE)) +
geom_bar(stat = "identity", fill = "green4") +
guides(x=guide_axis(angle = 30)) +
xlab("Event Type") +
ylab("Total Damage (in dollars")
As shown above, floods cause the most economic damage by far, with hurricanes/typhoons coming second at a little less than half that value.