Severe weather events have a negative impact on both public health and the economy. The damage caused in past events can be calculated which in turn can provide statistical information which can guide policy decisions and emergency management plans.
This report will focus on two key areas, public health and the economy, by specifically looking at which types of weather events have the greatest impact in each.
The data used in the analysis is the U.S. National Oceanic and Atmospheric Administration’s Storm database. The NOAA database compiles information on major storms and other weather events across the United States. The location, duration and date of events are recorded as well as fatalities, injuries, property damage and crop damage.The database covers a time period from 1950 to 2011.
The report is structured in two sections. The first ‘Data Processing’ details the steps taken to process the raw NOAA data and the calculations made to create analytical data. The R code chunks are given in sequence with text providing explanations and motivations for the methods used. The second section ‘Results’ analyze the data and provides a brief discussion of the results. Both sections are further divided in order to look at Public Health and Economic Impact separately.
The NOAA data is loaded and processed in R.
The majority of the data analysis is executed using base R but these additional packages are also needed
1: ‘tidyr’ : used to pivot data tables before plotting the data 2: ‘ggplot2’ : used for creating plots
library(ggplot2)
library(tidyr)
The raw data was made available through the course website and loaded to the project directory. The raw data is in a csv file and is downloaded as a zip file in the .csv.bz2 format. The file is unzipped automatically when using the dataCSV function.
dirW <- "C:/Users/annal/Dropbox/03_Education/07_Data Science/Coursera/Course 5 - Reproducible Research/Assignment 2"
setwd(dirW)
folderName <- "repdata_data_StormData.csv.bz2"
dataCSV <- read.csv(folderName)
There are two variables in the data set related to public health, the number of fatalities for each event and the number of injuries. To evaluate the impact of specific types of weather events it is helpful to start with summary statistics (total, median, mean). The first step is to look at the summary statistics for these two variables across all weather events.
print(summary(dataCSV$FATALITIES))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
print(summary(dataCSV$INJURIES))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
The next step is to calculate the summary statistics for each severe weather event type. This time we can leave median out since the previous process have shown that median is not helpful. We will calculate the total number of events (count) as well, the total value (sum) and the average (mean). The data is calculated using a for loop that is based on the number of unique values for weather events (EVTYPE)
# first extract all the unique values for events
events <- unique(dataCSV$EVTYPE)
# Second create an empty data frame
dataHealth <- data.frame(matrix(ncol = 6, nrow = length(events)))
colnames(dataHealth) <- c("Events", "Count", "F_Sum",
"F_Mean", "I_Sum", "I_Mean")
# lastly calculate the total and mean values for each event using a 'for' loop
for (rn in 1:length(events)) {
dataHealth[rn, 1] <- events[rn]
dataHealth[rn, 2] <- length(dataCSV$FATALITIES[dataCSV$EVTYPE==events[rn]])
dataHealth[rn, 3] <- sum(subset(dataCSV$FATALITIES, dataCSV$EVTYPE == events[rn]), na.rm = TRUE)
dataHealth[rn, 4] <- mean(subset(dataCSV$FATALITIES, dataCSV$EVTYPE == events[rn]), na.rm = TRUE)
dataHealth[rn, 5] <- sum(subset(dataCSV$INJURIES, dataCSV$EVTYPE == events[rn]), na.rm = TRUE)
dataHealth[rn, 6] <- mean(subset(dataCSV$INJURIES, dataCSV$EVTYPE == events[rn]), na.rm = TRUE)
}
There are over 900 unique events and many of them have no impact on public health. For this report we will focus on the 20 events that have the most significant impact in regards to fatalities and injuries.
# Order the total fatalities and extract the top 20 rows
dataHealth <- dataHealth[order(dataHealth$F_Sum, decreasing = TRUE),]
resultsFatSum <- dataHealth[1:20, c(1, 3)]
# Order the fatalities means and extract the top 20 rows
dataHealth <- dataHealth[order(dataHealth$F_Mean, decreasing = TRUE),]
resultsFatMean <- dataHealth[1:20, c(1, 4)]
# Order the injuries sum and extract the top 20 rows
dataHealth <- dataHealth[order(dataHealth$I_Sum, decreasing = TRUE),]
resultsInjSum <- dataHealth[1:20, c(1, 5)]
# Order the injuries means and extract the top 20 rows
dataHealth <- dataHealth[order(dataHealth$I_Mean, decreasing = TRUE),]
resultsInjMean <- dataHealth[1:20, c(1, 6)]
To display the results, create a data frame that combines the values calculated above
# Combine all the events in the 4 individual results data frames into one list
eventsHealth <- resultsFatSum$Events
eventsHealth <- append(eventsHealth, resultsFatMean$Events)
eventsHealth <- append(eventsHealth, resultsInjSum$Events)
eventsHealth <- append(eventsHealth, resultsInjMean$Events)
eventsHealth <- unique(eventsHealth)
# Create dataframe by subsetting dataHealth by only the events that have the
# highest impact (eventsHealth)
dataHealthHigh <- subset(dataHealth, Events %in% eventsHealth)
The highest mean values for fatalities and injuries can best be visualized as a plot.
# To plot the events with the highest mean for fatalities and injuries we need
# to first subset the data using using resultsFatMean and resultsInjMean and
# and then remove all duplicate event values
eventsMean <- resultsFatMean$Events
eventsMean <- append(eventsMean, resultsInjMean$Events)
eventsMean <- unique(eventsMean)
# Create dataframe by subsetting dataHealth by only the events that have the
# highest mean (eventsMean), only take the columns with events and mean values
dataHealthPlot1 <- subset(dataHealth, Events %in% eventsMean)
dataHealthPlot1 <- dataHealthPlot1[, c(1,4,6)]
# Change the col names to values that will read better in the plot
colnames(dataHealthPlot1) <- c("Events", "Fatalities", "Injuries" )
# To create the plots the data frame needs to pivot in order to combine 3 variables
# into 2 (Type and Value)
dataHealthPlot2 <- dataHealthPlot1 %>% pivot_longer(!Events, names_to = "Type", values_to = "Mean")
plotHealthMean <- ggplot(data = dataHealthPlot2, aes(x= Events, y=Mean, fill=Type)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
There are a number of factors in the raw data which relates to the economic damage caused by weather events. PROPDMG provides property damage in a dollar amount, while CROPDMG records damage to crops, also in dollar amounts. Both of these factors are accompanied by a second factor that provides the magnitude (XP). K for thousands, M for millions, B for Billions and T for Trillions. Unfortunately, when looking at the unique values in PROPDMGEXP and CROPDMGEXP there are a number of entries which fall outside the given parameters. These need to be removed first before the XP code value is replaced by the corresponding numeric value.
# Create data tables for property and crop damage that remove all rows with incorrect
# XP values
dataProp <- dataCSV[dataCSV$PROPDMGEXP %in% c("K", "M", "B", "T"),]
dataCrop <- dataCSV[dataCSV$CROPDMGEXP %in% c("K", "M", "B", "T"),]
# Create a lookup table for the magnitude values
XP <- c("K","M","B","T")
X <- c(1e3, 1e6, 1e9, 1e12)
lookup <- data.frame( XP = XP, X = X)
# Replace the XP code with the numeric value using the lookup table
dataProp$PROPDMGEXP <- lookup$X[match(dataProp$PROPDMGEXP, lookup$XP)]
dataCrop$CROPDMGEXP <- lookup$X[match(dataCrop$CROPDMGEXP, lookup$XP)]
The property and crop damage is calculated by multiplying the value with the numeric value of the magnitude. The new value is added to a new column to each data table.
# Create a new empty column to contain the multiplied values for property and crop damage
dataProp[, "PropertyDamage"] = NA
dataCrop[, "CropDamage"] = NA
# Calculate the new property damage costs by multiplying the Cost with the Magnitude
for (nr in 1:nrow(dataProp)) {
dataProp$PropertyDamage[nr] <- dataProp$PROPDMG[nr] * dataProp$PROPDMGEXP[nr]
}
# Calculate the new crop damage costs by multiplying the Cost with the Magnitude
for (nr in 1:nrow(dataCrop)) {
dataCrop$CropDamage[nr] <- dataCrop$CROPDMG[nr] * dataCrop$CROPDMGEXP[nr]
}
The total cost for property and crop damage caused by each type of weather event is calculated and added to a new data table.
# Now we create a data table where the total for each type of event is given
# First step is to calculate all the unique values for EVTYPE
events <- unique(dataCSV$EVTYPE)
# then a blank data frame is created with the column headings already given
dataCost <- data.frame(matrix(ncol = 3, nrow = length(dataCSV)))
colnames(dataCost) <- c("Events", "Property", "Crop")
# then the data frame is populated using a for loop
for (rn in 1:length(events)) {
dataCost[rn, 1] <- events[rn]
dataCost[rn, 2] <- sum(dataProp[which(dataProp$EVTYPE == events[rn]), 38])
dataCost[rn, 3] <- sum(dataCrop[which(dataCrop$EVTYPE == events[rn]), 38])
}
There are over 900 types of weather events, many of which have no property of crop damage recorded. To simplify we will extract only the 20 events with the highest values for both types of damage
# Order the property damage and extract the top 20 values
dataCost <- dataCost[order(dataCost$Property, decreasing = TRUE),]
resultsP <- dataCost[1:20, c(1, 2)]
# Order the crop damage and extract the top 20 values
dataCost <- dataCost[order(dataCost$Crop, decreasing = TRUE),]
resultsC <- dataCost[1:20, c(1, 3)]
The cost related to property and crop damage can best be visualized by a plot
# Create dataframe by subsetting dataCost by only the events that have the
# highest mean (eventsMean), only take the columns with events and mean values
eventsHigh <- resultsP$Events
eventsHigh <- append(eventsHigh, resultsC$Events)
eventsHigh <- unique(eventsHigh)
# Create a dataframe by subsetting dataCost by only the events that have the
# highest total costs (eventsHigh)
dataCostHigh <- subset(dataCost, Events %in% eventsHigh)
# To create the plots the data frame needs to pivot in order to combine 3 variables
# into 2 (Type and Value)
dataPlot <- dataCostHigh %>% pivot_longer(!Events, names_to = "Type", values_to = "Total")
# Create a plot file using ggplot2
plotCostSum <- ggplot(data = dataPlot, aes(x= Events, y=Total, fill=Type)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
There are over 900 specific types of weather events. To investigate the impact on public health, only the events with the highest mean or total fatalities (F) and Injuries (I) are shown. The number of instances of each type of weather event is also counted.
## Events Count F_Sum F_Mean I_Sum I_Mean
## 779 Heat Wave 1 0 0.000000 70 70.00000
## 210 TROPICAL STORM GORDON 1 8 8.000000 43 43.00000
## 114 WILD FIRES 4 3 0.750000 150 37.50000
## 540 THUNDERSTORMW 1 0 0.000000 27 27.00000
## 554 HIGH WIND AND SEAS 1 3 3.000000 20 20.00000
## 524 SNOW/HIGH WINDS 2 0 0.000000 36 18.00000
## 486 HEAT WAVE DROUGHT 1 4 4.000000 15 15.00000
## 119 WINTER STORM HIGH WINDS 1 1 1.000000 15 15.00000
## 364 GLAZE/ICE STORM 1 0 0.000000 15 15.00000
## 973 HURRICANE/TYPHOON 88 64 0.727273 1275 14.48864
## 934 WINTER WEATHER MIX 6 0 0.000000 68 11.33333
## 90 EXTREME HEAT 22 96 4.363636 155 7.04545
## 920 NON-SEVERE WIND DAMAGE 1 0 0.000000 7 7.00000
## 179 GLAZE 32 7 0.218750 216 6.75000
## 979 TSUNAMI 20 33 1.650000 129 6.45000
## 121 WINTER STORMS 3 10 3.333333 17 5.66667
## 442 TORNADO F2 3 0 0.000000 16 5.33333
## 200 WATERSPOUT/TORNADO 8 3 0.375000 42 5.25000
## 539 EXCESSIVE RAINFALL 4 2 0.500000 21 5.25000
## 182 HEAT WAVE 74 172 2.324324 309 4.17568
## 99 EXCESSIVE HEAT 1678 1903 1.134088 6525 3.88856
## 27 HEAT 767 937 1.221643 2100 2.73794
## 74 MARINE MISHAP 2 7 3.500000 5 2.50000
## 914 ROUGH SEAS 3 8 2.666667 5 1.66667
## 1 TORNADO 60652 5633 0.092874 91346 1.50607
## 276 FOG 538 62 0.115242 734 1.36431
## 92 DUST STORM 427 22 0.051522 440 1.03044
## 65 ICE STORM 2006 89 0.044367 1975 0.98455
## 443 RIP CURRENTS 304 204 0.671053 297 0.97697
## 18 RIP CURRENT 470 368 0.782979 232 0.49362
## 73 AVALANCHE 386 224 0.580311 170 0.44041
## 227 WILD/FOREST FIRE 1457 12 0.008236 545 0.37406
## 43 EXTREME COLD 655 160 0.244275 231 0.35267
## 15 LIGHTNING 15754 816 0.051796 5230 0.33198
## 221 WILDFIRE 2761 75 0.027164 911 0.32995
## 47 BLIZZARD 2719 101 0.037146 805 0.29606
## 36 FLOOD 25326 470 0.018558 6789 0.26806
## 111 HIGH SURF 725 101 0.139310 152 0.20966
## 8 WINTER STORM 11433 206 0.018018 1321 0.11554
## 137 STRONG WIND 3566 103 0.028884 280 0.07852
## 53 HEAVY SNOW 15708 127 0.008085 1021 0.06500
## 46 HIGH WIND 20212 248 0.012270 1137 0.05625
## 10 THUNDERSTORM WINDS 20843 64 0.003071 908 0.04356
## 20 FLASH FLOOD 54277 978 0.018019 1777 0.03274
## 2 TSTM WIND 219940 504 0.002292 6957 0.03163
## 967 EXTREME COLD/WIND CHILL 1002 125 0.124750 24 0.02395
## 16 THUNDERSTORM WIND 82563 133 0.001611 1488 0.01802
## 3 HAIL 288661 15 0.000052 1361 0.00471
## 207 TORNADOES, TSTM WIND, HAIL 1 25 25.000000 0 0.00000
## 786 COLD AND SNOW 1 14 14.000000 0 0.00000
## 409 RECORD/EXCESSIVE HEAT 3 17 5.666667 0 0.00000
## 82 HIGH WIND/SEAS 1 4 4.000000 0 0.00000
## 834 Heavy surf and wind 1 3 3.000000 0 0.00000
## 406 RIP CURRENTS/HEAVY SURF 2 5 2.500000 0 0.00000
## 410 HEAT WAVES 2 5 2.500000 0 0.00000
## 490 UNSEASONABLY WARM AND DRY 13 29 2.230769 0 0.00000
## 9 HURRICANE OPAL/HIGH WINDS 1 2 2.000000 0 0.00000
## 561 HEAVY SEAS 2 3 1.500000 0 0.00000
## 813 Hypothermia/Exposure 3 4 1.333333 0 0.00000
There are more than one way to calculate which weather event has the highest impact on Public health. In terms of the number of fatalities these three events have the highest overall impact TORNADO, EXCESSIVE HEAT, FLASH FLOOD. But when we consider the mean Fatalities, the three events with the highest values (TORNADOES, TSTM WIND, HAIL, COLD AND SNOW, TROPICAL STORM GORDON) are different. The reason is that a small number of one type of event with high fatalities might have a smaller total value but a higher mean compared than a large number of another type of event with a smaller mean, but a higher total due to the number of events.
When looking at the impact of specific types of events in the past it might be helpful to consider the total. But when planning for future events knowing the statistical likelihood that an upcoming weather event might impact public health is also important. The mean value is helpful in this regard. If we consider the mean values in more detail we can look at the following graph which plots the events with the highest mean for both fatalities and injuries
Plot 1: Weather events with the highest public health cost
There are two variables which relate to economic impact, property damage and damage to crops. The types of weather events with the highest damage (calculated in dollar amounts) in both are given in the table below.
## Events Property Crop
## 194 DROUGHT 1046106000 13972566000
## 36 FLOOD 144657709800 5661968450
## 52 RIVER FLOOD 5118945500 5029459000
## 65 ICE STORM 3944927810 5022113500
## 3 HAIL 15727366720 3025537450
## 226 HURRICANE 11868319010 2741910000
## 973 HURRICANE/TYPHOON 69305840000 2607872800
## 20 FLASH FLOOD 16140811510 1421317100
## 43 EXTREME COLD 67737400 1292973000
## 960 FROST/FREEZE 9480000 1094086000
## 14 HEAVY RAIN 694248090 733399800
## 209 TROPICAL STORM 7703890550 678346000
## 46 HIGH WIND 5270046260 638571300
## 2 TSTM WIND 4484928440 554007350
## 99 EXCESSIVE HEAT 7753700 492402000
## 54 FREEZE 205000 446225000
## 1 TORNADO 56925660480 414953110
## 16 THUNDERSTORM WIND 3483121140 414843050
## 27 HEAT 1797000 401461500
## 221 WILDFIRE 4765114000 295472800
## 10 THUNDERSTORM WINDS 1733452850 190650700
## 227 WILD/FOREST FIRE 3001829500 106796830
## 8 WINTER STORM 6688497250 26944000
## 13 HURRICANE OPAL 3152846000 9000000
## 976 STORM SURGE/TIDE 4641188000 850000
## 204 STORM SURGE 43323536000 5000
## 313 HEAVY RAIN/SEVERE WEATHER 2500000000 0
The three weather events which have caused the highest property damage: FLOOD, HURRICANE/TYPHOON, TORNADO
The three weather events which caused the most severe crop damage: DROUGHT, FLOOD, RIVER FLOOD
Events such as Drought have little effect in terms of property damage, but a significant impact on crops. Floods on the other hand cause significant damage to both.
Another way of visualizing the data in the table above is in a stacked bar plot:
Plot 2: Weather events with the highest economic impact