Synopsis

This report contains the results of an analysis with the goal of identifying the most hazardous weather events in terms of population health and those with the greatest economic impact in the U.S. based on data collected from the U.S. National Oceanic and Atmospheric Administration (NOAA).

The storm database includes weather events from 1950 - 2011 and contains data estimates such as the number of fatalities and injuries for each weather event and economic cost damage to properties and crops for each weather event.

The estimates for fatalities and injuries were used to determine the weather events with the most harmful impact to population health. Property damage and crop damage cost estimates were used to determine weather events with the greatest economic consequences.

Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(xtable)
## Warning: package 'xtable' was built under R version 4.1.3

Load data

stormData<-read.csv("C:/Users/AroRoseman/Downloads/repdata_data_StormData.csv.bz2")

Data Processing

Subset Data

When processing large datasets, computer performance can be improved by taking a subset of the variables required for the analysis. For this analysis, the dataset will be trimmed to include only the necessary variables (listed below). In addition, only observations with values > 0 will be included.

EVTYPE (Event type (Flood, Heat, Hurricane, Tornado, etc)), FATALITIES (Number of fatalities resulting from event), INJURIES (Number of injuries resulting from event), PROPDMG (Property damage in USD), PROPDMGEXP (Unit multiplier for property damage (K, M, or B)), CROPDMG (Crop damage in USD), CROPDMGEXP (Unit multiplier for property damage (K, M, or B)), BGN_DATE (Begin date of the event), END_DATE (End date of the event), STATE (State where the event occurred)

stormDataTidy <- subset(stormData, EVTYPE != "?"
                                   &
                                   (FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0),
                                   select = c("EVTYPE",
                                              "FATALITIES",
                                              "INJURIES", 
                                              "PROPDMG",
                                              "PROPDMGEXP",
                                              "CROPDMG",
                                              "CROPDMGEXP",
                                              "BGN_DATE",
                                              "END_DATE",
                                              "STATE"))
dim(stormDataTidy)
## [1] 254632     10

Check for missing values

sum(is.na(stormDataTidy))
## [1] 0

This dataset contains no missing values and has 254632 observations of 10 variables

Clean event type data

There are 487 unique event types in this dataset

length(unique(stormDataTidy$EVTYPE))
## [1] 487

The values in this dataset are not clean (e.g. one value might read “strong wind” and another might read “STRONG WINDS”). The dataset was normalized by converting all Event Types to uppercase and combining similar Event Types into unique categories.

stormDataTidy$EVTYPE <- toupper(stormDataTidy$EVTYPE)  

#AVALANCHE
stormDataTidy$EVTYPE <- gsub('.*AVALANCE.*', 'AVALANCHE', stormDataTidy$EVTYPE)

#BLIZZARD
stormDataTidy$EVTYPE <- gsub('.*BLIZZARD.*', 'BLIZZARD', stormDataTidy$EVTYPE)

#CLOUD
stormDataTidy$EVTYPE <- gsub('.*CLOUD.*', 'CLOUD', stormDataTidy$EVTYPE)

#COLD
stormDataTidy$EVTYPE <- gsub('.*COLD.*', 'COLD', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*FREEZ.*', 'COLD', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*FROST.*', 'COLD', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*ICE.*', 'COLD', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*LOW TEMPERATURE RECORD.*', 'COLD', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*LO.*TEMP.*', 'COLD', stormDataTidy$EVTYPE)

#DRY
stormDataTidy$EVTYPE <- gsub('.*DRY.*', 'DRY', stormDataTidy$EVTYPE)

#DUST
stormDataTidy$EVTYPE <- gsub('.*DUST.*', 'DUST', stormDataTidy$EVTYPE)

#FIRE
stormDataTidy$EVTYPE <- gsub('.*FIRE.*', 'FIRE', stormDataTidy$EVTYPE)

#FLOOD
stormDataTidy$EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', stormDataTidy$EVTYPE)

#FOG
stormDataTidy$EVTYPE <- gsub('.*FOG.*', 'FOG', stormDataTidy$EVTYPE)

#HAIL
stormDataTidy$EVTYPE <- gsub('.*HAIL.*', 'HAIL', stormDataTidy$EVTYPE)

#HEAT
stormDataTidy$EVTYPE <- gsub('.*HEAT.*', 'HEAT', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*WARM.*', 'HEAT', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*HIGH.*TEMP.*', 'HEAT', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*RECORD HIGH TEMPERATURES.*', 'HEAT', stormDataTidy$EVTYPE)

#HYPOTHERMIA/EXPOSURE
stormDataTidy$EVTYPE <- gsub('.*HYPOTHERMIA.*', 'HYPOTHERMIA/EXPOSURE', stormDataTidy$EVTYPE)

#LANDSLIDE
stormDataTidy$EVTYPE <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', stormDataTidy$EVTYPE)

#LIGHTNING
stormDataTidy$EVTYPE <- gsub('^LIGHTNING.*', 'LIGHTNING', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('^LIGNTNING.*', 'LIGHTNING', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('^LIGHTING.*', 'LIGHTNING', stormDataTidy$EVTYPE)

#MICROBURST
stormDataTidy$EVTYPE <- gsub('.*MICROBURST.*', 'MICROBURST', stormDataTidy$EVTYPE)

#MUDSLIDE
stormDataTidy$EVTYPE <- gsub('.*MUDSLIDE.*', 'MUDSLIDE', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*MUD SLIDE.*', 'MUDSLIDE', stormDataTidy$EVTYPE)

#RAIN
stormDataTidy$EVTYPE <- gsub('.*RAIN.*', 'RAIN', stormDataTidy$EVTYPE)

#RIP CURRENT
stormDataTidy$EVTYPE <- gsub('.*RIP CURRENT.*', 'RIP CURRENT', stormDataTidy$EVTYPE)

#STORM
stormDataTidy$EVTYPE <- gsub('.*STORM.*', 'STORM', stormDataTidy$EVTYPE)

#SUMMARY
stormDataTidy$EVTYPE <- gsub('.*SUMMARY.*', 'SUMMARY', stormDataTidy$EVTYPE)

#TORNADO
stormDataTidy$EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*TORNDAO.*', 'TORNADO', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*LANDSPOUT.*', 'TORNADO', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*WATERSPOUT.*', 'TORNADO', stormDataTidy$EVTYPE)

#SURF
stormDataTidy$EVTYPE <- gsub('.*SURF.*', 'SURF', stormDataTidy$EVTYPE)

#VOLCANIC
stormDataTidy$EVTYPE <- gsub('.*VOLCANIC.*', 'VOLCANIC', stormDataTidy$EVTYPE)

#WET
stormDataTidy$EVTYPE <- gsub('.*WET.*', 'WET', stormDataTidy$EVTYPE)

#WIND
stormDataTidy$EVTYPE <- gsub('.*WIND.*', 'WIND', stormDataTidy$EVTYPE)

#WINTER
stormDataTidy$EVTYPE <- gsub('.*WINTER.*', 'WINTER', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*WINTRY.*', 'WINTER', stormDataTidy$EVTYPE)
stormDataTidy$EVTYPE <- gsub('.*SNOW.*', 'WINTER', stormDataTidy$EVTYPE)  

length(unique(stormDataTidy$EVTYPE))
## [1] 81

There are now 81 unique Event Types in this dataset.

Clean date data

Format date variables for any type of optional reporting or further analysis. In the raw dataset, the BNG_START and END_DATE variables are stored as factors, which should be made available as date types that can be manipulated and reported on. For now, time variables will be ignored. Create four new variables based on date variables in the tidy dataset: DATE_START (Begin date of the event stored as a date type), DATE_END (End date of the event stored as a date type), YEAR (Year the event started), and DURATION (Duration (in hours) of the event).

stormDataTidy$DATE_START <- as.Date(stormDataTidy$BGN_DATE, format = "%m/%d/%Y")
stormDataTidy$DATE_END <- as.Date(stormDataTidy$END_DATE, format = "%m/%d/%Y")
stormDataTidy$YEAR <- as.integer(format(stormDataTidy$DATE_START, "%Y"))
stormDataTidy$DURATION <- as.numeric(stormDataTidy$DATE_END - stormDataTidy$DATE_START)/3600

Clean economic data

According to the “National Weather Service Storm Data Documentation” (page 12), information about property damage is logged using two variables: PROPDMG and PROPDMGEXP. PROPDMG is the mantissa (the significand) rounded to three significant digits and PROPDMGEXP is the exponent (the multiplier). The same approach is used for Crop Damage where the CROPDMG variable is encoded by the CROPDMGEXP variable. The documentation also specifies that the PROPDMGEXP and CROPDMGEXP are supposed to contain an alphabetical character used to signify magnitude and logs “K” for thousands, “M” for millions, and “B” for billions. A quick review of the data, however, shows that there are several other additional characters.

table(toupper(stormDataTidy$PROPDMGEXP))
## 
##             -      +      0      2      3      4      5      6      7      B 
##  11585      1      5    210      1      1      4     18      3      3     40 
##      H      K      M 
##      7 231427  11327
table(toupper(stormDataTidy$CROPDMGEXP))
## 
##             ?      0      B      K      M 
## 152663      6     17      7  99953   1986

To calculate costs, the PROPDMGEXP and CROPDMGEXP variables will be mapped to a multiplier factor which will be used to calculate the costs for property and crop damage. Two new variables will be created to store damage costs: PROP_COST and CROP_COST

Function to get multiplier factor

getMultiplier <- function(exp) {
    exp <- toupper(exp);
    if (exp == "")  return (10^0);
    if (exp == "-") return (10^0);
    if (exp == "?") return (10^0);
    if (exp == "+") return (10^0);
    if (exp == "0") return (10^0);
    if (exp == "1") return (10^1);
    if (exp == "2") return (10^2);
    if (exp == "3") return (10^3);
    if (exp == "4") return (10^4);
    if (exp == "5") return (10^5);
    if (exp == "6") return (10^6);
    if (exp == "7") return (10^7);
    if (exp == "8") return (10^8);
    if (exp == "9") return (10^9);
    if (exp == "H") return (10^2);
    if (exp == "K") return (10^3);
    if (exp == "M") return (10^6);
    if (exp == "B") return (10^9);
    return (NA);
}

calculate property damage and crop damage costs (in billions)

stormDataTidy$PROP_COST <- with(stormDataTidy, as.numeric(PROPDMG) * sapply(PROPDMGEXP, getMultiplier))/10^9
stormDataTidy$CROP_COST <- with(stormDataTidy, as.numeric(CROPDMG) * sapply(CROPDMGEXP, getMultiplier))/10^9

Summarize data

Summary of health impact data (fatalities + injuries). Results are sorted in descending order by health impact.

healthImpactData <- aggregate(x = list(HEALTH_IMPACT = stormDataTidy$FATALITIES + stormDataTidy$INJURIES), 
                                  by = list(EVENT_TYPE = stormDataTidy$EVTYPE), 
                                  FUN = sum,
                                  na.rm = TRUE)
healthImpactData <- healthImpactData[order(healthImpactData$HEALTH_IMPACT, decreasing = TRUE),]

Summary of damage impact costs (property damage + crop damage). Results are sorted in descending order by damage cost.

damageCostImpactData <- aggregate(x = list(DAMAGE_IMPACT = stormDataTidy$PROP_COST + stormDataTidy$CROP_COST), 
                                  by = list(EVENT_TYPE = stormDataTidy$EVTYPE), 
                                  FUN = sum,
                                  na.rm = TRUE)
damageCostImpactData <- damageCostImpactData[order(damageCostImpactData$DAMAGE_IMPACT, decreasing = TRUE),]

Results

Event Types Most Harmful to Population Health

Fatalities and injuries have the most harmful impact on population health. The results below display the ten most harmful weather events in terms of population health in the U.S.

print(xtable(head(healthImpactData, 10),
             caption = "Top 10 Weather Events Most Harmful to Population Health"),
             caption.placement = 'top',
             type = "html",
             include.rownames = FALSE,
             html.table.attributes='class="table-bordered", width="100%"')
## <!-- html table generated in R 4.1.1 by xtable 1.8-4 package -->
## <!-- Sat Dec 31 20:53:47 2022 -->
## <table class="table-bordered", width="100%">
## <caption align="top"> Top 10 Weather Events Most Harmful to Population Health </caption>
## <tr> <th> EVENT_TYPE </th> <th> HEALTH_IMPACT </th>  </tr>
##   <tr> <td> TORNADO </td> <td align="right"> 97075.00 </td> </tr>
##   <tr> <td> HEAT </td> <td align="right"> 12392.00 </td> </tr>
##   <tr> <td> FLOOD </td> <td align="right"> 10127.00 </td> </tr>
##   <tr> <td> WIND </td> <td align="right"> 9893.00 </td> </tr>
##   <tr> <td> LIGHTNING </td> <td align="right"> 6049.00 </td> </tr>
##   <tr> <td> STORM </td> <td align="right"> 4780.00 </td> </tr>
##   <tr> <td> COLD </td> <td align="right"> 3100.00 </td> </tr>
##   <tr> <td> WINTER </td> <td align="right"> 1924.00 </td> </tr>
##   <tr> <td> FIRE </td> <td align="right"> 1698.00 </td> </tr>
##   <tr> <td> HAIL </td> <td align="right"> 1512.00 </td> </tr>
##    </table>
healthImpactChart <- ggplot(head(healthImpactData, 10),
                            aes(x = reorder(EVENT_TYPE, HEALTH_IMPACT), y = HEALTH_IMPACT, fill = EVENT_TYPE)) +
                            coord_flip() +
                            geom_bar(stat = "identity") + 
                            xlab("Event Type") +
                            ylab("Total Fatalities and Injures") +
                            theme(plot.title = element_text(size = 14, hjust = 0.5)) +
                            ggtitle("Top 10 Weather Events Most Harmful to\nPopulation Health")
print(healthImpactChart)

Event Types Most Harmful to the Economy

Property and crop damage have the most harmful impact on the economy. The results below display the ten most harmful weather events in terms of economic consequences in the U.S.

print(xtable(head(damageCostImpactData, 10),
             caption = "Top 10 Weather Events with Greatest Economic Consequences"),
             caption.placement = 'top',
             type = "html",
             include.rownames = FALSE,
             html.table.attributes='class="table-bordered", width="100%"')
## <!-- html table generated in R 4.1.1 by xtable 1.8-4 package -->
## <!-- Sat Dec 31 20:53:48 2022 -->
## <table class="table-bordered", width="100%">
## <caption align="top"> Top 10 Weather Events with Greatest Economic Consequences </caption>
## <tr> <th> EVENT_TYPE </th> <th> DAMAGE_IMPACT </th>  </tr>
##   <tr> <td> FLOOD </td> <td align="right"> 180.58 </td> </tr>
##   <tr> <td> HURRICANE/TYPHOON </td> <td align="right"> 71.91 </td> </tr>
##   <tr> <td> STORM </td> <td align="right"> 70.45 </td> </tr>
##   <tr> <td> TORNADO </td> <td align="right"> 57.43 </td> </tr>
##   <tr> <td> HAIL </td> <td align="right"> 20.74 </td> </tr>
##   <tr> <td> DROUGHT </td> <td align="right"> 15.02 </td> </tr>
##   <tr> <td> HURRICANE </td> <td align="right"> 14.61 </td> </tr>
##   <tr> <td> COLD </td> <td align="right"> 12.70 </td> </tr>
##   <tr> <td> WIND </td> <td align="right"> 12.01 </td> </tr>
##   <tr> <td> FIRE </td> <td align="right"> 8.90 </td> </tr>
##    </table>
damageCostImpactChart <- ggplot(head(damageCostImpactData, 10),
                            aes(x = reorder(EVENT_TYPE, DAMAGE_IMPACT), y = DAMAGE_IMPACT, fill = EVENT_TYPE)) +
                            coord_flip() +
                            geom_bar(stat = "identity") + 
                            xlab("Event Type") +
                            ylab("Total Property / Crop Damage Cost\n(in Billions)") +
                            theme(plot.title = element_text(size = 14, hjust = 0.5)) +
                            ggtitle("Top 10 Weather Events with\nGreatest Economic Consequences")
print(damageCostImpactChart)

Conclusion

This analysis leads to the following conclusions:

Which types of weather events are most harmful to population health?

Tornadoes are responsible for the greatest number of fatalities and injuries.

Which types of weather events have the greatest economic consequences?

Floods are responsible for causing the most economic harm in terms of property and crop damage.