Synopsis

This project involves exploring NOAA storm database to answer two questions concerning population and economic harm from storm events.The analysis involves resolving missing data and data requiring data conversions to determine those storm events causing the greatest population and economic harm. Calculations are made for population harm based on the number of fatalities and number of injuries caused by each storm event type. The 10 highest event types are included in the results.

Calculations are made for economic harm based on the property and crop damage caused by each storm event type. Analysis is done for property damage, crop damage and the total of the two combined. The 10 highest event types are included in the results for property damage, crop damage, and the total of property and crop damage combined.

Data Processing

The data for this assignment come in the form of a comma-separated-value file in zip format of the NOAA storm database with 902,297 observations of 37 variables. The events in the database start in the year 1950 and end in November 2011.

The variables included in this dataset of interest to this analysis are the columns of data for:

  • EVTYPE: Storm types i.e. Tornado

  • FATALITIES: Number of fatalities

  • INJURIES: Number injured

  • PROPDMG: Property damage

  • PROPDMGEXP: Property damage exponent (multiplier for PROPDMG)

  • CROPDMG: Crop damage exponent

  • CROPDMGEXP: Croperty damage exponent (multiplier for CROPDMG)

Load the storm data and libraries

storm_data <- read.csv("repdata_data_StormData.csv.bz2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3

Previewing the data structure prior to later data manipulation: using the functions head and str on the storm_data.

902297 observations of 37 variables

Relevant str data for the two questions to be answered are:

  • $ EVTYPE : chr “TORNADO” “TORNADO” “TORNADO” “TORNADO” …

  • $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 …

  • $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 …

  • $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 …

  • $ PROPDMGEXP: chr “K” “K” “K” “K” …

  • $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 …

  • $ CROPDMGEXP: chr “” “” “” “” …

Analysis Summary

Summary of analysis done for assignment questions 1 and 2.

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Question 2: Across the United States, which types of events have the greatest economic consequences?

The storm data is reduced to those columns of data required to answer the two assignment questions (sub_storm_data): EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

QUESTION 1:

For question 1 the storm data is assigned to sub_storm_data_health. The three columns EVTYPE (storm event), FATALITIES and INJURIES are reduced to those columns containing values greater than one for either FATALITIES or INJURIES.

The columns for FATALITIES and INJURIES are checked for missing values and none found.

The value for the two columns are summed based on event. The summed values by event are sorted and the top ten events are listed and graphed.

QUESTION 2:

For question 2 the storm data is assigned to sub_storm_data_econ. The five columns EVTYPE (storm event), PROPDMG (property damage), PRODMGEXP (property damage exponent), CROPDMG (crop damage), CROPDMGEXP (crop damage exponent).

Create a data frame (sub_storm_data_econ) to address the needs of the question. Columns to include: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

The columns for PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are checked for missing values and none found.

The columns for PROPDMGEXP and CROPDMGEXP have values for exponents such as K, M, B,… that are converted to a common numeric base to be used in calculations to determine damage by event. The common base for conversion is that of 10.

Damage is determined for property, crop, and total for property and crop. To assess damage the product of the damage is multiplied with the value for the exponent (base 10 number). The summed values by event are sorted and the top ten events by event for each category is determined: top ten property damage, top ten crop damage, and top ten combined property and crop damage.

Results

Question 1

The top ten most harmful events across the United States with respect to population health:

Order Event Harm Total
1 Tornado 96979
2 Excessive Heat 8428
3 TSTM Wind 7461
4 Flood 7259
5 Lightning 6046
6 Heat 3037
7 Flash Flood 2755
8 Ice Storm 2064
9 Thunderstorm Wind 1621
10 Winter Storm 1527

Question 2

The top ten most harmful events across the United States with respect to property damage:

Order Event Harm Total
1 Flood 144,657,709,807
2 Hurricane/Typhoon 69,305,840,000
3 Tornado 56,947,380,677
4 Storm Surge 43,323,536,000
5 Flash Flood 16,822,673,979
6 Hail 15,735,267,018
7 Hurricane 11,868,319,010
8 Tropical Storm 7,703,890,550
9 Winter Storm 6,688,497,251
10 High Wind 5,270,046,295

The top ten most harmful events across the United States with respect to crop damage:

Order Event Harm Total
1 Drought 13,953,264,600
2 Flood 5,515,683,845
3 River Flood 5,026,345,900
4 Ice Storm 5,020,616,350
5 Hurricane 2,739,570,000
6 Hurricane/Typhoon 2,604,540,280
7 Hail 2,506,542,468
8 Extreme Cold 1,288,623,300
9 Flash Flood 1,261,155,710
10 Frost/Freeze 1,088,734,600

Analysis

Question 1

sub_storm_data <- storm_data[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG",
                                "PROPDMGEXP", "CROPDMG","CROPDMGEXP")]

Create the subset data frame to address health – fatalities and injuries where there has been harm (sub_storm_data_health).

sub_storm_data_health <- subset(sub_storm_data, FATALITIES > 0 | INJURIES >0)

Include columns EVTYPE, FATALITIES, INJURIES where values are greater than 0. Check for missing values in the identified columns of interest

sum(is.na(sub_storm_data_health$FATALITIES))
## [1] 0

NOTE no missing data

sum(is.na(sub_storm_data_health$INJURIES))
## [1] 0

NOTE no missing data

The unique function is used for sub_storm_data_health$EVTYPE producing 220 unique event types in the health (FATALITIES and INJURIES) columns.

Sum the FATALITIES and INJURIES column values to determine the total harm by EVTYPE.

Select the top 10 most harmful EVTYPEs

sub_storm_data_health$sum_harm <- rowSums(sub_storm_data_health[, c(2,3)], na.rm=TRUE)
most_harmful <- aggregate(sum_harm ~ EVTYPE, data = sub_storm_data_health, FUN="sum")
most_harmful <- most_harmful[order(-most_harmful$sum_harm),]
top_ten <- head(most_harmful, n = 10)

Create a chart of the top ten harm (fatalities and injuries)

top_ten$EVTYPE <- factor(top_ten$EVTYPE, levels = unique(top_ten$EVTYPE))
ggplot(top_ten, aes(x = EVTYPE, y = sum_harm, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Population Harm by Storm Event", x = "Storm Event", 
           y = "Total Harm")  

QUESTION 2

Analysis includes the columns EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP for analysis.

The unique values in the “PROPDMGEXP” and “CROPDMGEXP” fields represent different exponents used to interpret property damage values in NOAA storm data.

Without a Code Book for the data some values were interpreted as a zero exponent or multiplication by 1.

Here’s how each value is interpreted:

• “K”, “k”: Denotes thousands.

• “M”, “m”: Denotes millions.

• “B”: Denotes billions.

• “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”: numeric values representing exponents.

• “h”, “H”, “-”, “?”, and “+”: could not be interpreted and are replaced with “0” equivalent to multiplication by 1.

Create the subset data frame to address economics

sub_storm_data_econ <- subset(sub_storm_data,PROPDMG > 0 | CROPDMG >0)

check for missing values in the identified columns of interest

sum(is.na(sub_storm_data_econ$PROPDMG))
## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$PROPDMGEXP))
## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$CROPDMG))
## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$CROPDMGEXP))
## [1] 0

no missing data

The unique function was used for sub_storm_data_econ$EVTYPE to determine that there are 431 EVTYPE events for econ event type.

Determine the exponential indicators that need to be converted to numeric values.

unique(sub_storm_data_econ$PROPDMGEXP)
##  [1] "K" "M" "B" "m" ""  "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"

[1] “K” “M” “B” “m” “” “+” “0” “5” “6” “4” “h” “2” “7” “3” “H” “-”

setting k to 1,000,

M & m to 1,000,000,

B = 1,000,000,000,

numeric as the 10 exponent,

without a code book to clarify – ““, +, h, H, - to be set to the exponent 0 equivalent to multiplying by 1.

unique(sub_storm_data_econ$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k"

[1] “” “M” “K” “m” “B” “?” “0” “k”

setting “M” & “m” to 1,000,000,

“K” & “k” to 1,000,

“B” to 1,000,000,000,

numeric as the 10 exponent,

without a code book to clarify “?” to be set to 1.

Convert PROPDMGEXP exponents to common numeric values

sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="K" | sub_storm_data_econ$PROPDMGEXP == "k" 
                               |sub_storm_data_econ$PROPDMGEXP == "3"] <- 1000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="B" | sub_storm_data_econ$PROPDMGEXP == "9"] <- 1000000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="M" | sub_storm_data_econ$PROPDMGEXP == "m" 
                               |sub_storm_data_econ$PROPDMGEXP == "6"] <- 1000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="" | sub_storm_data_econ$PROPDMGEXP == "+" 
                               |sub_storm_data_econ$PROPDMGEXP == "0"
                    |sub_storm_data_econ$PROPDMGEXP == "-" |sub_storm_data_econ$PROPDMGEXP == "?"
                    |sub_storm_data_econ$PROPDMGEXP == "h"
                    |sub_storm_data_econ$PROPDMGEXP == "H"] <- 1
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="5"] <- 100000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="4"] <- 10000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="2"] <- 100
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="7"] <- 10000000

Convert CROPDMGEXP exponents to common numeric values

sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="K" | sub_storm_data_econ$CROPDMGEXP == "k"] <- 100
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="B"] <- 1000000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="M" | sub_storm_data_econ$CROPDMGEXP == "m"] <- 1000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="" |sub_storm_data_econ$CROPDMGEXP == "0"
                              |sub_storm_data_econ$CROPDMGEXP == "?" ] <- 1

create column for property damage consequences and crop damage conequences.

“prop_conseq” and “crop_conseq” as product of …DMG value and …EXP value.

create column for total property damage that is the sum of the property and crop consequences columns (prop_conseq and crop_conseq)

sub_storm_data_econ$PROPDMGEXP <- as.numeric(sub_storm_data_econ$PROPDMGEXP)
sub_storm_data_econ$CROPDMGEXP <- as.numeric(sub_storm_data_econ$CROPDMGEXP)
sub_storm_data_econ$prop_conseq <- sub_storm_data_econ$PROPDMG * sub_storm_data_econ$PROPDMGEXP 
sub_storm_data_econ$crop_conseq <- sub_storm_data_econ$CROPDMG * sub_storm_data_econ$CROPDMGEXP 
sub_storm_data_econ$total_conseq <- rowSums(sub_storm_data_econ[, c(8,9)], na.rm=TRUE)

Determine the most harmful total(property and crop)

most_harmful_total <- aggregate(total_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_total <- most_harmful_total[order(-most_harmful_total$total_conseq),]
top_ten_total <- head(most_harmful_total, n = 10)

The top ten most harmful events across the United States with respect to combined property and crop damage:

Order Event Harm Total
1 Flood 150,173,393,652
2 Hurricane/Typhoon 71,910,380,280
3 Tornado 57,272,745,148
4 Storm Surge 43,323,536,500
5 Hail 18,241,809,486
6 Flash Flood 18,083,829,689
7 Drought 14,999,370,600
8 Hurricane 14,607,889,010
9 River Flood 10,145,291,400
10 Ice Storm 8,965,544,210

Determine the most harmful property (prop_conseq)

most_harmful_prop <- aggregate(prop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_prop <- most_harmful_prop[order(-most_harmful_prop$prop_conseq),]
top_ten_prop <- head(most_harmful_prop, n = 10)

Graph the top ten property damage events

top_ten_prop$EVTYPE <- factor(top_ten_prop$EVTYPE, levels = unique(top_ten_prop$EVTYPE))
ggplot(top_ten_prop, aes(x = EVTYPE, y = prop_conseq, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Total Property Harm by Storm Event",x = "EVENT (Property)",
           y = "Property Harm") 

Determine the most harmful crop (prop_conseq)

most_harmful_crop <- aggregate(crop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_crop <- most_harmful_crop[order(-most_harmful_crop$crop_conseq),]
top_ten_crop <- head(most_harmful_crop, n = 10)

Graph the top ten crop damage events

top_ten_crop$EVTYPE <- factor(top_ten_crop$EVTYPE, levels = unique(top_ten_crop$EVTYPE))
ggplot(top_ten_crop, aes(x = EVTYPE, y = crop_conseq, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Total Crop Harm by Storm Event",x = "EVENT (Crop)", 
           y = "Crop Harm")