Synopsis

This project involves exploring NOAA storm database to answer two questions concerning population and economic harm from storm events.The analysis involves resolving missing data and data requiring data conversions to determine those storm events causing the greatest population and economic harm. Calculations are made for population harm based on the number of fatalities and number of injuries caused by each storm event type. The 10 highest event types are included in the results.

Calculations are made for economic harm based on the property and crop damage caused by each storm event type. Analysis is done for property damage, crop damage and the total of the two combined. The 10 highest event types are included in the results for property damage, crop damage, and the total of property and crop damage combined.

Data Processing

The data for this assignment come in the form of a comma-separated-value file in zip format of the NOAA storm database with 902,297 observations of 37 variables. The events in the database start in the year 1950 and end in November 2011.

The variables included in this dataset of interest to this analysis are the columns of data for:

EVTYPE: Storm types i.e. Tornado
FATALITIES: Number of fatalities
INJURIES: Number injured
PROPDMG: Property damage
PROPDMGEXP: Property damage exponent (multiplier for PROPDMG)
CROPDMG: Crop damage exponent
CROPDMGEXP: Croperty damage exponent (multiplier for CROPDMG)

Load the storm data and libraries

storm_data <- read.csv("repdata_data_StormData.csv.bz2")
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(data.table)

## Warning: package 'data.table' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

Previewing the data structure prior to later data manipulation: using the functions head and str on the storm_data.

902297 observations of 37 variables

Relevant str data for the two questions to be answered are:

$ EVTYPE : chr “TORNADO” “TORNADO” “TORNADO” “TORNADO” …
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 …
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 …
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 …
$ PROPDMGEXP: chr “K” “K” “K” “K” …
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 …
$ CROPDMGEXP: chr “” “” “” “” …

Analysis Summary

Summary of analysis done for assignment questions 1 and 2.

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Question 2: Across the United States, which types of events have the greatest economic consequences?

The storm data is reduced to those columns of data required to answer the two assignment questions (sub_storm_data): EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

QUESTION 1:

For question 1 the storm data is assigned to sub_storm_data_health. The three columns EVTYPE (storm event), FATALITIES and INJURIES are reduced to those columns containing values greater than one for either FATALITIES or INJURIES.

The columns for FATALITIES and INJURIES are checked for missing values and none found.

The value for the two columns are summed based on event. The summed values by event are sorted and the top ten events are listed and graphed.

QUESTION 2:

For question 2 the storm data is assigned to sub_storm_data_econ. The five columns EVTYPE (storm event), PROPDMG (property damage), PRODMGEXP (property damage exponent), CROPDMG (crop damage), CROPDMGEXP (crop damage exponent).

Create a data frame (sub_storm_data_econ) to address the needs of the question. Columns to include: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.

The columns for PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are checked for missing values and none found.

The columns for PROPDMGEXP and CROPDMGEXP have values for exponents such as K, M, B,… that are converted to a common numeric base to be used in calculations to determine damage by event. The common base for conversion is that of 10.

Damage is determined for property, crop, and total for property and crop. To assess damage the product of the damage is multiplied with the value for the exponent (base 10 number). The summed values by event are sorted and the top ten events by event for each category is determined: top ten property damage, top ten crop damage, and top ten combined property and crop damage.

Results

Question 1

The top ten most harmful events across the United States with respect to population health:

Order	Event	Harm Total
1	Tornado	96979
2	Excessive Heat	8428
3	TSTM Wind	7461
4	Flood	7259
5	Lightning	6046
6	Heat	3037
7	Flash Flood	2755
8	Ice Storm	2064
9	Thunderstorm Wind	1621
10	Winter Storm	1527

Question 2

The top ten most harmful events across the United States with respect to property damage:

Order	Event	Harm Total
1	Flood	144,657,709,807
2	Hurricane/Typhoon	69,305,840,000
3	Tornado	56,947,380,677
4	Storm Surge	43,323,536,000
5	Flash Flood	16,822,673,979
6	Hail	15,735,267,018
7	Hurricane	11,868,319,010
8	Tropical Storm	7,703,890,550
9	Winter Storm	6,688,497,251
10	High Wind	5,270,046,295

The top ten most harmful events across the United States with respect to crop damage:

Order	Event	Harm Total
1	Drought	13,953,264,600
2	Flood	5,515,683,845
3	River Flood	5,026,345,900
4	Ice Storm	5,020,616,350
5	Hurricane	2,739,570,000
6	Hurricane/Typhoon	2,604,540,280
7	Hail	2,506,542,468
8	Extreme Cold	1,288,623,300
9	Flash Flood	1,261,155,710
10	Frost/Freeze	1,088,734,600

Analysis

Question 1

sub_storm_data <- storm_data[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG",
                                "PROPDMGEXP", "CROPDMG","CROPDMGEXP")]

Create the subset data frame to address health – fatalities and injuries where there has been harm (sub_storm_data_health).

sub_storm_data_health <- subset(sub_storm_data, FATALITIES > 0 | INJURIES >0)

Include columns EVTYPE, FATALITIES, INJURIES where values are greater than 0. Check for missing values in the identified columns of interest

sum(is.na(sub_storm_data_health$FATALITIES))

## [1] 0

NOTE no missing data

sum(is.na(sub_storm_data_health$INJURIES))

## [1] 0

NOTE no missing data

The unique function is used for sub_storm_data_health$EVTYPE producing 220 unique event types in the health (FATALITIES and INJURIES) columns.

Sum the FATALITIES and INJURIES column values to determine the total harm by EVTYPE.

Select the top 10 most harmful EVTYPEs

sub_storm_data_health$sum_harm <- rowSums(sub_storm_data_health[, c(2,3)], na.rm=TRUE)
most_harmful <- aggregate(sum_harm ~ EVTYPE, data = sub_storm_data_health, FUN="sum")
most_harmful <- most_harmful[order(-most_harmful$sum_harm),]
top_ten <- head(most_harmful, n = 10)

Create a chart of the top ten harm (fatalities and injuries)

top_ten$EVTYPE <- factor(top_ten$EVTYPE, levels = unique(top_ten$EVTYPE))
ggplot(top_ten, aes(x = EVTYPE, y = sum_harm, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Population Harm by Storm Event", x = "Storm Event", 
           y = "Total Harm")

QUESTION 2

Analysis includes the columns EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP for analysis.

The unique values in the “PROPDMGEXP” and “CROPDMGEXP” fields represent different exponents used to interpret property damage values in NOAA storm data.

Without a Code Book for the data some values were interpreted as a zero exponent or multiplication by 1.

Here’s how each value is interpreted:

• “K”, “k”: Denotes thousands.

• “M”, “m”: Denotes millions.

• “B”: Denotes billions.

• “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”: numeric values representing exponents.

• “h”, “H”, “-”, “?”, and “+”: could not be interpreted and are replaced with “0” equivalent to multiplication by 1.

Create the subset data frame to address economics

sub_storm_data_econ <- subset(sub_storm_data,PROPDMG > 0 | CROPDMG >0)

check for missing values in the identified columns of interest

sum(is.na(sub_storm_data_econ$PROPDMG))

## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$PROPDMGEXP))

## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$CROPDMG))

## [1] 0

no missing data

sum(is.na(sub_storm_data_econ$CROPDMGEXP))

## [1] 0

no missing data

The unique function was used for sub_storm_data_econ$EVTYPE to determine that there are 431 EVTYPE events for econ event type.

Determine the exponential indicators that need to be converted to numeric values.

unique(sub_storm_data_econ$PROPDMGEXP)

##  [1] "K" "M" "B" "m" ""  "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"

[1] “K” “M” “B” “m” “” “+” “0” “5” “6” “4” “h” “2” “7” “3” “H” “-”

setting k to 1,000,

M & m to 1,000,000,

B = 1,000,000,000,

numeric as the 10 exponent,

without a code book to clarify – ““, +, h, H, - to be set to the exponent 0 equivalent to multiplying by 1.

unique(sub_storm_data_econ$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k"

[1] “” “M” “K” “m” “B” “?” “0” “k”

setting “M” & “m” to 1,000,000,

“K” & “k” to 1,000,

“B” to 1,000,000,000,

numeric as the 10 exponent,

without a code book to clarify “?” to be set to 1.

Convert PROPDMGEXP exponents to common numeric values

sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="K" | sub_storm_data_econ$PROPDMGEXP == "k" 
                               |sub_storm_data_econ$PROPDMGEXP == "3"] <- 1000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="B" | sub_storm_data_econ$PROPDMGEXP == "9"] <- 1000000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="M" | sub_storm_data_econ$PROPDMGEXP == "m" 
                               |sub_storm_data_econ$PROPDMGEXP == "6"] <- 1000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="" | sub_storm_data_econ$PROPDMGEXP == "+" 
                               |sub_storm_data_econ$PROPDMGEXP == "0"
                    |sub_storm_data_econ$PROPDMGEXP == "-" |sub_storm_data_econ$PROPDMGEXP == "?"
                    |sub_storm_data_econ$PROPDMGEXP == "h"
                    |sub_storm_data_econ$PROPDMGEXP == "H"] <- 1
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="5"] <- 100000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="4"] <- 10000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="2"] <- 100
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="7"] <- 10000000

Convert CROPDMGEXP exponents to common numeric values

sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="K" | sub_storm_data_econ$CROPDMGEXP == "k"] <- 100
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="B"] <- 1000000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="M" | sub_storm_data_econ$CROPDMGEXP == "m"] <- 1000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="" |sub_storm_data_econ$CROPDMGEXP == "0"
                              |sub_storm_data_econ$CROPDMGEXP == "?" ] <- 1

create column for property damage consequences and crop damage conequences.

“prop_conseq” and “crop_conseq” as product of …DMG value and …EXP value.

create column for total property damage that is the sum of the property and crop consequences columns (prop_conseq and crop_conseq)

sub_storm_data_econ$PROPDMGEXP <- as.numeric(sub_storm_data_econ$PROPDMGEXP)
sub_storm_data_econ$CROPDMGEXP <- as.numeric(sub_storm_data_econ$CROPDMGEXP)
sub_storm_data_econ$prop_conseq <- sub_storm_data_econ$PROPDMG * sub_storm_data_econ$PROPDMGEXP 
sub_storm_data_econ$crop_conseq <- sub_storm_data_econ$CROPDMG * sub_storm_data_econ$CROPDMGEXP 
sub_storm_data_econ$total_conseq <- rowSums(sub_storm_data_econ[, c(8,9)], na.rm=TRUE)

Determine the most harmful total(property and crop)

most_harmful_total <- aggregate(total_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_total <- most_harmful_total[order(-most_harmful_total$total_conseq),]
top_ten_total <- head(most_harmful_total, n = 10)

The top ten most harmful events across the United States with respect to combined property and crop damage:

Order	Event	Harm Total
1	Flood	150,173,393,652
2	Hurricane/Typhoon	71,910,380,280
3	Tornado	57,272,745,148
4	Storm Surge	43,323,536,500
5	Hail	18,241,809,486
6	Flash Flood	18,083,829,689
7	Drought	14,999,370,600
8	Hurricane	14,607,889,010
9	River Flood	10,145,291,400
10	Ice Storm	8,965,544,210

Determine the most harmful property (prop_conseq)

most_harmful_prop <- aggregate(prop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_prop <- most_harmful_prop[order(-most_harmful_prop$prop_conseq),]
top_ten_prop <- head(most_harmful_prop, n = 10)

Graph the top ten property damage events

top_ten_prop$EVTYPE <- factor(top_ten_prop$EVTYPE, levels = unique(top_ten_prop$EVTYPE))
ggplot(top_ten_prop, aes(x = EVTYPE, y = prop_conseq, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Total Property Harm by Storm Event",x = "EVENT (Property)",
           y = "Property Harm")

Determine the most harmful crop (prop_conseq)

most_harmful_crop <- aggregate(crop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_crop <- most_harmful_crop[order(-most_harmful_crop$crop_conseq),]
top_ten_crop <- head(most_harmful_crop, n = 10)

Graph the top ten crop damage events

top_ten_crop$EVTYPE <- factor(top_ten_crop$EVTYPE, levels = unique(top_ten_crop$EVTYPE))
ggplot(top_ten_crop, aes(x = EVTYPE, y = crop_conseq, fill = EVTYPE)) +
      geom_bar(stat = "identity") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(title = "Total Crop Harm by Storm Event",x = "EVENT (Crop)", 
           y = "Crop Harm")

JHU Reproducible Research - Week 4 Project 2 to Analyze U.S. National Oceanic and Atmoshpheric Administration (NOAA) Storm Data for Population and Economic Harm

Anonymous

2024-05-24

Synopsis

Data Processing

Analysis Summary

QUESTION 1:

QUESTION 2:

Results

Question 1

Question 2

Analysis

Question 1

QUESTION 2