This project involves exploring NOAA storm database to answer two questions concerning population and economic harm from storm events.The analysis involves resolving missing data and data requiring data conversions to determine those storm events causing the greatest population and economic harm. Calculations are made for population harm based on the number of fatalities and number of injuries caused by each storm event type. The 10 highest event types are included in the results.
Calculations are made for economic harm based on the property and crop damage caused by each storm event type. Analysis is done for property damage, crop damage and the total of the two combined. The 10 highest event types are included in the results for property damage, crop damage, and the total of property and crop damage combined.
The data for this assignment come in the form of a comma-separated-value file in zip format of the NOAA storm database with 902,297 observations of 37 variables. The events in the database start in the year 1950 and end in November 2011.
The variables included in this dataset of interest to this analysis are the columns of data for:
EVTYPE: Storm types i.e. Tornado
FATALITIES: Number of fatalities
INJURIES: Number injured
PROPDMG: Property damage
PROPDMGEXP: Property damage exponent (multiplier for PROPDMG)
CROPDMG: Crop damage exponent
CROPDMGEXP: Croperty damage exponent (multiplier for CROPDMG)
Load the storm data and libraries
storm_data <- read.csv("repdata_data_StormData.csv.bz2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
Previewing the data structure prior to later data manipulation: using the functions head and str on the storm_data.
902297 observations of 37 variables
Relevant str data for the two questions to be answered are:
$ EVTYPE : chr “TORNADO” “TORNADO” “TORNADO” “TORNADO” …
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 …
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 …
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 …
$ PROPDMGEXP: chr “K” “K” “K” “K” …
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 …
$ CROPDMGEXP: chr “” “” “” “” …
Summary of analysis done for assignment questions 1 and 2.
Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Question 2: Across the United States, which types of events have the greatest economic consequences?
The storm data is reduced to those columns of data required to answer the two assignment questions (sub_storm_data): EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
For question 1 the storm data is assigned to sub_storm_data_health. The three columns EVTYPE (storm event), FATALITIES and INJURIES are reduced to those columns containing values greater than one for either FATALITIES or INJURIES.
The columns for FATALITIES and INJURIES are checked for missing values and none found.
The value for the two columns are summed based on event. The summed values by event are sorted and the top ten events are listed and graphed.
For question 2 the storm data is assigned to sub_storm_data_econ. The five columns EVTYPE (storm event), PROPDMG (property damage), PRODMGEXP (property damage exponent), CROPDMG (crop damage), CROPDMGEXP (crop damage exponent).
Create a data frame (sub_storm_data_econ) to address the needs of the question. Columns to include: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
The columns for PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP are checked for missing values and none found.
The columns for PROPDMGEXP and CROPDMGEXP have values for exponents such as K, M, B,… that are converted to a common numeric base to be used in calculations to determine damage by event. The common base for conversion is that of 10.
Damage is determined for property, crop, and total for property and crop. To assess damage the product of the damage is multiplied with the value for the exponent (base 10 number). The summed values by event are sorted and the top ten events by event for each category is determined: top ten property damage, top ten crop damage, and top ten combined property and crop damage.
The top ten most harmful events across the United States with respect to population health:
| Order | Event | Harm Total |
|---|---|---|
| 1 | Tornado | 96979 |
| 2 | Excessive Heat | 8428 |
| 3 | TSTM Wind | 7461 |
| 4 | Flood | 7259 |
| 5 | Lightning | 6046 |
| 6 | Heat | 3037 |
| 7 | Flash Flood | 2755 |
| 8 | Ice Storm | 2064 |
| 9 | Thunderstorm Wind | 1621 |
| 10 | Winter Storm | 1527 |
The top ten most harmful events across the United States with respect to property damage:
| Order | Event | Harm Total |
|---|---|---|
| 1 | Flood | 144,657,709,807 |
| 2 | Hurricane/Typhoon | 69,305,840,000 |
| 3 | Tornado | 56,947,380,677 |
| 4 | Storm Surge | 43,323,536,000 |
| 5 | Flash Flood | 16,822,673,979 |
| 6 | Hail | 15,735,267,018 |
| 7 | Hurricane | 11,868,319,010 |
| 8 | Tropical Storm | 7,703,890,550 |
| 9 | Winter Storm | 6,688,497,251 |
| 10 | High Wind | 5,270,046,295 |
The top ten most harmful events across the United States with respect to crop damage:
| Order | Event | Harm Total |
|---|---|---|
| 1 | Drought | 13,953,264,600 |
| 2 | Flood | 5,515,683,845 |
| 3 | River Flood | 5,026,345,900 |
| 4 | Ice Storm | 5,020,616,350 |
| 5 | Hurricane | 2,739,570,000 |
| 6 | Hurricane/Typhoon | 2,604,540,280 |
| 7 | Hail | 2,506,542,468 |
| 8 | Extreme Cold | 1,288,623,300 |
| 9 | Flash Flood | 1,261,155,710 |
| 10 | Frost/Freeze | 1,088,734,600 |
sub_storm_data <- storm_data[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG",
"PROPDMGEXP", "CROPDMG","CROPDMGEXP")]
Create the subset data frame to address health – fatalities and injuries where there has been harm (sub_storm_data_health).
sub_storm_data_health <- subset(sub_storm_data, FATALITIES > 0 | INJURIES >0)
Include columns EVTYPE, FATALITIES, INJURIES where values are greater than 0. Check for missing values in the identified columns of interest
sum(is.na(sub_storm_data_health$FATALITIES))
## [1] 0
NOTE no missing data
sum(is.na(sub_storm_data_health$INJURIES))
## [1] 0
NOTE no missing data
The unique function is used for sub_storm_data_health$EVTYPE producing 220 unique event types in the health (FATALITIES and INJURIES) columns.
Sum the FATALITIES and INJURIES column values to determine the total harm by EVTYPE.
Select the top 10 most harmful EVTYPEs
sub_storm_data_health$sum_harm <- rowSums(sub_storm_data_health[, c(2,3)], na.rm=TRUE)
most_harmful <- aggregate(sum_harm ~ EVTYPE, data = sub_storm_data_health, FUN="sum")
most_harmful <- most_harmful[order(-most_harmful$sum_harm),]
top_ten <- head(most_harmful, n = 10)
Create a chart of the top ten harm (fatalities and injuries)
top_ten$EVTYPE <- factor(top_ten$EVTYPE, levels = unique(top_ten$EVTYPE))
ggplot(top_ten, aes(x = EVTYPE, y = sum_harm, fill = EVTYPE)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Population Harm by Storm Event", x = "Storm Event",
y = "Total Harm")
Analysis includes the columns EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP for analysis.
The unique values in the “PROPDMGEXP” and “CROPDMGEXP” fields represent different exponents used to interpret property damage values in NOAA storm data.
Without a Code Book for the data some values were interpreted as a zero exponent or multiplication by 1.
Here’s how each value is interpreted:
• “K”, “k”: Denotes thousands.
• “M”, “m”: Denotes millions.
• “B”: Denotes billions.
• “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”: numeric values representing exponents.
• “h”, “H”, “-”, “?”, and “+”: could not be interpreted and are replaced with “0” equivalent to multiplication by 1.
Create the subset data frame to address economics
sub_storm_data_econ <- subset(sub_storm_data,PROPDMG > 0 | CROPDMG >0)
check for missing values in the identified columns of interest
sum(is.na(sub_storm_data_econ$PROPDMG))
## [1] 0
no missing data
sum(is.na(sub_storm_data_econ$PROPDMGEXP))
## [1] 0
no missing data
sum(is.na(sub_storm_data_econ$CROPDMG))
## [1] 0
no missing data
sum(is.na(sub_storm_data_econ$CROPDMGEXP))
## [1] 0
no missing data
The unique function was used for sub_storm_data_econ$EVTYPE to determine that there are 431 EVTYPE events for econ event type.
Determine the exponential indicators that need to be converted to numeric values.
unique(sub_storm_data_econ$PROPDMGEXP)
## [1] "K" "M" "B" "m" "" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
[1] “K” “M” “B” “m” “” “+” “0” “5” “6” “4” “h” “2” “7” “3” “H” “-”
setting k to 1,000,
M & m to 1,000,000,
B = 1,000,000,000,
numeric as the 10 exponent,
without a code book to clarify – ““, +, h, H, - to be set to the exponent 0 equivalent to multiplying by 1.
unique(sub_storm_data_econ$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
[1] “” “M” “K” “m” “B” “?” “0” “k”
setting “M” & “m” to 1,000,000,
“K” & “k” to 1,000,
“B” to 1,000,000,000,
numeric as the 10 exponent,
without a code book to clarify “?” to be set to 1.
Convert PROPDMGEXP exponents to common numeric values
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="K" | sub_storm_data_econ$PROPDMGEXP == "k"
|sub_storm_data_econ$PROPDMGEXP == "3"] <- 1000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="B" | sub_storm_data_econ$PROPDMGEXP == "9"] <- 1000000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="M" | sub_storm_data_econ$PROPDMGEXP == "m"
|sub_storm_data_econ$PROPDMGEXP == "6"] <- 1000000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="" | sub_storm_data_econ$PROPDMGEXP == "+"
|sub_storm_data_econ$PROPDMGEXP == "0"
|sub_storm_data_econ$PROPDMGEXP == "-" |sub_storm_data_econ$PROPDMGEXP == "?"
|sub_storm_data_econ$PROPDMGEXP == "h"
|sub_storm_data_econ$PROPDMGEXP == "H"] <- 1
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="5"] <- 100000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="4"] <- 10000
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="2"] <- 100
sub_storm_data_econ$PROPDMGEXP[sub_storm_data_econ$PROPDMGEXP =="7"] <- 10000000
Convert CROPDMGEXP exponents to common numeric values
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="K" | sub_storm_data_econ$CROPDMGEXP == "k"] <- 100
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="B"] <- 1000000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="M" | sub_storm_data_econ$CROPDMGEXP == "m"] <- 1000000
sub_storm_data_econ$CROPDMGEXP[sub_storm_data_econ$CROPDMGEXP =="" |sub_storm_data_econ$CROPDMGEXP == "0"
|sub_storm_data_econ$CROPDMGEXP == "?" ] <- 1
create column for property damage consequences and crop damage conequences.
“prop_conseq” and “crop_conseq” as product of …DMG value and …EXP value.
create column for total property damage that is the sum of the property and crop consequences columns (prop_conseq and crop_conseq)
sub_storm_data_econ$PROPDMGEXP <- as.numeric(sub_storm_data_econ$PROPDMGEXP)
sub_storm_data_econ$CROPDMGEXP <- as.numeric(sub_storm_data_econ$CROPDMGEXP)
sub_storm_data_econ$prop_conseq <- sub_storm_data_econ$PROPDMG * sub_storm_data_econ$PROPDMGEXP
sub_storm_data_econ$crop_conseq <- sub_storm_data_econ$CROPDMG * sub_storm_data_econ$CROPDMGEXP
sub_storm_data_econ$total_conseq <- rowSums(sub_storm_data_econ[, c(8,9)], na.rm=TRUE)
Determine the most harmful total(property and crop)
most_harmful_total <- aggregate(total_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_total <- most_harmful_total[order(-most_harmful_total$total_conseq),]
top_ten_total <- head(most_harmful_total, n = 10)
The top ten most harmful events across the United States with respect to combined property and crop damage:
| Order | Event | Harm Total |
|---|---|---|
| 1 | Flood | 150,173,393,652 |
| 2 | Hurricane/Typhoon | 71,910,380,280 |
| 3 | Tornado | 57,272,745,148 |
| 4 | Storm Surge | 43,323,536,500 |
| 5 | Hail | 18,241,809,486 |
| 6 | Flash Flood | 18,083,829,689 |
| 7 | Drought | 14,999,370,600 |
| 8 | Hurricane | 14,607,889,010 |
| 9 | River Flood | 10,145,291,400 |
| 10 | Ice Storm | 8,965,544,210 |
Determine the most harmful property (prop_conseq)
most_harmful_prop <- aggregate(prop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_prop <- most_harmful_prop[order(-most_harmful_prop$prop_conseq),]
top_ten_prop <- head(most_harmful_prop, n = 10)
Graph the top ten property damage events
top_ten_prop$EVTYPE <- factor(top_ten_prop$EVTYPE, levels = unique(top_ten_prop$EVTYPE))
ggplot(top_ten_prop, aes(x = EVTYPE, y = prop_conseq, fill = EVTYPE)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Total Property Harm by Storm Event",x = "EVENT (Property)",
y = "Property Harm")
Determine the most harmful crop (prop_conseq)
most_harmful_crop <- aggregate(crop_conseq ~ EVTYPE, data = sub_storm_data_econ, FUN="sum")
most_harmful_crop <- most_harmful_crop[order(-most_harmful_crop$crop_conseq),]
top_ten_crop <- head(most_harmful_crop, n = 10)
Graph the top ten crop damage events
top_ten_crop$EVTYPE <- factor(top_ten_crop$EVTYPE, levels = unique(top_ten_crop$EVTYPE))
ggplot(top_ten_crop, aes(x = EVTYPE, y = crop_conseq, fill = EVTYPE)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Total Crop Harm by Storm Event",x = "EVENT (Crop)",
y = "Crop Harm")