#INTRODUCTION This report analyzes the NOAA Storm Database to determine which types of severe weather events are most harmful to population health and which have the greatest economic consequences across the United States. The analysis focuses on the EVTYPE variable to categorize events and uses FATALITIES and INJURIES to assess population health impacts. Economic consequences are evaluated using PROPDMG and CROPDMG, adjusted for their respective exponents. The data is loaded directly from the raw CSV file and processed within this document. The results are presented through tables and figures, highlighting the most significant event types in terms of both human and economic impact. This analysis aims to provide information relevant to government and municipal managers for resource allocation in severe weather preparedness.
The data was downloaded from the course website to a local computer. It was then loaded into R usint its format
data <- read.csv("repdata_data_stormData.csv.bz2")
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
Since the data analysis is to evaluate the effect of severe weather on health and economics, only variables of interest to these were selected for the analysis.
Selected_data <- data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
The Eventype variable was cleaned by converting it to an uppercase and removing leading/trailing spaces in between.
Selected_data$EVTYPE <- toupper(str_trim(Selected_data$EVTYPE))
Converting the damage exponent variables to numeric and create columns for the actual damage values
unique(Selected_data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(Selected_data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
# Standardize exponents to uppercase and handle unexpected values
Selected_data <- Selected_data %>%
mutate(
PROPDMGEXP = toupper(PROPDMGEXP),
CROPDMGEXP = toupper(CROPDMGEXP),
PROPDMGEXP_VAL = case_when(
PROPDMGEXP == "K" ~ 1000,
PROPDMGEXP == "M" ~ 1000000,
PROPDMGEXP == "B" ~ 1000000000,
PROPDMGEXP == "H" ~ 100,
PROPDMGEXP %in% c("", "+", "-", "?", "0", "1", "2", "3", "4", "5", "6", "7", "8") ~ 1, # Treat as 1
TRUE ~ 1 # Default to 1 for any other case
),
CROPDMGEXP_VAL = case_when(
CROPDMGEXP == "K" ~ 1000,
CROPDMGEXP == "M" ~ 1000000,
CROPDMGEXP == "B" ~ 1000000000,
CROPDMGEXP == "H" ~ 100,
CROPDMGEXP %in% c("", "+", "-", "?", "0", "1", "2", "3", "4", "5", "6", "7", "8") ~ 1, # Treat as 1
TRUE ~ 1 # Default to 1 for any other case
),
PROPDMG_ADJ = PROPDMG * PROPDMGEXP_VAL,
CROPDMG_ADJ = CROPDMG * CROPDMGEXP_VAL
)
calculating the total health impact and economic impact for each event type
health_impact <- Selected_data %>%
group_by(EVTYPE) %>%
summarize(
total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_health_impact = total_fatalities + total_injuries
) %>%
arrange(desc(total_health_impact))
economic_impact <- Selected_data %>%
group_by(EVTYPE) %>%
summarize(
total_prop_dmg = sum(PROPDMG_ADJ, na.rm = TRUE),
total_crop_dmg = sum(CROPDMG_ADJ, na.rm = TRUE),
total_economic_impact = total_prop_dmg + total_crop_dmg
) %>%
arrange(desc(total_economic_impact))
Event type with high population health impact
head(health_impact, 10)
## # A tibble: 10 × 4
## EVTYPE total_fatalities total_injuries total_health_impact
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
Plot of eventypes with the greatest population health impact.
ggplot(head(health_impact, 10), aes(x = reorder(EVTYPE, total_health_impact), y = total_health_impact)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Events by Population Health Impact", x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)")
Event types with greatest economics consequences
head(economic_impact, 10)
## # A tibble: 10 × 4
## EVTYPE total_prop_dmg total_crop_dmg total_economic_impact
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160779. 414953270 57352114049.
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732267543. 3025954473 18758222016.
## 6 FLASH FLOOD 16140862067. 1421317100 17562179167.
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
Plot of event types with the greatest economic impact
ggplot(head(economic_impact, 10), aes(x = reorder(EVTYPE, total_economic_impact), y = total_economic_impact)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Events by Economic Impact", x = "Event Type", y = "Total Economic Impact (Property + Crop Damage)")
The analysis and the result above shows that Tornado events have the most significant impact on population health while flood events have the greatest economic impact and consequences. These findings can help municipal managers to prioritixe resources in preparing for severe weather. Also, it could help the Government not only in preparedness or resources allocation but in policies and efforts that could mitigate those events in the future.
##Sensitivity Analysis ###Sensitivity Analysis on Health Impact
The analysis above ignored the missing values and since the primary area of uncertainty in data analysis is how missing data are treated. Thus a sensitivity analysis is conducted to have a deeper understanding of the robustness of the result presented above and to ascertain the reliability of the analyis.
In the case of impact on population health, the sensitivity analysis is done by using regular expressions to standardize event names. The aim is to combine similar event types that might be recorded with slight variation.
# More aggressive EVTYPE cleaning using regular expressions
Selected_data_cleaned2 <- Selected_data %>%
mutate(
EVTYPE_CLEANED2 = case_when(
str_detect(EVTYPE, "TORNADO") ~ "TORNADO",
str_detect(EVTYPE, "FLOOD") ~ "FLOOD",
str_detect(EVTYPE, "HURRICANE|TYPHOON") ~ "HURRICANE/TYPHOON",
str_detect(EVTYPE, "EXCESSIVE HEAT") ~ "EXCESSIVE HEAT",
str_detect(EVTYPE, "LIGHTNING") ~ "LIGHTNING",
str_detect(EVTYPE, "TSTM WIND|THUNDERSTORM WIND") ~ "THUNDERSTORM WIND",
str_detect(EVTYPE, "FLASH FLOOD") ~ "FLASH FLOOD",
str_detect(EVTYPE, "HEAT") ~ "HEAT",
str_detect(EVTYPE, "ICE STORM") ~ "ICE STORM",
str_detect(EVTYPE, "WINTER STORM") ~ "WINTER STORM",
TRUE ~ "OTHER"
)
)
health_impact_cleaned2 <- Selected_data_cleaned2 %>%
group_by(EVTYPE_CLEANED2) %>%
summarize(
total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_health_impact = total_fatalities + total_injuries
) %>%
arrange(desc(total_health_impact))
# Compare original and refined cleaning results
head(health_impact, 10)
## # A tibble: 10 × 4
## EVTYPE total_fatalities total_injuries total_health_impact
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
head(health_impact_cleaned2, 10)
## # A tibble: 10 × 4
## EVTYPE_CLEANED2 total_fatalities total_injuries total_health_impact
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5661 91407 97068
## 2 OTHER 2835 11892 14727
## 3 THUNDERSTORM WIND 728 9493 10221
## 4 FLOOD 1525 8604 10129
## 5 EXCESSIVE HEAT 1922 6525 8447
## 6 LIGHTNING 817 5232 6049
## 7 HEAT 1216 2699 3915
## 8 ICE STORM 89 1990 2079
## 9 WINTER STORM 217 1353 1570
## 10 HURRICANE/TYPHOON 135 1333 1468
The result above shows a difference in the first 10 event impact on population health when a different data cleaning method approach is used. Although it didn’t change the result of the first analyis in that Tornado is still the leading event that mostly impact population health but it is important to pay attention to “Other” event as well when considering allocation of resources or any Goverment policies that could mitigate the events.
ggplot(head(health_impact_cleaned2, 10), aes(x = reorder(EVTYPE_CLEANED2, total_health_impact), y = total_health_impact)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Events by Population Health Impact (Refined EVTYPE Cleaning)", x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)")
Comparing the figure above to the previous figure generated for the impact on population health, we will see clearyly that there are some discrepancy in the figures. This gives us more insight into how the result of an analyis could change when a missing data is handled differently.
# Assuming you have health_impact and health_impact_cleaned2 data frames
# Find common event types
common_events <- intersect(health_impact$EVTYPE, health_impact_cleaned2$EVTYPE_CLEANED2)
# Filter data frames to include only common events
health_impact_common <- health_impact %>%
filter(EVTYPE %in% common_events)
health_impact_cleaned2_common <- health_impact_cleaned2 %>%
filter(EVTYPE_CLEANED2 %in% common_events)
# Join the data frames for comparison (using a simplified join for demonstration)
comparison_data <- data.frame(
EVTYPE = health_impact_common$EVTYPE,
Original_Impact = health_impact_common$total_health_impact,
Cleaned_Impact = health_impact_cleaned2_common$total_health_impact
)
# Calculate percentage change
comparison_data <- comparison_data %>%
mutate(
Percentage_Change = ((Cleaned_Impact - Original_Impact) / Original_Impact) * 100
)
# Display the comparison data
print(comparison_data)
## EVTYPE Original_Impact Cleaned_Impact Percentage_Change
## 1 TORNADO 96979 97068 9.177245e-02
## 2 EXCESSIVE HEAT 8428 14727 7.473897e+01
## 3 FLOOD 7259 10221 4.080452e+01
## 4 LIGHTNING 6046 10129 6.753225e+01
## 5 HEAT 3037 8447 1.781363e+02
## 6 ICE STORM 2064 6049 1.930717e+02
## 7 THUNDERSTORM WIND 1621 3915 1.415176e+02
## 8 WINTER STORM 1527 2079 3.614931e+01
## 9 HURRICANE/TYPHOON 1339 1570 1.725168e+01
## 10 OTHER 4 1468 3.660000e+04
# You can also visualize this data, for example, using a bar chart
library(ggplot2)
ggplot(comparison_data, aes(x = EVTYPE, y = Percentage_Change)) +
geom_bar(stat = "identity") +
labs(
title = "Percentage Change in Health Impact After Cleaning",
x = "Event Type",
y = "Percentage Change"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
###Sensitivity Analysis on Economic Impact
Although, in the analyis above the exponent values both numerical and symbols are treated as multipliers of 1. It is important to explore if there will be a change in result if the missing data were handled differently.
In the sensitivity analysis, missing data were imputed by replacing missing numerical damage values with mean or median of the non-missing values for each event type
# Mean Imputation for PROPDMG_ADJ
Selected_data_imputed <- Selected_data %>%
group_by(EVTYPE) %>%
mutate(
PROPDMG_ADJ_IMPUTED = ifelse(is.na(PROPDMG_ADJ), mean(PROPDMG_ADJ, na.rm = TRUE), PROPDMG_ADJ),
CROPDMG_ADJ_IMPUTED = ifelse(is.na(CROPDMG_ADJ), mean(CROPDMG_ADJ, na.rm = TRUE), CROPDMG_ADJ)
) %>%
ungroup()
economic_impact_imputed <- Selected_data_imputed %>%
group_by(EVTYPE) %>%
summarize(
total_prop_dmg_imputed = sum(PROPDMG_ADJ_IMPUTED, na.rm = TRUE),
total_crop_dmg_imputed = sum(CROPDMG_ADJ_IMPUTED, na.rm = TRUE),
total_economic_impact_imputed = total_prop_dmg_imputed + total_crop_dmg_imputed
) %>%
arrange(desc(total_economic_impact_imputed))
# Compare original and imputed results
head(economic_impact, 10)
## # A tibble: 10 × 4
## EVTYPE total_prop_dmg total_crop_dmg total_economic_impact
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160779. 414953270 57352114049.
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732267543. 3025954473 18758222016.
## 6 FLASH FLOOD 16140862067. 1421317100 17562179167.
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
head(economic_impact_imputed, 10)
## # A tibble: 10 × 4
## EVTYPE total_prop_dmg_imputed total_crop_dmg_imputed total_economic_impac…¹
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICA… 69305840000 2607872800 71913712800
## 3 TORNADO 56937160779. 414953270 57352114049.
## 4 STORM S… 43323536000 5000 43323541000
## 5 HAIL 15732267543. 3025954473 18758222016.
## 6 FLASH F… 16140862067. 1421317100 17562179167.
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICA… 11868319010 2741910000 14610229010
## 9 RIVER F… 5118945500 5029459000 10148404500
## 10 ICE STO… 3944927860 5022113500 8967041360
## # ℹ abbreviated name: ¹total_economic_impact_imputed
The result shows that there is no difference in the way the missing data was treated. Either by ignoring the data or by imputing the result seems the sames
ggplot(head(economic_impact_imputed, 10), aes(x = reorder(EVTYPE, total_economic_impact_imputed), y = total_economic_impact_imputed)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Events by Economic Impact", x = "Event Type", y = "Total Economic Impact (Property + Crop Damage)")