This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The task of this project is to assess which events cause the most harmful with respect to people's lives and the amount of economic damage caused.
The issue with this dataset is that it is quite complex and messy in the sense that there are a lot of entry errors and there are a lot of variables to consider. Therefore, the analysis will proceed in two steps:
The analysis has been broken down thusly to make it a little easier to take in.
library(tidyverse)
library(R.utils)
library(stringr)
library(lattice)
filePath = "./data/stormdata.csv.bz2"
if(!file.exists(filePath)) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", filePath)
}
#Unzipping the file contents
bunzip2(filePath, "./data/stormdata.csv", remove = FALSE, skip = TRUE)
#Viewing extracted files
str_view(list.files("./data"), "stormdata.+", match = TRUE)
#Reading into R
Sys.time()
stormdata_table = read.csv("./data/stormdata.csv")
Sys.time()
[1] "2022-08-05 09:13:02 EAT"
[1] "2022-08-05 09:13:23 EAT"
head(stormdata_table)
| STATE__ | BGN_DATE | BGN_TIME | TIME_ZONE | COUNTY | COUNTYNAME | STATE | EVTYPE | BGN_RANGE | BGN_AZI | ... | CROPDMGEXP | WFO | STATEOFFIC | ZONENAMES | LATITUDE | LONGITUDE | LATITUDE_E | LONGITUDE_ | REMARKS | REFNUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <dbl> | <chr> | ... | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | |
| 1 | 1 | 4/18/1950 0:00:00 | 0130 | CST | 97 | MOBILE | AL | TORNADO | 0 | ... | 3040 | 8812 | 3051 | 8806 | 1 | ||||||
| 2 | 1 | 4/18/1950 0:00:00 | 0145 | CST | 3 | BALDWIN | AL | TORNADO | 0 | ... | 3042 | 8755 | 0 | 0 | 2 | ||||||
| 3 | 1 | 2/20/1951 0:00:00 | 1600 | CST | 57 | FAYETTE | AL | TORNADO | 0 | ... | 3340 | 8742 | 0 | 0 | 3 | ||||||
| 4 | 1 | 6/8/1951 0:00:00 | 0900 | CST | 89 | MADISON | AL | TORNADO | 0 | ... | 3458 | 8626 | 0 | 0 | 4 | ||||||
| 5 | 1 | 11/15/1951 0:00:00 | 1500 | CST | 43 | CULLMAN | AL | TORNADO | 0 | ... | 3412 | 8642 | 0 | 0 | 5 | ||||||
| 6 | 1 | 11/15/1951 0:00:00 | 2000 | CST | 77 | LAUDERDALE | AL | TORNADO | 0 | ... | 3450 | 8748 | 0 | 0 | 6 |
Since cleaning is going to take a long time. I'm going to explore the relationships between the events as they now exist and the other variables of interest.
Variables of interest include
#filtering only for important variables
#Creating subset of data
min_stormdata = select(stormdata_table, EVTYPE, FATALITIES:CROPDMGEXP)
#Generating summary statistics
sum_df = group_by(min_stormdata, EVTYPE) %>%
summarize(fatalitiessum = sum(FATALITIES, na.rm = TRUE),
injuriessum = sum(INJURIES, na.rm = TRUE),
propertydamagesum = sum(PROPDMG, na.rm = TRUE),
cropdamagesum = sum(CROPDMG, na.rm = TRUE)
)
mean_df = group_by(min_stormdata, EVTYPE) %>%
summarize(fatalitiesmean = mean(FATALITIES, na.rm = TRUE),
injuriesmean = mean(INJURIES, na.rm = TRUE),
propertydamagemean = mean(PROPDMG, na.rm = TRUE),
cropdamagemean = mean(CROPDMG, na.rm = TRUE)
)
#Adding factor variable for creating panel plots
tt = mutate(sum_df, factor_var = rep_len(1:20, length.out = nrow(sum_df)))
tt2 = mutate(mean_df, factor_var = rep_len(1:20, length.out = nrow(sum_df)))
#Plotting
histogram(~ fatalitiessum | as.factor(factor_var), data = tt, layout = c(4,1))
##This is incomprehensible to me
##A plan B is needed
#creating a dataframe of each variable and its maximum value
results = data.frame(
events = names(tt[, 2:5]) ,
quantity = apply(tt[, 2:5], 2, max)
)
results2 = data.frame(
events = names(tt2[, 2:5]) ,
quantity = apply(tt2[, 2:5], 2, max)
)
#Plotting
ggplot(data = results, aes(events, quantity)) +
geom_histogram(stat = "identity")
ggplot(data = results2, aes(events, quantity)) +
geom_histogram(stat = "identity")
Warning message: "Ignoring unknown parameters: binwidth, bins, pad" Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
#sorting by occurences
rbind(
arrange(tt, desc(fatalitiessum)) %>% head(n = 1),
arrange(tt, desc(injuriessum)) %>% head(n = 1),
arrange(tt, desc(propertydamagesum)) %>% head(n = 1),
arrange(tt, desc(cropdamagesum)) %>% head(n = 1)
)
#This category TORNADOES, TSTM WIND, HAIL has the most fatalities
rbind(
arrange(tt2, desc(fatalitiesmean)) %>% head(n = 1),
arrange(tt2, desc(injuriesmean)) %>% head(n = 1),
arrange(tt2, desc(propertydamagemean)) %>% head(n = 1),
arrange(tt2, desc(cropdamagemean)) %>% head(n = 1)
)
| EVTYPE | fatalitiessum | injuriessum | propertydamagesum | cropdamagesum | factor_var |
|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
| TORNADO | 5633 | 91346 | 3212258.2 | 100018.5 | 14 |
| TORNADO | 5633 | 91346 | 3212258.2 | 100018.5 | 14 |
| TORNADO | 5633 | 91346 | 3212258.2 | 100018.5 | 14 |
| HAIL | 15 | 1361 | 688693.4 | 579596.3 | 4 |
| EVTYPE | fatalitiesmean | injuriesmean | propertydamagemean | cropdamagemean | factor_var |
|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
| TORNADOES, TSTM WIND, HAIL | 25 | 0 | 1.6 | 2.5 | 2 |
| Heat Wave | 0 | 70 | 0.0 | 0.0 | 17 |
| COASTAL EROSION | 0 | 0 | 766.0 | 0.0 | 12 |
| DUST STORM/HIGH WINDS | 0 | 0 | 50.0 | 500.0 | 18 |
Preliminary findings are inconclusive. All that can be said for certain is that different events have cause differing types of damage, which is to be expected. The identity of these events remains unknown. The EVTYPE variable is very untidy, with same events occuring under different observations. A lot of cleaning will be needed.
Below are conclusions of the first phase of exploration:
MEAN
SUM
This is going to take a lot of work to accomplish.
The expectation is that extensive use of regular expressions will suffice.
#reading events data in
events_df = read.table("./data/events.csv", sep = "|", header = TRUE)
events_df %>% head
#replacing each observation interword spaces with |
events_var = str_replace_all(events_df$Event.Name, "[ |/]", "|")
events_var
| Event.Name | Designator | |
|---|---|---|
| <chr> | <chr> | |
| 1 | Astronomical Low Tide | Z |
| 2 | Avalanche | Z |
| 3 | Blizzard | Z |
| 4 | Coastal Flood | Z |
| 5 | Cold/Wind Chill | Z |
| 6 | Debris Flow | C |
To reduce the complexity of the task, we will select a random sample of the observations of the dataframe and work on that rather than the entire dataframe.
#Taking a random sample of the dataframe observations to reduce complexity
set.seed(100)
sb_st = sample(x = 1:nrow(min_stormdata),
size = 10000,
replace = TRUE
)
rand_stormdata = min_stormdata[sb_st, ]
rand_stormdata %>% head
| EVTYPE | FATALITIES | INJURIES | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP | |
|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <chr> | |
| 606711 | WINTER WEATHER | 0 | 0 | 0 | 0 | ||
| 331376 | HAIL | 0 | 0 | 0 | 0 | ||
| 162777 | HAIL | 0 | 0 | 0 | 0 | ||
| 614094 | HAIL | 0 | 0 | 5 | K | 0 | |
| 732055 | THUNDERSTORM WIND | 0 | 0 | 3 | K | 0 | K |
| 353796 | HAIL | 0 | 0 | 0 | 0 |
#detecting all matching events in the EVTYPE variable
subset_matrix = sapply(events_var, function(x){
y = str_detect(rand_stormdata$EVTYPE, regex(x, ignore_case = TRUE))
data.frame(
filter(rand_stormdata, y)
)
}
) %>%
as.data.frame
The result of sapply() is messy so it needs to be tidyied (pun intended). We'll use the pivot_longer() function from the tidyr package.
has_rownames(subset_matrix)
subset_matrix2 = rownames_to_column(subset_matrix, var = "Effects")
#Moving identified observations into rows
tidy_ev_table = subset_matrix2 %>%
pivot_longer(names(subset_matrix) , names_to = "event", values_to = "figures") %>%
#Moving effect variables into columns
pivot_wider(names_from = "Effects", values_from = "figures")
#making a filtering boolean vector
sub_vect = sapply(tidy_ev_table$EVTYPE, length) %>%
sapply(function(x){
if(x != 0) {x = TRUE}
else {x = FALSE}
})
tidy_ev_table2 = filter(tidy_ev_table, sub_vect)
summ_func = function(y) {
#Creating a dataframe for each specific event
ev_y = data.frame(
event = tidy_ev_table2[y,1],
fatalities = tidy_ev_table2$FATALITIES[[y]] %>% unlist,
injuries = tidy_ev_table2$INJURIES[[y]] %>% unlist,
propertydamage = tidy_ev_table2$PROPDMG[[y]] %>% unlist,
cropdamage = tidy_ev_table2$CROPDMG[[y]] %>% unlist,
propertyexp = sapply(tidy_ev_table2$PROPDMGEXP[[y]] %>% unlist, function(x){
if(x == "K") {x = 1000}
else {x = 1000000}
}),
cropexp = sapply(tidy_ev_table2$CROPDMGEXP[[y]] %>% unlist, function(x){
x = 1000
})
)
ev_y %>% head
#Uniting the variables
mutate(ev_y, propertyactual = propertydamage * propertyexp, .keep = "unused") %>%
mutate(cropactual = cropdamage * cropexp, .keep = "unused") %>%
#Summing the outcomes
summarize(evs = event[1], totalfatalities = sum(fatalities), totalinjuries = sum(injuries),
totalprop = sum(propertyactual), totalcrop = sum(cropactual), )
}
fin_tally_df = sapply(1:nrow(tidy_ev_table2), summ_func) %>% data.frame
fin_tally_df
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | ... | X36 | X37 | X38 | X39 | X40 | X41 | X42 | X43 | X44 | X45 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | ... | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | <named list> | |
| evs | Astronomical|Low|Tide | Avalanche | Blizzard | Coastal|Flood | Cold|Wind|Chill | Dense|Fog | Dense|Smoke | Drought | Dust|Devil | Dust|Storm | ... | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | Tropical|Storm | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather |
| totalfatalities | 0 | 3 | 2 | 13 | 23 | 1 | 1 | 0 | 0 | 4 | ... | 12 | 12 | 60 | 0 | 4 | 11 | 0 | 0 | 5 | 1 |
| totalinjuries | 0 | 1 | 6 | 19 | 105 | 7 | 7 | 0 | 0 | 78 | ... | 105 | 106 | 641 | 0 | 78 | 18 | 0 | 43 | 79 | 60 |
| totalprop | 0 | 0 | 1e+05 | 171092250 | 259664720 | 12000 | 10000 | 508000 | 0 | 127436450 | ... | 259664720 | 259664720 | 382465070 | 34396000 | 127456450 | 100250250 | 550000 | 63832560 | 127543450 | 8540000 |
| totalcrop | 0 | 0 | 0 | 3300700 | 3428550 | 0 | 0 | 67900 | 0 | 1645550 | ... | 3428550 | 3428550 | 830000 | 0 | 1645550 | 762000 | 0 | 11000 | 1645550 | 0 |
#Tidying operations
fin_tally_df2 = rownames_to_column(fin_tally_df, var = "title") %>%
pivot_longer(X1:X29, names_to = "arbitrary", values_to = "realvalues") %>%
pivot_wider(names_from = title, values_from = realvalues, ) %>%
select(-(arbitrary)) %>%
apply(MARGIN = 2, unlist) %>%
as.data.frame %>%
mutate(totalfatalities = as.numeric(totalfatalities), totalinjuries = as.numeric(totalinjuries),
totalprop = as.numeric(totalprop), totalprop = as.numeric(totalprop)
)
fin_tally_df2 %>% head
| X30 | X31 | X32 | X33 | X34 | X35 | X36 | X37 | X38 | X39 | ... | X41 | X42 | X43 | X44 | X45 | evs | totalfatalities | totalinjuries | totalprop | totalcrop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | ... | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Astronomical|Low|Tide | 0 | 0 | 0 | 0 |
| 2 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Avalanche | 3 | 1 | 0 | 0 |
| 3 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Blizzard | 2 | 6 | 100000 | 0 |
| 4 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Coastal|Flood | 13 | 19 | 171092250 | 3300700 |
| 5 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Cold|Wind|Chill | 23 | 105 | 259664720 | 3428550 |
| 6 | Marine|High|Wind | Marine|Strong|Wind | Marine|Thunderstorm|Wind | Rip|Current | Sleet | Storm|Surge|Tide | Strong|Wind | Thunderstorm|Wind | Tornado | Tropical|Depression | ... | Volcanic|Ash | Waterspout | Wildfire | Winter|Storm | Winter|Weather | Dense|Fog | 1 | 7 | 12000 | 0 |
Finally after that whole process it is now possible to make plots.
#Plotting
##Property Damage
ggplot(data = fin_tally_df2,
aes(evs, totalprop)) +
geom_histogram(stat = "identity") +
##Rotating label ticks
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
##Crop Damage
ggplot(data = fin_tally_df2,
aes(evs, totalcrop)) +
geom_histogram(stat = "identity") +
##Rotating label ticks
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
##Injuries
ggplot(data = fin_tally_df2,
aes(evs, totalinjuries)) +
geom_histogram(stat = "identity") +
##Rotating label ticks
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
##Fatalities
ggplot(data = fin_tally_df2,
aes(evs, totalfatalities)) +
geom_histogram(stat = "identity") +
##Rotating label ticks
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Warning message: "Ignoring unknown parameters: binwidth, bins, pad" Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
Warning message: "Ignoring unknown parameters: binwidth, bins, pad"
There are about fisty separate event categories. The data we began with was rife with mispelling and misattributed errors. By using a relatively simple regular expression pattern, we have been able to pick up the most impactful events from the data frame.
Also, because the dataframe is so large, we have resorted to taking a random sample of the observations to allow us to fast track the process. The expectation is that this is a truthful representation of the whole dataset.