Geochemistry - Crude Oil Assay

Libraries

library(rmdformats) # for bookdown
library(readxl) # for excel
library(tibble) # for embedded tables
#library(plyr) # for ddply
library(tidyr) # for gather
library(dplyr) # for %>%
library(ggplot2) # for ggplot
library(rworldmap) # for geo map
library(broom) # for tidy
library(infer) # for bootstrapping
require(scales)
library(purrr)

Data Source

The data source is an Excel file that has 316 observations and 220 variables.

The Excel file is a chemical evaluation of crude oil feedstocks by petroleum testing laboratories called a crude oil assay.

The samples were first taken from a ship’s expedition near the Caribbean where the ship’s crew gathered core samples from the bottom of the ocean. The samples were then sent to a lab in order to understand the geochemistry of the sediment at the ocean floor. This data is used by large oil companies for the purpose of discovering good locations for deep sea oil drilling.

geology_data <- readxl::read_xlsx('~/Downloads/Josef oil families-2.xlsx')


geology_data_answer_key <- readxl::read_xlsx('~/Downloads/Josef oil families-2.xlsx', sheet =5)

First Look + Reference Table

We first take a look at the data in its raw form. We read the data into RStudio using the readxl package. We then verify its size, class and structure. We examine the presence of N/A values and make some decisions about whether or not these N/A are trivial or non-trivial.

We see that our dataset contains unique core samples that have a designated latitude and longitude. We also take a peak at the length of every variable and the mean of every variables.

Then, in order to help with data analysis, we create a reference table which will be useful because this particular data set contains scientific terms and abbreviations.

nr <- nrow(geology_data) # 316 observations
lth <- length(geology_data) # 219 Crude Oil Components

cl <- class(geology_data) #data frame

# glimpse(geology_data) # from tibble package
anna <- any(is.na(geology_data)) # TRUE

# check for number of unique values in each column


head(geology_data, n = 4)

geology_data %>% summarise(across(where(is.numeric), ~ length(unique(.x))))

geology_data %>% summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))

Abbreviation <- c('EOM','TSF','BIOD_','13Csats', '13Carom', 'NSO')
Meaning <- c('extractable organic matter','total spectral fluorescence', 'biodegredation','carbon isotope', 'carbon isotope', 'Nitrogen-Sulfur-Oxygen (NSO)')
Description <- c('fraction of organic matter extracted from soil','emission of light','decomposition by bacteria', 'ratio in carbon fraction', 'ratio in aromatic fraction', 'organic compounds that occur in crude oils')
reference_table <- data.frame(Abbreviation, Meaning, Description)
reference_table

Housekeeping + Fortifying

We do further work tidying the data. This includes renaming certain variables. We follow tidy principles by making sure that each column is one variable only. In order to enhance our data for further analysis, we also compute new variables and create new categories.

geology_data$OilSampleID <- as.factor(geology_data$OilSampleID) 

length(unique(geology_data$OilSampleID))

geology_data_selection <- geology_data %>%
  select(OilSampleID, Type, Country, `USGS Province`, Well, Latitude, Longitude, EOM, Misc...43, BIOD, `% Sat`, `% Aro`, `% NSO`, `% Asph`, BIOD_)


# let's add a new column, which is percent saturates plus percent aromatics
geology_data_selection <- geology_data_selection %>%
  mutate(Hydrocarbons = geology_data$`% Sat` + geology_data$`% Aro`)


colnames(geology_data_selection)[9] <- c("TSF")

colnames(geology_data_selection)[9] <- c("TSF")


# answer key
geology_data_answer_key <- geology_data_answer_key[, 1:2]

colnames(geology_data_answer_key) <- c("OilSampleID", "Value")

geology_data_answer_key <- geology_data_answer_key %>%
  mutate(Value = "Oily")

geology_data_selection <- left_join(geology_data_selection, geology_data_answer_key, by = "OilSampleID")

nrow(geology_data_selection)
length(geology_data_selection)

geology_data_selection$Value[is.na(geology_data_selection$Value)] <- "Not Oily"

geology_data_selection <- geology_data_selection %>%
  mutate(Numeric_Value = ifelse(Value == "Oily", 1, 0))

any(is.na(geology_data_selection$EOM))
any(is.na(geology_data_selection$TSF))
any(is.na(geology_data_selection$Hydrocarbons))
  
# geology_data_selection_complete_cases <- geology_data_selection[complete.cases(geology_data_selection[ , 8:9]), ]
# 
# geology_data_selection_complete_cases <- geology_data_selection[complete.cases(geology_data_selection[ , 15]), ]   

geology_data_selection <- geology_data_selection %>%
  drop_na("OilSampleID") %>%
  drop_na("EOM") %>%
  drop_na("TSF")

Core Sample Variables - EOM, TSF

Extractable Organic Matter (EOM)

We first look at extractable organic matter and the consideration that there is enough extractable organic matter at each core sampling site. EOM is to the weight of the sample parts-per-million; it is a measure of how rich the sample is in the contaminent. EOM also includes in the background organic material (dead stuff, leaves, bacteria, algae) that exists in the sediment but does not necessarily point to oil.

There are only two explanations for oil found on the sea floor: Either oil has been discharged from a passing ship and that oil finds its way to a see floor or else oil is seeping up through sediments from the sea floor. In order to eliminate the first possibility, core samples are typically taken several meters below the sea floor. If you go down a few meters, you are quite confident that oil deposits are not occurring on a human timescale.

Spectralflourescence (TSF)

Spectralflourescence is a measure that results when electromagnetic radiation (light) is shined on a sample and, when light hits the sample, some of that energy excites electrons to a higher level where they are not stable, after which they release energy and fall back down to ground state. The time gap in this measurement gives us flourescence. Importantly, with flourescence, there is a change in molecular structure and in particular this is characteristic of the aromatics in the oil mixture. The larger the time gap, the more flourescent (the more hydrocarbons or oil components) the material contains.

Distribution of EOM, TSF

We use histograms and boxplots in order to better understand the distribution of a single variable. In particular, we are looking to characterize the Extractable Organic Matter and Total Spectral Flourescence variables by talking about their center, their shape, and their spread.

We can see that the distributions of both of these variables are strongly right-skewed. The TSF variable in particular has a lot of outliers. Because the data is heavily skewed, we choose to report the median and interquartile range (IQR) instead of reporting the mean and standard deviation.

We next test the relationship of Extractable Organic Matter (EOM) as a function of Total spectral flourescence (TSF).

When looking at either EOM or TSF by themselves, we looked closely at the center, shape and spread of these variables. When we model these two variables together, we turn to discussing the form, direction and strength of the relationship as well as any unusual observations.

On the linear model graph below, we draw a circle around the core samples that failed both tests: EOM and TSF. These core samples are almost for certain not rich in oil. Instead, we call them “background” because they likely contain dead organic matter, leaves, bacteria, algae.

We take a look at our linear model and our residuals. We round to the nearest decimel place for readability.

geology_data_selection <- geology_data_selection %>%
  mutate(log_eom = log(EOM), 
         log_tsf = log(TSF))

eom_and_tsf <- geology_data_selection %>%
  dplyr::select(OilSampleID, log_tsf, log_eom, Value)

eom_and_tsf_gathered_transform <- gather(eom_and_tsf, "Measure", "Statistic", 2:3)

eom_df <- eom_and_tsf_gathered_transform %>%
  filter(Measure == "log_eom")

ggplot(geology_data_selection, aes(x = EOM, y = TSF, color = Value)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Histograms - EOM, TSF") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma) +
  geom_point(data = data.frame(x = 50000, y = 300000000), aes(x = x, y = y, pch = 1), size = 30, color = 'coral') +
  scale_shape_identity()

ggplot(data = eom_df, aes(x = Statistic, fill = Measure)) +
  facet_wrap(~ Value) +
  geom_histogram(position = "dodge", alpha = 0.6, fill = "light blue") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Log EOM Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

tsf_df <- eom_and_tsf_gathered_transform %>%
  filter(Measure == "log_tsf")

ggplot(data = tsf_df, aes(x = Statistic, fill = Measure)) +
  facet_wrap(~ Value) +
  geom_histogram(position = "dodge", alpha = 0.6) +
  labs(title = "Geochemistry Core Sampling", subtitle = "Log TSF Histogram") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous("count left side", labels = comma, sec.axis = sec_axis(~ . * 1, name = "count right side")) + 
  theme(axis.text.x = element_text(angle = 45))

ggplot(eom_df, aes(x = Statistic, fill = Measure)) +
  geom_boxplot() +
  facet_wrap(~ Value) +
  labs(title = "Geochemistry Core Sampling", subtitle = "Log EOM Boxplot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater") +
  coord_cartesian(xlim = c(0, 15)) +
  coord_flip()

ggplot(tsf_df, aes(x = Statistic, fill = Measure)) +
  geom_boxplot(fill = "light blue") +
  facet_wrap(~ Value) +
  labs(title = "Geochemistry Core Sampling", subtitle = "Log TSF Boxplot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater") +
  coord_cartesian(xlim = c(0, 25)) +
  coord_flip()

iqr <- eom_and_tsf %>%
  filter(Value == "Oily") %>%
  summarise(IQR_log_EOM = IQR(log_eom, na.rm = TRUE), IQR_log_TSF = IQR(eom_and_tsf$log_tsf, na.rm = TRUE))

median <- eom_and_tsf %>%
  summarise(median_log_EOM = median(log_eom, na.rm = TRUE), median_log_TSF = median(eom_and_tsf$log_tsf, na.rm = TRUE))

iqr_median <- cbind(iqr, median)

iqr_median <- gather(iqr_median, "Statistic", "Value", 1:4)
iqr_median <- iqr_median %>%
  mutate(Group = "Oily")

iqr_not_oily <- eom_and_tsf %>%
  filter(Value == "Not Oily") %>%
  summarise(IQR_log_EOM = IQR(log_eom, na.rm = TRUE), IQR_log_TSF = IQR(eom_and_tsf$log_tsf, na.rm = TRUE))

median_not_oily <- eom_and_tsf %>%
  summarise(median_log_EOM = median(log_eom, na.rm = TRUE), median_log_TSF = median(eom_and_tsf$log_tsf, na.rm = TRUE))

iqr_median_not_oily <- cbind(iqr, median)

iqr_median_not_oily <- gather(iqr_median_not_oily, "Statistic", "Value", 1:4)
iqr_median_not_oily <- iqr_median_not_oily %>%
  mutate(Group = "Not Oily")

iqr_median_overall <- rbind(iqr_median, iqr_median_not_oily)
iqr_median_overall

Assessing our Linear Model

In order to assess whether our linear model is reliable, we need to check for these three conditions: linearity, nearly normal residuals, constant variability.

We check for linearity by first looking at a scatterplot comparing EOM and TSF. We can also check for linearity by looking at a scatterplot comparing the x fitted values and the residuals.

We further check the conditions for nearly normal residuals and constant variability. We do this by looking first at a histogram of the residuals and also by looking at a normal probability (QQ) plot of the residuals.

Based on the histogram and the normal probability plot, the nearly normal residuals condition does appear to be met. Based on the residuals vs. fitted plot, the constant variability condition also appears to be met.

eom_and_tsf_oily <- geology_data_selection %>%
  filter(Value == "Oily")

eom_tsf_linear_model <- lm(TSF ~ EOM, eom_and_tsf_oily)

eom_tsf_linear_model_residuals <- fortify(eom_tsf_linear_model)

eom_tsf_linear_model_residuals$EOM <- round(eom_tsf_linear_model_residuals$EOM, 1)
eom_tsf_linear_model_residuals$.hat <- round(eom_tsf_linear_model_residuals$.hat, 1)
eom_tsf_linear_model_residuals$.sigma <- round(eom_tsf_linear_model_residuals$.sigma, 1)
eom_tsf_linear_model_residuals$.cooksd <-round( eom_tsf_linear_model_residuals$.cooksd, 1)
eom_tsf_linear_model_residuals$.fitted <- round(eom_tsf_linear_model_residuals$.fitted, 1)
eom_tsf_linear_model_residuals$.resid <- round(eom_tsf_linear_model_residuals$.resid, 1)
eom_tsf_linear_model_residuals$.stdresid <- round(eom_tsf_linear_model_residuals$.stdresid, 1)

head(eom_tsf_linear_model_residuals, n = 4)

tidy_eom_tsf_linear_model <- tidy(eom_tsf_linear_model)

tidy_eom_tsf_linear_model$estimate <- round(tidy_eom_tsf_linear_model$estimate, 1)
tidy_eom_tsf_linear_model$std.error <- round(tidy_eom_tsf_linear_model$std.error, 1)
tidy_eom_tsf_linear_model$statistic <- round(tidy_eom_tsf_linear_model$statistic, 1)
tidy_eom_tsf_linear_model$p.value <- round(tidy_eom_tsf_linear_model$p.value, 1)

tidy_eom_tsf_linear_model

ggplot(data = eom_tsf_linear_model, aes(x = .fitted, y = .resid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed") +
    xlab("Fitted values") +
    ylab("Residuals") +
    labs(title = "Geochemistry Core Sampling", subtitle = "Scatterplot - EOM ~ TSF Linear Model - Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

ggplot(data = eom_tsf_linear_model, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Histogram - EOM ~ TSF Linear Model - Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

ggplot(data = eom_tsf_linear_model, aes(sample = .resid)) +
  stat_qq() +
  labs(title = "Geochemistry Core Sampling", subtitle = "QQ Plot - EOM ~ TSF Linear Model - Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

SARA Analysis

Saturate, Aromatic, Resin and Asphaltene (SARA) is a method that divides crude oil components according to their polarizability and polarity.

SARA represents the four main fractions of an oil. Together, they account for 100%.

Saturates and aromatics are hydrocarbons, which to a chemist means their molecules contain only Hydrogen (H) and Carbon (C). NSOs are normal size molecules that contain one or more atoms of Nitrogen (N), Sulfur (S), and Oxygen (O). NSO is also sometimes referred to as Resin or Polars.

Asphaltenes are very heavy NSO molecules. NSOs are similar to the material from which asphalt roads are made. They occur in highly variable proportions in oils. The variations can be cause by the type of organic matter from which the oil was formed, the thermal history of the oil, or the separation of the different fractions as the oil moves from its point of origin to wherever it is today.

For an unaltered oil, the S/A ratio tells you something about the source and thus the family, though variations in thermal history (hotter history favors Saturates) can blur the trends. Low S/A could indicate significant biodegradation. All this is best viewed as just general background info rather than highly diagnostic.

Saturates and Aromatics both belong to a category called Hydrocarbons. NSO and Aromatics together can be called Non-Hydrocarbons.

As it turns out, crude oil is best for extraction when it is at least approximately two-thirds Hydrocarbons.

In order to assess the quality of our core samples overall, we are going to now look at the two main groupings of Hydrocarbons and Non-Hydrocarbons. In order to find these values, we will fortify out data frame by creating these new variables.

geology_data_selection <- geology_data_selection %>%
  mutate(Total = `% Sat` + `% Aro` + `% NSO` + `% Asph`) %>%
  mutate(Hydrocarbons = `% Sat` + `% Aro`) %>%
  mutate(Non_Hydrocarbons = `% NSO` + `% Asph`)


sara_data <- geology_data_selection %>%
  dplyr::select(OilSampleID, `% Sat`, `% Aro`, `% NSO`, `% Asph`, Total, Value)

sara_data_gathered <- gather(sara_data, 'Crude_Oil_Component', 'Percent', 2:5)

sara_data_gathered_oily <- sara_data_gathered %>%
  filter(Value == "Oily")

sara_data_gathered_not_oily <- sara_data_gathered %>%
  filter(Value == "Not Oily")

hydro_data <- geology_data_selection %>%
  dplyr::select(OilSampleID, Hydrocarbons, Non_Hydrocarbons, Value)

hydro_data_gathered <- gather(hydro_data, 'Crude_Oil_Component', 'Percent', 2:3)

hydro_data_gathered_oily <- hydro_data_gathered %>%
  filter(Value == "Oily")

hydro_data_gathered_not_oily <- hydro_data_gathered %>%
  filter(Value == "Not Oily")

ggplot(hydro_data_gathered_oily, aes(x = Percent, fill = Crude_Oil_Component)) +
  geom_density(alpha=0.2, position="identity") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Overlayed Histograms - Distribution of Crude Oil Components - Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

ggplot(hydro_data_gathered_not_oily, aes(x = Percent, fill = Crude_Oil_Component)) +
  geom_density(alpha=0.2, position="identity") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Overlayed Histograms - Distribution of Crude Oil Components - Not Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

ggplot(sara_data_gathered_oily, aes(x = Percent, fill = Crude_Oil_Component)) +
  geom_density(alpha=0.2, position="identity") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Overlayed Histograms - Distribution of Crude Oil Components - Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

ggplot(sara_data_gathered_not_oily, aes(x = Percent, fill = Crude_Oil_Component)) +
  geom_density(alpha=0.2, position="identity") +
  labs(title = "Geochemistry Core Sampling", subtitle = "Overlayed Histograms - Distribution of Crude Oil Components - Not Oily") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

sara_data_gathered_oily <- sara_data_gathered %>%
  filter(Value == "Oily")

donut_data <-sara_data_gathered_oily %>%
  group_by(Crude_Oil_Component) %>%
  summarize(avg = mean(Percent))

donut_data$fraction = donut_data$avg / sum(donut_data$avg)
donut_data = donut_data[order(donut_data$fraction), ]
donut_data$ymax = cumsum(donut_data$fraction)
donut_data$ymin = c(0, head(donut_data$ymax, n=-1))

ggplot(donut_data, aes(fill = Crude_Oil_Component, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "SARA Analysis") +
labs(title="Geochemistry Core Sampling", subtitle = "Relative Frequency of Crude Oil Components - Oily") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Deepwater")

sara_data_gathered_not_oily <- sara_data_gathered %>%
  filter(Value == "Not Oily")

donut_data <-sara_data_gathered_not_oily %>%
  group_by(Crude_Oil_Component) %>%
  summarize(avg = mean(Percent))

donut_data$fraction = donut_data$avg / sum(donut_data$avg)
donut_data = donut_data[order(donut_data$fraction), ]
donut_data$ymax = cumsum(donut_data$fraction)
donut_data$ymin = c(0, head(donut_data$ymax, n=-1))

ggplot(donut_data, aes(fill = Crude_Oil_Component, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "SARA Analysis") +
labs(title="Geochemistry Core Sampling", subtitle = "Relative Frequency of Crude Oil Components - Not Oily") +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5)) +
labs(caption = "Deepwater")

Biodegredation

Next we plot biodegredation as it relates to SARA analysis. We see that in the oily samples, biodegredation preferentially removed saturates and NSOs. We also see that the oily core samples had fewer cases of extreme biodegredation (max level 5 vs 8).

biodegredation_df <- geology_data_selection %>%
  select(OilSampleID, `% Sat`, `% Aro`, `% NSO`, `% Asph`, BIOD_, Value)

biodegredation_df <- pivot_longer(biodegredation_df, names_to = "Type", values_to = "Percent", 2:5)

ggplot(biodegredation_df, aes(x = BIOD_, y = Percent, color = Type)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~ Value) +
  labs(title = "Geochemistry Core Sampling", subtitle = "Biodegreation") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

Bootstrapping

Using bootstrapping, we generate a 95% confidence for the likelihood of finding Oily wells. We determine the lower bound as .33 and the upper bound as .52. That is to say, if we sample core samples, 95% of our samples will contain the mean number of oily wells to be between 33% and 52%.

n <- 100

Oily_vs_Non_Oily <- as.data.frame(geology_data_selection$Value)

colnames(Oily_vs_Non_Oily) <- "Oily"

samp <- Oily_vs_Non_Oily %>%
  sample_n(size = n)

samp %>%
  specify(response = Oily, success = "Oily") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Logistic Regression

Because we know the final results of the geochemical analysis for each core sample - oily vs. not oily - we can fit a logistic regression to our dataset. Our hope is to create a good predictive model for finding an oily core sample based on a variety of inputs and features.

We first determine our baseline for guessing: 55% of core samples in this study are not oily. Our most simple, naive model is to suggest that 100% of core samples are not oily, in which case we would be correct 55% of the time.

We create a logistic regression to see if we can do better. We model the binary value (oily vs. not oily) as a function of Hydrocarbons, Extractable Organic Matter, and Total Spectralfloursence.

We run out logistic regression model on a randomly selected sample of 70% of our data which we call our training data. We test our model against the remaining 30% of our data which we call our test data.

We create a confusion matrix in order to sort out the Type I and Type II errors from our model and we count our successes. As it turns out, our new model has a 65% accuracy which represents a 10% improvement over guessing.

list <- unique(geology_data_selection$`USGS Province`)

list

## [1] "Ocean"                       "Tampico-Misantla Basin"     
## [3] "Burgos Basin"                "Campeche-Sigsbee Salt Basin"
## [5] "Veracruz Basin"              "Gulf Cenozoic OCS"

geology_data_selection <- as_tibble(geology_data_selection)  # to play nicely with list-cols

regressions_base <- lm(Numeric_Value ~ Hydrocarbons + EOM + TSF, data = geology_data_selection)
regressions_base <- glance(regressions_base)
regressions_base$p.value <- round(regressions_base$p.value, 4)

regressions_base %>%
  select(adj.r.squared, p.value)

# regressions %>% 
#   select(tidied) %>% 
#   unnest(tidied)

geology_data_selection %>% 
  group_by(`USGS Province`) %>%
  tally(sort = TRUE) %>%
  ungroup() %>%
  arrange()

regressions <- geology_data_selection %>%
  nest(data = c(-`USGS Province`)) %>% 
  mutate(
    fit = map(data, ~ lm(Numeric_Value ~ Hydrocarbons + EOM + TSF, data = .x)),  # S3 list-col
    tidied = map(fit, tidy),
    glanced = map(fit, glance),
    augmented = map(fit, augment)
  )

r <- regressions %>% 
  select(glanced) %>% 
  unnest(glanced) %>%
  mutate(list = list) %>% 
  relocate(list, .before = r.squared)

r$p.value <- round(r$p.value, 4)

r %>% 
  select(list, adj.r.squared, p.value)

# regressions %>% 
#   select(augmented) %>% 
#   unnest(augmented)






train.rows <- sample(nrow(geology_data_selection), nrow(geology_data_selection) * .7)
geology_train <- geology_data_selection[train.rows,]
geology_test <- geology_data_selection[-train.rows,]


oily_train <- prop.table(table(geology_train$Numeric_Value))

oily_train <- as.data.frame(oily_train)


oily_train <- as_tibble(oily_train)


colnames(oily_train) <- c("Group", "Percent")

oily_train$Percent <- round(oily_train$Percent, 3)
oily_train$Group <- c("Not Oily", "Oily")
oily_train

geology_data_selection$TSF <- as.numeric(geology_data_selection$TSF)


lr.out <- glm(Numeric_Value ~ geology_train$`USGS Province` + TSF + EOM + Hydrocarbons + geology_train$`% Sat` + geology_train$`% Aro`, geology_train, family=binomial(link='logit'))

geology_train$prediction <- predict(lr.out, type = 'response', newdata = geology_train)

ggplot(geology_train, aes(x = prediction, color = Value)) + 
  geom_density() +
  labs(title = "Geochemistry Core Sampling", subtitle = "Density Plot of Predictive Modeling") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5)) +
  labs(caption = "Deepwater")

geology_train$prediction_class <- geology_train$prediction > 0.5


tab <- prop.table(table(geology_train$prediction_class, 
             geology_train$Numeric_Value))

rownames(tab) <- c("False Reading", "True Reading")
colnames(tab) <- c("False Value", "True Value")
tab

##                
##                 False Value True Value
##   False Reading   0.4162679  0.1818182
##   True Reading    0.1435407  0.2583732

# (oily_test <- table(geology_test$Numeric_Value) %>% prop.table())
# 
# geology_test$prediction <- predict(lr.out, newdata = geology_test, type = 'response')
# geology_test$prediction_class <- geology_test$prediction > 0.5
# tab_test <- table(geology_test$prediction_class, geology_test$Numeric_Value) %>%
#     prop.table() %>% print()
# 
# data(Orange)
# 
# 
# std <- function(x) sd(x)/sqrt(length(x))
# 
# std(Orange$age)
# 
# se <- std(Orange$age)
# 
# library(plotrix)
# 
# std.error(Orange$age)
# 
# 
# mean(Orange$age)
# 
# mean <- mean(Orange$age)
# confidence_interval_lower <- mean + 1.96 * se
# confidence_interval_upper <- mean - 1.96 * se
# 
# confidence_interval_lower
# confidence_interval_upper
# 
# 
# names <- 
# pokemon_data <- data_frame()