# The linear regression model assumes that the response variable Y is quantitative.
# But in many situations, the response variable is instead qualitative. For example, in this analysis,
# the response variable is reactivity, which is qualitative, as it categorizes elements as either reactive (Yes = 1)
# or non-reactive (No = 0). Often, qualitative variables are referred to as categorical; I will use these terms interchangeably.
# In this analysis, I study approaches for predicting qualitative responses, a process that is known as classification.
# Predicting a qualitative response for an observation can be referred to as classifying that observation,
# as it involves assigning the observation to a category, or class.
# Often, the methods used for classification first predict the probability that the observation belongs to each category
# of a qualitative variable, which then forms the basis for making the classification. In this sense, these methods behave
# like regression methods.
# Step 1: Create the dataset
# There are many possible classification techniques, or classifiers, that one might use to predict a qualitative response.
# For this project, I generated a simulated dataset that mimics properties of elements in the periodic table.
# This dataset includes atomic weight (X1), ionization energy (X2), and a binary response variable (Reactive),
# which indicates whether an element is reactive (Yes = 1) or non-reactive (No = 0).
set.seed(123)
periodic_data <- data.frame(
AtomicWeight = runif(1000, 1, 200), # Random atomic weights between 1 and 200
IonizationEnergy = runif(1000, 5, 25), # Random ionization energies between 5 and 25
Reactive = sample(c(0, 1), 1000, replace = TRUE, prob = c(0.7, 0.3)) # Reactive: 30% Yes
)
# Step 2: Visualize the data
# Just as in the regression setting, in the classification setting, I have a set of training observations
# (x1, y1), ..., (xn, yn) that I can use to build a classifier. I want my classifier to perform well not only on the training data
# but also on test observations that were not used to train the classifier. Before building the classifier,
# I explore the data visually to understand relationships between the predictors and the response.
# Scatterplot of AtomicWeight vs IonizationEnergy, colored by Reactive status
library(ggplot2)
ggplot(periodic_data, aes(x = AtomicWeight, y = IonizationEnergy, color = as.factor(Reactive))) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
labs(
title = "Reactivity Based on Atomic Properties",
x = "Atomic Weight",
y = "Ionization Energy",
color = "Reactive"
) +
theme_minimal()

# The scatterplot displays the distribution of atomic weight and ionization energy for elements.
# Elements that are reactive (orange) cluster in specific regions, suggesting relationships between these predictors and reactivity.
# Boxplot of AtomicWeight split by Reactivity
ggplot(periodic_data, aes(x = as.factor(Reactive), y = AtomicWeight, fill = as.factor(Reactive))) +
geom_boxplot() +
scale_fill_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
labs(
title = "Distribution of Atomic Weight by Reactivity",
x = "Reactive",
y = "Atomic Weight",
fill = "Reactive"
) +
theme_minimal()

# Boxplot of IonizationEnergy split by Reactivity
ggplot(periodic_data, aes(x = as.factor(Reactive), y = IonizationEnergy, fill = as.factor(Reactive))) +
geom_boxplot() +
scale_fill_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
labs(
title = "Distribution of Ionization Energy by Reactivity",
x = "Reactive",
y = "Ionization Energy",
fill = "Reactive"
) +
theme_minimal()

# Step 3: Build the Logistic Regression Model
# Since the response variable is qualitative, I cannot use simple linear regression.
# Instead, I use logistic regression to model the probability that an element is reactive (Yes = 1).
logistic_model <- glm(Reactive ~ AtomicWeight + IonizationEnergy, data = periodic_data, family = binomial)
# Step 4: Evaluate the Model
# I examine the model coefficients to understand how the predictors influence the log-odds of reactivity.
# For example, a positive coefficient for AtomicWeight indicates that heavier elements are more likely to be reactive.
summary(logistic_model)
##
## Call:
## glm(formula = Reactive ~ AtomicWeight + IonizationEnergy, family = binomial,
## data = periodic_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.104157 0.235202 -4.695 2.67e-06 ***
## AtomicWeight 0.002650 0.001207 2.196 0.0281 *
## IonizationEnergy 0.001380 0.012038 0.115 0.9087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1233.4 on 999 degrees of freedom
## Residual deviance: 1228.5 on 997 degrees of freedom
## AIC: 1234.5
##
## Number of Fisher Scoring iterations: 4
# Step 5: Make Predictions
# Using the fitted model, I calculate the predicted probabilities of reactivity for each element:
# p̂ = exp(β0 + β1 * AtomicWeight + β2 * IonizationEnergy) / (1 + exp(β0 + β1 * AtomicWeight + β2 * IonizationEnergy)).
# Elements are classified as reactive if their predicted probability exceeds 0.5.
periodic_data$PredictedProb <- predict(logistic_model, type = "response")
periodic_data$PredictedClass <- ifelse(periodic_data$PredictedProb > 0.5, 1, 0)
# Step 6: Evaluate Classification Performance
# To assess the model's performance, I create a confusion matrix that compares the actual and predicted classifications.
confusion_matrix <- table(Actual = periodic_data$Reactive, Predicted = periodic_data$PredictedClass)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
## Predicted
## Actual 0
## 0 693
## 1 307
# Calculate overall accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Model Accuracy:", round(accuracy, 2)))
## [1] "Model Accuracy: 0.69"
# Step 7: Visualize the Predictions
# I create a scatterplot of predicted probabilities by atomic weight, colored by actual reactivity.
ggplot(periodic_data, aes(x = AtomicWeight, y = PredictedProb, color = as.factor(Reactive))) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
labs(
title = "Predicted Probability of Reactivity",
x = "Atomic Weight",
y = "Predicted Probability",
color = "Actual Reactivity"
) +
theme_minimal()

# Step 8: Next Steps
# This logistic regression model provides a solid starting point for predicting reactivity.
# However, its performance can be improved by incorporating additional predictors,
# using feature engineering, or exploring advanced classification techniques such as decision trees or random forests.