An Overview of Classifcation

# The linear regression model assumes that the response variable Y is quantitative. 
# But in many situations, the response variable is instead qualitative. For example, in this analysis, 
# the response variable is reactivity, which is qualitative, as it categorizes elements as either reactive (Yes = 1) 
# or non-reactive (No = 0). Often, qualitative variables are referred to as categorical; I will use these terms interchangeably.

# In this analysis, I study approaches for predicting qualitative responses, a process that is known as classification. 
# Predicting a qualitative response for an observation can be referred to as classifying that observation, 
# as it involves assigning the observation to a category, or class. 

# Often, the methods used for classification first predict the probability that the observation belongs to each category 
# of a qualitative variable, which then forms the basis for making the classification. In this sense, these methods behave 
# like regression methods.

# Step 1: Create the dataset
# There are many possible classification techniques, or classifiers, that one might use to predict a qualitative response. 
# For this project, I generated a simulated dataset that mimics properties of elements in the periodic table. 
# This dataset includes atomic weight (X1), ionization energy (X2), and a binary response variable (Reactive), 
# which indicates whether an element is reactive (Yes = 1) or non-reactive (No = 0).
set.seed(123)
periodic_data <- data.frame(
  AtomicWeight = runif(1000, 1, 200),  # Random atomic weights between 1 and 200
  IonizationEnergy = runif(1000, 5, 25),  # Random ionization energies between 5 and 25
  Reactive = sample(c(0, 1), 1000, replace = TRUE, prob = c(0.7, 0.3))  # Reactive: 30% Yes
)

# Step 2: Visualize the data
# Just as in the regression setting, in the classification setting, I have a set of training observations 
# (x1, y1), ..., (xn, yn) that I can use to build a classifier. I want my classifier to perform well not only on the training data 
# but also on test observations that were not used to train the classifier. Before building the classifier, 
# I explore the data visually to understand relationships between the predictors and the response.

# Scatterplot of AtomicWeight vs IonizationEnergy, colored by Reactive status
library(ggplot2)
ggplot(periodic_data, aes(x = AtomicWeight, y = IonizationEnergy, color = as.factor(Reactive))) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
  labs(
    title = "Reactivity Based on Atomic Properties",
    x = "Atomic Weight",
    y = "Ionization Energy",
    color = "Reactive"
  ) +
  theme_minimal()

# The scatterplot displays the distribution of atomic weight and ionization energy for elements. 
# Elements that are reactive (orange) cluster in specific regions, suggesting relationships between these predictors and reactivity.

# Boxplot of AtomicWeight split by Reactivity
ggplot(periodic_data, aes(x = as.factor(Reactive), y = AtomicWeight, fill = as.factor(Reactive))) +
  geom_boxplot() +
  scale_fill_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
  labs(
    title = "Distribution of Atomic Weight by Reactivity",
    x = "Reactive",
    y = "Atomic Weight",
    fill = "Reactive"
  ) +
  theme_minimal()

# Boxplot of IonizationEnergy split by Reactivity
ggplot(periodic_data, aes(x = as.factor(Reactive), y = IonizationEnergy, fill = as.factor(Reactive))) +
  geom_boxplot() +
  scale_fill_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
  labs(
    title = "Distribution of Ionization Energy by Reactivity",
    x = "Reactive",
    y = "Ionization Energy",
    fill = "Reactive"
  ) +
  theme_minimal()

# Step 3: Build the Logistic Regression Model
# Since the response variable is qualitative, I cannot use simple linear regression. 
# Instead, I use logistic regression to model the probability that an element is reactive (Yes = 1).
logistic_model <- glm(Reactive ~ AtomicWeight + IonizationEnergy, data = periodic_data, family = binomial)

# Step 4: Evaluate the Model
# I examine the model coefficients to understand how the predictors influence the log-odds of reactivity. 
# For example, a positive coefficient for AtomicWeight indicates that heavier elements are more likely to be reactive.
summary(logistic_model)

## 
## Call:
## glm(formula = Reactive ~ AtomicWeight + IonizationEnergy, family = binomial, 
##     data = periodic_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.104157   0.235202  -4.695 2.67e-06 ***
## AtomicWeight      0.002650   0.001207   2.196   0.0281 *  
## IonizationEnergy  0.001380   0.012038   0.115   0.9087    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1233.4  on 999  degrees of freedom
## Residual deviance: 1228.5  on 997  degrees of freedom
## AIC: 1234.5
## 
## Number of Fisher Scoring iterations: 4

# Step 5: Make Predictions
# Using the fitted model, I calculate the predicted probabilities of reactivity for each element:
# p̂ = exp(β0 + β1 * AtomicWeight + β2 * IonizationEnergy) / (1 + exp(β0 + β1 * AtomicWeight + β2 * IonizationEnergy)).
# Elements are classified as reactive if their predicted probability exceeds 0.5.
periodic_data$PredictedProb <- predict(logistic_model, type = "response")
periodic_data$PredictedClass <- ifelse(periodic_data$PredictedProb > 0.5, 1, 0)

# Step 6: Evaluate Classification Performance
# To assess the model's performance, I create a confusion matrix that compares the actual and predicted classifications.
confusion_matrix <- table(Actual = periodic_data$Reactive, Predicted = periodic_data$PredictedClass)
print("Confusion Matrix:")

## [1] "Confusion Matrix:"

print(confusion_matrix)

##       Predicted
## Actual   0
##      0 693
##      1 307

# Calculate overall accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Model Accuracy:", round(accuracy, 2)))

## [1] "Model Accuracy: 0.69"

# Step 7: Visualize the Predictions
# I create a scatterplot of predicted probabilities by atomic weight, colored by actual reactivity.
ggplot(periodic_data, aes(x = AtomicWeight, y = PredictedProb, color = as.factor(Reactive))) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("blue", "orange"), labels = c("No", "Yes")) +
  labs(
    title = "Predicted Probability of Reactivity",
    x = "Atomic Weight",
    y = "Predicted Probability",
    color = "Actual Reactivity"
  ) +
  theme_minimal()

# Step 8: Next Steps
# This logistic regression model provides a solid starting point for predicting reactivity. 
# However, its performance can be improved by incorporating additional predictors, 
# using feature engineering, or exploring advanced classification techniques such as decision trees or random forests.

An Overview of Classifcation

Avery Holloman

2024-11-08