Prepared for the Dean of the Law School

1.0 Introduction

This report provides a detailed, step-by-step guide to analyzing factors that influence law students’ success on the bar exam using logistic regression. The analysis was conducted for a major university’s law school in Texas to help administrators understand which academic and preparatory factors most significantly impact bar passage rates.

Logistic regression is particularly suited for this analysis because our outcome variable (bar passage) is binary (Pass/Fail), and we need to understand how multiple predictor variables influence this outcome. This guide not only presents the final results but explains each step of the process in detail, making it valuable both for decision-makers and other data analysts who may need to replicate or build upon this work.

The analysis follows a rigorous statistical process:

  1. Data preparation and cleaning

  2. Exploratory data analysis

  3. Model development and refinement

  4. Comprehensive model diagnostics

  5. Interpretation of results

  6. Actionable recommendations

Each section includes both the R code implementation and detailed explanations of the statistical concepts and decisions involved.

2.0 Data Preparation

2.1 Data Import and Initial Inspection

Before any analysis can begin, we must properly import and examine our dataset. The data comes from law students who took the bar exam between 2021-2024, containing both academic records and bar exam results.

bar_data <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")

Key Considerations:

  • The data set contains both numerical variables (like LSAT scores and GPAs) and categorical variables (like whether students received accommodations)

  • We need to verify that all variables imported with the correct data types

  • Missing values must be identified and addressed before modeling

2.2 Data Cleaning and Transformation

Proper data cleaning is crucial for obtaining valid results. This involves:

  1. Converting categorical variables to factors

  2. Handling missing data

bar_data_clean <- bar_data %>%
  mutate(across(c("PassFail", "CivPro", "LPI", "LPII",
                  "Accommodations", "Probation", "LegalAnalysis_TexasPractice",
                  "AdvLegalPerfSkills", "AdvLegalAnalysis", "BarPrepCompany",
                  "OptIntoWritingGuide", "StudentSuccessInitiative", "BarPrepMentor"),
                as.factor)) %>%
  drop_na()

3.0 Exploratory Data Analysis

Through the Exploratory Data Analysis we are able to build models based on a clear understanding of the data’s characteristics and relationships.

3.1 Key Predictor Distributions

bar1 <- barplot(table(bar_data_clean$PassFail),
        main = 'Pass/Fail Student Count',
        ylab = 'Number of Students',
        col = c('#BE8A60', '#3E5622'))

h1<- ggplot(bar_data_clean, aes(x = LSAT)) +
  geom_histogram(binwidth = 5, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Undergraduate LSAT Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

h2 <- ggplot(bar_data_clean, aes(x = UGPA)) +
  geom_histogram(binwidth = 0.2, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Undergraduate GPA Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))


h3<- ggplot(bar_data_clean, aes(x = GPA_Final)) +
  geom_histogram(binwidth = 0.2, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Final GPA Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b1<- ggplot(bar_data_clean, aes(x = PassFail,y = LSAT, fill = PassFail)) +
  geom_boxplot() +
  ggtitle("LSAT Score For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b2<- ggplot(bar_data_clean, aes(x = PassFail,y = UGPA, fill = PassFail)) +
  geom_boxplot() +
 ggtitle("Undergraduate GPA For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b3<- ggplot(bar_data_clean, aes(x = PassFail,y = GPA_Final, fill = PassFail)) +
  geom_boxplot() + 
  ggtitle("Final GPA For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))


 (h1|h2|h3) /(b1 | b2 | b3)

3.2 How Many Passed Versus Failed Bar Exam

The first bar plot shows how many students passed the bar exam and how many students failed the bar exam. This depicts how big the sample sizes are for passed versus failed students. If more students failed than passed then that would change how the model is built compared to how it i built since more students passed than failed.

3.3 Histogram Analysis (Top Row)

LSAT Histogram (Right)

  • Normally distributed

  • Centered around 155

  • Most scored fall between 150 and 160

  • Slight left skew indicating more scores fall between 145-155 than between 155-165

Undergraduate GPA’s (Middle)

  • Normally distributed

  • Centered around a GPA of 3.3

  • Most GPA’s falling between 2.5 and 4.0

Final GPA’s (Right)

  • Normally distributed

    • Has slightly more uniform shape
  • Centered around 3.3

  • Most GPA’s between 2.8 and 3.8

3.4 Box-Plot Analysis (Bottom Row)

LSAT Box-Plot (Right)

  • Median LSAT scores of those who pass are higher

  • Passing students typically have an LSAT between 155-160 while failing students typically have an LSAT below 155

  • More variability in the passing group due to longer whiskers and several outliers

Undergraduate GPA Box-Plot (Middle)

  • Median Undergraduate GPA for passing and failing students are approximately the same

  • Substantial overlap between passing and failing students in regards to their undergraduate GPA

    • Passing is slightly higher than failing
  • Lower level outliers in the passing students could lead to interesting conclusions later on

Final GPA Box-Plot (Left)

  • Median Final GPA for passing students is higher than for failing students

  • Strong separation with passing GPA’s typically 3.4 and up while failing GPA’s are typically lower

  • More variability in the passing students GPA due to longer box and whiskers

4.0 Model Development

4.1 Understanding Logistic Regression

Before implementing the model, it’s important to understand the statistical foundation. Logistic regression models the log-odds of an event (here, passing the bar) as a linear combination of predictors:

\[ log(\frac{p}{1-p}) = β_{0} + β_{1}x_{1}+ β_{2}x_{2}+....+β_{k}x_{k} \]

Where:

  • p is the probability of passing the bar

  • β​ is the intercept

  • β​ through \(\beta_k\beta_k\) are coefficients for predictors \(X_1X_1\)​ through \(X_kX_k\)

The logistic function constrains predicted probabilities between 0 and 1.

4.2 Building the Full Model

We start with a comprehensive model including all potential predictors:

## 
## Call:
## glm(formula = PassFail ~ LSAT + UGPA + CivPro + LPI + LPII + 
##     GPA_Final + FinalRankPercentile + Accommodations + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     GPA_1L, family = binomial(link = "logit"), data = bar_data_clean)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                 -202.5384    63.9703  -3.166  0.00154 **
## LSAT                           0.5191     0.2033   2.553  0.01067 * 
## UGPA                           2.2903     1.8925   1.210  0.22620   
## CivProB                        2.5105     2.0769   1.209  0.22675   
## CivProB+                       1.0355     2.0354   0.509  0.61093   
## CivProC                       -1.7173     2.0920  -0.821  0.41170   
## CivProC+                       0.0652     1.7858   0.037  0.97088   
## CivProD                      -17.4291  3956.1814  -0.004  0.99648   
## CivProD+                       1.3254     2.8794   0.460  0.64530   
## LPIB                           4.2416     2.1432   1.979  0.04780 * 
## LPIB+                          1.1684     1.7035   0.686  0.49276   
## LPIC                           4.9234     2.6672   1.846  0.06490 . 
## LPIC+                          1.2036     1.7274   0.697  0.48594   
## LPID                          11.6141     4.2847   2.711  0.00672 **
## LPID+                         -3.7464     3.1948  -1.173  0.24094   
## LPIIB                          1.2679     1.7472   0.726  0.46803   
## LPIIB+                         3.5222     1.7878   1.970  0.04882 * 
## LPIIC                          7.3172     3.4554   2.118  0.03421 * 
## LPIIC+                         3.8994     2.1490   1.814  0.06960 . 
## LPIICR                         4.7372     2.1146   2.240  0.02508 * 
## LPIID                          7.1945  3956.1824   0.002  0.99855   
## LPIID+                        28.6420  2260.4946   0.013  0.98989   
## GPA_Final                     28.8233    10.6776   2.699  0.00695 **
## FinalRankPercentile          -21.9419    10.6663  -2.057  0.03967 * 
## AccommodationsY               -0.4341     1.2113  -0.358  0.72006   
## ProbationY                    -1.2369     1.3798  -0.896  0.37004   
## BarPrepCompanyThemis           4.1697     1.4643   2.848  0.00441 **
## BarPrepCompletion             16.7473     5.5161   3.036  0.00240 **
## X.LawSchoolBarPrepWorkshops    0.4062     0.2481   1.637  0.10161   
## GPA_1L                         4.0147     2.5137   1.597  0.11023   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 145.211  on 193  degrees of freedom
## Residual deviance:  55.938  on 164  degrees of freedom
## AIC: 115.94
## 
## Number of Fisher Scoring iterations: 16

Model Assessment:

  • We removed, LegalAnalysis_TexasPractice , AdvLegalPerfSkills and AdvLegalAnalysis because they all have a Yes which makes it impossible to have a contrast to factors with only one level

  • The model likely contains redundant predictors

  • We will use step wise selection to refine the model

4.3 Model Selection Process

Step wise selection helps identify the most important predictors while balancing model fit and complexity. Step wise was chosen for this analysis because it offers a systematic approach to reducing model complexity in the presence of many potential predictors. Although it has limitations, it is well-suited for this context as the primary goal is to identify key factors influencing outcome rather than solely focusing on prediction accuracy.

## 
## Call:
## glm(formula = PassFail ~ LSAT + UGPA + CivPro + LPI + LPII + 
##     GPA_Final + FinalRankPercentile + Accommodations + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     GPA_1L, family = binomial(link = "logit"), data = bar_data_clean)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                 -202.5384    63.9703  -3.166  0.00154 **
## LSAT                           0.5191     0.2033   2.553  0.01067 * 
## UGPA                           2.2903     1.8925   1.210  0.22620   
## CivProB                        2.5105     2.0769   1.209  0.22675   
## CivProB+                       1.0355     2.0354   0.509  0.61093   
## CivProC                       -1.7173     2.0920  -0.821  0.41170   
## CivProC+                       0.0652     1.7858   0.037  0.97088   
## CivProD                      -17.4291  3956.1814  -0.004  0.99648   
## CivProD+                       1.3254     2.8794   0.460  0.64530   
## LPIB                           4.2416     2.1432   1.979  0.04780 * 
## LPIB+                          1.1684     1.7035   0.686  0.49276   
## LPIC                           4.9234     2.6672   1.846  0.06490 . 
## LPIC+                          1.2036     1.7274   0.697  0.48594   
## LPID                          11.6141     4.2847   2.711  0.00672 **
## LPID+                         -3.7464     3.1948  -1.173  0.24094   
## LPIIB                          1.2679     1.7472   0.726  0.46803   
## LPIIB+                         3.5222     1.7878   1.970  0.04882 * 
## LPIIC                          7.3172     3.4554   2.118  0.03421 * 
## LPIIC+                         3.8994     2.1490   1.814  0.06960 . 
## LPIICR                         4.7372     2.1146   2.240  0.02508 * 
## LPIID                          7.1945  3956.1824   0.002  0.99855   
## LPIID+                        28.6420  2260.4946   0.013  0.98989   
## GPA_Final                     28.8233    10.6776   2.699  0.00695 **
## FinalRankPercentile          -21.9419    10.6663  -2.057  0.03967 * 
## AccommodationsY               -0.4341     1.2113  -0.358  0.72006   
## ProbationY                    -1.2369     1.3798  -0.896  0.37004   
## BarPrepCompanyThemis           4.1697     1.4643   2.848  0.00441 **
## BarPrepCompletion             16.7473     5.5161   3.036  0.00240 **
## X.LawSchoolBarPrepWorkshops    0.4062     0.2481   1.637  0.10161   
## GPA_1L                         4.0147     2.5137   1.597  0.11023   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 145.211  on 193  degrees of freedom
## Residual deviance:  55.938  on 164  degrees of freedom
## AIC: 115.94
## 
## Number of Fisher Scoring iterations: 16
## 
## Call:
## glm(formula = PassFail ~ LSAT + LPII + GPA_Final + FinalRankPercentile + 
##     BarPrepCompany + BarPrepCompletion + GPA_1L, family = binomial(link = "logit"), 
##     data = bar_data_clean)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -82.9407    25.0713  -3.308 0.000939 ***
## LSAT                    0.2238     0.1033   2.167 0.030250 *  
## LPIIB                   1.5493     1.0455   1.482 0.138381    
## LPIIB+                  3.1707     1.4440   2.196 0.028115 *  
## LPIIC                   4.0318     1.6707   2.413 0.015812 *  
## LPIIC+                  2.7844     1.2422   2.242 0.024992 *  
## LPIICR                  2.1741     1.1347   1.916 0.055373 .  
## LPIID                  13.5340  2399.5452   0.006 0.995500    
## LPIID+                 20.4433  1507.9657   0.014 0.989184    
## GPA_Final              10.1632     5.1508   1.973 0.048481 *  
## FinalRankPercentile    -8.3049     5.7393  -1.447 0.147890    
## BarPrepCompanyThemis    2.0063     0.7917   2.534 0.011272 *  
## BarPrepCompletion       8.4250     2.8954   2.910 0.003617 ** 
## GPA_1L                  3.7736     1.4212   2.655 0.007927 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 145.211  on 193  degrees of freedom
## Residual deviance:  81.216  on 180  degrees of freedom
## AIC: 109.22
## 
## Number of Fisher Scoring iterations: 15
## 
## Call:
## glm(formula = PassFail ~ LSAT + LPII + GPA_Final + FinalRankPercentile + 
##     BarPrepCompany + BarPrepCompletion + GPA_1L, family = binomial(link = "logit"), 
##     data = bar_data_clean)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -82.9407    25.0713  -3.308 0.000939 ***
## LSAT                    0.2238     0.1033   2.167 0.030250 *  
## LPIIB                   1.5493     1.0455   1.482 0.138381    
## LPIIB+                  3.1707     1.4440   2.196 0.028115 *  
## LPIIC                   4.0318     1.6707   2.413 0.015812 *  
## LPIIC+                  2.7844     1.2422   2.242 0.024992 *  
## LPIICR                  2.1741     1.1347   1.916 0.055373 .  
## LPIID                  13.5340  2399.5452   0.006 0.995500    
## LPIID+                 20.4433  1507.9657   0.014 0.989184    
## GPA_Final              10.1632     5.1508   1.973 0.048481 *  
## FinalRankPercentile    -8.3049     5.7393  -1.447 0.147890    
## BarPrepCompanyThemis    2.0063     0.7917   2.534 0.011272 *  
## BarPrepCompletion       8.4250     2.8954   2.910 0.003617 ** 
## GPA_1L                  3.7736     1.4212   2.655 0.007927 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 145.211  on 193  degrees of freedom
## Residual deviance:  81.216  on 180  degrees of freedom
## AIC: 109.22
## 
## Number of Fisher Scoring iterations: 15

4.4 Analysis the Models for Bar Passage Prediction

4.4.1 General Observations

The first model incorporates all available predictors (full model), whereas the second and third models which appear identical represent reduced versions that retain only the most influential variables. The reduction in Akaike Information Criterion (AIC) from 115.94 (full model) to 109.22 (reduced model) suggests the reduced model explains the data just as well but with fewer predictors, making it a more efficient choice. However, some coefficients, such as those for CivProD, LPIID, and LPIID+, exhibit extreme standard errors and negligible statistical significance, indicating potential insufficient observations in those categories.

Both models demonstrate strong explanatory power, as evidenced by the substantial decline in residual deviance relative to the null deviance (from 145.211 to 55.938 in the full model and 81.216 in the reduced model). The moderately high number of Fisher Scoring iterations (15–16) suggests convergence was achieved without failure, though the model required multiple steps to stabilize.

4.4.2 Key Significant Predictors

Several variables consistently emerged as statistically significant predictors of bar passage across models. LSAT scores exhibited a positive and significant association (p < 0.05), indicating that higher scores correlate with increased odds of passing. Similarly, participation in bar preparation programs, particularly those offered by Themis and completion of bar prep courses were strongly predictive of success (p < 0.01). Academic performance in law school, as measured by final GPA (GPA_Final) and first-year GPA (GPA_1L), also significantly influenced outcomes (p < 0.05 in the reduced model). Additionally, performance in Legal Profession II (LPII) was influential, with students earning grades of B+, C, C+, or CR demonstrating significantly higher pass rates (p < 0.05).

4.4.3 Non-Significant and Problematic Predictors

In contrast, undergraduate GPA (UGPA) did not reach statistical significance (p = 0.226), suggesting it may not meaningfully predict bar passage after accounting for law school performance. Similarly, accommodations and probation status showed no significant effects (p > 0.3). The final class rank percentile (FinalRankPercentile) was significant in the full model (p = 0.039) but not in the reduced version (p = 0.147), implying potential mixup with other academic metrics (e.g GPA).

Several categorical predictors, including CivProD, LPIID, and LPIID+, produced extreme coefficient estimates with negligible significance, likely due to limited data in these categories. These findings suggest that collapsing or omitting such categories may improve model stability.

4.4.4 Reviewing modified model

# Converting factor PassFail (1, 2) to binary numeric (0, 1)
model_data <- bar_data_clean[, c(
  "PassFail", "LSAT", "GPA_Final", 
  "FinalRankPercentile", 
  "BarPrepCompletion", "GPA_1L"
)]

# Ensure PassFail is binary: 0 = Fail (original 1), 1 = Pass (original 2)
model_data$PassFail_binary <- ifelse(model_data$PassFail == 2, 1, 0)
model <- glm(
  PassFail_binary ~ LSAT + GPA_Final + FinalRankPercentile + BarPrepCompletion + GPA_1L,
  data = model_data,
  family = binomial(link = "logit")  # Logistic regression
)
## Warning: glm.fit: algorithm did not converge
summary(model)
## 
## Call:
## glm(formula = PassFail_binary ~ LSAT + GPA_Final + FinalRankPercentile + 
##     BarPrepCompletion + GPA_1L, family = binomial(link = "logit"), 
##     data = model_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -2.657e+01  1.750e+06       0        1
## LSAT                -1.807e-15  7.266e+03       0        1
## GPA_Final            4.306e-14  4.703e+05       0        1
## FinalRankPercentile -1.012e-13  5.083e+05       0        1
## BarPrepCompletion    5.499e-14  2.211e+05       0        1
## GPA_1L               3.516e-14  1.148e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 193  degrees of freedom
## Residual deviance: 1.1255e-09  on 188  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

Observations

The modified logistic regression model examining predictors of bar exam passage (Pass/Fail) yielded unusual results, with all coefficients showing zero statistical significance (p-values = 1) and estimates extremely close to zero (e.g., LSAT: −1.807 × 10⁻¹⁵). This suggests the model failed to converge or encountered complete separation, where predictors perfectly classify outcomes. The null and residual deviance values (both ~0) further indicate the model fits the data too perfectly, (e.g., all students with LSAT > 150 passed). Such results are unlikely in real-world data and imply the model is unreliable for inference.

plot(
  model_data$LSAT, 
  as.numeric(model_data$PassFail) - 1,  # Convert factor to 0/1 for plotting
  xlab = "LSAT Score", 
  ylab = "Pass (1) / Fail (0)", 
  pch = 16, 
  col = ifelse(model_data$PassFail == 2, "blue", "red"),  
  yaxt = "n",  # Suppress default y-axis
  ylim = c(-0.1, 1.1),
  main = "Logistic Regression: LSAT vs. Pass Probability"
)


axis(2, at = c(0, 1))

# Add sigmoid curve
lines(LSAT_grid, predicted_probs, lwd = 3, col = "darkgreen")

5.3 Residual Analysis

plot(final_model)
## Warning: not plotting observations with leverage one:
##   91

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Observations

  • The residuals vs. fitted plot indicates complete separation in the data. This means one or more predictors (e.g., LSAT scores) perfectly divide the outcomes into pass/fail groups, leaving no overlap. Visually, the plot shows:

    • Two distinct clusters of residuals at 0 (fail) and 1 (pass), with no points in between.

    • A horizontal gap separating the clusters, confirming perfect prediction.

    This violates logistic regression assumptions, making coefficient estimates unreliable (as seen earlier with near-zero values and p=1).

6.0 Recommendations

Academic Foundations and Entry Indicators

The LSAT score emerged as a strong, statistically significant predictor of bar passage. Students entering with lower LSAT scores are more likely to struggle. This suggests the importance of identifying at-risk students early ideally before or at the beginning of their first year and providing them with academic support such as skills workshops, personalized advising, and peer mentoring to close preparation gaps.

Bar Preparation Matters

One of the strongest predictors of bar exam success was participation in and completion of bar preparation programs, especially those offered by Themis. Students who fully completed bar prep programs had a significantly higher chance of passing. The institution could consider strengthening partnerships with such companies, subsidizing prep course access, and incorporating progress monitoring to ensure students remain on track throughout their preparation journey.

Law School Performance as a Barometer

GPA during the first year and final semesters both significantly predicted bar passage. These findings suggest that academic performance throughout law school, not just in the final year, shapes bar readiness. Schools should pay close attention to GPA trends and provide timely academic interventions, particularly to students with GPAs below average.

In addition, specific course performance particularly in Legal Profession II (LPII) was significantly associated with outcomes. Students earning grades of B+ or lower in this course showed a notable drop in pass likelihood. Reviewing instructional practices in such key courses and offering targeted academic support to students who under perform in them could make a measurable difference.

Less Impact or Unreliable Predictors

Interestingly, undergraduate GPA, accommodations, and probation status did not significantly affect bar passage rates once law school variables were considered. This suggests that while these factors may warrant attention for other reasons, they should not be treated as strong indicators of bar risk. Final class rank percentile had inconsistent significance, possibly due to its correlation with GPA and other academic variables.

Conclusion

This study highlights that bar success is shaped primarily by standardized test readiness (LSAT), consistent academic performance during law school, and thorough bar exam preparation. By targeting resources toward these areas, law schools can better position their students for success on the bar exam and beyond.

Appendix: Complete R Script

The following code reproduces the entire analysis.


{r include=FALSE}
library(tidyverse)
library(car)
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(MASS)

# Import dataset
bar_data <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")


# Convert variables to factors and handle missing data

bar_data_clean <- bar_data %>%
  mutate(
    across(c('PassFail','CivPro', 'LPI', 'LPII', 'Accommodations',
             'Probation', 'LegalAnalysis_TexasPractice', 'AdvLegalPerfSkills',
             'AdvLegalAnalysis', 'BarPrepCompany', 'OptIntoWritingGuide',
             'StudentSuccessInitiative', 'BarPrepMentor'), 
           as.factor),
  ) %>%
  drop_na()  # Remove rows with missing values

# Verify structure
glimpse(bar_data_clean)


bar1 <- barplot(table(bar_data_clean$PassFail),
        main = 'Pass/Fail Student Count',
        ylab = 'Number of Students',
        col = c('#BE8A60', '#3E5622'))


h1<- ggplot(bar_data_clean, aes(x = LSAT)) +
  geom_histogram(binwidth = 5, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Undergraduate LSAT Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

h2 <- ggplot(bar_data_clean, aes(x = UGPA)) +
  geom_histogram(binwidth = 0.2, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Undergraduate GPA Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))


h3<- ggplot(bar_data_clean, aes(x = GPA_Final)) +
  geom_histogram(binwidth = 0.2, fill = "#BE8A60", color = "#3E5622") +
ggtitle("Final GPA Distribution") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b1<- ggplot(bar_data_clean, aes(x = PassFail,y = LSAT, fill = PassFail)) +
  geom_boxplot() +
  ggtitle("LSAT Score For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b2<- ggplot(bar_data_clean, aes(x = PassFail,y = UGPA, fill = PassFail)) +
  geom_boxplot() +
 ggtitle("Undergraduate GPA For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))

b3<- ggplot(bar_data_clean, aes(x = PassFail,y = GPA_Final, fill = PassFail)) +
  geom_boxplot() + 
  ggtitle("Final GPA For Passing the Bar") + theme(plot.title = element_text(size = 8, margin = margin(b = 10)))


 (h1|h2|h3) /(b1 | b2 | b3)



full_model <- glm(PassFail ~ LSAT + UGPA + CivPro + LPI + LPII + GPA_Final + FinalRankPercentile + Accommodations + Probation + BarPrepCompany  + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + GPA_1L,
           family = binomial(link = "logit"), data = bar_data_clean)



model1 <- stepAIC(full_model, direction = "forward", trace = FALSE)
summary(model1)

model2 <- stepAIC(full_model, direction = "backward", trace = FALSE)
summary(model2)

final_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(final_model)

# Converting factor PassFail (1, 2) to binary numeric (0, 1)
model_data <- bar_data_clean[, c(
  "PassFail", "LSAT", "GPA_Final", 
  "FinalRankPercentile", 
  "BarPrepCompletion", "GPA_1L"
)]

# Ensure PassFail is binary: 0 = Fail (original 1), 1 = Pass (original 2)
model_data$PassFail_binary <- ifelse(model_data$PassFail == 2, 1, 0)

# Set other predictors to fixed values
new_data <- data.frame(
  LSAT = LSAT_grid,
  GPA_Final = median(model_data$GPA_Final, na.rm = TRUE),
  FinalRankPercentile = median(model_data$FinalRankPercentile, na.rm = TRUE),
  BarPrepCompletion = median(model_data$BarPrepCompletion, na.rm = TRUE), 
  GPA_1L = median(model_data$GPA_1L, na.rm = TRUE)
)

# Get predicted probabilities
predicted_probs <- predict(model, newdata = new_data, type = "response")

 
plot(
  model_data$LSAT, 
  as.numeric(model_data$PassFail) - 1,  # Convert factor to 0/1 for plotting
  xlab = "LSAT Score", 
  ylab = "Pass (1) / Fail (0)", 
  pch = 16, 
  col = ifelse(model_data$PassFail == 2, "blue", "red"), 
  yaxt = "n",  # Suppress default y-axis
  ylim = c(-0.1, 1.1),
  main = "Logistic Regression: LSAT vs. Pass Probability"
)


axis(2, at = c(0, 1))

# Add sigmoid curve
lines(LSAT_grid, predicted_probs, lwd = 3, col = "darkgreen")


plot(final_model)



 

```