1. Introduction

This report presents a logistic regression analysis predicting bar passage (PassFail) using data from a law school. The analysis includes data cleaning, model building, interpretation of significant predictors, visualizations, model diagnostics, and evidence-based recommendations.

Install packages

#install.packages("caret", dependencies = TRUE, type = "binary")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(pscl)
## Warning: package 'pscl' was built under R version 4.4.3
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(dplyr)

2. Data Cleaning and Transformation

2.1 Data Import

The dataset was imported from a CSV file located at the specified path. The structure of the data was examined to understand the variables and their types.

#import the dataset
data <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")
str(data)
## 'data.frame':    476 obs. of  28 variables:
##  $ Year                       : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
##  $ PassFail                   : chr  "F" "F" "F" "F" ...
##  $ Age                        : num  29.1 29.6 29 36.2 28.9 30.8 29.1 42.9 28.3 27.1 ...
##  $ LSAT                       : int  152 155 157 156 145 154 149 160 152 150 ...
##  $ UGPA                       : num  3.42 2.82 3.46 3.13 3.49 2.85 3.43 3.29 3.62 3.07 ...
##  $ CivPro                     : chr  "B+" "B+" "C" "D+" ...
##  $ LPI                        : chr  "A" "B" "B" "C" ...
##  $ LPII                       : chr  "A" "B" "B" "C+" ...
##  $ GPA_1L                     : num  3.21 2.43 2.62 2.27 2.29 ...
##  $ GPA_Final                  : num  3.29 3.2 2.91 2.77 2.9 2.82 3 3.09 3.21 2.74 ...
##  $ FinalRankPercentile        : num  0.46 0.33 0.08 0.02 0.08 0.05 0.15 0.22 0.34 0.01 ...
##  $ Accommodations             : chr  "N" "Y" "N" "N" ...
##  $ Probation                  : chr  "N" "Y" "N" "Y" ...
##  $ LegalAnalysis_TexasPractice: chr  "Y" "Y" "Y" "Y" ...
##  $ AdvLegalPerfSkills         : chr  "Y" "Y" "Y" "Y" ...
##  $ AdvLegalAnalysis           : chr  "Y" "Y" "Y" "Y" ...
##  $ BarPrepCompany             : chr  "Barbri" "Barbri" "Barbri" "Barbri" ...
##  $ BarPrepCompletion          : num  0.96 0.98 0.48 1 0.77 0.02 0.9 0.76 0.77 0.88 ...
##  $ OptIntoWritingGuide        : chr  "" "" "" "" ...
##  $ X.LawSchoolBarPrepWorkshops: int  3 0 3 0 5 1 5 5 1 5 ...
##  $ StudentSuccessInitiative   : chr  "N" "Cochran" "Smith" "Baldwin" ...
##  $ BarPrepMentor              : chr  "N" "N" "N" "N" ...
##  $ MPRE                       : num  103 76 99 81 99 NA 90 97 100 78 ...
##  $ MPT                        : num  3 3 3 2.5 3.5 3 2.5 2.5 3 2.5 ...
##  $ MEE                        : num  2.67 3.17 2.67 3 2.67 2 3.5 3 2.67 3.83 ...
##  $ WrittenScaledScore         : num  126 133 126 126 130 ...
##  $ MBE                        : num  133 133 118 140 125 ...
##  $ UBE                        : num  259 266 244 266 256 ...

2.2 Variable Selection

The following variables were selected for analysis: Numeric Variables: LSAT, UGPA, GPA_1L, GPA_Final, FinalRankPercentile, BarPrepCompletion, X.LawSchoolBarPrepWorkshops Categorical Variables: Accommodations, Probation, LegalAnalysis_TexasPractice, AdvLegalPerfSkills, AdvLegalAnalysis, BarPrepCompany, StudentSuccessInitiative, BarPrepMentor, CivPro, LPI, LPII The target variable is PassFail.

# data clean

numeric_vars <- c("LSAT", "UGPA", "GPA_1L", "GPA_Final", "FinalRankPercentile", "BarPrepCompletion", "X.LawSchoolBarPrepWorkshops",
                    "LSAT", "UGPA")

categorical_vars <- c("Accommodations", "Probation", "LegalAnalysis_TexasPractice", 
                      "AdvLegalPerfSkills", "AdvLegalAnalysis", "BarPrepCompany", 
                      "StudentSuccessInitiative", "BarPrepMentor", "CivPro", "LPI", "LPII")
data <- data %>%
dplyr::select(all_of(c(numeric_vars, categorical_vars, "PassFail")))

2.3 Data Transformation

The PassFail variable was converted to a factor to represent the binary outcome (pass/fail). Additionally, the StudentSuccessInitiative and BarPrepMentor variables were transformed into binary factors (0/1) based on their values.

# transform data
data$PassFail <- as.factor(data$PassFail)

data <- data %>%
  mutate(
    StudentSuccessInitiative = ifelse(toupper(StudentSuccessInitiative) %in% c("N", "NO"), 0, 1),
    BarPrepMentor = ifelse(toupper(BarPrepMentor) %in% c("N", "NO"), 0, 1)
  ) %>%

  mutate(across(c(StudentSuccessInitiative, BarPrepMentor), as.factor))

2.4 Missing Value Handling

Rows with missing values were removed from the dataset to ensure data integrity for the analysis.

data_clean<-na.omit(data)

data_clean[categorical_vars] <- lapply(data_clean[categorical_vars], as.factor)

3. Exploratory Data Analysis

Before building logistic regression model, I performed several exploratory data analysis (EDA) steps to understand the bar exam outcomes and the factors affecting them.

3.1 Distribution of Bar Exam Outcomes

The first visualization is a bar chart that shows the distribution of bar exam outcomes (Pass/Fail). It illustrates that the majority of students passed the bar exam, with approximately 400 students passing and around 50 failing. This indicates a relatively high overall pass rate but also reveals an imbalance in the dataset, which could affect the modeling process.

ggplot(data_clean, aes(x = PassFail)) + 
  geom_bar(fill = "steelblue") + 
  theme_minimal() + 
  labs(title = "Distribution of Bar Exam Outcomes")

3.2 Pairs Plot

The pairs plot provides a matrix of scatterplots showing the relationships between LSAT scores, UGPA, GPA_1L, GPA_Final, and PassFail. This visualization helps identify correlations between these numeric variables and their relationship with the PassFail outcome. It reveals patterns such as the positive correlation between LSAT scores and PassFail, suggesting that higher LSAT scores are associated with a higher likelihood of passing the bar.

pairs(data_clean[, c("LSAT", "UGPA", "GPA_1L", "GPA_Final", "PassFail")], 
      lower.panel = NULL)

3.3 Pass Rates by CivPro Course Grades

A stacked bar chart illustrates the pass rates grouped by the grades received in the CivPro course. It shows that students with higher CivPro grades (such as A and B) have a higher pass rate, while those with lower grades (such as D and F) have a significantly lower pass rate. This indicates that performance in the CivPro course is a strong predictor of bar exam success.

ggplot(data_clean, aes(x = CivPro, fill = PassFail)) + 
  geom_bar(position = "fill") + 
  theme_minimal() + 
  labs(title = "Pass rates grouped by CivPro course grades")

3.4 LSAT Scores and Pass Rates

The histogram displays the distribution of LSAT scores, filled by PassFail status. It shows that students with higher LSAT scores are more likely to pass the bar exam, with the pass rate increasing as LSAT scores rise. This visualization reinforces the importance of LSAT scores as a predictor of bar passage.

ggplot(data_clean, aes(x = LSAT, fill = PassFail)) + 
  geom_histogram(position = "fill", bins = 20) + 
  theme_minimal() + 
  labs(title = "The relationship between LSAT scores and pass rates")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

4. Logistic Regression Model

4.1 Data Partitioning

The dataset was split into training and testing sets using an 80-20 split to evaluate the model’s performance.

# train data and test data
set.seed(123)
train_index <- createDataPartition(data_clean$PassFail, p = 0.8, list = FALSE)
train_data <- data_clean[train_index, ]
test_data <- data_clean[-train_index, ]

4.2 Base Model

A base logistic regression model was fit using the glm() function with family = binomial. The model includes multiple predictors such as LSAT, UGPA, GPA_1L, GPA_Final, and others.

# define base formula
base_formula <- as.formula(
  PassFail ~ LSAT + UGPA + GPA_1L + GPA_Final + FinalRankPercentile + 
  Accommodations + Probation  + BarPrepCompany + 
  BarPrepCompletion + X.LawSchoolBarPrepWorkshops + StudentSuccessInitiative + 
    LegalAnalysis_TexasPractice+AdvLegalPerfSkills+AdvLegalAnalysis+
  BarPrepMentor + CivPro + LPI + LPII  
)

train_data$CivPro <- factor(train_data$CivPro)

levels(train_data$CivPro)
## [1] "A"  "B"  "B+" "C"  "C+" "D"  "D+" "F"
# base model
base_model <- glm(base_formula, data = train_data, family = binomial)

# model summary
summary(base_model)
## 
## Call:
## glm(formula = base_formula, family = binomial, data = train_data)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -104.60414   26.82664  -3.899 9.65e-05 ***
## LSAT                            0.36833    0.10132   3.635 0.000278 ***
## UGPA                            1.99226    1.01718   1.959 0.050158 .  
## GPA_1L                          1.01417    1.87737   0.540 0.589055    
## GPA_Final                      10.42274    5.12917   2.032 0.042148 *  
## FinalRankPercentile            -7.38388    5.77946  -1.278 0.201388    
## AccommodationsY                -0.08459    0.76237  -0.111 0.911652    
## ProbationN                      3.66255    2.43391   1.505 0.132375    
## ProbationY                     -0.31466    0.92319  -0.341 0.733225    
## BarPrepCompanyHelix            20.10688 3956.18073   0.005 0.995945    
## BarPrepCompanyKaplan           -1.68251    1.74302  -0.965 0.334401    
## BarPrepCompanyThemis            1.99871    0.72372   2.762 0.005750 ** 
## BarPrepCompletion              11.32612    2.50556   4.520 6.17e-06 ***
## X.LawSchoolBarPrepWorkshops    -0.14763    0.14938  -0.988 0.323001    
## StudentSuccessInitiative1      -0.75186    0.90683  -0.829 0.407042    
## LegalAnalysis_TexasPracticeY   -2.17007    0.95140  -2.281 0.022553 *  
## AdvLegalPerfSkillsY             0.54675    1.03335   0.529 0.596734    
## AdvLegalAnalysisY               0.06820    0.86919   0.078 0.937456    
## BarPrepMentor1                 -0.45159    0.78578  -0.575 0.565497    
## CivProB                         1.56339    1.51427   1.032 0.301866    
## CivProB+                        2.18549    1.62478   1.345 0.178593    
## CivProC                        -0.44261    1.61990  -0.273 0.784674    
## CivProC+                       -0.49890    1.47655  -0.338 0.735452    
## CivProD                       -17.86884 3956.18076  -0.005 0.996396    
## CivProD+                        0.15341    1.85498   0.083 0.934089    
## CivProF                        11.31209 3956.18100   0.003 0.997719    
## LPIB                            1.25940    1.15632   1.089 0.276088    
## LPIB+                          -0.02529    1.14724  -0.022 0.982411    
## LPIC                            1.33607    1.44735   0.923 0.355947    
## LPIC+                           0.98988    1.17358   0.843 0.398967    
## LPID                            4.06440    2.97710   1.365 0.172184    
## LPID+                          -0.50200    2.23772  -0.224 0.822495    
## LPIIB                           0.22056    1.28326   0.172 0.863535    
## LPIIB+                          0.85044    1.31570   0.646 0.518033    
## LPIIC                           2.96086    1.79341   1.651 0.098746 .  
## LPIIC+                          0.82102    1.38844   0.591 0.554304    
## LPIICR                          0.44386    1.44942   0.306 0.759429    
## LPIID                          20.49104 2071.19633   0.010 0.992106    
## LPIID+                         18.87357 1587.41827   0.012 0.990514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 255.29  on 359  degrees of freedom
## Residual deviance: 111.69  on 321  degrees of freedom
## AIC: 189.69
## 
## Number of Fisher Scoring iterations: 16
# correlation
cor_matrix <- cor(train_data[, numeric_vars])
print(cor_matrix)
##                                    LSAT        UGPA     GPA_1L   GPA_Final
## LSAT                         1.00000000 -0.19714583  0.2047858  0.09983408
## UGPA                        -0.19714583  1.00000000  0.1982986  0.28036589
## GPA_1L                       0.20478577  0.19829855  1.0000000  0.86321663
## GPA_Final                    0.09983408  0.28036589  0.8632166  1.00000000
## FinalRankPercentile          0.12107579  0.28664445  0.8660633  0.98042850
## BarPrepCompletion           -0.14275201  0.14829724  0.1561402  0.25454658
## X.LawSchoolBarPrepWorkshops -0.14601247  0.05345049 -0.2079524 -0.09012703
## LSAT.1                       1.00000000 -0.19714583  0.2047858  0.09983408
## UGPA.1                      -0.19714583  1.00000000  0.1982986  0.28036589
##                             FinalRankPercentile BarPrepCompletion
## LSAT                                  0.1210758        -0.1427520
## UGPA                                  0.2866445         0.1482972
## GPA_1L                                0.8660633         0.1561402
## GPA_Final                             0.9804285         0.2545466
## FinalRankPercentile                   1.0000000         0.2421480
## BarPrepCompletion                     0.2421480         1.0000000
## X.LawSchoolBarPrepWorkshops          -0.1068939         0.1197472
## LSAT.1                                0.1210758        -0.1427520
## UGPA.1                                0.2866445         0.1482972
##                             X.LawSchoolBarPrepWorkshops      LSAT.1      UGPA.1
## LSAT                                        -0.14601247  1.00000000 -0.19714583
## UGPA                                         0.05345049 -0.19714583  1.00000000
## GPA_1L                                      -0.20795243  0.20478577  0.19829855
## GPA_Final                                   -0.09012703  0.09983408  0.28036589
## FinalRankPercentile                         -0.10689386  0.12107579  0.28664445
## BarPrepCompletion                            0.11974716 -0.14275201  0.14829724
## X.LawSchoolBarPrepWorkshops                  1.00000000 -0.14601247  0.05345049
## LSAT.1                                      -0.14601247  1.00000000 -0.19714583
## UGPA.1                                       0.05345049 -0.19714583  1.00000000

4.3 Model Selection

Stepwise selection was used to identify the most relevant predictors and refine the model.

# step model
step_model <- stepAIC(base_model, direction = "both")
## Start:  AIC=189.69
## PassFail ~ LSAT + UGPA + GPA_1L + GPA_Final + FinalRankPercentile + 
##     Accommodations + Probation + BarPrepCompany + BarPrepCompletion + 
##     X.LawSchoolBarPrepWorkshops + StudentSuccessInitiative + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + AdvLegalAnalysis + 
##     BarPrepMentor + CivPro + LPI + LPII
## 
##                               Df Deviance    AIC
## - LPI                          6   116.61 182.61
## - AdvLegalAnalysis             1   111.69 187.69
## - Accommodations               1   111.70 187.70
## - AdvLegalPerfSkills           1   111.97 187.97
## - GPA_1L                       1   111.98 187.99
## - BarPrepMentor                1   112.02 188.02
## - LPII                         7   124.13 188.13
## - StudentSuccessInitiative     1   112.39 188.39
## - CivPro                       7   124.60 188.60
## - X.LawSchoolBarPrepWorkshops  1   112.67 188.67
## - Probation                    2   114.77 188.77
## - FinalRankPercentile          1   113.36 189.36
## <none>                             111.69 189.69
## - UGPA                         1   115.61 191.61
## - GPA_Final                    1   116.17 192.17
## - LegalAnalysis_TexasPractice  1   117.36 193.36
## - BarPrepCompany               3   123.48 195.49
## - LSAT                         1   128.08 204.08
## - BarPrepCompletion            1   147.25 223.25
## 
## Step:  AIC=182.61
## PassFail ~ LSAT + UGPA + GPA_1L + GPA_Final + FinalRankPercentile + 
##     Accommodations + Probation + BarPrepCompany + BarPrepCompletion + 
##     X.LawSchoolBarPrepWorkshops + StudentSuccessInitiative + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + AdvLegalAnalysis + 
##     BarPrepMentor + CivPro + LPII
## 
##                               Df Deviance    AIC
## - GPA_1L                       1   116.61 180.62
## - Accommodations               1   116.62 180.62
## - AdvLegalAnalysis             1   116.64 180.64
## - LPII                         7   128.95 180.95
## - BarPrepMentor                1   117.20 181.20
## - AdvLegalPerfSkills           1   117.37 181.37
## - FinalRankPercentile          1   117.53 181.53
## - StudentSuccessInitiative     1   118.05 182.05
## - X.LawSchoolBarPrepWorkshops  1   118.16 182.16
## <none>                             116.61 182.61
## - CivPro                       7   130.66 182.66
## - Probation                    2   121.00 183.00
## - UGPA                         1   119.53 183.53
## - GPA_Final                    1   120.06 184.06
## - LegalAnalysis_TexasPractice  1   123.66 187.66
## - BarPrepCompany               3   129.44 189.44
## + LPI                          6   111.69 189.69
## - LSAT                         1   133.50 197.50
## - BarPrepCompletion            1   151.85 215.85
## 
## Step:  AIC=180.61
## PassFail ~ LSAT + UGPA + GPA_Final + FinalRankPercentile + Accommodations + 
##     Probation + BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     StudentSuccessInitiative + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + AdvLegalAnalysis + BarPrepMentor + CivPro + 
##     LPII
## 
##                               Df Deviance    AIC
## - Accommodations               1   116.62 178.62
## - AdvLegalAnalysis             1   116.64 178.64
## - BarPrepMentor                1   117.21 179.21
## - AdvLegalPerfSkills           1   117.38 179.38
## - FinalRankPercentile          1   117.53 179.53
## - StudentSuccessInitiative     1   118.05 180.05
## - LPII                         7   130.19 180.19
## - X.LawSchoolBarPrepWorkshops  1   118.23 180.23
## <none>                             116.61 180.62
## - Probation                    2   121.00 181.00
## - UGPA                         1   119.54 181.54
## - CivPro                       7   131.92 181.92
## - GPA_Final                    1   120.17 182.17
## + GPA_1L                       1   116.61 182.61
## - LegalAnalysis_TexasPractice  1   123.98 185.98
## - BarPrepCompany               3   129.51 187.51
## + LPI                          6   111.98 187.99
## - LSAT                         1   134.27 196.27
## - BarPrepCompletion            1   151.85 213.85
## 
## Step:  AIC=178.62
## PassFail ~ LSAT + UGPA + GPA_Final + FinalRankPercentile + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     StudentSuccessInitiative + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + AdvLegalAnalysis + BarPrepMentor + CivPro + 
##     LPII
## 
##                               Df Deviance    AIC
## - AdvLegalAnalysis             1   116.64 176.64
## - BarPrepMentor                1   117.22 177.22
## - AdvLegalPerfSkills           1   117.40 177.40
## - FinalRankPercentile          1   117.55 177.55
## - StudentSuccessInitiative     1   118.07 178.07
## - X.LawSchoolBarPrepWorkshops  1   118.26 178.26
## - LPII                         7   130.38 178.38
## <none>                             116.62 178.62
## - Probation                    2   121.02 179.02
## - UGPA                         1   119.54 179.54
## - CivPro                       7   131.99 179.99
## - GPA_Final                    1   120.21 180.21
## + Accommodations               1   116.61 180.62
## + GPA_1L                       1   116.62 180.62
## - LegalAnalysis_TexasPractice  1   124.18 184.18
## - BarPrepCompany               3   129.53 185.53
## + LPI                          6   112.02 186.02
## - LSAT                         1   134.27 194.27
## - BarPrepCompletion            1   151.90 211.90
## 
## Step:  AIC=176.64
## PassFail ~ LSAT + UGPA + GPA_Final + FinalRankPercentile + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     StudentSuccessInitiative + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + BarPrepMentor + CivPro + LPII
## 
##                               Df Deviance    AIC
## - BarPrepMentor                1   117.26 175.26
## - FinalRankPercentile          1   117.58 175.58
## - AdvLegalPerfSkills           1   117.85 175.85
## - StudentSuccessInitiative     1   118.15 176.15
## - X.LawSchoolBarPrepWorkshops  1   118.27 176.27
## - LPII                         7   130.38 176.38
## <none>                             116.64 176.64
## - UGPA                         1   119.55 177.55
## - Probation                    2   121.57 177.57
## - GPA_Final                    1   120.22 178.22
## - CivPro                       7   132.58 178.58
## + AdvLegalAnalysis             1   116.62 178.62
## + Accommodations               1   116.64 178.64
## + GPA_1L                       1   116.64 178.64
## - LegalAnalysis_TexasPractice  1   124.18 182.18
## + LPI                          6   112.05 184.05
## - BarPrepCompany               3   131.25 185.25
## - LSAT                         1   134.43 192.43
## - BarPrepCompletion            1   152.40 210.40
## 
## Step:  AIC=175.26
## PassFail ~ LSAT + UGPA + GPA_Final + FinalRankPercentile + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     StudentSuccessInitiative + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + CivPro + LPII
## 
##                               Df Deviance    AIC
## - FinalRankPercentile          1   118.15 174.15
## - StudentSuccessInitiative     1   118.62 174.62
## - AdvLegalPerfSkills           1   118.73 174.73
## - LPII                         7   130.93 174.93
## <none>                             117.26 175.26
## - X.LawSchoolBarPrepWorkshops  1   119.56 175.56
## - UGPA                         1   119.88 175.88
## - Probation                    2   122.08 176.08
## - GPA_Final                    1   120.64 176.64
## + BarPrepMentor                1   116.64 176.64
## - CivPro                       7   132.89 176.89
## + AdvLegalAnalysis             1   117.22 177.22
## + GPA_1L                       1   117.24 177.24
## + Accommodations               1   117.26 177.26
## - LegalAnalysis_TexasPractice  1   125.53 181.53
## + LPI                          6   112.59 182.59
## - BarPrepCompany               3   132.23 184.23
## - LSAT                         1   134.67 190.67
## - BarPrepCompletion            1   152.75 208.75
## 
## Step:  AIC=174.15
## PassFail ~ LSAT + UGPA + GPA_Final + Probation + BarPrepCompany + 
##     BarPrepCompletion + X.LawSchoolBarPrepWorkshops + StudentSuccessInitiative + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + CivPro + 
##     LPII
## 
##                               Df Deviance    AIC
## - StudentSuccessInitiative     1   118.98 172.98
## - LPII                         7   131.63 173.63
## <none>                             118.15 174.15
## - X.LawSchoolBarPrepWorkshops  1   120.21 174.21
## - UGPA                         1   120.29 174.29
## - AdvLegalPerfSkills           1   120.44 174.44
## - Probation                    2   122.71 174.71
## + FinalRankPercentile          1   117.26 175.26
## + BarPrepMentor                1   117.58 175.58
## - CivPro                       7   133.68 175.68
## + AdvLegalAnalysis             1   118.10 176.10
## + Accommodations               1   118.14 176.14
## + GPA_1L                       1   118.14 176.14
## - GPA_Final                    1   124.10 178.10
## - LegalAnalysis_TexasPractice  1   126.44 180.44
## + LPI                          6   114.11 182.11
## - BarPrepCompany               3   133.79 183.79
## - LSAT                         1   134.69 188.69
## - BarPrepCompletion            1   153.06 207.06
## 
## Step:  AIC=172.98
## PassFail ~ LSAT + UGPA + GPA_Final + Probation + BarPrepCompany + 
##     BarPrepCompletion + X.LawSchoolBarPrepWorkshops + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + CivPro + LPII
## 
##                               Df Deviance    AIC
## - LPII                         7   132.00 172.00
## - AdvLegalPerfSkills           1   120.91 172.91
## - UGPA                         1   120.94 172.94
## <none>                             118.98 172.98
## - Probation                    2   123.14 173.14
## - X.LawSchoolBarPrepWorkshops  1   121.52 173.51
## + StudentSuccessInitiative     1   118.15 174.15
## + BarPrepMentor                1   118.53 174.53
## + FinalRankPercentile          1   118.62 174.62
## + AdvLegalAnalysis             1   118.88 174.88
## - CivPro                       7   134.88 174.88
## + GPA_1L                       1   118.96 174.96
## + Accommodations               1   118.97 174.97
## - LegalAnalysis_TexasPractice  1   127.56 179.56
## + LPI                          6   114.45 180.45
## - BarPrepCompany               3   133.96 181.96
## - LSAT                         1   135.48 187.48
## - GPA_Final                    1   135.92 187.92
## - BarPrepCompletion            1   154.40 206.40
## 
## Step:  AIC=172
## PassFail ~ LSAT + UGPA + GPA_Final + Probation + BarPrepCompany + 
##     BarPrepCompletion + X.LawSchoolBarPrepWorkshops + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + CivPro
## 
##                               Df Deviance    AIC
## <none>                             132.00 172.00
## - UGPA                         1   134.59 172.59
## - Probation                    2   136.90 172.90
## + LPII                         7   118.98 172.98
## + GPA_1L                       1   130.99 172.99
## - AdvLegalPerfSkills           1   135.09 173.09
## - X.LawSchoolBarPrepWorkshops  1   135.19 173.19
## + BarPrepMentor                1   131.48 173.48
## + StudentSuccessInitiative     1   131.63 173.63
## + FinalRankPercentile          1   131.66 173.66
## + Accommodations               1   131.80 173.80
## + AdvLegalAnalysis             1   132.00 174.00
## - CivPro                       7   149.26 175.26
## + LPI                          6   125.67 177.68
## - BarPrepCompany               3   144.03 178.03
## - LegalAnalysis_TexasPractice  1   143.28 181.28
## - GPA_Final                    1   147.62 185.62
## - LSAT                         1   150.31 188.31
## - BarPrepCompletion            1   164.88 202.88
# print result
print(step_model)
## 
## Call:  glm(formula = PassFail ~ LSAT + UGPA + GPA_Final + Probation + 
##     BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + CivPro, 
##     family = binomial, data = train_data)
## 
## Coefficients:
##                  (Intercept)                          LSAT  
##                    -69.76127                       0.30106  
##                         UGPA                     GPA_Final  
##                      1.20996                       4.37048  
##                  ProbationN                     ProbationY  
##                      3.56786                       0.46041  
##          BarPrepCompanyHelix          BarPrepCompanyKaplan  
##                     17.91199                      -2.10463  
##         BarPrepCompanyThemis             BarPrepCompletion  
##                      1.42878                       9.12065  
##  X.LawSchoolBarPrepWorkshops  LegalAnalysis_TexasPracticeY  
##                     -0.22325                      -2.38996  
##          AdvLegalPerfSkillsY                       CivProB  
##                      1.14968                       1.72565  
##                     CivProB+                       CivProC  
##                      2.19544                      -0.47593  
##                     CivProC+                       CivProD  
##                      0.13886                     -16.93465  
##                     CivProD+                       CivProF  
##                     -0.09315                      10.44633  
## 
## Degrees of Freedom: 359 Total (i.e. Null);  340 Residual
## Null Deviance:       255.3 
## Residual Deviance: 132   AIC: 172

4.4 Final Model

The final model was built using the significant predictors identified through stepwise selection.

# final model
final_formula <- step_model$call$formula
final_model <- glm(final_formula, data = train_data, family = binomial)

summary(final_model)
## 
## Call:
## glm(formula = final_formula, family = binomial, data = train_data)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -69.76127   15.18731  -4.593 4.36e-06 ***
## LSAT                            0.30106    0.07666   3.927 8.60e-05 ***
## UGPA                            1.20996    0.75358   1.606  0.10836    
## GPA_Final                       4.37048    1.19666   3.652  0.00026 ***
## ProbationN                      3.56786    1.83827   1.941  0.05227 .  
## ProbationY                      0.46041    0.64796   0.711  0.47736    
## BarPrepCompanyHelix            17.91199 2399.54498   0.007  0.99404    
## BarPrepCompanyKaplan           -2.10463    1.29610  -1.624  0.10441    
## BarPrepCompanyThemis            1.42878    0.55756   2.563  0.01039 *  
## BarPrepCompletion               9.12065    1.91429   4.765 1.89e-06 ***
## X.LawSchoolBarPrepWorkshops    -0.22325    0.12557  -1.778  0.07542 .  
## LegalAnalysis_TexasPracticeY   -2.38996    0.75701  -3.157  0.00159 ** 
## AdvLegalPerfSkillsY             1.14968    0.66044   1.741  0.08172 .  
## CivProB                         1.72565    1.31153   1.316  0.18826    
## CivProB+                        2.19544    1.46578   1.498  0.13419    
## CivProC                        -0.47593    1.28004  -0.372  0.71003    
## CivProC+                        0.13886    1.24161   0.112  0.91095    
## CivProD                       -16.93465 2399.54515  -0.007  0.99437    
## CivProD+                       -0.09315    1.46006  -0.064  0.94913    
## CivProF                        10.44633 2399.54521   0.004  0.99653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 255.29  on 359  degrees of freedom
## Residual deviance: 132.00  on 340  degrees of freedom
## AIC: 172
## 
## Number of Fisher Scoring iterations: 15
anova(base_model, final_model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: PassFail ~ LSAT + UGPA + GPA_1L + GPA_Final + FinalRankPercentile + 
##     Accommodations + Probation + BarPrepCompany + BarPrepCompletion + 
##     X.LawSchoolBarPrepWorkshops + StudentSuccessInitiative + 
##     LegalAnalysis_TexasPractice + AdvLegalPerfSkills + AdvLegalAnalysis + 
##     BarPrepMentor + CivPro + LPI + LPII
## Model 2: PassFail ~ LSAT + UGPA + GPA_Final + Probation + BarPrepCompany + 
##     BarPrepCompletion + X.LawSchoolBarPrepWorkshops + LegalAnalysis_TexasPractice + 
##     AdvLegalPerfSkills + CivPro
##   Resid. Df Resid. Dev  Df Deviance Pr(>Chi)
## 1       321     111.69                      
## 2       340     132.00 -19  -20.311   0.3761

Final Model Summary

The final logistic regression model was fit using the formula derived from stepwise selection. The model includes the following significant predictors:

LSAT: Higher LSAT scores are associated with an increased likelihood of passing the bar (p-value = 8.60e-05).

GPA_Final: A higher final GPA significantly increases the likelihood of passing (p-value = 0.00026).

BarPrepCompletion: Completing bar preparation courses increases the chances of passing (p-value = 1.89e-06).

LegalAnalysis_TexasPracticeY: Participating in Texas practice negatively impacts the log-odds of passing, though the effect is still significant (p-value = 0.00159).

BarPrepCompanyThemis: Using Themis as a bar preparation company is positively associated with passing (p-value = 0.01039).

Model Performance

Null Deviance: 255.29 on 359 degrees of freedom

Residual Deviance: 132.00 on 340 degrees of freedom

AIC: 172

The model shows a significant reduction in deviance compared to the null model, indicating that the predictors contribute meaningfully to explaining the variance in bar passage outcomes. The AIC value provides a measure of model fit, with lower values indicating better fit.

5. Model diagnostics

5.1 Residuals vs Fitted plot

# Model diagnostics
# Residuals vs Fitted plot
residuals_vs_fitted <- ggplot(final_model, aes(x = fitted(final_model), y = residuals(final_model))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals")
print(residuals_vs_fitted)

The Residuals vs Fitted Values plot for the logistic regression model shows a curved pattern in the residuals. Residuals are positive for lower fitted values and become more negative as fitted values increase, indicating that the model may underpredict for lower probabilities and overpredict for higher probabilities. This suggests potential model misspecification or non-linear relationships between predictors and the outcome not captured by the current model.

5.2 Confusion Matrix

To evaluate the performance of the final model, I used the Confusion Matrix based on test dataset.

test_data$PassFail <- factor(test_data$PassFail, levels = c("F", "P"))

test_data$CivPro <- as.character(test_data$CivPro)
test_data$CivPro[test_data$CivPro == ""] <- NA
test_data$CivPro <- factor(test_data$CivPro)

test_data <- na.omit(test_data)

predictions <- predict(final_model, newdata = test_data, type = "response")
predictions <- ifelse(predictions > 0.5, "P", "F")
predictions <- factor(predictions, levels = c("F", "P"))

conf_matrix <- confusionMatrix(predictions, test_data$PassFail)
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  F  P
##          F  1  7
##          P  9 71
##                                           
##                Accuracy : 0.8182          
##                  95% CI : (0.7216, 0.8924)
##     No Information Rate : 0.8864          
##     P-Value [Acc > NIR] : 0.9803          
##                                           
##                   Kappa : 0.0112          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.10000         
##             Specificity : 0.91026         
##          Pos Pred Value : 0.12500         
##          Neg Pred Value : 0.88750         
##              Prevalence : 0.11364         
##          Detection Rate : 0.01136         
##    Detection Prevalence : 0.09091         
##       Balanced Accuracy : 0.50513         
##                                           
##        'Positive' Class : F               
## 

The result shows that the model achieved an overall accuracy of 81.82%, indicating a relatively good performance in correctly classifying the majority of cases. However, the sensitivity was very low at 10%, meaning the model was not effective at identifying “Fail” cases. The specificity was high at 91.026%, showing the model was good at correctly classifying “Pass” cases. Overall, the balanced accuracy of 50.513% and a Kappa statistic of 0.0112. One reason of the relative low balanced accuracy is the data is imbalanced. However, these results suggest that while the model has some predictive power, there is significant space for improvement, particularly in detecting “Fail” cases.

6. Recommendations

Based on my fianl moddel and analysis, I proposed the following evidence-based recommendations to increase the law school’s bar passage rate:

First, people should enhance Bar Preparation Programs. Strengthen the bar preparation courses and workshops to ensure students are well-prepared for the exam.

Second, they should support Student Success Initiatives. Expand initiatives aimed at student success, such as mentorship programs and academic support services.

Then, they are supposed to focus on Academic Performance. Implement strategies to improve both undergraduate GPA and final law school GPA, as these are significant predictors of bar passage.

Also, they could utilize Predictive Analytics. Use the identified predictors to identify at-risk students early and provide targeted interventions.

In the end, people can focus on improving CivPro Course Outcomes. Given the strong association between CivPro grades and bar passage, enhance the CivPro curriculum and support students in this course.

7. Conclusion

The logistic regression model provides valuable insights into the factors influencing bar passage rates. By addressing the significant predictors in fianl model and implementing the recommended strategies, the law school can work towards improving its bar passage rate.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.