As data analysts for a major university, we have been tasked by the Dean of the school to provide them with data driven insights into factors which influence a student’s passing the Uniform Bar Exam (UBE). Our study was limited to performing regression analysis in order to predict the bar exam passage based on the provided variables. Whether the significant variables that our model will unveil is actionable or not by the law school will be left to the discretion of the Dean of school and their leadership team. The following dataset available in csv format has been provided to us using the following link: https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv
A pull of historical data from the link above unveiled a total of 476 observations contained in 26 variables, including the response variable of Passfail, which denotes whether the student passed the exam. This represents datasets taken from 2021 to 2024. Below is the list of the variables summarized and defined:
1. LSAT: Score on the LSAT entrance examination
2. UGPA: Undergraduate
GPA 3. Class: Year student entered law school
4. CivPro, LP1, LP2: Scores in 1L core courses
5. OneCum, FGPA: Cumulative GPA at end of 1L year and 3L year, respectively
6. Accom: Received accommodations from Student Disability Services (Yes/No)
7. Probation: Ever placed on academic probation (Yes/No)
8. LegalAnalysis, AdvLegalPerf, AdvLegalAnalysis: Enrollment in various elective courses (Yes/No)
9. BarPrep: Type of bar preparation course taken
10. PctBarPrepComplete: Percent of bar prep course completed
11. NumPrepWorkshops: Number of bar prep workshops attended
12. StudentSuccessInitiative: Participation in academic support program (Yes/No)
13. BarPrepMentor: Whether student had a mentor for bar prep (Yes/No) 14. MPRE, MPT, MEE, MBE: Component scores of the bar exam
15. UBE: Composite Uniform Bar Exam score
16. Pass: Final outcome – Did the student pass the bar exam? (Yes/No)
We did not use all the variables above to build our model. We left out variables in our analysis, such as Age for example, that we practically deemed irrelevant in passing the bar exam. The following section on assumptions explain the logic of our variables selection.
Assumptions: According to the National Conference of Bar Examiners (https://www.ncbex.org/exams/ube), the Uniform Bar Exam (UBE) includes the Multistate Essay Examination (MEE), the Multistate Bar Examination (MBE), the Multistate Professional Responsibility Exam (MPRE) and two of Multistate Performance Test (MPT). Hence, passing or failing the exam depends on passing all these tests. Every year, the cutoff score changes. Therefore, we have decided to regress the predictor variables on the Passfail response instead of UBE:
• Passfail will be our response variable and treated as a categorical variable with two outcomes: Yes or No. Therefore our model will assume a Bernoulli distribution, which is a particular case of the binomial distribution.
• Civpro, LP1 and LP2 are foundational courses in the law program and one can logically assume that students who perform well in these classes will be well prepared for the Uniform Bar Exam. We have treated these as ordinal variables.
• We think that making accommodation for students with disabilities
will help this segment of the student population prepare well for the
exam. Issues related to moving around, hearing, typing, or any other
physical abilities related to learning hinder students from properly
studying. So in our project we are interested in knowing whether
accommodations have been made for the physically impaired students. It’s
a Yes or No answer, therefore treated as a categorical variable.
• A student having been placed on probation is definitely an indication
of learning issues whatever the causes are. Therefore we have included
these in our initial set of potential predictor variables
• Another set of courses, LegalAnalysis, AdvLegalPerf, AdvLegalAnalysis, similar to the Civpro, LP1 and LP2 in how they can impact the outcome of the Bar Exam have also been considered. Although these are elective courses, they are of legal relevance. In our project, we want to know whether students have taken these yes or no. Therefore we have treated this set as a categorical variable
• We also have assumed that BarPrepCompany which indicates the type of Bar Preparation completed, can be an interesting factor to study, therefore we have included it in our initial set as a factor variable
• PctBarPrepComplete can also positively influence the success rates for bar exams since it indicates the percentage of bar preparatory courses completed. The number of bar preparatory workshops attended, represented by the variable PrepWorkshops`, is also available and is similar to the PctBarPrepComplete variable. One describes preparatory courses and the other preparatory workshops. To avoid potential correlation between two predictor variables, we decided to select the prime one.
• Participation in an academic support program, as a categorical variable, has also been assumed in our initial set as potentially relevant. So we have included it as a categorial variable
• Last but not necessarily least at this stage of our analysis, is the variable BarPrepMentor indicating whether the student has used a mentor or not in preparing for the exam. We assumed this to be relevant and therefore have included it in our intial model as a categorical value.
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(ggpubr)
#dowload the dataset from github
df<-read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")
#define new dataset by removing (MPRE, MPT< MEE< Writtenscalescore, MBE, UBE )
new_df <- data.frame(df$PassFail, df$Age, df$LSAT, df$UGPA, df$CivPro, df$LPI, df$LPII, df$GPA_1L,
df$GPA_Final, df$FinalRankPercentile, df$Accommodations, df$Probation,
df$LegalAnalysis_TexasPractice, df$AdvLegalPerfSkills, df$AdvLegalAnalysis, df$BarPrepCompany,
df$BarPrepCompletion, df$X.LawSchoolBarPrepWorkshops, df$StudentSuccessInitiative,
df$BarPrepMentor)
#cleaning dataset operation
clear_df <- na.omit(new_df) #clean missing value in the dataset
colSums(is.na(clear_df)) # Check for missing values
## df.PassFail df.Age
## 0 0
## df.LSAT df.UGPA
## 0 0
## df.CivPro df.LPI
## 0 0
## df.LPII df.GPA_1L
## 0 0
## df.GPA_Final df.FinalRankPercentile
## 0 0
## df.Accommodations df.Probation
## 0 0
## df.LegalAnalysis_TexasPractice df.AdvLegalPerfSkills
## 0 0
## df.AdvLegalAnalysis df.BarPrepCompany
## 0 0
## df.BarPrepCompletion df.X.LawSchoolBarPrepWorkshops
## 0 0
## df.StudentSuccessInitiative df.BarPrepMentor
## 0 0
clear_df <- na.omit(clear_df) #clean missing value in the dataset
# Remove 'df.' prefix from all column names
names(clear_df) <- str_remove(names(clear_df), "^df.")
#Transform our variables types
clear_df$PassFail <- as.factor(clear_df$PassFail)
clear_df$CivPro <- factor(clear_df$CivPro, levels =c("A","B+","B", "C+",
"c","D+","D", "F"), ordered = TRUE)
clear_df$LPI <- factor(clear_df$LPI, levels =c("A","B+","B", "C+",
"c","D+","D", "F"), ordered = TRUE)
clear_df$LPII <- factor(clear_df$LPII, levels =c("A","B+","B", "C+",
"c","D+","D", "F"), ordered = TRUE)
clear_df$Accommodations <- as.factor(clear_df$Accommodations)
clear_df$Probation <- factor(clear_df$Probation, levels = c("Y","N"))
clear_df$LegalAnalysis_TexasPractice <- as.factor(clear_df$LegalAnalysis_TexasPractice)
clear_df$AdvLegalPerfSkills <- as.factor(clear_df$AdvLegalPerfSkills)
clear_df$AdvLegalAnalysis <- as.factor(clear_df$AdvLegalAnalysis)
clear_df$BarPrepCompany <- as.factor(clear_df$BarPrepCompany)
clear_df$StudentSuccessInitiative <- as.factor(clear_df$StudentSuccessInitiative)
clear_df$BarPrepMentor <- as.factor(clear_df$BarPrepMentor)
#Take a look of our dataset
summary(clear_df)
## PassFail Age LSAT UGPA CivPro
## F: 51 Min. :23.10 Min. :145.0 Min. :2.010 B :141
## P:398 1st Qu.:26.70 1st Qu.:153.0 1st Qu.:3.260 B+ :106
## Median :28.20 Median :156.0 Median :3.500 A : 77
## Mean :29.08 Mean :155.3 Mean :3.454 C+ : 76
## 3rd Qu.:30.10 3rd Qu.:157.0 3rd Qu.:3.710 D+ : 11
## Max. :65.70 Max. :168.0 Max. :4.140 (Other): 2
## NA's : 36
## LPI LPII GPA_1L GPA_Final
## B :159 B :118 Min. :2.200 Min. :2.440
## B+ :101 B+ :108 1st Qu.:2.783 1st Qu.:3.060
## C+ : 84 C+ : 77 Median :3.086 Median :3.280
## A : 65 A : 69 Mean :3.091 Mean :3.288
## D+ : 7 D+ : 5 3rd Qu.:3.391 3rd Qu.:3.530
## (Other): 5 (Other): 3 Max. :4.000 Max. :3.990
## NA's : 28 NA's : 69
## FinalRankPercentile Accommodations Probation LegalAnalysis_TexasPractice
## Min. :0.0000 N:383 Y : 37 N:207
## 1st Qu.:0.2700 Y: 66 N :408 Y:242
## Median :0.5300 NA's: 4
## Mean :0.5137
## 3rd Qu.:0.7500
## Max. :0.9900
##
## AdvLegalPerfSkills AdvLegalAnalysis BarPrepCompany BarPrepCompletion
## N:204 N:155 Barbri:230 Min. :0.0200
## Y:245 Y:294 Helix : 1 1st Qu.:0.8000
## Kaplan: 14 Median :0.8900
## Themis:204 Mean :0.8639
## 3rd Qu.:0.9800
## Max. :1.0000
##
## X.LawSchoolBarPrepWorkshops StudentSuccessInitiative BarPrepMentor
## Min. :0.000 N :337 N :347
## 1st Qu.:0.000 Keffer : 12 ColleenPotts : 5
## Median :1.000 Christopher: 11 DeirdreWard : 5
## Mean :1.559 Sherwin : 9 ClayElliott : 4
## 3rd Qu.:3.000 Saavedra : 8 AshleyPirtle : 3
## Max. :5.000 Baldwin : 7 AshleySanders: 3
## (Other) : 65 (Other) : 82
Before modeling, we first examine the distribution of our target
variable, PassFail
, which indicates
whether students passed (P) or failed (F) the bar exam.
#calculate the passage rate
pass_rate <- clear_df %>%
count(PassFail) %>%
mutate(percentage = n / sum(n) * 100)
pass_rate
## PassFail n percentage
## 1 F 51 11.35857
## 2 P 398 88.64143
Looking at this result F (Fail) represents 51 students (11.4%), while P (Pass) includes 398 students (88.6%), showing a strong majority of bar exam passers in the datasetalize distribution
#plot PassFail rate
ggplot(clear_df, aes(x = PassFail, fill = PassFail)) +
geom_bar() +
labs(title = "Bar Passage Outcomes", x = "Outcome", y = "Count")
88.6% of students (397 individuals) successfully passed the bar exam, this exceptionally high pass rate suggests either: An academically strong cohort of students or an effective law school curriculum and bar preparation program. Fail Rate: Only 11.4% of students (51 individuals) failed the bar exam, while relatively small, this group represents a significant opportunity for intervention.
Let;s look at trends in academic performance.
# Plot key continuous predictors
p1 <- ggplot(clear_df, aes(x = PassFail, y = LSAT, fill = factor(PassFail))) +
geom_boxplot() +
labs(title = "LSAT Scores by Bar Outcome")
p2 <- ggplot(clear_df, aes(x = PassFail, y = GPA_Final, fill = factor(PassFail))) +
geom_boxplot() +
labs(title = "Final GPA by Bar Outcome")
ggarrange(p1, p2, ncol = 2)
Passing students tend to have higher LSAT scores and final GPAs.
In the plot on the left Median LSAT for Passers (“P”): Likely around 155-157 for successful law students Median LSAT for Failers (“F”): Likely 3-5 points lower (150-153) This plot on the left indicate that LSAT scores give us to understand that LSAT scores are a moderate predictor of bar success. The 5-point median gap between Pass and Fail boxplot shows that LSAT correlates with bar passage but isn’t deterministic .
In the plot on the right Median GPA for Passers (“P”): Likely 3.3–3.5 (strong performance), Median GPA for Failers (“F”): Likely 0.3–0.5 points lower. Failers may include extreme low-GPA students , while passers rarely dip below 3.0. We can observe that GPA is a stronger predictor than LSAT. A 3.0+ GPA appears critical for success. while it looks give GPA and LSAT give a little explanation about the passers or failers, those two alone don’t fully explain outcomes, other factors may intervent.
p3 <- ggplot(clear_df, aes(x = Probation, fill = factor(PassFail))) +
geom_bar(position = "fill") +
labs(title = "Bar Passage by Probation Status")
p4 <- ggplot(clear_df, aes(x = BarPrepMentor, fill = factor(PassFail))) +
geom_bar(position = "fill") +
labs(title = "Bar Passage by Mentor Status")
ggarrange(p3, p4, ncol = 2)
Students on probation typically have GPAs below 2.0–2.5, indicating persistent struggles with legal coursework. (probation = failure of academic integration)
Mentors who recently passed the bar offer credible, relatable advice (e.g., how to approach exam questions), and can teach candidate to identify early signs of struggle. (mentorship = belonging and guidance)
To identify key predictors of bar exam success, we begin by fitting a full logistic regression model using all available variables. This initial model serves as a baseline for evaluating predictor significance and guiding subsequent refinement through stepwise selection.
# Fit full logistic regression model
# Convert PassFail to binary (1 for Pass, 0 for Fail)
clear_df$PassFail <- as.numeric(clear_df$PassFail == "P") # Convert PassFail to binary (1 for Pass, 0 for Fail)
model_data <- na.omit(clear_df)
full_model <- glm(PassFail ~ .,
family = binomial(link = "logit"),
data = model_data)
# Model summary
summary(full_model)
##
## Call:
## glm(formula = PassFail ~ ., family = binomial(link = "logit"),
## data = model_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.776e+03 1.962e+06 -0.001 0.999
## Age 1.828e+00 4.224e+03 0.000 1.000
## LSAT 1.281e+01 8.704e+03 0.001 0.999
## UGPA 6.148e+01 7.769e+04 0.001 0.999
## CivPro.L -2.239e+01 4.170e+05 0.000 1.000
## CivPro.Q -1.166e+02 3.086e+05 0.000 1.000
## CivPro.C 9.158e+00 5.049e+05 0.000 1.000
## CivPro^4 -2.987e+01 4.898e+05 0.000 1.000
## CivPro^5 -9.605e-01 3.260e+05 0.000 1.000
## CivPro^6 3.856e+01 2.094e+05 0.000 1.000
## LPI.L 2.110e+01 3.800e+05 0.000 1.000
## LPI.Q -1.290e+01 2.894e+05 0.000 1.000
## LPI.C -3.850e+01 4.412e+05 0.000 1.000
## LPI^4 -5.727e+01 4.314e+05 0.000 1.000
## LPI^5 -3.740e+01 2.338e+05 0.000 1.000
## LPII.L 1.427e+02 2.885e+05 0.000 1.000
## LPII.Q 2.708e+01 1.866e+05 0.000 1.000
## LPII.C 8.864e+01 2.451e+05 0.000 1.000
## LPII^4 4.821e+01 2.347e+05 0.000 1.000
## LPII^5 5.116e+01 1.294e+05 0.000 1.000
## GPA_1L 1.318e+02 1.550e+05 0.001 0.999
## GPA_Final -7.442e+00 4.311e+05 0.000 1.000
## FinalRankPercentile 7.955e+01 4.145e+05 0.000 1.000
## AccommodationsY -1.366e+01 9.136e+04 0.000 1.000
## ProbationN -7.968e+01 2.114e+05 0.000 1.000
## LegalAnalysis_TexasPracticeY -3.386e+01 6.933e+04 0.000 1.000
## AdvLegalPerfSkillsY -5.742e-01 1.052e+05 0.000 1.000
## AdvLegalAnalysisY 2.934e+00 4.521e+04 0.000 1.000
## BarPrepCompanyHelix 3.862e+01 3.683e+05 0.000 1.000
## BarPrepCompanyKaplan -7.907e+01 1.001e+05 -0.001 0.999
## BarPrepCompanyThemis 3.791e+01 7.220e+04 0.001 1.000
## BarPrepCompletion 2.089e+02 9.402e+04 0.002 0.998
## X.LawSchoolBarPrepWorkshops 5.895e-01 1.738e+04 0.000 1.000
## StudentSuccessInitiativeAycock 4.696e+01 4.133e+05 0.000 1.000
## StudentSuccessInitiativeBaldwin 6.588e-01 5.083e+05 0.000 1.000
## StudentSuccessInitiativeBeyer -4.626e+01 1.239e+06 0.000 1.000
## StudentSuccessInitiativeChapman 7.559e+01 4.414e+05 0.000 1.000
## StudentSuccessInitiativeChristopher 1.324e+02 2.189e+06 0.000 1.000
## StudentSuccessInitiativeCochran 2.763e+01 5.696e+05 0.000 1.000
## StudentSuccessInitiativeCorn 5.167e+01 4.840e+05 0.000 1.000
## StudentSuccessInitiativeGonzalez -1.642e+01 4.574e+05 0.000 1.000
## StudentSuccessInitiativeHardberger -2.850e+01 7.522e+05 0.000 1.000
## StudentSuccessInitiativeHumphrey 5.541e+01 5.480e+07 0.000 1.000
## StudentSuccessInitiativeKeffer 6.116e+01 2.878e+05 0.000 1.000
## StudentSuccessInitiativeLauriat -4.533e+00 3.274e+05 0.000 1.000
## StudentSuccessInitiativeLux -8.721e+01 3.986e+05 0.000 1.000
## StudentSuccessInitiativeMcDonald 8.733e+01 5.960e+05 0.000 1.000
## StudentSuccessInitiativeN 8.072e+01 3.018e+05 0.000 1.000
## StudentSuccessInitiativeRosen 9.118e+01 3.523e+05 0.000 1.000
## StudentSuccessInitiativeRSherwin 1.143e+02 3.226e+05 0.000 1.000
## StudentSuccessInitiativeSaavedra -1.075e+00 7.767e+05 0.000 1.000
## StudentSuccessInitiativeSherwin 2.124e+02 3.330e+05 0.001 0.999
## StudentSuccessInitiativeSmith 1.002e+02 3.614e+05 0.000 1.000
## BarPrepMentorAmberBeard 8.183e+01 5.293e+05 0.000 1.000
## BarPrepMentorAmberRich -8.478e+01 5.304e+05 0.000 1.000
## BarPrepMentorAshleyPirtle -1.353e+02 6.226e+05 0.000 1.000
## BarPrepMentorAshleySanders 2.865e+01 4.036e+05 0.000 1.000
## BarPrepMentorBrendaJohnson 5.660e+01 5.259e+05 0.000 1.000
## BarPrepMentorBryanGreer -1.018e+01 5.254e+05 0.000 1.000
## BarPrepMentorCadyMello -1.032e+02 1.574e+06 0.000 1.000
## BarPrepMentorChrisRhodes 7.792e+01 5.418e+05 0.000 1.000
## BarPrepMentorClayElliott -7.071e+01 4.589e+05 0.000 1.000
## BarPrepMentorColeShooter -4.321e+01 4.650e+05 0.000 1.000
## BarPrepMentorColleenByrom 1.551e+01 5.701e+05 0.000 1.000
## BarPrepMentorColleenElbe(Potts) -1.813e+02 5.570e+05 0.000 1.000
## BarPrepMentorColleenPotts -1.153e+02 4.430e+05 0.000 1.000
## BarPrepMentorDanielleSaavedra -5.354e+01 4.426e+05 0.000 1.000
## BarPrepMentorDavidHutchens 3.708e+01 5.139e+05 0.000 1.000
## BarPrepMentorDavidRice -1.973e+01 8.652e+05 0.000 1.000
## BarPrepMentorDeirdreWard -1.701e+01 3.846e+05 0.000 1.000
## BarPrepMentorDenetteVaughn -1.923e+02 5.399e+05 0.000 1.000
## BarPrepMentorGrantCoffey -1.753e+02 5.246e+05 0.000 1.000
## BarPrepMentorHaleyHickey 2.370e+01 4.819e+05 0.000 1.000
## BarPrepMentorHolleyMcDaniel -3.638e+01 4.448e+05 0.000 1.000
## BarPrepMentorHollyHaseloff -4.210e+01 4.369e+05 0.000 1.000
## BarPrepMentorHoltonWestbrook -8.914e+01 5.239e+05 0.000 1.000
## BarPrepMentorJacquelynnMayes -9.132e+01 5.276e+05 0.000 1.000
## BarPrepMentorJessicaAycock -1.864e+02 5.515e+05 0.000 1.000
## BarPrepMentorJohnMoore -9.737e+01 5.214e+05 0.000 1.000
## BarPrepMentorJordanChavez -1.397e+02 5.128e+05 0.000 1.000
## BarPrepMentorJosephAustin -9.056e+01 5.480e+07 0.000 1.000
## BarPrepMentorJulieDavis -3.598e+01 5.109e+05 0.000 1.000
## BarPrepMentorJustinPlescha -1.033e+01 4.349e+05 0.000 1.000
## BarPrepMentorKathleenGoegel -3.263e+02 5.417e+05 -0.001 1.000
## BarPrepMentorKatyCrocker 1.132e+02 5.835e+05 0.000 1.000
## BarPrepMentorKimberlyKelley -3.836e+01 5.259e+05 0.000 1.000
## BarPrepMentorLauraFidelie -1.415e+02 5.161e+05 0.000 1.000
## BarPrepMentorLauraMcDivitt 2.320e+01 5.107e+05 0.000 1.000
## BarPrepMentorLaurenWelch -1.400e+02 5.180e+05 0.000 1.000
## BarPrepMentorLeenaAl-Souki -4.049e+01 5.480e+07 0.000 1.000
## BarPrepMentorMadelynDeviney -9.884e+01 4.433e+05 0.000 1.000
## BarPrepMentorMariaOviedo -1.931e+01 6.814e+05 0.000 1.000
## BarPrepMentorMelissaWaggoner -2.272e+01 5.509e+05 0.000 1.000
## BarPrepMentorMerylBenham 5.990e+00 4.418e+05 0.000 1.000
## BarPrepMentorMichaelEconomidis -5.476e+01 3.877e+05 0.000 1.000
## BarPrepMentorMistyPratt -7.841e+01 7.000e+05 0.000 1.000
## BarPrepMentorMonicaReyes 2.330e+01 5.769e+05 0.000 1.000
## BarPrepMentorN 2.777e+01 3.788e+05 0.000 1.000
## BarPrepMentorPaulaMilan -5.839e+01 4.745e+07 0.000 1.000
## BarPrepMentorPaulBarkhurst -2.173e+02 7.428e+05 0.000 1.000
## BarPrepMentorQuentinWetsel -5.852e+01 5.183e+05 0.000 1.000
## BarPrepMentorScottKeffer -1.458e+02 5.259e+05 0.000 1.000
## BarPrepMentorScoutBlosser -1.639e+02 5.683e+05 0.000 1.000
## BarPrepMentorTasiaEaslon -7.416e+01 5.904e+05 0.000 1.000
## BarPrepMentorTomHall -4.313e+01 5.206e+05 0.000 1.000
## BarPrepMentorTravisWeibold -2.054e+02 5.446e+05 0.000 1.000
## BarPrepMentorTylynnPayne -1.014e+02 5.342e+05 0.000 1.000
## BarPrepMentorVictoriaWhitehead -1.929e+01 5.108e+05 0.000 1.000
## BarPrepMentorVictorMellinger -7.847e+01 5.382e+05 0.000 1.000
## BarPrepMentorWillRaftis -1.290e+02 4.642e+05 0.000 1.000
## BarPrepMentorY-DanielleSaavedra -7.018e+01 5.278e+05 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.8758e+02 on 333 degrees of freedom
## Residual deviance: 2.6310e-08 on 223 degrees of freedom
## AIC: 222
##
## Number of Fisher Scoring iterations: 25
Running the full model it’s appear that his isn’t significant. All p-values all $\approx$ 1.000, Residual deviance near 0 (2.6310e-08) with AIC = 222 (too low for a meaningful model).
To identify the most impactful predictors while maintaining model parsimony, we perform stepwise AIC-based selection on the full model.
library(MASS) # Load necessary library
# Perform stepwise selection
step_results <- stepAIC(full_model, direction = "both", trace = FALSE, steps=9)
# Extract the top 19 models
top_3_models <- head(step_results$anova, 9)
top_3_models
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## PassFail ~ Age + LSAT + UGPA + CivPro + LPI + LPII + GPA_1L +
## GPA_Final + FinalRankPercentile + Accommodations + Probation +
## LegalAnalysis_TexasPractice + AdvLegalPerfSkills + AdvLegalAnalysis +
## BarPrepCompany + BarPrepCompletion + X.LawSchoolBarPrepWorkshops +
## StudentSuccessInitiative + BarPrepMentor
##
## Final Model:
## PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation +
## LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion +
## StudentSuccessInitiative
##
##
## Step Df Deviance Resid. Df Resid. Dev
## 1 223 2.630955e-08
## 2 - BarPrepMentor 58 3.818392e-08 281 6.449346e-08
## 3 - CivPro 6 2.260952e-07 287 2.905887e-07
## 4 - X.LawSchoolBarPrepWorkshops 1 5.447172e-09 288 2.960359e-07
## 5 - AdvLegalPerfSkills 1 6.465908e-09 289 3.025018e-07
## 6 - FinalRankPercentile 1 6.534497e-09 290 3.090363e-07
## 7 - Accommodations 1 1.025674e-08 291 3.192930e-07
## 8 - AdvLegalAnalysis 1 2.213149e-05 292 2.245078e-05
## 9 - Age 1 3.612132e-04 293 3.836640e-04
## AIC
## 1 222.00000
## 2 106.00000
## 3 94.00000
## 4 92.00000
## 5 90.00000
## 6 88.00000
## 7 86.00000
## 8 84.00002
## 9 82.00038
Started with 19 predictors and systematically removed 11 variables that contributed least to model fit AIC decreased from 222 to 82, indicating significantly better model quality. Variable ‘Age’ appears least important since it has been remove the last. Some academic metrics like CiVPro, FinalRankPercentile were remove earlier on the stepwise process likely because of their insignificant impact on the response small deviance score ( 2.26e-07 and 6.53e-09).
The final select Model is PassFail regress on LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation + LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion + StudentSuccessInitiative
For this final model the Residual Deviance: Reduced from 2.63e-08 to 3.84e-04, AIC is 82 (lower is better) and we have 11 variables instead of 19 for original model that means The model achieves good fit with just 11 key predictors instead of the original 19.
To be able to analyse our model we have to convert LPI, LPII and StudentSuccessInitiative to numerical variable
#let's create a new dataset
analysis_df <- clear_df
# Convert polynomials to numeric
## Create grade-to-numeric mapping for converting ordinal variable to numerical
grade_map <- c("A" = 4, "A-" = 3.7, "B+" = 3.3, "B" = 3.0, "B-" = 2.7,
"C+" = 2.3, "C" = 2.0, "D" = 1.0, "F" = 0)
analysis_df$LPI <- grade_map[as.character(analysis_df$LPI)]
analysis_df$LPII <- grade_map[as.character(analysis_df$LPII)]
Initiative_group <- names(sort(table(analysis_df$StudentSuccessInitiative), decreasing = TRUE)[1:3])
analysis_df$StudentSuccessInitiative <- ifelse(
analysis_df$StudentSuccessInitiative %in% Initiative_group,
as.character(analysis_df$StudentSuccessInitiative),
"Other_Programs"
) %>% factor()
Having refined our predictors through stepwise selection, we now analyze the final logistic regression model to identify which factors significantly impact bar exam success. This model balances complexity and predictive power by retaining only the most influential variables.
final_model <- glm(PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L + GPA_Final + Probation +
LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion +
StudentSuccessInitiative,
family = binomial(link = "logit"),
data = analysis_df)
summary(final_model)
##
## Call:
## glm(formula = PassFail ~ LSAT + UGPA + LPI + LPII + GPA_1L +
## GPA_Final + Probation + LegalAnalysis_TexasPractice + BarPrepCompany +
## BarPrepCompletion + StudentSuccessInitiative, family = binomial(link = "logit"),
## data = analysis_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -77.89262 18.25574 -4.267 1.98e-05
## LSAT 0.36190 0.09740 3.716 0.000203
## UGPA 0.62718 0.79922 0.785 0.432610
## LPI -0.12250 0.51236 -0.239 0.811036
## LPII -2.40697 0.69935 -3.442 0.000578
## GPA_1L 5.93756 1.89619 3.131 0.001740
## GPA_Final 1.67400 2.04473 0.819 0.412964
## ProbationN -0.46980 0.86338 -0.544 0.586342
## LegalAnalysis_TexasPracticeY -0.38584 0.60509 -0.638 0.523701
## BarPrepCompanyHelix 15.83063 1455.39804 0.011 0.991321
## BarPrepCompanyKaplan -3.21891 1.12366 -2.865 0.004175
## BarPrepCompanyThemis 1.39532 0.61631 2.264 0.023576
## BarPrepCompletion 7.62277 1.87890 4.057 4.97e-05
## StudentSuccessInitiativeKeffer 0.01598 1.43917 0.011 0.991141
## StudentSuccessInitiativeN 1.61866 1.17559 1.377 0.168545
## StudentSuccessInitiativeOther_Programs 0.75744 1.14088 0.664 0.506748
##
## (Intercept) ***
## LSAT ***
## UGPA
## LPI
## LPII ***
## GPA_1L **
## GPA_Final
## ProbationN
## LegalAnalysis_TexasPracticeY
## BarPrepCompanyHelix
## BarPrepCompanyKaplan **
## BarPrepCompanyThemis *
## BarPrepCompletion ***
## StudentSuccessInitiativeKeffer
## StudentSuccessInitiativeN
## StudentSuccessInitiativeOther_Programs
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 236.67 on 351 degrees of freedom
## Residual deviance: 110.38 on 336 degrees of freedom
## (97 observations deleted due to missingness)
## AIC: 142.38
##
## Number of Fisher Scoring iterations: 14
Significants variables:
LSAT: pvalue=0.0002; a 1 point increase in LSAT score increase the odds of passing the bar by approximatelt 44% (exp(0.3619)
GPA_1L: p-value = 0.0017; a one unit increase in first year GPA increase the odds of passing the bar by 379% (exp(5.938)
LPII: p-value = 0.0006; each one unit increase in LPII (converted from grade) decreasses the odds of passing the bar by 91% (exp(-2.407)
BarPrepCompletion: p-value = 0.00005; completing the bar prep program increased the odds of passing by 2.038% (exp(7.623))
Key Model Metrics
-AIC 142.38 lower than null model, good model fit
-Null Deviance 236.67 Baseline error (higher = worse fit)
-Residual Deviance 110, 53% lower than the null deviance value this indicate a Good fit
-The difference between Null Deviance and Residual Deviance is greater than (qchisq(.95,1)), we a model which is statistical significant.
library(car)
vif(final_model)
## GVIF Df GVIF^(1/(2*Df))
## LSAT 1.785294 1 1.336149
## UGPA 1.389645 1 1.178832
## LPI 1.579703 1 1.256862
## LPII 2.670961 1 1.634308
## GPA_1L 4.942028 1 2.223067
## GPA_Final 3.550727 1 1.884337
## Probation 1.939126 1 1.392525
## LegalAnalysis_TexasPractice 1.450698 1 1.204449
## BarPrepCompany 2.032850 3 1.125514
## BarPrepCompletion 1.947905 1 1.395674
## StudentSuccessInitiative 2.676942 3 1.178347
GPA_1L: GVIF = 4.94. Moderate concern. This variable may be moderately correlated with GPA_Final, as both measure academic performance over time. All other variables are non concerning collinearity because of their GVIF (<5)
For the refine model we drop UGPA, LPI, GPA_Final, Probation, LegalAnalysis_TexasPractice, StudentSuccessInitiative and LPII.
refine_model <- glm(
PassFail ~ LSAT + GPA_1L + BarPrepCompletion + BarPrepCompany,
family = binomial,
data = analysis_df)
summary(refine_model)
##
## Call:
## glm(formula = PassFail ~ LSAT + GPA_1L + BarPrepCompletion +
## BarPrepCompany, family = binomial, data = analysis_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.67186 8.61617 -4.256 2.08e-05 ***
## LSAT 0.15775 0.05257 3.001 0.00269 **
## GPA_1L 3.09320 0.53538 5.778 7.58e-09 ***
## BarPrepCompletion 5.84549 1.22510 4.771 1.83e-06 ***
## BarPrepCompanyHelix 15.69717 882.74367 0.018 0.98581
## BarPrepCompanyKaplan -1.08325 0.78966 -1.372 0.17013
## BarPrepCompanyThemis 1.26794 0.41465 3.058 0.00223 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 317.84 on 448 degrees of freedom
## Residual deviance: 220.38 on 442 degrees of freedom
## AIC: 234.38
##
## Number of Fisher Scoring iterations: 13
vif(refine_model)
## GVIF Df GVIF^(1/(2*Df))
## LSAT 1.156420 1 1.075370
## GPA_1L 1.069834 1 1.034328
## BarPrepCompletion 1.352713 1 1.163062
## BarPrepCompany 1.344860 3 1.050621
we know have a model with of significant value and no collinearity between predictors. although we have a higher AIc 125.13 than in final_model, the difference between the null deviance and the residual deviance (187.58 - 111.13) is > than the critical delta # (qchisq(.95,1)) that means our model is significant.
AIC = 234.38, Higher than the final model’s AIC (142.38), but acceptable given model simplicity. Null Deviance = 317.84, Residual Deviance = 220.38. This refined model simplifies the analysis using just four predictors while retaining high explanatory power and statistical significance. It supports the conclusion that early academic strength, LSAT, and completing bar prep are key predictors of bar exam success. Despite a higher AIC than the full model, this model provides clarity, interpretability, and robust predictive insight without collinearity concerns.