1 Background

Student level data was reviewed for four academic years.This data included GPA, class scores, elective participation and many more variables. This data was analyzed in relationship to pass/fail performance on the bar exam.

1.1 Objective

Use student level data with logistic regression to identify predictors for bar exam perfomance, with the aim to identify actionable plans to improve bar performance for future students.

2 Methods

2.1 Data Collection and Cleaning

Data was collected from https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv. The data included students who took the bar exam in the years 2021, 2022, 2023, and 2024. The variables collected are summarized in the table below:

Variables	Explanation
PassFail	Final Outcome - Did the student pass the exam?(Yes/No)
Year	Year in which the student took the bar exam
Age	Student’s age
LSAT	Score on the LSAT entrance examination
UGPA	TUndergraduate GPA
CivPro	Score in CivPro L1 core course
LPI	Score in LPI L1 core course
LPII	Score in LPII L1 core course
GPA_1L	Cumulative GPA at the end of 1L year
GPA_Final	Cumulative GPA at the end of 3L year
FinalRankPercentile	Final percentile rank of student in program
Accommodations	Whether the student received accommodations from Student Disability Services (Yes/No)
Probation	Whether the student was ever placed on academic probation(Yes/No)
Legal Analysis_TexasPractice	Enrollment in Legal Analysis elective (Yes/No)
AdvLegalPerfSkills	Enrollment in Advanced Legal Performance elective (Yes/No)
AdvLegalAnalysis	Enrollment in Advanced Legal Analysis elective (Yes/No)
BarPrepCompany	Type of bar preparation course taken
BarPrepCompletion	Percent of bar prep course completed
X.LawSchoolBarPrepWorkshops	Number of bar prep workshops attended
StudentSuccessInitiative	Participation in academic support program
BarPrepMentor	Whether the student had a bar prep mentor
MPRE	Component score of the bar exam
MPT	Component score of the bar exam
MEE	Component score of the bar exam
WrittenScaledScore	Component score of the bar exam
MBE	Component score of the bar exam
UBE	Composite Uniform Bar Exam score

One of the primary methods of data cleaning involved converting variable types. Most of the raw data was either numerical or character data. The score variables were converted from character data to ordered factors. The Yes/No character data was converted to numerical data of 0 for no and 1 for yes. The bar prep company data was converted from character data to factors. The BarPrepMentor data was converted from character data to numerical data where no mentor was mapped to 0 and any mentor was mapped to 1. This was done because the majority of the mentors were unique so a trend specific to which mentor was used was unlikely.

Also the component and composite scores were removed from the data set. This was done because they were additional forms of the primary response variable and were not independent or suitable as predictors. The Pass/Fail data, which was the primary response variable was converted from character to numerical data where a fail was mapped to 0 and a pass was mapped to 1.

The data was also cleaned by removing entries which did not have complete data.

2.2 Exploratory Data Analysis

Box Plots of some of the continuous independent variables were used to observe if there was an obvious difference between the group that passed the bar compared to the group that did not pass the bar.

Pairwise correlations were examined between all numerical predictor variables using a correlation matrix.

2.3 Logistic Regression Modeling

Since our response variable, passing the bar exam, is not a normally distributed continuous variable, logistic regression will be used instead of simple linear regression. This means that our instead of directly returning our response variable, our regression formula will return the logit as shown in the formula below:

\[ log(\frac{p}{1-p})= \beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+...+\beta _{n}x_{n} \] The logit is the log odds of the outcome.

Models will be compared to each other using the Akaike Information Criterion (AIC). The AIC is calculated with the formula below where L is the likelihood function. The AIC is an estimator of prediction error that is used to provide the quality of a model relative to other models.

\[ AIC=-2ln(L) +2p \] Model significance will be evaluated using the deviance. Specifically R squared will compare the ratio of the residual deviance(\(D\)) and the null deviance(\(D_{0}\)) per the formula below:

\[ R^{2}=1-\frac{D}{D_{0}} \] Unlike the R squared parameter used in simple linear regression, this R squared is not a percentage of the variance explained by the model. The R squared parameter in logistic regression is indicating how close the fit is to being perfect or how close to the null.

3 Exploratory Data Analysis

3.1 Data Cleaning

The data was originally extracted from https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv

raw_df <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")
df <- raw_df
library(tidyverse)
library(dplyr)

Many of the variables required conversion to be suitable for analysis and regression. Many of the variables were Yes or No, represented by character data. These values were converted to binary numerical data.

# Converting predictors to binary values
df <- df %>% mutate(Accommodations = dplyr::recode(Accommodations,
                 "N"=0,"Y"=1)) %>%  mutate (Probation = dplyr::recode(Probation,
                 "N"=0,"Y"=1)) %>% mutate (LegalAnalysis_TexasPractice = dplyr::recode(LegalAnalysis_TexasPractice,
                 "N"=0,"Y"=1)) %>% mutate (AdvLegalPerfSkills = dplyr::recode(AdvLegalPerfSkills,
                 "N"=0,"Y"=1)) %>% mutate (AdvLegalAnalysis = dplyr::recode(AdvLegalAnalysis,
                 "N"=0,"Y"=1)) 

# Converting BarPrepMentor and OptintoWriting to binary
df <- df %>% mutate(BarPrepMentor=if_else(BarPrepMentor=="N", 0, 1)) %>% mutate(OptIntoWritingGuide=if_else(OptIntoWritingGuide=="Y", 1, 0))

Other character variables, such as BarPrepCompany and StudentSuccessInitiative were converted to factors. Scores for courses were converted from character to ordered factors.

# Converting predictors to factors
df$BarPrepCompany <- as.factor(df$BarPrepCompany)
df$StudentSuccessInitiative <- as.factor(df$StudentSuccessInitiative)

# Converting ordered factors
df$CivPro <- factor(df$CivPro, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "F" )) 
df$LPI <- factor(df$LPI, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "F" )) 
df$LPII <- factor(df$LPII, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "CR" ))

BarPrepCompletion was a numerical variable but many of the entries were NA if that student did not participate in a Prep program. These NA were converted to 0 so the variable could be treated as a numerical variable.

# Converting Barprep Completion NA to 0
df$BarPrepCompletion <- df$BarPrepCompletion  %>% replace_na(0)

The response variable, PassFail, was converted from a character variable to a binary numerical variable with F mapping to 0 and P mapping to 1. Also, the other components of the output score were removed from the data set. Also any rows with an na value were removed.

# Converting Response to binary
df <- df %>% mutate(PassFail = dplyr::recode(PassFail, "F"=0, "P"=1))

# Removing additional outputs
df_trim <- df[,-c(23:28)]

# Selecting rows with NA values in any column
rows_with_na <- df_trim[apply(
  df_trim, 
  1, 
  function(x) any(is.na(x))
), ]

# Printing selected rows(commented out for report readability)
#print(rows_with_na)
 
# Remove rows with NAs
df_trim <- df_trim[-c(229, 307, 311, 349, 352, 415, 432, 451, 470),]

3.2 Exploratory Data Plotting

Multiple continuous predictor variables were plotted on a boxplot grouped by Passing and Failing the bar exam to see if there was a observable difference between the groups. While LSAT, Age and Undergrad GPA did not show a large difference between the pass and fail groups, both 1st year GPA and 3rd year GPA showed larger differences between the groups. This means that these variables could be useful predictors for passing the bar exam.

# Exploratory Data Analysis Plotting
boxplot(df_trim$LSAT ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="LSAT Score", xlab="", main="LSAT Score grouped by PassFail")

boxplot(df_trim$Age ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="Age", xlab="", main="Age grouped by PassFail")

boxplot(df_trim$UGPA ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="Undergrad GPA", xlab="", main="Undergrad GPA grouped by PassFail")

boxplot(df_trim$GPA_1L ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="1st Year GPA", xlab="", main="1st Year GPA grouped by PassFail")

boxplot(df_trim$GPA_Final ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="3rd Year GPA", xlab="", main="3rd Year GPA grouped by PassFail")

Next all the numerical variables were plotted on a correlation plot to see if there were any high levels of multi-collinearity between the variables. As expected 1st year GPA, 3rd year GPA and Ranking have higher correlations. There is also some correlation between the elective course enrollment. In general though, most of the variables appear to have lower levels of correlation.

# Multicollinearity analysis
library(corrplot)
library(car)
df_trim_num <- df_trim[,-c(1,2,6,7,8,17,19,21)]
colnames(df_trim_num) <- c("Age", "LSAT", "UGPA", "GPA_1L", "GPA_Final",
                           "Rank%", "Accom", "Prob", "LegAnalysis", "AdvLegPerf", "AdvLegAn",
                           "BarPrepComp", "PrepWorkshops", "Mentor")
corrplot(cor(df_trim_num[,-c(1:2)]),method="number", type="upper",
         main="Correlation of Numerical Law Program Parameters", mar=c(0,0,2,0), tl.cex=.75, number.cex=.5)

4 Regression

Multiple logistic regression models were evaluated to determine the best relationship to use to predict the response.

4.1 Manual Logistic Regression Trials

Multiple logistical regression models were evaluated. The first model utilized a larger number of terms. From the first model the terms showing the most significance were selected for a second model. Additional terms were added for a third model and finally a fourth model was developed by only regressing LSAT scores. The AIC was compared for each of these models. The second model utilizing LSAT, GPA_Final, and BarPrepCompletion resulted in the lowest AIC.

# GLM model trials
model <- glm(PassFail~LSAT+CivPro+LPI+LPII+GPA_Final+BarPrepCompletion+BarPrepMentor+
               X.LawSchoolBarPrepWorkshops+Probation+LegalAnalysis_TexasPractice+
               AdvLegalPerfSkills+AdvLegalAnalysis+UGPA,family=binomial(link="logit"),data=df_trim)
summary(model) #AIC 269.91

## 
## Call:
## glm(formula = PassFail ~ LSAT + CivPro + LPI + LPII + GPA_Final + 
##     BarPrepCompletion + BarPrepMentor + X.LawSchoolBarPrepWorkshops + 
##     Probation + LegalAnalysis_TexasPractice + AdvLegalPerfSkills + 
##     AdvLegalAnalysis + UGPA, family = binomial(link = "logit"), 
##     data = df_trim)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -52.93092   12.38472  -4.274 1.92e-05 ***
## LSAT                           0.18713    0.06291   2.975  0.00293 ** 
## CivProB+                      -0.39855    1.16375  -0.342  0.73200    
## CivProB                        0.17714    1.15844   0.153  0.87847    
## CivProC+                      -0.33557    1.17958  -0.284  0.77604    
## CivProC                       -1.37890    1.18884  -1.160  0.24610    
## CivProD+                      -0.64253    1.38249  -0.465  0.64210    
## CivProD                      -18.61508 3956.18059  -0.005  0.99625    
## CivProF                       14.35304 3956.18060   0.004  0.99711    
## LPIB+                          0.70609    0.79300   0.890  0.37325    
## LPIB                           1.35488    0.78470   1.727  0.08424 .  
## LPIC+                          1.02897    0.81236   1.267  0.20528    
## LPIC                           2.59949    1.03492   2.512  0.01201 *  
## LPID+                          1.80771    1.41726   1.275  0.20214    
## LPID                           2.20035    1.62548   1.354  0.17584    
## LPIF                         -14.55752 3956.18051  -0.004  0.99706    
## LPIIB+                         1.12117    0.82964   1.351  0.17657    
## LPIIB                          0.91476    0.80925   1.130  0.25831    
## LPIIC+                         1.78718    0.92992   1.922  0.05462 .  
## LPIIC                          2.41666    1.10381   2.189  0.02857 *  
## LPIID+                        18.19108 1619.78914   0.011  0.99104    
## LPIID                         20.22123 1684.32417   0.012  0.99042    
## LPIICR                         0.92036    0.95168   0.967  0.33350    
## GPA_Final                      6.34285    1.15763   5.479 4.27e-08 ***
## BarPrepCompletion              1.46270    0.69406   2.107  0.03508 *  
## BarPrepMentor                 -0.15815    0.47992  -0.330  0.74176    
## X.LawSchoolBarPrepWorkshops    0.09903    0.10605   0.934  0.35041    
## Probation                     -0.72214    0.58032  -1.244  0.21335    
## LegalAnalysis_TexasPractice   -0.49649    0.63517  -0.782  0.43441    
## AdvLegalPerfSkills            -0.06010    0.68742  -0.087  0.93033    
## AdvLegalAnalysis               0.41853    0.53947   0.776  0.43786    
## UGPA                           0.94925    0.60328   1.573  0.11561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.40  on 466  degrees of freedom
## Residual deviance: 205.91  on 435  degrees of freedom
## AIC: 269.91
## 
## Number of Fisher Scoring iterations: 16

model2 <- glm(PassFail~LSAT+GPA_Final+BarPrepCompletion,family=binomial(link="logit"),data=df_trim)
summary(model2) #AIC 256.33

## 
## Call:
## glm(formula = PassFail ~ LSAT + GPA_Final + BarPrepCompletion, 
##     family = binomial(link = "logit"), data = df_trim)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -38.58247    7.75546  -4.975 6.53e-07 ***
## LSAT                0.16922    0.04634   3.652  0.00026 ***
## GPA_Final           4.36227    0.68417   6.376 1.82e-10 ***
## BarPrepCompletion   1.06031    0.57624   1.840  0.06576 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.40  on 466  degrees of freedom
## Residual deviance: 248.33  on 463  degrees of freedom
## AIC: 256.33
## 
## Number of Fisher Scoring iterations: 6

model3 <- glm(PassFail~LSAT+GPA_Final+BarPrepCompletion+LPI+LPII,family=binomial(link="logit"),data=df_trim)
summary(model3) #AIC 260.17

## 
## Call:
## glm(formula = PassFail ~ LSAT + GPA_Final + BarPrepCompletion + 
##     LPI + LPII, family = binomial(link = "logit"), data = df_trim)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -47.82288    9.04941  -5.285 1.26e-07 ***
## LSAT                 0.17144    0.05003   3.427  0.00061 ***
## GPA_Final            6.64253    1.00496   6.610 3.85e-11 ***
## BarPrepCompletion    1.20740    0.64830   1.862  0.06254 .  
## LPIB+                0.45388    0.74096   0.613  0.54017    
## LPIB                 1.16167    0.73051   1.590  0.11179    
## LPIC+                0.75296    0.76098   0.989  0.32244    
## LPIC                 2.18059    0.96526   2.259  0.02388 *  
## LPID+                1.53832    1.25352   1.227  0.21975    
## LPID                 1.57623    1.53323   1.028  0.30393    
## LPIF               -15.35004 3956.18046  -0.004  0.99690    
## LPIIB+               0.92109    0.79506   1.159  0.24665    
## LPIIB                0.54041    0.76463   0.707  0.47972    
## LPIIC+               1.19529    0.84629   1.412  0.15784    
## LPIIC                2.00591    1.01368   1.979  0.04783 *  
## LPIID+              16.93209 1715.56339   0.010  0.99213    
## LPIID               19.40789 1656.17433   0.012  0.99065    
## LPIICR               0.44952    0.88680   0.507  0.61223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.40  on 466  degrees of freedom
## Residual deviance: 224.17  on 449  degrees of freedom
## AIC: 260.17
## 
## Number of Fisher Scoring iterations: 16

model4 <- glm(PassFail~LSAT,family=binomial(link="logit"), data=df_trim)
summary(model4) #AIC 320.73

## 
## Call:
## glm(formula = PassFail ~ LSAT, family = binomial(link = "logit"), 
##     data = df_trim)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.96190    6.24979  -3.354 0.000796 ***
## LSAT          0.14885    0.04059   3.667 0.000245 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.40  on 466  degrees of freedom
## Residual deviance: 316.73  on 465  degrees of freedom
## AIC: 320.73
## 
## Number of Fisher Scoring iterations: 5

#model 2 was the best manual trial with an AIC of 256.33

4.2 Step wise Regression

Both backwards and forwards step wise regression were used to find an optimal logistic regression model. The backwards step wise regression yielded a model with an AIC of 230.43. The forward regression method yielded a model with an AIC of 231.51

# Backwards stepwise regression
model5 <- step(glm(PassFail~., family=binomial(link="logit"), data=df_trim), direction="backward")

summary(model5) #AIC 230.43.

# Foreward stepwise regression
fitstart<-glm(PassFail~1, family=binomial(link="logit"),data=df_trim)
fitall <-glm(PassFail~., family=binomial(link="logit"), data=df_trim)
model6<- step(fitstart, scope=formula(fitall), direction="forward")
summary(model6) #AIC 231.51
 #Better AIC found with backward stepwise regression

The backwards regression method resulted in a model that had a lower AIC than all other models evaluated. The formula for this model is as follows:

PassFail ~ Year + Age + LSAT + LPII + GPA_1L + FinalRankPercentile + AdvLegalPerfSkills + BarPrepCompany + BarPrepCompletion + StudentSuccessInitiative

4.3 Comparing Deviation

The backwards regression model resulted in the lowest AIC out of all the models evaluated. Evaluating the null and residual deviation for this model resulted in an r squared of 0.56. The next best model according to AIC was the forwards step wise regression model and it also had an R squared of 0.56. The third best model according to AIC had an R squared of 0.25. This shows that the step wise regression models were fairly equivalent according to both AIC and R squared and performed significantly better than the manual trials according to R squared.

5 Results and Conclusion

In conclusion, we can find from the best performing regression model that a number of predictors are significant. Among the parameters which are most significant are LSAT score, Bar Preparation Company and LPII score.

Recommendations for improving proportion of students who pass the bar exam include the following:

Raise the LSAT admission requirements: By increasing the required LSAT score the program would be biased towards students who were capable of performing better on the LSAT exam, which could improve the proportion of students who pass the bar.
Increase Bar Prep Utilization: The Law School can provide incentives to utilize the Bar Prep Company which had the most significant positive impact on bar exam results. Raising awareness of these options could also drive improvement.
Improve LPII knowledge and performance:Student scores on the LPII course showed to be a significant predictor of bar performance. The curriculum for this course could be expanded to improve knowledge in this area. Tutoring and other support could also be added in this subject area to increase profieciency in this area.

6 Complete R Code

raw_df <- read.csv("https://raw.githubusercontent.com/tmatis12/datafiles/refs/heads/main/Updated_Bar_Data_For_Review_Final.csv")
df <- raw_df
library(tidyverse)
library(dplyr)

# Converting predictors to binary values
df <- df %>% mutate(Accommodations = dplyr::recode(Accommodations,
                 "N"=0,"Y"=1)) %>%  mutate (Probation = dplyr::recode(Probation,
                 "N"=0,"Y"=1)) %>% mutate (LegalAnalysis_TexasPractice = dplyr::recode(LegalAnalysis_TexasPractice,
                 "N"=0,"Y"=1)) %>% mutate (AdvLegalPerfSkills = dplyr::recode(AdvLegalPerfSkills,
                 "N"=0,"Y"=1)) %>% mutate (AdvLegalAnalysis = dplyr::recode(AdvLegalAnalysis,
                 "N"=0,"Y"=1)) 

# Converting BarPrepMentor and OptintoWriting to binary
df <- df %>% mutate(BarPrepMentor=if_else(BarPrepMentor=="N", 0, 1)) %>% mutate(OptIntoWritingGuide=if_else(OptIntoWritingGuide=="Y", 1, 0))

# Converting predictors to factors
df$BarPrepCompany <- as.factor(df$BarPrepCompany)
df$StudentSuccessInitiative <- as.factor(df$StudentSuccessInitiative)


# Converting Barprep Completion NA to 0
df$BarPrepCompletion <- df$BarPrepCompletion  %>% replace_na(0)

# Converting ordered factors
df$CivPro <- factor(df$CivPro, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "F" )) 
df$LPI <- factor(df$LPI, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "F" )) 
df$LPII <- factor(df$LPII, levels = c("A", "B+", "B", "C+", 
                                              "C", "D+", "D", "CR" )) 

# Converting Response to binary
df <- df %>% mutate(PassFail = dplyr::recode(PassFail, "F"=0, "P"=1))

# Removing additional outputs
df_trim <- df[,-c(23:28)]

# Selecting rows with NA values in any column
rows_with_na <- df_trim[apply(
  df_trim, 
  1, 
  function(x) any(is.na(x))
), ]

# Printing selected rows
print(rows_with_na)
 
# Remove rows with NAs
df_trim <- df_trim[-c(229, 307, 311, 349, 352, 415, 432, 451, 470),]

# Exploratory Data Analysis Plotting
boxplot(df_trim$LSAT ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="LSAT Score", xlab="", main="LSAT Score grouped by PassFail")
boxplot(df_trim$Age ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="Age", xlab="", main="Age grouped by PassFail")
boxplot(df_trim$UGPA ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="Undergrad GPA", xlab="", main="Undergrad GPA grouped by PassFail")
boxplot(df_trim$GPA_1L ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="1st Year GPA", xlab="", main="1st Year GPA grouped by PassFail")
boxplot(df_trim$GPA_Final ~ df_trim$PassFail, names=c("Fail", "Pass"), ylab="3rd Year GPA", xlab="", main="3rd Year GPA grouped by PassFail")

# Multicollinearity analysis
library(corrplot)
library(car)
df_trim_num <- df_trim[,-c(1,2,6,7,8,17,19,21)]
colnames(df_trim_num) <- c("Age", "LSAT", "UGPA", "GPA_1L", "GPA_Final",
                           "Rank%", "Accom", "Prob", "LegAnalysis", "AdvLegPerf", "AdvLegAn",
                           "BarPrepComp", "PrepWorkshops", "Mentor")
corrplot(cor(df_trim_num[,-c(1:2)]),method="number", type="upper",
         main="Correlation of Numerical Law Program Parameters", mar=c(0,0,2,0), tl.cex=.75, number.cex=.5)

# GLM model trials
model <- glm(PassFail~LSAT+CivPro+LPI+LPII+GPA_Final+BarPrepCompletion+BarPrepMentor+
               X.LawSchoolBarPrepWorkshops+Probation+LegalAnalysis_TexasPractice+
               AdvLegalPerfSkills+AdvLegalAnalysis+UGPA,family=binomial(link="logit"),data=df_trim)
summary(model) #AIC 269.91

model2 <- glm(PassFail~LSAT+GPA_Final+BarPrepCompletion,family=binomial(link="logit"),data=df_trim)
summary(model2) #AIC 256.33

model3 <- glm(PassFail~LSAT+GPA_Final+BarPrepCompletion+LPI+LPII,family=binomial(link="logit"),data=df_trim)
summary(model3) #AIC 260.17

model4 <- glm(PassFail~LSAT,family=binomial(link="logit"), data=df_trim)
summary(model4) #AIC 320.73

#model 2 was the best manual trial with an AIC of 256.33

# Backwards stepwise regression
model5 <- step(glm(PassFail~., family=binomial(link="logit"), data=df_trim), direction="backward")

summary(model5) #AIC 230.43.

# Foreward stepwise regression
fitstart<-glm(PassFail~1, family=binomial(link="logit"),data=df_trim)
fitall <-glm(PassFail~., family=binomial(link="logit"), data=df_trim)
model6<- step(fitstart, scope=formula(fitall), direction="forward")
summary(model6) #AIC 231.51
 #Better AIC found with backward stepwise regression

# Combine terms from backwards and forward regression results

model7 <-glm(formula = PassFail ~ Year + Age + LSAT + CivPro + LPII + 
               GPA_1L + LegalAnalysis_TexasPractice + BarPrepCompany + BarPrepCompletion + 
               StudentSuccessInitiative +GPA_Final, family=binomial(link="logit"), data = df_trim) 
summary(model7) #AIC 243.59
  #Combined terms model had higher AIC than result of backwards stepwise regression

IE 5344 Project 3-Bar Performance Prediction

Joshua Kupke

Last compiled on May 08, 2025 at 10:28 PM - CDT