#loading libraries
library(readr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
#loading data
df<-read_csv("Application2.csv")
## Rows: 2159 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Major, Degree, Condition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#examining the data
describe(df)
## df 
## 
##  3  Variables      2159  Observations
## --------------------------------------------------------------------------------
## Major 
##        n  missing distinct 
##     2159        0        3 
##                                         
## Value       Business Education      Math
## Frequency        720       720       719
## Proportion     0.333     0.333     0.333
## --------------------------------------------------------------------------------
## Degree 
##        n  missing distinct 
##     2159        0        3 
##                                            
## Value        Bachelor HighSchool     Master
## Frequency         720        719        720
## Proportion      0.333      0.333      0.333
## --------------------------------------------------------------------------------
## Condition 
##        n  missing distinct 
##     2159        0        2 
##                       
## Value       High   Low
## Frequency   1432   727
## Proportion 0.663 0.337
## --------------------------------------------------------------------------------
summary(df)
##     Major              Degree           Condition        
##  Length:2159        Length:2159        Length:2159       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
#data prep: applying the factor function for categorical variables
df$Major<-as.factor(df$Major)
df$Degree<-as.factor(df$Degree)
df$Condition<-as.factor(df$Condition)

#model
model <- glm(Condition ~ Major + Degree, data = df, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = Condition ~ Major + Degree, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0381  -0.9483  -0.7937   1.3456   1.6730  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.7388     0.1007  -7.336  2.2e-13 ***
## MajorEducation     0.1727     0.1137   1.519  0.12883    
## MajorMath          0.1225     0.1146   1.069  0.28524    
## DegreeHighSchool   0.2293     0.1100   2.085  0.03704 *  
## DegreeMaster      -0.3772     0.1172  -3.218  0.00129 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2758.5  on 2158  degrees of freedom
## Residual deviance: 2724.7  on 2154  degrees of freedom
## AIC: 2734.7
## 
## Number of Fisher Scoring iterations: 4

Summary

A logistic regression model was conducted to predict the level of an individual's starting salary also called as Condition based on Degree and Major. A significant regression equation was found with Intercept (z = -7.336, p < .001), DegreeMaster (z = -3.218, p = .001), DegreeHighSchool (z = 2.085, p = 0.04), MajorEducation (z = 1.519, p = 0.13), MajorMath (z = 1.069, p = 0.29) were statistically significant. The difference between Null deviance and Residual deviance is 33.8, which shows that the model fit is good.

Thus by plugging in an individual's educational degree and major onto the above equation, we can predict the likelihood of his / her salary level aka condition being high or low.