#loading libraries
library(readr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
#loading data
df<-read_csv("Application2.csv")
## Rows: 2159 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Major, Degree, Condition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#examining the data
describe(df)
## df
##
## 3 Variables 2159 Observations
## --------------------------------------------------------------------------------
## Major
## n missing distinct
## 2159 0 3
##
## Value Business Education Math
## Frequency 720 720 719
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
## Degree
## n missing distinct
## 2159 0 3
##
## Value Bachelor HighSchool Master
## Frequency 720 719 720
## Proportion 0.333 0.333 0.333
## --------------------------------------------------------------------------------
## Condition
## n missing distinct
## 2159 0 2
##
## Value High Low
## Frequency 1432 727
## Proportion 0.663 0.337
## --------------------------------------------------------------------------------
summary(df)
## Major Degree Condition
## Length:2159 Length:2159 Length:2159
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
#data prep: applying the factor function for categorical variables
df$Major<-as.factor(df$Major)
df$Degree<-as.factor(df$Degree)
df$Condition<-as.factor(df$Condition)
#model
model <- glm(Condition ~ Major + Degree, data = df, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Condition ~ Major + Degree, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0381 -0.9483 -0.7937 1.3456 1.6730
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7388 0.1007 -7.336 2.2e-13 ***
## MajorEducation 0.1727 0.1137 1.519 0.12883
## MajorMath 0.1225 0.1146 1.069 0.28524
## DegreeHighSchool 0.2293 0.1100 2.085 0.03704 *
## DegreeMaster -0.3772 0.1172 -3.218 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.5 on 2158 degrees of freedom
## Residual deviance: 2724.7 on 2154 degrees of freedom
## AIC: 2734.7
##
## Number of Fisher Scoring iterations: 4
A logistic regression model was conducted to predict the level of an individual's starting salary also called as Condition based on Degree and Major. A significant regression equation was found with Intercept (z = -7.336, p < .001), DegreeMaster (z = -3.218, p = .001), DegreeHighSchool (z = 2.085, p = 0.04), MajorEducation (z = 1.519, p = 0.13), MajorMath (z = 1.069, p = 0.29) were statistically significant. The difference between Null deviance and Residual deviance is 33.8, which shows that the model fit is good.
Thus by plugging in an individual's educational degree and major onto the above equation, we can predict the likelihood of his / her salary level aka condition being high or low.