Simple Logistic Regression

Task 1. Determine whether the chance of getting a high salary differed between male and female

professors. The libraries needed for this task: table1, epiDisplay. 1.1 Provide a brief description of the project

The cross-sectional investigation of 397 professors to determine whether the chance to get a high salary differed between male and female professors.

Null hypothesis: Professors’ sexes were not associated with the chance to get a high salary.

Alternative hypothesis: Professors’ sexes were associated with the chance to get a high salary ###

1.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Users\\User\\Documents\\UTS\\AUTUMN 2024\\TRM\\Data Analyst R Int\\Professorial Salaries.csv")

1.3 Create a new variable ‘high.salary’ using the threshold of $130,000 high.salary: 1 = Salary> $130,000; 0 = otherwise

salary$high.salary = ifelse(salary$Salary>= 130000, 1, 0)

1.4 Describe the characteristics of the study sample by sexes

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex | high.salary, data = salary)

## Warning in table1.formula(~Rank + Discipline + Yrs.since.phd + Yrs.service + :
## Terms to the right of '|' in formula 'x' define table columns and are expected
## to be factors with meaningful labels.

	0 (N=287)	1 (N=110)	Overall (N=397)
Rank
AssocProf	64 (22.3%)	0 (0%)	64 (16.1%)
AsstProf	67 (23.3%)	0 (0%)	67 (16.9%)
Prof	156 (54.4%)	110 (100%)	266 (67.0%)
Discipline
A	140 (48.8%)	41 (37.3%)	181 (45.6%)
B	147 (51.2%)	69 (62.7%)	216 (54.4%)
Yrs.since.phd
Mean (SD)	19.7 (13.1)	29.2 (9.41)	22.3 (12.9)
Median [Min, Max]	17.0 [1.00, 56.0]	29.0 [11.0, 56.0]	21.0 [1.00, 56.0]
Yrs.service
Mean (SD)	15.4 (13.0)	23.3 (11.2)	17.6 (13.0)
Median [Min, Max]	11.0 [0, 57.0]	21.0 [2.00, 60.0]	16.0 [0, 60.0]
NPubs
Mean (SD)	17.3 (13.0)	20.3 (16.2)	18.2 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]	16.0 [1.00, 69.0]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	38.7 (15.6)	44.3 (19.4)	40.2 (16.9)
Median [Min, Max]	35.0 [1.00, 90.0]	41.0 [1.00, 90.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	98400 (16900)	154000 (18900)	114000 (30300)
Median [Min, Max]	100000 [57800, 130000]	149000 [131000, 232000]	107000 [57800, 232000]
Sex
Female	34 (11.8%)	5 (4.5%)	39 (9.8%)
Male	253 (88.2%)	105 (95.5%)	358 (90.2%)

1.5 Fit a simple logistic regression to determine whether the chance of getting a high salary differed between male and female professors. Interpret the finding

m.1 = glm(high.salary ~ Sex, family = "binomial", data = salary)
summary(m.1)

## 
## Call:
## glm(formula = high.salary ~ Sex, family = "binomial", data = salary)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9169     0.4790  -4.002 6.28e-05 ***
## SexMale       1.0375     0.4928   2.105   0.0353 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 468.60  on 396  degrees of freedom
## Residual deviance: 463.11  on 395  degrees of freedom
## AIC: 467.11
## 
## Number of Fisher Scoring iterations: 4

# install.packages("epiDisplay")
library(epiDisplay)

## Loading required package: foreign

## Loading required package: survival

## Loading required package: MASS

## Loading required package: nnet

logistic.display(m.1)

## 
## Logistic regression predicting high.salary 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## Sex (cont. var.) 2.82 (1.07,7.41)  0.035          0.019     
##                                                             
## Log-likelihood = -231.5529
## No. of observations = 397
## AIC value = 467.1058

Interpretation: There is evidence (P= .035) that male professors were associated with 2.8-fold greater odds of getting a high salary than their female counterparts, ranging from 7% to 7.4-fold greater.

Task 2: Save all your work and publish it into your Rpubs account.

Multiple Logistic Regression

Task 1. Determine whether the chance of getting a high salary differed between male and female

professors. The libraries needed for this task: table1, Publish. 1.1 Provide a brief description of the project

The cross-sectional investigation of 397 professors to determine whether the chance to get a high salary differed between male and female professors, accounting for potential confounding effects.

Null hypothesis: Professors’ sexes were not independently associated with the chance to get a high salary.

Alternative hypothesis: Professors’ sexes were independently associated with the chance to get a high salary ###

1.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Users\\User\\Documents\\UTS\\AUTUMN 2024\\TRM\\Data Analyst R Int\\Professorial Salaries.csv")

1.3 Create a new variable ‘high.salary’ using the threshold of $120,000 high.salary: 1 = Salary> $120,000; 0 = otherwise

salary$high.salary = ifelse(salary$Salary> 120000, 1, 0)
salary$Prof[salary$Rank == "Prof"] = "Professor"
salary$Prof[salary$Rank != "Prof"] = "AssProfessor"

1.4 Describe the characteristics of the study sample by salary status and predefine potential covariates for multiple regression model

library(table1)
table1(~ Rank + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex | as.factor(high.salary), data = salary)

	0 (N=255)	1 (N=142)	Overall (N=397)
Rank
AssocProf	63 (24.7%)	1 (0.7%)	64 (16.1%)
AsstProf	67 (26.3%)	0 (0%)	67 (16.9%)
Prof	125 (49.0%)	141 (99.3%)	266 (67.0%)
Prof
AssProfessor	130 (51.0%)	1 (0.7%)	131 (33.0%)
Professor	125 (49.0%)	141 (99.3%)	266 (67.0%)
Discipline
A	127 (49.8%)	54 (38.0%)	181 (45.6%)
B	128 (50.2%)	88 (62.0%)	216 (54.4%)
Yrs.since.phd
Mean (SD)	19.1 (13.5)	28.1 (9.29)	22.3 (12.9)
Median [Min, Max]	15.0 [1.00, 56.0]	28.0 [11.0, 56.0]	21.0 [1.00, 56.0]
Yrs.service
Mean (SD)	14.9 (13.4)	22.5 (10.7)	17.6 (13.0)
Median [Min, Max]	9.00 [0, 57.0]	20.0 [2.00, 60.0]	16.0 [0, 60.0]
NPubs
Mean (SD)	17.6 (13.2)	19.2 (15.2)	18.2 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	38.6 (15.8)	43.1 (18.6)	40.2 (16.9)
Median [Min, Max]	35.0 [1.00, 90.0]	38.5 [1.00, 90.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	95000 (14600)	147000 (20400)	114000 (30300)
Median [Min, Max]	96200 [57800, 120000]	144000 [121000, 232000]	107000 [57800, 232000]
Sex
Female	30 (11.8%)	9 (6.3%)	39 (9.8%)
Male	225 (88.2%)	133 (93.7%)	358 (90.2%)

1.5 Fit a multiple logistic regression to determine whether the chance of getting a high salary differed between male and female professors after adjusting for potential covariates. Interpret the finding.

m.1 = glm(high.salary ~ Sex , family = "binomial", data = salary)
summary(m.1)

## 
## Call:
## glm(formula = high.salary ~ Sex, family = "binomial", data = salary)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.2040     0.3801  -3.168  0.00154 **
## SexMale       0.6782     0.3955   1.715  0.08636 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 517.75  on 396  degrees of freedom
## Residual deviance: 514.52  on 395  degrees of freedom
## AIC: 518.52
## 
## Number of Fisher Scoring iterations: 4

## install.packages("Publish")
library(Publish)

## Loading required package: prodlim

publish(m.1)

##  Variable  Units OddsRatio       CI.95   p-value 
##       Sex Female       Ref                       
##             Male      1.97 [0.91;4.28]   0.08636

## Multiple model
m.2 = glm(high.salary ~ Sex + Prof + Discipline + Yrs.since.phd + NPubs, family = "binomial", data = salary)
summary(m.2)

## 
## Call:
## glm(formula = high.salary ~ Sex + Prof + Discipline + Yrs.since.phd + 
##     NPubs, family = "binomial", data = salary)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.941036   1.140946  -5.207 1.92e-07 ***
## SexMale        0.215592   0.503442   0.428 0.668480    
## ProfProfessor  5.031301   1.037844   4.848 1.25e-06 ***
## DisciplineB    0.997854   0.259625   3.843 0.000121 ***
## Yrs.since.phd  0.008645   0.012943   0.668 0.504147    
## NPubs          0.004598   0.008735   0.526 0.598648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 517.75  on 396  degrees of freedom
## Residual deviance: 363.49  on 391  degrees of freedom
## AIC: 375.49
## 
## Number of Fisher Scoring iterations: 7

publish(m.2)

##       Variable        Units OddsRatio           CI.95     p-value 
##            Sex       Female       Ref                             
##                        Male      1.24     [0.46;3.33]   0.6684800 
##           Prof AssProfessor       Ref                             
##                   Professor    153.13 [20.03;1170.79]     < 1e-04 
##     Discipline            A       Ref                             
##                           B      2.71     [1.63;4.51]   0.0001213 
##  Yrs.since.phd                   1.01     [0.98;1.03]   0.5041467 
##          NPubs                   1.00     [0.99;1.02]   0.5986477

Interpretation: There is no evidence (P= 0.64) that male professors were associated with different odds of getting a high salary than their female counterparts.

Task 2 Develop an algorithm to predict the chance of getting a high salary among professors.

The libraries needed for this task: table1, Publish, BMA, rms. 1.1 Provide a brief description of the project - What is the study design? - What is the null hypothesis? - What is the alternative hypothesis? ###

1.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Users\\User\\Documents\\UTS\\AUTUMN 2024\\TRM\\Data Analyst R Int\\Professorial Salaries.csv")

1.3 Create a new variable ‘high.salary’ using the threshold of $120,000 high.salary: 1 = Salary> $120,000; 0 = otherwise

salary$high.salary = ifelse(salary$Salary> 120000, 1, 0)
salary$Prof[salary$Rank == "Prof"] = "Professor"
salary$Prof[salary$Rank != "Prof"] = "AssProfessor"

1.4 Describe the characteristics of the study sample by salary status and predefine potential predictors of high salary

## install.packages("BMA")
library(BMA)

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.7-5)

m.3 = bic.glm(high.salary ~ Sex + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits, strict = F, OR = 20, glm.family = "binomial", data = salary)
summary(m.3)

## 
## Call:
## bic.glm.formula(f = high.salary ~ Sex + Prof + Discipline + Yrs.since.phd +     Yrs.service + NPubs + Ncits, data = salary, glm.family = "binomial",     strict = F, OR = 20)
## 
## 
##   9  models were selected
##  Best  5  models (cumulative posterior probability =  0.9093 ): 
## 
##                p!=0    EV         SD        model 1     model 2     model 3   
## Intercept      100    -5.978e+00  1.129453  -6.336e+00  -5.559e+00  -5.691e+00
## SexMale          4.6   1.053e-02  0.117232       .           .           .    
## ProfProfessor  100.0   5.202e+00  1.018200   5.222e+00   5.198e+00   5.033e+00
## DisciplineB    100.0   9.733e-01  0.254090   9.797e-01   9.618e-01   1.006e+00
## Yrs.since.phd    5.3   4.565e-04  0.003544       .           .       9.694e-03
## Yrs.service      2.1   1.670e-05  0.001626       .           .           .    
## NPubs            4.7   1.025e-05  0.002167       .           .           .    
## Ncits           51.8   9.540e-03  0.010718   1.837e-02       .           .    
##                                                                               
## nVar                                           3           2           3      
## BIC                                         -1.993e+03  -1.993e+03  -1.988e+03
## post prob                                    0.425       0.408       0.027    
##                model 4     model 5   
## Intercept      -6.421e+00  -6.542e+00
## SexMale             .       2.341e-01
## ProfProfessor   5.090e+00   5.206e+00
## DisciplineB     1.013e+00   9.837e-01
## Yrs.since.phd   7.581e-03       .    
## Yrs.service         .           .    
## NPubs               .           .    
## Ncits           1.802e-02   1.843e-02
##                                      
## nVar              4           4      
## BIC            -1.988e+03  -1.988e+03
## post prob       0.025       0.024    
## 
##   1  observations deleted due to missingness.

imageplot.bma(m.3)

1.5 Select the optimal prediction model using BMA approach. Interpret the finding

1.6 Fit the model. Write a prediction algorithm.

m.4 = glm(high.salary ~ Prof + Discipline, family = "binomial", data = salary)
summary(m.4)

## 
## Call:
## glm(formula = high.salary ~ Prof + Discipline, family = "binomial", 
##     data = salary)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.5592     1.0247  -5.425 5.78e-08 ***
## ProfProfessor   5.1983     1.0153   5.120 3.06e-07 ***
## DisciplineB     0.9618     0.2516   3.823 0.000132 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 517.75  on 396  degrees of freedom
## Residual deviance: 364.47  on 394  degrees of freedom
## AIC: 370.47
## 
## Number of Fisher Scoring iterations: 7

## install.packages("rms")
library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:table1':
## 
##     label, label<-, units

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Registered S3 method overwritten by 'rms':
##   method       from      
##   print.lrtest epiDisplay

## 
## Attaching package: 'rms'

## The following object is masked from 'package:epiDisplay':
## 
##     lrtest

ddist = datadist(salary)
options(datadist = 'ddist')
m = lrm(high.salary ~ Prof + Discipline, data = salary)
summary(m)

##              Effects              Response : high.salary 
## 
##  Factor                        Low High Diff. Effect     S.E.   Lower 0.95 
##  Prof - AssProfessor:Professor 2   1    NA    -5.1983000 1.0154 -7.18850000
##   Odds Ratio                   2   1    NA     0.0055257     NA  0.00075519
##  Discipline - A:B              2   1    NA    -0.9617800 0.2516 -1.45490000
##   Odds Ratio                   2   1    NA     0.3822100     NA  0.23342000
##  Upper 0.95
##  -3.208100 
##   0.040431 
##  -0.468660 
##   0.625840

nom.m = nomogram(m, fun = function(x)1/(1+exp(-x)), fun.at = c(.001, .01, .10, seq(.2, .8, by = .2), .90, .99, .999),               funlabel = "Predicted likelihood of a high salary")
plot(nom.m)

## Stepwise approach
m.sw = lm(Salary ~ Prof + Discipline + Yrs.since.phd + Yrs.service + Sex + NPubs + Ncits, data = salary)
step1 = step(m.sw)

## Start:  AIC=7972.16
## Salary ~ Prof + Discipline + Yrs.since.phd + Yrs.service + Sex + 
##     NPubs + Ncits
## 
##                 Df  Sum of Sq        RSS    AIC
## - NPubs          1 2.6173e+07 2.0064e+11 7970.2
## - Sex            1 7.5268e+08 2.0136e+11 7971.6
## <none>                        2.0061e+11 7972.2
## - Ncits          1 2.2974e+09 2.0291e+11 7974.7
## - Yrs.service    1 2.7077e+09 2.0332e+11 7975.5
## - Yrs.since.phd  1 3.7642e+09 2.0437e+11 7977.5
## - Discipline     1 1.9980e+10 2.2059e+11 8007.8
## - Prof           1 6.4497e+10 2.6511e+11 8080.8
## 
## Step:  AIC=7970.21
## Salary ~ Prof + Discipline + Yrs.since.phd + Yrs.service + Sex + 
##     Ncits
## 
##                 Df  Sum of Sq        RSS    AIC
## - Sex            1 7.7069e+08 2.0141e+11 7969.7
## <none>                        2.0064e+11 7970.2
## - Ncits          1 2.4053e+09 2.0304e+11 7972.9
## - Yrs.service    1 2.7116e+09 2.0335e+11 7973.5
## - Yrs.since.phd  1 3.7631e+09 2.0440e+11 7975.6
## - Discipline     1 1.9956e+10 2.2059e+11 8005.9
## - Prof           1 6.4471e+10 2.6511e+11 8078.8
## 
## Step:  AIC=7969.73
## Salary ~ Prof + Discipline + Yrs.since.phd + Yrs.service + Ncits
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        2.0141e+11 7969.7
## - Ncits          1 2.3585e+09 2.0377e+11 7972.4
## - Yrs.service    1 2.5792e+09 2.0399e+11 7972.8
## - Yrs.since.phd  1 3.7518e+09 2.0516e+11 7975.1
## - Discipline     1 2.0207e+10 2.2161e+11 8005.7
## - Prof           1 6.5729e+10 2.6714e+11 8079.9

summary(step1)

## 
## Call:
## lm(formula = Salary ~ Prof + Discipline + Yrs.since.phd + Yrs.service + 
##     Ncits, data = salary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69513 -13212  -2228  10409 100191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   69214.26    3918.27  17.665  < 2e-16 ***
## ProfProfessor 36815.51    3259.13  11.296  < 2e-16 ***
## DisciplineB   14759.23    2356.48   6.263 9.95e-10 ***
## Yrs.since.phd   644.97     238.99   2.699  0.00726 ** 
## Yrs.service    -477.09     213.21  -2.238  0.02581 *  
## Ncits           144.34      67.46   2.140  0.03299 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22700 on 391 degrees of freedom
## Multiple R-squared:  0.4456, Adjusted R-squared:  0.4385 
## F-statistic: 62.86 on 5 and 391 DF,  p-value: < 2.2e-16

1.7 Develop a nomogram for implicating the prediction model. What would be the probability that a professor in discipline A with 50 citations has a high salary?

Lab 9 Practical Data Analyst

2024-04-29

Simple Logistic Regression

Task 1. Determine whether the chance of getting a high salary differed between male and female

1.2 Import the “Professorial Salaries” and name this dataset “salary”

1.3 Create a new variable ‘high.salary’ using the threshold of $130,000 high.salary: 1 = Salary> $130,000; 0 = otherwise

1.4 Describe the characteristics of the study sample by sexes

1.5 Fit a simple logistic regression to determine whether the chance of getting a high salary differed between male and female professors. Interpret the finding

Interpretation: There is evidence (P= .035) that male professors were associated with 2.8-fold greater odds of getting a high salary than their female counterparts, ranging from 7% to 7.4-fold greater.

Task 2: Save all your work and publish it into your Rpubs account.

Multiple Logistic Regression

Task 1. Determine whether the chance of getting a high salary differed between male and female

1.2 Import the “Professorial Salaries” and name this dataset “salary”

1.3 Create a new variable ‘high.salary’ using the threshold of $120,000 high.salary: 1 = Salary> $120,000; 0 = otherwise

1.4 Describe the characteristics of the study sample by salary status and predefine potential covariates for multiple regression model

1.5 Fit a multiple logistic regression to determine whether the chance of getting a high salary differed between male and female professors after adjusting for potential covariates. Interpret the finding.

Interpretation: There is no evidence (P= 0.64) that male professors were associated with different odds of getting a high salary than their female counterparts.

Task 2 Develop an algorithm to predict the chance of getting a high salary among professors.

1.2 Import the “Professorial Salaries” and name this dataset “salary”

1.3 Create a new variable ‘high.salary’ using the threshold of $120,000 high.salary: 1 = Salary> $120,000; 0 = otherwise

1.4 Describe the characteristics of the study sample by salary status and predefine potential predictors of high salary

1.5 Select the optimal prediction model using BMA approach. Interpret the finding

1.6 Fit the model. Write a prediction algorithm.

1.7 Develop a nomogram for implicating the prediction model. What would be the probability that a professor in discipline A with 50 citations has a high salary?

Task 3: Save all your work and publish it into your Rpubs account.