TRM Practical Data Analysis - Intermediate level

Lecture 12. Prediction model - Validation

Task 1: Validation a prediction model for proffessorial salaries

1.1 Study design:

The cross-sectional investigation of 397 professors has been conducted to develop a prediction model for predicting professorial salaries. The current analysis aims to validate the prediction model.

1.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Thach\\UTS\\Teaching\\TRM\\Practical Data Analysis\\2024_Autumn semester\\Data\\Professorial Salaries.csv")

1.3 Split data validation

1.3.1 Slit data

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(1234)

index = createDataPartition(salary$Salary, p = 0.75, list = FALSE)
training = salary[index, ]
dim(training)

## [1] 300   9

testing = salary[-index, ]
dim(testing)

## [1] 97  9

1.3.2 Describe the baseline characteristics in the Training and Testing datasets

1.3.2.1 Training dataset

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex, data = training)

	Overall (N=300)
Rank
AssocProf	50 (16.7%)
AsstProf	50 (16.7%)
Prof	200 (66.7%)
Discipline
A	137 (45.7%)
B	163 (54.3%)
Yrs.since.phd
Mean (SD)	22.0 (12.9)
Median [Min, Max]	20.5 [1.00, 56.0]
Yrs.service
Mean (SD)	17.7 (12.9)
Median [Min, Max]	17.0 [0, 57.0]
NPubs
Mean (SD)	18.0 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	39.9 (17.6)
Median [Min, Max]	35.0 [1.00, 90.0]
Salary
Mean (SD)	114000 (29800)
Median [Min, Max]	107000 [57800, 232000]
Sex
Female	24 (8.0%)
Male	276 (92.0%)

1.3.2.2 Testing dataset

table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex, data = testing)

	Overall (N=97)
Rank
AssocProf	14 (14.4%)
AsstProf	17 (17.5%)
Prof	66 (68.0%)
Discipline
A	44 (45.4%)
B	53 (54.6%)
Yrs.since.phd
Mean (SD)	23.2 (13.0)
Median [Min, Max]	22.0 [3.00, 56.0]
Yrs.service
Mean (SD)	17.2 (13.3)
Median [Min, Max]	14.0 [0, 60.0]
NPubs
Mean (SD)	18.7 (13.9)
Median [Min, Max]	14.0 [1.00, 50.0]
Ncits
Mean (SD)	41.2 (14.6)
Median [Min, Max]	37.0 [14.0, 83.0]
Salary
Mean (SD)	113000 (31900)
Median [Min, Max]	107000 [62900, 206000]
Sex
Female	15 (15.5%)
Male	82 (84.5%)

1.3.3 Develop a prediction model

1.3.3.1 Systematically search for the optimal prediction model

library(BMA)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.7-4)

xvars = training[, c("Rank", "Discipline", "Yrs.since.phd", "Yrs.service", "Sex", "NPubs", "Ncits")]
yvar = training[,c("Salary")]
bma = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma)

## 
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   7  models were selected
##  Best  5  models (cumulative posterior probability =  0.9347 ): 
## 
##                p!=0    EV         SD       model 1     model 2     model 3   
## Intercept      100.0   8.833e+04  4376.20   88094.11    91372.51    86697.82 
## RankAsstProf   100.0  -1.653e+04  4677.07  -16064.89   -18293.10   -16414.39 
## RankProf       100.0   3.256e+04  3896.77   31938.49    34788.76    32340.00 
## DisciplineB    100.0   1.315e+04  2697.84   13227.64    12527.42    13644.18 
## Yrs.since.phd   17.8   9.058e+01   254.77      .           .          648.48 
## Yrs.service     33.1  -1.546e+02   285.53      .         -249.58     -751.30 
## SexMale          3.8   1.281e+02  1153.12      .           .           .     
## NPubs            3.1  -7.687e-01    17.20      .           .           .     
## Ncits            4.5   3.017e+00    21.15      .           .           .     
##                                                                              
## nVar                                              3           4           5  
## r2                                              0.427       0.434       0.443
## BIC                                          -149.79     -147.75     -147.22 
## post prob                                       0.521       0.187       0.144
##                model 4     model 5   
## Intercept       85234.76    85203.55 
## RankAsstProf   -15556.44   -15926.67 
## RankProf        32013.72    31688.22 
## DisciplineB     13319.46    13128.83 
## Yrs.since.phd      .           .     
## Yrs.service        .           .     
## SexMale            .         3356.59 
## NPubs              .           .     
## Ncits              67.03       .     
##                                      
## nVar                  4           4  
## r2                  0.428       0.428
## BIC              -144.90     -144.57 
## post prob           0.045       0.038

imageplot.bma(bma)

1.3.3.2 Present the model

m1 = lm(Salary ~ Rank + Discipline, data = training)
summary(m1)

## 
## Call:
## lm(formula = Salary ~ Rank + Discipline, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65701 -13823   -778  11321  98285 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     88094       3632  24.258  < 2e-16 ***
## RankAsstProf   -16065       4535  -3.542 0.000461 ***
## RankProf        31939       3608   8.852  < 2e-16 ***
## DisciplineB     13228       2665   4.963 1.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22670 on 296 degrees of freedom
## Multiple R-squared:  0.4267, Adjusted R-squared:  0.4209 
## F-statistic: 73.44 on 3 and 296 DF,  p-value: < 2.2e-16

1.3.3.3 Check the model’s assumptions

par(mfrow = c(2,2))
plot(m1)

1.3.4 Validation of the model

fit.m1 = train(Salary ~ Rank + Discipline, data = training, method = "lm", metric = "Rsquared")
summary(fit.m1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65701 -13823   -778  11321  98285 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     88094       3632  24.258  < 2e-16 ***
## RankAsstProf   -16065       4535  -3.542 0.000461 ***
## RankProf        31939       3608   8.852  < 2e-16 ***
## DisciplineB     13228       2665   4.963 1.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22670 on 296 degrees of freedom
## Multiple R-squared:  0.4267, Adjusted R-squared:  0.4209 
## F-statistic: 73.44 on 3 and 296 DF,  p-value: < 2.2e-16

pred.m1 = predict(fit.m1, testing)
testing.data = data.frame(obs = testing$Salary, pred = pred.m1)
defaultSummary(testing.data)

##         RMSE     Rsquared          MAE 
## 2.266292e+04 4.940016e-01 1.672117e+04

1.4 Bootstrapping

1.4.1 Describe the baseline characteristics

table1(~ Rank + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex, data = salary)

	Overall (N=397)
Rank
AssocProf	64 (16.1%)
AsstProf	67 (16.9%)
Prof	266 (67.0%)
Discipline
A	181 (45.6%)
B	216 (54.4%)
Yrs.since.phd
Mean (SD)	22.3 (12.9)
Median [Min, Max]	21.0 [1.00, 56.0]
Yrs.service
Mean (SD)	17.6 (13.0)
Median [Min, Max]	16.0 [0, 60.0]
NPubs
Mean (SD)	18.2 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	40.2 (16.9)
Median [Min, Max]	35.0 [1.00, 90.0]
Salary
Mean (SD)	114000 (30300)
Median [Min, Max]	107000 [57800, 232000]
Sex
Female	39 (9.8%)
Male	358 (90.2%)

1.4.2 Develop a prediction model

1.4.2.1 Search for the optimal prediction model

library(BMA) 
xvars = salary[, c("Rank", "Discipline", "Yrs.since.phd", "Yrs.service", "Sex", "NPubs", "Ncits")]
yvar = salary[,c("Salary")]
bma.2 = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma.2)

## 
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   6  models were selected
##  Best  5  models (cumulative posterior probability =  0.9676 ): 
## 
##                p!=0    EV         SD       model 1     model 2     model 3   
## Intercept      100.0   8.428e+04  4246.08   85705.87    80159.33    81946.95 
## RankAsstProf   100.0  -1.360e+04  3992.00  -13761.54   -13011.52   -13723.42 
## RankProf       100.0   3.409e+04  3193.57   34082.30    34252.88    33679.90 
## DisciplineB    100.0   1.376e+04  2298.00   13760.96    13779.31    13708.69 
## Yrs.since.phd    3.7   2.628e+00    27.72      .           .           .     
## Yrs.service      3.9  -3.007e+00    26.61      .           .           .     
## SexMale          6.1   2.760e+02  1441.87      .           .         4491.80 
## NPubs            3.2   7.712e-01    15.32      .           .           .     
## Ncits           21.3   2.801e+01    62.18      .          131.65       .     
##                                                                              
## nVar                                              3           4           4  
## r2                                              0.445       0.450       0.447
## BIC                                          -215.78     -213.65     -211.17 
## post prob                                       0.618       0.213       0.061
##                model 4     model 5   
## Intercept       86736.76    84435.56 
## RankAsstProf   -14483.23   -13030.16 
## RankProf        34894.27    33181.41 
## DisciplineB     13561.43    14028.68 
## Yrs.since.phd      .           71.92 
## Yrs.service       -76.33       .     
## SexMale            .           .     
## NPubs              .           .     
## Ncits              .           .     
##                                      
## nVar                  4           4  
## r2                  0.446       0.445
## BIC              -210.28     -210.13 
## post prob           0.039       0.037

imageplot.bma(bma.2)

1.4.2.2 Present the model

m2 = lm(Salary ~ Rank + Discipline, data = salary)
summary(m2)

## 
## Call:
## lm(formula = Salary ~ Rank + Discipline, data = salary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65990 -14049  -1288  10760  97996 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     85706       3142  27.273  < 2e-16 ***
## RankAsstProf   -13762       3961  -3.475 0.000569 ***
## RankProf        34082       3160  10.786  < 2e-16 ***
## DisciplineB     13761       2296   5.993 4.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22650 on 393 degrees of freedom
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.4407 
## F-statistic:   105 on 3 and 393 DF,  p-value: < 2.2e-16

1.4.3.3 Check the model’s assumptions

par(mfrow = c(2,2))
plot(m2)

1.4.3 Validation of the model

1.4.3.1 Estimate the parameters of the model

library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:table1':
## 
##     label, label<-, units

## The following objects are masked from 'package:base':
## 
##     format.pval, units

fit.m2 = ols(Salary ~ Rank + Discipline, data = salary, x = TRUE, y = TRUE)
fit.m2

## Linear Regression Model
## 
## ols(formula = Salary ~ Rank + Discipline, data = salary, x = TRUE, 
##     y = TRUE)
## 
##                     Model Likelihood    Discrimination    
##                           Ratio Test           Indexes    
## Obs         397    LR chi2    233.73    R2       0.445    
## sigma22651.1854    d.f.            3    R2 adj   0.441    
## d.f.        393    Pr(> chi2) 0.0000    g    21717.086    
## 
## Residuals
## 
##    Min     1Q Median     3Q    Max 
## -65990 -14049  -1288  10760  97996 
## 
## 
##               Coef        S.E.      t     Pr(>|t|)
## Intercept      85705.8693 3142.5016 27.27 <0.0001 
## Rank=AsstProf -13761.5432 3960.6611 -3.47 0.0006  
## Rank=Prof      34082.2954 3159.8850 10.79 <0.0001 
## Discipline=B   13760.9570 2296.0309  5.99 <0.0001

1.4.3.2 Validate the model

set.seed(1234)
v.m2 = validate(fit.m2, B = 500)
v.m2

##             index.orig     training         test      optimism index.corrected
## R-square  4.450000e-01 4.467000e-01 4.412000e-01        0.0055    4.395000e-01
## MSE       5.079067e+08 5.039997e+08 5.114001e+08 -7400384.7194    5.153071e+08
## g         2.171709e+04 2.162071e+04 2.167298e+04      -52.2655    2.176935e+04
## Intercept 0.000000e+00 0.000000e+00 9.012950e+01      -90.1295    9.012950e+01
## Slope     1.000000e+00 1.000000e+00 9.998000e-01        0.0002    9.998000e-01
##             n
## R-square  500
## MSE       500
## g         500
## Intercept 500
## Slope     500

Task 2: Validation a prediction model for high salaries (> $120,000)

2.1 Study design:

The cross-sectional investigation of 397 professors has been conducted to develop a prediction model for predicting professorial salaries. The current analysis aims to validate the prediction model.

2.2 Import the “Professorial Salaries” and name this dataset “salary”

salary = read.csv("C:\\Thach\\UTS\\Teaching\\TRM\\Practical Data Analysis\\2024_Autumn semester\\Data\\Professorial Salaries.csv")

2.3 Create a new variable ‘high salary’ using the threshold of $120,000 and ‘Prof’ (AssProf = ‘AsstProfessor’ or ‘AssoProfessor’; Prof = ‘Professor’)

salary$high.salary = ifelse(salary$Salary> 120000, 1, 0)
salary$Prof[salary$Rank == "Prof"] = "Professor"
salary$Prof[salary$Rank != "Prof"] = "AssProfessor"

2.4 Split data validation

2.4.1 Slit data

library(caret)
set.seed(1234)

index = createDataPartition(salary$Salary, p = 0.75, list = FALSE)
training = salary[index, ]
dim(training)

## [1] 300  11

testing = salary[-index, ]
dim(testing)

## [1] 97 11

2.4.2 Describe the baseline characteristics in the Training and Testing datasets

2.4.2.1 Training dataset

library(table1)
table1(~ Rank + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex | as.factor(high.salary), data = training)

	0 (N=193)	1 (N=107)	Overall (N=300)
Rank
AssocProf	49 (25.4%)	1 (0.9%)	50 (16.7%)
AsstProf	50 (25.9%)	0 (0%)	50 (16.7%)
Prof	94 (48.7%)	106 (99.1%)	200 (66.7%)
Prof
AssProfessor	99 (51.3%)	1 (0.9%)	100 (33.3%)
Professor	94 (48.7%)	106 (99.1%)	200 (66.7%)
Discipline
A	92 (47.7%)	45 (42.1%)	137 (45.7%)
B	101 (52.3%)	62 (57.9%)	163 (54.3%)
Yrs.since.phd
Mean (SD)	19.0 (13.7)	27.6 (8.98)	22.0 (12.9)
Median [Min, Max]	15.0 [1.00, 56.0]	28.0 [11.0, 46.0]	20.5 [1.00, 56.0]
Yrs.service
Mean (SD)	15.1 (13.7)	22.5 (9.87)	17.7 (12.9)
Median [Min, Max]	9.00 [0, 57.0]	20.0 [2.00, 45.0]	17.0 [0, 57.0]
NPubs
Mean (SD)	17.5 (12.8)	18.9 (15.9)	18.0 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	38.3 (16.4)	42.8 (19.4)	39.9 (17.6)
Median [Min, Max]	35.0 [1.00, 90.0]	36.0 [1.00, 90.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	95600 (14400)	147000 (20300)	114000 (29800)
Median [Min, Max]	96600 [57800, 120000]	144000 [121000, 232000]	107000 [57800, 232000]
Sex
Female	19 (9.8%)	5 (4.7%)	24 (8.0%)
Male	174 (90.2%)	102 (95.3%)	276 (92.0%)

2.4.2.2 Testing dataset

table1(~ Rank + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex | as.factor(high.salary), data = testing)

	0 (N=62)	1 (N=35)	Overall (N=97)
Rank
AssocProf	14 (22.6%)	0 (0%)	14 (14.4%)
AsstProf	17 (27.4%)	0 (0%)	17 (17.5%)
Prof	31 (50.0%)	35 (100%)	66 (68.0%)
Prof
AssProfessor	31 (50.0%)	0 (0%)	31 (32.0%)
Professor	31 (50.0%)	35 (100%)	66 (68.0%)
Discipline
A	35 (56.5%)	9 (25.7%)	44 (45.4%)
B	27 (43.5%)	26 (74.3%)	53 (54.6%)
Yrs.since.phd
Mean (SD)	19.5 (13.0)	29.8 (10.2)	23.2 (13.0)
Median [Min, Max]	16.5 [3.00, 46.0]	31.0 [13.0, 56.0]	22.0 [3.00, 56.0]
Yrs.service
Mean (SD)	14.3 (12.6)	22.4 (13.0)	17.2 (13.3)
Median [Min, Max]	9.00 [0, 46.0]	20.0 [2.00, 60.0]	14.0 [0, 60.0]
NPubs
Mean (SD)	17.9 (14.4)	20.1 (13.2)	18.7 (13.9)
Median [Min, Max]	13.0 [1.00, 50.0]	18.0 [3.00, 50.0]	14.0 [1.00, 50.0]
Ncits
Mean (SD)	39.5 (13.6)	44.3 (16.0)	41.2 (14.6)
Median [Min, Max]	36.0 [14.0, 70.0]	41.0 [14.0, 83.0]	37.0 [14.0, 83.0]
Salary
Mean (SD)	93100 (15300)	149000 (21000)	113000 (31900)
Median [Min, Max]	95500 [62900, 120000]	145000 [121000, 206000]	107000 [62900, 206000]
Sex
Female	11 (17.7%)	4 (11.4%)	15 (15.5%)
Male	51 (82.3%)	31 (88.6%)	82 (84.5%)

2.4.3 Develop a prediction model

2.4.3.1 Systematically search for the optimal prediction model

library(BMA)
bma.bin1 = bic.glm(high.salary ~ Sex + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits, strict = F, OR = 20, glm.family = "binomial", data = training)
summary(bma.bin1)

## 
## Call:
## bic.glm.formula(f = high.salary ~ Sex + Prof + Discipline + Yrs.since.phd +     Yrs.service + NPubs + Ncits, data = training, glm.family = "binomial",     strict = F, OR = 20)
## 
## 
##   8  models were selected
##  Best  5  models (cumulative posterior probability =  0.9053 ): 
## 
##                p!=0    EV         SD        model 1     model 2     model 3   
## Intercept      100    -5.139e+00  1.085239  -5.184e+00  -4.595e+00  -5.700e+00
## SexMale          3.2   6.160e-03  0.121368       .           .           .    
## ProfProfessor  100.0   4.875e+00  1.024675   4.923e+00   4.715e+00   4.903e+00
## DisciplineB     77.9   6.213e-01  0.417734   7.970e-01       .       8.054e-01
## Yrs.since.phd    3.1   6.003e-05  0.002609       .           .           .    
## Yrs.service      3.3  -1.357e-04  0.002464       .           .           .    
## NPubs            3.1   4.455e-05  0.001748       .           .           .    
## Ncits           15.4   2.003e-03  0.005665       .           .       1.307e-02
##                                                                               
## nVar                                           2           1           3      
## BIC                                         -1.414e+03  -1.412e+03  -1.411e+03
## post prob                                    0.535       0.183       0.116    
##                model 4     model 5   
## Intercept      -5.098e+00  -5.144e+00
## SexMale             .           .    
## ProfProfessor   4.703e+00   4.987e+00
## DisciplineB         .       7.814e-01
## Yrs.since.phd       .           .    
## Yrs.service         .      -4.171e-03
## NPubs               .           .    
## Ncits           1.266e-02       .    
##                                      
## nVar              2           3      
## BIC            -1.409e+03  -1.409e+03
## post prob       0.038       0.033    
## 
##   1  observations deleted due to missingness.

imageplot.bma(bma.bin1)

2.4.3.2 Present the model

bin.m1 = glm(high.salary ~ Prof + Discipline, family = "binomial", data = training)
summary(bin.m1)

## 
## Call:
## glm(formula = high.salary ~ Prof + Discipline, family = "binomial", 
##     data = training)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.1843     1.0321  -5.023 5.09e-07 ***
## ProfProfessor   4.9231     1.0205   4.824 1.41e-06 ***
## DisciplineB     0.7970     0.2876   2.771  0.00558 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 390.89  on 299  degrees of freedom
## Residual deviance: 279.89  on 297  degrees of freedom
## AIC: 285.89
## 
## Number of Fisher Scoring iterations: 7

2.4.4 Validate the model

fit.bin.m1 = train(factor(high.salary) ~ Prof + Discipline + Ncits, data = training, method = "glm", family = "binomial")
summary(fit.bin.m1)

## 
## Call:
## NULL
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.700254   1.086959  -5.244 1.57e-07 ***
## ProfProfessor  4.903412   1.020676   4.804 1.55e-06 ***
## DisciplineB    0.805372   0.289619   2.781  0.00542 ** 
## Ncits          0.013071   0.008133   1.607  0.10800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 390.89  on 299  degrees of freedom
## Residual deviance: 277.24  on 296  degrees of freedom
## AIC: 285.24
## 
## Number of Fisher Scoring iterations: 7

pred.bin.m1 = predict(fit.bin.m1, testing, type = "raw")
confusionMatrix(factor(testing$high.salary), pred.bin.m1, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50 12
##          1  6 29
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7227, 0.8862)
##     No Information Rate : 0.5773          
##     P-Value [Acc > NIR] : 6.302e-07       
##                                           
##                   Kappa : 0.6122          
##                                           
##  Mcnemar's Test P-Value : 0.2386          
##                                           
##             Sensitivity : 0.7073          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.8286          
##          Neg Pred Value : 0.8065          
##              Prevalence : 0.4227          
##          Detection Rate : 0.2990          
##    Detection Prevalence : 0.3608          
##       Balanced Accuracy : 0.8001          
##                                           
##        'Positive' Class : 1               
##

Calculating AUC

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

prob = predict(fit.bin.m1, newdata = testing, type = "prob")
pred.data = data.frame(prob, testing$high.salary)
head(pred.data)

##           X0         X1 testing.high.salary
## 5  0.3671548 0.63284524                   1
## 6  0.9880057 0.01199434                   0
## 16 0.3948900 0.60511003                   0
## 22 0.5840340 0.41596600                   0
## 27 0.6091936 0.39080637                   1
## 31 0.3402721 0.65972788                   1

roc(pred.data$testing.high.salary, pred.data$X1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = pred.data$testing.high.salary, predictor = pred.data$X1)
## 
## Data: pred.data$X1 in 62 controls (pred.data$testing.high.salary 0) < 35 cases (pred.data$testing.high.salary 1).
## Area under the curve: 0.8776

auc = smooth(roc(pred.data$testing.high.salary, pred.data$X1))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(auc)

2.5 Bootstrapping validation

2.5.1 Describe the baseline characteristics

table1(~ Rank + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits + Salary + Sex | as.factor(high.salary), data = salary)

	0 (N=255)	1 (N=142)	Overall (N=397)
Rank
AssocProf	63 (24.7%)	1 (0.7%)	64 (16.1%)
AsstProf	67 (26.3%)	0 (0%)	67 (16.9%)
Prof	125 (49.0%)	141 (99.3%)	266 (67.0%)
Prof
AssProfessor	130 (51.0%)	1 (0.7%)	131 (33.0%)
Professor	125 (49.0%)	141 (99.3%)	266 (67.0%)
Discipline
A	127 (49.8%)	54 (38.0%)	181 (45.6%)
B	128 (50.2%)	88 (62.0%)	216 (54.4%)
Yrs.since.phd
Mean (SD)	19.1 (13.5)	28.1 (9.29)	22.3 (12.9)
Median [Min, Max]	15.0 [1.00, 56.0]	28.0 [11.0, 56.0]	21.0 [1.00, 56.0]
Yrs.service
Mean (SD)	14.9 (13.4)	22.5 (10.7)	17.6 (13.0)
Median [Min, Max]	9.00 [0, 57.0]	20.0 [2.00, 60.0]	16.0 [0, 60.0]
NPubs
Mean (SD)	17.6 (13.2)	19.2 (15.2)	18.2 (14.0)
Median [Min, Max]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]	13.0 [1.00, 69.0]
Ncits
Mean (SD)	38.6 (15.8)	43.1 (18.6)	40.2 (16.9)
Median [Min, Max]	35.0 [1.00, 90.0]	38.5 [1.00, 90.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	95000 (14600)	147000 (20400)	114000 (30300)
Median [Min, Max]	96200 [57800, 120000]	144000 [121000, 232000]	107000 [57800, 232000]
Sex
Female	30 (11.8%)	9 (6.3%)	39 (9.8%)
Male	225 (88.2%)	133 (93.7%)	358 (90.2%)

2.5.3 Develop a prediction model

2.5.3.1 Systematically search for the optimal prediction model

bma.bin2 = bic.glm(high.salary ~ Sex + Prof + Discipline + Yrs.since.phd + Yrs.service + NPubs + Ncits, strict = F, OR = 20, glm.family = "binomial", data = salary)
summary(bma.bin2)

## 
## Call:
## bic.glm.formula(f = high.salary ~ Sex + Prof + Discipline + Yrs.since.phd +     Yrs.service + NPubs + Ncits, data = salary, glm.family = "binomial",     strict = F, OR = 20)
## 
## 
##   9  models were selected
##  Best  5  models (cumulative posterior probability =  0.9093 ): 
## 
##                p!=0    EV         SD        model 1     model 2     model 3   
## Intercept      100    -5.978e+00  1.129453  -6.336e+00  -5.559e+00  -5.691e+00
## SexMale          4.6   1.053e-02  0.117232       .           .           .    
## ProfProfessor  100.0   5.202e+00  1.018200   5.222e+00   5.198e+00   5.033e+00
## DisciplineB    100.0   9.733e-01  0.254090   9.797e-01   9.618e-01   1.006e+00
## Yrs.since.phd    5.3   4.565e-04  0.003544       .           .       9.694e-03
## Yrs.service      2.1   1.670e-05  0.001626       .           .           .    
## NPubs            4.7   1.025e-05  0.002167       .           .           .    
## Ncits           51.8   9.540e-03  0.010718   1.837e-02       .           .    
##                                                                               
## nVar                                           3           2           3      
## BIC                                         -1.993e+03  -1.993e+03  -1.988e+03
## post prob                                    0.425       0.408       0.027    
##                model 4     model 5   
## Intercept      -6.421e+00  -6.542e+00
## SexMale             .       2.341e-01
## ProfProfessor   5.090e+00   5.206e+00
## DisciplineB     1.013e+00   9.837e-01
## Yrs.since.phd   7.581e-03       .    
## Yrs.service         .           .    
## NPubs               .           .    
## Ncits           1.802e-02   1.843e-02
##                                      
## nVar              4           4      
## BIC            -1.988e+03  -1.988e+03
## post prob       0.025       0.024    
## 
##   1  observations deleted due to missingness.

imageplot.bma(bma.bin2)

2.5.3.2 Present the model

bin.m2 = glm(high.salary ~ Prof + Discipline + Ncits, family = "binomial", data = salary)
summary(bin.m2)

## 
## Call:
## glm(formula = high.salary ~ Prof + Discipline + Ncits, family = "binomial", 
##     data = salary)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.335608   1.083334  -5.848 4.97e-09 ***
## ProfProfessor  5.221741   1.016653   5.136 2.80e-07 ***
## DisciplineB    0.979729   0.254899   3.844 0.000121 ***
## Ncits          0.018368   0.007614   2.412 0.015853 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 517.75  on 396  degrees of freedom
## Residual deviance: 358.41  on 393  degrees of freedom
## AIC: 366.41
## 
## Number of Fisher Scoring iterations: 7

2.5.4 Validate the model

2.5.4.1 Estimate the parameters of the model

library(rms)
fit.bin.m2 = lrm(high.salary ~ Prof + Discipline + Ncits, data = salary, x = TRUE, y = TRUE)
fit.bin.m2

## Logistic Regression Model
## 
## lrm(formula = high.salary ~ Prof + Discipline + Ncits, data = salary, 
##     x = TRUE, y = TRUE)
## 
##                        Model Likelihood      Discrimination    Rank Discrim.    
##                              Ratio Test             Indexes          Indexes    
## Obs           397    LR chi2     159.34      R2       0.454    C       0.824    
##  0            255    d.f.             3      R2(3,397)0.326    Dxy     0.648    
##  1            142    Pr(> chi2) <0.0001    R2(3,273.6)0.435    gamma   0.653    
## max |deriv| 1e-05                            Brier    0.157    tau-a   0.298    
## 
##                Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept      -6.3356 1.0834 -5.85  <0.0001 
## Prof=Professor  5.2217 1.0168  5.14  <0.0001 
## Discipline=B    0.9797 0.2549  3.84  0.0001  
## Ncits           0.0184 0.0076  2.41  0.0159

2.5.4.2 Validate the model

set.seed(1234)
val.bin.m2 = validate(fit.bin.m2, B = 500)
val.bin.m2

##           index.orig training   test optimism index.corrected   n
## Dxy           0.6478   0.6494 0.6464   0.0030          0.6449 500
## R2            0.4537   0.4590 0.4462   0.0128          0.4409 500
## Intercept     0.0000   0.0000 0.0175  -0.0175          0.0175 500
## Slope         1.0000   1.0000 0.8564   0.1436          0.8564 500
## Emax          0.0000   0.0000 0.0374   0.0374          0.0374 500
## D             0.3988   0.4056 0.3907   0.0148          0.3840 500
## U            -0.0050  -0.0050 0.0050  -0.0100          0.0050 500
## Q             0.4039   0.4106 0.3857   0.0249          0.3790 500
## B             0.1567   0.1555 0.1584  -0.0029          0.1596 500
## g             2.6489   3.5705 2.7251   0.8454          1.8036 500
## gp            0.3031   0.3031 0.2935   0.0096          0.2935 500

2.5.4.3 Validation curve

pred.logit = predict(fit.bin.m2)
phat = 1/(1 + exp(-pred.logit))
val.prob(phat, salary$high.salary, m = 20, cex = .6)

##           Dxy       C (ROC)            R2             D      D:Chi-sq 
##  6.478597e-01  8.239299e-01  4.537261e-01  3.988291e-01  1.593352e+02 
##           D:p             U      U:Chi-sq           U:p             Q 
##            NA -5.037783e-03 -2.273737e-13  1.000000e+00  4.038669e-01 
##         Brier     Intercept         Slope          Emax           E90 
##  1.566567e-01 -6.466591e-09  1.000000e+00  7.602105e-02  1.641977e-02 
##          Eavg           S:z           S:p 
##  8.148218e-03  2.475030e-03  9.980252e-01

TRM Practical Data Analysis - Intermediate level

Thach Tran

2024-04-28

TRM Practical Data Analysis - Intermediate level

Lecture 12. Prediction model - Validation

Task 1: Validation a prediction model for proffessorial salaries

1.1 Study design:

1.2 Import the “Professorial Salaries” and name this dataset “salary”

1.3 Split data validation

1.3.1 Slit data

1.3.2 Describe the baseline characteristics in the Training and Testing datasets

1.3.2.1 Training dataset

1.3.2.2 Testing dataset

1.3.3 Develop a prediction model

1.3.3.1 Systematically search for the optimal prediction model

1.3.3.2 Present the model

1.3.3.3 Check the model’s assumptions

1.3.4 Validation of the model

1.4 Bootstrapping

1.4.1 Describe the baseline characteristics

1.4.2 Develop a prediction model

1.4.2.1 Search for the optimal prediction model

1.4.2.2 Present the model

1.4.3.3 Check the model’s assumptions

1.4.3 Validation of the model

1.4.3.1 Estimate the parameters of the model

1.4.3.2 Validate the model

Task 2: Validation a prediction model for high salaries (> $120,000)

2.1 Study design:

2.2 Import the “Professorial Salaries” and name this dataset “salary”

2.3 Create a new variable ‘high salary’ using the threshold of $120,000 and ‘Prof’ (AssProf = ‘AsstProfessor’ or ‘AssoProfessor’; Prof = ‘Professor’)

2.4 Split data validation

2.4.1 Slit data

2.4.2 Describe the baseline characteristics in the Training and Testing datasets

2.4.2.1 Training dataset

2.4.2.2 Testing dataset

2.4.3 Develop a prediction model

2.4.3.1 Systematically search for the optimal prediction model

2.4.3.2 Present the model

2.4.4 Validate the model

Calculating AUC

2.5 Bootstrapping validation

2.5.1 Describe the baseline characteristics

2.5.3 Develop a prediction model

2.5.3.1 Systematically search for the optimal prediction model

2.5.3.2 Present the model

2.5.4 Validate the model

2.5.4.1 Estimate the parameters of the model

2.5.4.2 Validate the model

2.5.4.3 Validation curve

Task 3. Save your work and upload it to your Rpubs account