Abstract

The objective of this project is to identify the association between prostate-specific antigen (PSA) and a variety of prognostic clinical measurements in men with advanced prostate cancer. Cancer volume, weight, age, benign prostatic hyperplasia, seminal vesicle invasion, capsular penetration, and gleason score are the set of clinical measurements included in this analysis. A variety of multiple linear regression models were investigated to identify the relationship between PSA and these clinical measurements. Using some of the predictor variables mentioned, a final linear regression model was developed in order to predict the level of PSA among men diagnosed with prostate cancer. The result of the analysis reveals that cancer volume, benign prostatic hyperplasia, and seminal vesicle invasion are useful factors in detecting the PSA level in men with prostate cancer. This analysis uses a dataset collected from men with prostate cancer and does not include any measurements on men who are not diagnosed with prostate cancer.

Introduction

Prostate cancer is one of the most common causes of cancer-related deaths among men. In the US, the American Cancer Society estimates approximately 248,530 new cases and 34,130 deaths from prostate cancer for 2021 (American Cancer Society, 2021). Additionally, there is a high risk of prostate cancer in men. The American Cancer Society estimates that approximately 1 in 8 men will be diagnosed with prostate cancer and 1 in 41 men will die of prostate cancer.


Prostate-specific antigen (PSA), an enzyme produced by the prostate, is commonly recommended as a screening mechanism for detecting prostate cancer. In order to become an efficient screening tool, it is important that we understand how PSA levels relate to factors that may determine prognosis and outcome. However, PSA screening does not always guarantee accurate results. In rare cases, some men with prostate cancer do not have an elevated PSA and some men with high level of PSA do not have prostate cancer. Moreover, PSA levels can change for variety of reasons other than cancer. Some of the other common causes of high PSA levels include benign prostatic hyperplasia (BPH), urinary tract infection (UTI), and prostatitis (prostate inflammation) (Whelan, 2017). Despite this, PSA level detection is still currently one of the most helpful method used in prostate cancer diagnosis.


The data used to analyze the relationship between PSA level and a number of prognostic clinical measurements were collected on 97 men between ages 41 and 79 who were about to undergo radical prostectomies (Stamey et al., 1989). The statistical analysis performed to analyze this dataset involves conducting a series of multiple regression analyses used to determine the best model predictor for PSA level in men who are diagnosed with prostate cancer using both quantitative and qualitative clinical measurements included in the data. The other clinical measurements included in the data aside from PSA level are commonly measured in men with prostate cancer.

Methods and Results

The goal of this analysis was to build a multiple regression model to predict the level of PSA from a subset of predictors: cancer volume, prostate weight, patient age, benign prostatic hyperplasia, seminal vesicle invasion, capsular penetration, and gleason score. In order to reach this goal, the project methodology was split into multiple parts. The relevant steps taken to perform the analysis includes exploratory data analysis, model building, model selection, model validation, and model diagnostics.



Part I: Exploratory Data Analysis

The purpose of the exploratory data analysis is to investigate the data to be used in the multiple regression analyses. The dataset does not include any missing values. Two variables (seminal vesicle invasion and gleason score) are converted to categorical (qualitative) variables. Out of the 97 advanced prostate cancer cases in the dataset, about 78% do not have seminal vesicle invasion, which suggests that the presence of seminal vesicle invasion does not imply prostate cancer is present but it is positively correlated with PSA level. The relationships between the quantitative variables does not show any obvious nonlinearity. The results reveal a high correlation between PSA level and predictor variables such as cancer volume, and capsular penetration. Since the PSA level have a severely right-skewed distribution as shown in Figure 3, it can be transformed using a log transformation, which has similar effect to PSA’s distribution as the power transformation of -0.1 suggested by the Box-cox log-likelihood procedure (Figure 5). The similar effect of log and power transformation is evident in Figure 6. Performing the log transformation increased the coefficient of determination \(R^2\) of the first-order model with all seven predictors. Additionally, exploring the dataset shows that PSA level increases from low to high gleason score (high prognosis) and is weakly correlated with prostate weight and patient age.

Data Processing and Manipulation

##Read data into R
data <- read.table("C:/Users/kayan/R-Projects/STA206/Project/data/prostate.txt", header=F)

##Assign header names 
names(data) = c("patient_id","psa_level","cancer_vol","weight", "age", "benign_prostatic_hyperplasia", "seminal_vesicle_invasion", "capsular_penetration", "gleason_score")

##Convert data to dataframe
data = as.data.frame(data)

##View data
library(DT)
datatable(data, caption = 'Table 1: Prostate Cancer Data', rownames = FALSE,filter="top", options = list(pageLength = 10, autoWidth = TRUE, scrollX=F, columnDefs = list(list(width = '50px', targets = "_all"))))
##Check summary statistic for each variable:
data_summary<-as.data.frame(apply(data,2,summary))
datatable(data_summary, caption = 'Table 2: Data Summary', rownames = TRUE, options = list(pageLength = 9, scrollX=F))
##check for variable types 
s <-sapply(data, class)
data_types<-as.data.frame(s)
datatable(data_types,  caption = 'Table 3: Data Types I', rownames = TRUE,colnames = "Type")
##Check number of missing values for each variable
n <-sapply(data, function(x) sum(is.na(x)))
missing_<-as.data.frame(n)
datatable(missing_,  caption = 'Table 4: Missing Values', rownames = TRUE,colnames = "NA")
##Drop irrelevant variable: patient_id 
drops=c("patient_id")
prostate=data[,!(names(data)%in%drops)]

##Change variables to categorical
prostate$gleason_score <- as.factor(prostate$gleason_score) 
prostate$seminal_vesicle_invasion <- as.factor(prostate$seminal_vesicle_invasion) 

##recheck variable types 
v <-sapply(prostate, class)
data_types1<-as.data.frame(v)
datatable(data_types1,  caption = 'Table 3: Data Types II', rownames = TRUE,colnames = "Type")

DATA VISUALIZATION

##Pie charts (with class percentage) for seminal_vesicle_invasion
#table(prostate$seminal_vesicle_invasion)
#levels(prostate$seminal_vesicle_invasion)
n <- nrow(prostate)
library(plotly)
svi <- data.frame("Seminal" = c("Absence", "Presence"), "Count" = c(76, 21))
colors <- c('#FC8D62', "#8DA0CB")

fig <- plot_ly(svi, labels = ~Seminal, values = ~Count, type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        hoverinfo = 'text',
        text = ~paste(Count),
        marker = list(colors = colors,
                      line = list(color = '#FFFFFF', width = 1)),
                      #The 'pull' attribute can also be used to create space between the sectors
        showlegend = FALSE)
(fig <- fig %>% layout(title = 'Figure 1: Seminal Vesicle Invasion',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)))
##Pie charts (with class percentage) for gleason_score
#table(prostate$gleason_score)
#levels(prostate$gleason_score)
gs <- data.frame("Gleason" = c("Bad Prognosis (6)","Worse Prognosis (7)", "Worst Prognosis (8)"), "count" = c(33 , 43, 21  ))
colors <- c('#khaki', "palegreen", "plum")

fig1 <- plot_ly(gs, labels = ~Gleason, values = ~count, type = 'pie',
        textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        hoverinfo = 'text',
        text = ~paste(count),
        marker = list(colors = colors,
                      line = list(color = '#FFFFFF', width = 1)),
                      #The 'pull' attribute can also be used to create space between the sectors
        showlegend = FALSE)
(fig1 <- fig1 %>% layout(title = 'Figure 2: Gleason Score',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)))
##Histograms of each quantitative variable

##function to combine plots in one panel
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  numPlots = length(plots)
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }
 if (numPlots==1) {
    print(plots[[1]])
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

#Plot 1 - PSA level
library(ggplot2)
ggp1 <- ggplot(prostate, aes(x = psa_level)) + xlab("PSA Level") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$psa_level), col="red", lwd=1.5) 
#Plot 2 - Cancer Volume
ggp2<-ggplot(prostate, aes(x = cancer_vol)) + xlab("Cancer Volume") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$cancer_vol), col="red", lwd=1.5)
#Plot 3 - Weight
ggp3<-ggplot(prostate, aes(x = weight)) + xlab("Weight") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$weight), col="red", lwd=1.5)
#Plot 4 - age
ggp4<-ggplot(prostate, aes(x = age)) + xlab("Age") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$age), col="red", lwd=1.5)
#Plot 5 - Benign prostatic Hyperplasia
ggp5<-ggplot(prostate, aes(x = benign_prostatic_hyperplasia)) + xlab("Benign Prostatic Hyperplasia") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$benign_prostatic_hyperplasia), col="red", lwd=1.5)
#Plot 6 - Capsular Penetration
ggp6<-ggplot(prostate, aes(x = capsular_penetration)) + xlab("Capsular Penetration") +
    geom_histogram(bins = 30, fill = "cornflowerblue", color = "white") +
    geom_vline(xintercept=mean(prostate$capsular_penetration), col="red", lwd=1.5)

myplots <- list(ggp1, ggp2, ggp3, ggp4, ggp5, ggp6) 
multiplot(plotlist = myplots, cols = 3)
Figure 3: Histograms of quantitative variables

Figure 3: Histograms of quantitative variables

##Transform PSA level to eliminate right skewed distribution shown in Figure 3 

##Box-cox transformation
fit1<-lm(psa_level~., data=prostate) #first-order model with the (non-transformed) variables
par(mfrow=c(2,2))
plot(fit1,pch =20, cex = 2, col = "aquamarine2")
Figure 4

Figure 4

###The Box-Cox likelihood suggests a power transformation with power -0.1.
par(mfrow=c(1,2))
bc<-MASS::boxcox(fit1, plotit = TRUE)
library(car)
#with(prostate, boxCox(fit1, data = prostate, main= ""))
with(prostate, boxCox(fit1, data = prostate, main= "", lambda = seq(-0.25, 0.25, length = 10))) #zoomed in
Figure 5: Log-likelihood

Figure 5: Log-likelihood

(lambda <- bc$x[which.max(bc$y)]) #optimal lambda
## [1] 0.1010101
##psa_level transformation 
ggp0 <- ggplot(prostate, aes(x = psa_level^(-0.1))) + xlab("Histogram of psa_level^(-0.1))") +
    geom_histogram(aes(y = ..density..), bins = 30, fill = "gold", color = "white") + 
                stat_function(fun = dnorm,
                args = list(mean = mean(prostate$psa_level^(-0.1)),
                            sd = sd(prostate$psa_level^(-0.1))),
                col = "grey40",
                size = 0.5)
ggp7 <- ggplot(prostate, aes(x = log(psa_level))) + xlab("Histogram of log(psa_level)") +
    geom_histogram(aes(y = ..density..),bins = 30, fill = "rosybrown", color = "white") +
                stat_function(fun = dnorm,
                args = list(mean = mean(log(prostate$psa_level)),
                            sd = sd(log(prostate$psa_level))),
                col = "grey40",
                size = 0.5)
#Plot 2 - Cancer Volume
ggp8<-ggplot(prostate, aes(x = sqrt(psa_level))) + xlab("Histogram of sqrt(psa_level)") +
    geom_histogram(aes(y = ..density..),bins = 30, fill = "springgreen", color = "white")+
                stat_function(fun = dnorm,
                args = list(mean = mean(sqrt(prostate$psa_level)),
                            sd = sd(sqrt(prostate$psa_level))),
                col = "grey40",
                size = 0.5) 
#Plot 3 - Weight
ggp9<-ggplot(prostate, aes(x = 1/psa_level)) + xlab("Histogram of 1/psa_level") +
    geom_histogram(aes(y = ..density..),bins = 30, fill = "darkmagenta", color = "white")  +
                stat_function(fun = dnorm,
                args = list(mean = mean(1/prostate$psa_level),
                            sd = sd(1/prostate$psa_level)),
                col = "grey40",
                size = 0.5) 
myplots2 <- list(ggp0,ggp7, ggp8, ggp9) 
multiplot(plotlist = myplots2, cols = 2)
Figure 6: Histograms of PSA Level Transformations

Figure 6: Histograms of PSA Level Transformations

##Based on Figure 6, the best transformation is log(psa_level) 
prostate$psa_level<-log(prostate$psa_level)

##Model diagnostic - after variable transformation 
fit2 <-lm(psa_level~., data=prostate) #first-order model with the (transformed) variable
par(mfrow=c(2,2))
plot(fit2,pch =20, cex = 2, col = "aquamarine2")
Figure 7

Figure 7

##scatter plot matrix among quantitative variables with the lower panel showing correlation
# Correlation panel
panel.cor <- function(x, y) {
    #usr <- par('usr') on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- round(cor(x, y, use = "complete.obs"), 2)
    txt <- paste0("R = ", r)
    cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

# Customize upper panel
upper.panel<-function(x, y){
  #my_cols <- c("#00AFBB", "#E7B800", "#FC4E07") 
  points(x,y, pch = 19, col = "red")
}

# Create the plots
pairs(~psa_level + cancer_vol + weight + age + benign_prostatic_hyperplasia + capsular_penetration, data = prostate,lower.panel = panel.cor, upper.panel = upper.panel)
Figure 8: Correlation Plot I

Figure 8: Correlation Plot I

##Scatter plot 2
library(ggplot2)
library(GGally)
library(dplyr)
#p <- ggpairs(dat_psa)
# put scatterplots on top so y axis is vertical
(p <- ggpairs(prostate %>% select(psa_level, cancer_vol, weight, age, benign_prostatic_hyperplasia, capsular_penetration)
            #, upper = list(continuous = wrap("points", alpha = 0.2, size = 0.5))
            , aes(color = prostate$seminal_vesicle_invasion)
            , upper = list(continuous = "points")
            , lower = list(continuous = "cor")
            ))
Figure 9: Correlation Plot II

Figure 9: Correlation Plot II

##Pairwise correlation matrix for all quantitative variables
quantitative<-c('cancer_vol','weight','age','benign_prostatic_hyperplasia' , 'capsular_penetration')
c <-cor(prostate[,c("psa_level",quantitative)], use = "pairwise.complete.obs")
corr_df<-as.data.frame(c)
library(formattable)
library(kableExtra)
#as.datatable(formattable(corr_df, list(area(col = c(psa_level, cancer_vol, weight, age, benign_prostatic_hyperplasia, capsular_penetration)) ~ color_tile("white", "orange"))), rownames = TRUE, options = list(pageLength = 6, scrollX=T))

for(i in 1:nrow(corr_df)) corr_df[i,] <- color_tile("white", "orange")(corr_df[i,])

corr_df %>%
  kable(escape = F) %>%
  kable_styling(full_width = F, latex_options="scale_down")%>% scroll_box(width = "100%")
psa_level cancer_vol weight age benign_prostatic_hyperplasia capsular_penetration
psa_level 1 0.6570739 0.1217208 0.1699068 0.1574016 0.5180231
cancer_vol 0.657073936076788 1 0.0051071479125806 0.0390944231120364 -0.133209430563056 0.692896688349495
weight 0.121720757056228 0.0051071479125806 1 0.164323713744137 0.321848747536083 0.00157890459676328
age 0.169906822489551 0.0390944231120364 0.164323713744137 1 0.366341212237295 0.0995553505620938
benign_prostatic_hyperplasia 0.157401578697896 -0.133209430563056 0.321848747536083 0.366341212237295 1 -0.0830086487111463
capsular_penetration 0.518023107610481 0.692896688349495 0.00157890459676328 0.0995553505620938 -0.0830086487111463 1

Table 5: Pairwise correlation matrix

##side-by-side box plots for ”PSA” with respect to each categorical variable
fig4 <- plot_ly(prostate, y = ~psa_level, color = ~gleason_score, type = "box")
fig5 <- plot_ly(prostate, y = ~psa_level, color = ~seminal_vesicle_invasion, type = "box")

# subplot
(fig6 <- subplot(fig4, fig5, nrows = 1))

Figure 10: PSA Level: Side-by-Side Boxplot by Gleason Score (Left) and Seminal Vesicle Invasion (Right)

#fig6 <- fig6 %>% layout(title = "PSA Level: Side-by-Side Boxplot by Gleason Score (Left) and Seminal Vesicle Invasion (Right)")
##plot the distribution of PSA level 
##by seminal invasion in using kernel density plots
ggp7<-ggplot(prostate, 
       aes(x = psa_level, 
           fill = seminal_vesicle_invasion)) +
  geom_density(alpha = 0.4) +
  labs(title = "PSA Level distribution by Seminal Vesicle Invasion")
ggp8<-ggplot(prostate, 
       aes(x = psa_level, 
           fill = gleason_score)) +
  geom_density(alpha = 0.4) +
  labs(title = "PSA Level distribution by Gleason Score")

myplots1 <- list(ggp7, ggp8) 
multiplot(plotlist = myplots1, cols = 1)
Figure 11: PSA Level Distribution

Figure 11: PSA Level Distribution

Observation(s):

  • PSA level increases from low to high gleason score and from absence to presence of seminal vesicle invasion

  • There are no obvious nonlinearity

  • The Box-Cox likelihood suggests a power transformation with power -0.1 which produces similar results as log transformation of response variable so we can use either one

  • R-squared increases from Model 1 to Model 2 after log transformation


Part II: Multiple Regression with Categorical Variables

  • Response variable: psa_level
  • Predictor variables: cancer_vol, weight, age, benign_prostatic_hyperplasia, seminal_vesicle_invasion, capsular_penetration, gleason_score
  • Objective: Fit model with interactions and compare the two models using the function anova() and determine which variables are not significant
##Preliminary Model Fitting
summary(fit1) #model with all predictors (non-transformed psa_level) from Part I
## 
## Call:
## lm(formula = psa_level ~ ., data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.153  -7.323  -0.177   6.403 161.547 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  31.849265  28.958981   1.100  0.27442   
## cancer_vol                    1.748107   0.615858   2.838  0.00563 **
## weight                       -0.004546   0.074038  -0.061  0.95118   
## age                          -0.537278   0.471991  -1.138  0.25808   
## benign_prostatic_hyperplasia  1.530782   1.201007   1.275  0.20581   
## seminal_vesicle_invasion1    21.108723  10.844893   1.946  0.05479 . 
## capsular_penetration          1.097882   1.322879   0.830  0.40883   
## gleason_score7               -1.661862   7.570741  -0.220  0.82676   
## gleason_score8               18.423157  10.661795   1.728  0.08750 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.91 on 88 degrees of freedom
## Multiple R-squared:  0.4733, Adjusted R-squared:  0.4254 
## F-statistic: 9.886 on 8 and 88 DF,  p-value: 1.037e-09
##Preliminary Model Fitting
summary(fit2) #model with all predictors (transformed psa_level) from part I
## 
## Call:
## lm(formula = psa_level ~ ., data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85118 -0.45409  0.07018  0.45549  1.50935 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.496462   0.722596   2.071  0.04129 *  
## cancer_vol                    0.067454   0.015367   4.389 3.15e-05 ***
## weight                        0.001268   0.001847   0.687  0.49419    
## age                          -0.002799   0.011777  -0.238  0.81267    
## benign_prostatic_hyperplasia  0.089106   0.029968   2.973  0.00380 ** 
## seminal_vesicle_invasion1     0.793175   0.270606   2.931  0.00430 ** 
## capsular_penetration         -0.026527   0.033009  -0.804  0.42378    
## gleason_score7                0.296768   0.188908   1.571  0.11978    
## gleason_score8                0.746606   0.266037   2.806  0.00617 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7714 on 88 degrees of freedom
## Multiple R-squared:  0.5902, Adjusted R-squared:  0.5529 
## F-statistic: 15.84 on 8 and 88 DF,  p-value: 3.135e-14

Observation(s):

  • First order model (fit2 - includes all predictors): \(\beta_2\), \(\beta_3\), and \(\beta_6\) are not significant

\(H_0:\beta_2=0~~vs.~~H_a:\beta_2\neq0,~T^*=0.687, ~Under H_0, T^*\sim t_{(88)}, p-value=0.494\) \(H_0:\beta_3=0~~vs.~~H_a:\beta_3\neq0,~T^*=-0.238, ~Under H_0, T^*\sim t_{(88)}, p-value=0.813\) \(H_0:\beta_6=0~~vs.~~H_a:\beta_6\neq0,~T^*=-0.804 , ~Under H_0, T^*\sim t_{(88)}, p-value=0.424\)

  • Log transformation of response variable increased \(R^2\)
##Check for multicollinearity in this data

##Variance inflation factors VIF_k
#Obtain R-squared_k by regressing X_k to {X_j : 1≤j not equal to k≤4} (k = 1, 2, 3, 4)
(vif_1 <- 1/(1-(summary(lm(cancer_vol~weight+age+benign_prostatic_hyperplasia+capsular_penetration, data = prostate))$r.squared)))
## [1] 1.948535
(vif_2 <- 1/(1-(summary(lm(weight~cancer_vol+age+benign_prostatic_hyperplasia+capsular_penetration, data = prostate))$r.squared)))
## [1] 1.121247
(vif_3 <- 1/(1-(summary(lm(age~cancer_vol+weight+benign_prostatic_hyperplasia+capsular_penetration, data = prostate))$r.squared)))
## [1] 1.181007
(vif_4 <- 1/(1-(summary(lm(benign_prostatic_hyperplasia~cancer_vol+weight+age+capsular_penetration, data = prostate))$r.squared)))
## [1] 1.293057
(vif_5 <- 1/(1-(summary(lm(capsular_penetration~cancer_vol+weight+age+ benign_prostatic_hyperplasia, data = prostate))$r.squared)))
## [1] 1.944823

Observation(s):

  • All five VIF values are a higher than 1 and far less than 10, so we can conclude that there is not much multicollinearity in the model
##First order model with significant X variables
fit3a <- lm(psa_level~ cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion , data=prostate)
summary(fit3a)
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9867 -0.4996  0.1032  0.5545  1.4993 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.51484    0.13206  11.471  < 2e-16 ***
## cancer_vol                    0.07618    0.01256   6.067 2.78e-08 ***
## benign_prostatic_hyperplasia  0.09971    0.02674   3.729 0.000331 ***
## seminal_vesicle_invasion1     0.82194    0.23858   3.445 0.000858 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7861 on 93 degrees of freedom
## Multiple R-squared:  0.5502, Adjusted R-squared:  0.5357 
## F-statistic: 37.92 on 3 and 93 DF,  p-value: 4.247e-16
fit3aa <- lm(psa_level~ cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion +gleason_score, data=prostate)
summary(fit3aa)
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion + gleason_score, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85235 -0.45777  0.06741  0.51651  1.53204 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.38817    0.15609   8.894 5.27e-14 ***
## cancer_vol                    0.06241    0.01367   4.566 1.55e-05 ***
## benign_prostatic_hyperplasia  0.09265    0.02627   3.527  0.00066 ***
## seminal_vesicle_invasion1     0.69646    0.23837   2.922  0.00439 ** 
## gleason_score7                0.26028    0.18280   1.424  0.15790    
## gleason_score8                0.70545    0.25712   2.744  0.00732 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7636 on 91 degrees of freedom
## Multiple R-squared:  0.5848, Adjusted R-squared:  0.5619 
## F-statistic: 25.63 on 5 and 91 DF,  p-value: 4.722e-16
##Model with two-way interactions (significant predictors)
drops1=c("weight", "age", "capsular_penetration")
pd=prostate[,!(names(prostate)%in%drops1)]
fit3b <- lm(psa_level~ .^2, data=pd)
summary(fit3b)
## 
## Call:
## lm(formula = psa_level ~ .^2, data = pd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71044 -0.41531 -0.00348  0.48621  1.80370 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                             1.080795   0.202661
## cancer_vol                                              0.118417   0.027680
## benign_prostatic_hyperplasia                            0.141754   0.052268
## seminal_vesicle_invasion1                              -0.004472   1.263746
## gleason_score7                                          0.410471   0.312494
## gleason_score8                                          1.197254   0.518413
## cancer_vol:benign_prostatic_hyperplasia                 0.001194   0.008732
## cancer_vol:seminal_vesicle_invasion1                   -0.020573   0.036158
## cancer_vol:gleason_score7                              -0.020164   0.053961
## cancer_vol:gleason_score8                              -0.073630   0.042757
## benign_prostatic_hyperplasia:seminal_vesicle_invasion1 -0.166088   0.090083
## benign_prostatic_hyperplasia:gleason_score7            -0.053534   0.062762
## benign_prostatic_hyperplasia:gleason_score8            -0.052336   0.105125
## seminal_vesicle_invasion1:gleason_score7                1.141611   1.178306
## seminal_vesicle_invasion1:gleason_score8                1.493889   1.077867
##                                                        t value Pr(>|t|)    
## (Intercept)                                              5.333 8.36e-07 ***
## cancer_vol                                               4.278 5.07e-05 ***
## benign_prostatic_hyperplasia                             2.712  0.00814 ** 
## seminal_vesicle_invasion1                               -0.004  0.99719    
## gleason_score7                                           1.314  0.19267    
## gleason_score8                                           2.309  0.02343 *  
## cancer_vol:benign_prostatic_hyperplasia                  0.137  0.89159    
## cancer_vol:seminal_vesicle_invasion1                    -0.569  0.57093    
## cancer_vol:gleason_score7                               -0.374  0.70962    
## cancer_vol:gleason_score8                               -1.722  0.08883 .  
## benign_prostatic_hyperplasia:seminal_vesicle_invasion1  -1.844  0.06884 .  
## benign_prostatic_hyperplasia:gleason_score7             -0.853  0.39616    
## benign_prostatic_hyperplasia:gleason_score8             -0.498  0.61992    
## seminal_vesicle_invasion1:gleason_score7                 0.969  0.33547    
## seminal_vesicle_invasion1:gleason_score8                 1.386  0.16951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7489 on 82 degrees of freedom
## Multiple R-squared:  0.6401, Adjusted R-squared:  0.5787 
## F-statistic: 10.42 on 14 and 82 DF,  p-value: 5.1e-13
##Model with two-way interactions 
fit3b <- lm(psa_level~ .^2, data=prostate)
summary(fit3b)
## 
## Call:
## lm(formula = psa_level ~ .^2, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6692 -0.2934  0.0000  0.3874  1.6488 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                             4.233800   2.570730
## cancer_vol                                              0.303276   0.173336
## weight                                                 -0.086133   0.081243
## age                                                    -0.064713   0.042944
## benign_prostatic_hyperplasia                            0.389719   0.435132
## seminal_vesicle_invasion1                              -1.420907   4.851800
## capsular_penetration                                   -0.049000   0.546142
## gleason_score7                                         -0.030655   1.802806
## gleason_score8                                          3.454056   4.332597
## cancer_vol:weight                                      -0.002764   0.001897
## cancer_vol:age                                         -0.000597   0.002230
## cancer_vol:benign_prostatic_hyperplasia                 0.011300   0.010969
## cancer_vol:seminal_vesicle_invasion1                   -0.031776   0.069049
## cancer_vol:capsular_penetration                         0.005215   0.005983
## cancer_vol:gleason_score7                              -0.125014   0.087691
## cancer_vol:gleason_score8                              -0.130589   0.058787
## weight:age                                              0.001707   0.001286
## weight:benign_prostatic_hyperplasia                    -0.003891   0.001476
## weight:seminal_vesicle_invasion1                       -0.043008   0.052745
## weight:capsular_penetration                             0.003770   0.005860
## weight:gleason_score7                                   0.004122   0.010257
## weight:gleason_score8                                   0.020481   0.029580
## age:benign_prostatic_hyperplasia                       -0.001387   0.006858
## age:seminal_vesicle_invasion1                           0.068016   0.076388
## age:capsular_penetration                               -0.004708   0.009147
## age:gleason_score7                                      0.009721   0.029847
## age:gleason_score8                                     -0.048420   0.079074
## benign_prostatic_hyperplasia:seminal_vesicle_invasion1  0.129029   0.180185
## benign_prostatic_hyperplasia:capsular_penetration      -0.034542   0.021356
## benign_prostatic_hyperplasia:gleason_score7            -0.066955   0.082549
## benign_prostatic_hyperplasia:gleason_score8            -0.057507   0.145909
## seminal_vesicle_invasion1:capsular_penetration         -0.135395   0.120265
## seminal_vesicle_invasion1:gleason_score7                0.017252   2.245275
## seminal_vesicle_invasion1:gleason_score8                0.076708   2.105042
## capsular_penetration:gleason_score7                     0.274529   0.213993
## capsular_penetration:gleason_score8                     0.309074   0.212771
##                                                        t value Pr(>|t|)  
## (Intercept)                                              1.647   0.1047  
## cancer_vol                                               1.750   0.0852 .
## weight                                                  -1.060   0.2932  
## age                                                     -1.507   0.1370  
## benign_prostatic_hyperplasia                             0.896   0.3740  
## seminal_vesicle_invasion1                               -0.293   0.7706  
## capsular_penetration                                    -0.090   0.9288  
## gleason_score7                                          -0.017   0.9865  
## gleason_score8                                           0.797   0.4284  
## cancer_vol:weight                                       -1.457   0.1503  
## cancer_vol:age                                          -0.268   0.7898  
## cancer_vol:benign_prostatic_hyperplasia                  1.030   0.3070  
## cancer_vol:seminal_vesicle_invasion1                    -0.460   0.6470  
## cancer_vol:capsular_penetration                          0.872   0.3868  
## cancer_vol:gleason_score7                               -1.426   0.1591  
## cancer_vol:gleason_score8                               -2.221   0.0300 *
## weight:age                                               1.327   0.1893  
## weight:benign_prostatic_hyperplasia                     -2.637   0.0106 *
## weight:seminal_vesicle_invasion1                        -0.815   0.4180  
## weight:capsular_penetration                              0.643   0.5224  
## weight:gleason_score7                                    0.402   0.6892  
## weight:gleason_score8                                    0.692   0.4913  
## age:benign_prostatic_hyperplasia                        -0.202   0.8404  
## age:seminal_vesicle_invasion1                            0.890   0.3768  
## age:capsular_penetration                                -0.515   0.6086  
## age:gleason_score7                                       0.326   0.7458  
## age:gleason_score8                                      -0.612   0.5426  
## benign_prostatic_hyperplasia:seminal_vesicle_invasion1   0.716   0.4767  
## benign_prostatic_hyperplasia:capsular_penetration       -1.617   0.1109  
## benign_prostatic_hyperplasia:gleason_score7             -0.811   0.4205  
## benign_prostatic_hyperplasia:gleason_score8             -0.394   0.6949  
## seminal_vesicle_invasion1:capsular_penetration          -1.126   0.2647  
## seminal_vesicle_invasion1:gleason_score7                 0.008   0.9939  
## seminal_vesicle_invasion1:gleason_score8                 0.036   0.9711  
## capsular_penetration:gleason_score7                      1.283   0.2044  
## capsular_penetration:gleason_score8                      1.453   0.1515  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7434 on 61 degrees of freedom
## Multiple R-squared:  0.7361, Adjusted R-squared:  0.5847 
## F-statistic: 4.862 on 35 and 61 DF,  p-value: 3.611e-08
fit3 <- lm(psa_level~ cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion + gleason_score+ cancer_vol:benign_prostatic_hyperplasia + cancer_vol:seminal_vesicle_invasion + cancer_vol:gleason_score, data=prostate)
summary(fit3)
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion + gleason_score + cancer_vol:benign_prostatic_hyperplasia + 
##     cancer_vol:seminal_vesicle_invasion + cancer_vol:gleason_score, 
##     data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67536 -0.45692  0.05281  0.43829  1.70583 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              1.188041   0.182582   6.507 4.71e-09
## cancer_vol                               0.103718   0.024185   4.289 4.64e-05
## benign_prostatic_hyperplasia             0.129957   0.040745   3.190  0.00198
## seminal_vesicle_invasion1                0.946895   0.418419   2.263  0.02612
## gleason_score7                           0.229407   0.253334   0.906  0.36767
## gleason_score8                           1.105168   0.404354   2.733  0.00760
## cancer_vol:benign_prostatic_hyperplasia -0.006877   0.005386  -1.277  0.20507
## cancer_vol:seminal_vesicle_invasion1    -0.025957   0.030557  -0.849  0.39795
## cancer_vol:gleason_score7                0.003900   0.041014   0.095  0.92446
## cancer_vol:gleason_score8               -0.039213   0.032541  -1.205  0.23146
##                                            
## (Intercept)                             ***
## cancer_vol                              ***
## benign_prostatic_hyperplasia            ** 
## seminal_vesicle_invasion1               *  
## gleason_score7                             
## gleason_score8                          ** 
## cancer_vol:benign_prostatic_hyperplasia    
## cancer_vol:seminal_vesicle_invasion1       
## cancer_vol:gleason_score7                  
## cancer_vol:gleason_score8                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7541 on 87 degrees of freedom
## Multiple R-squared:  0.6128, Adjusted R-squared:  0.5728 
## F-statistic:  15.3 on 9 and 87 DF,  p-value: 1.251e-14
##Polynomial Model
fit4 <- lm(psa_level~ cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion + gleason_score+cancer_vol:benign_prostatic_hyperplasia+I(cancer_vol^2)+I(benign_prostatic_hyperplasia^2), data=prostate)
summary(fit4)
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion + gleason_score + cancer_vol:benign_prostatic_hyperplasia + 
##     I(cancer_vol^2) + I(benign_prostatic_hyperplasia^2), data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63690 -0.47919  0.09209  0.50902  1.74263 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              1.1369391  0.1834337   6.198 1.80e-08
## cancer_vol                               0.1272796  0.0304641   4.178 6.91e-05
## benign_prostatic_hyperplasia             0.1947783  0.0794516   2.452  0.01620
## seminal_vesicle_invasion1                0.6221195  0.2399560   2.593  0.01115
## gleason_score7                           0.2325434  0.1807751   1.286  0.20169
## gleason_score8                           0.6879983  0.2573059   2.674  0.00894
## I(cancer_vol^2)                         -0.0017616  0.0007775  -2.266  0.02592
## I(benign_prostatic_hyperplasia^2)       -0.0077762  0.0087863  -0.885  0.37855
## cancer_vol:benign_prostatic_hyperplasia -0.0078307  0.0054541  -1.436  0.15462
##                                            
## (Intercept)                             ***
## cancer_vol                              ***
## benign_prostatic_hyperplasia            *  
## seminal_vesicle_invasion1               *  
## gleason_score7                             
## gleason_score8                          ** 
## I(cancer_vol^2)                         *  
## I(benign_prostatic_hyperplasia^2)          
## cancer_vol:benign_prostatic_hyperplasia    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7493 on 88 degrees of freedom
## Multiple R-squared:  0.6133, Adjusted R-squared:  0.5782 
## F-statistic: 17.45 on 8 and 88 DF,  p-value: 2.714e-15
par(mfrow = c(4, 2))
plot(fit1, which = 1, pch =20, cex = 2, col = "aquamarine2")
plot(fit2, which = 1, pch =20, cex = 2, col = "olivedrab1")
plot(fit3a, which = 1, pch =20, cex = 2, col = "goldenrod")
plot(fit3aa, which = 1, pch =20, cex = 2, col = "aquamarine2")
plot(fit3b, which = 1, pch =20, cex = 2, col = "dodgerblue")
plot(fit3, which = 1, pch =20, cex = 2, col = "coral")
plot(fit4, which = 1, pch =20, cex = 2, col = "orchid")
Figure 12

Figure 12


Part III: Model building and model selection

The goal of model building and model selection part is to determine the best predictive model for PSA Level and to verify whether the best model produced by the previous step is the similar to the best model produced by using criterion in the best subsets selection method. First, a first-order model with all 7 predictors for transformed PSA level is examined. Based on this, prostate weight, patient age, and capsular penetration are dropped from the model and considered to have no statistical significance. The Variance Inflation Factor (VIF) for the quantitative predictor variables (cancer volume, weight, age, benign prostatic hyperplasia, and capsular penetration) are higher than 1 and far less than 10, so it can be concluded that there is no substantial multicollinearity in the model. The multiple coefficients of determination \(R^2\) of the first order model is moderately good (59%). A first-order model with two-way interactions and all predictors is also fitted and resulted in a fairly large \(R^2\) (73.6%). However, this model includes various nuisance variables and interactions that are not statistically significant. The original data is then randomly split into training and validation datasets to be used in the next steps that follow to prevent the resulting model from overfitting and to accurately evaluate the models.

###split data into two halves (training and validation)
set.seed(10)
n <- nrow(prostate)/2
ind <- sample(1:(2*n), n, replace=FALSE)
prostate.t <- prostate[ind, ] #training set
prostate.v <- prostate[-ind, ] #validation/test set


###Draw side-by-side boxplots for PSA and other variables, in training data and validation data, respectively to see if they have similar distributions in these two sets

par(mfrow=c(2,3))
for (col_name in c('psa_level', 'cancer_vol', 'weight',
'age', 'benign_prostatic_hyperplasia', 'capsular_penetration')){
boxplot(prostate.t[, col_name],prostate.v[, col_name],main=col_name,names=c('training','validation'), col=c("coral","lightblue"))
}
Figure 13

Figure 13

Observation(s):

  • The training data and validation data appear to have similar distributions.

I. Best Subsets Regression

The primary goal of this step is to identify the best model according to each criterion: \(SSE_p\),\(R^2_p\), \(R^2_{a,p}\), \(C_p\), \(AIC_p\),\(BIC_p\). The Best Subsets Regression method is used to verify whether the reduced model with statistically significant predictor variables identified in the preceding step where a first-order model with all predictor variables is fitted will be the same as the “best” model identified through this step. The “best” model according to sum of squared estimates of errors (SSE) and multiple coefficient of determination \(R^2\) is the full model with all seven predictors while the “best” model using adjusted-\(R^2\) included some predictors (cancer volume, benign prostatic hyperplasia, seminal vesicle invasion1, gleason score7, and gleason score8). Using Mallows’s Cp, Akaike information criterion (AIC), and Bayesian information criterion (BIC), the “best” model includes cancer volume, benign prostatic hyperplasia, seminal vesicle invasion1, and gleason score8. The Cp, AIC, and BIC values for this model are 3.72, -34.79, and -25.43 respectively. The “best” model identified using this method is similar to the model that was identified in the previous step, which includes all significant predictor variables (cancer volume, benign prostatic hyperplasia, seminal vesicle invasion, and gleason score).

fit5<-lm(psa_level~.,data=prostate.t)
summary(fit5)
## 
## Call:
## lm(formula = psa_level ~ ., data = prostate.t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39781 -0.42545 -0.02081  0.43310  1.45613 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   4.321e-01  1.229e+00   0.352  0.72698   
## cancer_vol                    4.602e-02  2.417e-02   1.904  0.06436 . 
## weight                        2.396e-05  1.694e-03   0.014  0.98879   
## age                           1.810e-02  1.956e-02   0.926  0.36033   
## benign_prostatic_hyperplasia  7.640e-02  4.039e-02   1.892  0.06596 . 
## seminal_vesicle_invasion1     1.011e+00  3.109e-01   3.251  0.00237 **
## capsular_penetration         -4.094e-02  4.545e-02  -0.901  0.37330   
## gleason_score7                2.291e-01  2.457e-01   0.933  0.35672   
## gleason_score8                7.121e-01  3.068e-01   2.321  0.02561 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6727 on 39 degrees of freedom
## Multiple R-squared:  0.6087, Adjusted R-squared:  0.5284 
## F-statistic: 7.584 on 8 and 39 DF,  p-value: 4.642e-06
length(fit5$coef) #8 regression coefficients
## [1] 9
anova(fit5)['Residuals',3] #MSE
## [1] 0.4525879
par(mfrow=c(2,2))
plot(fit5,which=1:3, pch =20, cex = 2, col = "aquamarine2")
MASS::boxcox(fit5)
Figure 14

Figure 14

library(leaps)
sub_set<-regsubsets(psa_level~.,data=prostate.t,nbest=3,nvmax=8,method="exhaustive")
(sum_sub<-summary(sub_set))
## Subset selection object
## Call: regsubsets.formula(psa_level ~ ., data = prostate.t, nbest = 3, 
##     nvmax = 8, method = "exhaustive")
## 8 Variables  (and intercept)
##                              Forced in Forced out
## cancer_vol                       FALSE      FALSE
## weight                           FALSE      FALSE
## age                              FALSE      FALSE
## benign_prostatic_hyperplasia     FALSE      FALSE
## seminal_vesicle_invasion1        FALSE      FALSE
## capsular_penetration             FALSE      FALSE
## gleason_score7                   FALSE      FALSE
## gleason_score8                   FALSE      FALSE
## 3 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          cancer_vol weight age benign_prostatic_hyperplasia
## 1  ( 1 ) "*"        " "    " " " "                         
## 1  ( 2 ) " "        " "    " " " "                         
## 1  ( 3 ) " "        " "    " " " "                         
## 2  ( 1 ) " "        " "    " " " "                         
## 2  ( 2 ) "*"        " "    " " " "                         
## 2  ( 3 ) " "        " "    " " "*"                         
## 3  ( 1 ) " "        " "    " " "*"                         
## 3  ( 2 ) "*"        " "    " " "*"                         
## 3  ( 3 ) " "        " "    "*" " "                         
## 4  ( 1 ) "*"        " "    " " "*"                         
## 4  ( 2 ) " "        " "    "*" "*"                         
## 4  ( 3 ) " "        " "    " " "*"                         
## 5  ( 1 ) "*"        " "    " " "*"                         
## 5  ( 2 ) "*"        " "    "*" "*"                         
## 5  ( 3 ) "*"        " "    " " "*"                         
## 6  ( 1 ) "*"        " "    "*" "*"                         
## 6  ( 2 ) "*"        " "    " " "*"                         
## 6  ( 3 ) "*"        " "    "*" "*"                         
## 7  ( 1 ) "*"        " "    "*" "*"                         
## 7  ( 2 ) "*"        "*"    "*" "*"                         
## 7  ( 3 ) "*"        "*"    " " "*"                         
## 8  ( 1 ) "*"        "*"    "*" "*"                         
##          seminal_vesicle_invasion1 capsular_penetration gleason_score7
## 1  ( 1 ) " "                       " "                  " "           
## 1  ( 2 ) "*"                       " "                  " "           
## 1  ( 3 ) " "                       " "                  " "           
## 2  ( 1 ) "*"                       " "                  " "           
## 2  ( 2 ) "*"                       " "                  " "           
## 2  ( 3 ) "*"                       " "                  " "           
## 3  ( 1 ) "*"                       " "                  " "           
## 3  ( 2 ) "*"                       " "                  " "           
## 3  ( 3 ) "*"                       " "                  " "           
## 4  ( 1 ) "*"                       " "                  " "           
## 4  ( 2 ) "*"                       " "                  " "           
## 4  ( 3 ) "*"                       " "                  "*"           
## 5  ( 1 ) "*"                       " "                  "*"           
## 5  ( 2 ) "*"                       " "                  " "           
## 5  ( 3 ) "*"                       "*"                  " "           
## 6  ( 1 ) "*"                       " "                  "*"           
## 6  ( 2 ) "*"                       "*"                  "*"           
## 6  ( 3 ) "*"                       "*"                  " "           
## 7  ( 1 ) "*"                       "*"                  "*"           
## 7  ( 2 ) "*"                       " "                  "*"           
## 7  ( 3 ) "*"                       "*"                  "*"           
## 8  ( 1 ) "*"                       "*"                  "*"           
##          gleason_score8
## 1  ( 1 ) " "           
## 1  ( 2 ) " "           
## 1  ( 3 ) "*"           
## 2  ( 1 ) "*"           
## 2  ( 2 ) " "           
## 2  ( 3 ) " "           
## 3  ( 1 ) "*"           
## 3  ( 2 ) " "           
## 3  ( 3 ) "*"           
## 4  ( 1 ) "*"           
## 4  ( 2 ) "*"           
## 4  ( 3 ) "*"           
## 5  ( 1 ) "*"           
## 5  ( 2 ) "*"           
## 5  ( 3 ) "*"           
## 6  ( 1 ) "*"           
## 6  ( 2 ) "*"           
## 6  ( 3 ) "*"           
## 7  ( 1 ) "*"           
## 7  ( 2 ) "*"           
## 7  ( 3 ) "*"           
## 8  ( 1 ) "*"
n <-nrow(prostate.t)
## number of coefficients in each model: p
p.m<-as.integer(as.numeric(rownames(sum_sub$which))+1)
sse<-sum_sub$rss
aic<-n*log(sse/n)+2*p.m
bic<-n*log(sse/n)+log(n)*p.m
res_sub<-cbind(sum_sub$which,sse,sum_sub$rsq,sum_sub$adjr2,sum_sub$cp, aic, bic)
fit<-lm(psa_level~1,data=prostate.t) ##fit the model with only intercept
full <- lm(psa_level ~., data = prostate.t)  #full first-order model
full1 <- lm(psa_level~ .^2, data=prostate.t) ##full first-order model with interactions 
sse1<-sum(fit$residuals^2)
p<-1
c1<-sse1/summary(full)$sigma^2 - (n - 2 * p)
aic1<-n*log(sse1/n)+2*p
bic1<-n*log(sse1/n)+log(n)*p
none<-c(1,rep(0,8),sse1,0,0,c1,bic1,aic1)
res_sub<-rbind(none,res_sub) ##combine the results with other models
colnames(res_sub)<-c(colnames(sum_sub$which),"sse", "R^2", "R^2_a", "Cp", "aic", "bic")
res_sub
##      (Intercept) cancer_vol weight age benign_prostatic_hyperplasia
## none           1          0      0   0                            0
## 1              1          1      0   0                            0
## 1              1          0      0   0                            0
## 1              1          0      0   0                            0
## 2              1          0      0   0                            0
## 2              1          1      0   0                            0
## 2              1          0      0   0                            1
## 3              1          0      0   0                            1
## 3              1          1      0   0                            1
## 3              1          0      0   1                            0
## 4              1          1      0   0                            1
## 4              1          0      0   1                            1
## 4              1          0      0   0                            1
## 5              1          1      0   0                            1
## 5              1          1      0   1                            1
## 5              1          1      0   0                            1
## 6              1          1      0   1                            1
## 6              1          1      0   0                            1
## 6              1          1      0   1                            1
## 7              1          1      0   1                            1
## 7              1          1      1   1                            1
## 7              1          1      1   0                            1
## 8              1          1      1   1                            1
##      seminal_vesicle_invasion1 capsular_penetration gleason_score7
## none                         0                    0              0
## 1                            0                    0              0
## 1                            1                    0              0
## 1                            0                    0              0
## 2                            1                    0              0
## 2                            1                    0              0
## 2                            1                    0              0
## 3                            1                    0              0
## 3                            1                    0              0
## 3                            1                    0              0
## 4                            1                    0              0
## 4                            1                    0              0
## 4                            1                    0              1
## 5                            1                    0              1
## 5                            1                    0              0
## 5                            1                    1              0
## 6                            1                    0              1
## 6                            1                    1              1
## 6                            1                    1              0
## 7                            1                    1              1
## 7                            1                    0              1
## 7                            1                    1              1
## 8                            1                    1              1
##      gleason_score8      sse       R^2     R^2_a        Cp         aic
## none              0 45.10856 0.0000000 0.0000000 53.668051   0.8890059
## 1                 0 28.50702 0.3680352 0.3542968 18.986704 -21.0104279
## 1                 0 30.61089 0.3213949 0.3066427 23.635243 -17.5925620
## 1                 1 34.90068 0.2262957 0.2094760 33.113600 -11.2973397
## 2                 1 24.26454 0.4620857 0.4381784 11.612865 -26.7448887
## 2                 0 25.18281 0.4417288 0.4169167 13.641802 -24.9618948
## 2                 0 25.96976 0.4242832 0.3986957 15.380576 -23.4848831
## 3                 1 20.21460 0.5518677 0.5213132  4.664474 -33.5101997
## 3                 0 21.01723 0.5340744 0.5023067  6.437896 -31.6412022
## 3                 1 21.55065 0.5222493 0.4896754  7.616483 -30.4381718
## 4                 1 18.88099 0.5814321 0.5424956  3.717845 -34.7861729
## 4                 1 19.61030 0.5652642 0.5248236  5.329271 -32.9669997
## 4                 1 19.87350 0.5594294 0.5184461  5.910815 -32.3270522
## 5                 1 18.35819 0.5930221 0.5445723  4.562699 -34.1340165
## 5                 1 18.47101 0.5905208 0.5417733  4.811991 -33.8399187
## 5                 1 18.51809 0.5894773 0.5406056  4.915995 -33.7177528
## 6                 1 18.02045 0.6005092 0.5420472  5.816467 -33.0252958
## 6                 1 18.03874 0.6001039 0.5415825  5.856864 -32.9766208
## 6                 1 18.05368 0.5997727 0.5412028  5.889876 -32.9368807
## 7                 1 17.65102 0.6086991 0.5402214  7.000200 -32.0195582
## 7                 1 18.01806 0.6005622 0.5306606  7.811184 -31.0316658
## 7                 1 18.03870 0.6001047 0.5301230  7.856788 -30.9767124
## 8                 1 17.65093 0.6087011 0.5284346  9.000000 -30.0198044
##              bic
## none  -0.9821951
## 1    -17.2680258
## 1    -13.8501600
## 1     -7.5549377
## 2    -21.1312857
## 2    -19.3482918
## 2    -17.8712800
## 3    -26.0253956
## 3    -24.1563981
## 3    -22.9533678
## 4    -25.4301678
## 4    -23.6109947
## 4    -22.9710472
## 5    -22.9068105
## 5    -22.6127127
## 5    -22.4905467
## 6    -19.9268888
## 6    -19.8782137
## 6    -19.8384737
## 7    -17.0499501
## 7    -16.0620577
## 7    -16.0071043
## 8    -13.1789953
##The best model according to each criterion are extracted as follows: 
#result.sum = summary(sub_set)
#(criteria = data.frame(Nvar = 1:7, R2adj=result.sum$adjr2, CP=result.sum$cp, BIC=result.sum$bic))
##Estimated best subset byeach criterion
#(which.best.subset = data.frame(R2adj=which.max(result.sum$adjr2), CP=which.min(result.sum$cp), BIC=which.min(result.sum$bic)))

(PRESS_none <- sum((fit$residuals/(1-influence(fit)$hat))^2))
## [1] 47.04849
(PRESS_full <- sum((full$residuals/(1-influence(full)$hat))^2))
## [1] 26.62755
#plot results
p.plot <- res_sub[, 1] + res_sub[, 2] + res_sub[, 3] + res_sub[, 4] + res_sub[, 5]+ res_sub[, 6]+ res_sub[, 7]+ res_sub[, 8]
res.sub.plot <- as.data.frame(cbind(p.plot, res_sub))
best.plot <- res.sub.plot[c(1), ]
par(mfrow = c(3, 2))
plot(res.sub.plot$p.plot, res.sub.plot$`R^2`, xlab = "p", ylab = "R^2")
lines(best.plot$p.plot, best.plot$`R^2`, lwd = 2)

plot(res.sub.plot$p.plot, res.sub.plot$`R^2_a`, xlab = "p", ylab = "R^2_a")
lines(best.plot$p.plot, best.plot$`R^2_a`, lwd = 2)

plot(res.sub.plot$p.plot, res.sub.plot$Cp, xlab = "p", ylab = "Cp")
lines(best.plot$p.plot, best.plot$Cp, lwd = 2)
lines(best.plot$p.plot, best.plot$p.plot, col = "red")

plot(res.sub.plot$p.plot, res.sub.plot$aic, xlab = "p", ylab = "aic")
lines(best.plot$p.plot, best.plot$aic, lwd = 2)

plot(res.sub.plot$p.plot, res.sub.plot$bic, xlab = "p", ylab = "bic")
lines(best.plot$p.plot, best.plot$bic, lwd = 2)
Figure 15

Figure 15

Observation(s):

  • Best subsets selection
  • Best model according to \(SSEp\): Model 8
  • Best model according to \(R^2\): Model 8
  • Best model according to \(R^2_{a}\): Model 5 ( cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion1 + gleason_score7 + gleason_score8)
  • Best model according to \(C_p\), \(AIC_p\), \(BIC_p\): Model 4 (cancer_vol + benign_prostatic_hyperplasia + seminal_vesicle_invasion1 + gleason_score8)
  • The box-cox log-likelihood does not suggest any power transformation since lambda~1

II: Stepwise Procedure

The stepwise procedure is conducted in order to determine the “best” predictive model for PSA and if the resulting model would be similar to the best model identified using best subsets regression. The “best” model according to all stepwise procedures used (forward, backward, and bidirectional) using both AIC and BIC criterion include three predictors: cancer volume, seminal vesicle invasion, and benign prostatic hyperplasia. Compare to the models generated using the best subsets regression method, the “best” model using stepwise procedure does not include the variable gleason score. Therefore, a model validation is needed to compare the “best” two models generated by best subsets regression method and stepwise procedure in order to choose the best predictive model for PSA.

library(MASS)
n <-nrow(prostate.t)
## forward selection based on AIC:
(step.f<-stepAIC(fit, scope = list(upper = full, lower = ~1), direction = "forward", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
a <- as.data.frame(step.f$anova)
datatable(a, rownames = FALSE)
(step.fa<-stepAIC(fit, scope = list(upper = full1, lower = ~1), direction = "forward", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
b <- as.data.frame(step.fa$anova)
datatable(b, rownames = FALSE)
## backward elimination based on AIC
(step.b <- stepAIC(full, scope = list(upper = full, lower = ~1), direction = "backward", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
c <- as.data.frame(step.b$anova)
datatable(c, rownames = FALSE)
(step.ba <- stepAIC(full, scope = list(upper = full1, lower = ~1), direction = "backward", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
d <- as.data.frame(step.ba$anova)
datatable(d, rownames = FALSE)
## forward stepwise based on AIC
(step.fs<-stepAIC(fit, scope = list(upper = full, lower = ~1), direction = "both", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
e <- as.data.frame(step.fs$anova)
datatable(e, rownames = FALSE)
(step.fsa<-stepAIC(fit, scope = list(upper = full1, lower = ~1), direction = "both", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
f <- as.data.frame(step.fsa$anova)
datatable(f, rownames = FALSE)
## backward stepwise based on AIC
(step.bs<-stepAIC(full, scope = list(upper = full, lower = ~1), direction = "both", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
g <- as.data.frame(step.bs$anova)
datatable(g, rownames = FALSE)
(step.bsa<-stepAIC(full, scope = list(upper = full1, lower = ~1), direction = "both", 
    k = 4, trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
h <- as.data.frame(step.bsa$anova)
datatable(h, rownames = FALSE)
## selection based on BIC: set option 'k=log(n)'
(step.f1<-stepAIC(fit, scope = list(upper = full, lower = ~1), direction = "forward", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
i <- as.data.frame(step.f1$anova)
datatable(i, rownames = FALSE)
(step.f1a<-stepAIC(fit, scope = list(upper = full1, lower = ~1), direction = "forward", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
j <- as.data.frame(step.f1a$anova)
datatable(j, rownames = FALSE)
(step.b1 <- stepAIC(full, scope = list(upper = full, lower = ~1), direction = "backward", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
k <- as.data.frame(step.b1$anova)
datatable(k, rownames = FALSE)
(step.b1a <- stepAIC(full, scope = list(upper = full1, lower = ~1), direction = "backward", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
l <- as.data.frame(step.b1a$anova)
datatable(l, rownames = FALSE)
(step.fs1<-stepAIC(fit, scope = list(upper = full, lower = ~1), direction = "both", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
m <- as.data.frame(step.fs1$anova)
datatable(m, rownames = FALSE)
(step.fs2<-stepAIC(fit, scope = list(upper = full1, lower = ~1), direction = "both", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
##    seminal_vesicle_invasion1  benign_prostatic_hyperplasia  
##                      0.90993                       0.10150
nn <- as.data.frame(step.fs2$anova)
datatable(nn, rownames = FALSE)
(step.bs1<-stepAIC(full, scope = list(upper = full1, lower = ~1), direction = "both", 
    k = log(n), trace = FALSE))
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate.t)
## 
## Coefficients:
##                  (Intercept)                    cancer_vol  
##                      1.69605                       0.05786  
## benign_prostatic_hyperplasia     seminal_vesicle_invasion1  
##                      0.10150                       0.90993
o <- as.data.frame(step.bs1$anova)
datatable(o, rownames = FALSE)

Observation(s):

  • The “best” model according to all stepwise procedures include predictors: cancer_vol, seminal_vesicle_invasion, and benign_prostatic_hyperplasia

Part IV: Model validation

In this part, the “best” models identified using best subsets regression and stepwise procedure are validated and compared. The main purpose of this part is to determine the final “best” model using the following methods. First, the “best” model according to stepwise procedures is used on validation data. The estimated coefficients as well as their standard errors agree quite closely on the training and validation data sets Table 6. Additionally, the sum of squared estimate of errors (SSE) and adjusted-\(R^2\) values are similar. Since the mean squared prediction error (MSPEv) is somewhat close to SSE dived by number of observations, then there is no severe overfitting in the model.


Using internal validation on the “best” models identified by best subsets regression and stepwise procedure, the calculated values for the predicted residual error sum of squares (Pressp) and sum of squared estimate of errors (SSEp) are somewhat close, which supports their validity, which means they have little bias and not much overfitting. Meanwhile, an external validation is also performed to further compare the two models. The model identified using stepwise procedures has smaller mean squared prediction error using validation data (MSPEv) compare to the model identified using best subsets regression. Therefore, this model is the final “best” model.

###use best model according to stepwise procedures 
train1 <- lm(psa_level ~ cancer_vol  + benign_prostatic_hyperplasia  + seminal_vesicle_invasion , data = prostate.t)
# re-run model using validation data
valid1 <- lm(psa_level ~ cancer_vol  + benign_prostatic_hyperplasia  + seminal_vesicle_invasion, data = prostate.v)

mod_sum <- cbind(coef(summary(train1))[, 1], coef(summary(valid1))[, 1], coef(summary(train1))[, 
    2], coef(summary(valid1))[, 2])
colnames(mod_sum) <- c("Train Est", "Valid Est", "Train s.e.", "Valid s.e.")
mod_sum1 <- as.data.frame(mod_sum)
#datatable(mod_sum1, rownames = TRUE, options = list(pageLength = 4, scrollX=T))
kbl(mod_sum1, caption = "Table 6") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Table 6
Train Est Valid Est Train s.e. Valid s.e.
(Intercept) 1.6960483 1.3858731 0.1780221 0.2019846
cancer_vol 0.0578564 0.0868547 0.0179680 0.0188434
benign_prostatic_hyperplasia 0.1014970 0.1040655 0.0343698 0.0428635
seminal_vesicle_invasion1 0.9099273 0.7142537 0.2798131 0.4402380
##examine the SSE and adjusted R squares using both the training data and validation data
sse_t <- sum(train1$residuals^2)
sse_v <- sum(valid1$residuals^2)
Radj_t <- summary(train1)$adj.r.squared
Radj_v <- summary(valid1)$adj.r.squared
train_sum <- c(sse_t,Radj_t)
valid_sum <- c(sse_v,Radj_v)
criteria <- rbind(train_sum,valid_sum)
colnames(criteria) <- c("SSE","R2_adj")
criteria1 <- as.data.frame(criteria)
#datatable(criteria1, rownames = TRUE, options = list(pageLength = 2, scrollX=T))
kbl(criteria1, caption = "Table 7") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Table 7
SSE R2_adj
train_sum 21.01723 0.5023067
valid_sum 35.18765 0.5333025
#Get MSPE_v from new data
newdata <- prostate.v[, -1]
newdata$seminal_vesicle_invasion<-as.factor(newdata$seminal_vesicle_invasion)
sv <-sapply(newdata, class)
sv1<-as.data.frame(sv)
#datatable(sv1,  rownames = TRUE,colnames = "Type")
kbl(sv1) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
sv
cancer_vol numeric
weight numeric
age integer
benign_prostatic_hyperplasia numeric
seminal_vesicle_invasion factor
capsular_penetration numeric
gleason_score factor
n1 <- nrow(prostate)/2
y.hat <- predict(train1, newdata)
(MSPE <- mean((prostate.v$psa_level - y.hat)^2))
## [1] 0.7873197
sse_t/n1
## [1] 0.433345

Observation(s):

  • Most of the estimated coefficients as well as their standard errors agree quite closely on the training and validation data sets
  • The SSE and adjusted R squares values are close
  • Since \(MSPE_v\) is somewhat close to \(SSE/n\), then there is no severe overfitting in the model.
##Internal validation  - 1
(names(step.fs1$coefficients)[-1])
## [1] "cancer_vol"                   "seminal_vesicle_invasion1"   
## [3] "benign_prostatic_hyperplasia"
modelfs1 <- lm(psa_level~cancer_vol+ benign_prostatic_hyperplasia  + seminal_vesicle_invasion , data = prostate.t)
drops2=c("weight", "age", "capsular_penetration", "gleason-score")
data.tt=prostate.t[,!(names(prostate)%in%drops2)]
#data.tt<-prostate.t[ c("psa_level","cancer_vol" , "benign_prostatic_hyperplasia", "seminal_vesicle_invasion1"),  ]
fit5<- lm (psa_level~., data = data.tt)
length(fit5$coef) #number of coefficents in Model
## [1] 6
mse5<-anova(fit5)["Residuals",3]
mse5#MSE for Model
## [1] 0.4370997
sse.fs1<-anova(step.fs1)["Residuals",2] #first order selected
sse.fs1
## [1] 21.01723
mse.fs1<-anova(step.fs1)["Residuals",3] #MSE for Model fs1
mse.fs1
## [1] 0.4776643
p.fs1<-length(step.fs1$coefficients) #4
p.fs1
## [1] 4
cp.fs1<-sse.fs1/mse5-(n-2*p.fs1) #C_p for Model fs1
cp.fs1
## [1] 8.08338
press.fs1<-sum(step.fs1$residuals^2/(1-influence(step.fs1)$hat)^2)
press.fs1
## [1] 24.58292
##Internal validation -2
###using best model according to best subset procedures based on criterion 
model2 <- lm(psa_level~cancer_vol+ benign_prostatic_hyperplasia  + seminal_vesicle_invasion + gleason_score, data = prostate.t)
drops2a=c("weight", "age", "capsular_penetration")
data.tt2=prostate.t[,!(names(prostate)%in%drops2a)]
#data.tt<-prostate.t[ c("psa_level","cancer_vol" , "benign_prostatic_hyperplasia", "seminal_vesicle_invasion1"),  ]
fit6<- lm (psa_level~., data = data.tt2)
length(fit6$coef) #number of coefficents in Model
## [1] 6
(mse6<-anova(fit6)["Residuals",3])#MSE for Model
## [1] 0.4370997
(sse.fs2<-anova(model2)["Residuals",2]) #first order selected
## [1] 18.35819
(mse.fs2<-anova(model2)["Residuals",3]) #MSE for Model2
## [1] 0.4370997
(p.fs2<-length(model2$coefficients)) #5
## [1] 6
(cp.fs2<-sse.fs2/mse6-(n-2*p.fs2)) #C_p for Model 2
## [1] 6
(press.fs2<-sum(model2$residuals^2/(1-influence(model2)$hat)^2))
## [1] 23.25294

Observation(s):

  • For both models, \(Cp ≈p\) and \(Press_p\) and \(SSE_p\) are somewhat close, which supports their validity: little bias and not much overfitting.
##External  validation - 1 
fit.fs1.v<-lm(step.fs1,data=prostate.v) #Model fs1 on validation data
summary(step.fs1) #summary on training data
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + seminal_vesicle_invasion + 
##     benign_prostatic_hyperplasia, data = prostate.t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62251 -0.48856 -0.01258  0.43652  1.39462 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.69605    0.17802   9.527 2.91e-12 ***
## cancer_vol                    0.05786    0.01797   3.220  0.00241 ** 
## seminal_vesicle_invasion1     0.90993    0.27981   3.252  0.00220 ** 
## benign_prostatic_hyperplasia  0.10150    0.03437   2.953  0.00503 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6911 on 44 degrees of freedom
## Multiple R-squared:  0.5341, Adjusted R-squared:  0.5023 
## F-statistic: 16.81 on 3 and 44 DF,  p-value: 2.022e-07
summary(fit.fs1.v) #summary on validation data
## 
## Call:
## lm(formula = step.fs1, data = prostate.v)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8638 -0.5089  0.1578  0.5782  1.5188 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.38587    0.20198   6.861 1.64e-08 ***
## cancer_vol                    0.08685    0.01884   4.609 3.34e-05 ***
## seminal_vesicle_invasion1     0.71425    0.44024   1.622   0.1117    
## benign_prostatic_hyperplasia  0.10407    0.04286   2.428   0.0193 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8843 on 45 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5333 
## F-statistic: 19.28 on 3 and 45 DF,  p-value: 3.471e-08
#percent change in parameter estimation
q<-as.data.frame(round(abs(coef(step.fs1)-coef(fit.fs1.v))/abs(coef(step.fs1))*100,3))
datatable(q, rownames = TRUE, colnames = "Percent change")
sd.fs1<- summary(step.fs1)$coefficients[,"Std. Error"]
sd.fs1.v<- summary(fit.fs1.v)$coefficients[,"Std. Error"]
#percent change in standard errors
p<-as.data.frame(round(abs(sd.fs1-sd.fs1.v)/sd.fs1*100,3))
datatable(p, rownames = TRUE, colnames = "Percent change")
##mean squared prediction error
pred.fs1<-predict.lm(step.fs1, prostate.v[,-3])
pred.fs1a<-as.data.frame(pred.fs1)
datatable(pred.fs1a, rownames = TRUE, colnames = "Predicted values")
mspe.fs1<-mean((pred.fs1-prostate.v[,3])^2)
mspe.fs1
## [1] 1698.921
press.fs1/n
## [1] 0.5121443
mse.fs1
## [1] 0.4776643
##External  validation - 2 
fit.fs2.v<-lm(model2,data=prostate.v) #Model2 on validation data
summary(model2) #summary on training data
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion + gleason_score, data = prostate.t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43793 -0.46912 -0.04104  0.43159  1.37643 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.55126    0.22390   6.928 1.84e-08 ***
## cancer_vol                    0.03759    0.02019   1.862  0.06962 .  
## benign_prostatic_hyperplasia  0.09697    0.03294   2.944  0.00526 ** 
## seminal_vesicle_invasion1     0.91864    0.27903   3.292  0.00202 ** 
## gleason_score7                0.25904    0.23686   1.094  0.28034    
## gleason_score8                0.73913    0.30001   2.464  0.01793 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6611 on 42 degrees of freedom
## Multiple R-squared:  0.593,  Adjusted R-squared:  0.5446 
## F-statistic: 12.24 on 5 and 42 DF,  p-value: 2.388e-07
summary(fit.fs2.v) #summary on validation data
## 
## Call:
## lm(formula = model2, data = prostate.v)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7509 -0.5067  0.1642  0.6007  1.6324 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.27905    0.22835   5.601 1.39e-06 ***
## cancer_vol                    0.07610    0.01970   3.863 0.000372 ***
## benign_prostatic_hyperplasia  0.10556    0.04407   2.395 0.021049 *  
## seminal_vesicle_invasion1     0.33023    0.48883   0.676 0.502942    
## gleason_score7                0.18969    0.28902   0.656 0.515109    
## gleason_score8                0.85107    0.48734   1.746 0.087888 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8741 on 43 degrees of freedom
## Multiple R-squared:  0.5915, Adjusted R-squared:  0.544 
## F-statistic: 12.45 on 5 and 43 DF,  p-value: 1.702e-07
#percent change in parameter estimation
r<-as.data.frame(round(abs(coef(model2)-coef(fit.fs2.v))/abs(coef(model2))*100,3))
datatable(r, rownames = TRUE, colnames = "Percent change")
sd.fs2<- summary(model2)$coefficients[,"Std. Error"]
sd.fs2.v<- summary(fit.fs2.v)$coefficients[,"Std. Error"]
#percent change in standard errors
z<-as.data.frame(round(abs(sd.fs2-sd.fs2.v)/sd.fs2*100,3))
datatable(z, rownames = TRUE, colnames = "Percent change")
##mean squared prediction error
pred.fs2<-predict.lm(model2, prostate.v[,-3])
pred.fs2a<-as.data.frame(pred.fs2)
datatable(pred.fs2a, rownames = TRUE, colnames = "Predicted values")
(mspe.fs2<-mean((pred.fs2-prostate.v[,3])^2))
## [1] 1703.327
(press.fs2/n)
## [1] 0.4844362
(mse.fs2)
## [1] 0.4370997

Observation(s):

  • Model 1 is preferred based on smaller \(MSPE_v\) and more consistent parameter estimation in training and validation data set

Part V: Model diagnostics: Outlying and influential cases

The purpose of this part of the analysis is to conduct a model diagnostic on the final “best” model identified in the preceding step shown below by fitting it on the entire original data set.


\[psa~level = 1.51484 + 0.07618(c𝑎𝑛𝑐𝑒𝑟 ~𝑣𝑜𝑙𝑢𝑚𝑒)\] \[+ 0.09971 (𝑏𝑒𝑛𝑖𝑔𝑛 ~p𝑟𝑜𝑠𝑡𝑎𝑡𝑖𝑐 ~ℎ𝑦𝑝𝑒𝑟𝑝𝑙𝑎𝑠𝑖𝑎) \] \[ + 0.82194 (𝑠𝑒𝑚𝑖𝑛𝑎𝑙 ~𝑣𝑒𝑠𝑖𝑐𝑙𝑒 𝑖𝑛𝑣𝑎𝑠𝑖𝑜𝑛1)\]

The multiple coefficients of determination \(R^2\) for this model is 55%, which does not drop significantly from the original model with all the 7 predictors, and is still moderately adequate. In addition, the assumption of multiple regression is tested for the selected equation above. The first test of normal distribution for the residuals, shown in Figure 6 (Normal Q-Q plot), indicates the equation above is in fact a normal distribution with minimal outliers. Furthermore, the residuals against fitted values is plotted as shown in Figure 16. The plot shows that the residual against the fitted values is moderately constant but still reasonable. Comparing the model diagnostic plot for the final model with three predictor variables (cancer volume, seminal vesicle invasion, and benign prostatic hyperplasia) to the model diagnostic plot for the original first-order model with all seven predictors (Figure 7), we can confirm that the final model is better than the original model.


Moreover, the studentized deleted residuals is obtained and the Bonferroni outlier test procedure at α = 0.1 is used in order to identify any outlying Y and X observations in the data. There are no outlying Y observations identified and the outlying X observations include cases 55, 64, 68, 73, 78, 86, 91, 94, and 97. Using the Cook’s distance plot (Figure 6), the potential most influential case identified is the 94th case. As a result, the final model is fitted without this case and the calculated average absolute difference in the fitted values between the final model with and without the 94th observation is 1.87%. For 94th case, the percentage change on the fitted value with or without the observation is extremely small. Therefore, it can be concluded that there are no significant influential cases on the prediction and all cases can be included.

final_mod <- lm(psa_level~cancer_vol+ benign_prostatic_hyperplasia + 
    seminal_vesicle_invasion, data = prostate)
summary(final_mod)
## 
## Call:
## lm(formula = psa_level ~ cancer_vol + benign_prostatic_hyperplasia + 
##     seminal_vesicle_invasion, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9867 -0.4996  0.1032  0.5545  1.4993 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.51484    0.13206  11.471  < 2e-16 ***
## cancer_vol                    0.07618    0.01256   6.067 2.78e-08 ***
## benign_prostatic_hyperplasia  0.09971    0.02674   3.729 0.000331 ***
## seminal_vesicle_invasion1     0.82194    0.23858   3.445 0.000858 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7861 on 93 degrees of freedom
## Multiple R-squared:  0.5502, Adjusted R-squared:  0.5357 
## F-statistic: 37.92 on 3 and 93 DF,  p-value: 4.247e-16
fm<-as.data.frame(anova(final_mod))
#datatable(fm, rownames = TRUE)
kbl(fm) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Df Sum Sq Mean Sq F value Pr(>F)
cancer_vol 1 55.163663 55.1636633 89.27125 0.0000000
benign_prostatic_hyperplasia 1 7.803408 7.8034079 12.62824 0.0005992
seminal_vesicle_invasion 1 7.333891 7.3338909 11.86842 0.0008583
Residuals 93 57.467780 0.6179331 NA NA
par(mfrow=c(3,2))
plot(final_mod,pch =20, cex = 2, col = "aquamarine2")

## check outliers in Y
res<-residuals(final_mod)# residuals of the final model
p <- length(final_mod$coefficients)
n.s<-nrow(prostate)
h1 <- influence(final_mod)$hat

#Obtain the studentized deleted residuals and identify any outlying Y observations using
#Bonferroni outlier test procedure at α = 0.1
d.res.std<-studres(final_mod) #studentized deleted residuals
qt(1-0.1/(2*n.s),n.s-p) # bonferronis thresh hold
## [1] 3.38885
idx.Y <- as.vector(which(abs(d.res.std)>=qt(1-0.1/(2*n.s),n.s-p)))
idx.Y ## outliers
## integer(0)
idx.X <- as.vector(which(h1>(2*p/n.s)))
idx.X ## outliers
## [1] 55 64 68 73 78 86 91 94 97
plot(h1,res,xlab="leverage",ylab="residuals", pch =20, cex = 2, col = "aquamarine2")
plot(final_mod, which=4, col = "aquamarine2") #Case 94 is an influential case according to Cook’s distance.
Figure 16

Figure 16

#calculate the average absolute percent difference in the fitted values with and without the
#most influential case identified 
final_mod2<-lm(final_mod, data=prostate[-94,])
f1<-fitted(final_mod)
f2<-fitted(final_mod2)
SUM<-sum(abs((f1[-94]-f2)/f1[-94]))
SUM<-SUM+abs((f1[94]-predict(final_mod,newdata = prostate[94,]))/f1[94])
per.average<-SUM/n.s
per.average
##         94 
## 0.01872144

Observation(s):

  • The potential influential case identified previously is the 94th case, we fit the model without 94th case and calculate the average absolute difference in the fitted values as 1.87%. For 94th case, the percentage change on the fitted value with or without the case is very small. Therefore, no case have a large influence on prediction. Hence, all cases can be retained.

Conclusion and Discussion

By using multiple regression analyses, it can be concluded that three predictors (cancer volume, seminal vesicle invasion, and benign prostatic hyperplasia) are important clinical measurements that can be used in the determination of prostate-specific antigen (PSA) level among men who are diagnosed with prostate cancer. However, the performance of the model that is identified in this analysis to predict PSA level using these predictors can be improved. Other important clinical measurements not included in the original dataset can be added to the analysis. Although seminal vesicle invasion is not present in most of the 97 men with advanced prostate cancer, it is still important in identifying PSA level. This supports the fact the PSA level does not always imply the presence of prostate cancer in men which is a limitation of PSA screening for cancer. The result of this analysis would generally not apply to every man with prostate cancer in rare cases and to those who are not diagnosed with prostate cancer. Since the dataset used for this analysis include a small group of men ages 41 to 79 diagnosed with prostate cancer, a larger collection of data which includes both men with and without prostate cancer from a wider age range might be more helpful in determining the relationship between PSA level and other clinical measurements. For further analysis of this data, other statistical analysis can also be used which may result in more accurate results.





References