Statistical Learning Final Project

Prostate cancer dataset

In the following prject we’ll use the ‘Prostate Cancer dataset’ .

The dataset is about the Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II., it has been published in the journall of Urology and it’s also present in the book Element of Statistical Learning.

Data to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

Variables

The attributes are designed as follows:

lcavol: log cancer volume
lweight: log prostate weight
age: in years
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration (cancer that has reached the outer wall of the prostate)
gleason: a numeric vector; A way of describing prostate cancer based on how abnormal the cancer cells in a biopsy sample look under a microscope and how quickly they are likely to grow and spread. It usually goes from 1 to 10 but its never less than 5.
pgg45: percent of Gleason score 4 or 5
lpsa: response (prostate-specific antigen)

 head(dis)

##        lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1: -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2: -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3: -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4: -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5:  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6: -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Let’s also take a closer and more precise look to our dataset.

psych::describe(dis)

##         vars  n  mean    sd median trimmed   mad   min    max  range  skew
## lcavol     1 97  1.35  1.18   1.45    1.39  1.28 -1.35   3.82   5.17 -0.24
## lweight    2 97  3.63  0.43   3.62    3.62  0.38  2.37   4.78   2.41  0.06
## age        3 97 63.87  7.45  65.00   64.47  5.93 41.00  79.00  38.00 -0.80
## lbph       4 97  0.10  1.45   0.30    0.03  2.50 -1.39   2.33   3.71  0.13
## svi        5 97  0.22  0.41   0.00    0.15  0.00  0.00   1.00   1.00  1.36
## lcp        6 97 -0.18  1.40  -0.80   -0.34  0.87 -1.39   2.90   4.29  0.71
## gleason    7 97  6.75  0.72   7.00    6.67  0.00  6.00   9.00   3.00  1.22
## pgg45      8 97 24.38 28.20  15.00   20.57 22.24  0.00 100.00 100.00  0.94
## lpsa       9 97  2.48  1.15   2.59    2.48  1.15 -0.43   5.58   6.01  0.00
##         kurtosis   se
## lcavol     -0.60 0.12
## lweight     0.36 0.04
## age         0.96 0.76
## lbph       -1.75 0.15
## svi        -0.16 0.04
## lcp        -1.01 0.14
## gleason     2.36 0.07
## pgg45      -0.37 2.86
## lpsa        0.43 0.12

str(dis)

## Classes 'data.table' and 'data.frame':   97 obs. of  9 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Before to start let’s check if there are some missing variables.

missmap(dis, col = c("yellow", "black"), y.at=1, y.labels = '', legend = TRUE)

luckily for us the graph evidence how there are no missing values in our data.

Let’s upload some useful function!

#removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# Population standard deviation function
PopSD= function (x){
  sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}

# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}

Multiple linear regression

introduction

Trough this study we are trying to understand if a response variable can be predicted by use more explanatory and independent variables.

Objective

The prime objective of this project is to construct a working model which has the capability of predicting the value of the antigen.

The response on interest (YY) in this data set is lpsa (log prostate specific antigen). The predictors arelog cancer volume (lcavol),log of weight (lweight ), age, log(benign prostatic hyperplasia amount) (lbpsa), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score and the percentage of gleason scores (pgg45).

We use **lpsa** as dependent variable so we run our regression on this variable and we try to understand if it can be predicted by the other dependent values.

Let’s start!

Let’s build our model at first with all our variables in it

r1<- lm(lpsa~. ,data= dis)

summary(r1)

## 
## Call:
## lm(formula = lpsa ~ ., data = dis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

… and plot it!

par(mfrow=c(2,2))
plot(r1, col = thematic::thematic_get_option("accent"))

Now let’s perform an automatic stepwise method of selection of the predictors.

 library(MASS)
 best_step_model <- stepAIC(r1, direction = "both", trace = F)
 summary(best_step_model)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86717 -0.37725  0.01257  0.40252  1.44544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49473    0.87652   0.564  0.57385    
## lcavol       0.54400    0.07463   7.289 1.11e-10 ***
## lweight      0.58821    0.19790   2.972  0.00378 ** 
## age         -0.01644    0.01068  -1.540  0.12692    
## lbph         0.10122    0.05759   1.758  0.08215 .  
## svi          0.71490    0.20653   3.461  0.00082 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared:  0.6526, Adjusted R-squared:  0.6335 
## F-statistic: 34.19 on 5 and 91 DF,  p-value: < 2.2e-16

extractAIC(best_step_model)

## [1]   6.00000 -63.72263

Display this new regression

r2 <- lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)

summary(r2)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86717 -0.37725  0.01257  0.40252  1.44544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49473    0.87652   0.564  0.57385    
## lcavol       0.54400    0.07463   7.289 1.11e-10 ***
## lweight      0.58821    0.19790   2.972  0.00378 ** 
## age         -0.01644    0.01068  -1.540  0.12692    
## lbph         0.10122    0.05759   1.758  0.08215 .  
## svi          0.71490    0.20653   3.461  0.00082 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared:  0.6526, Adjusted R-squared:  0.6335 
## F-statistic: 34.19 on 5 and 91 DF,  p-value: < 2.2e-16

and plot it!

par(mfrow=c(2,2))
plot(r2, col = thematic::thematic_get_option("accent"))

Comparison

extractAIC(r1)

## [1]   9.00000 -60.77886

extractAIC(r2)

## [1]   6.00000 -63.72263

Since that we’re building a model with log variables the AIC is negative.

It seems like that, according to the AIC model which is an estimator of information loss, we can notice how the firstmodel is slightly better.

In this case let’s use the second regression due to assumptions problems and to useless variables.

If a model were so good that perfectly predicted the dependent variable, the error-sum-of squares would approach zero, and the natural logarithm of this value and the AIC would approach negative infinity.

parameters interpretation

As first we can notice not really an high Rsquare but since that the p value is really small we can take this model as reliable.

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable.

we can notice how cancer volume, lweight, lbph and svi are the ones more related with lpsa.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

Looking at the pvalue we can notice how only lcavol, lweight and svi have effectvly and effect on lpsa.

Furthemore age in this case is inversely related but is not signficant.

ASSUMPTION

Linearity assumption

crPlots(r2, col = thematic::thematic_get_option("accent"))

Every parameter seems to have approximately a normal distribution.

Mean of error is zero and normally distribuited

qqnorm(r2$residuals, col = thematic::thematic_get_option("accent"))

mean(r2$residuals)

## [1] -3.173694e-17

jarque.bera.test(r2$residuals)

## 
##  Jarque Bera Test
## 
## data:  r2$residuals
## X-squared = 0.28992, df = 2, p-value = 0.8651

seems to be also in this case respected.

Homoscedasticity of residuals: the variance must be costante

lmtest::bptest(r2)

## 
##  studentized Breusch-Pagan test
## 
## data:  r2
## BP = 6.6668, df = 5, p-value = 0.2466

Since we have a p-value greater than 0.05 we can asset the null hypothesis for which the variance of residuals is costant.

Absence of perfect multicollinearity

pairs(dis, col = thematic::thematic_get_option("accent"))

car:: vif(r2)

##   lcavol  lweight      age     lbph      svi 
## 1.521255 1.413125 1.241850 1.372257 1.437259

Since that the VIF is smaller than 2 in each case we can confirm that mutlicollinearity is not present.

VIF is made by dividing 1 for 1 minus Rsquare.

Independence of error

lawstat::runs.test(r2$residuals)

## 
##  Runs Test - Two sided
## 
## data:  r2$residuals
## Standardized Runs Statistic = -1.1218, p-value = 0.2619

also this assumption is verified!

Logistic regression

introduction

Now we can perform a logistic regression that we can define binary since that the response variable has only two outcome:

svi present = 1
svi not present = 0

In this case this regression has a better fit since that we have probability lying between a bound of 0 and one.

We’ll expect, in this case, a “sigmoid” function.

Let’s start!

st1<- glm(svi ~ age + lbph + lpsa+ lcp, data = dis, family = "binomial")

summary(st1)

## 
## Call:
## glm(formula = svi ~ age + lbph + lpsa + lcp, family = "binomial", 
##     data = dis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9170  -0.2205  -0.0753  -0.0114   2.0407  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.70154    5.43714  -2.336 0.019488 *  
## age           0.06581    0.06916   0.951 0.341352    
## lbph         -0.27403    0.29855  -0.918 0.358687    
## lpsa          2.16631    0.74973   2.889 0.003859 ** 
## lcp           1.32598    0.39263   3.377 0.000732 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.353  on 96  degrees of freedom
## Residual deviance:  38.102  on 92  degrees of freedom
## AIC: 48.102
## 
## Number of Fisher Scoring iterations: 7

coefs<- coef(st1)
exp(coefs)

##  (Intercept)          age         lbph         lpsa          lcp 
## 3.046436e-06 1.068021e+00 7.603124e-01 8.726012e+00 3.765881e+00

#View(dis)

lpsa statistically significant and positive effect; average change in the odds when change by 1 unit.
lcp statistically significant and positive effect; average change in the odds when change by 1 unit.
age and lpbsh doesn’t seems to have a statistically significant effect on svi.

par(mfrow=c(2,2))
plot(st1)

ODDS interpretation

As an odd interpretation instead, we try to understand what has happened to the odd, changing it by 1 unit and taking in account the proportional rate at which the predicted odds ratio changes with each successive unit of x (how much the average odds ratio varies when an unit of independent variable)

(exp(coef(st1))-1)*100

## (Intercept)         age        lbph        lpsa         lcp 
##  -99.999695    6.802066  -23.968764  772.601250  276.588133

That’s basically the percentage increase in the odds for a unitary increase in that specific predictor.

let’s perform a pseudo rsquare

library(pscl)
pR2(st1)

## fitting null model for pseudo-r2

##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -19.0507887 -50.6762599  63.2509422   0.6240688   0.4790346   0.7389510

We seems to get really a good result.

Assumptions

#bind the logit and tidy the data for a plot
probabilities <- predict(st1, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "tru", "fal")
head(predicted.classes)

##     1     2     3     4     5     6 
## "fal" "fal" "fal" "fal" "fal" "fal"

mydata <- dis %>%
  dplyr::select_if(is.numeric) 

predictors <- colnames(mydata)

library(broom)
library(dplyr)

mydata <- mydata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

#dev.off()

influential values

plot(st1, which = 4, id.n = 3)

extract model results

model.data <- augment(st1) %>% 
  mutate(index = 1:n()) 

# Find real outliers with standardised err >3
model.data %>% top_n(3, .cooksd)

## # A tibble: 3 × 12
##     svi   age  lbph  lpsa   lcp .fitted .resid .std.resid  .hat .sigma .cooksd
##   <int> <int> <dbl> <dbl> <dbl>   <dbl>  <dbl>      <dbl> <dbl>  <dbl>   <dbl>
## 1     1    68 1.37   2.21  1.83   -1.38   1.79       1.90 0.108  0.616   0.108
## 2     0    57 0.438  3.53  2.33    1.66  -1.92      -2.07 0.141  0.610   0.203
## 3     1    61 1.35   4.13 -1.39   -1.95   2.04       2.23 0.161  0.603   0.322
## # … with 1 more variable: index <int>
## # ℹ Use `colnames()` to see all variable names

model.data %>% 
  filter(abs(.std.resid) > 3)

## # A tibble: 0 × 12
## # … with 12 variables: svi <int>, age <int>, lbph <dbl>, lpsa <dbl>, lcp <dbl>,
## #   .fitted <dbl>, .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, index <int>
## # ℹ Use `colnames()` to see all variable names

we can notice three influential value, let’s try to remove them

influence <- cooks.distance(st1)
leverage<- hatvalues(st1)


new <- dis %>%
  mutate(cooks_dist = influence, leverage = leverage)%>%
  filter(cooks_dist < 0.01)%>%
  filter(leverage < 0.05)

st2<- glm(svi ~ age + lbph + lpsa+ lcp, data = new, family = "binomial")

par(mfrow=c(2,2))
plot(st2)

multicollinearity

car::vif(st1)

##      age     lbph     lpsa      lcp 
## 1.048707 1.051401 1.109449 1.103298