In the following prject we’ll use the ‘Prostate Cancer dataset’ .
The dataset is about the Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II., it has been published in the journall of Urology and it’s also present in the book Element of Statistical Learning.
Data to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.
The attributes are designed as follows:
lcavol: log cancer volume
lweight: log prostate weight
age: in years
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration (cancer that has reached the outer wall of the prostate)
gleason: a numeric vector; A way of describing prostate cancer based on how abnormal the cancer cells in a biopsy sample look under a microscope and how quickly they are likely to grow and spread. It usually goes from 1 to 10 but its never less than 5.
pgg45: percent of Gleason score 4 or 5
lpsa: response (prostate-specific antigen)
head(dis)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1: -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2: -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3: -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4: -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5: 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6: -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
Let’s also take a closer and more precise look to our dataset.
psych::describe(dis)
## vars n mean sd median trimmed mad min max range skew
## lcavol 1 97 1.35 1.18 1.45 1.39 1.28 -1.35 3.82 5.17 -0.24
## lweight 2 97 3.63 0.43 3.62 3.62 0.38 2.37 4.78 2.41 0.06
## age 3 97 63.87 7.45 65.00 64.47 5.93 41.00 79.00 38.00 -0.80
## lbph 4 97 0.10 1.45 0.30 0.03 2.50 -1.39 2.33 3.71 0.13
## svi 5 97 0.22 0.41 0.00 0.15 0.00 0.00 1.00 1.00 1.36
## lcp 6 97 -0.18 1.40 -0.80 -0.34 0.87 -1.39 2.90 4.29 0.71
## gleason 7 97 6.75 0.72 7.00 6.67 0.00 6.00 9.00 3.00 1.22
## pgg45 8 97 24.38 28.20 15.00 20.57 22.24 0.00 100.00 100.00 0.94
## lpsa 9 97 2.48 1.15 2.59 2.48 1.15 -0.43 5.58 6.01 0.00
## kurtosis se
## lcavol -0.60 0.12
## lweight 0.36 0.04
## age 0.96 0.76
## lbph -1.75 0.15
## svi -0.16 0.04
## lcp -1.01 0.14
## gleason 2.36 0.07
## pgg45 -0.37 2.86
## lpsa 0.43 0.12
str(dis)
## Classes 'data.table' and 'data.frame': 97 obs. of 9 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## - attr(*, ".internal.selfref")=<externalptr>
Before to start let’s check if there are some missing variables.
missmap(dis, col = c("yellow", "black"), y.at=1, y.labels = '', legend = TRUE)
luckily for us the graph evidence how there are no missing values in our data.
Let’s upload some useful function!
#removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Population standard deviation function
PopSD= function (x){
sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}
# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}
introduction
Trough this study we are trying to understand if a response variable can be predicted by use more explanatory and independent variables.
The prime objective of this project is to construct a working model which has the capability of predicting the value of the antigen.
The response on interest (YY) in this data set is lpsa (log prostate specific antigen). The predictors arelog cancer volume (lcavol),log of weight (lweight ), age, log(benign prostatic hyperplasia amount) (lbpsa), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score and the percentage of gleason scores (pgg45).
We use **lpsa** as dependent variable so we run our regression on this variable and we try to understand if it can be predicted by the other dependent values.
Let’s start!
Let’s build our model at first with all our variables in it
r1<- lm(lpsa~. ,data= dis)
summary(r1)
##
## Call:
## lm(formula = lpsa ~ ., data = dis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
… and plot it!
par(mfrow=c(2,2))
plot(r1, col = thematic::thematic_get_option("accent"))
Now let’s perform an automatic stepwise method of selection of the predictors.
library(MASS)
best_step_model <- stepAIC(r1, direction = "both", trace = F)
summary(best_step_model)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86717 -0.37725 0.01257 0.40252 1.44544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49473 0.87652 0.564 0.57385
## lcavol 0.54400 0.07463 7.289 1.11e-10 ***
## lweight 0.58821 0.19790 2.972 0.00378 **
## age -0.01644 0.01068 -1.540 0.12692
## lbph 0.10122 0.05759 1.758 0.08215 .
## svi 0.71490 0.20653 3.461 0.00082 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.6335
## F-statistic: 34.19 on 5 and 91 DF, p-value: < 2.2e-16
extractAIC(best_step_model)
## [1] 6.00000 -63.72263
Display this new regression
r2 <- lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)
summary(r2)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = dis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86717 -0.37725 0.01257 0.40252 1.44544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49473 0.87652 0.564 0.57385
## lcavol 0.54400 0.07463 7.289 1.11e-10 ***
## lweight 0.58821 0.19790 2.972 0.00378 **
## age -0.01644 0.01068 -1.540 0.12692
## lbph 0.10122 0.05759 1.758 0.08215 .
## svi 0.71490 0.20653 3.461 0.00082 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.6335
## F-statistic: 34.19 on 5 and 91 DF, p-value: < 2.2e-16
and plot it!
par(mfrow=c(2,2))
plot(r2, col = thematic::thematic_get_option("accent"))
extractAIC(r1)
## [1] 9.00000 -60.77886
extractAIC(r2)
## [1] 6.00000 -63.72263
Since that we’re building a model with log variables the AIC is negative.
It seems like that, according to the AIC model which is an estimator of information loss, we can notice how the firstmodel is slightly better.
In this case let’s use the second regression due to assumptions problems and to useless variables.
If a model were so good that perfectly predicted the dependent variable, the error-sum-of squares would approach zero, and the natural logarithm of this value and the AIC would approach negative infinity.
parameters interpretation
As first we can notice not really an high Rsquare but since that the p value is really small we can take this model as reliable.
For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable.
we can notice how cancer volume, lweight, lbph and svi are the ones more related with lpsa.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.
Looking at the pvalue we can notice how only lcavol, lweight and svi have effectvly and effect on lpsa.
Furthemore age in this case is inversely related but is not signficant.
Linearity assumption
crPlots(r2, col = thematic::thematic_get_option("accent"))
Every parameter seems to have approximately a normal distribution.
Mean of error is zero and normally distribuited
qqnorm(r2$residuals, col = thematic::thematic_get_option("accent"))
mean(r2$residuals)
## [1] -3.173694e-17
jarque.bera.test(r2$residuals)
##
## Jarque Bera Test
##
## data: r2$residuals
## X-squared = 0.28992, df = 2, p-value = 0.8651
seems to be also in this case respected.
Homoscedasticity of residuals: the variance must be costante
lmtest::bptest(r2)
##
## studentized Breusch-Pagan test
##
## data: r2
## BP = 6.6668, df = 5, p-value = 0.2466
Since we have a p-value greater than 0.05 we can asset the null hypothesis for which the variance of residuals is costant.
Absence of perfect multicollinearity
pairs(dis, col = thematic::thematic_get_option("accent"))
car:: vif(r2)
## lcavol lweight age lbph svi
## 1.521255 1.413125 1.241850 1.372257 1.437259
Since that the VIF is smaller than 2 in each case we can confirm that mutlicollinearity is not present.
VIF is made by dividing 1 for 1 minus Rsquare.
Independence of error
lawstat::runs.test(r2$residuals)
##
## Runs Test - Two sided
##
## data: r2$residuals
## Standardized Runs Statistic = -1.1218, p-value = 0.2619
also this assumption is verified!
introduction
Now we can perform a logistic regression that we can define binary since that the response variable has only two outcome:
svi present = 1
svi not present = 0
In this case this regression has a better fit since that we have probability lying between a bound of 0 and one.
We’ll expect, in this case, a “sigmoid” function.
Let’s start!
st1<- glm(svi ~ age + lbph + lpsa+ lcp, data = dis, family = "binomial")
summary(st1)
##
## Call:
## glm(formula = svi ~ age + lbph + lpsa + lcp, family = "binomial",
## data = dis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9170 -0.2205 -0.0753 -0.0114 2.0407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.70154 5.43714 -2.336 0.019488 *
## age 0.06581 0.06916 0.951 0.341352
## lbph -0.27403 0.29855 -0.918 0.358687
## lpsa 2.16631 0.74973 2.889 0.003859 **
## lcp 1.32598 0.39263 3.377 0.000732 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.353 on 96 degrees of freedom
## Residual deviance: 38.102 on 92 degrees of freedom
## AIC: 48.102
##
## Number of Fisher Scoring iterations: 7
coefs<- coef(st1)
exp(coefs)
## (Intercept) age lbph lpsa lcp
## 3.046436e-06 1.068021e+00 7.603124e-01 8.726012e+00 3.765881e+00
#View(dis)
lpsa statistically significant and positive effect; average change in the odds when change by 1 unit.
lcp statistically significant and positive effect; average change in the odds when change by 1 unit.
age and lpbsh doesn’t seems to have a statistically significant effect on svi.
par(mfrow=c(2,2))
plot(st1)
As an odd interpretation instead, we try to understand what has happened to the odd, changing it by 1 unit and taking in account the proportional rate at which the predicted odds ratio changes with each successive unit of x (how much the average odds ratio varies when an unit of independent variable)
(exp(coef(st1))-1)*100
## (Intercept) age lbph lpsa lcp
## -99.999695 6.802066 -23.968764 772.601250 276.588133
That’s basically the percentage increase in the odds for a unitary increase in that specific predictor.
let’s perform a pseudo rsquare
library(pscl)
pR2(st1)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -19.0507887 -50.6762599 63.2509422 0.6240688 0.4790346 0.7389510
We seems to get really a good result.
Assumptions
#bind the logit and tidy the data for a plot
probabilities <- predict(st1, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "tru", "fal")
head(predicted.classes)
## 1 2 3 4 5 6
## "fal" "fal" "fal" "fal" "fal" "fal"
mydata <- dis %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
library(broom)
library(dplyr)
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth() +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
#dev.off()
influential values
plot(st1, which = 4, id.n = 3)
extract model results
model.data <- augment(st1) %>%
mutate(index = 1:n())
# Find real outliers with standardised err >3
model.data %>% top_n(3, .cooksd)
## # A tibble: 3 × 12
## svi age lbph lpsa lcp .fitted .resid .std.resid .hat .sigma .cooksd
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 68 1.37 2.21 1.83 -1.38 1.79 1.90 0.108 0.616 0.108
## 2 0 57 0.438 3.53 2.33 1.66 -1.92 -2.07 0.141 0.610 0.203
## 3 1 61 1.35 4.13 -1.39 -1.95 2.04 2.23 0.161 0.603 0.322
## # … with 1 more variable: index <int>
## # ℹ Use `colnames()` to see all variable names
model.data %>%
filter(abs(.std.resid) > 3)
## # A tibble: 0 × 12
## # … with 12 variables: svi <int>, age <int>, lbph <dbl>, lpsa <dbl>, lcp <dbl>,
## # .fitted <dbl>, .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, index <int>
## # ℹ Use `colnames()` to see all variable names
we can notice three influential value, let’s try to remove them
influence <- cooks.distance(st1)
leverage<- hatvalues(st1)
new <- dis %>%
mutate(cooks_dist = influence, leverage = leverage)%>%
filter(cooks_dist < 0.01)%>%
filter(leverage < 0.05)
st2<- glm(svi ~ age + lbph + lpsa+ lcp, data = new, family = "binomial")
par(mfrow=c(2,2))
plot(st2)
multicollinearity
car::vif(st1)
## age lbph lpsa lcp
## 1.048707 1.051401 1.109449 1.103298