cavol - cancer volume weight - prostate weight in grams age - age of patient bph - benign prostatic hyperplasia amount svi - seminal vesicle invasion cp - capsular penetration gleason - Gleason score pgg45 - percentage Gleason scores 4 or 5 psa - prostate specific antigen Source - Andrews DF and Herzberg AM (1985): Data. New York: Springer-Verlag Read the dataset into RStudio using the read.csv() function # 0. Setup
## Run
library(MASS)
library(ISLR)
library(lattice)
library(hexbin)
library(corrplot)
library(tidyverse)
library(coefplot)
library(GGally)
library(ggplot2)
library(gridExtra)
library(moments)
ps <- read.csv(file = 'prostate.csv', header = TRUE)
#head(prostate); # look at the data
str(ps)
## 'data.frame': 97 obs. of 10 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ cavol : num 0.56 0.37 0.6 0.3 2.12 ...
## $ weight : num 16 27.6 14.7 26.7 31 ...
## $ bph : num 0.25 0.25 0.25 0.25 0.25 ...
## $ cp : num 0.25 0.25 0.25 0.25 0.25 ...
## $ psa : num 0.65 0.85 0.85 0.85 1.45 ...
head(ps)
## X age svi gleason pgg45 cavol weight bph cp psa
## 1 1 50 0 6 0 0.56 15.95066 0.2500001 0.2500011 0.6500019
## 2 2 58 0 6 0 0.37 27.64929 0.2500001 0.2500011 0.8499991
## 3 3 74 0 7 20 0.60 14.74936 0.2500001 0.2500011 0.8499991
## 4 4 58 0 6 0 0.30 26.65029 0.2500001 0.2500011 0.8499991
## 5 5 62 0 6 0 2.12 30.95084 0.2500001 0.2500011 1.4499948
## 6 6 50 0 6 0 0.35 25.24934 0.2500001 0.2500011 2.1500046
### Exploratory Data Analysis (20 points)
########################################################################
### 1) Make a numerical and graphical summary of the data.
####### i. Convert the variable "svi" to a factor.
####### ii. Use the summary() function to produce numerical summary of the variables in the data set.
####### iii. Using the ggpairs() function from the GGaly to produce a scatter plot matrix of the variables. Make sure you address the issue of overplotting of the axes labels.
####### iv. Using ggplot, produce side-by-side boxplots of "psa" vs. "svi".
####### Using ggplot, produce histograms of the numeric variables with suitably chosen numbers of bins.
####### Provide a brief comment of what you observe about the data from the numeric and graphical summaries.
### Regression Model (30 points)
#################################################################################
# 2) Fit a multiple linear regression model with "psa" as the response variable and all other variables as the predictors.
# Store this linear regression model in an R object called 'lm.all'.
#################################################################################
# Write your code in this section below this line
### 1) Make a numerical and graphical summary of the data.
ps$svi <- factor(ps$svi)
summary(ps)
## X age svi gleason pgg45
## Min. : 1 Min. :41.00 0:76 Min. :6.000 Min. : 0.00
## 1st Qu.:25 1st Qu.:60.00 1:21 1st Qu.:6.000 1st Qu.: 0.00
## Median :49 Median :65.00 Median :7.000 Median : 15.00
## Mean :49 Mean :63.87 Mean :6.753 Mean : 24.38
## 3rd Qu.:73 3rd Qu.:68.00 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :97 Max. :79.00 Max. :9.000 Max. :100.00
## cavol weight bph cp
## Min. : 0.260 Min. : 10.75 Min. : 0.250 Min. : 0.250
## 1st Qu.: 1.670 1st Qu.: 29.25 1st Qu.: 0.250 1st Qu.: 0.250
## Median : 4.250 Median : 37.45 Median : 1.350 Median : 0.450
## Mean : 7.001 Mean : 45.48 Mean : 2.645 Mean : 2.362
## 3rd Qu.: 8.390 3rd Qu.: 48.35 3rd Qu.: 4.750 3rd Qu.: 3.250
## Max. :45.650 Max. :449.26 Max. :10.240 Max. :18.250
## psa
## Min. : 0.65
## 1st Qu.: 5.65
## Median : 13.35
## Mean : 23.74
## 3rd Qu.: 21.25
## Max. :265.85
#using the ggpairs() function from the GGaly to produce a scatter plot matrix of the variables. Make sure you address the issue of overplotting of the axes labels
ggpairs(ps[c("age", "svi", "gleason", "pgg45", "cavol", "weight", "bph", "cp", "psa")],
axisLabels = "showDiag",
diag = list(continuous = "density", bins = 20))
#Using ggplot, produce side-by-side boxplots of "psa" vs. "svi".
ggplot(ps, aes(x = factor(svi), y = psa)) +
geom_boxplot() +
labs(x = "SVI", y = "PSA") +
ggtitle("PSA vs. SVI") +
theme_bw()
#Using ggplot, produce histograms of the numeric variables with suitably chosen numbers of bins
ggplot(ps, aes(x = cavol)) +
geom_histogram(binwidth = 2, fill = "violet", alpha = 0.5) +
labs(x = "cavol", y = "Frequency") +
ggtitle("Cancer Volume Histogram") +
theme_bw()
ggplot(ps, aes(x = weight)) +
geom_histogram(binwidth = 20, fill = "blue", alpha = 0.5) +
labs(x = "weight", y = "Frequency") +
ggtitle("Prostate Weight Histogram") +
theme_bw()
ggplot(ps, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "green", alpha = 0.5) +
labs(x = "Age", y = "Frequency") +
ggtitle("Age Histogram") +
theme_bw()
ggplot(ps, aes(x = pgg45)) +
geom_histogram(binwidth = 10, fill = "purple", alpha = 0.5) +
labs(x = "PGG45", y = "Frequency") +
ggtitle("PGG45 Histogram") +
theme_bw()
ggplot(ps, aes(x = bph)) +
geom_histogram(binwidth = 1, fill = "plum", alpha = 0.5) +
labs(x = "bph", y = "Frequency") +
ggtitle("Benign Prostatic Hyperplasia Amount Histogram") +
theme_bw()
ggplot(ps, aes(x = cp)) +
geom_histogram(binwidth = 2, fill = "orange", alpha = 0.5) +
labs(x = "cp", y = "Frequency") +
ggtitle("Capsular Penetration Histogram") +
theme_bw()
ggplot(ps, aes(x = gleason)) +
geom_histogram(binwidth = 0.5, fill = "orchid", alpha = 0.5) +
labs(x = "gleason", y = "Frequency") +
ggtitle("Gleason Score Histogram") +
theme_bw()
ggplot(ps, aes(x = pgg45)) +
geom_histogram(binwidth = 5, fill = "violet", alpha = 0.5) +
labs(x = "pgg45", y = "Frequency") +
ggtitle("Percentage Gleason Scores 4 or 5 Histogram") +
theme_bw()
ggplot(ps, aes(x = psa)) +
geom_histogram(binwidth = 20, fill = "navy", alpha = 0.5) +
labs(x = "psa", y = "Frequency") +
ggtitle("Prostate Specific Antigen Histogram") +
theme_bw()
# 2) Fit a multiple linear regression model with "psa" as the response variable and all other variables as the predictors.
# Store this linear regression model in an R object called 'lm.all'.
lm.all <- lm(psa ~ ., ps)
lm.all
##
## Call:
## lm(formula = psa ~ ., data = ps)
##
## Coefficients:
## (Intercept) X age svi1 gleason pgg45
## 46.11061 0.56055 -0.26667 9.22835 -6.80512 -0.10919
## cavol weight bph cp
## 1.18283 -0.02467 0.15786 2.63857
#################################################################################
# 3) Produce a model fit summary AND diagnostic plots associated with lm.all
#################################################################################
# Write your code in this section below this line
summary(lm.all)
##
## Call:
## lm(formula = psa ~ ., data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.110 -9.244 -0.549 6.662 171.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.11061 47.11649 0.979 0.330466
## X 0.56055 0.16363 3.426 0.000938 ***
## age -0.26667 0.46204 -0.577 0.565329
## svi1 9.22835 11.14222 0.828 0.409805
## gleason -6.80512 6.51112 -1.045 0.298847
## pgg45 -0.10919 0.17981 -0.607 0.545267
## cavol 1.18283 0.63083 1.875 0.064143 .
## weight -0.02467 0.07135 -0.346 0.730336
## bph 0.15786 1.23843 0.127 0.898863
## cp 2.63857 1.35672 1.945 0.055027 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.72 on 87 degrees of freedom
## Multiple R-squared: 0.5198, Adjusted R-squared: 0.4701
## F-statistic: 10.46 on 9 and 87 DF, p-value: 8.417e-11
plot(lm.all)
# copy the summary output and plots into a separate document
#################################################################################
# 4) Using ggplot and any other useful functions, display any plots that are relevant for diagnostics on the model. Check the following assumptions:
# i. Check the structure of the relationship between the predictors
# and the response i.e. check the linearity assumption
# ii. The constant variance assumption for the errors.
# iii. Check the normality assumption.
# iv. Checks for large outlier.
# v. Check for influential points.
# Residual vs Fitted plot
ggplot(lm.all, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.5) +
labs(x = "Fitted values", y = "Residuals") +
ggtitle("Residual vs Fitted plot")
# Normal Q-Q plot
ggplot(lm.all, aes(sample = .stdresid)) +
stat_qq() +
stat_qq_line() +
labs(x = "Theoretical Quantiles", y = "Standardized Residuals") +
ggtitle("Normal Q-Q plot")
# Scale-Location plot
ggplot(lm.all, aes(x = .fitted, y = sqrt(abs(.stdresid)))) +
geom_point(alpha = 0.5) +
labs(x = "Fitted values", y = "sqrt(|Standardized residuals|)") +
ggtitle("Scale-Location plot")
# Cook's Distance plot
ggplot(lm.all, aes(x = .hat, y = .cooksd)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 2*mean(lm.all$.cooksd), intercept = 0, color = "red") +
labs(x = "Leverage", y = "Cook's distance") +
ggtitle("Cook's Distance plot")
#################################################################################
# Write your responses in the separate document containing the model summary and plots
##############################################################################################
# 4) Do you see any serious problems in any of the diagnostic plots?
# If so, how will you modify your data to address the problem?
# Important: Focus on one serious problem, rather than addressing every minor issue in the diagnostic plots.
##############################################################################################
# Write your responses in the separate document containing the model summary and plots
#############################################################################################
# 5) Perform backward model selection until all variables left in your model are significant.
# You can do this using an automated function in R or manually by repeatedly fitting models and dropping variables one at a time.
# Important: Be sure to include any modifications to the data from the previous question when fitting the model
#############################################################################################
# Write your code below this line to produce a summary of the model with all variables
lm.all <- lm(psa ~ ., ps)
summary(lm.all)
##
## Call:
## lm(formula = psa ~ ., data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.110 -9.244 -0.549 6.662 171.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.11061 47.11649 0.979 0.330466
## X 0.56055 0.16363 3.426 0.000938 ***
## age -0.26667 0.46204 -0.577 0.565329
## svi1 9.22835 11.14222 0.828 0.409805
## gleason -6.80512 6.51112 -1.045 0.298847
## pgg45 -0.10919 0.17981 -0.607 0.545267
## cavol 1.18283 0.63083 1.875 0.064143 .
## weight -0.02467 0.07135 -0.346 0.730336
## bph 0.15786 1.23843 0.127 0.898863
## cp 2.63857 1.35672 1.945 0.055027 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.72 on 87 degrees of freedom
## Multiple R-squared: 0.5198, Adjusted R-squared: 0.4701
## F-statistic: 10.46 on 9 and 87 DF, p-value: 8.417e-11
# Write your code below this line all code created to perform backward selection
lm.backward <- step(lm.all, direction = "backward", trace = 0)
summary(lm.backward)
##
## Call:
## lm(formula = psa ~ X + gleason + cavol + cp, data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.293 -8.813 -0.584 7.109 170.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.2464 29.9409 1.578 0.118000
## X 0.5671 0.1404 4.038 0.000111 ***
## gleason -9.9620 4.6670 -2.135 0.035453 *
## cavol 1.2974 0.5932 2.187 0.031271 *
## cp 2.9176 1.1549 2.526 0.013236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 92 degrees of freedom
## Multiple R-squared: 0.5114, Adjusted R-squared: 0.4902
## F-statistic: 24.08 on 4 and 92 DF, p-value: 1.203e-13
# write your code below this line to save the final model in an R object called 'lm.bw'
lm.bw <- lm.backward
par(mfrow=c(2,2))
#################################################################################
# 6) Produce a model fit summary AND diagnostic plots associated with lm.bw
#################################################################################
# Write your code in this section below this line
# Model fit summary for lm.bw
summary(lm.bw)
##
## Call:
## lm(formula = psa ~ X + gleason + cavol + cp, data = ps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.293 -8.813 -0.584 7.109 170.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.2464 29.9409 1.578 0.118000
## X 0.5671 0.1404 4.038 0.000111 ***
## gleason -9.9620 4.6670 -2.135 0.035453 *
## cavol 1.2974 0.5932 2.187 0.031271 *
## cp 2.9176 1.1549 2.526 0.013236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 92 degrees of freedom
## Multiple R-squared: 0.5114, Adjusted R-squared: 0.4902
## F-statistic: 24.08 on 4 and 92 DF, p-value: 1.203e-13
#Diagnostic plots for lm.bw
par(mfrow=c(2,2))
plot(lm.bw)
# copy the summary output and plots into a separate document
#################################################################################
# 7) Compare the diagnostic plots for lm.all and lm.bw. Which model and data appear to satisfy the linear regression model assumptions? Describe the characteristics of the diagnostic plots for lm.all and lm.bw and how they support your claims.
########################################################################
# Diagnostic Plot for lm.all
par(mfrow=c(2,2))
plot(lm.all)
# Diagnostic Plot for lm.bw
par(mfrow=c(2,2))
plot(lm.bw)
# Write your responses in the separate document containing the model summary and plots