The homework concerns data collected about Prostate cancer surgery. The prostate data frame has 97 rows and 9 columns. A study on 97 men with prostate cancer who were due to receive a radical prostatectomy.
Data Dictionary - This data frame contains the following columns:

cavol - cancer volume weight - prostate weight in grams age - age of patient bph - benign prostatic hyperplasia amount svi - seminal vesicle invasion cp - capsular penetration gleason - Gleason score pgg45 - percentage Gleason scores 4 or 5 psa - prostate specific antigen Source - Andrews DF and Herzberg AM (1985): Data. New York: Springer-Verlag Read the dataset into RStudio using the read.csv() function # 0. Setup

## Run

library(MASS)
library(ISLR)
library(lattice)
library(hexbin)
library(corrplot)
library(tidyverse)
library(coefplot)
library(GGally)
library(ggplot2)
library(gridExtra)
library(moments)

ps <- read.csv(file = 'prostate.csv', header = TRUE)

#head(prostate); # look at the data

str(ps)
## 'data.frame':    97 obs. of  10 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ cavol  : num  0.56 0.37 0.6 0.3 2.12 ...
##  $ weight : num  16 27.6 14.7 26.7 31 ...
##  $ bph    : num  0.25 0.25 0.25 0.25 0.25 ...
##  $ cp     : num  0.25 0.25 0.25 0.25 0.25 ...
##  $ psa    : num  0.65 0.85 0.85 0.85 1.45 ...
head(ps)
##   X age svi gleason pgg45 cavol   weight       bph        cp       psa
## 1 1  50   0       6     0  0.56 15.95066 0.2500001 0.2500011 0.6500019
## 2 2  58   0       6     0  0.37 27.64929 0.2500001 0.2500011 0.8499991
## 3 3  74   0       7    20  0.60 14.74936 0.2500001 0.2500011 0.8499991
## 4 4  58   0       6     0  0.30 26.65029 0.2500001 0.2500011 0.8499991
## 5 5  62   0       6     0  2.12 30.95084 0.2500001 0.2500011 1.4499948
## 6 6  50   0       6     0  0.35 25.24934 0.2500001 0.2500011 2.1500046
### Exploratory Data Analysis (20 points)
########################################################################
### 1) Make a numerical and graphical summary of the data.  
####### i. Convert the variable "svi" to a factor.
####### ii. Use the summary() function to produce numerical summary of the variables in the data set. 
####### iii. Using the ggpairs() function from the GGaly to produce a scatter plot matrix of the variables. Make sure you address the issue of overplotting of the axes labels. 
####### iv. Using ggplot, produce side-by-side boxplots of "psa" vs. "svi". 
####### Using ggplot, produce histograms of the numeric variables with suitably chosen numbers of bins. 
####### Provide a brief comment of what you observe about the data from the numeric and graphical summaries. 

### Regression Model (30 points)
#################################################################################
# 2) Fit a multiple linear regression model with "psa" as the response variable and all other variables as the predictors. 
#    Store this linear regression model in an R object called 'lm.all'.
#################################################################################

# Write your code in this section below this line
### 1) Make a numerical and graphical summary of the data. 
ps$svi <- factor(ps$svi)
summary(ps)
##        X           age        svi       gleason          pgg45       
##  Min.   : 1   Min.   :41.00   0:76   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:25   1st Qu.:60.00   1:21   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :49   Median :65.00          Median :7.000   Median : 15.00  
##  Mean   :49   Mean   :63.87          Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:73   3rd Qu.:68.00          3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :97   Max.   :79.00          Max.   :9.000   Max.   :100.00  
##      cavol            weight            bph               cp        
##  Min.   : 0.260   Min.   : 10.75   Min.   : 0.250   Min.   : 0.250  
##  1st Qu.: 1.670   1st Qu.: 29.25   1st Qu.: 0.250   1st Qu.: 0.250  
##  Median : 4.250   Median : 37.45   Median : 1.350   Median : 0.450  
##  Mean   : 7.001   Mean   : 45.48   Mean   : 2.645   Mean   : 2.362  
##  3rd Qu.: 8.390   3rd Qu.: 48.35   3rd Qu.: 4.750   3rd Qu.: 3.250  
##  Max.   :45.650   Max.   :449.26   Max.   :10.240   Max.   :18.250  
##       psa        
##  Min.   :  0.65  
##  1st Qu.:  5.65  
##  Median : 13.35  
##  Mean   : 23.74  
##  3rd Qu.: 21.25  
##  Max.   :265.85
#using the ggpairs() function from the GGaly to produce a scatter plot matrix of the variables. Make sure you address the issue of overplotting of the axes labels
ggpairs(ps[c("age", "svi", "gleason", "pgg45", "cavol", "weight", "bph", "cp", "psa")], 
        axisLabels = "showDiag", 
        diag = list(continuous = "density", bins = 20))

#Using ggplot, produce side-by-side boxplots of "psa" vs. "svi".
ggplot(ps, aes(x = factor(svi), y = psa)) +
  geom_boxplot() +
  labs(x = "SVI", y = "PSA") +
  ggtitle("PSA vs. SVI") +
  theme_bw()

#Using ggplot, produce histograms of the numeric variables with suitably chosen numbers of bins

ggplot(ps, aes(x = cavol)) +
  geom_histogram(binwidth = 2, fill = "violet", alpha = 0.5) +
  labs(x = "cavol", y = "Frequency") +
  ggtitle("Cancer Volume Histogram") +
  theme_bw()

ggplot(ps, aes(x = weight)) +
  geom_histogram(binwidth = 20, fill = "blue", alpha = 0.5) +
  labs(x = "weight", y = "Frequency") +
  ggtitle("Prostate Weight Histogram") +
  theme_bw()

ggplot(ps, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "green", alpha = 0.5) +
  labs(x = "Age", y = "Frequency") +
  ggtitle("Age Histogram") +
  theme_bw()

ggplot(ps, aes(x = pgg45)) +
  geom_histogram(binwidth = 10, fill = "purple", alpha = 0.5) +
  labs(x = "PGG45", y = "Frequency") +
  ggtitle("PGG45 Histogram") +
  theme_bw()

ggplot(ps, aes(x = bph)) +
  geom_histogram(binwidth = 1, fill = "plum", alpha = 0.5) +
  labs(x = "bph", y = "Frequency") +
  ggtitle("Benign Prostatic Hyperplasia Amount Histogram") +
  theme_bw()

ggplot(ps, aes(x = cp)) +
  geom_histogram(binwidth = 2, fill = "orange", alpha = 0.5) +
  labs(x = "cp", y = "Frequency") +
  ggtitle("Capsular Penetration Histogram") +
  theme_bw()

ggplot(ps, aes(x = gleason)) +
  geom_histogram(binwidth = 0.5, fill = "orchid", alpha = 0.5) +
  labs(x = "gleason", y = "Frequency") +
  ggtitle("Gleason Score Histogram") +
  theme_bw()

ggplot(ps, aes(x = pgg45)) +
  geom_histogram(binwidth = 5, fill = "violet", alpha = 0.5) +
  labs(x = "pgg45", y = "Frequency") +
  ggtitle("Percentage Gleason Scores 4 or 5 Histogram") +
  theme_bw()

ggplot(ps, aes(x = psa)) +
  geom_histogram(binwidth = 20, fill = "navy", alpha = 0.5) +
  labs(x = "psa", y = "Frequency") +
  ggtitle("Prostate Specific Antigen Histogram") +
  theme_bw()

# 2) Fit a multiple linear regression model with "psa" as the response variable and all other variables as the predictors. 
#    Store this linear regression model in an R object called 'lm.all'.

lm.all <- lm(psa ~ ., ps)
lm.all
## 
## Call:
## lm(formula = psa ~ ., data = ps)
## 
## Coefficients:
## (Intercept)            X          age         svi1      gleason        pgg45  
##    46.11061      0.56055     -0.26667      9.22835     -6.80512     -0.10919  
##       cavol       weight          bph           cp  
##     1.18283     -0.02467      0.15786      2.63857
#################################################################################
# 3) Produce a model fit summary AND diagnostic plots associated with lm.all
#################################################################################

# Write your code in this section below this line
summary(lm.all)
## 
## Call:
## lm(formula = psa ~ ., data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.110  -9.244  -0.549   6.662 171.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.11061   47.11649   0.979 0.330466    
## X            0.56055    0.16363   3.426 0.000938 ***
## age         -0.26667    0.46204  -0.577 0.565329    
## svi1         9.22835   11.14222   0.828 0.409805    
## gleason     -6.80512    6.51112  -1.045 0.298847    
## pgg45       -0.10919    0.17981  -0.607 0.545267    
## cavol        1.18283    0.63083   1.875 0.064143 .  
## weight      -0.02467    0.07135  -0.346 0.730336    
## bph          0.15786    1.23843   0.127 0.898863    
## cp           2.63857    1.35672   1.945 0.055027 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.72 on 87 degrees of freedom
## Multiple R-squared:  0.5198, Adjusted R-squared:  0.4701 
## F-statistic: 10.46 on 9 and 87 DF,  p-value: 8.417e-11
plot(lm.all)

# copy the summary output and plots into a separate document

#################################################################################
# 4) Using ggplot and any other useful functions, display any plots that are relevant for diagnostics on the model. Check the following assumptions:

#   i. Check the structure of the relationship between the predictors 
#     and the response i.e. check the linearity assumption
#   ii. The constant variance assumption for the errors.
#   iii. Check the normality assumption.
#   iv. Checks for large outlier.
#   v. Check for influential points.

# Residual vs Fitted plot
ggplot(lm.all, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  labs(x = "Fitted values", y = "Residuals") +
  ggtitle("Residual vs Fitted plot")

# Normal Q-Q plot
ggplot(lm.all, aes(sample = .stdresid)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles", y = "Standardized Residuals") +
  ggtitle("Normal Q-Q plot")

# Scale-Location plot
ggplot(lm.all, aes(x = .fitted, y = sqrt(abs(.stdresid)))) +
  geom_point(alpha = 0.5) +
  labs(x = "Fitted values", y = "sqrt(|Standardized residuals|)") +
  ggtitle("Scale-Location plot")

# Cook's Distance plot
ggplot(lm.all, aes(x = .hat, y = .cooksd)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 2*mean(lm.all$.cooksd), intercept = 0, color = "red") +
  labs(x = "Leverage", y = "Cook's distance") +
  ggtitle("Cook's Distance plot")

#################################################################################

# Write your responses in the separate document containing the model summary and plots

##############################################################################################
# 4) Do you see any serious problems in any of the diagnostic plots?
#    If so, how will you modify your data to address the problem?
#    Important: Focus on one serious problem, rather than addressing every minor issue in the diagnostic plots.
##############################################################################################

# Write your responses in the separate document containing the model summary and plots

#############################################################################################
# 5) Perform backward model selection until all variables left in your model are significant.
#    You can do this using an automated function in R or manually by repeatedly fitting models and dropping variables one at a time.
#    Important: Be sure to include any modifications to the data from the previous question when fitting the model
#############################################################################################

# Write your code below this line to produce a summary of the model with all variables
lm.all <- lm(psa ~ ., ps)
summary(lm.all)
## 
## Call:
## lm(formula = psa ~ ., data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.110  -9.244  -0.549   6.662 171.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.11061   47.11649   0.979 0.330466    
## X            0.56055    0.16363   3.426 0.000938 ***
## age         -0.26667    0.46204  -0.577 0.565329    
## svi1         9.22835   11.14222   0.828 0.409805    
## gleason     -6.80512    6.51112  -1.045 0.298847    
## pgg45       -0.10919    0.17981  -0.607 0.545267    
## cavol        1.18283    0.63083   1.875 0.064143 .  
## weight      -0.02467    0.07135  -0.346 0.730336    
## bph          0.15786    1.23843   0.127 0.898863    
## cp           2.63857    1.35672   1.945 0.055027 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.72 on 87 degrees of freedom
## Multiple R-squared:  0.5198, Adjusted R-squared:  0.4701 
## F-statistic: 10.46 on 9 and 87 DF,  p-value: 8.417e-11
# Write your code below this line all code created to perform backward selection
lm.backward <- step(lm.all, direction = "backward", trace = 0)
summary(lm.backward)
## 
## Call:
## lm(formula = psa ~ X + gleason + cavol + cp, data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.293  -8.813  -0.584   7.109 170.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.2464    29.9409   1.578 0.118000    
## X             0.5671     0.1404   4.038 0.000111 ***
## gleason      -9.9620     4.6670  -2.135 0.035453 *  
## cavol         1.2974     0.5932   2.187 0.031271 *  
## cp            2.9176     1.1549   2.526 0.013236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.15 on 92 degrees of freedom
## Multiple R-squared:  0.5114, Adjusted R-squared:  0.4902 
## F-statistic: 24.08 on 4 and 92 DF,  p-value: 1.203e-13
# write your code below this line to save the final model in an R object called 'lm.bw'
lm.bw <- lm.backward
par(mfrow=c(2,2))

#################################################################################
# 6) Produce a model fit summary AND diagnostic plots associated with lm.bw
#################################################################################

# Write your code in this section below this line
# Model fit summary for lm.bw
summary(lm.bw)
## 
## Call:
## lm(formula = psa ~ X + gleason + cavol + cp, data = ps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.293  -8.813  -0.584   7.109 170.267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.2464    29.9409   1.578 0.118000    
## X             0.5671     0.1404   4.038 0.000111 ***
## gleason      -9.9620     4.6670  -2.135 0.035453 *  
## cavol         1.2974     0.5932   2.187 0.031271 *  
## cp            2.9176     1.1549   2.526 0.013236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.15 on 92 degrees of freedom
## Multiple R-squared:  0.5114, Adjusted R-squared:  0.4902 
## F-statistic: 24.08 on 4 and 92 DF,  p-value: 1.203e-13
#Diagnostic plots for lm.bw
par(mfrow=c(2,2))
plot(lm.bw)

# copy the summary output and plots into a separate document

#################################################################################
# 7) Compare the diagnostic plots for lm.all and lm.bw. Which model and data appear to satisfy the linear regression model assumptions? Describe the characteristics of the diagnostic plots for lm.all and lm.bw and how they support your claims.
########################################################################

# Diagnostic Plot for lm.all
par(mfrow=c(2,2))
plot(lm.all)

# Diagnostic Plot for lm.bw
par(mfrow=c(2,2))
plot(lm.bw)

# Write your responses in the separate document containing the model summary and plots