This assignment uses a prostate cancer dataset to study how clinical and pathological variables are related to prostate-specific antigen (PSA) levels. The main goal is to explore the data, fit multiple regression models, compare models, and interpret the results in context.
The dataset is based on the prostate cancer case study discussed by Kutner et al.
id: identification number 1-97
psa level: serum prostate -specific antigent level
(mg/ml)
cancer volume: estimate of prostate cancer volumen
(cc)
weight: prostate weight (gm)
age: age of the patient (years)
Benign Prostatic hyperplasia: benign prostatic
hyperplasia (square cm)
seminal vesicle invasion: presence or absense of
seminal vesicle invasion
(1 = yes,
0 = no)
capsular penetration: Degree of capsular penetration
(cm)
gleason score: pathologically determined grade
(higher score indicating worse prognosis)
The objective is to identify important predictors of
psa level and determine whether a multiple regression model
provides a reasonable explanation of the variation in PSA.
data <- read.csv("prostate_cancer_data.csv")
# Viewing structure of the data
str(data)
## 'data.frame': 97 obs. of 9 variables:
## $ X1: int 1 2 3 4 5 6 7 8 9 10 ...
## $ X2: num 0.651 0.852 0.852 0.852 1.448 ...
## $ X3: num 0.56 0.372 0.601 0.301 2.117 ...
## $ X4: num 16 27.7 14.7 26.6 30.9 ...
## $ X5: int 50 58 74 58 62 50 64 58 47 63 ...
## $ X6: num 0 0 0 0 0 ...
## $ X7: int 0 0 0 0 0 0 0 0 0 0 ...
## $ X8: num 0 0 0 0 0 0 0 0 0 0 ...
## $ X9: int 6 7 7 6 6 6 6 6 7 6 ...
# Summary statistics
summary(data)
## X1 X2 X3 X4
## Min. : 1 Min. : 0.651 Min. : 0.2592 Min. : 10.70
## 1st Qu.:25 1st Qu.: 5.641 1st Qu.: 1.6653 1st Qu.: 29.37
## Median :49 Median : 13.330 Median : 4.2631 Median : 37.34
## Mean :49 Mean : 23.730 Mean : 6.9987 Mean : 45.49
## 3rd Qu.:73 3rd Qu.: 21.328 3rd Qu.: 8.4149 3rd Qu.: 48.42
## Max. :97 Max. :265.072 Max. :45.6042 Max. :450.34
## X5 X6 X7 X8
## Min. :41.00 Min. : 0.000 Min. :0.0000 Min. : 0.0000
## 1st Qu.:60.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median :65.00 Median : 1.350 Median :0.0000 Median : 0.4493
## Mean :63.87 Mean : 2.535 Mean :0.2165 Mean : 2.2454
## 3rd Qu.:68.00 3rd Qu.: 4.759 3rd Qu.:0.0000 3rd Qu.: 3.2544
## Max. :79.00 Max. :10.278 Max. :1.0000 Max. :18.1741
## X9
## Min. :6.000
## 1st Qu.:6.000
## Median :7.000
## Mean :6.876
## 3rd Qu.:7.000
## Max. :8.000
#Fixing Variable Names
names(data) <- c(
"id", "psa", "cancer_volume", "weight", "age",
"bph", "svi", "capsular_penetration", "gleason"
)
summary(data)
## id psa cancer_volume weight
## Min. : 1 Min. : 0.651 Min. : 0.2592 Min. : 10.70
## 1st Qu.:25 1st Qu.: 5.641 1st Qu.: 1.6653 1st Qu.: 29.37
## Median :49 Median : 13.330 Median : 4.2631 Median : 37.34
## Mean :49 Mean : 23.730 Mean : 6.9987 Mean : 45.49
## 3rd Qu.:73 3rd Qu.: 21.328 3rd Qu.: 8.4149 3rd Qu.: 48.42
## Max. :97 Max. :265.072 Max. :45.6042 Max. :450.34
## age bph svi capsular_penetration
## Min. :41.00 Min. : 0.000 Min. :0.0000 Min. : 0.0000
## 1st Qu.:60.00 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median :65.00 Median : 1.350 Median :0.0000 Median : 0.4493
## Mean :63.87 Mean : 2.535 Mean :0.2165 Mean : 2.2454
## 3rd Qu.:68.00 3rd Qu.: 4.759 3rd Qu.:0.0000 3rd Qu.: 3.2544
## Max. :79.00 Max. :10.278 Max. :1.0000 Max. :18.1741
## gleason
## Min. :6.000
## 1st Qu.:6.000
## Median :7.000
## Mean :6.876
## 3rd Qu.:7.000
## Max. :8.000
#Checking Missing Values
colSums(is.na(data))
## id psa cancer_volume
## 0 0 0
## weight age bph
## 0 0 0
## svi capsular_penetration gleason
## 0 0 0
#Extracting Numeric Variables
num_data <- data[, sapply(data, is.numeric)]
#Correlation Matrix
cor_matrix <- cor(num_data)
cor_matrix
## id psa cancer_volume weight
## id 1.0000000 0.60268375 0.620997842 0.113741022
## psa 0.6026837 1.00000000 0.624150588 0.026213430
## cancer_volume 0.6209978 0.62415059 1.000000000 0.005107148
## weight 0.1137410 0.02621343 0.005107148 1.000000000
## age 0.1965557 0.01719938 0.039094423 0.164323714
## bph 0.1650054 -0.01648649 -0.133209431 0.321848748
## svi 0.5667803 0.52861878 0.581741687 -0.002410475
## capsular_penetration 0.4767525 0.55079252 0.692896688 0.001578905
## gleason 0.5379241 0.42957975 0.481438397 -0.024206925
## age bph svi capsular_penetration
## id 0.19655569 0.16500536 0.566780347 0.476752459
## psa 0.01719938 -0.01648649 0.528618785 0.550792517
## cancer_volume 0.03909442 -0.13320943 0.581741687 0.692896688
## weight 0.16432371 0.32184875 -0.002410475 0.001578905
## age 1.00000000 0.36634121 0.117658038 0.099555351
## bph 0.36634121 1.00000000 -0.119553192 -0.083008649
## svi 0.11765804 -0.11955319 1.000000000 0.680284092
## capsular_penetration 0.09955535 -0.08300865 0.680284092 1.000000000
## gleason 0.22585181 0.02682555 0.428573479 0.461565896
## gleason
## id 0.53792405
## psa 0.42957975
## cancer_volume 0.48143840
## weight -0.02420693
## age 0.22585181
## bph 0.02682555
## svi 0.42857348
## capsular_penetration 0.46156590
## gleason 1.00000000
library(corrplot)
corrplot(cor_matrix, method = "number", number.cex = 0.8)
Observations:
The dataset contains 97 observations with nine variables related to prostate cancer. There are no missing values, and all variables are numeric except for the binary seminal vesicle invasion (SVI) variable.
PSA levels show a wide range (0.651 to 265.072) with a high mean, indicating a right-skewed distribution and potential outliers. Cancer volume and prostate weight also exhibit substantial variability, while age is relatively stable across patients. Capsular penetration is highly skewed with many low values and a few large ones, and Gleason scores are concentrated between 6 and 7, indicating moderate to high cancer severity.
The correlation analysis reveals that PSA is strongly associated with cancer volume (0.624) and moderately associated with capsular penetration (0.551) and SVI (0.529), suggesting that more advanced disease leads to higher PSA levels. Cancer volume is also strongly related to capsular penetration (0.693) and SVI (0.582), reflecting disease progression.
In contrast, variables such as weight, age, and BPH show weak relationships with PSA, indicating they may not be strong predictors. Overall, cancer-related variables appear to be the most important predictors of PSA, and some correlation among predictors suggests potential multicollinearity.
library(ggplot2)
theme_set(theme_minimal(base_size = 14))
#Plot1-Distribution of PSA
ggplot(data, aes(x = psa)) +
geom_histogram(bins = 20, fill = "#2E86AB", color = "black", alpha = 0.8) +
labs(title = "Distribution of PSA Levels",
x = "PSA", y = "Frequency")
#Plot2-PSA vs Cancer Volume
ggplot(data, aes(x = cancer_volume, y = psa)) +
geom_point(color = "#E74C3C", size = 2, alpha = 0.7) +
geom_smooth(method = "lm", color = "#2E86AB", se = FALSE, linewidth = 1) +
labs(title = "PSA vs Cancer Volume",
x = "Cancer Volume", y = "PSA")
#Plot3-PSA vs Capsular Penetration
ggplot(data, aes(x = capsular_penetration, y = psa)) +
geom_point(color = "#27AE60", size = 2, alpha = 0.7) +
geom_smooth(method = "lm", color = "#8E44AD", se = FALSE, linewidth = 1) +
labs(title = "PSA vs Capsular Penetration",
x = "Capsular Penetration", y = "PSA")
#Plot4-PSA by SVI
ggplot(data, aes(x = factor(svi), y = psa, fill = factor(svi))) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("#3498DB", "#F39C12")) +
labs(title = "PSA by Seminal Vesicle Invasion",
x = "SVI (0 = No, 1 = Yes)", y = "PSA") +
theme(legend.position = "none")
The histogram shows that PSA levels are highly right-skewed, with most observations concentrated at lower values and a few extreme high values. This indicates the presence of outliers and suggests that PSA does not follow a normal distribution. The skewness supports the need for a transformation, such as a log transformation, before regression modeling.
The plot indicates a positive association between capsular penetration and PSA levels, where higher penetration tends to correspond to higher PSA. However, the relationship is more dispersed compared to cancer volume, suggesting that capsular penetration may have a moderate effect rather than a dominant influence.
The scatter plot shows a clear positive relationship between cancer volume and PSA levels, indicating that larger tumor sizes are associated with higher PSA. However, the spread of points increases as cancer volume increases, suggesting heteroscedasticity and the presence of some influential observations. Despite this variability, cancer volume appears to be a strong predictor of PSA.
The boxplot shows that patients with seminal vesicle invasion (SVI = 1) have noticeably higher PSA levels compared to those without invasion (SVI = 0). Additionally, the variability in PSA is greater for patients with invasion, indicating more severe and inconsistent disease progression. This suggests that SVI is an important categorical predictor of PSA.
Overall, the visualizations reinforce that cancer-related variables are strongly associated with PSA and support their inclusion in regression modeling.
1.There is a positive relationship between cancer volume and PSA levels. Based on the scatter plot and correlation analysis, higher cancer volume is associated with increased PSA levels, suggesting that tumor size is a significant predictor of PSA.
2.Patients with seminal vesicle invasion (SVI = 1) have higher PSA levels than those without invasion (SVI = 0). The boxplot indicates that PSA levels are generally higher for patients with SVI, implying that disease progression is linked to elevated PSA.
3.Capsular penetration is positively associated with PSA levels. The scatter plot shows an increasing trend between capsular penetration and PSA, suggesting that greater tumor spread contributes to higher PSA levels.
library(MASS)
boxcox(psa ~ cancer_volume + weight + age + bph + svi + capsular_penetration + gleason,
lambda = seq(-3, 3, by = 0.1),
data = data)
data$log_psa <- log(data$psa)
modelA <- lm(log_psa ~ cancer_volume + weight + age + bph + svi + capsular_penetration + gleason,
data = data)
summary(modelA)
##
## Call:
## lm(formula = log_psa ~ cancer_volume + weight + age + bph + svi +
## capsular_penetration + gleason, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88309 -0.46629 0.08045 0.47380 1.53219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.685796 0.998754 -0.687 0.49409
## cancer_volume 0.069454 0.014624 4.749 7.77e-06 ***
## weight 0.001380 0.001822 0.757 0.45079
## age -0.002799 0.011724 -0.239 0.81186
## bph 0.087470 0.029605 2.955 0.00401 **
## svi 0.782623 0.268339 2.917 0.00448 **
## capsular_penetration -0.026521 0.032860 -0.807 0.42177
## gleason 0.358153 0.127976 2.799 0.00629 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7679 on 89 degrees of freedom
## Multiple R-squared: 0.5893, Adjusted R-squared: 0.557
## F-statistic: 18.24 on 7 and 89 DF, p-value: 7.694e-15
The full regression model examines the relationship between log-transformed PSA and all available predictors. The results indicate that several variables are statistically significant in explaining PSA levels.
Cancer volume is a highly significant predictor (p < 0.001) with a positive coefficient, indicating that an increase in tumor size is associated with an increase in PSA levels. Similarly, benign prostatic hyperplasia (BPH), seminal vesicle invasion (SVI), and Gleason score are also significant predictors (p < 0.01), suggesting that both disease severity and prostate conditions contribute to higher PSA levels.
In contrast, variables such as prostate weight, age, and capsular penetration are not statistically significant, as their p-values are relatively large. This suggests that these variables do not provide substantial additional explanatory power in the presence of other predictors.
The model explains approximately 59% of the variation in PSA levels (R² = 0.5893), indicating a reasonably good fit. The overall model is highly significant (p < 0.001), confirming that the predictors collectively have a strong relationship with PSA.
modelB <- lm(log_psa ~ cancer_volume + bph + svi + gleason,
data = data)
summary(modelB)
##
## Call:
## lm(formula = log_psa ~ cancer_volume + bph + svi + gleason, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88531 -0.50276 0.09885 0.53687 1.56621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.65013 0.80999 -0.803 0.424253
## cancer_volume 0.06488 0.01285 5.051 2.22e-06 ***
## bph 0.09136 0.02606 3.506 0.000705 ***
## svi 0.68421 0.23640 2.894 0.004746 **
## gleason 0.33376 0.12331 2.707 0.008100 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7606 on 92 degrees of freedom
## Multiple R-squared: 0.5834, Adjusted R-squared: 0.5653
## F-statistic: 32.21 on 4 and 92 DF, p-value: < 2.2e-16
The reduced model includes cancer volume, BPH, SVI, and Gleason score as predictors of log(PSA). All included variables are statistically significant (p < 0.01), indicating that they contribute meaningfully to explaining PSA levels.
Cancer volume remains the strongest predictor, with a highly significant positive effect on PSA. Similarly, BPH, SVI, and Gleason score also show positive and significant relationships with PSA, suggesting that both tumor size and disease severity are important determinants.
The model explains approximately 58.3% of the variability in PSA (R² = 0.5834), which is very close to the full model. The overall model is highly significant (p < 0.001), confirming that the selected predictors collectively provide a strong explanation of PSA variation.
The reduced model performs comparably to the full model, with only a slight decrease in R-squared (from 0.5893 to 0.5834). This indicates that removing non-significant variables such as weight, age, and capsular penetration does not substantially reduce the model’s explanatory power.
modelC <- lm(log_psa ~ cancer_volume * svi + bph + gleason,
data = data)
summary(modelC)
##
## Call:
## lm(formula = log_psa ~ cancer_volume * svi + bph + gleason, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7854 -0.4707 0.1028 0.5205 1.6353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.55461 0.80222 -0.691 0.491110
## cancer_volume 0.08873 0.01840 4.823 5.66e-06 ***
## svi 1.12189 0.33808 3.318 0.001303 **
## bph 0.08983 0.02577 3.486 0.000757 ***
## gleason 0.30379 0.12300 2.470 0.015379 *
## cancer_volume:svi -0.04339 0.02423 -1.791 0.076617 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7517 on 91 degrees of freedom
## Multiple R-squared: 0.5976, Adjusted R-squared: 0.5755
## F-statistic: 27.03 on 5 and 91 DF, p-value: < 2.2e-16
The interaction model examines whether the relationship between cancer volume and PSA differs depending on the presence of seminal vesicle invasion (SVI). The results show that cancer volume, SVI, BPH, and Gleason score are all statistically significant predictors of PSA.
The interaction term between cancer volume and SVI is marginally significant (p ≈ 0.077), suggesting that the effect of cancer volume on PSA may differ depending on SVI status, although this effect is not strongly significant at the 5% level.
The positive coefficient for cancer volume indicates that PSA increases with tumor size, while the positive coefficient for SVI suggests that patients with invasion have higher PSA levels. The negative interaction term suggests that the increase in PSA with cancer volume is slightly weaker when SVI is present.
The model explains approximately 59.8% of the variability in PSA (R² = 0.5976), which is slightly higher than both the full and reduced models. The overall model is highly significant (p < 0.001), indicating a strong relationship between the predictors and PSA.
Although the interaction term is not strongly significant, it suggests a potential moderating effect of SVI on the relationship between cancer volume and PSA.
Among the three models, the reduced model provides a good balance between simplicity and explanatory power, while the interaction model offers slightly improved fit but adds complexity. Since the interaction term is only marginally significant, the reduced model may be preferred for interpretation, while the interaction model provides additional insight into possible relationships between predictors.
par(mfrow = c(2, 2))
plot(modelB)
### Model Diagnostics
The diagnostic plots for the reduced model were examined to evaluate the assumptions of linear regression.
The Residuals vs Fitted plot shows a slight curved pattern, suggesting minor deviation from perfect linearity. However, no strong systematic pattern is observed, indicating that the linearity assumption is reasonably satisfied.
The Normal Q-Q plot shows that most residuals lie close to the reference line, although there are small deviations at the tails. This suggests that the normality assumption is approximately met, with minor departures due to extreme observations.
The Scale-Location plot shows a slight increase in the spread of residuals as fitted values increase, indicating mild heteroscedasticity. However, the variation is not severe enough to significantly affect the model.
The Residuals vs Leverage plot does not indicate any highly influential observations, although a few points have moderate leverage. None of these points exceed critical Cook’s distance thresholds, suggesting that no single observation unduly influences the model.
Overall, the regression assumptions are reasonably satisfied, and the model appears to be appropriate for explaining PSA levels. The log transformation of PSA helps in improving the validity of these assumptions.
new_data <- data.frame(
cancer_volume = 10,
bph = 2,
svi = 1,
gleason = 7
)
pred_log <- predict(modelB, newdata = new_data)
pred_log
## 1
## 3.201907
# Convert back to PSA
pred_psa <- exp(pred_log)
pred_psa
## 1
## 24.57935
Interpret the result: The model predicts a log(PSA) value of approximately 3.20 for a patient with a cancer volume of 10, BPH of 2, SVI present, and a Gleason score of 7. After converting this back to the original scale, the predicted PSA level is approximately 24.58.
This means that for a patient with these clinical characteristics, the expected PSA level is around 24.6. The prediction reflects the combined effect of tumor size, cancer severity, and disease progression, demonstrating how the model can be used to estimate PSA levels for new patients.
The analysis shows that cancer-related variables such as cancer volume, seminal vesicle invasion (SVI), and Gleason score play a significant role in determining PSA levels. These findings are consistent with the exploratory analysis, where these variables showed strong relationships with PSA.
The regression models indicate that a reduced model with only significant predictors provides a good balance between simplicity and explanatory power. The use of a log transformation for PSA helped improve model assumptions and overall model performance.
Overall, the results suggest that PSA levels are influenced by both tumor characteristics and prostate conditions, making it a useful indicator for understanding disease progression.
One limitation of this analysis is the relatively small sample size, which may affect the generalizability of the results. Additionally, some variables showed moderate correlation, which may introduce multicollinearity and affect the stability of coefficient estimates.
The model also assumes a linear relationship between predictors and the log-transformed PSA, which may not fully capture complex relationships in the data. Furthermore, the presence of slight heteroscedasticity and minor deviations from normality suggests that the model assumptions are not perfectly satisfied.
Finally, the analysis is based on observational data, so causal relationships cannot be established.