## id lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
## 4 4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
## 5 5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
## 6 6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
## [1] 97 10
## [1] "id" "lcavol" "lweight" "age" "lbph" "svi" "lcp"
## [8] "gleason" "pgg45" "lpsa"
lcavol: log of prostate cancer volume. Units are not specified, but probably milliliters
lweight: log of prostate cancer weight. Units are not specified, but probably grams
age: age of subject in years
lbph: log of benign prostatic hyperplasia amount, units unclear
svi: seminal vesicle invasion, absence or presence
lcp: log of capsular penetration, units unclear
gleason: the Gleason score, a grading system for prostate cancer that stages according to biopsy results. The score is made up of a primary and secondary grade by looking at the two most dominant tumor patterns, then adding the grades together. Each grade is 1-5, so the total score ranges between 2-10
pgg45: percentage Gleason score 4 or 5. These are the highest two grades, and indicates undifferentiated cancer cells
lpsa:the log of postate-specific antigen. This is a blood test that may offer some benefit in identifying serious cases of prostate cancer
Predict the 1 weight using the age. How do the methods deal with the outlier?
##
## Call:
## lm(formula = lweight ~ age, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28054 -0.21276 -0.03244 0.23751 2.43165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.342572 0.418693 5.595 2.12e-07 ***
## age 0.020514 0.006512 3.150 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.475 on 95 degrees of freedom
## Multiple R-squared: 0.09457, Adjusted R-squared: 0.08504
## F-statistic: 9.923 on 1 and 95 DF, p-value: 0.002183
A simple linear regression offers a very weak correlation between age and weight, owing to a single outlier. Omitting this point only improves R-squared by approximately 25% to 0.1209. Looking at the residual plot, the outlier falls near the mean age value where there are many obsevations. There are also one large negative residual around this area as well, indicating high nearby variance.
Note from book: We have introduced kernel smoothers, splines, local polynomials, wavelets and other smoothing methods in this chapter. Apply these methods to the following datasets. You must choose the amount of smoothing you think is appropriate. Compare the fits from the methods. Comment on the features of the fitted curves. Comment on the advantage of the nonparametric approach compared to a parametric one for the data and, in particular, whether the nonparametric fit reveals structure that a parametric approach would miss.
The h-value determined by cross-validation does not have any local minima, so in this case I used an h of approximately 7 as this graphically keeps the estimate as monotonic as possible throughout the interval. My intuition for prostate weight is that it should only increase with age.
The h-value determined by cross-validation does not have any local minima, so in this case I used an h of approximately 7 as this graphically keeps the estimate as monotonic as possible throughout the interval. My intuition for prostate weight is that it should only increase with age.
On the other hand, an h value of 0.7 offers a much more variable line that fails to capture the impact of the outlier. Instead, there’s greater variation seen elsewhere because the outlier is near the middle of the dataset.
Using ggplot, I generated a LOESS / local polynomial regression using three spans. Again, decreasing the span makes the graph more sensitive to nearby variance. At span = 0.7 there is a small but noticeable bump accomoodating the outlier. At 0.3, the change is much more drastic and the line stays relatively flat on the edges of the graph. However, there is extreme fluctuation in the area surrounding the outlier, which does not do much to capture the overall pattern of the nearby points.
All three of these methods appear insufficient in describing the variance of weight as a response to age. The LOESS model does the best job of modeling values far away from the outlier, with a small accomodation to values nearby. This is in compamrison to the kernel smoothing model which has difficulty dealing with the sparse observations at the exteme ends of age. The ideal model should accomodate two important things about this relationship: that overall prostate weight increases with age, and that the majority of observations for age lie between 60 - 68. Prostate weights are likely only taken if there is reasonsable suspciion of cancer, which also increases with age. Because of this, it’s reasonable to find a model that can model both natural and pathological increases to prostate weight.