Name: Daniel Heller
NSHE ID: 5002428049
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file and raw Rmd file on Canvas, by Sunday 9/22, 11:59pm.
This lab starts the analysis of a prostate cancer data set and we’ll come back to it later.
The Huber loss function (or just Huber function, for short) is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”.
It is often used in place of the usual squared error loss for robust estimation. The sample average, \(\bar{X}\)—which given a sample \(X_1,\ldots,X_n\) minimizes the squared error loss \(\sum_{i=1}^n (X_i-m)^2\) over all choices of \(m\)—can be inaccurate as an estimate of \(\mathbb{E}(X)\) if the distribution of \(X\) is heavy-tailed. In such cases, minimizing Huber loss can give a better estimate.
huber() that takes as an input a number \(x\), and returns the Huber value \(\psi(x)\), as defined above. Hint: the body of a function is just a block of R code, i.e., in this code you can use if() and else() statements. Check that huber(1) returns 1, and huber(4) returns 7.huber <- function(x){
if(abs(x) <= 1){
x = x**2
return(x)
} else{
x = 2*abs(x)-1
return(x)
}
}
huber(1)
## [1] 1
huber(4)
## [1] 7
huber() function so that it takes two arguments: \(x\), a number at which to evaluate the loss, and \(a\) a number representing the cutoff value. It should now return \(\psi_a(x)\), as defined above. Check that huber(3, 2) returns 8, and huber(3, 4) returns 9.huber <- function(x, a){
if(abs(x) <= a){
x = x**2
return(x)
} else{
x = 2*a*abs(x)-a**2
return(x)
}
}
huber(3, 2)
## [1] 8
huber(3, 4)
## [1] 9
huber() function so that the default value of the cutoff \(a\) is 1. Check that huber(3) returns 5.huber <- function(x, a=1){
if(abs(x) <= a){
x = x**2
return(x)
} else{
x = 2*a*abs(x)-a**2
return(x)
}
}
huber(3)
## [1] 5
huber(a=1, x=3) returns 5. Check that huber(1, 3) returns 1. Explain why these are different.huber(x = 3, a=1)
## [1] 5
huber(1, 3)
## [1] 1
The two are different because in each call the values 1 and 3 are in different variables.
huber() function, so that the first input can actually be a vector of numbers, and what is returned is a vector whose elements give the Huber evaluated at each of these numbers. Hint: you might try using ifelse(), if you haven’t already, to vectorize nicely. Check that huber(x=1:6, a=3) returns the vector of numbers (1, 4, 9, 15, 21, 27).huber <- function(x, a){
for(i in length(x)){
x <- ifelse(abs(x)<= a, x**2, 2*a*abs(x)-a**2)
return(x)
}
}
huber(1:6, 3)
## [1] 1 4 9 15 21 27
huber_vals and x_vals, respectively. However, the cutoff \(a\) was, let’s say, lost. Using huber_vals, x_vals, and the definition of the Huber function, you should be able to figure out the cutoff value \(a\), at least roughly. Estimate \(a\) and explain how you got there. Hint: one way to estimate \(a\) is to do so visually, using plotting tools; there are other ways too.x.vals = seq(0, 5, length=21)
huber_vals = c(0.0000, 0.0625, 0.2500, 0.5625, 1.0000, 1.5625, 2.2500,
3.0625, 4.0000, 5.0625, 6.2500, 7.5625, 9.0000, 10.5000,
12.0000, 13.5000, 15.0000, 16.5000, 18.0000, 19.5000,
21.0000)
We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). There are 9 variables measured on these 97 men:
lpsa: log PSA scorelcavol: log cancer volumelweight: log prostate weightage: age of patientlbph: log of the amount of benign prostatic hyperplasiasvi: seminal vesicle invasionlcp: log of capsular penetrationgleason: Gleason scorepgg45: percent of Gleason scores 4 or 5To load this prostate cancer data set into your R session, and store it as a matrix pros_dat:
pros_dat <-
as.matrix(read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat"))
pros_dat (i.e., how many rows and how many columns)? Using integer indexing, print the first 6 rows and all columns; again using integer indexing, print the last 6 rows and all columns.dim(pros_dat)
## [1] 97 9
pros_dat[1:6,]
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
pros_dat[-1:-91,]
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 92 2.532903 3.677566 61 1.3480732 1 -1.386294 7 15 4.129551
## 93 2.830268 3.876396 68 -1.3862944 1 1.321756 7 60 4.385147
## 94 3.821004 3.896909 44 -1.3862944 1 2.169054 7 40 4.684443
## 95 2.907447 3.396185 52 -1.3862944 1 2.463853 7 10 5.143124
## 96 2.882564 3.773910 68 1.5581446 1 1.558145 7 80 5.477509
## 97 3.471966 3.974998 68 0.4382549 1 2.904165 7 20 5.582932
head() and tail() (i.e., do not use integer indexing), print the first 6 rows and all columns, and also the last 6 rows and all columns.head(pros_dat)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
tail(pros_dat)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 92 2.532903 3.677566 61 1.3480732 1 -1.386294 7 15 4.129551
## 93 2.830268 3.876396 68 -1.3862944 1 1.321756 7 60 4.385147
## 94 3.821004 3.896909 44 -1.3862944 1 2.169054 7 40 4.684443
## 95 2.907447 3.396185 52 -1.3862944 1 2.463853 7 10 5.143124
## 96 2.882564 3.773910 68 1.5581446 1 1.558145 7 80 5.477509
## 97 3.471966 3.974998 68 0.4382549 1 2.904165 7 20 5.582932
pros_dat have names assigned to its rows and columns, and if so, what are they? Use rownames() and colnames() to find out. Note: these would have been automatically created by the read.table() function that we used above to read the data file into our R session. To see where read.table() would have gotten these names from, open up the data file: http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat in your web browser. Only the column names here are actually informative.rownames(pros_dat)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97"
colnames(pros_dat)
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa"
pros_dat that measure the log cancer volume and the log cancer weight, and store the result as a matrix pros_dat_sub. (Recall the explanation of variables at the top of this lab.) Check that its dimensions make sense to you, and that its first 6 rows are what you’d expect. Did R automatically assign column names to pros_dat_sub?head(pros_dat_sub)
## lcavol lweight
## 1 -0.5798185 2.769459
## 2 -0.9942523 3.319626
## 3 -0.5108256 2.691243
## 4 -1.2039728 3.282789
## 5 0.7514161 3.432373
## 6 -1.0498221 3.228826
It turned out how I expected and yes R automatically assigned column names.
ldens <- pros_dat_sub[,2]/pros_dat_sub[,1]
head(ldens)
## 1 2 3 4 5 6
## -4.776424 -3.338817 -5.268418 -2.726631 4.567873 -3.075593
pros_dat, using cbind(). The new pros_dat matrix should now have 10 columns. Set the last column name to be ldens. Print its first 6 rows, to check that you’ve done all this right.pros_dat <- cbind(pros_dat, ldens)
head(pros_dat)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## ldens
## 1 -4.776424
## 2 -3.338817
## 3 -5.268418
## 4 -2.726631
## 5 4.567873
## 6 -3.075593
hist(), produce a histogram of the log cancer volume measurements of the 97 men in the data set; also produce a histogram of the log cancer weight. In each case, use breaks=20 as an arugment to hist(). Comment just briefly on the distributions you see. Then, using plot(), produce a scatterplot of the log cancer volume (y-axis) versus the log cancer weight (x-axis). Do you see any kind of relationship? Would you expect to? Challenge: how would you measure the strength of this relationship formally? Note that there is certainly more than one way to do so. We’ll talk about statistical modeling tools later in the course.hist(pros_dat[,1], breaks = 20, xlab = "log cancer volume")
hist(pros_dat[,2], breaks = 20, xlab = "log cancer weight")
The plots both resemble a normal distribution of a large middle section and small tails, however there are a few spikes in both.
plot(pros_dat[,2],pros_dat[,1], xlab = "log cancer weight", ylab = "log cancer volume")
There does seem to be a relationship between the points as they all are clustered together around the center, we should expect this result after we look at the histograms where both were strongly centered around the center of their respective graphs.
plot(pros_dat[,3], pros_dat[,2], xlab = "age", ylab = "log cancer weight")
plot(pros_dat[,3], pros_dat[,1], xlab = "age", ylab = "log cancer volume")
There does seem to be a relationship corresponding to both graphs, which is that there is a bulk of the points in the top right corner of the graph indicating that there is a correlation between advanced age and cancer development.
hist(pros_dat[,10], xlab = "log cancer density")
plot(pros_dat[,3],pros_dat[,10], xlab = "age", ylab = "log cancer density")
The similarities between these plots and the ones produced before are that the histogram is still centered around the middle, however this time the majority are in the middle and there are some outliers in the graph on the ends. As well there is a large difference in the scatterplot in that it is all in a middle line with only two points off of that line.
pros_dat matrix, using negative integer indexing.pros_dat <- pros_dat[, -10]
head(pros_dat)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
svi variable in the pros_dat matrix is binary: 1 if the patient had a condition called “seminal vesicle invasion” or SVI, and 0 otherwise. SVI (which means, roughly speaking, that the cancer invaded into the muscular wall of the seminal vesicle) is bad: if it occurs, then it is believed the prognosis for the patient is poorer, and even once/if recovered, the patient is more likely to have prostate cancer return in the future. Compute a Boolean vector called has_svi, of length 97, that has a TRUE element if a row (patient) in pros_dat has SVI, and FALSE otherwise. Then using sum(), figure out how many patients have SVI.has_svi <- pros_dat[, 5] > 0
sum(has_svi)
## [1] 21
pros_dat that correspond to patients with SVI, and the rows that correspond to patients without it. Call the resulting matrices pros_dat_svi and pros_dat_no_svi, respectively. You can do this in two ways: using the has_svi Boolean vector created above, or using on-the-fly Boolean indexing, it’s up to you. Check that the dimensions of pros_dat_svi and pros_dat_no_svi make sense to you.pros_dat_svi <- pros_dat[has_svi,]
pros_dat_no_svi <- pros_dat[!has_svi,]
dim(pros_dat_svi)
## [1] 21 9
dim(pros_dat_no_svi)
## [1] 76 9
The dimensions make sense because there should only be 9 columns since we deleted the tenth and then the rows are equal to 97 as they should be.
pros_dat_svi and pros_dat_no_svi that you created above, compute the means of each variable in our data set for patients with SVI, and for patients without it. Store the resulting means into vectors called pros_dat_svi_avg and pros_dat_no_svi_avg, respectively. Hint: for each matrix, you can compute the means with a single call to a built-in R function. What variables appear to have different means between the two groups?pros_dat_svi_avg
## [1] 2.551959 3.754927 65.523810 -0.135346 1.000000 1.601858 7.190476
## [8] 48.809524 3.715360
pros_dat_no_svi_avg
## [1] 1.0178918 3.5941312 65.5238095 0.1654837 0.0000000 1.6018579
## [7] 6.6315789 17.6315789 2.1365916
Every variable is different except age andlcp, however lweight is only slightly different.
pros_dat_svi_sd of length ncol(pros_dat) (of length 9). The second line defines an index variable i and sets it equal to 1. Write a third line of code to compute the standard deviation of the ith column of pros_dat_svi, using a built-in R function, and store this value in the ith element of pros_dat_svi_sd.pros_dat_svi_sd = vector(length=ncol(pros_dat))
i = 1
pros_dat_svi_sd[i] <- sd(pros_dat_svi[i])
pros_dat_no_svi_sd of length ncol(pros_dat) (of length 9), the second should define an index variable i and set it equal to 1, and the third should fill the ith element of pros_dat_no_svi_sd with the standard deviation of the ith column of pros_dat_no_svi.pros_dat_no_svi_sd = vector(length=ncol(pros_dat))
i = 1
pros_dat_no_svi_sd[i] <- sd(pros_dat_no_svi[i])
for() loop to compute the standard deviations of the columns of pros_dat_svi and pros_dat_no_svi, and store the results in the vectors pros_dat_svi_sd and pros_dat_no_svi_sd, respectively, that were created above. Note: you should have a single for() loop here, not two for loops. And if it helps, consider breaking this task down into two steps: as the first step, write a for() loop that iterates an index variable i over the integers between 1 and the number of columns of pros_dat (don’t just manually write 9 here, pull out the number of columns programmatically), with an empty body. As the second step, paste relevant pieces of your solution code from Q5a and Q5b into the body of the for() loop. Print out the resulting vectors pros_dat_svi_sd and pros_dat_no_svi_sd to the console. Comment, just briefly (informally), by visually inspecting these standard deviations and the means you computed in Q4c: which variables exhibit large differences in means between the SVI and non-SVI patients, relative to their standard deviations?pros_dat_svi_sd
## [1] 0.6707867 0.3275689 7.8715885 1.3545258 0.0000000 1.0452899
## [7] 0.6015852 25.7344498 0.9251229
pros_dat_no_svi_sd
## [1] 1.0685730 0.4479291 7.3105907 1.4782007 0.0000000 1.0379398
## [7] 0.7088414 25.0667600 0.9646403
The pgg45 and lpsa have relatively high mean differences especially relative to their standard deviations, which aren’t too different.
pros_dat_svi and pros_dat_no_svi, and stores them in pros_dat_svi_sd_master and pros_dat_no_svi_sd_master, respectively, using apply(). Remove eval=FALSE as an option to the Rmd code chunk and complete the code. Then check using all.equal() that the standard deviations you computed in the previous question equal these “master” copies. Note: use check.names=FALSE as a third argument to all.equal(), which instructs it to ignore the names of its first two arguments. (If all.equal() doesn’t succeed in both cases, then you must have done something wrong in computing the standard deviations, so go back and fix them!)pros_dat_svi_sd_master = apply(pros_dat_svi, MARGIN = 2, FUN = sd)
pros_dat_no_svi_sd_master = apply(pros_dat_no_svi, MARGIN = 2, FUN = sd)
all.equal(pros_dat_svi_sd, pros_dat_svi_sd_master, check.names=FALSE)
## [1] TRUE
all.equal(pros_dat_no_svi_sd, pros_dat_no_svi_sd_master, check.names=FALSE)
## [1] TRUE