Lab 02: Functions and the prostate cancer set

Name: Daniel Heller

NSHE ID: 5002428049

Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file and raw Rmd file on Canvas, by Sunday 9/22, 11:59pm.

This lab starts the analysis of a prostate cancer data set and we’ll come back to it later.

Huber loss function

The Huber loss function (or just Huber function, for short) is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”.

It is often used in place of the usual squared error loss for robust estimation. The sample average, $\bar{X}$—which given a sample $X_1,\ldots,X_n$ minimizes the squared error loss $\sum_{i=1}^n (X_i-m)^2$ over all choices of $m$—can be inaccurate as an estimate of $\mathbb{E}(X)$ if the distribution of $X$ is heavy-tailed. In such cases, minimizing Huber loss can give a better estimate.

Some function tasks

1a. Write a function huber() that takes as an input a number $x$, and returns the Huber value $\psi(x)$, as defined above. Hint: the body of a function is just a block of R code, i.e., in this code you can use if() and else() statements. Check that huber(1) returns 1, and huber(4) returns 7.

huber <- function(x){
  if(abs(x) <= 1){
    x = x**2
    return(x)
  } else{
      x = 2*abs(x)-1
      return(x)
    }
}

huber(1)

## [1] 1

huber(4)

## [1] 7

1b. The Huber function can be modified so that the transition from quadratic to linear happens at an arbitrary cutoff value $a$, as in: \[ \psi_a(x) = \begin{cases} x^2 & \text{if $|x| \leq a$} \\ 2a|x| - a^2 & \text{if $|x| > a$} \end{cases} \] Starting with your solution code to the last question, update your huber() function so that it takes two arguments: $x$, a number at which to evaluate the loss, and $a$ a number representing the cutoff value. It should now return $\psi_a(x)$, as defined above. Check that huber(3, 2) returns 8, and huber(3, 4) returns 9.

huber <- function(x, a){
  if(abs(x) <= a){
    x = x**2
    return(x)
  } else{
      x = 2*a*abs(x)-a**2
      return(x)
    }
}

huber(3, 2)

## [1] 8

huber(3, 4)

## [1] 9

1c. Update your huber() function so that the default value of the cutoff $a$ is 1. Check that huber(3) returns 5.

huber <- function(x, a=1){
  if(abs(x) <= a){
    x = x**2
    return(x)
  } else{
      x = 2*a*abs(x)-a**2
      return(x)
    }
}

huber(3)

## [1] 5

1d. Check that huber(a=1, x=3) returns 5. Check that huber(1, 3) returns 1. Explain why these are different.

huber(x = 3, a=1)

## [1] 5

huber(1, 3)

## [1] 1

The two are different because in each call the values 1 and 3 are in different variables.

1e. Vectorize your huber() function, so that the first input can actually be a vector of numbers, and what is returned is a vector whose elements give the Huber evaluated at each of these numbers. Hint: you might try using ifelse(), if you haven’t already, to vectorize nicely. Check that huber(x=1:6, a=3) returns the vector of numbers (1, 4, 9, 15, 21, 27).

huber <- function(x, a){
  for(i in length(x)){
  x <- ifelse(abs(x)<= a, x**2, 2*a*abs(x)-a**2)
  return(x)
  }
}

huber(1:6, 3)

## [1]  1  4  9 15 21 27

Challenge. Your instructor computed the Huber function values $\psi_a(x)$ over a bunch of different $x$ values, stored in huber_vals and x_vals, respectively. However, the cutoff $a$ was, let’s say, lost. Using huber_vals, x_vals, and the definition of the Huber function, you should be able to figure out the cutoff value $a$, at least roughly. Estimate $a$ and explain how you got there. Hint: one way to estimate $a$ is to do so visually, using plotting tools; there are other ways too.

x.vals = seq(0, 5, length=21)
huber_vals = c(0.0000, 0.0625, 0.2500, 0.5625, 1.0000, 1.5625, 2.2500,
3.0625, 4.0000, 5.0625, 6.2500, 7.5625, 9.0000, 10.5000,
12.0000, 13.5000, 15.0000, 16.5000, 18.0000, 19.5000, 
21.0000)

Prostate cancer data set

We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). There are 9 variables measured on these 97 men:

lpsa: log PSA score
lcavol: log cancer volume
lweight: log prostate weight
age: age of patient
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: Gleason score
pgg45: percent of Gleason scores 4 or 5

To load this prostate cancer data set into your R session, and store it as a matrix pros_dat:

pros_dat <-
  as.matrix(read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat"))

Basic indexing and calculations

2a. What are the dimensions of pros_dat (i.e., how many rows and how many columns)? Using integer indexing, print the first 6 rows and all columns; again using integer indexing, print the last 6 rows and all columns.

dim(pros_dat)

## [1] 97  9

pros_dat[1:6,]

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

pros_dat[-1:-91,]

##      lcavol  lweight age       lbph svi       lcp gleason pgg45     lpsa
## 92 2.532903 3.677566  61  1.3480732   1 -1.386294       7    15 4.129551
## 93 2.830268 3.876396  68 -1.3862944   1  1.321756       7    60 4.385147
## 94 3.821004 3.896909  44 -1.3862944   1  2.169054       7    40 4.684443
## 95 2.907447 3.396185  52 -1.3862944   1  2.463853       7    10 5.143124
## 96 2.882564 3.773910  68  1.5581446   1  1.558145       7    80 5.477509
## 97 3.471966 3.974998  68  0.4382549   1  2.904165       7    20 5.582932

2b. Using the built-in R functions head() and tail() (i.e., do not use integer indexing), print the first 6 rows and all columns, and also the last 6 rows and all columns.

head(pros_dat)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

tail(pros_dat)

##      lcavol  lweight age       lbph svi       lcp gleason pgg45     lpsa
## 92 2.532903 3.677566  61  1.3480732   1 -1.386294       7    15 4.129551
## 93 2.830268 3.876396  68 -1.3862944   1  1.321756       7    60 4.385147
## 94 3.821004 3.896909  44 -1.3862944   1  2.169054       7    40 4.684443
## 95 2.907447 3.396185  52 -1.3862944   1  2.463853       7    10 5.143124
## 96 2.882564 3.773910  68  1.5581446   1  1.558145       7    80 5.477509
## 97 3.471966 3.974998  68  0.4382549   1  2.904165       7    20 5.582932

2c. Does the matrix pros_dat have names assigned to its rows and columns, and if so, what are they? Use rownames() and colnames() to find out. Note: these would have been automatically created by the read.table() function that we used above to read the data file into our R session. To see where read.table() would have gotten these names from, open up the data file: http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat in your web browser. Only the column names here are actually informative.

rownames(pros_dat)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97"

colnames(pros_dat)

## [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
## [8] "pgg45"   "lpsa"

2d. Using named indexing, pull out the two columns of pros_dat that measure the log cancer volume and the log cancer weight, and store the result as a matrix pros_dat_sub. (Recall the explanation of variables at the top of this lab.) Check that its dimensions make sense to you, and that its first 6 rows are what you’d expect. Did R automatically assign column names to pros_dat_sub?

head(pros_dat_sub)

##       lcavol  lweight
## 1 -0.5798185 2.769459
## 2 -0.9942523 3.319626
## 3 -0.5108256 2.691243
## 4 -1.2039728 3.282789
## 5  0.7514161 3.432373
## 6 -1.0498221 3.228826

It turned out how I expected and yes R automatically assigned column names.

2e. Using the log cancer weights and log cancer volumes, calculate and store the log cancer density for the 97 men in the data set (note: by density here we mean weight divided by volume). Finally, display the first 6 values.

ldens <- pros_dat_sub[,2]/pros_dat_sub[,1]
head(ldens)

##         1         2         3         4         5         6 
## -4.776424 -3.338817 -5.268418 -2.726631  4.567873 -3.075593

2f. Append the log cancer density to the columns of pros_dat, using cbind(). The new pros_dat matrix should now have 10 columns. Set the last column name to be ldens. Print its first 6 rows, to check that you’ve done all this right.

pros_dat <- cbind(pros_dat, ldens)
head(pros_dat)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##       ldens
## 1 -4.776424
## 2 -3.338817
## 3 -5.268418
## 4 -2.726631
## 5  4.567873
## 6 -3.075593

Exploratory data analysis with plots

3a. Using hist(), produce a histogram of the log cancer volume measurements of the 97 men in the data set; also produce a histogram of the log cancer weight. In each case, use breaks=20 as an arugment to hist(). Comment just briefly on the distributions you see. Then, using plot(), produce a scatterplot of the log cancer volume (y-axis) versus the log cancer weight (x-axis). Do you see any kind of relationship? Would you expect to? Challenge: how would you measure the strength of this relationship formally? Note that there is certainly more than one way to do so. We’ll talk about statistical modeling tools later in the course.

hist(pros_dat[,1], breaks = 20, xlab = "log cancer volume")

hist(pros_dat[,2], breaks = 20, xlab = "log cancer weight")

The plots both resemble a normal distribution of a large middle section and small tails, however there are a few spikes in both.

plot(pros_dat[,2],pros_dat[,1], xlab = "log cancer weight", ylab = "log cancer volume")

There does seem to be a relationship between the points as they all are clustered together around the center, we should expect this result after we look at the histograms where both were strongly centered around the center of their respective graphs.

3b. Produce scatterplots of log cancer weight versus age, and log cancer volume versus age. Do you see relationships here between the age of a patient and the volume/weight of his cancer?

plot(pros_dat[,3], pros_dat[,2], xlab = "age", ylab = "log cancer weight")

plot(pros_dat[,3], pros_dat[,1], xlab = "age", ylab = "log cancer volume")

There does seem to be a relationship corresponding to both graphs, which is that there is a bulk of the points in the top right corner of the graph indicating that there is a correlation between advanced age and cancer development.

3c. Produce a histogram of the log cancer density, and a scatterplot of the log cancer density versus age. Comment on any similarities/differences you see between these plots, and the corresponding ones you produced above for log cancer volume/weight.

hist(pros_dat[,10], xlab = "log cancer density")

plot(pros_dat[,3],pros_dat[,10], xlab = "age", ylab = "log cancer density")

The similarities between these plots and the ones produced before are that the histogram is still centered around the middle, however this time the majority are in the middle and there are some outliers in the graph on the ends. As well there is a large difference in the scatterplot in that it is all in a middle line with only two points off of that line.

3d. Delete the last column, corresponding to the log cancer density, from the pros_dat matrix, using negative integer indexing.

pros_dat <- pros_dat[, -10]
head(pros_dat)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

A bit of Boolean indexing never hurt anyone

4a. The svi variable in the pros_dat matrix is binary: 1 if the patient had a condition called “seminal vesicle invasion” or SVI, and 0 otherwise. SVI (which means, roughly speaking, that the cancer invaded into the muscular wall of the seminal vesicle) is bad: if it occurs, then it is believed the prognosis for the patient is poorer, and even once/if recovered, the patient is more likely to have prostate cancer return in the future. Compute a Boolean vector called has_svi, of length 97, that has a TRUE element if a row (patient) in pros_dat has SVI, and FALSE otherwise. Then using sum(), figure out how many patients have SVI.

has_svi <- pros_dat[, 5] > 0
sum(has_svi)

## [1] 21

4b. Extract the rows of pros_dat that correspond to patients with SVI, and the rows that correspond to patients without it. Call the resulting matrices pros_dat_svi and pros_dat_no_svi, respectively. You can do this in two ways: using the has_svi Boolean vector created above, or using on-the-fly Boolean indexing, it’s up to you. Check that the dimensions of pros_dat_svi and pros_dat_no_svi make sense to you.

pros_dat_svi <- pros_dat[has_svi,]
pros_dat_no_svi <- pros_dat[!has_svi,]
dim(pros_dat_svi)

## [1] 21  9

dim(pros_dat_no_svi)

## [1] 76  9

The dimensions make sense because there should only be 9 columns since we deleted the tenth and then the rows are equal to 97 as they should be.

4c. Using the two matrices pros_dat_svi and pros_dat_no_svi that you created above, compute the means of each variable in our data set for patients with SVI, and for patients without it. Store the resulting means into vectors called pros_dat_svi_avg and pros_dat_no_svi_avg, respectively. Hint: for each matrix, you can compute the means with a single call to a built-in R function. What variables appear to have different means between the two groups?

pros_dat_svi_avg

## [1]  2.551959  3.754927 65.523810 -0.135346  1.000000  1.601858  7.190476
## [8] 48.809524  3.715360

pros_dat_no_svi_avg

## [1]  1.0178918  3.5941312 65.5238095  0.1654837  0.0000000  1.6018579
## [7]  6.6315789 17.6315789  2.1365916

Every variable is different except age andlcp, however lweight is only slightly different.

Computing standard deviations using iteration

5a. Take a look at the starter code below. The first line defines an empty vector pros_dat_svi_sd of length ncol(pros_dat) (of length 9). The second line defines an index variable i and sets it equal to 1. Write a third line of code to compute the standard deviation of the ith column of pros_dat_svi, using a built-in R function, and store this value in the ith element of pros_dat_svi_sd.

pros_dat_svi_sd = vector(length=ncol(pros_dat))
i = 1
pros_dat_svi_sd[i] <- sd(pros_dat_svi[i])

5b. Repeat the calculation as in the previous question, but for patients without SVI. That is, produce three lines of code: the first should define an empty vector pros_dat_no_svi_sd of length ncol(pros_dat) (of length 9), the second should define an index variable i and set it equal to 1, and the third should fill the ith element of pros_dat_no_svi_sd with the standard deviation of the ith column of pros_dat_no_svi.

pros_dat_no_svi_sd = vector(length=ncol(pros_dat))
i = 1
pros_dat_no_svi_sd[i] <- sd(pros_dat_no_svi[i])

5c. Write a for() loop to compute the standard deviations of the columns of pros_dat_svi and pros_dat_no_svi, and store the results in the vectors pros_dat_svi_sd and pros_dat_no_svi_sd, respectively, that were created above. Note: you should have a single for() loop here, not two for loops. And if it helps, consider breaking this task down into two steps: as the first step, write a for() loop that iterates an index variable i over the integers between 1 and the number of columns of pros_dat (don’t just manually write 9 here, pull out the number of columns programmatically), with an empty body. As the second step, paste relevant pieces of your solution code from Q5a and Q5b into the body of the for() loop. Print out the resulting vectors pros_dat_svi_sd and pros_dat_no_svi_sd to the console. Comment, just briefly (informally), by visually inspecting these standard deviations and the means you computed in Q4c: which variables exhibit large differences in means between the SVI and non-SVI patients, relative to their standard deviations?

pros_dat_svi_sd

## [1]  0.6707867  0.3275689  7.8715885  1.3545258  0.0000000  1.0452899
## [7]  0.6015852 25.7344498  0.9251229

pros_dat_no_svi_sd

## [1]  1.0685730  0.4479291  7.3105907  1.4782007  0.0000000  1.0379398
## [7]  0.7088414 25.0667600  0.9646403

The pgg45 and lpsa have relatively high mean differences especially relative to their standard deviations, which aren’t too different.

5d. The starter code below partially computes to calculate the standard deviations of the columns of pros_dat_svi and pros_dat_no_svi, and stores them in pros_dat_svi_sd_master and pros_dat_no_svi_sd_master, respectively, using apply(). Remove eval=FALSE as an option to the Rmd code chunk and complete the code. Then check using all.equal() that the standard deviations you computed in the previous question equal these “master” copies. Note: use check.names=FALSE as a third argument to all.equal(), which instructs it to ignore the names of its first two arguments. (If all.equal() doesn’t succeed in both cases, then you must have done something wrong in computing the standard deviations, so go back and fix them!)

pros_dat_svi_sd_master = apply(pros_dat_svi, MARGIN = 2, FUN = sd)
pros_dat_no_svi_sd_master = apply(pros_dat_no_svi, MARGIN = 2, FUN = sd)
all.equal(pros_dat_svi_sd, pros_dat_svi_sd_master, check.names=FALSE)

## [1] TRUE

all.equal(pros_dat_no_svi_sd, pros_dat_no_svi_sd_master, check.names=FALSE)

## [1] TRUE