Homework2

3.15 A paper by Robertson et al. [1976] discusses the level of plasma prostaglandin E (iPGE) in patients with cancer with and without hypercalcemia. The data are given in Table 3.20. Note that the variables are the mean plasma iPGE and mean serum Ca levels—presumably, more than one assay was carried out for each patient’s level. The number of such tests for each patient is not indicated, nor is the criterion for the number.

a) Calculate the mean and standard deviation of plasma iPGE level for patients with hypercalcemia; do the same for patients without hypercalcemia.

rm(list = setdiff(ls(), lsf.str()))
iPGE_wH <- c(500, 500, 301, 272, 226, 183, 183, 177, 136, 118, 60)
iPGE_woH <- c(254, 172, 168, 150, 148, 144, 130, 121, 100, 88)
mean_wH <- mean(iPGE_wH, na.rm = TRUE)
mean_woH <- mean(iPGE_woH, na.rm = TRUE)
sd_wH <- sd(iPGE_wH, na.rm = TRUE)
sd_woH <- sd(iPGE_woH, na.rm=TRUE)
cat("The mean and standard deviation of plasma iPGE level for patients with and without hypercalcemia are",
    round(mean_wH, digits = 2), "pg/mL,",
    round(sd_wH, digits = 2), "pg/mL", "respectively.")

## The mean and standard deviation of plasma iPGE level for patients with and without hypercalcemia are 241.45 pg/mL, 144.46 pg/mL respectively.

b) Make box plots for plasma iPGE levels for each group. Can you draw any conclusions from these plots? Do they suggest that the two groups differ in plasma iPGE levels?

# Boxplot for plasma iPGE levels for patients with cancer with and without hypercalcemia
z <- c("iPGE_wH", "iPGE_woH")
dataList <- lapply(z, get, envir = environment())
names(dataList) <- z
boxplot(dataList[["iPGE_wH"]], dataList[["iPGE_woH"]], main = "Box Plot of plasma iPGE with/without hypercalcemia", 
        ylab = "Mean Plasma iPGE(pg/mL)", las = 1,
        names = c("With Hypercalcemia", "Without Hypercalcemia"))

From what box plot show, Patients with hypercalcemia (Group A) have higher mean/third quartile/first quartile/maximum plasma iPGE values than the ones without hypercalcemia (Group B). The minimum of Group A is lower than Group B. And they both have one outliers.
From the boxplots the 75% of Group B is even lower than the median of Group A. It is likely that the distribution of Group A is greater than Group B. But this statement needs to be validated by other quantative statistics such as t-test.

c) The article states that normal limits for serum calcium levels are 8.5 to 10.5 mg/dL. It is clear that patients were classified as hypercalcemic if their serum calcium levels exceeded 10.5 mg/dL. Without classifying patients it may be postulated that high plasma iPGE levels tend to be associated with high serum calcium levels. Make a plot of the plasma iPGE and serum calcium levels to determine if there is a suggestion of a pattern relating these two variables.

# Scatter Plots for all types of patients
iPGE <- c(iPGE_wH, iPGE_woH)
calcium <- c(13.3, 11.2, 13.4, 11.5, 11.4, 11.6, 11.7, 12.1, 12.5, 12.2, 18.0, 
             10.1, 9.4, 9.3, 8.6, 10.5, 10.3, 10.5, 10.2, 9.7, 9.2)
plot(iPGE, calcium, ylab = "Serum Calcium (mg/dL)", xlab = "Plasma iPGE (pg/mL)", 
     main = "Scatter Plot for All The Patients")
abline(fit <- lm(calcium ~ iPGE), col = 'red')
legend("topright", bty="n", legend=c(paste("R2 = ", format(summary(fit)$adj.r.squared, digits = 4)), 
                                     paste('y =', round(coef(lm(calcium ~ iPGE))[[2]], digits = 4),'* x', '+', round(coef(lm(calcium ~ iPGE))[[1]], digits = 2))))

From the scatter plot above, there are three points which are possible outliers. In the next another scatter plot will be plotted without these possible outliers.

# Rule out 3 outliers
iPGEo <- c(iPGE_wH, iPGE_woH)
remove <- c(500, 60)
iPGEo = iPGEo[! iPGEo %in% remove]
calciumo <- c(13.4, 11.5, 11.4, 11.6, 11.7, 12.1, 12.5, 12.2, 
             10.1, 9.4, 9.3, 8.6, 10.5, 10.3, 10.5, 10.2, 9.7, 9.2)
plot(iPGEo, calciumo, ylab = "Serum Calcium (mg/dL)", xlab = "Plasma iPGE (pg/mL)", 
     main = "Scatter Plot with Ruling Outliers Out")
abline(fit2 <- lm(calciumo ~ iPGEo), col = 'red')
legend("topright", bty="n", legend=c(paste("R2 = ", format(summary(fit2)$adj.r.squared, digits = 4)), 
                                     paste('y =', round(coef(lm(calciumo ~ iPGEo))[[2]], digits = 4),'* x', '+', round(coef(lm(calciumo ~ iPGEo))[[1]], digits = 2))))

From the scatter plot it is not obvious to conclude that the serum calcium value and the plasma iPGE are correlated based on adjusted R-squared value (-0.04104). Even I outlied three patients which seems ‘abnormal’ values, the adjusted R-squared value is still low (0.1558). Thus, there is no obvious pattern relating these two patterns.

Homework2_315