Take Home Final Notebook

0.1 Question 1

data_1<-read.csv("final_1.csv")
data_2<-read.csv("final_2.csv")
head(data_1)

##       Y     P1    P2     P3
## 1 0.019 24.554 1.296  4.629
## 2 0.010 25.855 1.474  0.849
## 3 0.008 26.119 0.071  2.685
## 4 0.021 23.372 0.760  6.048
## 5 0.028 25.437 0.874 12.132
## 6 0.009 26.844 1.420  1.391

Data file “final_1.csv”, defined as data_1, provides preclinical study data examining the relationship of drug concentration in liver (Y) and the concentration of 3 liver proteins (P1, P2, P3).

Create histograms of the four variables (Y, P1, P2, P3)

attach(data_1)

hist_Y<-hist(Y, col="red", xlab = "Drug Concentration (var. Y)",
main="Frequency Distribution of Var. 'Y' (Drug Concentration in Liver)")

hist_P1<-hist(P1, col="red", xlab = "P1 Liver Protein Level",
main="Freqency Distribution of Liver Protein P1")

hist_P2<-hist(P2, col="red", xlab = "P2 Liver Protein Level",
main="Freqency Distribution of Liver Protein P2")

hist_P3<-hist(P3, col="red", xlab = "P3 Liver Protein Level",
main="Freqency Distribution of Liver Protein P3")

P1 appear to be approximately normally distributed. Y, P2, P3 do not appear to be normally distributed.

# DispersionCalculations
sd_P1 <- sd(P1)
sd_P1

## [1] 2.139313

sd_P2 <- sd(P2)
sd_P2

## [1] 0.8802262

sd_P3 <- sd(P3)
sd_P3

## [1] 4.747751

sd_Y <- sd(Y)
sd_Y

## [1] 0.03168693

variance_Y <- (sd_Y)^2
variance_Y

## [1] 0.001004061

variance_P1 <- (sd_P1)^2
variance_P1

## [1] 4.576659

variance_P2 <- (sd_P2)^2
variance_P2

## [1] 0.7747982

variance_P3 <- (sd_P3)^2
variance_P3

## [1] 22.54114

#Range. Min/Max values pulled from Summary function
range_Y <- 0.395-0.003
range_P1 <- 30.30 - 18.83
range_P2 <- 5.037 - 0.004
range_P3 <- 28.181 - 0

range_Y

## [1] 0.392

range_P1

## [1] 11.47

range_P2

## [1] 5.033

range_P3

## [1] 28.181

Central Tendency Calculations: Mean and Median

# Central Tendency Calculations
summary(data_1)

##        Y                 P1              P2               P3        
##  Min.   :0.00300   Min.   :18.83   Min.   :0.0040   Min.   : 0.000  
##  1st Qu.:0.01000   1st Qu.:23.46   1st Qu.:0.3560   1st Qu.: 1.929  
##  Median :0.01500   Median :25.07   Median :0.7515   Median : 3.950  
##  Mean   :0.02141   Mean   :24.99   Mean   :0.9772   Mean   : 5.229  
##  3rd Qu.:0.02300   3rd Qu.:26.39   3rd Qu.:1.3650   3rd Qu.: 7.149  
##  Max.   :0.39500   Max.   :30.30   Max.   :5.0370   Max.   :28.181

Use a matrix plot to visualize associations among variables (Y, P1, P2, P3).

#Define Data Frame
df <- data.frame(Y, P1, P2, P3)

# Matrix Plot
plot(df, pch=18, col ="steelblue") #Basic Matrix pplot

# install and load GGolly for a better visual
# install.packages("GGally")
library(GGally)

## Warning: package 'GGally' was built under R version 4.2.2

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(df)

Estimate a multiple linear regression model for the outcome Y, using P1, P2, and P3 as predictors.

# Fitting the Model
fit<-lm(Y~ P1 + P2 + P3, data=data_1)
summary(fit)

## 
## Call:
## lm(formula = Y ~ P1 + P2 + P3, data = data_1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042986 -0.008581 -0.001209  0.005788  0.272024 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0903518  0.0188092   4.804 3.09e-06 ***
## P1          -0.0039507  0.0007521  -5.253 3.88e-07 ***
## P2           0.0083762  0.0018286   4.581 8.23e-06 ***
## P3           0.0041287  0.0003386  12.195  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02261 on 196 degrees of freedom
## Multiple R-squared:  0.4985, Adjusted R-squared:  0.4909 
## F-statistic: 64.95 on 3 and 196 DF,  p-value: < 2.2e-16

Therefore, the multiple linear regression model can be expressed as: \[ Y = 0.90352 - 0.003951*P1 + 0.008376*P2 + 0.004129*P3 \]

Based on the regression model developed for part-d, evaluate the normality of residuals and constant error variance assumptions. State your conclusions.

qqnorm(residuals(fit))
qqline(residuals(fit))

The Q-Q plot gives us a graphical visualization of normality data. The data appears to be normal based off the the plot.

A more definitive, quantitative method of testing for normality is with the Shapiro-Wilk Hypothesis Test

#Shapiro Test for Normality
shapiro.test(residuals(fit))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit)
## W = 0.44738, p-value < 2.2e-16

Null Hypothesis: Normal Error Alternative Hypothesis: Non-Normal Error

With a significance level of $.01$, and $p-value < 0.01$, we reject the null hypothesis and conclude that errors are not normally distributed.

Upon further inspection of the QQ plot, i see how the points form an S (sigmoidal?) about the normal line. I see how this is not normally distributed

# Evaluate Constant Variance Assumption
plot(fitted(fit), residuals(fit), xlab="Predicted Values",
ylab="Residual",
main="Checking for Constant Variance",
pch=19, col="darkblue")
abline(h=0)

Based off the plot, this does not appear to have constant variance: far from it.

Evaluate Homoscedasticity - non-constant error variance test]

Null Hypotheses: Constant Error Variance Alt: Non-Constant Error Variance

library(car)

## Loading required package: carData

ncvTest(fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1289.161, Df = 1, p = < 2.22e-16

$p-value < α (.05)$ . Therefore, reject the constant error variance assumption at $5%$ significance level.

The model fails both the constant error variance assumption and the constant variance assumption.

Convert the original outcome variable Y by the natural log transformation (use “log()” command in R software). Comment on the distribution of the transformed variable log_Y, based on a histogram.

log_Y<-log10(df$Y)

#create histogram for original distribution
hist(df$Y, col='steelblue', main='Original')

#create histogram for log-transformed distribution 
hist(log_Y, col='coral2', main='Log Transformed')

The log transformed data appear to resemble a bell shaped curve more. More normally distributed now.

Estimate a multiple linear regression model for the transformed outcome log_Y, using P1, P2, and P3 as predictors.

fit_log<-lm(log_Y~ P1 + P2 + P3, data=data_1)
summary(fit_log)

## 
## Call:
## lm(formula = log_Y ~ P1 + P2 + P3, data = data_1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.200842 -0.059024  0.004205  0.065231  0.257750 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.102822   0.075298  -1.366    0.174    
## P1          -0.083224   0.003011 -27.641   <2e-16 ***
## P2           0.137679   0.007320  18.808   <2e-16 ***
## P3           0.045308   0.001355  33.428   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09051 on 196 degrees of freedom
## Multiple R-squared:  0.9165, Adjusted R-squared:  0.9152 
## F-statistic:   717 on 3 and 196 DF,  p-value: < 2.2e-16

Therefore the MLR Model for the transformed outcome is: \[ Y = -0.102822 - 0.083224*P1 +0.137679*P2 + 0.045308*P3 \]

Based on the model developed for part-g, evaluate the normality of errors and constant error variance assumptions. State your conclusions.

Evaluate Normality of Errors

Null Hypotheses: Normally distributed Error Alt: Error is not normally distributed

#Shapiro Test for Normality
shapiro.test(residuals(fit_log))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_log)
## W = 0.9917, p-value = 0.3112

$p-value > α (0.05)$. Therefore fail to reject the null at the 5% significance level. Error of log transformed data are normally distributed.

Evaluate Homoscedasticity - non-constant error variance test

Null Hypotheses: Constant Error Variance Alt: Non-Constant Error Variance

library(car)
ncvTest(fit_log)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.03444392, Df = 1, p = 0.85277

$p > α (.05)$. Therefore, fail to reject null at $5%$ significance level. Insufficient evidence to conclude Non-constant error variance (error variance is constant).

summary(fit_log)$adj.r.squared

## [1] 0.9152081

Adjusted $r^2 = 0.9152$

It is a fairly good fit, but could be better. For calibration curves in the analytical chem laboratory, we use $r^2$ values of $0.98$. If calibration curve has $r^2 < .98$, we must remake the curve. A value of $1.0$ is a perfect fit.

0.2 Question 2

Dataset final_2.csv provides a summary of a randomized clinical trial conducted on comparing a new HIV antiretroviral therapy EFV-3TC-TDF (indicated by Group_1) with a currently marketed drug ABC-3TC-ZDV (indicated by Group_0). Investigators used CD4 counts (cell count/µL) 48 weeks after the treatment initiation as the primary outcome (indicated by CD4_X). The dataset also contains several other variables measured at the baseline.

• Baseline CD4 count (cell count/µL): CD4_0 • Log transformed baseline viral-load (copies/mL): log_vl • Baseline age (years): age • Gender (Male=1, Female=0): gender

Create a contingency table for treatment group and gender variable. Conduct the Chi-squared test to see if gender and treatment group are associated

attach(data_2)
# Preview Data
head(data_2)

##   ID treatment age gender log_vl CD4_0 CD4_X
## 1  1   Group_0  43      1   9.10   322   291
## 2  2   Group_0  31      1  10.35   311   292
## 3  3   Group_0  28      1  11.37   318   287
## 4  4   Group_0  45      0  10.16   480   488
## 5  5   Group_0  43      0   9.05   412   405
## 6  6   Group_0  34      1   7.69   308   293

Create contingency Table

table <- table(gender, treatment)
table

##       treatment
## gender Group_0 Group_1
##      0      77      61
##      1     281     281

Where Male when gender = 1, Female when gender=0

Add the margins. Why not?

table_w_margins<- addmargins(table)
table_w_margins

##       treatment
## gender Group_0 Group_1 Sum
##    0        77      61 138
##    1       281     281 562
##    Sum     358     342 700

Conduct the Chi-Squared test to see if gender and treatment are associated

Hypotheses • 𝐻0: there is no association between gender and treatment variables. • 𝐻1: there is an association between gender and treatment variables.

# Conduct chi Squared Test
chisq.test(gender,treatment)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  gender and treatment
## X-squared = 1.2672, df = 1, p-value = 0.2603

With p-value of 0.2603, we fail to reject the null hypothesis at the 5% significance level. There is insufficient evidence to conclude there is an association between gender and flavor. Therefore, there is no association between gender and flavor at the 5% significance level.

Assuming larger CD4 counts as better outcomes, investigators wanted to compare average values of the primary outcome (CD4_X) between Group_1 and Group_0. Perform a two-sample t-test to test if the new treatment had better performance compared to the current treatment (assume equal variances).

H0: µ1 = µ2 (the two population means are equal, ) HA: µ1 ≠ µ2 (the two population means are not equal)

# Subset the data to the two columns of interest
sub <-data_2[c("CD4_X","treatment")]
head(sub)

##   CD4_X treatment
## 1   291   Group_0
## 2   292   Group_0
## 3   287   Group_0
## 4   488   Group_0
## 5   405   Group_0
## 6   293   Group_0

Further subset the data, dividing Group_0 from Group_1

# Group_0
subset_ABC_G0 <- subset(sub, treatment=="Group_0", select=c('CD4_X'))
head(subset_ABC_G0)

##   CD4_X
## 1   291
## 2   292
## 3   287
## 4   488
## 5   405
## 6   293

# Group_1
subset_EFV_G1 <- subset(sub, treatment=="Group_1", select=c('CD4_X'))
head(subset_EFV_G1)

##     CD4_X
## 359   351
## 360   291
## 361   313
## 362   373
## 363   317
## 364   416

Perform the two-sample t-test

t.test(subset_ABC_G0, subset_EFV_G1, var.equal =TRUE)

## 
##  Two Sample t-test
## 
## data:  subset_ABC_G0 and subset_EFV_G1
## t = -2.6818, df = 698, p-value = 0.007497
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.40941  -2.69242
## sample estimates:
## mean of x mean of y 
##  338.6508  348.7018

With 5% significance level and p-value of 0.007497, we reject the null hypothesis and conclude that the true difference in means is not equal to zero. That is, there is a significant difference in treatment outcomes (CD_X, as measured by cell count/microliter) between Group_1 (New HIV antiretroviral therapy EFV-3TC-TDF) and Group_0 (currently marketed drug ABC-3TC-ZDV).

Create two subsets for two levels of the treatment (“Group_0”, “Group_1”) variable. Name two subsets as Data_0 and Data_1 (i.e., Data_1 should contain patients those who received the new treatment- EFV-3TC-TDF)

My previous subsets only included CD4_X values for Group_0 and Group_1. These subsets will include all information from the dataset, but subsetted based on treatment.

I already defined the data files for question 1 and question 2 as “data_1<-final_1.csv” and “data_2<-final_2.csv”. Therefore to avoid confusion, I will name these subsets slightly differently than requested.

Instead of “Data_0” this subset will be named Data_0_ABC Instead of “Data_1” this subset will be name Data_1_EFV

Data_0_ABC <- subset(data_2, treatment=="Group_0")
Data_1_EFV <- subset(data_2, treatment=="Group_1")

head(Data_0_ABC)

##   ID treatment age gender log_vl CD4_0 CD4_X
## 1  1   Group_0  43      1   9.10   322   291
## 2  2   Group_0  31      1  10.35   311   292
## 3  3   Group_0  28      1  11.37   318   287
## 4  4   Group_0  45      0  10.16   480   488
## 5  5   Group_0  43      0   9.05   412   405
## 6  6   Group_0  34      1   7.69   308   293

head(Data_1_EFV)

##      ID treatment age gender log_vl CD4_0 CD4_X
## 359 359   Group_1  35      1   9.98   344   351
## 360 360   Group_1  42      1  10.57   302   291
## 361 361   Group_1  27      1   8.79   325   313
## 362 362   Group_1  41      1   8.86   380   373
## 363 363   Group_1  28      1  10.11   326   317
## 364 364   Group_1  26      1   8.56   418   416

Suppose, investigators define a treatment response as having a 48-week CD4 value above 320 (cell count/µL). Using Data_1_EFV, determine binary outcomes (Y) that indicate the responses of patients in Group_1.

Because Y was already used as a variable for Question #1, I will be using a different variable than the request variable for this question.

Instead of Y, binary outcomes will be defined as Y_Q2_df

Y_Q2 = 1 when CD_X ≥ 320$ (cell count/µL) Y_Q2 = 0 when CD_X < 320$ (cell count/µL)

# Create new variable Y_Q2 according to rules defined above
Y_Q2_df<-ifelse(Data_1_EFV$CD4_X>=320, 1, 0)
Y_Q2_df

##   [1] 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1
##  [38] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
##  [75] 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 0 1 1
## [112] 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1
## [149] 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1
## [186] 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0
## [223] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1
## [260] 1 0 1 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1
## [297] 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1
## [334] 0 1 1 1 0 1 1 0 1

Using “table( )” command obtain the frequency distribution for the binary variable Y.

Y_Q2_table<-table(Y_Q2_df)
addmargins(Y_Q2_table)

## Y_Q2_df
##   0   1 Sum 
##  92 250 342

Of the 342 participants in Treatment Group_1, 92 had CD_X values >= 320 (defined as 0) while 250 had CD_X values < 320 (defined as 1).

Estimate a logistic regression model for the binary outcome Y, using baseline CD4 count (CD4_0), log viral-load (log_vl), age, and gender, as predictors.

#data frame with Y_Q2 column based off CD4_X value
Y_Q2_df<-transform(
  Data_1_EFV, Y_Q2= ifelse(Data_1_EFV$CD4_X>=320, 1, 0))
head(Y_Q2_df)

##      ID treatment age gender log_vl CD4_0 CD4_X Y_Q2
## 359 359   Group_1  35      1   9.98   344   351    1
## 360 360   Group_1  42      1  10.57   302   291    0
## 361 361   Group_1  27      1   8.79   325   313    0
## 362 362   Group_1  41      1   8.86   380   373    1
## 363 363   Group_1  28      1  10.11   326   317    0
## 364 364   Group_1  26      1   8.56   418   416    1

L_reg <- glm(Y_Q2 ~ CD4_0 + log_vl + age + gender,
             family= "binomial", data=Y_Q2_df)
summary(L_reg)

## 
## Call:
## glm(formula = Y_Q2 ~ CD4_0 + log_vl + age + gender, family = "binomial", 
##     data = Y_Q2_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.59862  -0.00698   0.01291   0.12287   2.82925  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -47.731520   8.019765  -5.952 2.65e-09 ***
## CD4_0         0.152220   0.023082   6.595 4.26e-11 ***
## log_vl        0.002586   0.270017   0.010  0.99236    
## age           0.021602   0.042732   0.506  0.61320    
## gender       -2.266893   0.711356  -3.187  0.00144 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 398.27  on 341  degrees of freedom
## Residual deviance: 110.55  on 337  degrees of freedom
## AIC: 120.55
## 
## Number of Fisher Scoring iterations: 8

Therefore, the multiple logistic regression model can be expressed as: \[ Y_Q2 = -47.732 + 0.1522*x_1 - 0.0026*x_2 + 0.0216*x_3 -2.267*x_4 \] where $x_1$ is CD4_0 $x_2$ is log_vl $x_3$ is age $x_4$ is gender

Interpret the coefficient estimated for the gender variable. Also indicate, if there is a statistically significant effect by the gender variables on the response Y?

IDK

With a p value of 0.0144, there is a statistically significant effect by the gender variable on response Y at the 5% level

Assume we have a new patient with the following details.

CD4_0: 315 Age: 32 Gender: Female log_vl: 9.80

How likely the new patient will observe a response from the new treatment?

Prediction using the model and given data outlined above

\[ Y_Q2 = -47.732 + 0.1522*315 - 0.0026*32 + 0.0216*0 -2.267*9.80 \]

x_1<- 315 # CD4_0
x_2<- 32 # Age
x_3<- 0 # Gender = Female = 0
x_4<- 9.80 # log_vl

Prediction<- -47.732 + 0.1522*x_1 - 0.0026*x_2 + 0.0216*x_3 -2.267*x_4
Prediction

## [1] -22.0888

Y_Q2 = -22.0888 given these values, using our model

# Calculate the probability
prob<-(exp(Prediction))/(1+exp(Prediction))
prob

## [1] 2.552443e-10

Prediction is essentially 0. Therefore, given the model and values, 0 is predicted. It is predicted in this instance that there will no treatment response (defined as CD_X < 320).

# Attempt to use the predict function. But I cannot get it to work for the life of me....
# Did it the longer, algebraic,way up above.



#m_r <- glm(Y_Q2_df$Y_Q2 ~ x_1 + x_2 + x_3 + x_4, family= "binomial", data=Y_Q2_df)


#predict(L_reg, newdata=data.frame(x_1,x_2,x_3,x_4), type="response")

Take Home Final Notebook

Steven Ferguson

2022-12-09

0.1 Question 1

0.2 Question 2