data_1<-read.csv("final_1.csv")
data_2<-read.csv("final_2.csv")
head(data_1)
## Y P1 P2 P3
## 1 0.019 24.554 1.296 4.629
## 2 0.010 25.855 1.474 0.849
## 3 0.008 26.119 0.071 2.685
## 4 0.021 23.372 0.760 6.048
## 5 0.028 25.437 0.874 12.132
## 6 0.009 26.844 1.420 1.391
Data file “final_1.csv”, defined as data_1
, provides
preclinical study data examining the relationship of drug concentration
in liver (Y
) and the concentration of 3 liver proteins
(P1
, P2
, P3
).
Create histograms of the four variables (Y
,
P1
, P2
, P3
)
attach(data_1)
hist_Y<-hist(Y, col="red", xlab = "Drug Concentration (var. Y)",
main="Frequency Distribution of Var. 'Y' (Drug Concentration in Liver)")
hist_P1<-hist(P1, col="red", xlab = "P1 Liver Protein Level",
main="Freqency Distribution of Liver Protein P1")
hist_P2<-hist(P2, col="red", xlab = "P2 Liver Protein Level",
main="Freqency Distribution of Liver Protein P2")
hist_P3<-hist(P3, col="red", xlab = "P3 Liver Protein Level",
main="Freqency Distribution of Liver Protein P3")
P1
appear to be approximately normally distributed.
Y
, P2
, P3
do not appear to be
normally distributed.
# DispersionCalculations
sd_P1 <- sd(P1)
sd_P1
## [1] 2.139313
sd_P2 <- sd(P2)
sd_P2
## [1] 0.8802262
sd_P3 <- sd(P3)
sd_P3
## [1] 4.747751
sd_Y <- sd(Y)
sd_Y
## [1] 0.03168693
variance_Y <- (sd_Y)^2
variance_Y
## [1] 0.001004061
variance_P1 <- (sd_P1)^2
variance_P1
## [1] 4.576659
variance_P2 <- (sd_P2)^2
variance_P2
## [1] 0.7747982
variance_P3 <- (sd_P3)^2
variance_P3
## [1] 22.54114
#Range. Min/Max values pulled from Summary function
range_Y <- 0.395-0.003
range_P1 <- 30.30 - 18.83
range_P2 <- 5.037 - 0.004
range_P3 <- 28.181 - 0
range_Y
## [1] 0.392
range_P1
## [1] 11.47
range_P2
## [1] 5.033
range_P3
## [1] 28.181
Central Tendency Calculations: Mean and Median
# Central Tendency Calculations
summary(data_1)
## Y P1 P2 P3
## Min. :0.00300 Min. :18.83 Min. :0.0040 Min. : 0.000
## 1st Qu.:0.01000 1st Qu.:23.46 1st Qu.:0.3560 1st Qu.: 1.929
## Median :0.01500 Median :25.07 Median :0.7515 Median : 3.950
## Mean :0.02141 Mean :24.99 Mean :0.9772 Mean : 5.229
## 3rd Qu.:0.02300 3rd Qu.:26.39 3rd Qu.:1.3650 3rd Qu.: 7.149
## Max. :0.39500 Max. :30.30 Max. :5.0370 Max. :28.181
Use a matrix plot to visualize associations among variables
(Y
, P1
, P2
, P3
).
#Define Data Frame
df <- data.frame(Y, P1, P2, P3)
# Matrix Plot
plot(df, pch=18, col ="steelblue") #Basic Matrix pplot
# install and load GGolly for a better visual
# install.packages("GGally")
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(df)
Estimate a multiple linear regression model for the outcome
Y
, using P1
, P2
, and
P3
as predictors.
# Fitting the Model
fit<-lm(Y~ P1 + P2 + P3, data=data_1)
summary(fit)
##
## Call:
## lm(formula = Y ~ P1 + P2 + P3, data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.042986 -0.008581 -0.001209 0.005788 0.272024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0903518 0.0188092 4.804 3.09e-06 ***
## P1 -0.0039507 0.0007521 -5.253 3.88e-07 ***
## P2 0.0083762 0.0018286 4.581 8.23e-06 ***
## P3 0.0041287 0.0003386 12.195 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02261 on 196 degrees of freedom
## Multiple R-squared: 0.4985, Adjusted R-squared: 0.4909
## F-statistic: 64.95 on 3 and 196 DF, p-value: < 2.2e-16
Therefore, the multiple linear regression model can be expressed as: \[ Y = 0.90352 - 0.003951*P1 + 0.008376*P2 + 0.004129*P3 \]
Based on the regression model developed for part-d, evaluate the normality of residuals and constant error variance assumptions. State your conclusions.
qqnorm(residuals(fit))
qqline(residuals(fit))
The Q-Q plot gives us a graphical visualization of normality data. The
data appears to be normal based off the the plot.
A more definitive, quantitative method of testing for normality is with the Shapiro-Wilk Hypothesis Test
#Shapiro Test for Normality
shapiro.test(residuals(fit))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit)
## W = 0.44738, p-value < 2.2e-16
Null Hypothesis: Normal Error Alternative Hypothesis: Non-Normal Error
With a significance level of \(.01\), and \(p-value < 0.01\), we reject the null hypothesis and conclude that errors are not normally distributed.
Upon further inspection of the QQ plot, i see how the points form an S (sigmoidal?) about the normal line. I see how this is not normally distributed
# Evaluate Constant Variance Assumption
plot(fitted(fit), residuals(fit), xlab="Predicted Values",
ylab="Residual",
main="Checking for Constant Variance",
pch=19, col="darkblue")
abline(h=0)
Based off the plot, this does not appear to have constant variance: far
from it.
Evaluate Homoscedasticity - non-constant error variance test]
Null Hypotheses: Constant Error Variance Alt: Non-Constant Error Variance
library(car)
## Loading required package: carData
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1289.161, Df = 1, p = < 2.22e-16
\(p-value < α (.05)\) . Therefore, reject the constant error variance assumption at \(5%\) significance level.
The model fails both the constant error variance assumption and the constant variance assumption.
Convert the original outcome variable Y
by the natural
log transformation (use “log()” command in R software). Comment on the
distribution of the transformed variable log_Y
, based on a
histogram.
log_Y<-log10(df$Y)
#create histogram for original distribution
hist(df$Y, col='steelblue', main='Original')
#create histogram for log-transformed distribution
hist(log_Y, col='coral2', main='Log Transformed')
The log transformed data appear to resemble a bell shaped curve more.
More normally distributed now.
Estimate a multiple linear regression model for the transformed
outcome log_Y
, using P1
, P2
, and
P3
as predictors.
fit_log<-lm(log_Y~ P1 + P2 + P3, data=data_1)
summary(fit_log)
##
## Call:
## lm(formula = log_Y ~ P1 + P2 + P3, data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.200842 -0.059024 0.004205 0.065231 0.257750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.102822 0.075298 -1.366 0.174
## P1 -0.083224 0.003011 -27.641 <2e-16 ***
## P2 0.137679 0.007320 18.808 <2e-16 ***
## P3 0.045308 0.001355 33.428 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09051 on 196 degrees of freedom
## Multiple R-squared: 0.9165, Adjusted R-squared: 0.9152
## F-statistic: 717 on 3 and 196 DF, p-value: < 2.2e-16
Therefore the MLR Model for the transformed outcome is: \[ Y = -0.102822 - 0.083224*P1 +0.137679*P2 + 0.045308*P3 \]
Based on the model developed for part-g, evaluate the normality of errors and constant error variance assumptions. State your conclusions.
Evaluate Normality of Errors
Null Hypotheses: Normally distributed Error Alt: Error is not normally distributed
#Shapiro Test for Normality
shapiro.test(residuals(fit_log))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit_log)
## W = 0.9917, p-value = 0.3112
\(p-value > α (0.05)\). Therefore fail to reject the null at the 5% significance level. Error of log transformed data are normally distributed.
Evaluate Homoscedasticity - non-constant error variance test
Null Hypotheses: Constant Error Variance Alt: Non-Constant Error Variance
library(car)
ncvTest(fit_log)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.03444392, Df = 1, p = 0.85277
\(p > α (.05)\). Therefore, fail to reject null at \(5%\) significance level. Insufficient evidence to conclude Non-constant error variance (error variance is constant).
summary(fit_log)$adj.r.squared
## [1] 0.9152081
Adjusted \(r^2 = 0.9152\)
It is a fairly good fit, but could be better. For calibration curves in the analytical chem laboratory, we use \(r^2\) values of \(0.98\). If calibration curve has \(r^2 < .98\), we must remake the curve. A value of \(1.0\) is a perfect fit.
Dataset final_2.csv
provides a summary of a randomized
clinical trial conducted on comparing a new HIV antiretroviral therapy
EFV-3TC-TDF (indicated by Group_1
) with a currently
marketed drug ABC-3TC-ZDV (indicated by Group_0
).
Investigators used CD4 counts (cell count/µL) 48 weeks after the
treatment initiation as the primary outcome (indicated by
CD4_X
). The dataset also contains several other variables
measured at the baseline.
• Baseline CD4 count (cell count/µL): CD4_0
• Log
transformed baseline viral-load (copies/mL): log_vl
•
Baseline age (years): age
• Gender (Male=1, Female=0):
gender
Create a contingency table for treatment group and gender variable. Conduct the Chi-squared test to see if gender and treatment group are associated
attach(data_2)
# Preview Data
head(data_2)
## ID treatment age gender log_vl CD4_0 CD4_X
## 1 1 Group_0 43 1 9.10 322 291
## 2 2 Group_0 31 1 10.35 311 292
## 3 3 Group_0 28 1 11.37 318 287
## 4 4 Group_0 45 0 10.16 480 488
## 5 5 Group_0 43 0 9.05 412 405
## 6 6 Group_0 34 1 7.69 308 293
Create contingency Table
table <- table(gender, treatment)
table
## treatment
## gender Group_0 Group_1
## 0 77 61
## 1 281 281
Where Male when gender = 1
, Female when
gender=0
Add the margins. Why not?
table_w_margins<- addmargins(table)
table_w_margins
## treatment
## gender Group_0 Group_1 Sum
## 0 77 61 138
## 1 281 281 562
## Sum 358 342 700
Conduct the Chi-Squared test to see if gender
and
treatment
are associated
Hypotheses • 𝐻0: there is no association between gender
and treatment
variables. • 𝐻1: there is an association
between gender
and treatment
variables.
# Conduct chi Squared Test
chisq.test(gender,treatment)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: gender and treatment
## X-squared = 1.2672, df = 1, p-value = 0.2603
With p-value of 0.2603, we fail to reject the null hypothesis at the 5% significance level. There is insufficient evidence to conclude there is an association between gender and flavor. Therefore, there is no association between gender and flavor at the 5% significance level.
Assuming larger CD4 counts as better outcomes, investigators wanted
to compare average values of the primary outcome (CD4_X
)
between Group_1
and Group_0
. Perform a
two-sample t-test to test if the new treatment had better performance
compared to the current treatment (assume equal variances).
H0: µ1 = µ2 (the two population means are equal, ) HA: µ1 ≠ µ2 (the two population means are not equal)
# Subset the data to the two columns of interest
sub <-data_2[c("CD4_X","treatment")]
head(sub)
## CD4_X treatment
## 1 291 Group_0
## 2 292 Group_0
## 3 287 Group_0
## 4 488 Group_0
## 5 405 Group_0
## 6 293 Group_0
Further subset the data, dividing Group_0 from Group_1
# Group_0
subset_ABC_G0 <- subset(sub, treatment=="Group_0", select=c('CD4_X'))
head(subset_ABC_G0)
## CD4_X
## 1 291
## 2 292
## 3 287
## 4 488
## 5 405
## 6 293
# Group_1
subset_EFV_G1 <- subset(sub, treatment=="Group_1", select=c('CD4_X'))
head(subset_EFV_G1)
## CD4_X
## 359 351
## 360 291
## 361 313
## 362 373
## 363 317
## 364 416
Perform the two-sample t-test
t.test(subset_ABC_G0, subset_EFV_G1, var.equal =TRUE)
##
## Two Sample t-test
##
## data: subset_ABC_G0 and subset_EFV_G1
## t = -2.6818, df = 698, p-value = 0.007497
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -17.40941 -2.69242
## sample estimates:
## mean of x mean of y
## 338.6508 348.7018
With 5% significance level and p-value of 0.007497, we reject the null hypothesis and conclude that the true difference in means is not equal to zero. That is, there is a significant difference in treatment outcomes (CD_X, as measured by cell count/microliter) between Group_1 (New HIV antiretroviral therapy EFV-3TC-TDF) and Group_0 (currently marketed drug ABC-3TC-ZDV).
Create two subsets for two levels of the treatment (“Group_0”, “Group_1”) variable. Name two subsets as Data_0 and Data_1 (i.e., Data_1 should contain patients those who received the new treatment- EFV-3TC-TDF)
My previous subsets only included CD4_X values for Group_0 and Group_1. These subsets will include all information from the dataset, but subsetted based on treatment.
I already defined the data files for question 1 and question 2 as “data_1<-final_1.csv” and “data_2<-final_2.csv”. Therefore to avoid confusion, I will name these subsets slightly differently than requested.
Instead of “Data_0” this subset will be named Data_0_ABC Instead of “Data_1” this subset will be name Data_1_EFV
Data_0_ABC <- subset(data_2, treatment=="Group_0")
Data_1_EFV <- subset(data_2, treatment=="Group_1")
head(Data_0_ABC)
## ID treatment age gender log_vl CD4_0 CD4_X
## 1 1 Group_0 43 1 9.10 322 291
## 2 2 Group_0 31 1 10.35 311 292
## 3 3 Group_0 28 1 11.37 318 287
## 4 4 Group_0 45 0 10.16 480 488
## 5 5 Group_0 43 0 9.05 412 405
## 6 6 Group_0 34 1 7.69 308 293
head(Data_1_EFV)
## ID treatment age gender log_vl CD4_0 CD4_X
## 359 359 Group_1 35 1 9.98 344 351
## 360 360 Group_1 42 1 10.57 302 291
## 361 361 Group_1 27 1 8.79 325 313
## 362 362 Group_1 41 1 8.86 380 373
## 363 363 Group_1 28 1 10.11 326 317
## 364 364 Group_1 26 1 8.56 418 416
Suppose, investigators define a treatment response as having a
48-week CD4 value above 320 (cell count/µL). Using
Data_1_EFV
, determine binary outcomes (Y
) that
indicate the responses of patients in Group_1
.
Because Y
was already used as a variable for Question
#1, I will be using a different variable than the request variable for
this question.
Instead of Y
, binary outcomes will be defined as
Y_Q2_df
Y_Q2 = 1 when CD_X ≥ 320$ (cell count/µL) Y_Q2 = 0 when CD_X < 320$ (cell count/µL)
# Create new variable Y_Q2 according to rules defined above
Y_Q2_df<-ifelse(Data_1_EFV$CD4_X>=320, 1, 0)
Y_Q2_df
## [1] 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1
## [38] 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
## [75] 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 0 1 1
## [112] 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1
## [149] 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1
## [186] 0 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 0
## [223] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1
## [260] 1 0 1 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1
## [297] 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1
## [334] 0 1 1 1 0 1 1 0 1
Using “table( )” command obtain the frequency distribution for the
binary variable Y
.
Y_Q2_table<-table(Y_Q2_df)
addmargins(Y_Q2_table)
## Y_Q2_df
## 0 1 Sum
## 92 250 342
Of the 342 participants in Treatment Group_1
, 92 had
CD_X
values >= 320 (defined as 0) while 250 had
CD_X
values < 320 (defined as 1).
Estimate a logistic regression model for the binary outcome Y, using baseline CD4 count (CD4_0), log viral-load (log_vl), age, and gender, as predictors.
#data frame with Y_Q2 column based off CD4_X value
Y_Q2_df<-transform(
Data_1_EFV, Y_Q2= ifelse(Data_1_EFV$CD4_X>=320, 1, 0))
head(Y_Q2_df)
## ID treatment age gender log_vl CD4_0 CD4_X Y_Q2
## 359 359 Group_1 35 1 9.98 344 351 1
## 360 360 Group_1 42 1 10.57 302 291 0
## 361 361 Group_1 27 1 8.79 325 313 0
## 362 362 Group_1 41 1 8.86 380 373 1
## 363 363 Group_1 28 1 10.11 326 317 0
## 364 364 Group_1 26 1 8.56 418 416 1
L_reg <- glm(Y_Q2 ~ CD4_0 + log_vl + age + gender,
family= "binomial", data=Y_Q2_df)
summary(L_reg)
##
## Call:
## glm(formula = Y_Q2 ~ CD4_0 + log_vl + age + gender, family = "binomial",
## data = Y_Q2_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.59862 -0.00698 0.01291 0.12287 2.82925
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -47.731520 8.019765 -5.952 2.65e-09 ***
## CD4_0 0.152220 0.023082 6.595 4.26e-11 ***
## log_vl 0.002586 0.270017 0.010 0.99236
## age 0.021602 0.042732 0.506 0.61320
## gender -2.266893 0.711356 -3.187 0.00144 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 398.27 on 341 degrees of freedom
## Residual deviance: 110.55 on 337 degrees of freedom
## AIC: 120.55
##
## Number of Fisher Scoring iterations: 8
Therefore, the multiple logistic regression model can be expressed
as: \[
Y_Q2 = -47.732 + 0.1522*x_1 - 0.0026*x_2 + 0.0216*x_3 -2.267*x_4
\] where \(x_1\) is
CD4_0
\(x_2\) is
log_vl
\(x_3\) is
age
\(x_4\) is
gender
Interpret the coefficient estimated for the gender variable. Also indicate, if there is a statistically significant effect by the gender variables on the response Y?
IDK
With a p value of 0.0144, there is a statistically significant effect by the gender variable on response Y at the 5% level
Assume we have a new patient with the following details.
CD4_0: 315 Age: 32 Gender: Female log_vl: 9.80
How likely the new patient will observe a response from the new treatment?
Prediction using the model and given data outlined above
\[ Y_Q2 = -47.732 + 0.1522*315 - 0.0026*32 + 0.0216*0 -2.267*9.80 \]
x_1<- 315 # CD4_0
x_2<- 32 # Age
x_3<- 0 # Gender = Female = 0
x_4<- 9.80 # log_vl
Prediction<- -47.732 + 0.1522*x_1 - 0.0026*x_2 + 0.0216*x_3 -2.267*x_4
Prediction
## [1] -22.0888
Y_Q2 = -22.0888 given these values, using our model
# Calculate the probability
prob<-(exp(Prediction))/(1+exp(Prediction))
prob
## [1] 2.552443e-10
Prediction is essentially 0. Therefore, given the model and values, 0 is predicted. It is predicted in this instance that there will no treatment response (defined as CD_X < 320).
# Attempt to use the predict function. But I cannot get it to work for the life of me....
# Did it the longer, algebraic,way up above.
#m_r <- glm(Y_Q2_df$Y_Q2 ~ x_1 + x_2 + x_3 + x_4, family= "binomial", data=Y_Q2_df)
#predict(L_reg, newdata=data.frame(x_1,x_2,x_3,x_4), type="response")