Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Answer

library('ggplot2')

About the Data Set

The features that are provided within the data set are outlined as below

Age : Age of the Patient Sex : Gender of the Patient BP : Blood Pressure of the Patient Cholesterol: Cholesterol of the Patient Drug: Drug each patient responded to Na_to_K: Sodium to Potasium Levels Please note all patients in the dataset have the same illness.

df <- read.csv('drug200.csv')
head(df)
##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX
summary(df)
##       Age        Sex          BP     Cholesterol     Na_to_K      
##  Min.   :15.00   F: 96   HIGH  :77   HIGH  :103   Min.   : 6.269  
##  1st Qu.:31.00   M:104   LOW   :64   NORMAL: 97   1st Qu.:10.445  
##  Median :45.00           NORMAL:59                Median :13.937  
##  Mean   :44.31                                    Mean   :16.084  
##  3rd Qu.:58.00                                    3rd Qu.:19.380  
##  Max.   :74.00                                    Max.   :38.247  
##     Drug   
##  drugA:23  
##  drugB:16  
##  drugC:16  
##  drugX:54  
##  drugY:91  
## 
# see linear relationship between age and Na_to_K

g <- ggplot(df, aes(Age, Na_to_K))

# Scatterplot
g + geom_point() + 
  geom_smooth(method="lm", se=F)

df$Sex <- with(df, replace(Sex, Sex=="Male", 1))
## Warning in `[<-.factor`(`*tmp*`, list, value = 1): invalid factor level, NA
## generated
df$Sex <- with(df, replace(Sex, Sex=="Female", 2))
## Warning in `[<-.factor`(`*tmp*`, list, value = 2): invalid factor level, NA
## generated
head(df)
##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX
df <- na.omit(df)
head(df)
##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX
summary(df)
##       Age        Sex          BP     Cholesterol     Na_to_K      
##  Min.   :15.00   F: 96   HIGH  :77   HIGH  :103   Min.   : 6.269  
##  1st Qu.:31.00   M:104   LOW   :64   NORMAL: 97   1st Qu.:10.445  
##  Median :45.00           NORMAL:59                Median :13.937  
##  Mean   :44.31                                    Mean   :16.084  
##  3rd Qu.:58.00                                    3rd Qu.:19.380  
##  Max.   :74.00                                    Max.   :38.247  
##     Drug   
##  drugA:23  
##  drugB:16  
##  drugC:16  
##  drugX:54  
##  drugY:91  
## 
str(df)
## 'data.frame':    200 obs. of  6 variables:
##  $ Age        : int  23 47 47 28 61 22 49 41 60 43 ...
##  $ Sex        : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 2 2 2 ...
##  $ BP         : Factor w/ 3 levels "HIGH","LOW","NORMAL": 1 2 2 3 2 3 3 2 3 2 ...
##  $ Cholesterol: Factor w/ 2 levels "HIGH","NORMAL": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Na_to_K    : num  25.4 13.1 10.1 7.8 18 ...
##  $ Drug       : Factor w/ 5 levels "drugA","drugB",..: 5 3 3 4 5 4 5 3 5 5 ...
linear_model <- lm(Age ~ Na_to_K, data=df)
summary(linear_model)
## 
## Call:
## lm(formula = Age ~ Na_to_K, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.3270 -13.4270  -0.3004  14.1205  30.3872 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.6401     2.8629   16.29   <2e-16 ***
## Na_to_K      -0.1446     0.1624   -0.89    0.375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.55 on 198 degrees of freedom
## Multiple R-squared:  0.003984,   Adjusted R-squared:  -0.001046 
## F-statistic: 0.792 on 1 and 198 DF,  p-value: 0.3746
qqnorm(linear_model$residuals)
qqline(linear_model$residuals)

p-value is not low , it is higher than 0.05. The R-squared is really low - 0.000563. The residuals are not normally distributed. There are deviation on the top and bottom end. Based on this, there is a great chance there is no linear relationship between age and sodium to potasium levels. The linear model we picked is not appropriate.

## 
## Call:
## lm(formula = Age ~ Sex + Na_to_K, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.652 -13.559  -0.737  14.073  31.897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.5571     3.2524  13.700   <2e-16 ***
## SexM          3.1589     2.3566   1.340    0.182    
## Na_to_K      -0.1172     0.1634  -0.717    0.474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.52 on 197 degrees of freedom
## Multiple R-squared:  0.01299,    Adjusted R-squared:  0.002966 
## F-statistic: 1.296 on 2 and 197 DF,  p-value: 0.2759
qqnorm(linear_model_2$residuals)
qqline(linear_model_2$residuals)

The p value is low, however based on the R squared, the model fits 12% of the data. It is a better model compare to the ealier one however it is still not appropriate linear model for prediction. The second model is also not useful as