Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library('ggplot2')
The features that are provided within the data set are outlined as below
Age : Age of the Patient Sex : Gender of the Patient BP : Blood Pressure of the Patient Cholesterol: Cholesterol of the Patient Drug: Drug each patient responded to Na_to_K: Sodium to Potasium Levels Please note all patients in the dataset have the same illness.
df <- read.csv('drug200.csv')
head(df)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 drugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 drugY
## 6 22 F NORMAL HIGH 8.607 drugX
summary(df)
## Age Sex BP Cholesterol Na_to_K
## Min. :15.00 F: 96 HIGH :77 HIGH :103 Min. : 6.269
## 1st Qu.:31.00 M:104 LOW :64 NORMAL: 97 1st Qu.:10.445
## Median :45.00 NORMAL:59 Median :13.937
## Mean :44.31 Mean :16.084
## 3rd Qu.:58.00 3rd Qu.:19.380
## Max. :74.00 Max. :38.247
## Drug
## drugA:23
## drugB:16
## drugC:16
## drugX:54
## drugY:91
##
# see linear relationship between age and Na_to_K
g <- ggplot(df, aes(Age, Na_to_K))
# Scatterplot
g + geom_point() +
geom_smooth(method="lm", se=F)
df$Sex <- with(df, replace(Sex, Sex=="Male", 1))
## Warning in `[<-.factor`(`*tmp*`, list, value = 1): invalid factor level, NA
## generated
df$Sex <- with(df, replace(Sex, Sex=="Female", 2))
## Warning in `[<-.factor`(`*tmp*`, list, value = 2): invalid factor level, NA
## generated
head(df)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 drugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 drugY
## 6 22 F NORMAL HIGH 8.607 drugX
df <- na.omit(df)
head(df)
## Age Sex BP Cholesterol Na_to_K Drug
## 1 23 F HIGH HIGH 25.355 drugY
## 2 47 M LOW HIGH 13.093 drugC
## 3 47 M LOW HIGH 10.114 drugC
## 4 28 F NORMAL HIGH 7.798 drugX
## 5 61 F LOW HIGH 18.043 drugY
## 6 22 F NORMAL HIGH 8.607 drugX
summary(df)
## Age Sex BP Cholesterol Na_to_K
## Min. :15.00 F: 96 HIGH :77 HIGH :103 Min. : 6.269
## 1st Qu.:31.00 M:104 LOW :64 NORMAL: 97 1st Qu.:10.445
## Median :45.00 NORMAL:59 Median :13.937
## Mean :44.31 Mean :16.084
## 3rd Qu.:58.00 3rd Qu.:19.380
## Max. :74.00 Max. :38.247
## Drug
## drugA:23
## drugB:16
## drugC:16
## drugX:54
## drugY:91
##
str(df)
## 'data.frame': 200 obs. of 6 variables:
## $ Age : int 23 47 47 28 61 22 49 41 60 43 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 2 2 2 ...
## $ BP : Factor w/ 3 levels "HIGH","LOW","NORMAL": 1 2 2 3 2 3 3 2 3 2 ...
## $ Cholesterol: Factor w/ 2 levels "HIGH","NORMAL": 1 1 1 1 1 1 1 1 1 2 ...
## $ Na_to_K : num 25.4 13.1 10.1 7.8 18 ...
## $ Drug : Factor w/ 5 levels "drugA","drugB",..: 5 3 3 4 5 4 5 3 5 5 ...
linear_model <- lm(Age ~ Na_to_K, data=df)
summary(linear_model)
##
## Call:
## lm(formula = Age ~ Na_to_K, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.3270 -13.4270 -0.3004 14.1205 30.3872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.6401 2.8629 16.29 <2e-16 ***
## Na_to_K -0.1446 0.1624 -0.89 0.375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.55 on 198 degrees of freedom
## Multiple R-squared: 0.003984, Adjusted R-squared: -0.001046
## F-statistic: 0.792 on 1 and 198 DF, p-value: 0.3746
qqnorm(linear_model$residuals)
qqline(linear_model$residuals)
p-value is not low , it is higher than 0.05. The R-squared is really low - 0.000563. The residuals are not normally distributed. There are deviation on the top and bottom end. Based on this, there is a great chance there is no linear relationship between age and sodium to potasium levels. The linear model we picked is not appropriate.
##
## Call:
## lm(formula = Age ~ Sex + Na_to_K, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.652 -13.559 -0.737 14.073 31.897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.5571 3.2524 13.700 <2e-16 ***
## SexM 3.1589 2.3566 1.340 0.182
## Na_to_K -0.1172 0.1634 -0.717 0.474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.52 on 197 degrees of freedom
## Multiple R-squared: 0.01299, Adjusted R-squared: 0.002966
## F-statistic: 1.296 on 2 and 197 DF, p-value: 0.2759
qqnorm(linear_model_2$residuals)
qqline(linear_model_2$residuals)
The p value is low, however based on the R squared, the model fits 12% of the data. It is a better model compare to the ealier one however it is still not appropriate linear model for prediction. The second model is also not useful as