Answer

library('ggplot2')

About the Data Set

The features that are provided within the data set are outlined as below

Age : Age of the Patient Sex : Gender of the Patient BP : Blood Pressure of the Patient Cholesterol: Cholesterol of the Patient Drug: Drug each patient responded to Na_to_K: Sodium to Potasium Levels Please note all patients in the dataset have the same illness.

df <- read.csv('drug200.csv')
head(df)

##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX

summary(df)

##       Age        Sex          BP     Cholesterol     Na_to_K      
##  Min.   :15.00   F: 96   HIGH  :77   HIGH  :103   Min.   : 6.269  
##  1st Qu.:31.00   M:104   LOW   :64   NORMAL: 97   1st Qu.:10.445  
##  Median :45.00           NORMAL:59                Median :13.937  
##  Mean   :44.31                                    Mean   :16.084  
##  3rd Qu.:58.00                                    3rd Qu.:19.380  
##  Max.   :74.00                                    Max.   :38.247  
##     Drug   
##  drugA:23  
##  drugB:16  
##  drugC:16  
##  drugX:54  
##  drugY:91  
##

# see linear relationship between age and Na_to_K

g <- ggplot(df, aes(Age, Na_to_K))

# Scatterplot
g + geom_point() + 
  geom_smooth(method="lm", se=F)

df$Sex <- with(df, replace(Sex, Sex=="Male", 1))

## Warning in `[<-.factor`(`*tmp*`, list, value = 1): invalid factor level, NA
## generated

df$Sex <- with(df, replace(Sex, Sex=="Female", 2))

## Warning in `[<-.factor`(`*tmp*`, list, value = 2): invalid factor level, NA
## generated

head(df)

##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX

df <- na.omit(df)
head(df)

##   Age Sex     BP Cholesterol Na_to_K  Drug
## 1  23   F   HIGH        HIGH  25.355 drugY
## 2  47   M    LOW        HIGH  13.093 drugC
## 3  47   M    LOW        HIGH  10.114 drugC
## 4  28   F NORMAL        HIGH   7.798 drugX
## 5  61   F    LOW        HIGH  18.043 drugY
## 6  22   F NORMAL        HIGH   8.607 drugX

summary(df)

##       Age        Sex          BP     Cholesterol     Na_to_K      
##  Min.   :15.00   F: 96   HIGH  :77   HIGH  :103   Min.   : 6.269  
##  1st Qu.:31.00   M:104   LOW   :64   NORMAL: 97   1st Qu.:10.445  
##  Median :45.00           NORMAL:59                Median :13.937  
##  Mean   :44.31                                    Mean   :16.084  
##  3rd Qu.:58.00                                    3rd Qu.:19.380  
##  Max.   :74.00                                    Max.   :38.247  
##     Drug   
##  drugA:23  
##  drugB:16  
##  drugC:16  
##  drugX:54  
##  drugY:91  
##

str(df)

## 'data.frame':    200 obs. of  6 variables:
##  $ Age        : int  23 47 47 28 61 22 49 41 60 43 ...
##  $ Sex        : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 2 2 2 ...
##  $ BP         : Factor w/ 3 levels "HIGH","LOW","NORMAL": 1 2 2 3 2 3 3 2 3 2 ...
##  $ Cholesterol: Factor w/ 2 levels "HIGH","NORMAL": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Na_to_K    : num  25.4 13.1 10.1 7.8 18 ...
##  $ Drug       : Factor w/ 5 levels "drugA","drugB",..: 5 3 3 4 5 4 5 3 5 5 ...

linear_model <- lm(Age ~ Na_to_K, data=df)
summary(linear_model)

## 
## Call:
## lm(formula = Age ~ Na_to_K, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.3270 -13.4270  -0.3004  14.1205  30.3872 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.6401     2.8629   16.29   <2e-16 ***
## Na_to_K      -0.1446     0.1624   -0.89    0.375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.55 on 198 degrees of freedom
## Multiple R-squared:  0.003984,   Adjusted R-squared:  -0.001046 
## F-statistic: 0.792 on 1 and 198 DF,  p-value: 0.3746

qqnorm(linear_model$residuals)
qqline(linear_model$residuals)

p-value is not low , it is higher than 0.05. The R-squared is really low - 0.000563. The residuals are not normally distributed. There are deviation on the top and bottom end. Based on this, there is a great chance there is no linear relationship between age and sodium to potasium levels. The linear model we picked is not appropriate.

## 
## Call:
## lm(formula = Age ~ Sex + Na_to_K, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.652 -13.559  -0.737  14.073  31.897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.5571     3.2524  13.700   <2e-16 ***
## SexM          3.1589     2.3566   1.340    0.182    
## Na_to_K      -0.1172     0.1634  -0.717    0.474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.52 on 197 degrees of freedom
## Multiple R-squared:  0.01299,    Adjusted R-squared:  0.002966 
## F-statistic: 1.296 on 2 and 197 DF,  p-value: 0.2759

qqnorm(linear_model_2$residuals)
qqline(linear_model_2$residuals)

The p value is low, however based on the R squared, the model fits 12% of the data. It is a better model compare to the ealier one however it is still not appropriate linear model for prediction. The second model is also not useful as

Discussion 12

Anil Akyildirim

11/13/2019

Question

Answer

About the Data Set