Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

300k medical appointments of the public healthcare of the capital city of Espirito Santo State - Vitoria - Brazil and its 15 variables (characteristics) of each. The most important one if the patient show-up or no-show the appointment. Variable names are self-explanatory, if you have doubts, just let me know! Handcap is the total amount of handcaps a person presents, it is not binary.

What if that possible to predict someone to no-show an appointment?

Kaggle Dataset: https://www.kaggle.com/joniarroba/noshowappointments

Let’s obtain the data from my github account:

if (!require(RCurl)) install.packages("RCurl")
require(RCurl)

x <- getURL("https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Week%2012/NoShow.csv")
NoShow_data <- read.csv(text = x, stringsAsFactors = FALSE)
NoShow_data$Gender <- factor(NoShow_data$Gender, levels = c("M", "F"))

head(NoShow_data)
##      PatientId AppointmentID Gender         ScheduledDay
## 1 2.987250e+13       5642903      F 2016-04-29T18:38:08Z
## 2 5.589978e+14       5642503      M 2016-04-29T16:08:27Z
## 3 4.262962e+12       5642549      F 2016-04-29T16:19:04Z
## 4 8.679512e+11       5642828      F 2016-04-29T17:29:31Z
## 5 8.841186e+12       5642494      F 2016-04-29T16:07:23Z
## 6 9.598513e+13       5626772      F 2016-04-27T08:36:51Z
##         AppointmentDay Age     Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z  62   JARDIM DA PENHA           0            1
## 2 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            0
## 3 2016-04-29T00:00:00Z  62     MATA DA PRAIA           0            0
## 4 2016-04-29T00:00:00Z   8 PONTAL DE CAMBURI           0            0
## 5 2016-04-29T00:00:00Z  56   JARDIM DA PENHA           0            1
## 6 2016-04-29T00:00:00Z  76         REPÚBLICA           0            1
##   Diabetes Alcoholism Handcap SMS_received No.show
## 1        0          0       0            0      No
## 2        0          0       0            0      No
## 3        0          0       0            0      No
## 4        0          0       0            0      No
## 5        1          0       0            0      No
## 6        0          0       0            0      No

How many showed up? and how many skipped out?

if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)

status_table <- table(NoShow_data$No.show)
status_table
## 
##    No   Yes 
## 88208 22319
ggplot(NoShow_data, aes(x=No.show, fill=No.show)) + geom_bar()

print(paste0("Percentage of people who do not show up to their appointments: ", 
             round(status_table["Yes"]/(status_table["Yes"] + status_table["No"]) * 100, 2), "%"))
## [1] "Percentage of people who do not show up to their appointments: 20.19%"

Not to be confusing, but to clarify the data. “Yes” means that the patient did NOT show up, and “No” means patient DID show up!

Let’s clean up some of the data by munging. Let’s take out some information that will not likely make a difference if the patient will show up or not. In this case, we will remove ‘PatientId’, ‘AppointmentID’. We will also simplify the data by removing the ‘ScheduledDay’, ‘Neighbourhood’, and ‘AppointmentDay’. We will also set the outcome variable in the ‘NoShow’ column to be 1 if the patient did NOT show up and 0 if the patient DID show up.

if (!require(tidyr)) install.packages("tidyr")
if (!require(dplyr)) install.packages("dplyr")
library(tidyr)
library(dplyr)

NoShow_rev <- NoShow_data %>% select(c("Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism", "Handcap", "SMS_received", "No.show"))

# Converts No into 0 and Yes into 1 in the 'NoShow' Column
NoShow_rev[NoShow_rev$No.show == "No",]$No.show = 0
NoShow_rev[NoShow_rev$No.show == "Yes",]$No.show = 1

NoShow_rev$No.show <- sapply(NoShow_rev$No.show, as.numeric)

head(NoShow_rev)
##   Gender Age Scholarship Hipertension Diabetes Alcoholism Handcap
## 1      F  62           0            1        0          0       0
## 2      M  56           0            0        0          0       0
## 3      F  62           0            0        0          0       0
## 4      F   8           0            0        0          0       0
## 5      F  56           0            1        1          0       0
## 6      F  76           0            1        0          0       0
##   SMS_received No.show
## 1            0       0
## 2            0       0
## 3            0       0
## 4            0       0
## 5            0       0
## 6            0       0

How many total samples and variables do we have in this dataset? And let’s provide the summary of the details to get a better look at the data.

row <- nrow(NoShow_rev)
column <- ncol(NoShow_rev)
print(paste0("Samples: ", row))
## [1] "Samples: 110527"
print(paste0("Variables: ", column-1)) # Given that one of the column is value "No Show", which is what we are attempting to measure
## [1] "Variables: 8"
summary(NoShow_rev)
##  Gender         Age          Scholarship       Hipertension   
##  M:38687   Min.   : -1.00   Min.   :0.00000   Min.   :0.0000  
##  F:71840   1st Qu.: 18.00   1st Qu.:0.00000   1st Qu.:0.0000  
##            Median : 37.00   Median :0.00000   Median :0.0000  
##            Mean   : 37.09   Mean   :0.09827   Mean   :0.1972  
##            3rd Qu.: 55.00   3rd Qu.:0.00000   3rd Qu.:0.0000  
##            Max.   :115.00   Max.   :1.00000   Max.   :1.0000  
##     Diabetes         Alcoholism        Handcap         SMS_received  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000  
##  Mean   :0.07186   Mean   :0.0304   Mean   :0.02225   Mean   :0.321  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :4.00000   Max.   :1.000  
##     No.show      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2019  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Now that we explored the data, let’s build a multiple regression model.

attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Gender + Age + Scholarship + Hipertension + Diabetes + Alcoholism + Handcap + SMS_received)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received)
## 
## Coefficients:
##  (Intercept)       GenderF           Age   Scholarship  Hipertension  
##     0.200122      0.002627     -0.001021      0.030899     -0.009529  
##     Diabetes    Alcoholism       Handcap  SMS_received  
##     0.012640      0.020963      0.005224      0.109500
summary(NoShow.lm)
## 
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3432 -0.2132 -0.1700 -0.1272  0.9094 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.001e-01  2.848e-03  70.267  < 2e-16 ***
## GenderF       2.627e-03  2.563e-03   1.025  0.30538    
## Age          -1.021e-03  6.102e-05 -16.733  < 2e-16 ***
## Scholarship   3.090e-02  4.073e-03   7.586 3.33e-14 ***
## Hipertension -9.529e-03  3.717e-03  -2.564  0.01036 *  
## Diabetes      1.264e-02  5.161e-03   2.449  0.01432 *  
## Alcoholism    2.096e-02  7.066e-03   2.967  0.00301 ** 
## Handcap       5.224e-03  7.437e-03   0.702  0.48241    
## SMS_received  1.095e-01  2.565e-03  42.698  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 110518 degrees of freedom
## Multiple R-squared:  0.02053,    Adjusted R-squared:  0.02045 
## F-statistic: 289.5 on 8 and 110518 DF,  p-value: < 2.2e-16

It appears that neither ‘Handcap’ nor ‘Gender’ appear to have statistical significance when it comes to No Shows, so we will remove them from our linear regression model. Perhaps this will improve our adjusted R-squared from 0.02045 to a higher amount.

attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes + 
##     Alcoholism + SMS_received)
## 
## Coefficients:
##  (Intercept)           Age   Scholarship  Hipertension      Diabetes  
##     0.201550     -0.001012      0.031441     -0.009415      0.012711  
##   Alcoholism  SMS_received  
##     0.020045      0.109567
summary(NoShow.lm)
## 
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes + 
##     Alcoholism + SMS_received)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3426 -0.2136 -0.1692 -0.1273  0.9149 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.015e-01  2.505e-03  80.471  < 2e-16 ***
## Age          -1.012e-03  6.058e-05 -16.709  < 2e-16 ***
## Scholarship   3.144e-02  4.038e-03   7.786 6.95e-15 ***
## Hipertension -9.415e-03  3.714e-03  -2.535  0.01126 *  
## Diabetes      1.271e-02  5.160e-03   2.463  0.01376 *  
## Alcoholism    2.005e-02  7.012e-03   2.859  0.00426 ** 
## SMS_received  1.096e-01  2.562e-03  42.774  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 110520 degrees of freedom
## Multiple R-squared:  0.02051,    Adjusted R-squared:  0.02046 
## F-statistic: 385.7 on 6 and 110520 DF,  p-value: < 2.2e-16

The linear multiple regression model now all have coefficients that have significant t values. However, the adjusted R-squared still remains low at 0.02046. Let us see if we can build a model that has at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.

We already have multiple dichotomous terms here with the exception of age, which is continuous discrete. Let us take a look at age vs. show vs. no show.

attach(NoShow_rev)
age.lm <- lm(No.show ~ Age)
age.lm
## 
## Call:
## lm(formula = No.show ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##    0.240794    -0.001048
plot(Age, No.show)
abline(age.lm)

summary(age.lm)
## 
## Call:
## lm(formula = No.show ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2418 -0.2167 -0.1905 -0.1633  0.8797 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.408e-01  2.279e-03  105.65   <2e-16 ***
## Age         -1.048e-03  5.216e-05  -20.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 110525 degrees of freedom
## Multiple R-squared:  0.003638,   Adjusted R-squared:  0.003629 
## F-statistic: 403.6 on 1 and 110525 DF,  p-value: < 2.2e-16

Again, as a reminder, the younger you are, the more likely you will skip out on your appointment (though age is statistically significant, it doesn’t appear to contribute much to the No Show as noted in the R-square value of 0.003.) Let’s choose this term and square it to obtain our quadratic term. I will also speculate and state that young people who are alcoholics are even more likely NOT to show up to their appointments, so I will use thaat as our one dichotomous vs. quantitative interaction term.

attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Age + Age**2 + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
## 
## Coefficients:
##    (Intercept)             Age     Scholarship    Hipertension  
##      0.2010302      -0.0009979       0.0306882      -0.0089410  
##       Diabetes      Alcoholism    SMS_received  Age:Alcoholism  
##      0.0127228       0.1240017       0.1094748      -0.0021011
summary(NoShow.lm)
## 
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4221 -0.2147 -0.1691 -0.1272  0.9349 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.2010302  0.0025081  80.154  < 2e-16 ***
## Age            -0.0009979  0.0000607 -16.441  < 2e-16 ***
## Scholarship     0.0306882  0.0040423   7.592 3.18e-14 ***
## Hipertension   -0.0089410  0.0037162  -2.406 0.016133 *  
## Diabetes        0.0127228  0.0051596   2.466 0.013670 *  
## Alcoholism      0.1240017  0.0277624   4.467 7.96e-06 ***
## SMS_received    0.1094748  0.0025615  42.739  < 2e-16 ***
## Age:Alcoholism -0.0021011  0.0005429  -3.870 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 110519 degrees of freedom
## Multiple R-squared:  0.02064,    Adjusted R-squared:  0.02058 
## F-statistic: 332.8 on 7 and 110519 DF,  p-value: < 2.2e-16

So again, let’s take a look at this. Overall, I would say that this model really is NOT a great fit to predict who will show or will not show. The coefficients, which I will explain, are all very close to zero, and appear to not really influence if the patient shows up or not show up. Age, Hipertension, Age:Alcoholism all are essentially zero, and it appears whether or not you have these disease processes, it isn’t likely to influence the decision to show up or not. Alcoholism and SMS_received have a positive correlation, so if for example, if the patient is an alcoholic, the patient will more likely NOT show up. Diabetes and scholarship are also very very minimally positive coefficient, but I would argue that these values are so small, that they are unlikely to influence the decision either.

Let’s take a look at the residuals:

plot(fitted(NoShow.lm), resid(NoShow.lm))

hist(resid(NoShow.lm))

qqnorm(resid(NoShow.lm))
qqline(resid(NoShow.lm))

As you can see, there is NO normal distribution of the residuals, and there is significant deviation on the Q-Q plot. The residuals demonstrate that once again, this linear regression does a poor job predicting who would or would not show up.