Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
300k medical appointments of the public healthcare of the capital city of Espirito Santo State - Vitoria - Brazil and its 15 variables (characteristics) of each. The most important one if the patient show-up or no-show the appointment. Variable names are self-explanatory, if you have doubts, just let me know! Handcap is the total amount of handcaps a person presents, it is not binary.
What if that possible to predict someone to no-show an appointment?
Kaggle Dataset: https://www.kaggle.com/joniarroba/noshowappointments
Let’s obtain the data from my github account:
if (!require(RCurl)) install.packages("RCurl")
require(RCurl)
x <- getURL("https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Week%2012/NoShow.csv")
NoShow_data <- read.csv(text = x, stringsAsFactors = FALSE)
NoShow_data$Gender <- factor(NoShow_data$Gender, levels = c("M", "F"))
head(NoShow_data)
## PatientId AppointmentID Gender ScheduledDay
## 1 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
## 2 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
## 3 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
## 4 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
## 5 8.841186e+12 5642494 F 2016-04-29T16:07:23Z
## 6 9.598513e+13 5626772 F 2016-04-27T08:36:51Z
## AppointmentDay Age Neighbourhood Scholarship Hipertension
## 1 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1
## 2 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0
## 3 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0
## 4 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0
## 5 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1
## 6 2016-04-29T00:00:00Z 76 REPÚBLICA 0 1
## Diabetes Alcoholism Handcap SMS_received No.show
## 1 0 0 0 0 No
## 2 0 0 0 0 No
## 3 0 0 0 0 No
## 4 0 0 0 0 No
## 5 1 0 0 0 No
## 6 0 0 0 0 No
How many showed up? and how many skipped out?
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)
status_table <- table(NoShow_data$No.show)
status_table
##
## No Yes
## 88208 22319
ggplot(NoShow_data, aes(x=No.show, fill=No.show)) + geom_bar()
print(paste0("Percentage of people who do not show up to their appointments: ",
round(status_table["Yes"]/(status_table["Yes"] + status_table["No"]) * 100, 2), "%"))
## [1] "Percentage of people who do not show up to their appointments: 20.19%"
Not to be confusing, but to clarify the data. “Yes” means that the patient did NOT show up, and “No” means patient DID show up!
Let’s clean up some of the data by munging. Let’s take out some information that will not likely make a difference if the patient will show up or not. In this case, we will remove ‘PatientId’, ‘AppointmentID’. We will also simplify the data by removing the ‘ScheduledDay’, ‘Neighbourhood’, and ‘AppointmentDay’. We will also set the outcome variable in the ‘NoShow’ column to be 1 if the patient did NOT show up and 0 if the patient DID show up.
if (!require(tidyr)) install.packages("tidyr")
if (!require(dplyr)) install.packages("dplyr")
library(tidyr)
library(dplyr)
NoShow_rev <- NoShow_data %>% select(c("Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism", "Handcap", "SMS_received", "No.show"))
# Converts No into 0 and Yes into 1 in the 'NoShow' Column
NoShow_rev[NoShow_rev$No.show == "No",]$No.show = 0
NoShow_rev[NoShow_rev$No.show == "Yes",]$No.show = 1
NoShow_rev$No.show <- sapply(NoShow_rev$No.show, as.numeric)
head(NoShow_rev)
## Gender Age Scholarship Hipertension Diabetes Alcoholism Handcap
## 1 F 62 0 1 0 0 0
## 2 M 56 0 0 0 0 0
## 3 F 62 0 0 0 0 0
## 4 F 8 0 0 0 0 0
## 5 F 56 0 1 1 0 0
## 6 F 76 0 1 0 0 0
## SMS_received No.show
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
How many total samples and variables do we have in this dataset? And let’s provide the summary of the details to get a better look at the data.
row <- nrow(NoShow_rev)
column <- ncol(NoShow_rev)
print(paste0("Samples: ", row))
## [1] "Samples: 110527"
print(paste0("Variables: ", column-1)) # Given that one of the column is value "No Show", which is what we are attempting to measure
## [1] "Variables: 8"
summary(NoShow_rev)
## Gender Age Scholarship Hipertension
## M:38687 Min. : -1.00 Min. :0.00000 Min. :0.0000
## F:71840 1st Qu.: 18.00 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 37.00 Median :0.00000 Median :0.0000
## Mean : 37.09 Mean :0.09827 Mean :0.1972
## 3rd Qu.: 55.00 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :115.00 Max. :1.00000 Max. :1.0000
## Diabetes Alcoholism Handcap SMS_received
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.000
## Mean :0.07186 Mean :0.0304 Mean :0.02225 Mean :0.321
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.0000 Max. :4.00000 Max. :1.000
## No.show
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2019
## 3rd Qu.:0.0000
## Max. :1.0000
Now that we explored the data, let’s build a multiple regression model.
attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Gender + Age + Scholarship + Hipertension + Diabetes + Alcoholism + Handcap + SMS_received)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension +
## Diabetes + Alcoholism + Handcap + SMS_received)
##
## Coefficients:
## (Intercept) GenderF Age Scholarship Hipertension
## 0.200122 0.002627 -0.001021 0.030899 -0.009529
## Diabetes Alcoholism Handcap SMS_received
## 0.012640 0.020963 0.005224 0.109500
summary(NoShow.lm)
##
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension +
## Diabetes + Alcoholism + Handcap + SMS_received)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3432 -0.2132 -0.1700 -0.1272 0.9094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.001e-01 2.848e-03 70.267 < 2e-16 ***
## GenderF 2.627e-03 2.563e-03 1.025 0.30538
## Age -1.021e-03 6.102e-05 -16.733 < 2e-16 ***
## Scholarship 3.090e-02 4.073e-03 7.586 3.33e-14 ***
## Hipertension -9.529e-03 3.717e-03 -2.564 0.01036 *
## Diabetes 1.264e-02 5.161e-03 2.449 0.01432 *
## Alcoholism 2.096e-02 7.066e-03 2.967 0.00301 **
## Handcap 5.224e-03 7.437e-03 0.702 0.48241
## SMS_received 1.095e-01 2.565e-03 42.698 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 110518 degrees of freedom
## Multiple R-squared: 0.02053, Adjusted R-squared: 0.02045
## F-statistic: 289.5 on 8 and 110518 DF, p-value: < 2.2e-16
It appears that neither ‘Handcap’ nor ‘Gender’ appear to have statistical significance when it comes to No Shows, so we will remove them from our linear regression model. Perhaps this will improve our adjusted R-squared from 0.02045 to a higher amount.
attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes +
## Alcoholism + SMS_received)
##
## Coefficients:
## (Intercept) Age Scholarship Hipertension Diabetes
## 0.201550 -0.001012 0.031441 -0.009415 0.012711
## Alcoholism SMS_received
## 0.020045 0.109567
summary(NoShow.lm)
##
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes +
## Alcoholism + SMS_received)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3426 -0.2136 -0.1692 -0.1273 0.9149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.015e-01 2.505e-03 80.471 < 2e-16 ***
## Age -1.012e-03 6.058e-05 -16.709 < 2e-16 ***
## Scholarship 3.144e-02 4.038e-03 7.786 6.95e-15 ***
## Hipertension -9.415e-03 3.714e-03 -2.535 0.01126 *
## Diabetes 1.271e-02 5.160e-03 2.463 0.01376 *
## Alcoholism 2.005e-02 7.012e-03 2.859 0.00426 **
## SMS_received 1.096e-01 2.562e-03 42.774 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 110520 degrees of freedom
## Multiple R-squared: 0.02051, Adjusted R-squared: 0.02046
## F-statistic: 385.7 on 6 and 110520 DF, p-value: < 2.2e-16
The linear multiple regression model now all have coefficients that have significant t values. However, the adjusted R-squared still remains low at 0.02046. Let us see if we can build a model that has at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.
We already have multiple dichotomous terms here with the exception of age, which is continuous discrete. Let us take a look at age vs. show vs. no show.
attach(NoShow_rev)
age.lm <- lm(No.show ~ Age)
age.lm
##
## Call:
## lm(formula = No.show ~ Age)
##
## Coefficients:
## (Intercept) Age
## 0.240794 -0.001048
plot(Age, No.show)
abline(age.lm)
summary(age.lm)
##
## Call:
## lm(formula = No.show ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2418 -0.2167 -0.1905 -0.1633 0.8797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.408e-01 2.279e-03 105.65 <2e-16 ***
## Age -1.048e-03 5.216e-05 -20.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 110525 degrees of freedom
## Multiple R-squared: 0.003638, Adjusted R-squared: 0.003629
## F-statistic: 403.6 on 1 and 110525 DF, p-value: < 2.2e-16
Again, as a reminder, the younger you are, the more likely you will skip out on your appointment (though age is statistically significant, it doesn’t appear to contribute much to the No Show as noted in the R-square value of 0.003.) Let’s choose this term and square it to obtain our quadratic term. I will also speculate and state that young people who are alcoholics are even more likely NOT to show up to their appointments, so I will use thaat as our one dichotomous vs. quantitative interaction term.
attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Age + Age**2 + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension +
## Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
##
## Coefficients:
## (Intercept) Age Scholarship Hipertension
## 0.2010302 -0.0009979 0.0306882 -0.0089410
## Diabetes Alcoholism SMS_received Age:Alcoholism
## 0.0127228 0.1240017 0.1094748 -0.0021011
summary(NoShow.lm)
##
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension +
## Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4221 -0.2147 -0.1691 -0.1272 0.9349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2010302 0.0025081 80.154 < 2e-16 ***
## Age -0.0009979 0.0000607 -16.441 < 2e-16 ***
## Scholarship 0.0306882 0.0040423 7.592 3.18e-14 ***
## Hipertension -0.0089410 0.0037162 -2.406 0.016133 *
## Diabetes 0.0127228 0.0051596 2.466 0.013670 *
## Alcoholism 0.1240017 0.0277624 4.467 7.96e-06 ***
## SMS_received 0.1094748 0.0025615 42.739 < 2e-16 ***
## Age:Alcoholism -0.0021011 0.0005429 -3.870 0.000109 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 110519 degrees of freedom
## Multiple R-squared: 0.02064, Adjusted R-squared: 0.02058
## F-statistic: 332.8 on 7 and 110519 DF, p-value: < 2.2e-16
So again, let’s take a look at this. Overall, I would say that this model really is NOT a great fit to predict who will show or will not show. The coefficients, which I will explain, are all very close to zero, and appear to not really influence if the patient shows up or not show up. Age, Hipertension, Age:Alcoholism all are essentially zero, and it appears whether or not you have these disease processes, it isn’t likely to influence the decision to show up or not. Alcoholism and SMS_received have a positive correlation, so if for example, if the patient is an alcoholic, the patient will more likely NOT show up. Diabetes and scholarship are also very very minimally positive coefficient, but I would argue that these values are so small, that they are unlikely to influence the decision either.
Let’s take a look at the residuals:
plot(fitted(NoShow.lm), resid(NoShow.lm))
hist(resid(NoShow.lm))
qqnorm(resid(NoShow.lm))
qqline(resid(NoShow.lm))
As you can see, there is NO normal distribution of the residuals, and there is significant deviation on the Q-Q plot. The residuals demonstrate that once again, this linear regression does a poor job predicting who would or would not show up.