Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

1 Data Acquisition

300k medical appointments and its 15 variables (characteristics) of each. The most important one if the patient show-up or no-show the appointment. Variable names are self-explanatory, if you have doubts, just let me know!

https://www.kaggle.com/joniarroba/noshowappointments

1.1 Data Dictionary

PatientId - Identification of a patient

AppointmentID - Identification of each appointment

Gender = Male or Female . Female is the greater proportion, woman takes way more care of they health in comparison to man.

DataMarcacaoConsulta = The day of the actuall appointment, when they have to visit the doctor.

DataAgendamento = The day someone called or registered the appointment, this is before appointment of course.

Age = How old is the patient. Neighbourhood = Where the appointment takes place.

Scholarship = Ture of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia

Hipertension = True or False

Diabetes = True or False

Alcoholism = True or False

Handcap = True or False

SMS_received = 1 or more messages sent to the patient.

No-show = True or False.

library(RCurl)
## Loading required package: bitops
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.4.4
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
data_url <- 'https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/dataset/Kaggle_NoShow.csv'
NoShow_data <- read.csv(data_url, stringsAsFactors = FALSE)
NoShow_data$Gender <- factor(NoShow_data$Gender, levels = c("M", "F"))
kable(head(NoShow_data))
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No.show
2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
9.598513e+13 5626772 F 2016-04-27T08:36:51Z 2016-04-29T00:00:00Z 76 REPÚBLICA 0 1 0 0 0 0 No

2 Exploratory Data Analysis (EDA)

status_table <- table(NoShow_data$No.show)
status_table
## 
##    No   Yes 
## 88208 22319
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(NoShow_data, aes(x=No.show, fill=No.show)) + geom_bar()

print(round(status_table["Yes"]/(status_table["Yes"] + status_table["No"]) * 100, 2))
##   Yes 
## 20.19
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
NoShow_rev <- NoShow_data %>% select(c("Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism", "Handcap", "SMS_received", "No.show"))

# Converts No into 0 and Yes into 1 in the 'NoShow' Column
NoShow_rev[NoShow_rev$No.show == "No",]$No.show = 0
NoShow_rev[NoShow_rev$No.show == "Yes",]$No.show = 1

NoShow_rev$No.show <- sapply(NoShow_rev$No.show, as.numeric)

kable(head(NoShow_rev))
Gender Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No.show
F 62 0 1 0 0 0 0 0
M 56 0 0 0 0 0 0 0
F 62 0 0 0 0 0 0 0
F 8 0 0 0 0 0 0 0
F 56 0 1 1 0 0 0 0
F 76 0 1 0 0 0 0 0
summary(NoShow_rev)
##  Gender         Age          Scholarship       Hipertension   
##  M:38687   Min.   : -1.00   Min.   :0.00000   Min.   :0.0000  
##  F:71840   1st Qu.: 18.00   1st Qu.:0.00000   1st Qu.:0.0000  
##            Median : 37.00   Median :0.00000   Median :0.0000  
##            Mean   : 37.09   Mean   :0.09827   Mean   :0.1972  
##            3rd Qu.: 55.00   3rd Qu.:0.00000   3rd Qu.:0.0000  
##            Max.   :115.00   Max.   :1.00000   Max.   :1.0000  
##     Diabetes         Alcoholism        Handcap         SMS_received  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000  
##  Mean   :0.07186   Mean   :0.0304   Mean   :0.02225   Mean   :0.321  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :4.00000   Max.   :1.000  
##     No.show      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2019  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

3 Model

Now let’s build a multiple regression model.

attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Gender + Age + Scholarship + Hipertension + Diabetes + Alcoholism + Handcap + SMS_received)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received)
## 
## Coefficients:
##  (Intercept)       GenderF           Age   Scholarship  Hipertension  
##     0.200122      0.002627     -0.001021      0.030899     -0.009529  
##     Diabetes    Alcoholism       Handcap  SMS_received  
##     0.012640      0.020963      0.005224      0.109500
summary(NoShow.lm)
## 
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + Handcap + SMS_received)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3432 -0.2132 -0.1700 -0.1272  0.9094 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.001e-01  2.848e-03  70.267  < 2e-16 ***
## GenderF       2.627e-03  2.563e-03   1.025  0.30538    
## Age          -1.021e-03  6.102e-05 -16.733  < 2e-16 ***
## Scholarship   3.090e-02  4.073e-03   7.586 3.33e-14 ***
## Hipertension -9.529e-03  3.717e-03  -2.564  0.01036 *  
## Diabetes      1.264e-02  5.161e-03   2.449  0.01432 *  
## Alcoholism    2.096e-02  7.066e-03   2.967  0.00301 ** 
## Handcap       5.224e-03  7.437e-03   0.702  0.48241    
## SMS_received  1.095e-01  2.565e-03  42.698  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 110518 degrees of freedom
## Multiple R-squared:  0.02053,    Adjusted R-squared:  0.02045 
## F-statistic: 289.5 on 8 and 110518 DF,  p-value: < 2.2e-16

Let’s try to find influence of Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received

attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
NoShow.lm <- lm(No.show ~ Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes + 
##     Alcoholism + SMS_received)
## 
## Coefficients:
##  (Intercept)           Age   Scholarship  Hipertension      Diabetes  
##     0.201550     -0.001012      0.031441     -0.009415      0.012711  
##   Alcoholism  SMS_received  
##     0.020045      0.109567

4 Evaluate

summary(NoShow.lm)
## 
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes + 
##     Alcoholism + SMS_received)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3426 -0.2136 -0.1692 -0.1273  0.9149 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.015e-01  2.505e-03  80.471  < 2e-16 ***
## Age          -1.012e-03  6.058e-05 -16.709  < 2e-16 ***
## Scholarship   3.144e-02  4.038e-03   7.786 6.95e-15 ***
## Hipertension -9.415e-03  3.714e-03  -2.535  0.01126 *  
## Diabetes      1.271e-02  5.160e-03   2.463  0.01376 *  
## Alcoholism    2.005e-02  7.012e-03   2.859  0.00426 ** 
## SMS_received  1.096e-01  2.562e-03  42.774  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3973 on 110520 degrees of freedom
## Multiple R-squared:  0.02051,    Adjusted R-squared:  0.02046 
## F-statistic: 385.7 on 6 and 110520 DF,  p-value: < 2.2e-16

The linear multiple regression model now all have coefficients that have significant t values. However, the adjusted R-squared still remains low at 0.02046. Let us see if we can build a model that has at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.

Let us take a look at age vs. show vs. no show.

attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 4):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
age.lm <- lm(No.show ~ Age)
age.lm
## 
## Call:
## lm(formula = No.show ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##    0.240794    -0.001048
summary(age.lm)
## 
## Call:
## lm(formula = No.show ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2418 -0.2167 -0.1905 -0.1633  0.8797 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.408e-01  2.279e-03  105.65   <2e-16 ***
## Age         -1.048e-03  5.216e-05  -20.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4007 on 110525 degrees of freedom
## Multiple R-squared:  0.003638,   Adjusted R-squared:  0.003629 
## F-statistic: 403.6 on 1 and 110525 DF,  p-value: < 2.2e-16
plot(Age, No.show)
abline(age.lm)

It looks like that yourger people are more likely to miss the appointment. So, let’s choose age and square it to obtain our quadratic term.

attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 4):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 5):
## 
##     Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
##     No.show, Scholarship, SMS_received
NoShow.lm <- lm(No.show ~ Age + Age**2 + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
NoShow.lm
## 
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension + 
##     Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
## 
## Coefficients:
##    (Intercept)             Age     Scholarship    Hipertension  
##      0.2010302      -0.0009979       0.0306882      -0.0089410  
##       Diabetes      Alcoholism    SMS_received  Age:Alcoholism  
##      0.0127228       0.1240017       0.1094748      -0.0021011

let us explore residuals:

plot(fitted(NoShow.lm), resid(NoShow.lm))

hist(resid(NoShow.lm))

qqnorm(resid(NoShow.lm))
qqline(resid(NoShow.lm))

5 Summary

There is NO normal distribution of the residuals, and there is significant deviation on the Q-Q plot. The residuals also ahows that linear regression not ideal for predicting who would or would not show up.