Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
300k medical appointments and its 15 variables (characteristics) of each. The most important one if the patient show-up or no-show the appointment. Variable names are self-explanatory, if you have doubts, just let me know!
https://www.kaggle.com/joniarroba/noshowappointments
PatientId - Identification of a patient
AppointmentID - Identification of each appointment
Gender = Male or Female . Female is the greater proportion, woman takes way more care of they health in comparison to man.
DataMarcacaoConsulta = The day of the actuall appointment, when they have to visit the doctor.
DataAgendamento = The day someone called or registered the appointment, this is before appointment of course.
Age = How old is the patient. Neighbourhood = Where the appointment takes place.
Scholarship = Ture of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia
Hipertension = True or False
Diabetes = True or False
Alcoholism = True or False
Handcap = True or False
SMS_received = 1 or more messages sent to the patient.
No-show = True or False.
library(RCurl)
## Loading required package: bitops
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.4.4
library(knitr)
## Warning: package 'knitr' was built under R version 3.4.4
data_url <- 'https://raw.githubusercontent.com/niteen11/CUNY_DATA_605/master/dataset/Kaggle_NoShow.csv'
NoShow_data <- read.csv(data_url, stringsAsFactors = FALSE)
NoShow_data$Gender <- factor(NoShow_data$Gender, levels = c("M", "F"))
kable(head(NoShow_data))
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No.show |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
9.598513e+13 | 5626772 | F | 2016-04-27T08:36:51Z | 2016-04-29T00:00:00Z | 76 | REPÃBLICA | 0 | 1 | 0 | 0 | 0 | 0 | No |
status_table <- table(NoShow_data$No.show)
status_table
##
## No Yes
## 88208 22319
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(NoShow_data, aes(x=No.show, fill=No.show)) + geom_bar()
print(round(status_table["Yes"]/(status_table["Yes"] + status_table["No"]) * 100, 2))
## Yes
## 20.19
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
NoShow_rev <- NoShow_data %>% select(c("Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism", "Handcap", "SMS_received", "No.show"))
# Converts No into 0 and Yes into 1 in the 'NoShow' Column
NoShow_rev[NoShow_rev$No.show == "No",]$No.show = 0
NoShow_rev[NoShow_rev$No.show == "Yes",]$No.show = 1
NoShow_rev$No.show <- sapply(NoShow_rev$No.show, as.numeric)
kable(head(NoShow_rev))
Gender | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No.show |
---|---|---|---|---|---|---|---|---|
F | 62 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
M | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
F | 62 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
F | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
F | 56 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
F | 76 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
summary(NoShow_rev)
## Gender Age Scholarship Hipertension
## M:38687 Min. : -1.00 Min. :0.00000 Min. :0.0000
## F:71840 1st Qu.: 18.00 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 37.00 Median :0.00000 Median :0.0000
## Mean : 37.09 Mean :0.09827 Mean :0.1972
## 3rd Qu.: 55.00 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :115.00 Max. :1.00000 Max. :1.0000
## Diabetes Alcoholism Handcap SMS_received
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.000
## Mean :0.07186 Mean :0.0304 Mean :0.02225 Mean :0.321
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.0000 Max. :4.00000 Max. :1.000
## No.show
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2019
## 3rd Qu.:0.0000
## Max. :1.0000
Now let’s build a multiple regression model.
attach(NoShow_rev)
NoShow.lm <- lm(No.show ~ Gender + Age + Scholarship + Hipertension + Diabetes + Alcoholism + Handcap + SMS_received)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension +
## Diabetes + Alcoholism + Handcap + SMS_received)
##
## Coefficients:
## (Intercept) GenderF Age Scholarship Hipertension
## 0.200122 0.002627 -0.001021 0.030899 -0.009529
## Diabetes Alcoholism Handcap SMS_received
## 0.012640 0.020963 0.005224 0.109500
summary(NoShow.lm)
##
## Call:
## lm(formula = No.show ~ Gender + Age + Scholarship + Hipertension +
## Diabetes + Alcoholism + Handcap + SMS_received)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3432 -0.2132 -0.1700 -0.1272 0.9094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.001e-01 2.848e-03 70.267 < 2e-16 ***
## GenderF 2.627e-03 2.563e-03 1.025 0.30538
## Age -1.021e-03 6.102e-05 -16.733 < 2e-16 ***
## Scholarship 3.090e-02 4.073e-03 7.586 3.33e-14 ***
## Hipertension -9.529e-03 3.717e-03 -2.564 0.01036 *
## Diabetes 1.264e-02 5.161e-03 2.449 0.01432 *
## Alcoholism 2.096e-02 7.066e-03 2.967 0.00301 **
## Handcap 5.224e-03 7.437e-03 0.702 0.48241
## SMS_received 1.095e-01 2.565e-03 42.698 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 110518 degrees of freedom
## Multiple R-squared: 0.02053, Adjusted R-squared: 0.02045
## F-statistic: 289.5 on 8 and 110518 DF, p-value: < 2.2e-16
Let’s try to find influence of Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received
attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
NoShow.lm <- lm(No.show ~ Age + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes +
## Alcoholism + SMS_received)
##
## Coefficients:
## (Intercept) Age Scholarship Hipertension Diabetes
## 0.201550 -0.001012 0.031441 -0.009415 0.012711
## Alcoholism SMS_received
## 0.020045 0.109567
summary(NoShow.lm)
##
## Call:
## lm(formula = No.show ~ Age + Scholarship + Hipertension + Diabetes +
## Alcoholism + SMS_received)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3426 -0.2136 -0.1692 -0.1273 0.9149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.015e-01 2.505e-03 80.471 < 2e-16 ***
## Age -1.012e-03 6.058e-05 -16.709 < 2e-16 ***
## Scholarship 3.144e-02 4.038e-03 7.786 6.95e-15 ***
## Hipertension -9.415e-03 3.714e-03 -2.535 0.01126 *
## Diabetes 1.271e-02 5.160e-03 2.463 0.01376 *
## Alcoholism 2.005e-02 7.012e-03 2.859 0.00426 **
## SMS_received 1.096e-01 2.562e-03 42.774 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3973 on 110520 degrees of freedom
## Multiple R-squared: 0.02051, Adjusted R-squared: 0.02046
## F-statistic: 385.7 on 6 and 110520 DF, p-value: < 2.2e-16
The linear multiple regression model now all have coefficients that have significant t values. However, the adjusted R-squared still remains low at 0.02046. Let us see if we can build a model that has at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term.
Let us take a look at age vs. show vs. no show.
attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 4):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
age.lm <- lm(No.show ~ Age)
age.lm
##
## Call:
## lm(formula = No.show ~ Age)
##
## Coefficients:
## (Intercept) Age
## 0.240794 -0.001048
summary(age.lm)
##
## Call:
## lm(formula = No.show ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2418 -0.2167 -0.1905 -0.1633 0.8797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.408e-01 2.279e-03 105.65 <2e-16 ***
## Age -1.048e-03 5.216e-05 -20.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4007 on 110525 degrees of freedom
## Multiple R-squared: 0.003638, Adjusted R-squared: 0.003629
## F-statistic: 403.6 on 1 and 110525 DF, p-value: < 2.2e-16
plot(Age, No.show)
abline(age.lm)
It looks like that yourger people are more likely to miss the appointment. So, let’s choose age and square it to obtain our quadratic term.
attach(NoShow_rev)
## The following objects are masked from NoShow_rev (pos = 3):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 4):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
## The following objects are masked from NoShow_rev (pos = 5):
##
## Age, Alcoholism, Diabetes, Gender, Handcap, Hipertension,
## No.show, Scholarship, SMS_received
NoShow.lm <- lm(No.show ~ Age + Age**2 + Scholarship + Hipertension + Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
NoShow.lm
##
## Call:
## lm(formula = No.show ~ Age + Age^2 + Scholarship + Hipertension +
## Diabetes + Alcoholism + SMS_received + Age * Alcoholism)
##
## Coefficients:
## (Intercept) Age Scholarship Hipertension
## 0.2010302 -0.0009979 0.0306882 -0.0089410
## Diabetes Alcoholism SMS_received Age:Alcoholism
## 0.0127228 0.1240017 0.1094748 -0.0021011
let us explore residuals:
plot(fitted(NoShow.lm), resid(NoShow.lm))
hist(resid(NoShow.lm))
qqnorm(resid(NoShow.lm))
qqline(resid(NoShow.lm))
There is NO normal distribution of the residuals, and there is significant deviation on the Q-Q plot. The residuals also ahows that linear regression not ideal for predicting who would or would not show up.