This homework is due by Thursday, February 9th, 8:00pm. Upload a pdf file to Canvas called
4_modeling_data.pdf
In this section and the next, we’ll be revisiting the credit dataset.
# Load data
df.credit = read_csv("data/credit.csv") %>%
clean_names()
Use the ggpairs() function from the GGally
package to make scatterplots of all pairwise combinations of the numeric
variables.
Tip: use
progress = FALSEwithin the function to suppress unwanted progress bars of each plot
### YOUR CODE HERE ###
sapply(df.credit, is.numeric)
## index income limit rating cards age education gender
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## student married ethnicity balance height
## FALSE FALSE FALSE TRUE TRUE
ggpairs(df.credit, columns=c(2,3,4,5,6,7,12,13), progress=F)
######################
That’s perhaps a bit too congested to be useful. Recreate the plot, focusing on just ~5 of the variables that seem interesting to you.
### YOUR CODE HERE ###
ggpairs(df.credit, columns=c(2,4,5,7,13), progress=F)
######################
Does your plot tell you anything interesting about the data? Briefly describe one observation.
Education appears to be positively correlated with height. The taller you are, the more likely you are to be educated. I wonder if physical height gives off a perception of power and superiority that leads to them receiving preferential treatment or higher expectations in life.
You decide you want to test the relationship between credit limit
(limit) and average credit card debt
(balance).
a) First, let’s visualize this relationship. Create
a scatterplot of balance (on the y-axis) as a function of
limit (on the x-axis). Set the transparency of the points
to 0.3.
### YOUR CODE HERE ###
df.credit %>%
ggplot(aes(x=limit,y=balance)) +
geom_point(alpha=.3)
######################
Model generalizability: When fitting models to data,
it’s important to not just evaluate how well the model fits the data,
but also how well the model predicts new data. Validating the
model on unseen data is critical to understanding its generalizability.
To create “unseen” data, we’ll divide our dataset into
training data and test data, randomly
designating 80% of observations as training data and 20% as
test data. Note that this approach previews the concept of
cross-validation, which we’ll cover later in the course. As you’ll see,
when cross-validating, performance metrics are computed on many
“splits”, where each “split” is a new division of the observed data into
training and test data. Here, we’ll only evaluate our model on what
amounts to one “split”. Once the model is fit to the training data, we
can generate predictions for the test data and evaluate model
performance.
We’ve split the dataset into training and test data for you:
# Set seed for reproducibility
set.seed(1)
# Split data into train and test
train_proportion = 0.8
train_data = df.credit %>%
sample_frac(train_proportion)
test_data = df.credit %>%
anti_join(train_data)
## Joining, by = c("index", "income", "limit", "rating", "cards", "age",
## "education", "gender", "student", "married", "ethnicity", "balance", "height")
b) You are interested in whether credit limit is a
significant predictor of average credit card debt. Build a simple linear
regression model to predict balance from limit
for the training data.
### YOUR CODE HERE ###
fit.lm1_train = lm(balance ~ 1 + limit, data=train_data)
summary(fit.lm1_train)
##
## Call:
## lm(formula = balance ~ 1 + limit, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -674.44 -144.65 -13.98 131.16 775.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -284.569658 28.738175 -9.902 <0.0000000000000002 ***
## limit 0.170409 0.005424 31.419 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 228.1 on 318 degrees of freedom
## Multiple R-squared: 0.7564, Adjusted R-squared: 0.7556
## F-statistic: 987.2 on 1 and 318 DF, p-value: < 0.00000000000000022
######################
c) Is credit limit a significant predictor of average credit card debt? What is the r-squared value, and what does it tell you about the model?
Yes, credit limit is a significant predictor of average credit card debt. The r-squared is .76, suggesting that 76% of the variance of credit card debt can be accounted for by credit limit.
d) Before visualizing model fit, let’s make a basic
plot of the training data that we can flexibly add to
later. Make a scatterplot showing balance (on the y-axis)
as a function of limit (on the x-axis) and assign this
scatterplot to the variable plot.balance_limit. Set the
transparency of the points to 0.2.
### YOUR CODE HERE ###
plot.balance_limit =
ggplot() +
geom_point(data=train_data, aes(x=limit, y=balance), alpha=.2)
######################
plot.balance_limit
e) Using plot.balance_limit, add a
linear regression line with confidence intervals.
Hint:
geom_smooth()will be useful here.
### YOUR CODE HERE ###
plot.balance_limit +
geom_smooth(data=train_data, aes(x=limit, y=balance), method="lm", color="black")
## `geom_smooth()` using formula = 'y ~ x'
######################
a) Use the trained model to predict the test data.
Hint: Take a look at the
predict()function here.
### YOUR CODE HERE ###
predictions.test = test_data %>%
mutate(predicted_balance = predict(fit.lm1_train, newdata=test_data))
######################
b) Using plot.balance_limit, overlay
the test data in blue and your predictions from 3a in red
(don’t alter their transparency). Add black vertical lines connecting
each predicted point to the observed point.
Hint: The function
geom_segment()will be useful for drawing the black vertical lines.
### YOUR CODE HERE ###
plot.balance_limit +
geom_point(data=predictions.test, aes(x=limit, y=balance), color = "blue") +
geom_point(data=predictions.test, aes(x=limit, y=predicted_balance), color= "red") +
geom_segment(data=predictions.test, aes(x=limit, y=balance, xend=limit, yend= predicted_balance))
######################
The black lines in your plot represent the residuals of the model (Y_true - Y_pred). Squaring and then summing these residuals produces an important quantity: the residual sum of squared errors (which we will use in the next problem).
c) Create a function that calculates r-squared from a set of TRUE values and a set of PREDICTED values.
Hint: you’ll need to calculate both the residual and total sum of squared errors. For the total sum of squared errors, you need to calculate the squared error between each observation and the mean of all of the observations.
r_squared = function(Y_true, Y_pred) {
# Input: Y_true and Y_pred should each be a numeric vector of the same length
### YOUR CODE HERE ###
r2 = 1 - (sum((Y_true - Y_pred)^2)/sum((Y_true - mean(Y_true))^2))
######################
return(r2)
}
Use your function to calculate r-squared for the test data.
### YOUR CODE HERE ###
r_squared_test = r_squared(predictions.test$balance, predictions.test$predicted_balance)
######################
r_squared_test
## [1] 0.6844749
Now, generate predictions (using predict()) for the data
your model was trained on, and calculate r-squared. [Note that this
value should be identical to what was calculated in
2b]
### YOUR CODE HERE ###
predictions_train = train_data %>%
mutate(predicted_balance = predict(fit.lm1_train, newdata=train_data))
r_squared_train = r_squared(predictions_train$balance, predictions_train$predicted_balance)
######################
r_squared_train
## [1] 0.7563526
d) Is r-squared higher for the training or test data? Why might that be?
The r-squared is higher for the training data. A larger sample size stabilized the data and reduced the variance of datapoints around the predicted line. Hence, the total sum of squared residuals decreased, leading to a higher r-squared value.
e) Based on r-squared for the test data, how well do you think the model generalized to new data?
The model generalized pretty well to the new data, accounting for almost 70% of the variance.
f) In theory, what is the minimum and maximum value r-squared can be on test data, and why? What does it mean if r-squared is 0?
The minimum value is 0 and the maximum value is 1. An r-squared of 0 means that the model accounted for 0% of the variance in the test data.
In psychological research, people often run linear regressions in which the goal is to assess the relationship between two variables while “controlling” for other variables. These control variables could, for example, be age and gender. But how should we decide whether and which variables to control for? In this exercise, we will see what potential effects controlling for variables can have in different situations.
Now you are interested in whether age is a significant predictor of credit limit.
Note: for this question, go ahead and use the full
df.credit dataset (instead of just the training
data).
a) Build a simple linear regression model to predict
limit from age. Is age a significant predictor
of credit limit?
# Simple regression without control variables
### YOUR CODE HERE ###
fit.lm1 = lm(limit ~ 1 + age, data=df.credit)
summary(fit.lm1)
##
## Call:
## lm(formula = limit ~ 1 + age, data = df.credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3877.1 -1628.3 -87.3 1244.3 8876.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3984.098 388.857 10.246 <0.0000000000000002 ***
## age 13.500 6.673 2.023 0.0437 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2299 on 398 degrees of freedom
## Multiple R-squared: 0.01018, Adjusted R-squared: 0.007691
## F-statistic: 4.093 on 1 and 398 DF, p-value: 0.04374
######################
Age is a significant predictor of credit limit.
Then you realize that age is actually related to income, which is a strong predictor of credit limit, so you are interested in seeing whether age is related to credit limit controlling for income.
b) Build a multiple regression model to predict
limit using both age and income
as predictors. Is age still a significant predictor? What could be an
explanation for the change, if there was any?
# Multiple regression with control variables
### YOUR CODE HERE ###
fit.lm2 = lm(income ~ 1 + age, data = df.credit)
summary(fit.lm2)
##
## Call:
## lm(formula = income ~ 1 + age, data = df.credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.24 -23.49 -10.69 10.29 146.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.2762 5.8755 4.302 0.0000213 ***
## age 0.3582 0.1008 3.553 0.000426 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.74 on 398 degrees of freedom
## Multiple R-squared: 0.03074, Adjusted R-squared: 0.02831
## F-statistic: 12.62 on 1 and 398 DF, p-value: 0.0004265
fit.lm3 = lm(limit ~ 1 + age + income, data = df.credit)
summary(fit.lm3)
##
## Call:
## lm(formula = limit ~ 1 + age + income, data = df.credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2701.44 -1116.82 41.84 1157.11 2611.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2661.516 243.880 10.913 <0.0000000000000002 ***
## age -5.245 4.156 -1.262 0.208
## income 52.325 2.034 25.727 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1410 on 397 degrees of freedom
## Multiple R-squared: 0.6289, Adjusted R-squared: 0.627
## F-statistic: 336.4 on 2 and 397 DF, p-value: < 0.00000000000000022
######################
Age is no longer a significant predictor of credit limit after controlling for income. Age spuriously predicted credit limit only because it was positively associated with income, which was the “true” driver of the positive effect on credit limit.
We will be using the following dataset.
families.csv:
Data from a study of 68 companies, examining relationships between the quality of family-friendly programs at each company, the percentage of employees with families who use these programs, and employee satisfaction (all continuous variables).
Fields:
famprog: the amount of family-friendly programs from (1 = Nothing at all to 9 = Amazing family-friendliness) perfam: the percentage of employees with families in the organization (from 0% to 100%) empastis: the average rating of employee satisfaction (1 = Extremely unsatisfied to 7 = Extremely satisfied)
df.families = read_csv("data/families.csv")
a) First, let’s visualize a few relationships, plotting regression lines with confidence intervals over scatterplots.
Does employee satisfaction depend on the amount of family programs?
### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=empsatis)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
######################
Does employee satisfaction depend upon the percentage of employees with families?
### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=perfam, y=empsatis)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
######################
Does the number of family programs depend on the percentage of employees with families?
### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=perfam)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
######################
Doesn’t seem like there’s much going on in this dataset so far…
Let’s try visualizing all three variables at once, mapping one of the
three variables to color.
### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=empsatis, color=perfam)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
######################
Still not super clear.
Sometimes the easiest way to visualize continuous data with 3 (or
more) continuous variables is to break some of them down into
categorical variables, for example by doing a median split, or breaking
into quantiles. Let’s break the perfam and
famprog columns into terciles (thirds).
a) First, create a function that cuts its input into
terciles [Hint: you may find the cut and
quantile functions useful here]. Then create two new
columns in df.families called perfam_terciles
and famprog_terciles by applying your function to the
relevant columns. [Hint: mutate and across
will be useful here]
Note: When using the quantile function, set include.lowest = T. This will make sure the lowest values in each column are included in the tercile cuts rather than being set to “NA”.
tercile_cuts = function(x) {
### YOUR CODE HERE ###
cut(x, quantile(x,c(0,1/3,2/3,1)), include.lowest=T, labels=c("low","medium","high"))
######################
}
# Mutate variables
### YOUR CODE HERE ###
df.families2 <- df.families %>%
mutate(across(c(perfam, famprog), tercile_cuts, .names = "{.col}_terciles"))
######################
b) Now, create two plots: first, how
empsatis changes with perfam (faceted by
famprog terciles), then how empsatis changes
with famprog (faceted by perfam terciles).
### YOUR CODE HERE ###
ggplot(df.families2, aes(x=perfam, y=empsatis)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(vars(famprog_terciles))
## `geom_smooth()` using formula = 'y ~ x'
######################
### YOUR CODE HERE ###
ggplot(df.families2, aes(x=famprog, y=empsatis)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(vars(perfam_terciles))
## `geom_smooth()` using formula = 'y ~ x'
######################
Now you should be able to see a story start to emerge.
c) Which of the two previous plots do you like best? Why? What story would you tell about it? Could you confirm it experimentally?
The second plot is better visually because of the completeness of values on the x-axis. It also tells us that there is a strong positive association between the number of family friendly programs offered by the organization and employee satisfaction when the organization contains a high percentage of employees who have families. It would be challenging, though possible, to experimentally manipulate the number of family programs in an organization. Organizations can be randomly assigned to either a high or low number of family friendly programs condition, and we can observe the interaction with percentage of employees who have families on employee satisfaction.
a) Create two linear models with employee
satisfaction as the outcome. One model should have both
perfam and famprog as well as their
interaction as predictors, and the other model should have both
variables but without the interaction. Perform an ANOVA on the two
models.
### YOUR CODE HERE ###
simple_model = lm(empsatis ~ 1 + perfam + famprog, data = df.families)
summary(simple_model)
##
## Call:
## lm(formula = empsatis ~ 1 + perfam + famprog, data = df.families)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.69259 -0.50280 -0.05507 0.43595 1.53689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.225338 0.532303 7.938 0.0000000000376 ***
## perfam -0.011641 0.008782 -1.326 0.1896
## famprog 0.066932 0.036396 1.839 0.0705 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7499 on 65 degrees of freedom
## Multiple R-squared: 0.07342, Adjusted R-squared: 0.04491
## F-statistic: 2.575 on 2 and 65 DF, p-value: 0.08388
interaction_model = lm(empsatis ~ 1 + perfam*famprog, data = df.families)
summary(interaction_model)
##
## Call:
## lm(formula = empsatis ~ 1 + perfam * famprog, data = df.families)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65923 -0.39193 -0.04313 0.46748 1.59056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.728586 1.321895 5.090 0.00000338 ***
## perfam -0.056740 0.023517 -2.413 0.0187 *
## famprog -0.343845 0.202604 -1.697 0.0945 .
## perfam:famprog 0.007399 0.003593 2.059 0.0435 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7318 on 64 degrees of freedom
## Multiple R-squared: 0.131, Adjusted R-squared: 0.09027
## F-statistic: 3.216 on 3 and 64 DF, p-value: 0.02856
anova(simple_model,interaction_model)
######################
How do the results of the models relate to the above plots?
The plots presented a likely interaction pattern in the data, such that the relationship between the number of family programs offered and employee satisfaction was dependent on the percentage of employees with families within the organization. The models confirm the presence of this very interaction pattern.
Information about this R session including which version of R was used, and what packages were loaded.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Singapore.utf8 LC_CTYPE=English_Singapore.utf8
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Singapore.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_1.0.0 stringr_1.5.0 dplyr_1.0.10 purrr_1.0.1
## [5] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 tidyverse_1.3.2
## [9] janitor_2.2.0 GGally_2.1.2 ggplot2_3.4.0 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.4 sass_0.4.5 bit64_4.0.5
## [4] vroom_1.6.1 jsonlite_1.8.4 splines_4.2.2
## [7] modelr_0.1.10 bslib_0.4.2 assertthat_0.2.1
## [10] highr_0.10 googlesheets4_1.0.1 cellranger_1.1.0
## [13] yaml_2.3.7 lattice_0.20-45 pillar_1.8.1
## [16] backports_1.4.1 glue_1.6.2 digest_0.6.31
## [19] RColorBrewer_1.1-3 rvest_1.0.3 snakecase_0.11.0
## [22] colorspace_2.1-0 Matrix_1.5-1 htmltools_0.5.4
## [25] plyr_1.8.8 pkgconfig_2.0.3 broom_1.0.3
## [28] haven_2.5.1 scales_1.2.1 tzdb_0.3.0
## [31] timechange_0.2.0 googledrive_2.0.0 mgcv_1.8-41
## [34] generics_0.1.3 farver_2.1.1 ellipsis_0.3.2
## [37] cachem_1.0.6 withr_2.5.0 cli_3.6.0
## [40] magrittr_2.0.3 crayon_1.5.2 readxl_1.4.1
## [43] evaluate_0.20 fs_1.6.0 fansi_1.0.4
## [46] nlme_3.1-160 xml2_1.3.3 tools_4.2.2
## [49] hms_1.1.2 gargle_1.2.1 lifecycle_1.0.3
## [52] munsell_0.5.0 reprex_2.0.2 compiler_4.2.2
## [55] jquerylib_0.1.4 rlang_1.0.6 grid_4.2.2
## [58] rstudioapi_0.14 labeling_0.4.2 rmarkdown_2.20
## [61] gtable_0.3.1 DBI_1.1.3 reshape_0.8.9
## [64] R6_2.5.1 lubridate_1.9.1 fastmap_1.1.0
## [67] bit_4.0.5 utf8_1.2.2 stringi_1.7.12
## [70] parallel_4.2.2 Rcpp_1.0.10 vctrs_0.5.2
## [73] dbplyr_2.3.0 tidyselect_1.2.0 xfun_0.36