This homework is due by Thursday, February 9th, 8:00pm. Upload a pdf file to Canvas called 4_modeling_data.pdf

I. Simple linear regression and prediction

In this section and the next, we’ll be revisiting the credit dataset.

# Load data
df.credit = read_csv("data/credit.csv") %>% 
  clean_names()

1. Explore and visualize the data

Use the ggpairs() function from the GGally package to make scatterplots of all pairwise combinations of the numeric variables.

Tip: use progress = FALSE within the function to suppress unwanted progress bars of each plot

### YOUR CODE HERE ###
sapply(df.credit, is.numeric)
##     index    income     limit    rating     cards       age education    gender 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE     FALSE 
##   student   married ethnicity   balance    height 
##     FALSE     FALSE     FALSE      TRUE      TRUE
ggpairs(df.credit, columns=c(2,3,4,5,6,7,12,13), progress=F)

######################

That’s perhaps a bit too congested to be useful. Recreate the plot, focusing on just ~5 of the variables that seem interesting to you.

### YOUR CODE HERE ###
ggpairs(df.credit, columns=c(2,4,5,7,13), progress=F)

######################

Does your plot tell you anything interesting about the data? Briefly describe one observation.

Education appears to be positively correlated with height. The taller you are, the more likely you are to be educated. I wonder if physical height gives off a perception of power and superiority that leads to them receiving preferential treatment or higher expectations in life.

2. Simple linear regression

You decide you want to test the relationship between credit limit (limit) and average credit card debt (balance).

a) First, let’s visualize this relationship. Create a scatterplot of balance (on the y-axis) as a function of limit (on the x-axis). Set the transparency of the points to 0.3.

### YOUR CODE HERE ###
df.credit %>% 
  ggplot(aes(x=limit,y=balance)) +
  geom_point(alpha=.3)

######################

Model generalizability: When fitting models to data, it’s important to not just evaluate how well the model fits the data, but also how well the model predicts new data. Validating the model on unseen data is critical to understanding its generalizability. To create “unseen” data, we’ll divide our dataset into training data and test data, randomly designating 80% of observations as training data and 20% as test data. Note that this approach previews the concept of cross-validation, which we’ll cover later in the course. As you’ll see, when cross-validating, performance metrics are computed on many “splits”, where each “split” is a new division of the observed data into training and test data. Here, we’ll only evaluate our model on what amounts to one “split”. Once the model is fit to the training data, we can generate predictions for the test data and evaluate model performance.

We’ve split the dataset into training and test data for you:

# Set seed for reproducibility
set.seed(1)

# Split data into train and test
train_proportion = 0.8
train_data = df.credit %>% 
  sample_frac(train_proportion)
test_data = df.credit %>%
  anti_join(train_data)
## Joining, by = c("index", "income", "limit", "rating", "cards", "age",
## "education", "gender", "student", "married", "ethnicity", "balance", "height")

b) You are interested in whether credit limit is a significant predictor of average credit card debt. Build a simple linear regression model to predict balance from limit for the training data.

### YOUR CODE HERE ###
fit.lm1_train = lm(balance ~ 1 + limit, data=train_data)
summary(fit.lm1_train)
## 
## Call:
## lm(formula = balance ~ 1 + limit, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -674.44 -144.65  -13.98  131.16  775.84 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) -284.569658   28.738175  -9.902 <0.0000000000000002 ***
## limit          0.170409    0.005424  31.419 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.1 on 318 degrees of freedom
## Multiple R-squared:  0.7564, Adjusted R-squared:  0.7556 
## F-statistic: 987.2 on 1 and 318 DF,  p-value: < 0.00000000000000022
######################

c) Is credit limit a significant predictor of average credit card debt? What is the r-squared value, and what does it tell you about the model?

Yes, credit limit is a significant predictor of average credit card debt. The r-squared is .76, suggesting that 76% of the variance of credit card debt can be accounted for by credit limit.

d) Before visualizing model fit, let’s make a basic plot of the training data that we can flexibly add to later. Make a scatterplot showing balance (on the y-axis) as a function of limit (on the x-axis) and assign this scatterplot to the variable plot.balance_limit. Set the transparency of the points to 0.2.

### YOUR CODE HERE ###
plot.balance_limit = 
  ggplot() +
  geom_point(data=train_data, aes(x=limit, y=balance), alpha=.2)
######################
plot.balance_limit

e) Using plot.balance_limit, add a linear regression line with confidence intervals.

Hint: geom_smooth() will be useful here.

### YOUR CODE HERE ###
plot.balance_limit + 
  geom_smooth(data=train_data, aes(x=limit, y=balance), method="lm", color="black")
## `geom_smooth()` using formula = 'y ~ x'

######################

3. Prediction

a) Use the trained model to predict the test data.

Hint: Take a look at the predict() function here.

### YOUR CODE HERE ###
predictions.test = test_data %>% 
  mutate(predicted_balance = predict(fit.lm1_train, newdata=test_data))
######################

b) Using plot.balance_limit, overlay the test data in blue and your predictions from 3a in red (don’t alter their transparency). Add black vertical lines connecting each predicted point to the observed point.

Hint: The function geom_segment() will be useful for drawing the black vertical lines.

### YOUR CODE HERE ###
plot.balance_limit + 
  geom_point(data=predictions.test, aes(x=limit, y=balance), color = "blue") +
  geom_point(data=predictions.test, aes(x=limit, y=predicted_balance), color= "red") +
  geom_segment(data=predictions.test, aes(x=limit, y=balance, xend=limit, yend= predicted_balance))

######################

The black lines in your plot represent the residuals of the model (Y_true - Y_pred). Squaring and then summing these residuals produces an important quantity: the residual sum of squared errors (which we will use in the next problem).

c) Create a function that calculates r-squared from a set of TRUE values and a set of PREDICTED values.

Hint: you’ll need to calculate both the residual and total sum of squared errors. For the total sum of squared errors, you need to calculate the squared error between each observation and the mean of all of the observations.

r_squared = function(Y_true, Y_pred) {
  # Input: Y_true and Y_pred should each be a numeric vector of the same length
  
  ### YOUR CODE HERE ###
  r2 = 1 - (sum((Y_true - Y_pred)^2)/sum((Y_true - mean(Y_true))^2))
  ######################
  return(r2)
}

Use your function to calculate r-squared for the test data.

### YOUR CODE HERE ###
r_squared_test = r_squared(predictions.test$balance, predictions.test$predicted_balance)
######################
r_squared_test
## [1] 0.6844749

Now, generate predictions (using predict()) for the data your model was trained on, and calculate r-squared. [Note that this value should be identical to what was calculated in 2b]

### YOUR CODE HERE ###
predictions_train = train_data %>% 
  mutate(predicted_balance = predict(fit.lm1_train, newdata=train_data))

r_squared_train = r_squared(predictions_train$balance, predictions_train$predicted_balance)
######################
r_squared_train
## [1] 0.7563526

d) Is r-squared higher for the training or test data? Why might that be?

The r-squared is higher for the training data. A larger sample size stabilized the data and reduced the variance of datapoints around the predicted line. Hence, the total sum of squared residuals decreased, leading to a higher r-squared value.

e) Based on r-squared for the test data, how well do you think the model generalized to new data?

The model generalized pretty well to the new data, accounting for almost 70% of the variance.

f) In theory, what is the minimum and maximum value r-squared can be on test data, and why? What does it mean if r-squared is 0?

The minimum value is 0 and the maximum value is 1. An r-squared of 0 means that the model accounted for 0% of the variance in the test data.

II. Multiple linear regression and controls

In psychological research, people often run linear regressions in which the goal is to assess the relationship between two variables while “controlling” for other variables. These control variables could, for example, be age and gender. But how should we decide whether and which variables to control for? In this exercise, we will see what potential effects controlling for variables can have in different situations.

Now you are interested in whether age is a significant predictor of credit limit.

Note: for this question, go ahead and use the full df.credit dataset (instead of just the training data).

4. Interpreting model parameters

a) Build a simple linear regression model to predict limit from age. Is age a significant predictor of credit limit?

# Simple regression without control variables
### YOUR CODE HERE ###
fit.lm1 = lm(limit ~ 1 + age, data=df.credit)
summary(fit.lm1)
## 
## Call:
## lm(formula = limit ~ 1 + age, data = df.credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3877.1 -1628.3   -87.3  1244.3  8876.4 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3984.098    388.857  10.246 <0.0000000000000002 ***
## age           13.500      6.673   2.023              0.0437 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2299 on 398 degrees of freedom
## Multiple R-squared:  0.01018,    Adjusted R-squared:  0.007691 
## F-statistic: 4.093 on 1 and 398 DF,  p-value: 0.04374
######################

Age is a significant predictor of credit limit.

Then you realize that age is actually related to income, which is a strong predictor of credit limit, so you are interested in seeing whether age is related to credit limit controlling for income.

b) Build a multiple regression model to predict limit using both age and income as predictors. Is age still a significant predictor? What could be an explanation for the change, if there was any?

# Multiple regression with control variables
### YOUR CODE HERE ###
fit.lm2 = lm(income ~ 1 + age, data = df.credit)
summary(fit.lm2)
## 
## Call:
## lm(formula = income ~ 1 + age, data = df.credit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.24 -23.49 -10.69  10.29 146.67 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  25.2762     5.8755   4.302 0.0000213 ***
## age           0.3582     0.1008   3.553  0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.74 on 398 degrees of freedom
## Multiple R-squared:  0.03074,    Adjusted R-squared:  0.02831 
## F-statistic: 12.62 on 1 and 398 DF,  p-value: 0.0004265
fit.lm3 = lm(limit ~ 1 + age + income, data = df.credit)
summary(fit.lm3)
## 
## Call:
## lm(formula = limit ~ 1 + age + income, data = df.credit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2701.44 -1116.82    41.84  1157.11  2611.30 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2661.516    243.880  10.913 <0.0000000000000002 ***
## age           -5.245      4.156  -1.262               0.208    
## income        52.325      2.034  25.727 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1410 on 397 degrees of freedom
## Multiple R-squared:  0.6289, Adjusted R-squared:  0.627 
## F-statistic: 336.4 on 2 and 397 DF,  p-value: < 0.00000000000000022
######################

Age is no longer a significant predictor of credit limit after controlling for income. Age spuriously predicted credit limit only because it was positively associated with income, which was the “true” driver of the positive effect on credit limit.

III. Interactions

We will be using the following dataset.

families.csv:

Data from a study of 68 companies, examining relationships between the quality of family-friendly programs at each company, the percentage of employees with families who use these programs, and employee satisfaction (all continuous variables).

Fields:

famprog: the amount of family-friendly programs from (1 = Nothing at all to 9 = Amazing family-friendliness) perfam: the percentage of employees with families in the organization (from 0% to 100%) empastis: the average rating of employee satisfaction (1 = Extremely unsatisfied to 7 = Extremely satisfied)

df.families = read_csv("data/families.csv")

5. Visualizing simple relationships

a) First, let’s visualize a few relationships, plotting regression lines with confidence intervals over scatterplots.

Does employee satisfaction depend on the amount of family programs?

### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=empsatis)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

######################

Does employee satisfaction depend upon the percentage of employees with families?

### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=perfam, y=empsatis)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

######################

Does the number of family programs depend on the percentage of employees with families?

### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=perfam)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

######################

Doesn’t seem like there’s much going on in this dataset so far…

Let’s try visualizing all three variables at once, mapping one of the three variables to color.

### YOUR CODE HERE ###
df.families %>% ggplot(aes(x=famprog, y=empsatis, color=perfam)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

######################

Still not super clear.

6. Visualizing interactions

Sometimes the easiest way to visualize continuous data with 3 (or more) continuous variables is to break some of them down into categorical variables, for example by doing a median split, or breaking into quantiles. Let’s break the perfam and famprog columns into terciles (thirds).

a) First, create a function that cuts its input into terciles [Hint: you may find the cut and quantile functions useful here]. Then create two new columns in df.families called perfam_terciles and famprog_terciles by applying your function to the relevant columns. [Hint: mutate and across will be useful here]

Note: When using the quantile function, set include.lowest = T. This will make sure the lowest values in each column are included in the tercile cuts rather than being set to “NA”.

tercile_cuts = function(x) {
  ### YOUR CODE HERE ###
  cut(x, quantile(x,c(0,1/3,2/3,1)), include.lowest=T, labels=c("low","medium","high"))
  ######################
}

# Mutate variables
### YOUR CODE HERE ###
df.families2 <- df.families %>% 
  mutate(across(c(perfam, famprog), tercile_cuts, .names = "{.col}_terciles"))
######################

b) Now, create two plots: first, how empsatis changes with perfam (faceted by famprog terciles), then how empsatis changes with famprog (faceted by perfam terciles).

### YOUR CODE HERE ###
ggplot(df.families2, aes(x=perfam, y=empsatis)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(vars(famprog_terciles))
## `geom_smooth()` using formula = 'y ~ x'

######################
### YOUR CODE HERE ###
ggplot(df.families2, aes(x=famprog, y=empsatis)) +
  geom_point() +
  geom_smooth(method="lm") +
  facet_wrap(vars(perfam_terciles))
## `geom_smooth()` using formula = 'y ~ x'

######################

Now you should be able to see a story start to emerge.

c) Which of the two previous plots do you like best? Why? What story would you tell about it? Could you confirm it experimentally?

The second plot is better visually because of the completeness of values on the x-axis. It also tells us that there is a strong positive association between the number of family friendly programs offered by the organization and employee satisfaction when the organization contains a high percentage of employees who have families. It would be challenging, though possible, to experimentally manipulate the number of family programs in an organization. Organizations can be randomly assigned to either a high or low number of family friendly programs condition, and we can observe the interaction with percentage of employees who have families on employee satisfaction.

7. Testing interactions

a) Create two linear models with employee satisfaction as the outcome. One model should have both perfam and famprog as well as their interaction as predictors, and the other model should have both variables but without the interaction. Perform an ANOVA on the two models.

### YOUR CODE HERE ###
simple_model = lm(empsatis ~ 1 + perfam + famprog, data = df.families)
summary(simple_model)
## 
## Call:
## lm(formula = empsatis ~ 1 + perfam + famprog, data = df.families)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69259 -0.50280 -0.05507  0.43595  1.53689 
## 
## Coefficients:
##              Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  4.225338   0.532303   7.938 0.0000000000376 ***
## perfam      -0.011641   0.008782  -1.326          0.1896    
## famprog      0.066932   0.036396   1.839          0.0705 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7499 on 65 degrees of freedom
## Multiple R-squared:  0.07342,    Adjusted R-squared:  0.04491 
## F-statistic: 2.575 on 2 and 65 DF,  p-value: 0.08388
interaction_model = lm(empsatis ~ 1 + perfam*famprog, data = df.families)
summary(interaction_model)
## 
## Call:
## lm(formula = empsatis ~ 1 + perfam * famprog, data = df.families)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65923 -0.39193 -0.04313  0.46748  1.59056 
## 
## Coefficients:
##                 Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)     6.728586   1.321895   5.090 0.00000338 ***
## perfam         -0.056740   0.023517  -2.413     0.0187 *  
## famprog        -0.343845   0.202604  -1.697     0.0945 .  
## perfam:famprog  0.007399   0.003593   2.059     0.0435 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7318 on 64 degrees of freedom
## Multiple R-squared:  0.131,  Adjusted R-squared:  0.09027 
## F-statistic: 3.216 on 3 and 64 DF,  p-value: 0.02856
anova(simple_model,interaction_model)
######################

How do the results of the models relate to the above plots?

The plots presented a likely interaction pattern in the data, such that the relationship between the number of family programs offered and employee satisfaction was dependent on the percentage of employees with families within the organization. The models confirm the presence of this very interaction pattern.

Grading rubric

Session info

Information about this R session including which version of R was used, and what packages were loaded.

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Singapore.utf8  LC_CTYPE=English_Singapore.utf8   
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Singapore.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.0.10    purrr_1.0.1    
##  [5] readr_2.1.3     tidyr_1.3.0     tibble_3.1.8    tidyverse_1.3.2
##  [9] janitor_2.2.0   GGally_2.1.2    ggplot2_3.4.0   knitr_1.42     
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.4          sass_0.4.5          bit64_4.0.5        
##  [4] vroom_1.6.1         jsonlite_1.8.4      splines_4.2.2      
##  [7] modelr_0.1.10       bslib_0.4.2         assertthat_0.2.1   
## [10] highr_0.10          googlesheets4_1.0.1 cellranger_1.1.0   
## [13] yaml_2.3.7          lattice_0.20-45     pillar_1.8.1       
## [16] backports_1.4.1     glue_1.6.2          digest_0.6.31      
## [19] RColorBrewer_1.1-3  rvest_1.0.3         snakecase_0.11.0   
## [22] colorspace_2.1-0    Matrix_1.5-1        htmltools_0.5.4    
## [25] plyr_1.8.8          pkgconfig_2.0.3     broom_1.0.3        
## [28] haven_2.5.1         scales_1.2.1        tzdb_0.3.0         
## [31] timechange_0.2.0    googledrive_2.0.0   mgcv_1.8-41        
## [34] generics_0.1.3      farver_2.1.1        ellipsis_0.3.2     
## [37] cachem_1.0.6        withr_2.5.0         cli_3.6.0          
## [40] magrittr_2.0.3      crayon_1.5.2        readxl_1.4.1       
## [43] evaluate_0.20       fs_1.6.0            fansi_1.0.4        
## [46] nlme_3.1-160        xml2_1.3.3          tools_4.2.2        
## [49] hms_1.1.2           gargle_1.2.1        lifecycle_1.0.3    
## [52] munsell_0.5.0       reprex_2.0.2        compiler_4.2.2     
## [55] jquerylib_0.1.4     rlang_1.0.6         grid_4.2.2         
## [58] rstudioapi_0.14     labeling_0.4.2      rmarkdown_2.20     
## [61] gtable_0.3.1        DBI_1.1.3           reshape_0.8.9      
## [64] R6_2.5.1            lubridate_1.9.1     fastmap_1.1.0      
## [67] bit_4.0.5           utf8_1.2.2          stringi_1.7.12     
## [70] parallel_4.2.2      Rcpp_1.0.10         vctrs_0.5.2        
## [73] dbplyr_2.3.0        tidyselect_1.2.0    xfun_0.36