Replace “Your Name” with your actual name.

Instructions

This lab will focus on conducting multiple regression analyses and interpreting the coefficients (main effects) with a special emphasis on handling categorical variables using effect coding. You will work with various datasets to predict different outcomes, interpret the results, and understand how effect coding influences the interpretation of categorical variables.

Exercise 1: Predicting Job Satisfaction

Dataset: You are given a dataset with variables Work_Hours, Job_Complexity, Salary, and Job_Satisfaction. Your task is to predict Job_Satisfaction based on the other three predictors.

Dataset Creation:

# Create the dataset
set.seed(100)
data_ex1 <- data.frame(
  Work_Hours = c(40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41),
  Job_Complexity = c(7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8),
  Salary = c(50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500),
  Job_Satisfaction = c(78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76)
)

# View the first few rows of the dataset
head(data_ex1)
##   Work_Hours Job_Complexity Salary Job_Satisfaction
## 1         40              7  50000               78
## 2         35              6  48000               72
## 3         45              8  52000               85
## 4         50              9  55000               80
## 5         38              5  47000               70
## 6         42              7  51000               82

Task:

1. Conduct a multiple regression analysis to predict Job_Satisfaction using Work_Hours, Job_Complexity, and Salary as predictors. Be sure to use the data argument in the lm() function.

2. Interpret the main effects of each predictor. What does each coefficient tell you about its relationship with Job_Satisfaction?

# Multiple regression model
model_ex1 <- lm(Job_Satisfaction ~ Work_Hours + Job_Complexity + Salary, data = data_ex1)
summary(model_ex1)
## 
## Call:
## lm(formula = Job_Satisfaction ~ Work_Hours + Job_Complexity + 
##     Salary, data = data_ex1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.367 -2.304 -0.491  2.131  5.056 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28.3102797  7.1988322   3.933 0.000159 ***
## Work_Hours     -0.1148592  0.1737588  -0.661 0.510179    
## Job_Complexity  1.3367244  0.4796182   2.787 0.006411 ** 
## Salary          0.0008867  0.0002455   3.612 0.000485 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.852 on 96 degrees of freedom
## Multiple R-squared:  0.5925, Adjusted R-squared:  0.5798 
## F-statistic: 46.53 on 3 and 96 DF,  p-value: < 2.2e-16
  • Interpretation of Main Effects:
    • Work_Hours: The coefficient is about 0.1149, but it is not statistically significant. (p = 0.510), which means working hours doesn’t have a clear or consistent effect on job satisfaction in this data.
    • Job_Complexity: The coefficient is about 1.3376 and it is significant. (p = 0.006). This means as job complexity increases by 1 unit, job satisfaction increases by about 1.34 points, assuming other variables stay the same. So, more complex jobs are linked with slightly higher satisfaction.
    • Salary: The coefficient is 0.00008867, and it’s also significant (p = 0.0000485). This means that higher salaries are associated with higher job satisfaction. For every $1 in salary, satisfaction goes up slightly.

Exercise 2: Predicting Student Performance with Effect Coding

Dataset: You are provided with a dataset containing Study_Hours, Attendance, Parent_Education_Level, and GPA. Your task is to predict GPA based on the other predictors.

Dataset Creation:

# Create the dataset with a larger sample size
set.seed(200)
data_ex2 <- data.frame(
  Study_Hours = c(15, 12, 20, 18, 14, 17, 16, 13, 19, 14, 18, 16, 21, 13, 15, 20, 19, 18, 17, 16, 12, 14, 13, 20, 21, 22, 17, 19, 15, 16),
  Attendance = c(90, 85, 95, 92, 88, 91, 89, 87, 93, 86, 91, 89, 95, 87, 90, 96, 94, 93, 89, 90, 85, 88, 87, 95, 96, 97, 92, 94, 88, 89),
  Parent_Education_Level = factor(rep(c("High School", "College"), 15))
)

# Effect coding for Parent_Education_Level: -1 for High School, 1 for College
data_ex2$Parent_Education_Level <- ifelse(data_ex2$Parent_Education_Level == "High School", -1, 1)

# Create GPA with stronger relationships to predictors for significance
data_ex2$GPA <- 2.5 + 0.07 * data_ex2$Study_Hours + 0.03 * data_ex2$Attendance + 0.4 * data_ex2$Parent_Education_Level + rnorm(30, 0, 0.1)

# View the first few rows of the dataset
head(data_ex2)
##   Study_Hours Attendance Parent_Education_Level      GPA
## 1          15         90                     -1 5.858476
## 2          12         85                      1 6.312646
## 3          20         95                     -1 6.393256
## 4          18         92                      1 6.975807
## 5          14         88                     -1 5.725976
## 6          17         91                      1 6.808536

Task:

1. Conduct a multiple regression analysis to predict GPA using Study_Hours, Attendance, and Parent_Education_Level (coded as -1 for “High School” and 1 for “College”) as predictors.

2. Interpret the main effects. How does each predictor contribute to predicting GPA?

# Multiple regression model
model_ex2 <- lm(GPA ~ Study_Hours + Attendance + Parent_Education_Level, data = data_ex2)
summary(model_ex2)
## 
## Call:
## lm(formula = GPA ~ Study_Hours + Attendance + Parent_Education_Level, 
##     data = data_ex2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.208221 -0.058093 -0.001553  0.040198  0.143841 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.79220    1.32615   1.351   0.1882    
## Study_Hours             0.04868    0.02267   2.147   0.0413 *  
## Attendance              0.04170    0.01862   2.239   0.0339 *  
## Parent_Education_Level  0.40410    0.01574  25.669   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08593 on 26 degrees of freedom
## Multiple R-squared:  0.9731, Adjusted R-squared:   0.97 
## F-statistic: 313.4 on 3 and 26 DF,  p-value: < 2.2e-16
  • Interpretation of Main Effects:
    • Study_Hours: The coefficient is about 0.0487, and it’s statistically significant (p = 0.0413). This means for each extra hour spent studying, GPA goes up by about 0.049 points. So, more study time = better GPA.
    • Attendance: The coefficient is 0.0417, and it’s also significant (p= 0.0339). This shows that better attendance is connected to higher GPA. For each unit increase in attendance, GPA goes up by about 0.042 points.
    • Parent_Education_Level: The coefficient is 0.4041, and it’s very significant (p< 0.001). This means that students whose parents went to college tend to have a GPA that’s about 0.40 points higher than those whose parents only went to high school.

Exercise 3: Predicting Health Outcomes

Dataset: You are provided with a dataset containing Exercise_Frequency, Diet_Quality, Sleep_Duration, and Health_Index. Your task is to predict Health_Index based on the other predictors.

Dataset Creation:

# Create the dataset with a larger sample size
set.seed(300)
data_ex3 <- data.frame(
  Exercise_Frequency = c(4, 5, 3, 6, 2, 5, 4, 3, 5, 4, 6, 7, 3, 6, 2, 5, 7, 8, 4, 5, 3, 6, 7, 2, 4, 5, 6, 3, 7, 8),
  Diet_Quality = c(8, 7, 9, 6, 5, 8, 7, 6, 8, 7, 9, 8, 6, 7, 5, 8, 9, 7, 8, 7, 9, 6, 8, 5, 7, 6, 9, 8, 7, 6),
  Sleep_Duration = c(7, 8, 6, 7, 5, 8, 7, 6, 7, 7, 8, 7, 6, 7, 5, 8, 7, 8, 6, 7, 6, 7, 8, 5, 7, 8, 7, 6, 7, 8)
)

# Create Health_Index with stronger relationships to predictors for significance
data_ex3$Health_Index <- 50 + 2 * data_ex3$Exercise_Frequency + 1.5 * data_ex3$Diet_Quality + 1 * data_ex3$Sleep_Duration + rnorm(30, 0, 2)

# View the first few rows of the dataset
head(data_ex3)
##   Exercise_Frequency Diet_Quality Sleep_Duration Health_Index
## 1                  4            8              7     79.74758
## 2                  5            7              8     80.22421
## 3                  3            9              6     76.44698
## 4                  6            6              7     79.40253
## 5                  2            5              5     66.32989
## 6                  5            8              8     83.13740

Task:

1. Conduct a multiple regression analysis to predict Health_Index using Exercise_Frequency, Diet_Quality, and Sleep_Duration as predictors.

2. How do the coefficients inform you about the relative importance of each predictor in determining health outcomes?

# Multiple regression model
model_ex3 <- lm(Health_Index ~ Exercise_Frequency + Diet_Quality + Sleep_Duration, data = data_ex3)
summary(model_ex3)
## 
## Call:
## lm(formula = Health_Index ~ Exercise_Frequency + Diet_Quality + 
##     Sleep_Duration, data = data_ex3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11901 -1.17265  0.03783  1.31807  2.86568 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         46.3914     2.9595  15.675 9.15e-15 ***
## Exercise_Frequency   1.8387     0.2829   6.500 6.87e-07 ***
## Diet_Quality         1.8522     0.2685   6.898 2.53e-07 ***
## Sleep_Duration       1.3717     0.5448   2.518   0.0183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.684 on 26 degrees of freedom
## Multiple R-squared:  0.9225, Adjusted R-squared:  0.9136 
## F-statistic: 103.2 on 3 and 26 DF,  p-value: 1.467e-14
  • Interpretation of Main Effects:
    • Exercise_Frequency: The coefficient is 1.8387, and it’s highly significant (p < 0.001). This means that for every additional day or unit of exercise, the health index increases by about 1.84 points. So, more exercise is strongly linked with better health.
    • Diet_Quality: The coefficient is 1.8522, and it’s also highly significant (p < 0.001). This says that better diet quality leads to better health outcomes. For every 1-unit increase in diet score, health index goes up about 1.85 points.
    • Sleep_Duration: The coefficient is 1.3717, and it’s significant (p = 0.0183). So, more sleep is also associated with a higher health index. FOr each extra hour of sleep, the health index rises by about 1.37 points.

Exercise 4: Categorical Variables in Regression with Effect Coding

Dataset: You have a dataset with variables Work_Experience, Education_Level, Gender, and Salary. The Gender variable is categorical with levels “Male” and “Female”.

Dataset Creation:

# Create the dataset with a larger sample size
set.seed(400)
data_ex4 <- data.frame(
  Work_Experience = c(5, 7, 3, 6, 8, 4, 9, 6, 7, 5, 8, 9, 4, 6, 7, 5, 9, 10, 6, 7, 4, 5, 7, 6, 8, 9, 10, 5, 6, 8),
  Education_Level = c(12, 14, 10, 16, 13, 15, 17, 12, 16, 14, 18, 19, 11, 14, 15, 13, 18, 20, 14, 15, 11, 13, 15, 14, 17, 18, 19, 13, 15, 17),
  Gender = factor(rep(c("Male", "Female"), 15))
)

# Effect coding for Gender: 1 for Male, 1 for Female
data_ex4$Gender_Effect <- ifelse(data_ex4$Gender == "Male", -1, 1)

# Create Salary with stronger relationships to predictors for significance
data_ex4$Salary <- 30000 + 3000 * data_ex4$Work_Experience + 1500 * data_ex4$Education_Level + 5000 * data_ex4$Gender_Effect + rnorm(30, 0, 2000)

# View the first few rows of the dataset
head(data_ex4)
##   Work_Experience Education_Level Gender Gender_Effect   Salary
## 1               5              12   Male            -1 55926.90
## 2               7              14 Female             1 78230.57
## 3               3              10   Male            -1 51945.87
## 4               6              16 Female             1 75634.63
## 5               8              13   Male            -1 67296.32
## 6               4              15 Female             1 66794.78

Task:

1. Conduct a multiple regression analysis to predict Salary using Work_Experience, Education_Level, and Gender_Effect as predictors.

2. Interpret the coefficients, especially focusing on the effect of Gender_Effect.

3. Discuss how effect coding impacts the interpretation of the Gender_Effect variable.

# Multiple regression model with effect coding
model_ex4 <- lm(Salary ~ Work_Experience + Education_Level + Gender_Effect, data = data_ex4)
summary(model_ex4)
## 
## Call:
## lm(formula = Salary ~ Work_Experience + Education_Level + Gender_Effect, 
##     data = data_ex4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4401.7 -1568.7   165.7  1265.8  3439.3 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30426.2     2574.8  11.817 5.89e-12 ***
## Work_Experience   3501.8      434.2   8.064 1.52e-08 ***
## Education_Level   1239.9      317.0   3.912 0.000588 ***
## Gender_Effect     4823.7      384.0  12.563 1.51e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2025 on 26 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9652 
## F-statistic:   269 on 3 and 26 DF,  p-value: < 2.2e-16
  • Interpretation of Main Effects:
    • Work_Experience: The coefficient is 3501.8, and it’s very significant (p. < 0.001). This means that for every additional year of work experience, salary increases by about $3,502.

    • Education_Level: The coefficient is 1239.9, and it’s also significant (p = 0.0006). This means that each additional unit of education increases salary by around $1,240.

    • Gender_Effect: The coefficient is 4823.7, and it’s very significant (p < 0.001). This shows that females earn about $4824 more than males, all else being equal.

  • Interpretation of Categorical Variables with Effect Coding: In this regression, Gender was coded as -1 for males and 1 for females. The positive coefficient (4823.7) means females earn $4823.70 more than the average salary, while males earn $4823.70 less. So in this dataset, women make about $9,647 more than men overall.
    • Gender_Effect:

Submission Instructions:

Ensure to knit your document to HTML format, checking that all content is correctly displayed before submission.