This lab will focus on conducting multiple regression analyses and interpreting the coefficients (main effects) with a special emphasis on handling categorical variables using effect coding. You will work with various datasets to predict different outcomes, interpret the results, and understand how effect coding influences the interpretation of categorical variables.
Dataset: You are given a dataset with variables
Work_Hours, Job_Complexity,
Salary, and Job_Satisfaction. Your task is to
predict Job_Satisfaction based on the other three
predictors.
Dataset Creation:
# Create the dataset
set.seed(100)
data_ex1 <- data.frame(
Work_Hours = c(40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41, 40, 35, 45, 50, 38, 42, 48, 37, 44, 41),
Job_Complexity = c(7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8, 7, 6, 8, 9, 5, 7, 8, 6, 7, 8),
Salary = c(50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500, 50000, 48000, 52000, 55000, 47000, 51000, 53000, 46000, 54000, 49500),
Job_Satisfaction = c(78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76, 78, 72, 85, 80, 70, 82, 79, 75, 81, 76)
)
# View the first few rows of the dataset
head(data_ex1)## Work_Hours Job_Complexity Salary Job_Satisfaction
## 1 40 7 50000 78
## 2 35 6 48000 72
## 3 45 8 52000 85
## 4 50 9 55000 80
## 5 38 5 47000 70
## 6 42 7 51000 82
Task:
1. Conduct a multiple regression analysis to predict
Job_Satisfaction using Work_Hours,
Job_Complexity, and Salary as predictors. Be
sure to use the data argument in the lm()
function.
2. Interpret the main effects of each predictor. What does each
coefficient tell you about its relationship with
Job_Satisfaction?
# Create the dataset with a larger sample size
set.seed(200)
data_ex2 <- data.frame(
Study_Hours = c(15, 12, 20, 18, 14, 17, 16, 13, 19, 14, 18, 16, 21, 13, 15, 20, 19, 18, 17, 16, 12, 14, 13, 20, 21, 22, 17, 19, 15, 16),
Attendance = c(90, 85, 95, 92, 88, 91, 89, 87, 93, 86, 91, 89, 95, 87, 90, 96, 94, 93, 89, 90, 85, 88, 87, 95, 96, 97, 92, 94, 88, 89),
Parent_Education_Level = factor(rep(c("High School", "College"), 15))
)
# Effect coding for Parent_Education_Level: -1 for High School, 1 for College
data_ex2$Parent_Education_Level <- ifelse(data_ex2$Parent_Education_Level == "High School", -1, 1)
# Create GPA with stronger relationships to predictors for significance
data_ex2$GPA <- 2.5 + 0.07 * data_ex2$Study_Hours + 0.03 * data_ex2$Attendance + 0.4 * data_ex2$Parent_Education_Level + rnorm(30, 0, 0.1)
# View the first few rows of the dataset
head(data_ex2)## Study_Hours Attendance Parent_Education_Level GPA
## 1 15 90 -1 5.858476
## 2 12 85 1 6.312646
## 3 20 95 -1 6.393256
## 4 18 92 1 6.975807
## 5 14 88 -1 5.725976
## 6 17 91 1 6.808536
Task:
1. Conduct a multiple regression analysis to predict GPA
using Study_Hours, Attendance, and
Parent_Education_Level (coded as -1 for “High School” and 1
for “College”) as predictors.
2. Interpret the main effects. How does each predictor contribute to predicting GPA?
# Multiple regression model
model_ex2 <- lm(GPA ~ Study_Hours + Attendance + Parent_Education_Level, data = data_ex2)
summary(model_ex2)##
## Call:
## lm(formula = GPA ~ Study_Hours + Attendance + Parent_Education_Level,
## data = data_ex2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.208221 -0.058093 -0.001553 0.040198 0.143841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.79220 1.32615 1.351 0.1882
## Study_Hours 0.04868 0.02267 2.147 0.0413 *
## Attendance 0.04170 0.01862 2.239 0.0339 *
## Parent_Education_Level 0.40410 0.01574 25.669 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08593 on 26 degrees of freedom
## Multiple R-squared: 0.9731, Adjusted R-squared: 0.97
## F-statistic: 313.4 on 3 and 26 DF, p-value: < 2.2e-16
Study_Hours: For every additional hour a student
studies, the GPA increases by 0.049 points. It is highly
significant.Attendance: For every 1% increase in attendance, GPA
increases by 0.222 points. It is also highly significant.Parent_Education_Level: Students who have parents who
went to college have 0.326 points higher in GPA than overall mean.
Students who have parents who only finished high school have a GPA 0.326
points below the mean. This is highly significant and meaningful.Dataset: You are provided with a dataset containing
Exercise_Frequency, Diet_Quality,
Sleep_Duration, and Health_Index. Your task is
to predict Health_Index based on the other predictors.
Dataset Creation:
# Create the dataset with a larger sample size
set.seed(300)
data_ex3 <- data.frame(
Exercise_Frequency = c(4, 5, 3, 6, 2, 5, 4, 3, 5, 4, 6, 7, 3, 6, 2, 5, 7, 8, 4, 5, 3, 6, 7, 2, 4, 5, 6, 3, 7, 8),
Diet_Quality = c(8, 7, 9, 6, 5, 8, 7, 6, 8, 7, 9, 8, 6, 7, 5, 8, 9, 7, 8, 7, 9, 6, 8, 5, 7, 6, 9, 8, 7, 6),
Sleep_Duration = c(7, 8, 6, 7, 5, 8, 7, 6, 7, 7, 8, 7, 6, 7, 5, 8, 7, 8, 6, 7, 6, 7, 8, 5, 7, 8, 7, 6, 7, 8)
)
# Create Health_Index with stronger relationships to predictors for significance
data_ex3$Health_Index <- 50 + 2 * data_ex3$Exercise_Frequency + 1.5 * data_ex3$Diet_Quality + 1 * data_ex3$Sleep_Duration + rnorm(30, 0, 2)
# View the first few rows of the dataset
head(data_ex3)## Exercise_Frequency Diet_Quality Sleep_Duration Health_Index
## 1 4 8 7 79.74758
## 2 5 7 8 80.22421
## 3 3 9 6 76.44698
## 4 6 6 7 79.40253
## 5 2 5 5 66.32989
## 6 5 8 8 83.13740
Task:
1. Conduct a multiple regression analysis to predict
Health_Index using Exercise_Frequency,
Diet_Quality, and Sleep_Duration as
predictors.
2. How do the coefficients inform you about the relative importance of each predictor in determining health outcomes?
# Multiple regression model
model_ex3 <- lm(Health_Index ~ Exercise_Frequency + Diet_Quality + Sleep_Duration, data = data_ex3)
summary(model_ex3)##
## Call:
## lm(formula = Health_Index ~ Exercise_Frequency + Diet_Quality +
## Sleep_Duration, data = data_ex3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.11901 -1.17265 0.03783 1.31807 2.86568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.3914 2.9595 15.675 9.15e-15 ***
## Exercise_Frequency 1.8387 0.2829 6.500 6.87e-07 ***
## Diet_Quality 1.8522 0.2685 6.898 2.53e-07 ***
## Sleep_Duration 1.3717 0.5448 2.518 0.0183 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.684 on 26 degrees of freedom
## Multiple R-squared: 0.9225, Adjusted R-squared: 0.9136
## F-statistic: 103.2 on 3 and 26 DF, p-value: 1.467e-14
Exercise_Frequency: For every additional increase in
exercise frequency, the Health_Index increases by about 1.84 points. It
is statistically significant, meaning more frequent exercise is a
reliable predictor for better health.Diet_Quality: For each 1 point increase in diet
quality, the Health_Index increases by about 1.85 points. It is also
highly significant, meaning that a healthier diet is strongly associated
with better health.Sleep_Duration: Each additional hour of sleep is
associated with a 1.37 point increase in the Health_Index. It is
statistically significant (p=0.018), but slightly weaker than excercise
and diet influences.Dataset: You have a dataset with variables
Work_Experience, Education_Level,
Gender, and Salary. The Gender
variable is categorical with levels “Male” and “Female”.
Dataset Creation:
# Create the dataset with a larger sample size
set.seed(400)
data_ex4 <- data.frame(
Work_Experience = c(5, 7, 3, 6, 8, 4, 9, 6, 7, 5, 8, 9, 4, 6, 7, 5, 9, 10, 6, 7, 4, 5, 7, 6, 8, 9, 10, 5, 6, 8),
Education_Level = c(12, 14, 10, 16, 13, 15, 17, 12, 16, 14, 18, 19, 11, 14, 15, 13, 18, 20, 14, 15, 11, 13, 15, 14, 17, 18, 19, 13, 15, 17),
Gender = factor(rep(c("Male", "Female"), 15))
)
# Effect coding for Gender: 1 for Male, 1 for Female
data_ex4$Gender_Effect <- ifelse(data_ex4$Gender == "Male", -1, 1)
# Create Salary with stronger relationships to predictors for significance
data_ex4$Salary <- 30000 + 3000 * data_ex4$Work_Experience + 1500 * data_ex4$Education_Level + 5000 * data_ex4$Gender_Effect + rnorm(30, 0, 2000)
# View the first few rows of the dataset
head(data_ex4)## Work_Experience Education_Level Gender Gender_Effect Salary
## 1 5 12 Male -1 55926.90
## 2 7 14 Female 1 78230.57
## 3 3 10 Male -1 51945.87
## 4 6 16 Female 1 75634.63
## 5 8 13 Male -1 67296.32
## 6 4 15 Female 1 66794.78
Task:
1. Conduct a multiple regression analysis to predict
Salary using Work_Experience,
Education_Level, and Gender_Effect as
predictors.
2. Interpret the coefficients, especially focusing on the effect of
Gender_Effect.
3. Discuss how effect coding impacts the interpretation of the
Gender_Effect variable.
# Multiple regression model with effect coding
model_ex4 <- lm(Salary ~ Work_Experience + Education_Level + Gender_Effect, data = data_ex4)
summary(model_ex4)##
## Call:
## lm(formula = Salary ~ Work_Experience + Education_Level + Gender_Effect,
## data = data_ex4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4401.7 -1568.7 165.7 1265.8 3439.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30426.2 2574.8 11.817 5.89e-12 ***
## Work_Experience 3501.8 434.2 8.064 1.52e-08 ***
## Education_Level 1239.9 317.0 3.912 0.000588 ***
## Gender_Effect 4823.7 384.0 12.563 1.51e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2025 on 26 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9652
## F-statistic: 269 on 3 and 26 DF, p-value: < 2.2e-16
Work_Experience: The coefficiant is 3501.8, meaning that for each additional year of work experience, salary increases by $3,501.80. T-value 8.064 is significant and the p-value confirms that the predictor is highly significant.
Education_Level: The coefficient is 1239.9, meaning that each additional year of education increases salary by $1,239.90. The t-value 3.912 is significant, with p-value 0.000588 confirming it is a predictor in determining salary.
Gender_Effect:
Gender_Effect:The coefficient is 4823.7, meaning the
difference in salary between females and males is 5,200 dollars, where
females earn 5200 more dollars than males. THe t-value of 12.563 and
p-value indicate that gender is a highly significant factor in salary
differences within this dataset.