Replace “Your Name” with your actual name.
In this lab assignment, you will apply what you’ve learned about correlation by calculating, visualizing, and interpreting correlations using different datasets. Make sure to follow the instructions carefully and complete all parts of each exercise. You will also practice creating and interpreting bivariate linear models using various datasets. For each exercise, follow the provided instructions to build a linear model, interpret the slope, and analyze the residuals.
Scenario: You are examining the relationship between the number of hours spent exercising per week and self-reported levels of happiness. The data is as follows:
c(1, 3, 4, 6, 8, 10)
c(50, 55, 60, 65, 70, 75)
Tasks:
1. Calculate the Pearson correlation coefficient using R.
2. Interpret the correlation coefficient in the context of the relationship between exercise hours and happiness scores.
# Sample data
exercise_hours <- c(1, 3, 4, 6, 8, 10)
happiness_scores <- c(50, 55, 60, 65, 70, 75)
# Calculate Pearson's correlation coefficient
cor.test(exercise_hours, happiness_scores)
##
## Pearson's product-moment correlation
##
## data: exercise_hours and happiness_scores
## t = 22.895, df = 4, p-value = 2.156e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9641148 0.9996047
## sample estimates:
## cor
## 0.9962062
Scenario: You are investigating the relationship between daily water intake (in liters) and energy levels throughout the day. The data is as follows:
c(0.5, 1, 1.5, 2, 2.5, 3)
c(40, 50, 60, 65, 70, 80)
Tasks:
1. Create a scatter plot using ggplot2 in R to visualize the relationship between water intake and energy levels.
2. Add a trend line to the scatter plot to show the direction and strength of the correlation.
R Code Chunk:
library(ggplot2)
# Sample data
water_intake <- c(0.5, 1, 1.5, 2, 2.5, 3)
energy_levels <- c(40, 50, 60, 65, 70, 80)
water_data <- data.frame(water_intake = water_intake, energy_levels = energy_levels)
# Create scatter plot with trend line
ggplot(water_data, aes(x = water_intake, y = energy_levels)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Scenario: You have calculated the Pearson correlation coefficient for two variables: screen time per day and sleep duration per night. The correlation coefficient is \(r = -0.4\).
Tasks:
1. Interpret the size of the correlation.
The correlation shows a inverse moderate relationship meaning as sleep increases, screen time tends to decrease.
2. Discuss the practical significance of this correlation in the context of the relationship between screen time and sleep duration.
#1 The correlation shows a inverse moderate relationship meaning as sleep increases, screen time tends to decrease.
#2 The practical signifigance of this information shows pople that inorder to increase heathly levels of increased sleep it may be good to reduce screen time in order to do that.
Scenario: You observe a correlation between outdoor time and academic performance in a group of students. However, you suspect that physical health might be influencing both.
Tasks:
1. Discuss how physical health could act as a confounding variable affecting both outdoor time and academic performance.
2. Suggest methods to control for this third variable in future research.
#1 Physical health could act as a confounding vvaribale as someone with good physical health will be able to more freely and conpintetnly handle going oustide and doing acytivites and also doing well with schoool. Someone who is ill may not be able to go oustside or perofrm well in school otherwise.
#2 Running a mutlipele regression.
Scenario: A study finds a strong positive correlation between eating breakfast and overall cognitive performance in children. However, the study does not examine causality.
Tasks:
1. Discuss why the correlation between eating breakfast and cognitive performance does not necessarily imply that eating breakfast causes better cognitive performance.
2. Provide specific examples of how you could design a study to explore causality between these variables.
#1 You could argue that a tertiary actor could be the cause for both. A person who maybe exercises or is health concious in general might eat a breakfest and have simply better cognitive performance due to thier better genral health aswell.
#2 You would create a group who never eats breakfest and a group who rigidly eats breakfest and then you would test them with cognitive tests overtime in order to track and compare thier correaltion.
“For every one unit increase in (X/IV), Y (Outcome/DV) increases/decreases by (SLOPE VALUE).”
For the above format, it will be EITHER increases OR decreases depending on the sign of the slope (positive or negative).
Dataset: You are provided with a dataset that includes the number of hours people spend exercising daily and their happiness scores on a scale from 0 to 100.
Simulate the Data:
set.seed(101)
daily_exercise <- rnorm(100, mean = 1, sd = 0.5) # Hours of daily exercise
happiness <- 50 + 5 * daily_exercise + rnorm(100, mean = 0, sd = 5) # Happiness scores
# Combine into a data frame
exercise_data <- data.frame(daily_exercise, happiness)
# View the first few rows
head(exercise_data)
## daily_exercise happiness
## 1 0.8369818 55.52524
## 2 1.2762309 53.42011
## 3 0.6625281 63.98007
## 4 1.1071797 61.39964
## 5 1.1553846 59.51073
## 6 1.5869831 56.78237
Tasks:
1. Create the Linear Model: Build a linear model to predict happiness based on daily exercise.
2. Interpret the Slope: Interpret the slope of the model using the provided format.
3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.
# Create the linear model
mod.1 <- lm(happiness ~ daily_exercise, data = exercise_data)
# View the summary of the model
summary(mod.1)
##
## Call:
## lm(formula = happiness ~ daily_exercise, data = exercise_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3174 -3.1136 -0.7943 3.1798 11.2470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.653 1.172 41.501 < 2e-16 ***
## daily_exercise 6.159 1.080 5.705 1.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.017 on 98 degrees of freedom
## Multiple R-squared: 0.2493, Adjusted R-squared: 0.2416
## F-statistic: 32.54 on 1 and 98 DF, p-value: 1.238e-07
# Interpret the slope
# The slope value is extracted from the summary output
# Plot the residuals
plot(mod.1$residuals)
abline(h = 0)
Slope Interpretation: - Interpret the slope using the format: “For every one unit increase in study time, academic performance increases by 6.16.”
Residuals Interpretation: There is no issue with residuals.
Dataset: You have data on the number of hours of screen time before bed and sleep quality scores (0 to 100).
Simulate the Data:
set.seed(102)
screen_time <- rnorm(100, mean = 3, sd = 1) # Hours of screen time before bed
sleep_quality <- 80 - 4 * screen_time + rnorm(100, mean = 0, sd = 8) # Sleep quality scores
# Combine into a data frame
sleep_data <- data.frame(screen_time, sleep_quality)
# View the first few rows
head(sleep_data)
## screen_time sleep_quality
## 1 3.180523 80.24487
## 2 3.784734 53.43548
## 3 1.646835 53.13154
## 4 4.983298 62.08685
## 5 4.238472 61.68009
## 6 4.200617 63.42158
Tasks:
1. Create the Linear Model: Build a linear model to predict sleep quality based on screen time.
2. Interpret the Slope: Interpret the slope of the model using the provided format.
3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.
# Create the linear model
mod.2 <- lm(sleep_quality ~ screen_time, data = sleep_data)
# View the summary of the model
summary(mod.2)
##
## Call:
## lm(formula = sleep_quality ~ screen_time, data = sleep_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4357 -5.7620 -0.2427 5.9956 17.8661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.6638 2.6209 29.633 < 2e-16 ***
## screen_time -3.1779 0.7974 -3.986 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.495 on 98 degrees of freedom
## Multiple R-squared: 0.1395, Adjusted R-squared: 0.1307
## F-statistic: 15.88 on 1 and 98 DF, p-value: 0.0001296
# Interpret the slope
# The slope value is extracted from the summary output
# Plot the residuals
plot(mod.2$residuals)
abline(h = 0)
Interpretation: - Interpret the slope using the format: “For every one unit increase in study time, academic performance decreases by 3.18.”
Residuals Interpretation: There is no issue with residuals.
Dataset: You are given data on the number of cups of coffee consumed daily and productivity scores at work (0 to 100).
Simulate the Data:
set.seed(103)
coffee_consumption <- rpois(100, lambda = 3) # Number of cups of coffee
productivity <- 60 + 2.5 * coffee_consumption + rnorm(100, mean = 0, sd = 7) # Productivity scores
# Combine into a data frame
coffee_data <- data.frame(coffee_consumption, productivity)
# View the first few rows
head(coffee_data)
## coffee_consumption productivity
## 1 2 63.92415
## 2 1 53.20904
## 3 3 74.02782
## 4 3 69.43269
## 5 1 68.47399
## 6 1 59.23412
Tasks:
1. Create the Linear Model: Build a linear model to predict productivity based on coffee consumption.
2. Interpret the Slope: Interpret the slope of the model using the provided format.
3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.
# Create the linear model
mod.3 <- lm(productivity ~ coffee_consumption, data = coffee_data)
# View the summary of the model
summary(mod.3)
##
## Call:
## lm(formula = productivity ~ coffee_consumption, data = coffee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6485 -4.6556 -0.5024 5.0264 13.4723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.8890 1.4471 42.075 < 2e-16 ***
## coffee_consumption 2.5653 0.4348 5.901 5.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.947 on 98 degrees of freedom
## Multiple R-squared: 0.2621, Adjusted R-squared: 0.2546
## F-statistic: 34.82 on 1 and 98 DF, p-value: 5.191e-08
# Interpret the slope
# The slope value is extracted from the summary output
# Plot the residuals
plot(mod.3$residuals)
abline(h = 0)
Interpretation: - Interpret the slope using the format: “For every one unit increase in coffee consumption, productivity increases by 2.57 .”
Residuals Interpretation: There is no clear issue with residuals.