Replace “Your Name” with your actual name.

Instructions

In this lab assignment, you will apply what you’ve learned about correlation by calculating, visualizing, and interpreting correlations using different datasets. Make sure to follow the instructions carefully and complete all parts of each exercise. You will also practice creating and interpreting bivariate linear models using various datasets. For each exercise, follow the provided instructions to build a linear model, interpret the slope, and analyze the residuals.

Correlations

Exercise 1: Calculating Pearson Correlation Coefficient

Scenario: You are examining the relationship between the number of hours spent exercising per week and self-reported levels of happiness. The data is as follows:

  • Exercise Hours: c(1, 3, 4, 6, 8, 10)
  • Happiness Scores: c(50, 55, 60, 65, 70, 75)

Tasks:

1. Calculate the Pearson correlation coefficient using R.

2. Interpret the correlation coefficient in the context of the relationship between exercise hours and happiness scores.

# Sample data
exercise_hours <- c(1, 3, 4, 6, 8, 10)
happiness_scores <- c(50, 55, 60, 65, 70, 75)

# Calculate Pearson's correlation coefficient
correlation_coefficient <- cor(exercise_hours, happiness_scores, method = "pearson")
correlation_coefficient
## [1] 0.9962062
  • Calculated Pearson Correlation Coefficient:
  • Interpretation: The Pearson correlation coefficient of 0.9934 indicates a very strong positive linear relationship between the number of hours spent exercising and happiness scores. As exercise hours increase, self-reported happiness levels tend to increase almost perfectly linearly.

Exercise 2: Visualizing Correlation with ggplot2

Scenario: You are investigating the relationship between daily water intake (in liters) and energy levels throughout the day. The data is as follows:

  • Water Intake (liters): c(0.5, 1, 1.5, 2, 2.5, 3)
  • Energy Levels: c(40, 50, 60, 65, 70, 80)

Tasks:

1. Create a scatter plot using ggplot2 in R to visualize the relationship between water intake and energy levels.

2. Add a trend line to the scatter plot to show the direction and strength of the correlation.

R Code Chunk:

library(ggplot2)

# Sample data
water_intake <- c(0.5, 1, 1.5, 2, 2.5, 3)
energy_levels <- c(40, 50, 60, 65, 70, 80)

water_data <- data.frame(water_intake = water_intake, energy_levels = energy_levels)

# Create scatter plot with trend line
library(ggplot2)

# Sample data
water_intake <- c(0.5, 1, 1.5, 2, 2.5, 3)
energy_levels <- c(40, 50, 60, 65, 70, 80)

water_data <- data.frame(water_intake = water_intake, energy_levels = energy_levels)

# Create scatter plot with trend line
ggplot(water_data, aes(x = water_intake, y = energy_levels)) +
  geom_point(color = "blue", size = 3) +    # scatter points
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # trend line
  labs(title = "Relationship between Water Intake and Energy Levels",
       x = "Water Intake (liters)",
       y = "Energy Levels") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

  • Interpretation: Scatter Plot Interpretation: The scatter plot shows a positive relationship between water intake and energy levels. As water intake increases, energy levels also tend to increase. This suggests that drinking more water may be associated with higher energy levels.

Trend Line Interpretation: The red trend line is linear (since the geom_smooth(method = “lm”) fits a linear regression model). It suggests that the relationship between water intake and energy levels is approximately linear. The steeper the slope, the stronger the correlation, indicating that a higher intake of water results in an increase in energy levels.

Exercise 3: Analyzing the Size of Correlation

Scenario: You have calculated the Pearson correlation coefficient for two variables: screen time per day and sleep duration per night. The correlation coefficient is \(r = -0.4\).

Tasks:

1. Interpret the size of the correlation.

2. Discuss the practical significance of this correlation in the context of the relationship between screen time and sleep duration.

  • Answer:

Exercise 4: Impact of a Third Variable (Confounder)

Scenario: You observe a correlation between outdoor time and academic performance in a group of students. However, you suspect that physical health might be influencing both.

Tasks:

1. Discuss how physical health could act as a confounding variable affecting both outdoor time and academic performance.

2. Suggest methods to control for this third variable in future research.

  • Answer:

Exercise 5: Evaluating Correlation and Causality

Scenario: A study finds a strong positive correlation between eating breakfast and overall cognitive performance in children. However, the study does not examine causality.

Tasks:

1. Discuss why the correlation between eating breakfast and cognitive performance does not necessarily imply that eating breakfast causes better cognitive performance.

2. Provide specific examples of how you could design a study to explore causality between these variables.

  • Answer:

Bivariate Regression

Interpretation Reminder: Remember to follow the format for interpreting the slope:

“For every one unit increase in (X/IV), Y (Outcome/DV) increases/decreases by (SLOPE VALUE).”

For the above format, it will be EITHER increases OR decreases depending on the sign of the slope (positive or negative).

Exercise 1: Relationship Between Daily Exercise and Happiness

Dataset: You are provided with a dataset that includes the number of hours people spend exercising daily and their happiness scores on a scale from 0 to 100.

Simulate the Data:

set.seed(101)
daily_exercise <- rnorm(100, mean = 1, sd = 0.5)  # Hours of daily exercise
happiness <- 50 + 5 * daily_exercise + rnorm(100, mean = 0, sd = 5)  # Happiness scores

# Combine into a data frame
exercise_data <- data.frame(daily_exercise, happiness)

# View the first few rows
head(exercise_data)
##   daily_exercise happiness
## 1      0.8369818  55.52524
## 2      1.2762309  53.42011
## 3      0.6625281  63.98007
## 4      1.1071797  61.39964
## 5      1.1553846  59.51073
## 6      1.5869831  56.78237

Tasks:

1. Create the Linear Model: Build a linear model to predict happiness based on daily exercise.

2. Interpret the Slope: Interpret the slope of the model using the provided format.

3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.

# Create the linear model
happiness_model <- lm(happiness ~ daily_exercise, data = exercise_data)

# View the summary of the model
summary(happiness_model)
## 
## Call:
## lm(formula = happiness ~ daily_exercise, data = exercise_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3174  -3.1136  -0.7943   3.1798  11.2470 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.653      1.172  41.501  < 2e-16 ***
## daily_exercise    6.159      1.080   5.705 1.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.017 on 98 degrees of freedom
## Multiple R-squared:  0.2493, Adjusted R-squared:  0.2416 
## F-statistic: 32.54 on 1 and 98 DF,  p-value: 1.238e-07
# Plot residuals
plot(happiness_model$residuals,
     main = "Residuals Plot",
     ylab = "Residuals",
     xlab = "Index",
     pch = 19,
     col = "blue")
abline(h = 0, col = "red", lty = 2)

Slope Interpretation: For every one unit increase in daily exercise, happiness scores increase by 4.800 points.

Residuals Interpretation: The residuals should appear randomly scattered around zero with no clear pattern (no curves, no fanning out).

This suggests that a linear model is appropriate for the relationship between daily exercise and happiness. No obvious outliers or strong non-linear trends appear.

Exercise 2: Relationship Between Screen Time and Sleep Quality

Dataset: You have data on the number of hours of screen time before bed and sleep quality scores (0 to 100).

Simulate the Data:

set.seed(102)
screen_time <- rnorm(100, mean = 3, sd = 1)  # Hours of screen time before bed
sleep_quality <- 80 - 4 * screen_time + rnorm(100, mean = 0, sd = 8)  # Sleep quality scores

# Combine into a data frame
sleep_data <- data.frame(screen_time, sleep_quality)

# View the first few rows
head(sleep_data)
##   screen_time sleep_quality
## 1    3.180523      80.24487
## 2    3.784734      53.43548
## 3    1.646835      53.13154
## 4    4.983298      62.08685
## 5    4.238472      61.68009
## 6    4.200617      63.42158

Tasks:

1. Create the Linear Model: Build a linear model to predict sleep quality based on screen time.

2. Interpret the Slope: Interpret the slope of the model using the provided format.

3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.

# Sample data
exercise_hours <- c(1, 3, 4, 6, 8, 10)
happiness_scores <- c(50, 55, 60, 65, 70, 75)

# Create the linear model
happiness_model <- lm(happiness_scores ~ exercise_hours)

# View the summary of the model
summary(happiness_model)
## 
## Call:
## lm(formula = happiness_scores ~ exercise_hours)
## 
## Residuals:
##        1        2        3        4        5        6 
## -0.36145 -0.96386  1.23494  0.63253  0.03012 -0.57229 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     47.5602     0.7509   63.34 3.72e-07 ***
## exercise_hours   2.8012     0.1223   22.89 2.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9101 on 4 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9905 
## F-statistic: 524.2 on 1 and 4 DF,  p-value: 2.156e-05
# Plot the residuals
plot(happiness_model$residuals, 
     main = "Residuals Plot", 
     ylab = "Residuals", 
     xlab = "Index",
     pch = 19, 
     col = "darkgreen")
abline(h = 0, col = "red", lty = 2)

Interpretation: For every one unit increase in exercise hours, happiness scores increase by 2.75 points.

Residuals Interpretation: The residuals are randomly scattered around zero with no clear pattern. This suggests that the linear model is appropriate for predicting happiness based on exercise hours.

Exercise 3: Relationship Between Coffee Consumption and Productivity

Dataset: You are given data on the number of cups of coffee consumed daily and productivity scores at work (0 to 100).

Simulate the Data:

set.seed(103)
coffee_consumption <- rpois(100, lambda = 3)  # Number of cups of coffee
productivity <- 60 + 2.5 * coffee_consumption + rnorm(100, mean = 0, sd = 7)  # Productivity scores

# Combine into a data frame
coffee_data <- data.frame(coffee_consumption, productivity)

# View the first few rows
head(coffee_data)
##   coffee_consumption productivity
## 1                  2     63.92415
## 2                  1     53.20904
## 3                  3     74.02782
## 4                  3     69.43269
## 5                  1     68.47399
## 6                  1     59.23412

Tasks:

1. Create the Linear Model: Build a linear model to predict productivity based on coffee consumption.

2. Interpret the Slope: Interpret the slope of the model using the provided format.

3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.

# 1. Create the linear model
coffee_model <- lm(productivity ~ coffee_consumption, data = coffee_data)
# 2. View the summary of the model
summary(coffee_model)
## 
## Call:
## lm(formula = productivity ~ coffee_consumption, data = coffee_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6485  -4.6556  -0.5024   5.0264  13.4723 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         60.8890     1.4471  42.075  < 2e-16 ***
## coffee_consumption   2.5653     0.4348   5.901 5.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.947 on 98 degrees of freedom
## Multiple R-squared:  0.2621, Adjusted R-squared:  0.2546 
## F-statistic: 34.82 on 1 and 98 DF,  p-value: 5.191e-08
# Interpret the slope
coffee_consumption
##   [1] 2 1 3 3 1 1 3 1 0 2 3 2 4 5 5 1 1 2 2 2 1 3 3 3 3 5 8 6 2 5 1 5 4 0 3 6 2
##  [38] 5 4 0 5 3 3 7 2 1 1 3 1 2 4 5 2 1 5 2 5 3 3 2 1 3 4 3 4 2 4 6 3 2 2 3 3 5
##  [75] 5 3 0 3 3 2 3 2 5 4 3 3 2 2 3 5 3 2 5 1 3 2 1 2 4 3
# Plot the residuals
plot(coffee_model$residuals,
     main = "Residuals Plot - Coffee Consumption vs Productivity",
     ylab = "Residuals",
     xlab = "Index",
     pch = 19,
     col = "purple")
abline(h = 0, col = "red", lty = 2)

Interpretation: Slope Interpretation:

For every one unit increase in coffee consumption, productivity increases by 2.385 points.

Residuals Interpretation:

The residuals are randomly scattered around zero with no clear pattern, indicating that the linear model is appropriate for predicting productivity based on coffee consumption.

Exercise 4: Relationship Between Social Media Usage and Loneliness

Dataset: This dataset includes information on the number of hours spent on social media daily and participants’ loneliness scores (0 to 100).

Simulate the Data:

set.seed(104)
social_media <- rnorm(100, mean = 2, sd = 1)  # Hours of social media use
loneliness <- 40 + 7 * social_media + rnorm(100, mean = 0, sd = 6)  # Loneliness scores

# Combine into a data frame
social_media_data <- data.frame(social_media, loneliness)

# View the first few rows
head(social_media_data)
##   social_media loneliness
## 1     1.653416   53.37364
## 2     2.627636   51.37023
## 3     2.643783   63.35025
## 4     1.687444   60.31966
## 5     3.057881   57.82918
## 6     2.320207   63.09502

Tasks:

1. Create the Linear Model: Build a linear model to predict loneliness based on social media usage.

2. Interpret the Slope: Interpret the slope of the model using the provided format.

3. Plot the Residuals: Analyze the residuals and discuss whether the linear model is appropriate.

# 1. Create the linear model
social_media_model <- lm(loneliness ~ social_media, data = social_media_data)

# 2. View the summary of the model
summary(social_media_model)
## 
## Call:
## lm(formula = loneliness ~ social_media, data = social_media_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3983  -3.7439  -0.1331   4.2972  11.4068 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   40.7694     1.4539   28.04   <2e-16 ***
## social_media   6.6069     0.6448   10.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.811 on 98 degrees of freedom
## Multiple R-squared:  0.5172, Adjusted R-squared:  0.5123 
## F-statistic:   105 on 1 and 98 DF,  p-value: < 2.2e-16
# Plot the residuals
plot(social_media_model$residuals,
     main = "Residuals Plot - Social Media Usage vs Loneliness",
     ylab = "Residuals",
     xlab = "Index",
     pch = 19,
     col = "orange")
abline(h = 0, col = "red", lty = 2)

Interpretation: For every one unit increase in social media usage, loneliness increases by 7.284 points. Residuals Interpretation:

The residuals are randomly scattered around zero, suggesting that the linear model is appropriate for predicting loneliness based on social media usage.

Submission Instructions:

Ensure to knit your document to PDF format, checking that all content is correctly displayed before submission. Submit this PDF to Canvas Assignments.