library(tidyverse)
library(tidymodels)
library(ggplot2)
Lab 05
Import
<- read_csv("https://raw.githubusercontent.com/Aranaur/aranaur.rbind.io/main/datasets/ipf/ipf.csv")
lifters
lifters
Part 1
Exercise 1
<- lifters %>%
ipf_squat filter(best3squat_kg > 0) %>%
filter(!is.na(age)) %>%
filter(!is.na(best3squat_kg)) %>%
mutate(best3_squat_lbs = best3squat_kg * 2.20462)
ipf_squat
Exercise 2
ggplot(ipf_squat, aes(x = age, y = best3_squat_lbs)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE, color="blue") +
labs(
title = "Relationship Between Squat (lbs) and Age",
x = "Age",
y = "Best Squat (lbs)"
+
) theme_minimal()
On this graph, we can see that there is a negative relationship between the age of the lifter and the best squat in lbs. The older the lifter, the lover the best squat in lbs.
Exercise 3
The linear model can be expressed as: \[ y = \beta_0 + \beta_1x + \epsilon \]
The Fitted Model: \[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}x \]
Where:
- \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the intercept and slope values provided by the model.
Intercept (\(\hat{\beta_0}\)):
The predicted squat value for a lifter with age 0. This may not be meaningful in the context, as no one competes at age 0.Slope (\(\hat{\beta_1}\)):
The average change in squat pounds per one-year increase in age. A positive value indicates an increase, while a negative value indicates a decrease.
Check the estimates: - If the slope (\(\hat{\beta_1}\)) is small or close to 0, it suggests age has little impact on squat performance.
- If the p-value for the slope is significant (\(p < 0.05\)), it indicates a meaningful relationship between age and squat pounds.
- Contextual Check: Extreme or nonsensical values may suggest the need for further data cleaning or transformation.
<- lm(best3_squat_lbs ~ age, data = ipf_squat)
age_fit
tidy(age_fit)
Exercise 4
<- ipf_squat %>%
ipf_squat mutate(age2 = age^2)
ggplot(ipf_squat, aes(x = age2, y = best3_squat_lbs)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE, color="red") +
labs(
title = "Relationship Between Squat (lbs) and Age squared",
x = "Age squared",
y = "Best Squat (lbs)"
+
) theme_minimal()
Visualization with age squared looks better than the previous one. The relationship between the best squat in lbs and age squared is more linear than the relationship between the best squat in lbs and age.
Exercise 5
<- lm(best3_squat_lbs ~ age2, data = ipf_squat)
age2_fit
glance(age2_fit)$r.squared
[1] 0.03256833
glance(age_fit)$r.squared
[1] 0.01748617
\(R^2\) looks better with age squared than with age. The \(R^2\) value is higher with age squared than with age.
Exercise 6
<- ipf_squat %>%
ipf_deadlift filter(best3deadlift_kg > 0) %>%
filter(!is.na(best3deadlift_kg)) %>%
mutate(
best3_deadlift_lbs = best3deadlift_kg * 2.20462,
age_grouped = ifelse(age >= 40, "40 or older", "Under 40"))
ggplot(ipf_deadlift, aes(x = best3_squat_lbs, fill = sex)) +
geom_density(alpha = 0.5) +
facet_wrap(~age_grouped, scales = "free") +
labs(
title = "Distribution of best deadlift in lbs by age grouping and sex",
x = "Best deadlift (lbs)",
y = "Density"
+
) theme_minimal()
Based on this graph we can see that sex gives more impact on the best deadlift of 3.
Exercise 7
<- ipf_deadlift %>%
ipf_bench filter(best3bench_kg > 0) %>%
filter(!is.na(best3bench_kg)) %>%
filter(bodyweight_kg > 0) %>%
filter (!is.na(bodyweight_kg)) %>%
mutate(best3_bench_lbs = best3bench_kg * 2.20462,
bodyweight_lbs = bodyweight_kg * 2.20462)
ggplot(ipf_bench, aes(x = bodyweight_lbs, y = best3_bench_lbs)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE, color="green") +
labs(
title = "Relationship Between Bench Press (lbs) and Bodyweight (lbs)",
x = "Bodyweight (lbs)",
y = "Best Bench Press (lbs)"
+
) theme_minimal()
As we can see from the graph, there is a positive relationship between the bodyweight and the best bench press in lbs. The heavier the lifter, the higher the best bench press in lbs.
Exercise 8
<- lm(best3_bench_lbs ~ bodyweight_lbs, data = ipf_bench)
bench_model
tidy(bench_model)
The fitted model equation is: \[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}x \]
Where: - Intercept (\(\hat{\beta_0}\)) = 12.4 - Slope (\(\hat{\beta_1}\)) = 1.64
The fitted model equation is: \[ \hat{y} = 12.4 + 1.64 \cdot \text{bodyweight\_lbs} \]
Part 2
Exercise 9
Predicted Value (\(\hat{y}\)):
The predicted number of incidents of skin cancer is 1.5 per 1,000 people.
Residual (\(e\)):
The residual is 0.5. The residual is calculated as: \[
e = y - \hat{y}
\] where: - \(e\) is the residual, - \(y\) is the actual value, - \(\hat{y}\) is the predicted value.
Rearranging the equation for \(y\): \[ y = e + \hat{y} \]
Substitute the values: \[ y = 0.5 + 1.5 = 2.0 \]
Interpretation:
The actual value (\(y\)) is 2.0 per 1,000 people, which is higher than the predicted value (\(\hat{y} = 1.5\)).
Exercise 10
The general form of the linear model is: \[ \hat{y} = \beta_0 + \beta_1x \]
From the table:
- \(\beta_0\) (Intercept) = \(-0.357\)
- \(\beta_1\) (Slope) = \(4.034\)
The linear model is: \[ \hat{Hwt} = -0.357 + 4.034 \cdot Bwt \]
Where:
- \(\hat{Hwt}\): Predicted heart weight (g)
- \(Bwt\): Body weight (kg)
The intercept (\(-0.357\)) represents the predicted heart weight of a cat with a body weight of \(0\) kg.
Interpretation: While mathematically valid, it doesn’t make biological sense because a cat cannot have a body weight of \(0\) kg.
The slope (\(4.034\)) represents the change in heart weight (g) for each 1 kg increase in body weight.
Interpretation: For every additional kilogram of body weight, the heart weight increases by approximately \(4.034\) grams.
\(R^2 = 0.65\), or \(65\%\), means that \(65\%\) of the variability in heart weight (\(Hwt\)) is explained by the linear relationship with body weight (\(Bwt\)).
Interpretation: Body weight is a strong predictor of heart weight, but there are other factors accounting for the remaining \(35\%\) of variability.
The formula to calculate the correlation coefficient (\(r\)) from \(R^2\) is: \[ r = \sqrt{R^2} \]
Since the slope (\(4.034\)) is positive, the correlation is positive: \[ r = \sqrt{0.65} \approx 0.806 \]
Correlation coefficient: \(r = 0.806\)