##1.0 Installation of packages.
#install.packages(c("readxl", "dplyr", "ggplot2"))
#install.packages("car")
library(readxl)
library(dplyr)
library(ggplot2)
library("nortest")
##2.0 Setting the working directory and importing the data
setwd("C:/Users/emerald/1Main assignment/R scripts/")
#setwd ("D:/1Main assignment/R scripts")
data <- read_excel("3147534SportsPeople Data.xlsx")
3.0 Inspecting the dat #3.1 Displaying the head
head(data)
#3.2 Structure of the data
str(data)
tibble [178 × 3] (S3: tbl_df/tbl/data.frame)
$ Sex: chr [1:178] "female" "male" "male" "female" ...
$ LBM: num [1:178] 56.5 75 76.6 49.4 48.8 ...
$ BMI: num [1:178] 20.8 21.2 24.5 19.8 20.7 ...
#3.3 Summary of the data
summary(data)
Sex LBM BMI
Length:178 Min. :39.22 Min. :16.88
Class :character 1st Qu.:52.30 1st Qu.:20.15
Mode :character Median :58.56 Median :21.36
Mean :59.98 Mean :21.57
3rd Qu.:66.77 3rd Qu.:22.64
Max. :91.16 Max. :26.06
#3.4 Check for missing values
any(is.na(data))
[1] FALSE
#3.5 Check for null values
any(is.null(data))
[1] FALSE
#3.6 Check for outliers
boxplot(data$`LBM`, data$`BMI`)
outliers <- boxplot(data[, c("LBM", "BMI")], plot = FALSE)$out
print(outliers)
[1] 91.16
#3.7 Scatter plot of BMI vs LBM
with(data, plot(BMI,LBM,
main="BMI VS LBM",
xlab="BMI",
ylab="LBM"))
##3.6 Checking the distribution of the data (checking the nature of symmetry)
#3.61 Histogram of Lean Body Mass (LBM)”
ggplot(data, aes(x = `LBM`)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histogram of Lean Body Mass (LBM)")
#3.62 Histogram of Body Mass Index (BMI)
ggplot(data, aes(x = `BMI`)) +
geom_histogram(binwidth = 1, fill = "green", color = "black", alpha = 0.7) +
labs(title = "Histogram of Body Mass Index (BMI)")
cor(data[, c("LBM", "BMI")])
LBM BMI
LBM 1.0000000 0.5714248
BMI 0.5714248 1.0000000
males_count <- sum(data$Sex == "male")
print(males_count)
[1] 97
males_count <- sum(data$Sex == "female")
print(males_count)
[1] 81
#4.0 Checking the normality of the distribution according to the sexes AndersonDarling’s Test
ad.test(data$LBM[data$Sex == "male"])
Anderson-Darling normality test
data: data$LBM[data$Sex == "male"]
A = 0.48534, p-value = 0.2218
ad.test(data$LBM[data$Sex == "female"])
Anderson-Darling normality test
data: data$LBM[data$Sex == "female"]
A = 0.71069, p-value = 0.06123
#5.0 Checking the nature of the variance of the groups using Levene’s Test
levene_test_result <- levenetTest(LBM ~ Sex, data = data)
Error in levenetTest(LBM ~ Sex, data = data) :
could not find function "levenetTest"
The Levene’s test p value of 1.131e-05, which is way below the 0.05 (or 5%) significance level, indicated that the male and female groups have unequal variance. Due to this difference, the Welch-two sample t-test was used instead.
#6.0 Welch’s t-test to determine if the null hypothesis should be accepted or rejected.
t_test_result <- t.test(LBM ~ Sex, data = data, var.equal = FALSE)
t_test_result
Welch Two Sample t-test
data: LBM by Sex
t = -10.508, df = 159.84, p-value < 2.2e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-14.447940 -9.876429
sample estimates:
mean in group female mean in group male
53.35644 65.51863
p_value <- t_test_result$p.value
if (p_value < 0.05) {
#Print the result on different lines (using the \n )
cat("There is a statistically significant difference in mean LBM between males and
females.\n")
} else {
cat("There is no statistically significant difference in mean LBM between males and
females.\n")
}
There is a statistically significant difference in mean LBM between males and
females.
The Welch Two Sample t test p value of 2.2e-16 which is way below 0.05 (or 5%) shows that there is enough evidence to reject the null hypothesis that true difference in means between group female and group male is equal to 0. In other words, there is a difference between mean LBM between in the male and female groups.
#7.0 Calculating the correlation coefficient based on gender
# Calculate correlation coefficient for males
cor_male <- cor(data[data$Sex == "male", c("LBM", "BMI")])
# Calculate correlation coefficient for females
cor_female <- cor(data[data$Sex == "female", c("LBM", "BMI")])
# Print correlation coefficients
cat("Correlation coefficient for males:", cor_male, "\n")
Correlation coefficient for males: 1 0.7829081 0.7829081 1
cat("Correlation coefficient for females:", cor_female, "\n")
Correlation coefficient for females: 1 0.6703704 0.6703704 1
The correlation coefficient for males is 0.7829081; for females, it is 0.6703704. This indicates a strong linear relationship between BMI and LBM in both males (to a greater degree) and females. This is not to say that one causes the other; correlation is not the same as causation. There could be a component common to both BMI and LBM that causes a rise in both
#8.0 Creating the liner regression model
# create a new data frame containing only information on males.
male_data <- data[data$Sex == "male",]
#Build the model for the male rows (LBM to be predicted by BMI)
model = lm(LBM~BMI, data = male_data)
#show summary of model
summary(model)
Call:
lm(formula = LBM ~ BMI, data = male_data)
Residuals:
Min 1Q Median 3Q Max
-13.9069 -4.2213 -0.6453 3.7428 14.1990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.9376 6.2629 -1.746 0.084 .
BMI 3.5474 0.2892 12.266 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.981 on 95 degrees of freedom
Multiple R-squared: 0.6129, Adjusted R-squared: 0.6089
F-statistic: 150.4 on 1 and 95 DF, p-value: < 2.2e-16
Estimated intercept is -10.9376. This is the predicted value of the response variable (LBM) when the predictor variable (BMI) is 0. However, this interpretation does not have much value as BMI cannot take 0 as a value.
Coefficient estimate for BMI is 3.5474. This is the change in LBM for one unit increase in BMI, in other words, this represents the value LBM when the value of BMI is 1. The p-value is < 2 e-16 (which is quite less than 0.05), this indicates that BMI is a significant predictor of LBM in this model.
Residuals: This helps to access a model’s fit. The fact that the median (-0.6453) is close to zero suggests that, on average, the model’s predictions are reasonably accurate. However, the spread of residuals indicates that there is variability in the accuracy of predictions across different observations.
R -squared: The shows the proportion in variability of LBM (response variable) that is predicted by the BMI. The output of the R code shows that 61.29% of the variability in LBM is explained by BMI (predictor variable).
The significance codes (***) suggest a high statistical significance.
F-statistic: This tests the overall significance of the model. In this case, the F-statistic is 150.4 with a very low p-value (< 2.2e-16), indicating that the overall model is statistically significant.
#9.0 Checking the assumptions of the model
# 9.1Q-Q plot of residuals (checking the normality of residuals)
qqnorm(residuals(model))
qqline(residuals(model), col = 2)
ad.test(residuals(model))
Anderson-Darling normality test
data: residuals(model)
A = 0.27025, p-value = 0.6697
#checking for the other assumptions (: Linearity, Independence of Residuals, and Homoscedasticity").
plot(model)
The points on the Q-Q plot fell along a straight line. This suggests that the distribution of residuals is approximately normal. In other words, this confirms the assumption that the residuals are normally distributed.
The random scattering on the “Residual vs Fitted”, “Scale-Location”, “Residual vs Leverage” plots confirm the following assumptions: Linearity, Independence of Residuals, and “Homoscedasticity”. (In addition to the linear relationship in the Q-Q Residual plot which confirms the assumption that the residuals are normally distributed.)
#10.0 Hypotheses Testing
male_data <- data[data$Sex == "male",]
model = lm(LBM~BMI, data = male_data)
summary(model)
Call:
lm(formula = LBM ~ BMI, data = male_data)
Residuals:
Min 1Q Median 3Q Max
-13.9069 -4.2213 -0.6453 3.7428 14.1990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.9376 6.2629 -1.746 0.084 .
BMI 3.5474 0.2892 12.266 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.981 on 95 degrees of freedom
Multiple R-squared: 0.6129, Adjusted R-squared: 0.6089
F-statistic: 150.4 on 1 and 95 DF, p-value: < 2.2e-16
par(mfrow = c(1,2))
#plot the regression line
plot(male_data$BMI,male_data$LBM,main="Regression line",
xlab="LBM",
ylab="BMI")
abline(model,col="blue")
# plot the fitted vs actual
plot(male_data$LBM,model$fitted.values,main="Actual vs Fitted",
xlab="Actual LBM",
ylab="Fitted LMB")
abline(a=0,b=1)
Interpretation: • The p-value is less than 0.05 (or 5%). This provides enough evidence to reject the null hypothesis, which is, there is no linear relationship between LBM and BMI. Test Conclusion: There is a linear relationship between LBM and BMI.
#11.0 Assessing the predictive performance of the model
# Assign to a different name to the model.
lm_model <- lm(LBM ~ BMI, data = male_data)
# Prediction and Confidence Intervals
new_data <- data.frame(BMI = seq(min(male_data$BMI), max(male_data$BMI)))
preds <- predict(lm_model, new_data, interval = "prediction")
confs <- predict(lm_model, new_data, interval = "confidence")
# Plotting Predictions and Intervals
plot(male_data$BMI, male_data$LBM, col = "blue", main = "Scatter Plot with Fitted
Regression Line and Intervals")
abline(lm_model, col = "red")
# and plot the predicted(preds) lower and upper range with dashed line (lty=2)
lines(new_data$BMI, preds[, "lwr"], col = "green", lty = 2)
lines(new_data$BMI, preds[, "upr"], col = "green", lty = 2)
# and also plot the confidence(confs) lower and upper range
lines(new_data$BMI, confs[, "lwr"], col = "orange")
lines(new_data$BMI, confs[, "upr"], col = "orange")
# Residual Analysis and Model Summary
residuals <- residuals(lm_model)
plot(male_data$BMI, residuals, col = "purple", main = "Residual Plot")
abline(h = 0, col = "red", lty = 2)
summary(lm_model)
Call:
lm(formula = LBM ~ BMI, data = male_data)
Residuals:
Min 1Q Median 3Q Max
-13.9069 -4.2213 -0.6453 3.7428 14.1990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.9376 6.2629 -1.746 0.084 .
BMI 3.5474 0.2892 12.266 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.981 on 95 degrees of freedom
Multiple R-squared: 0.6129, Adjusted R-squared: 0.6089
F-statistic: 150.4 on 1 and 95 DF, p-value: < 2.2e-16
The scatter plot points (blue tiny circles) show the linear relationship between LBM and BMI. • Most of the observed points (blue tiny circles) fall within the interval lines (green lines), this is a positive indication of the model’s reliability. • The narrow confidence intervals (orange lines) indicate that the model operates with a lower uncertainty, and this is good for predictive purposes. • The positive correlation in the graph shows implies that as LBM increases, BMI increases. This in no way implies direct causation, though there is such a possibility.