Project_2_ytang76

Task 1

gym2$Workout_Type <- as.factor(gym2$Workout_Type)
gym2$Experience_Level <- factor(gym2$Experience_Level,order = TRUE)
gym2$Gender <- as.factor(gym2$Gender)
gym2$bmi_class <- factor(gym2$bmi_class, order = TRUE)

Task 2 IV:Workout_Frequency DV:Calories_Burned Hypothesis: H0:Beta = 0, Workout Frequency does not explain the variation in the Calories Burned. H1: Beta !=0, Workout Frequency does explain the variation in the Calories Burned. Scatterplot for the data

plot(gym2$Workout_Frequency, gym2$Calories_Burned,
     pch = 16, col = "hotpink",
     xlab = "Workout Frequency",
     ylab = "Calories Burned",
     main = "Calories Burned vs. Workout Frequency")
abline(lm(Calories_Burned ~ Workout_Frequency, data = gym2), col = "lightgreen", lwd = 2)

Linear regression Modle

lm_modle <- lm(Calories_Burned~Workout_Frequency, data = gym2)
summary(lm_modle)

## 
## Call:
## lm(formula = Calories_Burned ~ Workout_Frequency, data = gym2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -547.08 -164.04  -11.12  147.88  760.88 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        333.953     26.981   12.38   <2e-16 ***
## Workout_Frequency  172.042      7.832   21.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 223 on 971 degrees of freedom
## Multiple R-squared:  0.3319, Adjusted R-squared:  0.3313 
## F-statistic: 482.5 on 1 and 971 DF,  p-value: < 2.2e-16

summary(lm_modle)$r.squared

## [1] 0.331949

coef(lm_modle)

##       (Intercept) Workout_Frequency 
##          333.9529          172.0420

Regression line equation from the modle: ** Estimate**
(Intercept) 333.953
Workout_Frequency 172.042
R squared: 0.3319 Conclusion: the result from the linear regression model shows that individual can burn 333.953 calories whne resting, and burn 172.0420 more during each wrokout session. The R^2 is 0.3319, which means that about 33.19% of the variability in calories burned is by work out frequency.But the R^2 value is not as high, which means that there still room to improve the modle such as adding different session duration, experience level or workout types.

Correlation

cor_gym <- cor(gym2$Workout_Frequency,
    gym2$Calories_Burned,
    use = "complete.obs",
    method = "pearson")
print(cor_gym)

## [1] 0.5761501

Conclusion: THe correlation value is 0.5761501, it is a positive value, which means that there is a positive corrlation between calories burned and workout frequency. In other words, Workout frequency can help burn more calories.

Normality Check

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dwtest(lm_modle, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  lm_modle
## DW = 2.0843, p-value = 0.188
## alternative hypothesis: true autocorrelation is not 0

qqnorm(lm_modle$residuals)
qqline(lm_modle$residuals, col = "lightblue")

shapiro.test(lm_modle$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  lm_modle$residuals
## W = 0.99324, p-value = 0.00021

par(mfrow = c(2,2))
plot(lm_modle)

par(mfrow = c(1,1))
plot(x = lm_modle$fitted.values, y = lm_modle$residuals, main = "Fitted values and Residuals", xlab = "predicted values", ylab = "Residuals", col = "hotpink")
abline(h = 0, lty = 2)

Conclusions: Based on the linearity, Normality and equal varience tests, we can assume that the linear condition is met because base on the plot is random scattered.

Based on Shapiro-wilk test, the p value is less the significant level 0.06. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line matched pretty well with linearity.

From the plot fitted values and residuals, the points are randomly scattered around 0, so we can approve that thevariance condition is met as well. From out.

From our hypothesis test, because the p value(2e-16) is less than the sginificant levle 0.06. As a result, we rejcte the H0, and conclude that Workout Frequency does explain the variation in the Calories Burned.

the result from the linear regression model shows that individual can burn 333.953 calories whne resting, and burn 172.0420 more during each wrokout session. The R^2 is 0.3319, which means that about 33.19% of the variability in calories burned is by work out frequency.But the R^2 value is not as high, which means that there still room to improve the modle such as adding different session duration, experience level or workout types.

Task 3 Hypothesis: H0: mu1 = mu2 = mu3…= mu, there is no differnce between the treatment groups H1: mu1!= mu2..= mu, at least one of the mean is different.

Linear regression model H0: beta_age = 0 H1: beta_age ! = 0

H0: beta-fatpercentagee = 0 H1: beta_factpercentage ! = 0

H0: beta_weight = 0 H1: beta_weight ! = 0

lm_mod3 <- lm(Water_Intake ~ Age + Fat_Percentage + Weight, data = gym2)
summary(lm_mod3)

## 
## Call:
## lm(formula = Water_Intake ~ Age + Fat_Percentage + Weight, data = gym2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20090 -0.31331 -0.01078  0.26757  1.38899 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.2060335  0.1018924  31.465   <2e-16 ***
## Age             0.0026045  0.0012048   2.162   0.0309 *  
## Fat_Percentage -0.0504558  0.0024050 -20.980   <2e-16 ***
## Weight          0.0078540  0.0007103  11.057   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4572 on 969 degrees of freedom
## Multiple R-squared:  0.4214, Adjusted R-squared:  0.4196 
## F-statistic: 235.2 on 3 and 969 DF,  p-value: < 2.2e-16

gym2$res3 <- residuals(lm_mod3)

plot(x = gym2$Age, y = gym2$res3, main = "Age vs. residuals", xlab = "age", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)

plot(x = gym2$Fat_Percentage, y = gym2$res3, main = "Fat Percentage vs. residuals", xlab = "Fat Percentage", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)

plot(x = gym2$Weight, y = gym2$res3, main = "Weightvs. residuals", xlab = "weight", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)

Conclusion For fat percentage:p value (0.0309) it is greater than 0.03, we fail to reject the H0, the Age is not statically significant predicotr of water intake. For fat percentage:p value (2e-16) it is less than 0.03 we reject the H0, the fat percentage is statically significant predicotr of water intake. For weight:p value (2e-16) it is less than 0.03 we reject the H0, the weight is statically significant predicotr of water intake. The adjR^2 is 0.4214, which means that about 42.14% of the variability in water intake is explaiined by the model, The true r62 is 0.4196,explanitory power is 41.96%. Since there is a difference between adjR^2 and Thur R^2, all three variables are useful.

independence

library(lmtest)
dwtest(lm_mod3, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  lm_mod3
## DW = 1.9057, p-value = 0.1405
## alternative hypothesis: true autocorrelation is not 0

Conclusion: The p value is 0.1405 it is greater than 0.03 so we do not reject the H0. Normality check:

qqnorm(lm_mod3$residuals)
qqline(lm_mod3$residuals, col = "blue")

shapiro.test(lm_mod3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  lm_mod3$residuals
## W = 0.99484, p-value = 0.002126

#Equal Variance:

plot(x = lm_mod3$fitted.values, y = lm_mod3$residuals, main = "Fitted values and Residuals", xlab = "predicted values", ylab = "Residuals", col = "hotpink")
abline(h = 0, lty = 2)

Conclusion: Based on the linearity, Normality and equal varience tests, we can assume that the linear condition is met because base on the plot is random scattered for each (age,fat percentage, wweight).

Based on Shapiro-wilk test, the p value(0.002126) is less the significant level 0.03. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line matched pretty well with linearity.

From the plot fitted values and residuals, the points are randomly scattered around 0, so we can approve that thevariance condition is met as well.

From our hypothesis test, because the p value(2e-16) is less than the sginificant levle 0.03. As a result, we rejcte the H0, and conclude that as least one predictor significantly affects water in take.

Task 4 Hypothesis: H0: mu1 = mu2 = mu3…= mu, there is no differnce between the treatment groups H1: mu1!= mu2..= mu, at least one of the mean is different. One-wya ANOVA

aov4 <- aov(Water_Intake ~ bmi_class, data = gym2)
summary(aov4)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## bmi_class     3   18.7   6.248   18.27 1.55e-11 ***
## Residuals   969  331.4   0.342                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: From our hypothesis test, because the p value(1.55e-11) is less than the sginificant levle 0.04. As a result, we rejcte the H0, and conclude that at least one of the mean is different.

Normality Check:

gym2$residuals <- aov4$residuals

qqnorm(aov4$residuals[gym2$bmi_class == "Underweight"], main = "Underweight")
qqline(aov4$residuals[gym2$bmi_class == "Underweight"], col = "blue")

qqnorm(aov4$residuals[gym2$bmi_class == "Healthy"], main = "Healthy")
qqline(aov4$residuals[gym2$bmi_class == "Healthy"], col = "blue")

qqnorm(aov4$residuals[gym2$bmi_class == "Overweight"], main = "Overweight")
qqline(aov4$residuals[gym2$bmi_class == "Overweight"], col = "blue")

qqnorm(aov4$residuals[gym2$bmi_class == "Obese"], main = "Obese")
qqline(aov4$residuals[gym2$bmi_class == "Obese"], col = "blue")

shapiro.test(aov4$residuals[gym2$bmi_class == "Underweight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  aov4$residuals[gym2$bmi_class == "Underweight"]
## W = 0.97482, p-value = 0.003754

shapiro.test(aov4$residuals[gym2$bmi_class == "Healthy"])

## 
##  Shapiro-Wilk normality test
## 
## data:  aov4$residuals[gym2$bmi_class == "Healthy"]
## W = 0.9512, p-value = 1.026e-09

shapiro.test(aov4$residuals[gym2$bmi_class == "Overweight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  aov4$residuals[gym2$bmi_class == "Overweight"]
## W = 0.93553, p-value = 7.8e-09

shapiro.test(aov4$residuals[gym2$bmi_class == "Obese"])

## 
##  Shapiro-Wilk normality test
## 
## data:  aov4$residuals[gym2$bmi_class == "Obese"]
## W = 0.92777, p-value = 3.786e-08

Conclusion: Based on Shapiro-wilk test, the p values: Underweight is 0.003754, Helthy is 1.026e-09, Overweight is 7.8e-09, Obese is 3.786e-08, they are all less the significant level 0.04. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line doest not matched pretty well with linearity. So we need to use Levene Test.

Equal Variance

library(car)

## Loading required package: carData

leveneTest(Water_Intake ~ bmi_class, data = gym2)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   3  3.5513 0.01409 *
##       969                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: The p valaue (0.01409) is less that 0.04, so we reject the H0.

Bax Plot

boxplot(Water_Intake ~ bmi_class, data = gym2,
        col = "lightpink", main = "Water Intake by BMI Class",
        ylab = "Water Intake")

Tukey Test

tukey4 <- TukeyHSD(aov4, conf.level = 0.96)
print(tukey4)

##   Tukey multiple comparisons of means
##     96% family-wise confidence level
## 
## Fit: aov(formula = Water_Intake ~ bmi_class, data = gym2)
## 
## $bmi_class
##                               diff        lwr         upr     p adj
## Obese-Healthy           0.31782376  0.1796462  0.45600136 0.0000000
## Overweight-Healthy      0.02459459 -0.1036819  0.15287112 0.9568729
## Underweight-Healthy    -0.10219112 -0.2467207  0.04233841 0.2380269
## Overweight-Obese       -0.29322917 -0.4432363 -0.14322206 0.0000015
## Underweight-Obese      -0.42001488 -0.5841369 -0.25589282 0.0000000
## Underweight-Overweight -0.12678571 -0.2826635  0.02909202 0.1350676

plot(tukey4)

Conclusion : from the tukey teste, the result shows that the obese group have significanttly high water in take than other groups.

Task 5 Hypothesis for workout type H0: alpha1 = alph2 = alph3…= alph, there is no differnce between the treatment groups H1: alph!= alph2..= alph, at least one of the mean is different.

Hypothesis for experence levle H0: Beta1 = Beta2 = Beta3…= Beta, there is no differnce between the treatment groups H1: Beta1!= Beta2..= Beta, at least one of the mean is different.

** Barplot**

mean_calories <- tapply(gym2$Calories_Burned, gym2$Workout_Type, mean)

barplot(mean_calories,
        xlab = "Workout Type", 
        ylab = "Mean Calories Burned",
        main = "Average Calories Burned by Workout Type and Experience Level")

Two-way ANOVA

aov5 <- aov(Calories_Burned ~ Workout_Type * Experience_Level, data = gym2)
summary(aov5)

##                                Df   Sum Sq  Mean Sq F value Pr(>F)    
## Workout_Type                    3   211670    70557   1.936  0.122    
## Experience_Level                2 36723588 18361794 503.764 <2e-16 ***
## Workout_Type:Experience_Level   6   289090    48182   1.322  0.244    
## Residuals                     961 35027713    36449                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: From our hypothesis test, because the p value(0.122) is greater than the sginificant levle 0.08. As a result, we rejcte the H0, and conclude that at least one of the mean is different.

Normality Check

gym2$residuals <- aov5$residuals
qqnorm(aov5$residual)
qqline(aov5$residual, col = "blue")

shapiro.test(aov5$residual)

## 
##  Shapiro-Wilk normality test
## 
## data:  aov5$residual
## W = 0.98995, p-value = 3.286e-06

Conclusion: Based on Shapiro-wilk test, the p values (3.286e-06) it is less the significant level 0.08. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line does matched pretty well with linearity. So we need to use Barlett Test.

Equal Variance

bartlett.test(Calories_Burned ~ interaction(Workout_Type, Experience_Level), data = gym2)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Calories_Burned by interaction(Workout_Type, Experience_Level)
## Bartlett's K-squared = 73.372, df = 11, p-value = 2.78e-11

main_effectsModel <- aov(Calories_Burned ~ Workout_Type*Experience_Level, data = gym2)
summary(main_effectsModel)

##                                Df   Sum Sq  Mean Sq F value Pr(>F)    
## Workout_Type                    3   211670    70557   1.936  0.122    
## Experience_Level                2 36723588 18361794 503.764 <2e-16 ***
## Workout_Type:Experience_Level   6   289090    48182   1.322  0.244    
## Residuals                     961 35027713    36449                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: For workout type, The p valaue (0.122) is less that 0.08, so we do not reject the H0. So there is no strong evidence to suggest that work out type has significant impact on calories burned. For experience levle The p valaue (2e-16) is less that 0.08, so we reject the H0. So there is strong evidence to suggest that experience level has significant impact on calories burned. For interaction effect, The p valaue (0,244) is greater than 0.08, so we do not reject the H0. So there is not a strong evidence to suggest that interaction between workout type and experience level. However, the interaction between workout type and experience level is statistcally significant, it is important to check the main effects as well. The interaction significaten means the effect of one factor to another.

Project_2_ytang76

Dorothy Tang

2025-04-18