Task 1
gym2$Workout_Type <- as.factor(gym2$Workout_Type)
gym2$Experience_Level <- factor(gym2$Experience_Level,order = TRUE)
gym2$Gender <- as.factor(gym2$Gender)
gym2$bmi_class <- factor(gym2$bmi_class, order = TRUE)
Task 2 IV:Workout_Frequency DV:Calories_Burned Hypothesis: H0:Beta = 0, Workout Frequency does not explain the variation in the Calories Burned. H1: Beta !=0, Workout Frequency does explain the variation in the Calories Burned. Scatterplot for the data
plot(gym2$Workout_Frequency, gym2$Calories_Burned,
pch = 16, col = "hotpink",
xlab = "Workout Frequency",
ylab = "Calories Burned",
main = "Calories Burned vs. Workout Frequency")
abline(lm(Calories_Burned ~ Workout_Frequency, data = gym2), col = "lightgreen", lwd = 2)
Linear regression Modle
lm_modle <- lm(Calories_Burned~Workout_Frequency, data = gym2)
summary(lm_modle)
##
## Call:
## lm(formula = Calories_Burned ~ Workout_Frequency, data = gym2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -547.08 -164.04 -11.12 147.88 760.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 333.953 26.981 12.38 <2e-16 ***
## Workout_Frequency 172.042 7.832 21.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 223 on 971 degrees of freedom
## Multiple R-squared: 0.3319, Adjusted R-squared: 0.3313
## F-statistic: 482.5 on 1 and 971 DF, p-value: < 2.2e-16
summary(lm_modle)$r.squared
## [1] 0.331949
coef(lm_modle)
## (Intercept) Workout_Frequency
## 333.9529 172.0420
Regression line equation from the modle: **
Estimate**
(Intercept) 333.953
Workout_Frequency 172.042
R squared: 0.3319 Conclusion:
the result from the linear regression model shows that
individual can burn 333.953 calories whne resting, and burn 172.0420
more during each wrokout session. The R^2 is 0.3319,
which means that about 33.19% of the variability in calories burned is
by work out frequency.But the R^2 value is not as high, which means that
there still room to improve the modle such as adding different session
duration, experience level or workout types.
Correlation
cor_gym <- cor(gym2$Workout_Frequency,
gym2$Calories_Burned,
use = "complete.obs",
method = "pearson")
print(cor_gym)
## [1] 0.5761501
Conclusion: THe correlation value is 0.5761501, it is a positive value, which means that there is a positive corrlation between calories burned and workout frequency. In other words, Workout frequency can help burn more calories.
Normality Check
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(lm_modle, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: lm_modle
## DW = 2.0843, p-value = 0.188
## alternative hypothesis: true autocorrelation is not 0
qqnorm(lm_modle$residuals)
qqline(lm_modle$residuals, col = "lightblue")
shapiro.test(lm_modle$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm_modle$residuals
## W = 0.99324, p-value = 0.00021
par(mfrow = c(2,2))
plot(lm_modle)
par(mfrow = c(1,1))
plot(x = lm_modle$fitted.values, y = lm_modle$residuals, main = "Fitted values and Residuals", xlab = "predicted values", ylab = "Residuals", col = "hotpink")
abline(h = 0, lty = 2)
Conclusions: Based on the linearity, Normality
and equal varience tests, we can assume that the linear condition is met
because base on the plot is random scattered.
Based on Shapiro-wilk test, the p value is less the significant level 0.06. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line matched pretty well with linearity.
From the plot fitted values and residuals, the points are randomly scattered around 0, so we can approve that thevariance condition is met as well. From out.
From our hypothesis test, because the p value(2e-16) is less than the sginificant levle 0.06. As a result, we rejcte the H0, and conclude that Workout Frequency does explain the variation in the Calories Burned.
the result from the linear regression model shows that individual can burn 333.953 calories whne resting, and burn 172.0420 more during each wrokout session. The R^2 is 0.3319, which means that about 33.19% of the variability in calories burned is by work out frequency.But the R^2 value is not as high, which means that there still room to improve the modle such as adding different session duration, experience level or workout types.
Task 3 Hypothesis: H0: mu1 = mu2 = mu3…= mu, there is no differnce between the treatment groups H1: mu1!= mu2..= mu, at least one of the mean is different.
Linear regression model H0: beta_age = 0 H1: beta_age ! = 0
H0: beta-fatpercentagee = 0 H1: beta_factpercentage ! = 0
H0: beta_weight = 0 H1: beta_weight ! = 0
lm_mod3 <- lm(Water_Intake ~ Age + Fat_Percentage + Weight, data = gym2)
summary(lm_mod3)
##
## Call:
## lm(formula = Water_Intake ~ Age + Fat_Percentage + Weight, data = gym2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20090 -0.31331 -0.01078 0.26757 1.38899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2060335 0.1018924 31.465 <2e-16 ***
## Age 0.0026045 0.0012048 2.162 0.0309 *
## Fat_Percentage -0.0504558 0.0024050 -20.980 <2e-16 ***
## Weight 0.0078540 0.0007103 11.057 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4572 on 969 degrees of freedom
## Multiple R-squared: 0.4214, Adjusted R-squared: 0.4196
## F-statistic: 235.2 on 3 and 969 DF, p-value: < 2.2e-16
gym2$res3 <- residuals(lm_mod3)
plot(x = gym2$Age, y = gym2$res3, main = "Age vs. residuals", xlab = "age", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)
plot(x = gym2$Fat_Percentage, y = gym2$res3, main = "Fat Percentage vs. residuals", xlab = "Fat Percentage", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)
plot(x = gym2$Weight, y = gym2$res3, main = "Weightvs. residuals", xlab = "weight", ylab = "residuals", col = "lavender")
abline(h = 0, lty= 2)
Conclusion For fat percentage:p value (0.0309)
it is greater than 0.03, we fail to reject the H0, the Age is not
statically significant predicotr of water intake. For
fat percentage:p value (2e-16) it is less than 0.03 we reject the H0,
the fat percentage is statically significant predicotr of water
intake. For weight:p value (2e-16) it is less than 0.03
we reject the H0, the weight is statically significant predicotr of
water intake. The adjR^2 is 0.4214, which means that
about 42.14% of the variability in water intake is explaiined by the
model, The true r62 is 0.4196,explanitory power is 41.96%. Since there
is a difference between adjR^2 and Thur R^2, all three variables are
useful.
independence
library(lmtest)
dwtest(lm_mod3, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: lm_mod3
## DW = 1.9057, p-value = 0.1405
## alternative hypothesis: true autocorrelation is not 0
Conclusion: The p value is 0.1405 it is greater than 0.03 so we do not reject the H0. Normality check:
qqnorm(lm_mod3$residuals)
qqline(lm_mod3$residuals, col = "blue")
shapiro.test(lm_mod3$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm_mod3$residuals
## W = 0.99484, p-value = 0.002126
#Equal Variance:
plot(x = lm_mod3$fitted.values, y = lm_mod3$residuals, main = "Fitted values and Residuals", xlab = "predicted values", ylab = "Residuals", col = "hotpink")
abline(h = 0, lty = 2)
Conclusion: Based on the linearity, Normality and equal varience tests, we can assume that the linear condition is met because base on the plot is random scattered for each (age,fat percentage, wweight).
Based on Shapiro-wilk test, the p value(0.002126) is less the significant level 0.03. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line matched pretty well with linearity.
From the plot fitted values and residuals, the points are randomly scattered around 0, so we can approve that thevariance condition is met as well.
From our hypothesis test, because the p value(2e-16) is less than the sginificant levle 0.03. As a result, we rejcte the H0, and conclude that as least one predictor significantly affects water in take.
Task 4 Hypothesis: H0: mu1 = mu2 = mu3…= mu, there is no differnce between the treatment groups H1: mu1!= mu2..= mu, at least one of the mean is different. One-wya ANOVA
aov4 <- aov(Water_Intake ~ bmi_class, data = gym2)
summary(aov4)
## Df Sum Sq Mean Sq F value Pr(>F)
## bmi_class 3 18.7 6.248 18.27 1.55e-11 ***
## Residuals 969 331.4 0.342
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: From our hypothesis test, because the p value(1.55e-11) is less than the sginificant levle 0.04. As a result, we rejcte the H0, and conclude that at least one of the mean is different.
Normality Check:
gym2$residuals <- aov4$residuals
qqnorm(aov4$residuals[gym2$bmi_class == "Underweight"], main = "Underweight")
qqline(aov4$residuals[gym2$bmi_class == "Underweight"], col = "blue")
qqnorm(aov4$residuals[gym2$bmi_class == "Healthy"], main = "Healthy")
qqline(aov4$residuals[gym2$bmi_class == "Healthy"], col = "blue")
qqnorm(aov4$residuals[gym2$bmi_class == "Overweight"], main = "Overweight")
qqline(aov4$residuals[gym2$bmi_class == "Overweight"], col = "blue")
qqnorm(aov4$residuals[gym2$bmi_class == "Obese"], main = "Obese")
qqline(aov4$residuals[gym2$bmi_class == "Obese"], col = "blue")
shapiro.test(aov4$residuals[gym2$bmi_class == "Underweight"])
##
## Shapiro-Wilk normality test
##
## data: aov4$residuals[gym2$bmi_class == "Underweight"]
## W = 0.97482, p-value = 0.003754
shapiro.test(aov4$residuals[gym2$bmi_class == "Healthy"])
##
## Shapiro-Wilk normality test
##
## data: aov4$residuals[gym2$bmi_class == "Healthy"]
## W = 0.9512, p-value = 1.026e-09
shapiro.test(aov4$residuals[gym2$bmi_class == "Overweight"])
##
## Shapiro-Wilk normality test
##
## data: aov4$residuals[gym2$bmi_class == "Overweight"]
## W = 0.93553, p-value = 7.8e-09
shapiro.test(aov4$residuals[gym2$bmi_class == "Obese"])
##
## Shapiro-Wilk normality test
##
## data: aov4$residuals[gym2$bmi_class == "Obese"]
## W = 0.92777, p-value = 3.786e-08
Conclusion: Based on Shapiro-wilk test, the p values: Underweight is 0.003754, Helthy is 1.026e-09, Overweight is 7.8e-09, Obese is 3.786e-08, they are all less the significant level 0.04. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line doest not matched pretty well with linearity. So we need to use Levene Test.
Equal Variance
library(car)
## Loading required package: carData
leveneTest(Water_Intake ~ bmi_class, data = gym2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 3.5513 0.01409 *
## 969
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: The p valaue (0.01409) is less that 0.04, so we reject the H0.
Bax Plot
boxplot(Water_Intake ~ bmi_class, data = gym2,
col = "lightpink", main = "Water Intake by BMI Class",
ylab = "Water Intake")
Tukey Test
tukey4 <- TukeyHSD(aov4, conf.level = 0.96)
print(tukey4)
## Tukey multiple comparisons of means
## 96% family-wise confidence level
##
## Fit: aov(formula = Water_Intake ~ bmi_class, data = gym2)
##
## $bmi_class
## diff lwr upr p adj
## Obese-Healthy 0.31782376 0.1796462 0.45600136 0.0000000
## Overweight-Healthy 0.02459459 -0.1036819 0.15287112 0.9568729
## Underweight-Healthy -0.10219112 -0.2467207 0.04233841 0.2380269
## Overweight-Obese -0.29322917 -0.4432363 -0.14322206 0.0000015
## Underweight-Obese -0.42001488 -0.5841369 -0.25589282 0.0000000
## Underweight-Overweight -0.12678571 -0.2826635 0.02909202 0.1350676
plot(tukey4)
Conclusion : from the tukey teste, the result shows that the obese group have significanttly high water in take than other groups.
Task 5 Hypothesis for workout type H0: alpha1 = alph2 = alph3…= alph, there is no differnce between the treatment groups H1: alph!= alph2..= alph, at least one of the mean is different.
Hypothesis for experence levle H0: Beta1 = Beta2 = Beta3…= Beta, there is no differnce between the treatment groups H1: Beta1!= Beta2..= Beta, at least one of the mean is different.
** Barplot**
mean_calories <- tapply(gym2$Calories_Burned, gym2$Workout_Type, mean)
barplot(mean_calories,
xlab = "Workout Type",
ylab = "Mean Calories Burned",
main = "Average Calories Burned by Workout Type and Experience Level")
Two-way ANOVA
aov5 <- aov(Calories_Burned ~ Workout_Type * Experience_Level, data = gym2)
summary(aov5)
## Df Sum Sq Mean Sq F value Pr(>F)
## Workout_Type 3 211670 70557 1.936 0.122
## Experience_Level 2 36723588 18361794 503.764 <2e-16 ***
## Workout_Type:Experience_Level 6 289090 48182 1.322 0.244
## Residuals 961 35027713 36449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: From our hypothesis test, because the p value(0.122) is greater than the sginificant levle 0.08. As a result, we rejcte the H0, and conclude that at least one of the mean is different.
Normality Check
gym2$residuals <- aov5$residuals
qqnorm(aov5$residual)
qqline(aov5$residual, col = "blue")
shapiro.test(aov5$residual)
##
## Shapiro-Wilk normality test
##
## data: aov5$residual
## W = 0.98995, p-value = 3.286e-06
Conclusion: Based on Shapiro-wilk test, the p values (3.286e-06) it is less the significant level 0.08. So, the normality is met and we reject the H0. Additionaly, from the QQ plot, the dots and the line does matched pretty well with linearity. So we need to use Barlett Test.
Equal Variance
bartlett.test(Calories_Burned ~ interaction(Workout_Type, Experience_Level), data = gym2)
##
## Bartlett test of homogeneity of variances
##
## data: Calories_Burned by interaction(Workout_Type, Experience_Level)
## Bartlett's K-squared = 73.372, df = 11, p-value = 2.78e-11
main_effectsModel <- aov(Calories_Burned ~ Workout_Type*Experience_Level, data = gym2)
summary(main_effectsModel)
## Df Sum Sq Mean Sq F value Pr(>F)
## Workout_Type 3 211670 70557 1.936 0.122
## Experience_Level 2 36723588 18361794 503.764 <2e-16 ***
## Workout_Type:Experience_Level 6 289090 48182 1.322 0.244
## Residuals 961 35027713 36449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: For workout type, The p valaue (0.122) is less that 0.08, so we do not reject the H0. So there is no strong evidence to suggest that work out type has significant impact on calories burned. For experience levle The p valaue (2e-16) is less that 0.08, so we reject the H0. So there is strong evidence to suggest that experience level has significant impact on calories burned. For interaction effect, The p valaue (0,244) is greater than 0.08, so we do not reject the H0. So there is not a strong evidence to suggest that interaction between workout type and experience level. However, the interaction between workout type and experience level is statistcally significant, it is important to check the main effects as well. The interaction significaten means the effect of one factor to another.