setwd("/Users/dorothytang/Desktop")
gym <- read.csv("gym.csv", header = TRUE, stringsAsFactors = FALSE)
# or however you're loading it
# Create some sample data
x <- 1:15
y <- x + rnorm(15)
# Plot the data
plot(x, y)
# Add a vertical line at x = 2
abline(v = 2)
## We want our code to run and show in a knitted document, we need to
place it inside an R code chunk in the R Markdown file. ##
I will be knitting to a HTML fist, because this allows me to convert to a PDF in the future as well. HTML file is easierr to edit and adjust in the future.
if we wnat to a creat a PDF after knitting to HTML, we can dowonload it from the browser and print it as a PDF.
gym$Workout_Type <- as.factor(gym$Workout_Type)
gym$Gender <- as.factor(gym$Gender)
gym$Experience_Level <- as.factor(
gym$Experience_Level)
sum(is.na(gym$Avg_BPM))
## [1] 0
#Histogram
hist(
gym$Avg_BPM,
main = "Historgam_Avg_BPM",
xlab = "Average Beats Per Minute",
col = "lightblue",
breaks = 15
)
#Boxplot
boxplot(
gym$Avg_BPM,
main = "Boxplot of Avg_BPM",
ylab = "Average Beats Per Minute",
col = "hotpink"
)
summary(gym$Avg_BPM)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120.0 131.0 143.0 143.8 156.0 169.0
mean_val <- mean(gym$Avg_BPM, na.rm = TRUE)
median_val <- median(gym$Avg_BPM, na.rm = TRUE)
sd_val <- sd(gym$Avg_BPM, na.rm = TRUE)
min_val <- min(gym$Avg_BPM, na.rm = TRUE)
max_val <- max(gym$Avg_BPM, na.rm = TRUE)
quantiles <- quantile(gym$Avg_BPM, na.rm = TRUE)
my_stats <- list(
Mean = mean_val,
Median = median_val,
StdDev = sd_val,
Minimum = min_val,
Maximum = max_val,
Quantiles = quantiles
)
print(my_stats)
## $Mean
## [1] 143.7667
##
## $Median
## [1] 143
##
## $StdDev
## [1] 14.3451
##
## $Minimum
## [1] 120
##
## $Maximum
## [1] 169
##
## $Quantiles
## 0% 25% 50% 75% 100%
## 120 131 143 156 169
##Summary First, we are checking to see if there were any missing values in the data for Average of BPM. The result is 0, which means that we have no missing data for that variable. Secondly, we are visualizing the data. I made a histogram to see how the Avg_BPM values are distributed. The histogram shows how many indivisualsn fall into different of heart rate ranges. From the uniform /slightly skewed tight shape, we can get a sense of most heart rates are pretty evenly spread out in the lower, middle ranges, but less in higher ranges. After that, we created a boxplot, which shows the minimum and maximum of the average of bp.The boxplot shows that half of the heart rates fall between 131 BPM (the 25th percentile) and 156 BPM (the 75th percentile). In addition,it also reveals that there aren’t any extreme outliers far away from the rest of the data.
##Important numbers to summarize the data - The lowest recorded heart rate (Minimum) is 120 beats per minute (BPM). - The highest recorded heart rate (Maximum) is 169 BPM. - The mean (average) heart rate is about 144 BPM. - The median heart rate is 143 BPM - The standard deviation is around 14 BPM, which tells us the heart rates vary about 14 points on average from the center. Overall, most people’s average heart rates are between 120 and 169 BPM, with many clustered around 140–150 BPM. The data spreads out moderately (about 14 BPM around the average), and there are no major unusual values. This gives us a good picture of how heart rates are distributed in this group: typically in the mid‐140s, with enough variation to cover the 120s up through the upper 160s.
gym$Workout_Type <- as.factor(gym$Workout_Type)
workout_counts <- table(gym$Workout_Type)
print(workout_counts)
##
## Cardio HIIT Strength Yoga
## 255 221 258 239
barplot(
workout_counts,
main = "Barplot of Workout Types",
xlab = "Workout Type",
ylab = "Frequency",
col = "hotpink"
)
Even there is a slight variation in that Strength is highest and HIIT is lowest—overall. Four types of workout are somewhat close. So this suggests that all four workouts are relatively popular among the group.
gym$Experience_Level <- as.numeric(gym$Experience_Level)
gym$Experience_Level <- factor(
gym$Experience_Level,
levels = c(1, 2, 3),
labels = c("Beginner", "Intermidiate", "Advanced"),
ordered = TRUE
)
table_wt<- table(gym$Experience_Level, gym$Workout_Type)
print(table_wt)
##
## Cardio HIIT Strength Yoga
## Beginner 109 85 97 85
## Intermidiate 102 87 116 101
## Advanced 44 49 45 53
prop_table_wt<- prop.table(table_wt, margin = 1)
print(prop_table_wt)
##
## Cardio HIIT Strength Yoga
## Beginner 0.2898936 0.2260638 0.2579787 0.2260638
## Intermidiate 0.2512315 0.2142857 0.2857143 0.2487685
## Advanced 0.2303665 0.2565445 0.2356021 0.2774869
barplot(
table_wt,
beside = TRUE,
legend.text = TRUE,
main = "Workout Type based on ExperienceLevel",
xlab = "Experience Level",
ylab = " Workout Type",
col = c("lavender","lightblue","hotpink")
)
# SUmmary - Beginners lean heavily toward Cardio but
also like to try different new thing like Strength and Yoga. They are
distributed pretty even. - Intermediates are the largest group for
Strength but they also participate in Cardio, HIIT, and Yoga. They are
distributed pretty even. - Advanced participants appear to be relatively
fewer in all workout types. But also distributed pretty evenly in each
workout types.
hist(
gym$Calories_Burned,
main = "Histogram of Calories_Burned",
xlab = "Calories Burned",
col = "lavender",
breaks = 20
)
qqnorm(gym$Calories_Burned, main = "Q-Q Plot of Calories Burned")
qqline(gym$Calories_Burned, col = "hotpink", lwd = 2)
library(e1071)
cal_skew <- skewness(gym$Calories_Burned, na.rm = TRUE)
cal_kurt <- kurtosis(gym$Calories_Burned, na.rm = TRUE)
print(cal_skew )
## [1] 0.2774636
print(cal_kurt)
## [1] -0.06795862
shapiro.test(gym$Calories_Burned)
##
## Shapiro-Wilk normality test
##
## data: gym$Calories_Burned
## W = 0.99176, p-value = 2.982e-05
gym$bmi_class <- cut(
gym$BMI,
breaks = c(-Inf, 18.5, 25, 30, Inf),
labels = c("Underweight", "Healthy", "Overweight", "Obese"),
right = FALSE
)
gym$bmi_class <- ordered(
gym$bmi_class,
levels = c("Underweight", "Healthy", "Overweight", "Obese")
)
table(gym$bmi_class)
##
## Underweight Healthy Overweight Obese
## 168 370 243 192
boxplot(
Water_Intake ~ bmi_class,
data = gym,
main = "Water Intake by BMI Class",
xlab = "BMI Class",
ylab = "Water Intake",
col = c("lightblue", "lightyellow", "pink", "lavender")
)
tapply(gym$Water_Intake, gym$bmi_class, summary)
## $Underweight
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 2.100 2.400 2.473 2.800 3.700
##
## $Healthy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 2.100 2.600 2.575 2.900 3.700
##
## $Overweight
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.50 2.10 2.50 2.60 3.25 3.70
##
## $Obese
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 2.400 2.900 2.893 3.400 3.700
kruskal.test(Water_Intake ~ bmi_class, data = gym)
##
## Kruskal-Wallis rank sum test
##
## data: Water_Intake by bmi_class
## Kruskal-Wallis chi-squared = 50.401, df = 3, p-value = 6.562e-11
After grouping people by their BMI (Underweight, Healthy, Overweight, Obese). I made a side‐by‐side boxplot to see how much water each group typically drinks. You can think of a boxplot as a quick way to compare medians and spot any big differences. I found that Underweight folks tend to drink the least water (around 2.4 liters on average), while Obese folks tend to drink the most (about 2.9 liters). The Healthy and Overweight groups land somewhere in between. Underweight people have the lowest average water intake, and Obese people have the highest. To confirm we’re not just seeing random variation, we ran a Kruskal–Wallis test, which gave a very small p‐value (much lower than 0.05). That means it’s extremely unlikely these differences happen by chance. So, we can say with confidence that people’s water intake really does vary according to their BMI class in this dataset.
# One-sample t-test
t_result <- t.test(
gym$Calories_Burned,
mu = 890,
alternative = "greater",
conf.level = 0.98 # 98% confidence level aligns with 2% alpha
)
print(t_result)
##
## One Sample t-test
##
## data: gym$Calories_Burned
## t = 1.7645, df = 972, p-value = 0.03898
## alternative hypothesis: true mean is greater than 890
## 98 percent confidence interval:
## 887.4475 Inf
## sample estimates:
## mean of x
## 905.4224
We wanted to see if the true average number of calories burned is greater than 890. We set up two hypotheses:
H0: The average is 890.
H1: The average is greater than 890.
We tested this at a 2% significance level, meaning we’d only claim the average is truly above 890 if our p‐value was below 0.02. However,our test came out with a p‐value of 0.03898, which is above 0.02. Because of that, we can’t reject the null hypothesis—so we can’t confidently say the average is above 890 based on this strict standard.
The 98% confidence interval started at about 887.45 and went all the way to infinity, which includes 890. That means 890 is still a plausible value for the true average. Even though our sample’s average was about 905 calories, the test suggests we don’t have enough evidence to claim that the entire population definitively has an average above 890 at the 2% level.
gym_sub <- subset(gym, Gender %in% c(0,1))
gym_sub$Gender <- factor(gym_sub$Gender)
t_result2 <- t.test(
Session_Duration ~ Gender,
data = gym_sub,
alternative = "two.sided",
conf.level = 0.97
)
print(t_result2)
##
## Welch Two Sample t-test
##
## data: Session_Duration by Gender
## t = 0.38032, df = 962.47, p-value = 0.7038
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 97 percent confidence interval:
## -0.03948947 0.05624212
## sample estimates:
## mean in group 0 mean in group 1
## 1.260823 1.252446