setwd("/Users/dorothytang/Desktop")
gym <- read.csv("gym.csv", header = TRUE, stringsAsFactors = FALSE)
# or however you're loading it

Task 1

# Create some sample data
x <- 1:15
y <- x + rnorm(15)

# Plot the data
plot(x, y)

# Add a vertical line at x = 2
abline(v = 2)

## We want our code to run and show in a knitted document, we need to place it inside an R code chunk in the R Markdown file. ##

Knitting to PDF or HTML

I will be knitting to a HTML fist, because this allows me to convert to a PDF in the future as well. HTML file is easierr to edit and adjust in the future.
if we wnat to a creat a PDF after knitting to HTML, we can dowonload it from the browser and print it as a PDF.

The difference between PDF.file and RMD.file

PDF file is the final document that can be easily accessed without using R or R studio. More for people to read.
RMD file is the source of our project. Where we can easily edit it later if we need to.

Task 2

gym$Workout_Type <- as.factor(gym$Workout_Type)
gym$Gender <- as.factor(gym$Gender)
gym$Experience_Level <- as.factor(
  gym$Experience_Level)

Task 3

sum(is.na(gym$Avg_BPM))

## [1] 0

#Histogram
hist(
  gym$Avg_BPM,
  main = "Historgam_Avg_BPM",
  xlab = "Average Beats Per Minute",
  col = "lightblue",
  breaks = 15
)

#Boxplot
boxplot(
  gym$Avg_BPM,
  main = "Boxplot of Avg_BPM",
  ylab = "Average Beats Per Minute",
  col = "hotpink"
)

summary(gym$Avg_BPM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   120.0   131.0   143.0   143.8   156.0   169.0

mean_val   <- mean(gym$Avg_BPM, na.rm = TRUE)
median_val <- median(gym$Avg_BPM, na.rm = TRUE)
sd_val     <- sd(gym$Avg_BPM, na.rm = TRUE)
min_val    <- min(gym$Avg_BPM, na.rm = TRUE)
max_val    <- max(gym$Avg_BPM, na.rm = TRUE)
quantiles  <- quantile(gym$Avg_BPM, na.rm = TRUE)


my_stats <- list(
  Mean = mean_val,
  Median = median_val,
  StdDev = sd_val,
  Minimum = min_val,
  Maximum = max_val,
  Quantiles = quantiles
)
print(my_stats)

## $Mean
## [1] 143.7667
## 
## $Median
## [1] 143
## 
## $StdDev
## [1] 14.3451
## 
## $Minimum
## [1] 120
## 
## $Maximum
## [1] 169
## 
## $Quantiles
##   0%  25%  50%  75% 100% 
##  120  131  143  156  169

##Summary First, we are checking to see if there were any missing values in the data for Average of BPM. The result is 0, which means that we have no missing data for that variable. Secondly, we are visualizing the data. I made a histogram to see how the Avg_BPM values are distributed. The histogram shows how many indivisualsn fall into different of heart rate ranges. From the uniform /slightly skewed tight shape, we can get a sense of most heart rates are pretty evenly spread out in the lower, middle ranges, but less in higher ranges. After that, we created a boxplot, which shows the minimum and maximum of the average of bp.The boxplot shows that half of the heart rates fall between 131 BPM (the 25th percentile) and 156 BPM (the 75th percentile). In addition,it also reveals that there aren’t any extreme outliers far away from the rest of the data.

##Important numbers to summarize the data - The lowest recorded heart rate (Minimum) is 120 beats per minute (BPM). - The highest recorded heart rate (Maximum) is 169 BPM. - The mean (average) heart rate is about 144 BPM. - The median heart rate is 143 BPM - The standard deviation is around 14 BPM, which tells us the heart rates vary about 14 points on average from the center. Overall, most people’s average heart rates are between 120 and 169 BPM, with many clustered around 140–150 BPM. The data spreads out moderately (about 14 BPM around the average), and there are no major unusual values. This gives us a good picture of how heart rates are distributed in this group: typically in the mid‐140s, with enough variation to cover the 120s up through the upper 160s.

Task 4

gym$Workout_Type <- as.factor(gym$Workout_Type)

workout_counts <- table(gym$Workout_Type)
print(workout_counts)

## 
##   Cardio     HIIT Strength     Yoga 
##      255      221      258      239

barplot(
  workout_counts,
  main = "Barplot of Workout Types",
  xlab = "Workout Type",
  ylab = "Frequency",
  col = "hotpink"
)

Summary

Even there is a slight variation in that Strength is highest and HIIT is lowest—overall. Four types of workout are somewhat close. So this suggests that all four workouts are relatively popular among the group.

Task 5

gym$Experience_Level <- as.numeric(gym$Experience_Level)
gym$Experience_Level <- factor(
  gym$Experience_Level,
    levels = c(1, 2, 3),
      labels = c("Beginner", "Intermidiate", "Advanced"),
        ordered = TRUE
)

table_wt<- table(gym$Experience_Level, gym$Workout_Type)
print(table_wt)

##               
##                Cardio HIIT Strength Yoga
##   Beginner        109   85       97   85
##   Intermidiate    102   87      116  101
##   Advanced         44   49       45   53

prop_table_wt<- prop.table(table_wt, margin = 1)
print(prop_table_wt)

##               
##                   Cardio      HIIT  Strength      Yoga
##   Beginner     0.2898936 0.2260638 0.2579787 0.2260638
##   Intermidiate 0.2512315 0.2142857 0.2857143 0.2487685
##   Advanced     0.2303665 0.2565445 0.2356021 0.2774869

barplot(
  table_wt,
  beside = TRUE,
  legend.text = TRUE,
  main = "Workout Type based on ExperienceLevel",
  xlab = "Experience Level",
  ylab = " Workout Type",
  col = c("lavender","lightblue","hotpink")
)

# SUmmary - Beginners lean heavily toward Cardio but also like to try different new thing like Strength and Yoga. They are distributed pretty even. - Intermediates are the largest group for Strength but they also participate in Cardio, HIIT, and Yoga. They are distributed pretty even. - Advanced participants appear to be relatively fewer in all workout types. But also distributed pretty evenly in each workout types.

Task 6

hist(
  gym$Calories_Burned,
  main = "Histogram of Calories_Burned",
  xlab = "Calories Burned",
  col = "lavender",
  breaks = 20 
)

qqnorm(gym$Calories_Burned, main = "Q-Q Plot of Calories Burned")
qqline(gym$Calories_Burned, col = "hotpink", lwd = 2)

library(e1071)
cal_skew <- skewness(gym$Calories_Burned, na.rm = TRUE)
cal_kurt <- kurtosis(gym$Calories_Burned, na.rm = TRUE)

print(cal_skew )

## [1] 0.2774636

print(cal_kurt)

## [1] -0.06795862

shapiro.test(gym$Calories_Burned)

## 
##  Shapiro-Wilk normality test
## 
## data:  gym$Calories_Burned
## W = 0.99176, p-value = 2.982e-05

Summary

We look at the histogram, we see the data is not extremely skewed, but it does have a somewhat longer tail on the right side.That tail means the data aren’t perfectly symmetrical like a typical bell shape. The Q‐Q plot which compares our data to a perfectly normal curve also stays fairly close to the straight line in the middle but bends away at both ends, suggesting the smallest and largest values don’t follow a normal pattern.
In addition,based on the Shapiro‐Wilk test gave a p‐value far below 0.05, which statistically tells us the data do not match a normal distribution well enough to say it’s close enough. While the skewness and kurtosis numbers are not extremely large, the test result indicates there is still a meaningful departure from perfect normality. Therefore, based on these findings, we generally would not treat Calories Burned as if it were normally distributed.

Task 7

gym$bmi_class <- cut(
  gym$BMI,
  breaks = c(-Inf, 18.5, 25, 30, Inf),
  labels = c("Underweight", "Healthy", "Overweight", "Obese"),
  right = FALSE
)

gym$bmi_class <- ordered(
  gym$bmi_class,
  levels = c("Underweight", "Healthy", "Overweight", "Obese")
)

table(gym$bmi_class)

## 
## Underweight     Healthy  Overweight       Obese 
##         168         370         243         192

boxplot(
  Water_Intake ~ bmi_class, 
  data = gym,
  main = "Water Intake by BMI Class",
  xlab = "BMI Class",
  ylab = "Water Intake",
  col = c("lightblue", "lightyellow", "pink", "lavender")
)

tapply(gym$Water_Intake, gym$bmi_class, summary)

## $Underweight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.100   2.400   2.473   2.800   3.700 
## 
## $Healthy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.100   2.600   2.575   2.900   3.700 
## 
## $Overweight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.50    2.10    2.50    2.60    3.25    3.70 
## 
## $Obese
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.400   2.900   2.893   3.400   3.700

kruskal.test(Water_Intake ~ bmi_class, data = gym)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Water_Intake by bmi_class
## Kruskal-Wallis chi-squared = 50.401, df = 3, p-value = 6.562e-11

My Opinions on Water_Intake by bmi_class

After grouping people by their BMI (Underweight, Healthy, Overweight, Obese). I made a side‐by‐side boxplot to see how much water each group typically drinks. You can think of a boxplot as a quick way to compare medians and spot any big differences. I found that Underweight folks tend to drink the least water (around 2.4 liters on average), while Obese folks tend to drink the most (about 2.9 liters). The Healthy and Overweight groups land somewhere in between. Underweight people have the lowest average water intake, and Obese people have the highest. To confirm we’re not just seeing random variation, we ran a Kruskal–Wallis test, which gave a very small p‐value (much lower than 0.05). That means it’s extremely unlikely these differences happen by chance. So, we can say with confidence that people’s water intake really does vary according to their BMI class in this dataset.

Task 8

# One-sample t-test
t_result <- t.test(
  gym$Calories_Burned, 
  mu = 890,
  alternative = "greater",
  conf.level = 0.98 # 98% confidence level aligns with 2% alpha
)
print(t_result)

## 
##  One Sample t-test
## 
## data:  gym$Calories_Burned
## t = 1.7645, df = 972, p-value = 0.03898
## alternative hypothesis: true mean is greater than 890
## 98 percent confidence interval:
##  887.4475      Inf
## sample estimates:
## mean of x 
##  905.4224

Summary

We wanted to see if the true average number of calories burned is greater than 890. We set up two hypotheses:
H0: The average is 890.
H1: The average is greater than 890.
We tested this at a 2% significance level, meaning we’d only claim the average is truly above 890 if our p‐value was below 0.02. However,our test came out with a p‐value of 0.03898, which is above 0.02. Because of that, we can’t reject the null hypothesis—so we can’t confidently say the average is above 890 based on this strict standard.
The 98% confidence interval started at about 887.45 and went all the way to infinity, which includes 890. That means 890 is still a plausible value for the true average. Even though our sample’s average was about 905 calories, the test suggests we don’t have enough evidence to claim that the entire population definitively has an average above 890 at the 2% level.

Task 9

gym_sub <- subset(gym, Gender %in% c(0,1))
gym_sub$Gender <- factor(gym_sub$Gender)  

t_result2 <- t.test(
  Session_Duration ~ Gender, 
  data = gym_sub,
  alternative = "two.sided", 
  conf.level = 0.97
)
print(t_result2)

## 
##  Welch Two Sample t-test
## 
## data:  Session_Duration by Gender
## t = 0.38032, df = 962.47, p-value = 0.7038
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 97 percent confidence interval:
##  -0.03948947  0.05624212
## sample estimates:
## mean in group 0 mean in group 1 
##        1.260823        1.252446

Summary

At a 3% significance level, our test suggests there’s no evidence that the average session times differ between the two gender groups. So any small difference we observe in the sample (1.26 vs. 1.25 hours) might just be due to normal random variation. We need to collect more data or see a much bigger gap. Therefore, we don’t have statistical proof that one gender consistently has longer or shorter sessions than the other.

Project 1

Dorothy Tang

2025-03-15

Task 1

Knitting to PDF or HTML

The difference between PDF.file and RMD.file

Task 2

Task 3

Task 4

Summary

Task 5

Task 6

Summary

Task 7

My Opinions on Water_Intake by bmi_class

Task 8

Summary

Task 9

Summary