setwd("/Users/dorothytang/Desktop")
gym <- read.csv("gym.csv", header = TRUE, stringsAsFactors = FALSE)
# or however you're loading it

Task 1

# Create some sample data
x <- 1:15
y <- x + rnorm(15)

# Plot the data
plot(x, y)

# Add a vertical line at x = 2
abline(v = 2)

## We want our code to run and show in a knitted document, we need to place it inside an R code chunk in the R Markdown file. ##

Knitting to PDF or HTML

  • I will be knitting to a HTML fist, because this allows me to convert to a PDF in the future as well. HTML file is easierr to edit and adjust in the future.

  • if we wnat to a creat a PDF after knitting to HTML, we can dowonload it from the browser and print it as a PDF.

The difference between PDF.file and RMD.file

  • PDF file is the final document that can be easily accessed without using R or R studio. More for people to read.
  • RMD file is the source of our project. Where we can easily edit it later if we need to.

Task 2

gym$Workout_Type <- as.factor(gym$Workout_Type)
gym$Gender <- as.factor(gym$Gender)
gym$Experience_Level <- as.factor(
  gym$Experience_Level)

Task 3

sum(is.na(gym$Avg_BPM))
## [1] 0
#Histogram
hist(
  gym$Avg_BPM,
  main = "Historgam_Avg_BPM",
  xlab = "Average Beats Per Minute",
  col = "lightblue",
  breaks = 15
)

#Boxplot
boxplot(
  gym$Avg_BPM,
  main = "Boxplot of Avg_BPM",
  ylab = "Average Beats Per Minute",
  col = "hotpink"
)

summary(gym$Avg_BPM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   120.0   131.0   143.0   143.8   156.0   169.0
mean_val   <- mean(gym$Avg_BPM, na.rm = TRUE)
median_val <- median(gym$Avg_BPM, na.rm = TRUE)
sd_val     <- sd(gym$Avg_BPM, na.rm = TRUE)
min_val    <- min(gym$Avg_BPM, na.rm = TRUE)
max_val    <- max(gym$Avg_BPM, na.rm = TRUE)
quantiles  <- quantile(gym$Avg_BPM, na.rm = TRUE)


my_stats <- list(
  Mean = mean_val,
  Median = median_val,
  StdDev = sd_val,
  Minimum = min_val,
  Maximum = max_val,
  Quantiles = quantiles
)
print(my_stats)
## $Mean
## [1] 143.7667
## 
## $Median
## [1] 143
## 
## $StdDev
## [1] 14.3451
## 
## $Minimum
## [1] 120
## 
## $Maximum
## [1] 169
## 
## $Quantiles
##   0%  25%  50%  75% 100% 
##  120  131  143  156  169

##Summary First, we are checking to see if there were any missing values in the data for Average of BPM. The result is 0, which means that we have no missing data for that variable. Secondly, we are visualizing the data. I made a histogram to see how the Avg_BPM values are distributed. The histogram shows how many indivisualsn fall into different of heart rate ranges. From the uniform /slightly skewed tight shape, we can get a sense of most heart rates are pretty evenly spread out in the lower, middle ranges, but less in higher ranges. After that, we created a boxplot, which shows the minimum and maximum of the average of bp.The boxplot shows that half of the heart rates fall between 131 BPM (the 25th percentile) and 156 BPM (the 75th percentile). In addition,it also reveals that there aren’t any extreme outliers far away from the rest of the data.

##Important numbers to summarize the data - The lowest recorded heart rate (Minimum) is 120 beats per minute (BPM). - The highest recorded heart rate (Maximum) is 169 BPM. - The mean (average) heart rate is about 144 BPM. - The median heart rate is 143 BPM - The standard deviation is around 14 BPM, which tells us the heart rates vary about 14 points on average from the center. Overall, most people’s average heart rates are between 120 and 169 BPM, with many clustered around 140–150 BPM. The data spreads out moderately (about 14 BPM around the average), and there are no major unusual values. This gives us a good picture of how heart rates are distributed in this group: typically in the mid‐140s, with enough variation to cover the 120s up through the upper 160s.

Task 4

gym$Workout_Type <- as.factor(gym$Workout_Type)

workout_counts <- table(gym$Workout_Type)
print(workout_counts)
## 
##   Cardio     HIIT Strength     Yoga 
##      255      221      258      239
barplot(
  workout_counts,
  main = "Barplot of Workout Types",
  xlab = "Workout Type",
  ylab = "Frequency",
  col = "hotpink"
)

Summary

Even there is a slight variation in that Strength is highest and HIIT is lowest—overall. Four types of workout are somewhat close. So this suggests that all four workouts are relatively popular among the group.

Task 5

gym$Experience_Level <- as.numeric(gym$Experience_Level)
gym$Experience_Level <- factor(
  gym$Experience_Level,
    levels = c(1, 2, 3),
      labels = c("Beginner", "Intermidiate", "Advanced"),
        ordered = TRUE
)

table_wt<- table(gym$Experience_Level, gym$Workout_Type)
print(table_wt)
##               
##                Cardio HIIT Strength Yoga
##   Beginner        109   85       97   85
##   Intermidiate    102   87      116  101
##   Advanced         44   49       45   53
prop_table_wt<- prop.table(table_wt, margin = 1)
print(prop_table_wt)
##               
##                   Cardio      HIIT  Strength      Yoga
##   Beginner     0.2898936 0.2260638 0.2579787 0.2260638
##   Intermidiate 0.2512315 0.2142857 0.2857143 0.2487685
##   Advanced     0.2303665 0.2565445 0.2356021 0.2774869
barplot(
  table_wt,
  beside = TRUE,
  legend.text = TRUE,
  main = "Workout Type based on ExperienceLevel",
  xlab = "Experience Level",
  ylab = " Workout Type",
  col = c("lavender","lightblue","hotpink")
)

# SUmmary - Beginners lean heavily toward Cardio but also like to try different new thing like Strength and Yoga. They are distributed pretty even. - Intermediates are the largest group for Strength but they also participate in Cardio, HIIT, and Yoga. They are distributed pretty even. - Advanced participants appear to be relatively fewer in all workout types. But also distributed pretty evenly in each workout types.

Task 6

hist(
  gym$Calories_Burned,
  main = "Histogram of Calories_Burned",
  xlab = "Calories Burned",
  col = "lavender",
  breaks = 20 
)

qqnorm(gym$Calories_Burned, main = "Q-Q Plot of Calories Burned")
qqline(gym$Calories_Burned, col = "hotpink", lwd = 2)

library(e1071)
cal_skew <- skewness(gym$Calories_Burned, na.rm = TRUE)
cal_kurt <- kurtosis(gym$Calories_Burned, na.rm = TRUE)

print(cal_skew )
## [1] 0.2774636
print(cal_kurt)
## [1] -0.06795862
shapiro.test(gym$Calories_Burned)
## 
##  Shapiro-Wilk normality test
## 
## data:  gym$Calories_Burned
## W = 0.99176, p-value = 2.982e-05

Summary

Task 7

gym$bmi_class <- cut(
  gym$BMI,
  breaks = c(-Inf, 18.5, 25, 30, Inf),
  labels = c("Underweight", "Healthy", "Overweight", "Obese"),
  right = FALSE
)

gym$bmi_class <- ordered(
  gym$bmi_class,
  levels = c("Underweight", "Healthy", "Overweight", "Obese")
)

table(gym$bmi_class)
## 
## Underweight     Healthy  Overweight       Obese 
##         168         370         243         192
boxplot(
  Water_Intake ~ bmi_class, 
  data = gym,
  main = "Water Intake by BMI Class",
  xlab = "BMI Class",
  ylab = "Water Intake",
  col = c("lightblue", "lightyellow", "pink", "lavender")
)

tapply(gym$Water_Intake, gym$bmi_class, summary)
## $Underweight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.100   2.400   2.473   2.800   3.700 
## 
## $Healthy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.100   2.600   2.575   2.900   3.700 
## 
## $Overweight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.50    2.10    2.50    2.60    3.25    3.70 
## 
## $Obese
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   2.400   2.900   2.893   3.400   3.700
kruskal.test(Water_Intake ~ bmi_class, data = gym)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Water_Intake by bmi_class
## Kruskal-Wallis chi-squared = 50.401, df = 3, p-value = 6.562e-11

My Opinions on Water_Intake by bmi_class

After grouping people by their BMI (Underweight, Healthy, Overweight, Obese). I made a side‐by‐side boxplot to see how much water each group typically drinks. You can think of a boxplot as a quick way to compare medians and spot any big differences. I found that Underweight folks tend to drink the least water (around 2.4 liters on average), while Obese folks tend to drink the most (about 2.9 liters). The Healthy and Overweight groups land somewhere in between. Underweight people have the lowest average water intake, and Obese people have the highest. To confirm we’re not just seeing random variation, we ran a Kruskal–Wallis test, which gave a very small p‐value (much lower than 0.05). That means it’s extremely unlikely these differences happen by chance. So, we can say with confidence that people’s water intake really does vary according to their BMI class in this dataset.

Task 8

# One-sample t-test
t_result <- t.test(
  gym$Calories_Burned, 
  mu = 890,
  alternative = "greater",
  conf.level = 0.98 # 98% confidence level aligns with 2% alpha
)
print(t_result)
## 
##  One Sample t-test
## 
## data:  gym$Calories_Burned
## t = 1.7645, df = 972, p-value = 0.03898
## alternative hypothesis: true mean is greater than 890
## 98 percent confidence interval:
##  887.4475      Inf
## sample estimates:
## mean of x 
##  905.4224

Summary

Task 9

gym_sub <- subset(gym, Gender %in% c(0,1))
gym_sub$Gender <- factor(gym_sub$Gender)  

t_result2 <- t.test(
  Session_Duration ~ Gender, 
  data = gym_sub,
  alternative = "two.sided", 
  conf.level = 0.97
)
print(t_result2)
## 
##  Welch Two Sample t-test
## 
## data:  Session_Duration by Gender
## t = 0.38032, df = 962.47, p-value = 0.7038
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 97 percent confidence interval:
##  -0.03948947  0.05624212
## sample estimates:
## mean in group 0 mean in group 1 
##        1.260823        1.252446

Summary