What are calories? Simply put, calories are a measure of energy. Calories are commonly used to measure the energy content of foods and beverages. To lose weight, you need to eat fewer calories than your body burns each day. Conversely, to gain weight, you need to burn more calories than you burn. Our bodies need energy to stay alive and our organs need to function properly.
The number of calories burned depends on weight and fitness, exercise or activity, and intensity level. Regular physical activity is essential to maintaining good health. This article uses linear regression models and K-Means clustering models as machine learning algorithms to predict calories burned, providing more accurate results. Data preparation, cleaning and analysis are the main steps before the model.
Therefore, we will analyze calorie burn data from different sports and activities by grouping data such as weight and exercise type into appropriate ranks. At the same time, we will use machine algorithms to burn calories more efficiently information.
The data comes from kaggle, this dataset contains the number of calories a person burns while doing some activity/exercise.
It currently contains 248 activities and exercises, including running, cycling, aerobics, and more.
The dataset includes 6 columns:Activity, exercise or sport (1 hour)、130 lbs、155 lbs、180 lbs、205 lbs、calories per pound。
The data is sourced from:
https://www.kaggle.com/datasets/aadhavvignesh/calories-burned-during-exercise-and-activities
exercise_data <- read.csv('exercise_dataset.csv')
head(exercise_data)
## Activity..Exercise.or.Sport..1.hour. X130.lb X155.lb X180.lb X205.lb
## 1 Cycling, mountain bike, bmx 502 598 695 791
## 2 Cycling, <10 mph, leisure bicycling 236 281 327 372
## 3 Cycling, >20 mph, racing 944 1126 1308 1489
## 4 Cycling, 10-11.9 mph, light 354 422 490 558
## 5 Cycling, 12-13.9 mph, moderate 472 563 654 745
## 6 Cycling, 14-15.9 mph, vigorous 590 704 817 931
## Calories.per.kg
## 1 1.7507297
## 2 0.8232356
## 3 3.2949735
## 4 1.2348534
## 5 1.6478253
## 6 2.0594431
The following libraries will be loaded in this project
dplyr : used to manipulate data
factoextra: used in the clustering model
cluster : used in the clustering model
ggplot2 : graphics library to plot modelling results
library('dplyr')
library('factoextra')
library('cluster')
library('ggplot2')
library('tibble')
Describe the metadata, number of rows columns? what do they mean?, date posted?etc.
glimpse(exercise_data)
## Rows: 248
## Columns: 6
## $ Activity..Exercise.or.Sport..1.hour. <chr> "Cycling, mountain bike, bmx", "C~
## $ X130.lb <int> 502, 236, 944, 354, 472, 590, 708~
## $ X155.lb <int> 598, 281, 1126, 422, 563, 704, 84~
## $ X180.lb <int> 695, 327, 1308, 490, 654, 817, 98~
## $ X205.lb <int> 791, 372, 1489, 558, 745, 931, 11~
## $ Calories.per.kg <dbl> 1.7507297, 0.8232356, 3.2949735, ~
What is the amount of calories that a person of a certain weight can expect to burn doing X minutes of a certain exercise?
Can various types of exercise activity be grouped into distinct groups based on the calories burnt per kg of bodyweight per hour?
In pursuit of the two objectives outlined above, we notice that the raw data are given in calories per hour of exercise. The dataset has 4 attributes related to the different bodyweights of the participants whose calories burnt during exercise were recorded. There is also one “Calories per kg” attribute, but it is not clear what this is (e.g. mean of the 4 participants’ calorie burn rate, ideal calorie burn rate of a fit individual, etc.). The first step of data preprocessing that we would like to perform is to convert the particpants’ bodyweights from pounds (lbs) to kilograms (kgs).
bodyweights_lbs <- c(130, 155, 180, 205)
# 1 lb is roughly equal to 0.453592 kg
lb_to_kg <- 0.453592
bodyweights_kgs <- lb_to_kg * bodyweights_lbs
bodyweights_kgs_labels <- paste("X", bodyweights_kgs, "kg", sep="")
new_colnames <- c(colnames(exercise_data)[1], bodyweights_kgs_labels,
colnames(exercise_data)[length(colnames(exercise_data))])
colnames(exercise_data) <- new_colnames
head(exercise_data)
## Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1 Cycling, mountain bike, bmx 502 598 695
## 2 Cycling, <10 mph, leisure bicycling 236 281 327
## 3 Cycling, >20 mph, racing 944 1126 1308
## 4 Cycling, 10-11.9 mph, light 354 422 490
## 5 Cycling, 12-13.9 mph, moderate 472 563 654
## 6 Cycling, 14-15.9 mph, vigorous 590 704 817
## X92.98636kg Calories.per.kg
## 1 791 1.7507297
## 2 372 0.8232356
## 3 1489 3.2949735
## 4 558 1.2348534
## 5 745 1.6478253
## 6 931 2.0594431
Next, we would like to express the data in calories per hour of exercise per kilogram of bodyweight so that we can more clearly see the impact of bodyweight on the rate of calories burnt in addition to the influence of the type of exercise on the rate of calories burnt.
for (column_index in 1:length(bodyweights_kgs)) {
for (row_index in 1:nrow(exercise_data)) {
column_name <- bodyweights_kgs_labels[column_index]
bodyweight <- bodyweights_kgs[column_index]
exercise_data[row_index, column_name] <- exercise_data[row_index, column_name] /
bodyweight
}
}
head(exercise_data)
## Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1 Cycling, mountain bike, bmx 8.513242 8.505583 8.512300
## 2 Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## 3 Cycling, >20 mph, racing 16.008965 16.015530 16.020271
## 4 Cycling, 10-11.9 mph, light 6.003362 6.002268 6.001478
## 5 Cycling, 12-13.9 mph, moderate 8.004483 8.007765 8.010135
## 6 Cycling, 14-15.9 mph, vigorous 10.005603 10.013262 10.006545
## X92.98636kg Calories.per.kg
## 1 8.506624 1.7507297
## 2 4.000587 0.8232356
## 3 16.013101 3.2949735
## 4 6.000880 1.2348534
## 5 8.011928 1.6478253
## 6 10.012221 2.0594431
Next, calculate the mean of each row across the four bodyweight categories. This step is done in the preprocessing stage rather than the exploratory data analysis (EDA) stage as we have yet to determine what the “Calories per kg” attribute refers to.
exercise_data$mean_calories_per_hour_per_kg <- apply(exercise_data[, bodyweights_kgs_labels],
1, mean)
head(exercise_data)
## Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1 Cycling, mountain bike, bmx 8.513242 8.505583 8.512300
## 2 Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## 3 Cycling, >20 mph, racing 16.008965 16.015530 16.020271
## 4 Cycling, 10-11.9 mph, light 6.003362 6.002268 6.001478
## 5 Cycling, 12-13.9 mph, moderate 8.004483 8.007765 8.010135
## 6 Cycling, 14-15.9 mph, vigorous 10.005603 10.013262 10.006545
## X92.98636kg Calories.per.kg mean_calories_per_hour_per_kg
## 1 8.506624 1.7507297 8.509437
## 2 4.000587 0.8232356 4.001167
## 3 16.013101 3.2949735 16.014467
## 4 6.000880 1.2348534 6.001997
## 5 8.011928 1.6478253 8.008578
## 6 10.012221 2.0594431 10.009408
Moving on, we plot our calculated mean vs the original “Calories per kg” attribute to identify if there is a noticeable relationship between the two.
plot(exercise_data$Calories.per.kg, exercise_data$mean_calories_per_hour_per_kg,
main="Almost Perfect Linear Relationship between Two Attributes")
There is an almost perfect linear relationship between the two. This likely means that they refer to the same underlying biological phenomenon (calories burnt per hour per kg) that we are studying. Therefore, we remove the original “Calories per kg” attribute since we are unsure what the units for that measure are, while we have a linearly correlated attribute (the computed means) for which we do know its associated unit (that is, calories per hour per kg).
exercise_data$Calories.per.kg <- NULL
head(exercise_data)
## Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1 Cycling, mountain bike, bmx 8.513242 8.505583 8.512300
## 2 Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## 3 Cycling, >20 mph, racing 16.008965 16.015530 16.020271
## 4 Cycling, 10-11.9 mph, light 6.003362 6.002268 6.001478
## 5 Cycling, 12-13.9 mph, moderate 8.004483 8.007765 8.010135
## 6 Cycling, 14-15.9 mph, vigorous 10.005603 10.013262 10.006545
## X92.98636kg mean_calories_per_hour_per_kg
## 1 8.506624 8.509437
## 2 4.000587 4.001167
## 3 16.013101 16.014467
## 4 6.000880 6.001997
## 5 8.011928 8.008578
## 6 10.012221 10.009408
For Objective 1, for a given exercise (in this case, the first one), we will plot calories_per_hour_per_kg against each bodyweight.
plot(bodyweights_kgs, exercise_data[1, bodyweights_kgs_labels],
ylim=c(0, max(exercise_data[1, bodyweights_kgs_labels])),
main=exercise_data$Activity..Exercise.or.Sport..1.hour.[1],
xlab="Bodyweight (kg)",
ylab="Calories per hour per kg")
Remarkably, the data points form a straight line, which suggests that bodyweight does not have much of an impact on the calories burnt per hour of exercise once the participant’s bodyweight is taken into account. We will determine the range of the calories burnt per hour per kg for each type of exercise to confirm if each of them have minimal ranges.
exercise_data$range <- apply(exercise_data[, bodyweights_kgs_labels], 1, max) -
apply(exercise_data[, bodyweights_kgs_labels], 1, min)
summary(exercise_data$range)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001241 0.007445 0.008297 0.007852 0.011101 0.012248
Each row representing a type of exercise has very little variation in the calories burnt per hour per kg for people of differing weights. This suggests that bodyweight is unlikely to be a significant factor in answering our question for Objective 1, a claim will be tested in the Modelling stage.
For Objective 2, given that we know from above there is little variation in calories burnt as bodyweight varies, the mean of each row fairly represents the calories burnt per hour per kg of each exercise type.
plot(sort(exercise_data$mean_calories_per_hour_per_kg),
type="h",
main="Distribution of Calories Per Hour Per Kg by Exercise",
xlab="Exercise index (lowest to highest calories per hour per kg",
ylab="Calories Per Hour Per Kg")
Interestingly, the sorted mean calories burnt per hour per kg data plotted above shows that there are indeed a few “low-intensity” exercises and a handful of “high-intensity” exercises, with a much larger gray area in between. This presents an interesting clustering problem for the modelling stage of our analysis.
The plot of the exercise data mean suggested that there are different groups of low-intensity and high-intensity exercises. To explore whether there are grouping patterns, an unsupervised machine learning method called clustering will be utilized to find if there is significant grouping of calories burned per kg for an exercise type.
Clustering is an unsupervised learning technique in which the data set is partitioned into several groups called as clusters based on their similarity. Several clusters of data are produced after the segmentation of data. All the objects in a cluster share common characteristics.
The type of clustering technique applied will be K-Means clustering, in which a data point either belongs to a grouping or not.The K value will determine the number of clusters.
Prior clustering data has to be verified that there is no NA.If there is NA rows have to be removed or missing data is to be imputed based on further analysis.
In addition, only numbers will be included in the data set, the rows will be named according to activity which was in column 1.
exercise_data$range <- NULL # removing the range feature from the data set
any(is.na(exercise_data)) # Verify if any NAs in dataset
## [1] FALSE
edkmeans1 = exercise_data[,-1] # Removing character column and defining new data set
row.names(edkmeans1) = exercise_data[,1] # Naming the rows based on the exercise type
head(edkmeans1)
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx 8.513242 8.505583 8.512300
## Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## Cycling, >20 mph, racing 16.008965 16.015530 16.020271
## Cycling, 10-11.9 mph, light 6.003362 6.002268 6.001478
## Cycling, 12-13.9 mph, moderate 8.004483 8.007765 8.010135
## Cycling, 14-15.9 mph, vigorous 10.005603 10.013262 10.006545
## X92.98636kg mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx 8.506624 8.509437
## Cycling, <10 mph, leisure bicycling 4.000587 4.001167
## Cycling, >20 mph, racing 16.013101 16.014467
## Cycling, 10-11.9 mph, light 6.000880 6.001997
## Cycling, 12-13.9 mph, moderate 8.011928 8.008578
## Cycling, 14-15.9 mph, vigorous 10.012221 10.009408
Scale the exercise_data and reassign to a new variable, this is to normalize the data. If the data is not normalized the differences in scale of the features will influence the output of the clustering model as it is based on the mean and difference of the values.
edkmeans2 = scale(edkmeans1) # Scaling the data and defining new scaled data set
head(edkmeans2)
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx 0.5772959 0.5745668 0.5764558
## Cycling, <10 mph, leisure bicycling -0.7907774 -0.7916999 -0.7893904
## Cycling, >20 mph, racing 2.8505605 2.8502411 2.8516291
## Cycling, 10-11.9 mph, light -0.1838877 -0.1839914 -0.1844096
## Cycling, 12-13.9 mph, moderate 0.4230019 0.4237171 0.4242828
## Cycling, 14-15.9 mph, vigorous 1.0298916 1.0314256 1.0292636
## X92.98636kg mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx 0.5748101 0.5757821
## Cycling, <10 mph, leisure bicycling -0.7906892 -0.7906395
## Cycling, >20 mph, racing 2.8495559 2.8504979
## Cycling, 10-11.9 mph, light -0.1845248 -0.1842035
## Cycling, 12-13.9 mph, moderate 0.4248985 0.4239754
## Cycling, 14-15.9 mph, vigorous 1.0310629 1.0304115
Generally the K in the K-Means clustering method refers to the number
of clusters that is present in the data, determining the optimal K value
can be done in two ways which is as the following:
1) Start with an initial guess of K and adjust, iterated as necessary
based on the result
2) Estimate the optimal K value by plotting number of clusters vs total
within sum of squares, this is the method that will be applied
fviz_nbclust(edkmeans2, kmeans, method = "wss") # Plotting K vs Total Within Sum of Square
Based on the plot, it is estimated that the optimal K is at 4 where the total within sum of squares begin to level off.A lower sum of squares imply a lower dissimilarity hence at K = 4 it is believed that the dissimilarity in a cluster is lower suggesting better fit.
set.seed(1012) # Set the seed so that result is reproducible
kcluster = kmeans(edkmeans2,centers = 4, nstart = 30) # Code for K-Means cluster execution
kcluster
## K-means clustering with 4 clusters of sizes 105, 88, 11, 44
##
## Cluster means:
## X58.96696kg X70.30676kg X81.64656kg X92.98636kg mean_calories_per_hour_per_kg
## 1 -0.8968239 -0.8969867 -0.8966009 -0.8968998 -0.8968282
## 2 0.1417076 0.1420985 0.1417419 0.1417033 0.1418129
## 3 2.6579268 2.6574671 2.6572795 2.6566855 2.6573407
## 4 1.1922509 1.1919727 1.1918121 1.1927511 1.1921972
##
## Clustering vector:
## Cycling, mountain bike, bmx
## 2
## Cycling, <10 mph, leisure bicycling
## 1
## Cycling, >20 mph, racing
## 3
## Cycling, 10-11.9 mph, light
## 2
## Cycling, 12-13.9 mph, moderate
## 2
## Cycling, 14-15.9 mph, vigorous
## 4
## Cycling, 16-19 mph, very fast, racing
## 4
## Unicycling
## 1
## Stationary cycling, very light
## 1
## Stationary cycling, light
## 2
## Stationary cycling, moderate
## 2
## Stationary cycling, vigorous
## 4
## Stationary cycling, very vigorous
## 4
## Calisthenics, vigorous, pushups, situpsâ\200¦
## 2
## Calisthenics, light
## 1
## Circuit training, minimal rest
## 2
## Weight lifting, body building, vigorous
## 2
## Weight lifting, light workout
## 1
## Health club exercise
## 2
## Stair machine
## 4
## Rowing machine, light
## 1
## Rowing machine, moderate
## 2
## Rowing machine, vigorous
## 2
## Rowing machine, very vigorous
## 4
## Ski machine
## 2
## Aerobics, low impact
## 1
## Aerobics, high impact
## 2
## Aerobics, step aerobics
## 2
## Aerobics, general
## 2
## Jazzercise
## 2
## Stretching, hatha yoga
## 1
## Mild stretching
## 1
## Instructing aerobic class
## 2
## Water aerobics
## 1
## Ballet, twist, jazz, tap
## 1
## Ballroom dancing, slow
## 1
## Ballroom dancing, fast
## 2
## Running, 5 mph (12 minute mile)
## 2
## Running, 5.2 mph (11.5 minute mile)
## 4
## Running, 6 mph (10 min mile)
## 4
## Running, 6.7 mph (9 min mile)
## 4
## Running, 7 mph (8.5 min mile)
## 4
## Running, 7.5mph (8 min mile)
## 4
## Running, 8 mph (7.5 min mile)
## 3
## Running, 8.6 mph (7 min mile)
## 3
## Running, 9 mph (6.5 min mile)
## 3
## Running, 10 mph (6 min mile)
## 3
## Running, 10.9 mph (5.5 min mile)
## 3
## Running, cross country
## 4
## Running, general
## 2
## Running, on a track, team practice
## 4
## Running, stairs, up
## 3
## Track and field (shot, discus)
## 1
## Track and field (high jump, pole vault)
## 2
## Track and field (hurdles)
## 4
## Archery
## 1
## Badminton
## 1
## Basketball game, competitive
## 2
## Playing basketball, non game
## 2
## Basketball, officiating
## 2
## Basketball, shooting baskets
## 1
## Basketball, wheelchair
## 2
## Running, training, pushing wheelchair
## 2
## Billiards
## 1
## Bowling
## 1
## Boxing, in ring
## 4
## Boxing, punching bag
## 2
## Boxing, sparring
## 4
## Coaching: football, basketball, soccerâ\200¦
## 1
## Cricket (batting, bowling)
## 1
## Croquet
## 1
## Curling
## 1
## Darts (wall or lawn)
## 1
## Fencing
## 2
## Football, competitive
## 4
## Football, touch, flag, general
## 2
## Football or baseball, playing catch
## 1
## Frisbee playing, general
## 1
## Frisbee, ultimate frisbee
## 2
## Golf, general
## 1
## Golf, walking and carrying clubs
## 1
## Golf, driving range
## 1
## Golf, miniature golf
## 1
## Golf, walking and pulling clubs
## 1
## Golf, using power cart
## 1
## Gymnastics
## 1
## Hacky sack
## 1
## Handball
## 4
## Handball, team
## 2
## Hockey, field hockey
## 2
## Hockey, ice hockey
## 2
## Riding a horse, general
## 1
## Horesback riding, saddling horse
## 1
## Horseback riding, grooming horse
## 1
## Horseback riding, trotting
## 2
## Horseback riding, walking
## 1
## Horse racing, galloping
## 2
## Horse grooming, moderate
## 2
## Horseshoe pitching
## 1
## Jai alai
## 4
## Martial arts, judo, karate, jujitsu
## 4
## Martial arts, kick boxing
## 4
## Martial arts, tae kwan do
## 4
## Krav maga training
## 4
## Juggling
## 1
## Kickball
## 2
## Lacrosse
## 2
## Orienteering
## 4
## Playing paddleball
## 2
## Paddleball, competitive
## 4
## Polo
## 2
## Racquetball, competitive
## 4
## Playing racquetball
## 2
## Rock climbing, ascending rock
## 4
## Rock climbing, rappelling
## 2
## Jumping rope, fast
## 4
## Jumping rope, moderate
## 4
## Jumping rope, slow
## 2
## Rugby
## 4
## Shuffleboard, lawn bowling
## 1
## Skateboarding
## 1
## Roller skating
## 2
## Roller blading, in-line skating
## 4
## Sky diving
## 1
## Soccer, competitive
## 4
## Playing soccer
## 2
## Softball or baseball
## 1
## Softball, officiating
## 1
## Softball, pitching
## 2
## Squash
## 4
## Table tennis, ping pong
## 1
## Tai chi
## 1
## Playing tennis
## 2
## Tennis, doubles
## 2
## Tennis, singles
## 2
## Trampoline
## 1
## Volleyball, competitive
## 2
## Playing volleyball
## 1
## Volleyball, beach
## 2
## Wrestling
## 2
## Wallyball
## 2
## Backpacking, Hiking with pack
## 2
## Carrying infant, level ground
## 1
## Carrying infant, upstairs
## 1
## Carrying 16 to 24 lbs, upstairs
## 2
## Carrying 25 to 49 lbs, upstairs
## 2
## Standing, playing with children, light
## 1
## Walk/run, playing with children, moderate
## 1
## Walk/run, playing with children, vigorous
## 1
## Carrying small children
## 1
## Loading, unloading car
## 1
## Climbing hills, carrying up to 9 lbs
## 2
## Climbing hills, carrying 10 to 20 lb
## 2
## Climbing hills, carrying 21 to 42 lb
## 2
## Climbing hills, carrying over 42 lb
## 4
## Walking downstairs
## 1
## Hiking, cross country
## 2
## Bird watching
## 1
## Marching, rapidly, military
## 2
## Children's games, hopscotch, dodgeball
## 1
## Pushing stroller or walking with children
## 1
## Pushing a wheelchair
## 1
## Race walking
## 2
## Rock climbing, mountain climbing
## 2
## Walking using crutches
## 1
## Walking the dog
## 1
## Walking, under 2.0 mph, very slow
## 1
## Walking 2.0 mph, slow
## 1
## Walking 2.5 mph
## 1
## Walking 3.0 mph, moderate
## 1
## Walking 3.5 mph, brisk pace
## 1
## Walking 3.5 mph, uphill
## 2
## Walking 4.0 mph, very brisk
## 1
## Walking 4.5 mph
## 2
## Walking 5.0 mph
## 2
## Boating, power, speed boat
## 1
## Canoeing, camping trip
## 1
## Canoeing, rowing, light
## 1
## Canoeing, rowing, moderate
## 2
## Canoeing, rowing, vigorous
## 4
## Crew, sculling, rowing, competition
## 4
## Kayaking
## 1
## Paddle boat
## 1
## Windsurfing, sailing
## 1
## Sailing, competition
## 1
## Sailing, yachting, ocean sailing
## 1
## Skiing, water skiing
## 2
## Ski mobiling
## 2
## Skin diving, fast
## 3
## Skin diving, moderate
## 4
## Skin diving, scuba diving
## 2
## Snorkeling
## 1
## Surfing, body surfing or board surfing
## 1
## Whitewater rafting, kayaking, canoeing
## 1
## Swimming laps, freestyle, fast
## 4
## Swimming laps, freestyle, slow
## 2
## Swimming backstroke
## 2
## Swimming breaststroke
## 4
## Swimming butterfly
## 4
## Swimming leisurely, not laps
## 2
## Swimming sidestroke
## 2
## Swimming synchronized
## 2
## Swimming, treading water, fast, vigorous
## 4
## Swimming, treading water, moderate
## 1
## Water aerobics, water calisthenics
## 1
## Water polo
## 4
## Water volleyball
## 1
## Water jogging
## 2
## Diving, springboard or platform
## 1
## Ice skating, < 9 mph
## 2
## Ice skating, average speed
## 2
## Ice skating, rapidly
## 4
## Speed skating, ice, competitive
## 3
## Cross country snow skiing, slow
## 2
## Cross country skiing, moderate
## 2
## Cross country skiing, vigorous
## 4
## Cross country skiing, racing
## 3
## Cross country skiing, uphill
## 3
## Snow skiing, downhill skiing, light
## 1
## Downhill snow skiing, moderate
## 2
## Downhill snow skiing, racing
## 2
## Sledding, tobagganing, luge
## 2
## Snow shoeing
## 2
## Snowmobiling
## 1
## General housework
## 1
## Cleaning gutters
## 1
## Painting
## 1
## Sit, playing with animals
## 1
## Walk / run, playing with animals
## 1
## Bathing dog
## 1
## Mowing lawn, walk, power mower
## 2
## Mowing lawn, riding mower
## 1
## Walking, snow blower
## 1
## Riding, snow blower
## 1
## Shoveling snow by hand
## 2
## Raking lawn
## 1
## Gardening, general
## 1
## Bagging grass, leaves
## 1
## Watering lawn or garden
## 1
## Weeding, cultivating garden
## 1
## Carpentry, general
## 1
## Carrying heavy loads
## 2
## Carrying moderate loads upstairs
## 2
## General cleaning
## 1
## Cleaning, dusting
## 1
## Taking out trash
## 1
## Walking, pushing a wheelchair
## 1
## Teach physical education,exercise class
## 1
##
## Within cluster sum of squares by cluster:
## [1] 33.982457 33.031002 7.847353 27.959329
## (between_SS / total_SS = 91.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
It is observed that the data has been clustered into 4 groups of the following sizes:
Cluster 1: 11
Cluster 2: 44
Cluster 3: 88
Cluster 4: 105
In addition the total sum of squares is 91.7%, implying that 91.7% of the variance in the data is measured by the clustering.
fviz_cluster(kcluster, data = edkmeans2, geom = 'point')
Clustering is an unsupervised learning method with the objective to cluster the data based on a value such as mean or median. Unlike classification, there is no label or target attribute thus the accuracy or precision cannot be determined. It is however, possible to evaluate the performance of clustering by using the silhouette coefficient
Silhouette Coefficient
The silhouette coefficient is a measure of the dissimilarity between a
data point and the closest cluster(Ci) and the average dissimilarity of
the data point within its cluster(Di) divided by the max of either
measures.
Si = (Ci - Di)/max(Ci,Di)
Si Value Interpretation
Si > 0: suggests the observation is well clustered, the closer to 1,
the better the fit
Si = 0: the observation is between 2 clusters
S1 < 0: the observation is in the wrong cluster
sil = silhouette(kcluster$cluster, dist(edkmeans2))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 105 0.69
## 2 2 88 0.59
## 3 3 11 0.67
## 4 4 44 0.52
Based on the silhouette coefficient results, it is suggested that the data are appropriately clustered as each Si value are above 0.5 with an average value of 0.63
The result of the clustering will be included into the original pre-processed exercise_dataset
exercise_data_clustered = bind_cols(exercise_data,kcluster$cluster)
names(exercise_data_clustered)[7] = 'Cluster'
head(filter(exercise_data_clustered,Cluster==1))
## Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1 Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## 2 Unicycling 5.002802 5.006631 5.009397
## 3 Stationary cycling, very light 3.001681 3.001134 3.000739
## 4 Calisthenics, light 3.510440 3.498952 3.502903
## 5 Weight lifting, light workout 3.001681 3.001134 3.000739
## 6 Rowing machine, light 3.510440 3.498952 3.502903
## X92.98636kg mean_calories_per_hour_per_kg Cluster
## 1 4.000587 4.001167 1
## 2 5.000733 5.004891 1
## 3 3.000440 3.000998 1
## 4 3.505891 3.504547 1
## 5 3.000440 3.000998 1
## 6 3.505891 3.504547 1
Based on the clustering of the exercise data, there are 4 different
intensities of the exercises that yield different averages of calories
burned per kg for each exercise. Based on the mean value for each
feature, the clusters are arranged in decreasing intensity in the
following order:
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Based on the EDA done above, the data points form a straight line, which suggests that bodyweight is unlikely to be a significant factor in the calories burnt per hour of exercise. To explore further whether the claim is true, regression modelling is done to determine whether bodyweight is a significant factor in the calories burnt per hour of exercise.
In this section, we use the modified data set (exercise_data_clustered) defined previously in K-means clustering modelling.
cluster_1_exercise = filter(exercise_data_clustered,Cluster==1) # Filter for Cluster 1 only
cluster_1 = cluster_1_exercise[,-1] # Removing character column
cluster1 = cluster_1[,-6] # Removing last column
row.names(cluster1) = cluster_1_exercise[,1] # Naming the rows based on the exercise type
head(cluster1) # New data set with cluster 1 only
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, <10 mph, leisure bicycling 4.002241 3.996771 4.005068
## Unicycling 5.002802 5.006631 5.009397
## Stationary cycling, very light 3.001681 3.001134 3.000739
## Calisthenics, light 3.510440 3.498952 3.502903
## Weight lifting, light workout 3.001681 3.001134 3.000739
## Rowing machine, light 3.510440 3.498952 3.502903
## X92.98636kg mean_calories_per_hour_per_kg
## Cycling, <10 mph, leisure bicycling 4.000587 4.001167
## Unicycling 5.000733 5.004891
## Stationary cycling, very light 3.000440 3.000998
## Calisthenics, light 3.505891 3.504547
## Weight lifting, light workout 3.000440 3.000998
## Rowing machine, light 3.505891 3.504547
cluster_2_exercise = filter(exercise_data_clustered,Cluster==2) # Filter for Cluster 2 only
cluster_2 = cluster_2_exercise[,-1] # Removing character column
cluster2 = cluster_2[,-6] # Removing last column
row.names(cluster2) = cluster_2_exercise[,1] # Naming the rows based on the exercise type
head(cluster2) # New data set with cluster 2 only
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx 8.513242 8.505583 8.512300
## Cycling, 10-11.9 mph, light 6.003362 6.002268 6.001478
## Cycling, 12-13.9 mph, moderate 8.004483 8.007765 8.010135
## Stationary cycling, light 5.511561 5.504449 5.499313
## Stationary cycling, moderate 7.003922 7.012128 7.005806
## Calisthenics, vigorous, pushups, situpsâ\200¦ 8.004483 8.007765 8.010135
## X92.98636kg
## Cycling, mountain bike, bmx 8.506624
## Cycling, 10-11.9 mph, light 6.000880
## Cycling, 12-13.9 mph, moderate 8.011928
## Stationary cycling, light 5.506184
## Stationary cycling, moderate 7.001027
## Calisthenics, vigorous, pushups, situpsâ\200¦ 8.011928
## mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx 8.509437
## Cycling, 10-11.9 mph, light 6.001997
## Cycling, 12-13.9 mph, moderate 8.008578
## Stationary cycling, light 5.505377
## Stationary cycling, moderate 7.005721
## Calisthenics, vigorous, pushups, situpsâ\200¦ 8.008578
cluster_3_exercise = filter(exercise_data_clustered,Cluster==3) # Filter for Cluster 3 only
cluster_3 = cluster_3_exercise[,-1] # Removing character column
cluster3 = cluster_3[,-6] # Removing last column
row.names(cluster3) = cluster_3_exercise[,1] # Naming the rows based on the exercise type
head(cluster3) # New data set with cluster 3 only
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, >20 mph, racing 16.00897 16.01553 16.02027
## Running, 8 mph (7.5 min mile) 13.51604 13.51221 13.50945
## Running, 8.6 mph (7 min mile) 14.00784 14.01003 14.01161
## Running, 9 mph (6.5 min mile) 15.00840 15.01989 15.01594
## Running, 10 mph (6 min mile) 16.00897 16.01553 16.02027
## Running, 10.9 mph (5.5 min mile) 18.01009 18.02103 18.01668
## X92.98636kg mean_calories_per_hour_per_kg
## Cycling, >20 mph, racing 16.01310 16.01447
## Running, 8 mph (7.5 min mile) 13.50736 13.51127
## Running, 8.6 mph (7 min mile) 14.01281 14.01057
## Running, 9 mph (6.5 min mile) 15.01295 15.01430
## Running, 10 mph (6 min mile) 16.01310 16.01447
## Running, 10.9 mph (5.5 min mile) 18.01339 18.01530
cluster_4_exercise = filter(exercise_data_clustered,Cluster==4) # Filter for Cluster 4 only
cluster_4 = cluster_4_exercise[,-1] # Removing character column
cluster4 = cluster_4[,-6] # Removing last column
row.names(cluster4) = cluster_4_exercise[,1] # Naming the rows based on the exercise type
head(cluster4) # New data set with cluster 4 only
## X58.96696kg X70.30676kg X81.64656kg
## Cycling, 14-15.9 mph, vigorous 10.005603 10.013262 10.006545
## Cycling, 16-19 mph, very fast, racing 12.006724 12.004536 12.015203
## Stationary cycling, vigorous 10.514363 10.511080 10.508710
## Stationary cycling, very vigorous 12.515483 12.516577 12.517368
## Stair machine 9.005043 9.003402 9.002216
## Rowing machine, very vigorous 12.006724 12.004536 12.015203
## X92.98636kg mean_calories_per_hour_per_kg
## Cycling, 14-15.9 mph, vigorous 10.012221 10.009408
## Cycling, 16-19 mph, very fast, racing 12.012515 12.009744
## Stationary cycling, vigorous 10.506917 10.510268
## Stationary cycling, very vigorous 12.507211 12.514160
## Stair machine 9.012074 9.005684
## Rowing machine, very vigorous 12.012515 12.009744
Here, we determine the correlation coefficient between the variables. We regress calories per hour per kg variables for each cluster.
Correlation coefficient on Cluster 1
cor(cluster1, method = "pearson")
## X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg 1.0000000 0.9999860 0.9999862 0.9999968
## X70.30676kg 0.9999860 1.0000000 0.9999871 0.9999891
## X81.64656kg 0.9999862 0.9999871 1.0000000 0.9999944
## X92.98636kg 0.9999968 0.9999891 0.9999944 1.0000000
## mean_calories_per_hour_per_kg 0.9999960 0.9999943 0.9999957 0.9999989
## mean_calories_per_hour_per_kg
## X58.96696kg 0.9999960
## X70.30676kg 0.9999943
## X81.64656kg 0.9999957
## X92.98636kg 0.9999989
## mean_calories_per_hour_per_kg 1.0000000
We determine the correlation coefficient between X58.96696kg and mean_calories_per_hour_per_kg.
cor.test(cluster1$X58.96696kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cluster1$X58.96696kg and cluster1$mean_calories_per_hour_per_kg
## t = 3597.8, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9999941 0.9999973
## sample estimates:
## cor
## 0.999996
The correlation coefficient of 0.99 suggests a very strong positive correlation between 58.96696kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.
We determine the correlation coefficient between X70.30676kg and mean_calories_per_hour_per_kg.
cor.test(cluster1$X70.30676kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cluster1$X70.30676kg and cluster1$mean_calories_per_hour_per_kg
## t = 3008.4, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9999916 0.9999961
## sample estimates:
## cor
## 0.9999943
The correlation coefficient of 0.99 suggests a very strong positive correlation between 70.30676kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.
We determine the correlation coefficient between X81.64656kg and mean_calories_per_hour_per_kg.
cor.test(cluster1$X81.64656kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cluster1$X81.64656kg and cluster1$mean_calories_per_hour_per_kg
## t = 3467.8, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9999937 0.9999971
## sample estimates:
## cor
## 0.9999957
The correlation coefficient of 0.99 suggests a very strong positive correlation between 81.64656kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.
We determine the correlation coefficient between X92.98636kg and mean_calories_per_hour_per_kg.
cor.test(edkmeans1$X92.98636kg, edkmeans1$mean_calories_per_hour_per_kg, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: edkmeans1$X92.98636kg and edkmeans1$mean_calories_per_hour_per_kg
## t = 17929, df = 246, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9999995 0.9999997
## sample estimates:
## cor
## 0.9999996
The correlation coefficient of 0.99 suggests a very strong positive correlation between 92.98636kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.
Correlation coefficient on Cluster 2
cor(cluster2, method = "pearson")
## X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg 1.0000000 0.9999832 0.9999943 0.9999915
## X70.30676kg 0.9999832 1.0000000 0.9999926 0.9999789
## X81.64656kg 0.9999943 0.9999926 1.0000000 0.9999932
## X92.98636kg 0.9999915 0.9999789 0.9999932 1.0000000
## mean_calories_per_hour_per_kg 0.9999964 0.9999928 0.9999992 0.9999950
## mean_calories_per_hour_per_kg
## X58.96696kg 0.9999964
## X70.30676kg 0.9999928
## X81.64656kg 0.9999992
## X92.98636kg 0.9999950
## mean_calories_per_hour_per_kg 1.0000000
Correlation coefficient on Cluster 3
cor(cluster3, method = "pearson")
## X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg 1.0000000 0.9999909 0.9999883 0.9999957
## X70.30676kg 0.9999909 1.0000000 0.9999955 0.9999942
## X81.64656kg 0.9999883 0.9999955 1.0000000 0.9999939
## X92.98636kg 0.9999957 0.9999942 0.9999939 1.0000000
## mean_calories_per_hour_per_kg 0.9999963 0.9999977 0.9999970 0.9999985
## mean_calories_per_hour_per_kg
## X58.96696kg 0.9999963
## X70.30676kg 0.9999977
## X81.64656kg 0.9999970
## X92.98636kg 0.9999985
## mean_calories_per_hour_per_kg 1.0000000
Correlation coefficient on Cluster 4
cor(cluster4, method = "pearson")
## X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg 1.0000000 0.9999928 0.9999978 0.9999942
## X70.30676kg 0.9999928 1.0000000 0.9999920 0.9999893
## X81.64656kg 0.9999978 0.9999920 1.0000000 0.9999991
## X92.98636kg 0.9999942 0.9999893 0.9999991 1.0000000
## mean_calories_per_hour_per_kg 0.9999984 0.9999957 0.9999994 0.9999978
## mean_calories_per_hour_per_kg
## X58.96696kg 0.9999984
## X70.30676kg 0.9999957
## X81.64656kg 0.9999994
## X92.98636kg 0.9999978
## mean_calories_per_hour_per_kg 1.0000000
Once again, our modelling objective with regards to regression is to estimate how many calories a person of bodyweight X will burn over one hour of a certain exercise. From our EDA, we suspect that bodyweight does not play a role in determining calories burnt per hour, adjusted for bodyweight. Therefore, our first sub-objective is to investigate this claim that bodyweight does not play a role in the calories burn rate per kg.
exercise_data_flattened <- data.frame(exercise = rep(exercise_data$Activity..Exercise.or.Sport..1.hour., length(bodyweights_kgs)), calories_per_hour_per_kg = unlist(exercise_data[,c(-1, -ncol(exercise_data))]), bodyweight = rep(bodyweights_kgs, each = nrow(exercise_data)), cluster = rep(exercise_data_clustered$Cluster, length(bodyweights_kgs)))
ggplot(exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ],aes(x=bodyweight, y=calories_per_hour_per_kg)) +
geom_point() +
stat_smooth(method="lm",se=FALSE) +
ylim(0, max(exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ]$calories_per_hour_per_kg))
## `geom_smooth()` using formula 'y ~ x'
The chart above is very similar to the one in our EDA. It is a graph of
calories_per_hour_per_kg against bodyweight for the “Cycling, mountain
bike, bmx” exercise, and it forms a nearly perfect flat line. However,
we require more than just visual proof, so a linear regression model is
applied.
regression_model <- lm(calories_per_hour_per_kg~bodyweight,exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ])
summary(regression_model) #summary of regression model
##
## Call:
## lm(formula = calories_per_hour_per_kg ~ bodyweight, data = exercise_data_flattened[exercise_data_flattened$exercise ==
## exercise_data_flattened$exercise[1], ])
##
## Residuals:
## X58.96696kg1 X70.30676kg1 X81.64656kg1 X92.98636kg1
## 0.0018341 -0.0045109 0.0035194 -0.0008427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5182393 0.0130321 653.636 2.34e-06 ***
## bodyweight -0.0001159 0.0001692 -0.685 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00429 on 2 degrees of freedom
## Multiple R-squared: 0.1899, Adjusted R-squared: -0.2151
## F-statistic: 0.4689 on 1 and 2 DF, p-value: 0.5642
From the results table above, we can see that the beta coefficient for bodyweight is not statistically significant at all reasonable significance levels. We now have statistical evidence that bodyweight does not influence calories_per_hour_per_kg. But this is just for one exercise, and we have 247 others to consider.
p_values <- vector(length = nrow(exercise_data))
betas <- vector(length = nrow(exercise_data))
for (index in 1:nrow(exercise_data)) {
model <- lm(calories_per_hour_per_kg~bodyweight,exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[index], ])
p_values[index] <- summary(model)$coefficients["bodyweight", "Pr(>|t|)"]
betas[index] <- summary(model)$coefficients["bodyweight", "Estimate"]
}
summary(p_values) #summary of all bodyweight coefficients' p-values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.009059 0.009059 0.498406 0.397729 0.648576 0.977595
Interestingly, not all beta coefficients for each exercise’s regression model were statistically insignificant. We examine the values of those which have p-values of less than 0.05.
significance_level <- 0.05
significant_beta_indices <- p_values < significance_level
significant_betas <- betas[significant_beta_indices]
summary(significant_betas) #summary of significant beta coefficients
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.542e-04 -7.262e-05 -3.631e-05 4.256e-05 2.179e-04 2.542e-04
All values of the significant beta coefficients are very small, thus we can safely conclude that bodyweight does not influence calories_per_hour_per_kg.
Nevertheless, the question still remains: how many calories will a person of bodyweight X burn during an hour of a certain exercise? With our findings above, in the absence of an impact from bodyweight on calories_per_hour_per_kg, we could perform simple arithmetic calculations to arrive at the desired result. However, we propose the use of more linear regressions.
original_exercise_data <- read.csv('exercise_dataset.csv')
original_exercise_data_flattened <- data.frame(exercise = rep(original_exercise_data$Activity..Exercise.or.Sport..1.hour., length(bodyweights_kgs)), calories_per_hour = unlist(original_exercise_data[,c(-1, -ncol(exercise_data))]), bodyweight = rep(bodyweights_kgs, each = nrow(exercise_data)), cluster = rep(exercise_data_clustered$Cluster, length(bodyweights_kgs)))
ggplot(original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[1], ],aes(x=bodyweight, y=calories_per_hour)) +
geom_point() +
stat_smooth(method="lm",se=FALSE) +
ylim(0, max(original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[1], ]$calories_per_hour))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing missing values (geom_smooth).
For this purpose, we used our original dataset whose values were
expressed in calories_per_hour rather than calories_per_hour_per_kg.
Here, for the “Cycling, mountain bike, bmx” exercise, we observe that an
increase in bodyweight is associated with an increase in
calories_per_hour of exercise, and the linear model has a very good fit.
Does this hold across all exercises?
r_squareds <- vector(length = nrow(exercise_data))
betas <- vector(length = nrow(exercise_data))
for (index in 1:nrow(exercise_data)) {
model <- lm(calories_per_hour~bodyweight,original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[index], ])
r_squareds[index] <- summary(model)$r.squared
betas[index] <- summary(model)$coefficients["bodyweight", "Estimate"]
}
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
summary(r_squareds) #summary of all bodyweight coefficients' p-values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000
Every single one of our linear regression models have very high r-squared values. Granted, there were only 4 data points for each exercise, but the goodness of fit of each model is compelling. We can now answer original question.
predictions <- vector(length = nrow(exercise_data))
bodyweight <- 75
for (index in 1:nrow(exercise_data)) {
model <- lm(calories_per_hour~bodyweight,original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[index], ])
predictions[index] <- predict(model, data.frame(weight = bodyweight))
}
pred_df <- data.frame(exercise = exercise_data$Activity..Exercise.or.Sport..1.hour., calories_per_hour = predictions)
pred_df
## exercise calories_per_hour
## 1 Cycling, mountain bike, bmx 638.1974
## 2 Cycling, <10 mph, leisure bicycling 300.0898
## 3 Cycling, >20 mph, racing 1201.1008
## 4 Cycling, 10-11.9 mph, light 450.1434
## 5 Cycling, 12-13.9 mph, moderate 600.6625
## 6 Cycling, 14-15.9 mph, vigorous 750.7160
## 7 Cycling, 16-19 mph, very fast, racing 900.7523
## 8 Unicycling 375.3666
## 9 Stationary cycling, very light 225.0717
## 10 Stationary cycling, light 412.8843
## 11 Stationary cycling, moderate 525.4201
## 12 Stationary cycling, vigorous 788.2509
## 13 Stationary cycling, very vigorous 938.5458
## 14 Calisthenics, vigorous, pushups, situpsâ\200¦ 600.6625
## 15 Calisthenics, light 262.8308
## 16 Circuit training, minimal rest 600.6625
## 17 Weight lifting, body building, vigorous 450.1434
## 18 Weight lifting, light workout 225.0717
## 19 Health club exercise 412.8843
## 20 Stair machine 675.4392
## 21 Rowing machine, light 262.8308
## 22 Rowing machine, moderate 525.4201
## 23 Rowing machine, vigorous 638.1974
## 24 Rowing machine, very vigorous 900.7523
## 25 Ski machine 525.4201
## 26 Aerobics, low impact 375.3666
## 27 Aerobics, high impact 525.4201
## 28 Aerobics, step aerobics 638.1974
## 29 Aerobics, general 487.9025
## 30 Jazzercise 450.1434
## 31 Stretching, hatha yoga 300.0898
## 32 Mild stretching 187.8126
## 33 Instructing aerobic class 450.1434
## 34 Water aerobics 300.0898
## 35 Ballet, twist, jazz, tap 338.1075
## 36 Ballroom dancing, slow 225.0717
## 37 Ballroom dancing, fast 412.8843
## 38 Running, 5 mph (12 minute mile) 600.6625
## 39 Running, 5.2 mph (11.5 minute mile) 675.4392
## 40 Running, 6 mph (10 min mile) 750.7160
## 41 Running, 6.7 mph (9 min mile) 825.7342
## 42 Running, 7 mph (8.5 min mile) 863.2691
## 43 Running, 7.5mph (8 min mile) 938.5458
## 44 Running, 8 mph (7.5 min mile) 1013.3226
## 45 Running, 8.6 mph (7 min mile) 1050.8058
## 46 Running, 9 mph (6.5 min mile) 1126.0826
## 47 Running, 10 mph (6 min mile) 1201.1008
## 48 Running, 10.9 mph (5.5 min mile) 1351.1543
## 49 Running, cross country 675.4392
## 50 Running, general 600.6625
## 51 Running, on a track, team practice 750.7160
## 52 Running, stairs, up 1126.0826
## 53 Track and field (shot, discus) 300.0898
## 54 Track and field (high jump, pole vault) 450.1434
## 55 Track and field (hurdles) 750.7160
## 56 Archery 262.8308
## 57 Badminton 338.1075
## 58 Basketball game, competitive 600.6625
## 59 Playing basketball, non game 450.1434
## 60 Basketball, officiating 525.4201
## 61 Basketball, shooting baskets 338.1075
## 62 Basketball, wheelchair 487.9025
## 63 Running, training, pushing wheelchair 600.6625
## 64 Billiards 187.8126
## 65 Bowling 225.0717
## 66 Boxing, in ring 900.7523
## 67 Boxing, punching bag 450.1434
## 68 Boxing, sparring 675.4392
## 69 Coaching: football, basketball, soccerâ\200¦ 300.0898
## 70 Cricket (batting, bowling) 375.3666
## 71 Croquet 187.8126
## 72 Curling 300.0898
## 73 Darts (wall or lawn) 187.8126
## 74 Fencing 450.1434
## 75 Football, competitive 675.4392
## 76 Football, touch, flag, general 600.6625
## 77 Football or baseball, playing catch 187.8126
## 78 Frisbee playing, general 225.0717
## 79 Frisbee, ultimate frisbee 600.6625
## 80 Golf, general 338.1075
## 81 Golf, walking and carrying clubs 338.1075
## 82 Golf, driving range 225.0717
## 83 Golf, miniature golf 225.0717
## 84 Golf, walking and pulling clubs 322.8142
## 85 Golf, using power cart 262.8308
## 86 Gymnastics 300.0898
## 87 Hacky sack 300.0898
## 88 Handball 900.7523
## 89 Handball, team 600.6625
## 90 Hockey, field hockey 600.6625
## 91 Hockey, ice hockey 600.6625
## 92 Riding a horse, general 300.0898
## 93 Horesback riding, saddling horse 262.8308
## 94 Horseback riding, grooming horse 262.8308
## 95 Horseback riding, trotting 487.9025
## 96 Horseback riding, walking 187.8126
## 97 Horse racing, galloping 600.6625
## 98 Horse grooming, moderate 450.1434
## 99 Horseshoe pitching 225.0717
## 100 Jai alai 900.7523
## 101 Martial arts, judo, karate, jujitsu 750.7160
## 102 Martial arts, kick boxing 750.7160
## 103 Martial arts, tae kwan do 750.7160
## 104 Krav maga training 750.7160
## 105 Juggling 300.0898
## 106 Kickball 525.4201
## 107 Lacrosse 600.6625
## 108 Orienteering 675.4392
## 109 Playing paddleball 450.1434
## 110 Paddleball, competitive 750.7160
## 111 Polo 600.6625
## 112 Racquetball, competitive 750.7160
## 113 Playing racquetball 525.4201
## 114 Rock climbing, ascending rock 825.7342
## 115 Rock climbing, rappelling 600.6625
## 116 Jumping rope, fast 900.7523
## 117 Jumping rope, moderate 750.7160
## 118 Jumping rope, slow 600.6625
## 119 Rugby 750.7160
## 120 Shuffleboard, lawn bowling 225.0717
## 121 Skateboarding 375.3666
## 122 Roller skating 525.4201
## 123 Roller blading, in-line skating 900.7523
## 124 Sky diving 225.0717
## 125 Soccer, competitive 750.7160
## 126 Playing soccer 525.4201
## 127 Softball or baseball 375.3666
## 128 Softball, officiating 300.0898
## 129 Softball, pitching 450.1434
## 130 Squash 900.7523
## 131 Table tennis, ping pong 300.0898
## 132 Tai chi 300.0898
## 133 Playing tennis 525.4201
## 134 Tennis, doubles 450.1434
## 135 Tennis, singles 600.6625
## 136 Trampoline 262.8308
## 137 Volleyball, competitive 600.6625
## 138 Playing volleyball 225.0717
## 139 Volleyball, beach 600.6625
## 140 Wrestling 450.1434
## 141 Wallyball 525.4201
## 142 Backpacking, Hiking with pack 525.4201
## 143 Carrying infant, level ground 262.8308
## 144 Carrying infant, upstairs 375.3666
## 145 Carrying 16 to 24 lbs, upstairs 450.1434
## 146 Carrying 25 to 49 lbs, upstairs 600.6625
## 147 Standing, playing with children, light 210.2439
## 148 Walk/run, playing with children, moderate 300.0898
## 149 Walk/run, playing with children, vigorous 375.3666
## 150 Carrying small children 225.0717
## 151 Loading, unloading car 225.0717
## 152 Climbing hills, carrying up to 9 lbs 525.4201
## 153 Climbing hills, carrying 10 to 20 lb 563.1792
## 154 Climbing hills, carrying 21 to 42 lb 600.6625
## 155 Climbing hills, carrying over 42 lb 675.4392
## 156 Walking downstairs 225.0717
## 157 Hiking, cross country 450.1434
## 158 Bird watching 187.8126
## 159 Marching, rapidly, military 487.9025
## 160 Children's games, hopscotch, dodgeball 375.3666
## 161 Pushing stroller or walking with children 187.8126
## 162 Pushing a wheelchair 300.0898
## 163 Race walking 487.9025
## 164 Rock climbing, mountain climbing 600.6625
## 165 Walking using crutches 375.3666
## 166 Walking the dog 225.0717
## 167 Walking, under 2.0 mph, very slow 150.0535
## 168 Walking 2.0 mph, slow 187.8126
## 169 Walking 2.5 mph 225.0717
## 170 Walking 3.0 mph, moderate 247.7789
## 171 Walking 3.5 mph, brisk pace 285.2621
## 172 Walking 3.5 mph, uphill 450.1434
## 173 Walking 4.0 mph, very brisk 375.3666
## 174 Walking 4.5 mph 472.8506
## 175 Walking 5.0 mph 600.6625
## 176 Boating, power, speed boat 187.8126
## 177 Canoeing, camping trip 300.0898
## 178 Canoeing, rowing, light 225.0717
## 179 Canoeing, rowing, moderate 525.4201
## 180 Canoeing, rowing, vigorous 900.7523
## 181 Crew, sculling, rowing, competition 900.7523
## 182 Kayaking 375.3666
## 183 Paddle boat 300.0898
## 184 Windsurfing, sailing 225.0717
## 185 Sailing, competition 375.3666
## 186 Sailing, yachting, ocean sailing 225.0717
## 187 Skiing, water skiing 450.1434
## 188 Ski mobiling 525.4201
## 189 Skin diving, fast 1201.1008
## 190 Skin diving, moderate 938.5458
## 191 Skin diving, scuba diving 525.4201
## 192 Snorkeling 375.3666
## 193 Surfing, body surfing or board surfing 225.0717
## 194 Whitewater rafting, kayaking, canoeing 375.3666
## 195 Swimming laps, freestyle, fast 750.7160
## 196 Swimming laps, freestyle, slow 525.4201
## 197 Swimming backstroke 525.4201
## 198 Swimming breaststroke 750.7160
## 199 Swimming butterfly 825.7342
## 200 Swimming leisurely, not laps 450.1434
## 201 Swimming sidestroke 600.6625
## 202 Swimming synchronized 600.6625
## 203 Swimming, treading water, fast, vigorous 750.7160
## 204 Swimming, treading water, moderate 300.0898
## 205 Water aerobics, water calisthenics 300.0898
## 206 Water polo 750.7160
## 207 Water volleyball 225.0717
## 208 Water jogging 600.6625
## 209 Diving, springboard or platform 225.0717
## 210 Ice skating, < 9 mph 412.8843
## 211 Ice skating, average speed 525.4201
## 212 Ice skating, rapidly 675.4392
## 213 Speed skating, ice, competitive 1126.0826
## 214 Cross country snow skiing, slow 525.4201
## 215 Cross country skiing, moderate 600.6625
## 216 Cross country skiing, vigorous 675.4392
## 217 Cross country skiing, racing 1050.8058
## 218 Cross country skiing, uphill 1238.6185
## 219 Snow skiing, downhill skiing, light 375.3666
## 220 Downhill snow skiing, moderate 450.1434
## 221 Downhill snow skiing, racing 600.6625
## 222 Sledding, tobagganing, luge 525.4201
## 223 Snow shoeing 600.6625
## 224 Snowmobiling 262.8308
## 225 General housework 262.8308
## 226 Cleaning gutters 375.3666
## 227 Painting 338.1075
## 228 Sit, playing with animals 187.8126
## 229 Walk / run, playing with animals 300.0898
## 230 Bathing dog 262.8308
## 231 Mowing lawn, walk, power mower 412.8843
## 232 Mowing lawn, riding mower 187.8126
## 233 Walking, snow blower 262.8308
## 234 Riding, snow blower 225.0717
## 235 Shoveling snow by hand 450.1434
## 236 Raking lawn 322.8142
## 237 Gardening, general 300.0898
## 238 Bagging grass, leaves 300.0898
## 239 Watering lawn or garden 113.0358
## 240 Weeding, cultivating garden 338.1075
## 241 Carpentry, general 262.8308
## 242 Carrying heavy loads 600.6625
## 243 Carrying moderate loads upstairs 600.6625
## 244 General cleaning 262.8308
## 245 Cleaning, dusting 187.8126
## 246 Taking out trash 225.0717
## 247 Walking, pushing a wheelchair 300.0898
## 248 Teach physical education,exercise class 300.0898
The example above shows the calories that would be burnt over an hour of each exercise by a person weighing 75kg. However, it may be difficult for this person, looking at the entire menu of 248 exercises, to make a choice. An interesting question would be: what if we modelled calories_per_hour not just by exercise, but by each cluster of exercise?
In the 248 individual activity and exercise datasets, each Si value was above 0.5 with a mean of 0.63 according to the silhouette coefficient results. We used a new dataset (edkmeans1) defined in K-means Clustering Data Preparation. A regression model was then run to explore further, identifying body weight as a significant factor in calories burned per hour of exercise. We show the accuracy of these individual components as well as the overall accuracy. The main limitation of this work is the lack of a general dataset for comparison with other methods. In the future, we plan to collect and annotate larger datasets to create a common ground for comparison and analysis. We also intend to add other features to the model, such as height and gender.
NHS Choices. (2022). Understanding calories. https://www.nhs.uk/live-well/healthy-weight/managing-your-weight/understanding-calories/
Nipas, M., Acoba, A. G., Mindoro, J. N., Malbog, M. A. F., Susa, J. A. B., & Gulmatico, J. S. (2022). Burned Calories Prediction using Supervised Machine Learning: Regression Algorithm. 2022 Second International Conference on Power, Control and Computing Technologies (ICPC2T). https://doi.org/10.1109/icpc2t53885.2022.9776710
Vinoy, S. P., & Joseph, B. (2022). Calorie Burn Prediction Analysis Using XGBoost Regressor and Linear Regression Algorithms. Zenodo. https://doi.org/10.5281/zenodo.6365018