Calorie Burn Prediction Analysis Using K-Means Clustering and Linear Regression Algorithms

Introduction

What are calories? Simply put, calories are a measure of energy. Calories are commonly used to measure the energy content of foods and beverages. To lose weight, you need to eat fewer calories than your body burns each day. Conversely, to gain weight, you need to burn more calories than you burn. Our bodies need energy to stay alive and our organs need to function properly.

The number of calories burned depends on weight and fitness, exercise or activity, and intensity level. Regular physical activity is essential to maintaining good health. This article uses linear regression models and K-Means clustering models as machine learning algorithms to predict calories burned, providing more accurate results. Data preparation, cleaning and analysis are the main steps before the model.

Therefore, we will analyze calorie burn data from different sports and activities by grouping data such as weight and exercise type into appropriate ranks. At the same time, we will use machine algorithms to burn calories more efficiently information.

Description of Dataset

The data comes from kaggle, this dataset contains the number of calories a person burns while doing some activity/exercise.

It currently contains 248 activities and exercises, including running, cycling, aerobics, and more.

The dataset includes 6 columns:Activity, exercise or sport (1 hour)、130 lbs、155 lbs、180 lbs、205 lbs、calories per pound。

Importing the Data

The data is sourced from:

https://www.kaggle.com/datasets/aadhavvignesh/calories-burned-during-exercise-and-activities

exercise_data <- read.csv('exercise_dataset.csv')
head(exercise_data)

##   Activity..Exercise.or.Sport..1.hour. X130.lb X155.lb X180.lb X205.lb
## 1          Cycling, mountain bike, bmx     502     598     695     791
## 2  Cycling, <10 mph, leisure bicycling     236     281     327     372
## 3             Cycling, >20 mph, racing     944    1126    1308    1489
## 4          Cycling, 10-11.9 mph, light     354     422     490     558
## 5       Cycling, 12-13.9 mph, moderate     472     563     654     745
## 6       Cycling, 14-15.9 mph, vigorous     590     704     817     931
##   Calories.per.kg
## 1       1.7507297
## 2       0.8232356
## 3       3.2949735
## 4       1.2348534
## 5       1.6478253
## 6       2.0594431

Loading the Relevent Library

The following libraries will be loaded in this project
dplyr : used to manipulate data
factoextra: used in the clustering model
cluster : used in the clustering model
ggplot2 : graphics library to plot modelling results

library('dplyr')      
library('factoextra')
library('cluster')
library('ggplot2')
library('tibble')

Description of Dataset

Describe the metadata, number of rows columns? what do they mean?, date posted?etc.

glimpse(exercise_data)

## Rows: 248
## Columns: 6
## $ Activity..Exercise.or.Sport..1.hour. <chr> "Cycling, mountain bike, bmx", "C~
## $ X130.lb                              <int> 502, 236, 944, 354, 472, 590, 708~
## $ X155.lb                              <int> 598, 281, 1126, 422, 563, 704, 84~
## $ X180.lb                              <int> 695, 327, 1308, 490, 654, 817, 98~
## $ X205.lb                              <int> 791, 372, 1489, 558, 745, 931, 11~
## $ Calories.per.kg                      <dbl> 1.7507297, 0.8232356, 3.2949735, ~

Objectives

What is the amount of calories that a person of a certain weight can expect to burn doing X minutes of a certain exercise?
Can various types of exercise activity be grouped into distinct groups based on the calories burnt per kg of bodyweight per hour?

Data Preprocessing

In pursuit of the two objectives outlined above, we notice that the raw data are given in calories per hour of exercise. The dataset has 4 attributes related to the different bodyweights of the participants whose calories burnt during exercise were recorded. There is also one “Calories per kg” attribute, but it is not clear what this is (e.g. mean of the 4 participants’ calorie burn rate, ideal calorie burn rate of a fit individual, etc.). The first step of data preprocessing that we would like to perform is to convert the particpants’ bodyweights from pounds (lbs) to kilograms (kgs).

bodyweights_lbs <- c(130, 155, 180, 205)
# 1 lb is roughly equal to 0.453592 kg
lb_to_kg <- 0.453592
bodyweights_kgs <- lb_to_kg * bodyweights_lbs
bodyweights_kgs_labels <- paste("X", bodyweights_kgs, "kg", sep="")
new_colnames <- c(colnames(exercise_data)[1], bodyweights_kgs_labels, 
                  colnames(exercise_data)[length(colnames(exercise_data))])
colnames(exercise_data) <- new_colnames
head(exercise_data)

##   Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1          Cycling, mountain bike, bmx         502         598         695
## 2  Cycling, <10 mph, leisure bicycling         236         281         327
## 3             Cycling, >20 mph, racing         944        1126        1308
## 4          Cycling, 10-11.9 mph, light         354         422         490
## 5       Cycling, 12-13.9 mph, moderate         472         563         654
## 6       Cycling, 14-15.9 mph, vigorous         590         704         817
##   X92.98636kg Calories.per.kg
## 1         791       1.7507297
## 2         372       0.8232356
## 3        1489       3.2949735
## 4         558       1.2348534
## 5         745       1.6478253
## 6         931       2.0594431

Next, we would like to express the data in calories per hour of exercise per kilogram of bodyweight so that we can more clearly see the impact of bodyweight on the rate of calories burnt in addition to the influence of the type of exercise on the rate of calories burnt.

for (column_index in 1:length(bodyweights_kgs)) {
  for (row_index in 1:nrow(exercise_data)) {
    column_name <- bodyweights_kgs_labels[column_index]
    bodyweight <- bodyweights_kgs[column_index]
    exercise_data[row_index, column_name] <- exercise_data[row_index, column_name] / 
      bodyweight
  }
}
head(exercise_data)

##   Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1          Cycling, mountain bike, bmx    8.513242    8.505583    8.512300
## 2  Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## 3             Cycling, >20 mph, racing   16.008965   16.015530   16.020271
## 4          Cycling, 10-11.9 mph, light    6.003362    6.002268    6.001478
## 5       Cycling, 12-13.9 mph, moderate    8.004483    8.007765    8.010135
## 6       Cycling, 14-15.9 mph, vigorous   10.005603   10.013262   10.006545
##   X92.98636kg Calories.per.kg
## 1    8.506624       1.7507297
## 2    4.000587       0.8232356
## 3   16.013101       3.2949735
## 4    6.000880       1.2348534
## 5    8.011928       1.6478253
## 6   10.012221       2.0594431

Next, calculate the mean of each row across the four bodyweight categories. This step is done in the preprocessing stage rather than the exploratory data analysis (EDA) stage as we have yet to determine what the “Calories per kg” attribute refers to.

exercise_data$mean_calories_per_hour_per_kg <- apply(exercise_data[, bodyweights_kgs_labels], 
                                                     1, mean)
head(exercise_data)

##   Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1          Cycling, mountain bike, bmx    8.513242    8.505583    8.512300
## 2  Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## 3             Cycling, >20 mph, racing   16.008965   16.015530   16.020271
## 4          Cycling, 10-11.9 mph, light    6.003362    6.002268    6.001478
## 5       Cycling, 12-13.9 mph, moderate    8.004483    8.007765    8.010135
## 6       Cycling, 14-15.9 mph, vigorous   10.005603   10.013262   10.006545
##   X92.98636kg Calories.per.kg mean_calories_per_hour_per_kg
## 1    8.506624       1.7507297                      8.509437
## 2    4.000587       0.8232356                      4.001167
## 3   16.013101       3.2949735                     16.014467
## 4    6.000880       1.2348534                      6.001997
## 5    8.011928       1.6478253                      8.008578
## 6   10.012221       2.0594431                     10.009408

Moving on, we plot our calculated mean vs the original “Calories per kg” attribute to identify if there is a noticeable relationship between the two.

plot(exercise_data$Calories.per.kg, exercise_data$mean_calories_per_hour_per_kg,
     main="Almost Perfect Linear Relationship between Two Attributes")

There is an almost perfect linear relationship between the two. This likely means that they refer to the same underlying biological phenomenon (calories burnt per hour per kg) that we are studying. Therefore, we remove the original “Calories per kg” attribute since we are unsure what the units for that measure are, while we have a linearly correlated attribute (the computed means) for which we do know its associated unit (that is, calories per hour per kg).

exercise_data$Calories.per.kg <- NULL
head(exercise_data)

##   Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1          Cycling, mountain bike, bmx    8.513242    8.505583    8.512300
## 2  Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## 3             Cycling, >20 mph, racing   16.008965   16.015530   16.020271
## 4          Cycling, 10-11.9 mph, light    6.003362    6.002268    6.001478
## 5       Cycling, 12-13.9 mph, moderate    8.004483    8.007765    8.010135
## 6       Cycling, 14-15.9 mph, vigorous   10.005603   10.013262   10.006545
##   X92.98636kg mean_calories_per_hour_per_kg
## 1    8.506624                      8.509437
## 2    4.000587                      4.001167
## 3   16.013101                     16.014467
## 4    6.000880                      6.001997
## 5    8.011928                      8.008578
## 6   10.012221                     10.009408

Exploratory Data Analysis

For Objective 1, for a given exercise (in this case, the first one), we will plot calories_per_hour_per_kg against each bodyweight.

plot(bodyweights_kgs, exercise_data[1, bodyweights_kgs_labels], 
     ylim=c(0, max(exercise_data[1, bodyweights_kgs_labels])),
     main=exercise_data$Activity..Exercise.or.Sport..1.hour.[1],
     xlab="Bodyweight (kg)",
     ylab="Calories per hour per kg")

Remarkably, the data points form a straight line, which suggests that bodyweight does not have much of an impact on the calories burnt per hour of exercise once the participant’s bodyweight is taken into account. We will determine the range of the calories burnt per hour per kg for each type of exercise to confirm if each of them have minimal ranges.

exercise_data$range <- apply(exercise_data[, bodyweights_kgs_labels], 1, max) - 
  apply(exercise_data[, bodyweights_kgs_labels], 1, min)
summary(exercise_data$range)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001241 0.007445 0.008297 0.007852 0.011101 0.012248

Each row representing a type of exercise has very little variation in the calories burnt per hour per kg for people of differing weights. This suggests that bodyweight is unlikely to be a significant factor in answering our question for Objective 1, a claim will be tested in the Modelling stage.

For Objective 2, given that we know from above there is little variation in calories burnt as bodyweight varies, the mean of each row fairly represents the calories burnt per hour per kg of each exercise type.

plot(sort(exercise_data$mean_calories_per_hour_per_kg), 
     type="h", 
     main="Distribution of Calories Per Hour Per Kg by Exercise", 
     xlab="Exercise index (lowest to highest calories per hour per kg",
     ylab="Calories Per Hour Per Kg")

Interestingly, the sorted mean calories burnt per hour per kg data plotted above shows that there are indeed a few “low-intensity” exercises and a handful of “high-intensity” exercises, with a much larger gray area in between. This presents an interesting clustering problem for the modelling stage of our analysis.

Modelling - K-Means Clustering

The plot of the exercise data mean suggested that there are different groups of low-intensity and high-intensity exercises. To explore whether there are grouping patterns, an unsupervised machine learning method called clustering will be utilized to find if there is significant grouping of calories burned per kg for an exercise type.

Clustering is an unsupervised learning technique in which the data set is partitioned into several groups called as clusters based on their similarity. Several clusters of data are produced after the segmentation of data. All the objects in a cluster share common characteristics.

The type of clustering technique applied will be K-Means clustering, in which a data point either belongs to a grouping or not.The K value will determine the number of clusters.

Preparing the data

Prior clustering data has to be verified that there is no NA.If there is NA rows have to be removed or missing data is to be imputed based on further analysis.

In addition, only numbers will be included in the data set, the rows will be named according to activity which was in column 1.

exercise_data$range <- NULL # removing the range feature from the data set
any(is.na(exercise_data)) # Verify if any NAs in dataset

## [1] FALSE

edkmeans1 = exercise_data[,-1] # Removing character column and defining new data set
row.names(edkmeans1) = exercise_data[,1] # Naming the rows based on the exercise type
head(edkmeans1)

##                                     X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx            8.513242    8.505583    8.512300
## Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## Cycling, >20 mph, racing              16.008965   16.015530   16.020271
## Cycling, 10-11.9 mph, light            6.003362    6.002268    6.001478
## Cycling, 12-13.9 mph, moderate         8.004483    8.007765    8.010135
## Cycling, 14-15.9 mph, vigorous        10.005603   10.013262   10.006545
##                                     X92.98636kg mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx            8.506624                      8.509437
## Cycling, <10 mph, leisure bicycling    4.000587                      4.001167
## Cycling, >20 mph, racing              16.013101                     16.014467
## Cycling, 10-11.9 mph, light            6.000880                      6.001997
## Cycling, 12-13.9 mph, moderate         8.011928                      8.008578
## Cycling, 14-15.9 mph, vigorous        10.012221                     10.009408

Scaling the Data

Scale the exercise_data and reassign to a new variable, this is to normalize the data. If the data is not normalized the differences in scale of the features will influence the output of the clustering model as it is based on the mean and difference of the values.

edkmeans2 = scale(edkmeans1) # Scaling the data and defining new scaled data set
head(edkmeans2)

##                                     X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx           0.5772959   0.5745668   0.5764558
## Cycling, <10 mph, leisure bicycling  -0.7907774  -0.7916999  -0.7893904
## Cycling, >20 mph, racing              2.8505605   2.8502411   2.8516291
## Cycling, 10-11.9 mph, light          -0.1838877  -0.1839914  -0.1844096
## Cycling, 12-13.9 mph, moderate        0.4230019   0.4237171   0.4242828
## Cycling, 14-15.9 mph, vigorous        1.0298916   1.0314256   1.0292636
##                                     X92.98636kg mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx           0.5748101                     0.5757821
## Cycling, <10 mph, leisure bicycling  -0.7906892                    -0.7906395
## Cycling, >20 mph, racing              2.8495559                     2.8504979
## Cycling, 10-11.9 mph, light          -0.1845248                    -0.1842035
## Cycling, 12-13.9 mph, moderate        0.4248985                     0.4239754
## Cycling, 14-15.9 mph, vigorous        1.0310629                     1.0304115

Choosing K Value for Clustering

Generally the K in the K-Means clustering method refers to the number of clusters that is present in the data, determining the optimal K value can be done in two ways which is as the following:
1) Start with an initial guess of K and adjust, iterated as necessary based on the result
2) Estimate the optimal K value by plotting number of clusters vs total within sum of squares, this is the method that will be applied

fviz_nbclust(edkmeans2, kmeans, method = "wss") # Plotting K vs Total Within Sum of Square

Based on the plot, it is estimated that the optimal K is at 4 where the total within sum of squares begin to level off.A lower sum of squares imply a lower dissimilarity hence at K = 4 it is believed that the dissimilarity in a cluster is lower suggesting better fit.

Performing K-Means Clustering

set.seed(1012) # Set the seed so that result is reproducible
kcluster = kmeans(edkmeans2,centers = 4, nstart = 30) # Code for K-Means cluster execution

Viewing the results of the clustering

kcluster

## K-means clustering with 4 clusters of sizes 105, 88, 11, 44
## 
## Cluster means:
##   X58.96696kg X70.30676kg X81.64656kg X92.98636kg mean_calories_per_hour_per_kg
## 1  -0.8968239  -0.8969867  -0.8966009  -0.8968998                    -0.8968282
## 2   0.1417076   0.1420985   0.1417419   0.1417033                     0.1418129
## 3   2.6579268   2.6574671   2.6572795   2.6566855                     2.6573407
## 4   1.1922509   1.1919727   1.1918121   1.1927511                     1.1921972
## 
## Clustering vector:
##                Cycling, mountain bike, bmx 
##                                          2 
##        Cycling, <10 mph, leisure bicycling 
##                                          1 
##                   Cycling, >20 mph, racing 
##                                          3 
##                Cycling, 10-11.9 mph, light 
##                                          2 
##             Cycling, 12-13.9 mph, moderate 
##                                          2 
##             Cycling, 14-15.9 mph, vigorous 
##                                          4 
##      Cycling, 16-19 mph, very fast, racing 
##                                          4 
##                                 Unicycling 
##                                          1 
##             Stationary cycling, very light 
##                                          1 
##                  Stationary cycling, light 
##                                          2 
##               Stationary cycling, moderate 
##                                          2 
##               Stationary cycling, vigorous 
##                                          4 
##          Stationary cycling, very vigorous 
##                                          4 
## Calisthenics, vigorous, pushups, situpsâ\200¦ 
##                                          2 
##                        Calisthenics, light 
##                                          1 
##             Circuit training, minimal rest 
##                                          2 
##    Weight lifting, body building, vigorous 
##                                          2 
##              Weight lifting, light workout 
##                                          1 
##                       Health club exercise 
##                                          2 
##                              Stair machine 
##                                          4 
##                      Rowing machine, light 
##                                          1 
##                   Rowing machine, moderate 
##                                          2 
##                   Rowing machine, vigorous 
##                                          2 
##              Rowing machine, very vigorous 
##                                          4 
##                                Ski machine 
##                                          2 
##                       Aerobics, low impact 
##                                          1 
##                      Aerobics, high impact 
##                                          2 
##                    Aerobics, step aerobics 
##                                          2 
##                          Aerobics, general 
##                                          2 
##                                 Jazzercise 
##                                          2 
##                     Stretching, hatha yoga 
##                                          1 
##                            Mild stretching 
##                                          1 
##                  Instructing aerobic class 
##                                          2 
##                             Water aerobics 
##                                          1 
##                   Ballet, twist, jazz, tap 
##                                          1 
##                     Ballroom dancing, slow 
##                                          1 
##                     Ballroom dancing, fast 
##                                          2 
##            Running, 5 mph (12 minute mile) 
##                                          2 
##        Running, 5.2 mph (11.5 minute mile) 
##                                          4 
##               Running, 6 mph (10 min mile) 
##                                          4 
##              Running, 6.7 mph (9 min mile) 
##                                          4 
##              Running, 7 mph (8.5 min mile) 
##                                          4 
##               Running, 7.5mph (8 min mile) 
##                                          4 
##              Running, 8 mph (7.5 min mile) 
##                                          3 
##              Running, 8.6 mph (7 min mile) 
##                                          3 
##              Running, 9 mph (6.5 min mile) 
##                                          3 
##               Running, 10 mph (6 min mile) 
##                                          3 
##           Running, 10.9 mph (5.5 min mile) 
##                                          3 
##                     Running, cross country 
##                                          4 
##                           Running, general 
##                                          2 
##         Running, on a track, team practice 
##                                          4 
##                        Running, stairs, up 
##                                          3 
##             Track and field (shot, discus) 
##                                          1 
##    Track and field (high jump, pole vault) 
##                                          2 
##                  Track and field (hurdles) 
##                                          4 
##                                    Archery 
##                                          1 
##                                  Badminton 
##                                          1 
##               Basketball game, competitive 
##                                          2 
##               Playing basketball, non game 
##                                          2 
##                    Basketball, officiating 
##                                          2 
##               Basketball, shooting baskets 
##                                          1 
##                     Basketball, wheelchair 
##                                          2 
##      Running, training, pushing wheelchair 
##                                          2 
##                                  Billiards 
##                                          1 
##                                    Bowling 
##                                          1 
##                            Boxing, in ring 
##                                          4 
##                       Boxing, punching bag 
##                                          2 
##                           Boxing, sparring 
##                                          4 
##  Coaching: football, basketball, soccerâ\200¦ 
##                                          1 
##                 Cricket (batting, bowling) 
##                                          1 
##                                    Croquet 
##                                          1 
##                                    Curling 
##                                          1 
##                       Darts (wall or lawn) 
##                                          1 
##                                    Fencing 
##                                          2 
##                      Football, competitive 
##                                          4 
##             Football, touch, flag, general 
##                                          2 
##        Football or baseball, playing catch 
##                                          1 
##                   Frisbee playing, general 
##                                          1 
##                  Frisbee, ultimate frisbee 
##                                          2 
##                              Golf, general 
##                                          1 
##           Golf, walking and carrying clubs 
##                                          1 
##                        Golf, driving range 
##                                          1 
##                       Golf, miniature golf 
##                                          1 
##            Golf, walking and pulling clubs 
##                                          1 
##                     Golf, using power cart 
##                                          1 
##                                 Gymnastics 
##                                          1 
##                                 Hacky sack 
##                                          1 
##                                   Handball 
##                                          4 
##                             Handball, team 
##                                          2 
##                       Hockey, field hockey 
##                                          2 
##                         Hockey, ice hockey 
##                                          2 
##                    Riding a horse, general 
##                                          1 
##           Horesback riding, saddling horse 
##                                          1 
##           Horseback riding, grooming horse 
##                                          1 
##                 Horseback riding, trotting 
##                                          2 
##                  Horseback riding, walking 
##                                          1 
##                    Horse racing, galloping 
##                                          2 
##                   Horse grooming, moderate 
##                                          2 
##                         Horseshoe pitching 
##                                          1 
##                                   Jai alai 
##                                          4 
##        Martial arts, judo, karate, jujitsu 
##                                          4 
##                  Martial arts, kick boxing 
##                                          4 
##                  Martial arts, tae kwan do 
##                                          4 
##                         Krav maga training 
##                                          4 
##                                   Juggling 
##                                          1 
##                                   Kickball 
##                                          2 
##                                   Lacrosse 
##                                          2 
##                               Orienteering 
##                                          4 
##                         Playing paddleball 
##                                          2 
##                    Paddleball, competitive 
##                                          4 
##                                       Polo 
##                                          2 
##                   Racquetball, competitive 
##                                          4 
##                        Playing racquetball 
##                                          2 
##              Rock climbing, ascending rock 
##                                          4 
##                  Rock climbing, rappelling 
##                                          2 
##                         Jumping rope, fast 
##                                          4 
##                     Jumping rope, moderate 
##                                          4 
##                         Jumping rope, slow 
##                                          2 
##                                      Rugby 
##                                          4 
##                 Shuffleboard, lawn bowling 
##                                          1 
##                              Skateboarding 
##                                          1 
##                             Roller skating 
##                                          2 
##            Roller blading, in-line skating 
##                                          4 
##                                 Sky diving 
##                                          1 
##                        Soccer, competitive 
##                                          4 
##                             Playing soccer 
##                                          2 
##                       Softball or baseball 
##                                          1 
##                      Softball, officiating 
##                                          1 
##                         Softball, pitching 
##                                          2 
##                                     Squash 
##                                          4 
##                    Table tennis, ping pong 
##                                          1 
##                                    Tai chi 
##                                          1 
##                             Playing tennis 
##                                          2 
##                            Tennis, doubles 
##                                          2 
##                            Tennis, singles 
##                                          2 
##                                 Trampoline 
##                                          1 
##                    Volleyball, competitive 
##                                          2 
##                         Playing volleyball 
##                                          1 
##                          Volleyball, beach 
##                                          2 
##                                  Wrestling 
##                                          2 
##                                  Wallyball 
##                                          2 
##              Backpacking, Hiking with pack 
##                                          2 
##              Carrying infant, level ground 
##                                          1 
##                  Carrying infant, upstairs 
##                                          1 
##            Carrying 16 to 24 lbs, upstairs 
##                                          2 
##            Carrying 25 to 49 lbs, upstairs 
##                                          2 
##     Standing, playing with children, light 
##                                          1 
##  Walk/run, playing with children, moderate 
##                                          1 
##  Walk/run, playing with children, vigorous 
##                                          1 
##                    Carrying small children 
##                                          1 
##                     Loading, unloading car 
##                                          1 
##       Climbing hills, carrying up to 9 lbs 
##                                          2 
##       Climbing hills, carrying 10 to 20 lb 
##                                          2 
##       Climbing hills, carrying 21 to 42 lb 
##                                          2 
##        Climbing hills, carrying over 42 lb 
##                                          4 
##                         Walking downstairs 
##                                          1 
##                      Hiking, cross country 
##                                          2 
##                              Bird watching 
##                                          1 
##                Marching, rapidly, military 
##                                          2 
##     Children's games, hopscotch, dodgeball 
##                                          1 
##  Pushing stroller or walking with children 
##                                          1 
##                       Pushing a wheelchair 
##                                          1 
##                               Race walking 
##                                          2 
##           Rock climbing, mountain climbing 
##                                          2 
##                     Walking using crutches 
##                                          1 
##                            Walking the dog 
##                                          1 
##          Walking, under 2.0 mph, very slow 
##                                          1 
##                      Walking 2.0 mph, slow 
##                                          1 
##                            Walking 2.5 mph 
##                                          1 
##                  Walking 3.0 mph, moderate 
##                                          1 
##                Walking 3.5 mph, brisk pace 
##                                          1 
##                    Walking 3.5 mph, uphill 
##                                          2 
##                Walking 4.0 mph, very brisk 
##                                          1 
##                            Walking 4.5 mph 
##                                          2 
##                            Walking 5.0 mph 
##                                          2 
##                 Boating, power, speed boat 
##                                          1 
##                     Canoeing, camping trip 
##                                          1 
##                    Canoeing, rowing, light 
##                                          1 
##                 Canoeing, rowing, moderate 
##                                          2 
##                 Canoeing, rowing, vigorous 
##                                          4 
##        Crew, sculling, rowing, competition 
##                                          4 
##                                   Kayaking 
##                                          1 
##                                Paddle boat 
##                                          1 
##                       Windsurfing, sailing 
##                                          1 
##                       Sailing, competition 
##                                          1 
##           Sailing, yachting, ocean sailing 
##                                          1 
##                       Skiing, water skiing 
##                                          2 
##                               Ski mobiling 
##                                          2 
##                          Skin diving, fast 
##                                          3 
##                      Skin diving, moderate 
##                                          4 
##                  Skin diving, scuba diving 
##                                          2 
##                                 Snorkeling 
##                                          1 
##     Surfing, body surfing or board surfing 
##                                          1 
##     Whitewater rafting, kayaking, canoeing 
##                                          1 
##             Swimming laps, freestyle, fast 
##                                          4 
##             Swimming laps, freestyle, slow 
##                                          2 
##                        Swimming backstroke 
##                                          2 
##                      Swimming breaststroke 
##                                          4 
##                         Swimming butterfly 
##                                          4 
##               Swimming leisurely, not laps 
##                                          2 
##                        Swimming sidestroke 
##                                          2 
##                      Swimming synchronized 
##                                          2 
##   Swimming, treading water, fast, vigorous 
##                                          4 
##         Swimming, treading water, moderate 
##                                          1 
##         Water aerobics, water calisthenics 
##                                          1 
##                                 Water polo 
##                                          4 
##                           Water volleyball 
##                                          1 
##                              Water jogging 
##                                          2 
##            Diving, springboard or platform 
##                                          1 
##                       Ice skating, < 9 mph 
##                                          2 
##                 Ice skating, average speed 
##                                          2 
##                       Ice skating, rapidly 
##                                          4 
##            Speed skating, ice, competitive 
##                                          3 
##            Cross country snow skiing, slow 
##                                          2 
##             Cross country skiing, moderate 
##                                          2 
##             Cross country skiing, vigorous 
##                                          4 
##               Cross country skiing, racing 
##                                          3 
##               Cross country skiing, uphill 
##                                          3 
##        Snow skiing, downhill skiing, light 
##                                          1 
##             Downhill snow skiing, moderate 
##                                          2 
##               Downhill snow skiing, racing 
##                                          2 
##                Sledding, tobagganing, luge 
##                                          2 
##                               Snow shoeing 
##                                          2 
##                               Snowmobiling 
##                                          1 
##                          General housework 
##                                          1 
##                           Cleaning gutters 
##                                          1 
##                                   Painting 
##                                          1 
##                  Sit, playing with animals 
##                                          1 
##           Walk / run, playing with animals 
##                                          1 
##                                Bathing dog 
##                                          1 
##             Mowing lawn, walk, power mower 
##                                          2 
##                  Mowing lawn, riding mower 
##                                          1 
##                       Walking, snow blower 
##                                          1 
##                        Riding, snow blower 
##                                          1 
##                     Shoveling snow by hand 
##                                          2 
##                                Raking lawn 
##                                          1 
##                         Gardening, general 
##                                          1 
##                      Bagging grass, leaves 
##                                          1 
##                    Watering lawn or garden 
##                                          1 
##                Weeding, cultivating garden 
##                                          1 
##                         Carpentry, general 
##                                          1 
##                       Carrying heavy loads 
##                                          2 
##           Carrying moderate loads upstairs 
##                                          2 
##                           General cleaning 
##                                          1 
##                          Cleaning, dusting 
##                                          1 
##                           Taking out trash 
##                                          1 
##              Walking, pushing a wheelchair 
##                                          1 
##    Teach physical education,exercise class 
##                                          1 
## 
## Within cluster sum of squares by cluster:
## [1] 33.982457 33.031002  7.847353 27.959329
##  (between_SS / total_SS =  91.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

It is observed that the data has been clustered into 4 groups of the following sizes:

Cluster 1: 11
Cluster 2: 44
Cluster 3: 88
Cluster 4: 105

In addition the total sum of squares is 91.7%, implying that 91.7% of the variance in the data is measured by the clustering.

Plotting the final result of the k-means model

fviz_cluster(kcluster, data = edkmeans2, geom = 'point')

K-Means Clustering Result Evaluation

Clustering is an unsupervised learning method with the objective to cluster the data based on a value such as mean or median. Unlike classification, there is no label or target attribute thus the accuracy or precision cannot be determined. It is however, possible to evaluate the performance of clustering by using the silhouette coefficient

Silhouette Coefficient
The silhouette coefficient is a measure of the dissimilarity between a data point and the closest cluster(Ci) and the average dissimilarity of the data point within its cluster(Di) divided by the max of either measures.

Si = (Ci - Di)/max(Ci,Di)

Si Value Interpretation
Si > 0: suggests the observation is well clustered, the closer to 1, the better the fit
Si = 0: the observation is between 2 clusters
S1 < 0: the observation is in the wrong cluster

sil = silhouette(kcluster$cluster, dist(edkmeans2))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1  105          0.69
## 2       2   88          0.59
## 3       3   11          0.67
## 4       4   44          0.52

Based on the silhouette coefficient results, it is suggested that the data are appropriately clustered as each Si value are above 0.5 with an average value of 0.63

Merging Clustering Result to Original Data Set

The result of the clustering will be included into the original pre-processed exercise_dataset

exercise_data_clustered = bind_cols(exercise_data,kcluster$cluster)
names(exercise_data_clustered)[7] = 'Cluster'
head(filter(exercise_data_clustered,Cluster==1))

##   Activity..Exercise.or.Sport..1.hour. X58.96696kg X70.30676kg X81.64656kg
## 1  Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## 2                           Unicycling    5.002802    5.006631    5.009397
## 3       Stationary cycling, very light    3.001681    3.001134    3.000739
## 4                  Calisthenics, light    3.510440    3.498952    3.502903
## 5        Weight lifting, light workout    3.001681    3.001134    3.000739
## 6                Rowing machine, light    3.510440    3.498952    3.502903
##   X92.98636kg mean_calories_per_hour_per_kg Cluster
## 1    4.000587                      4.001167       1
## 2    5.000733                      5.004891       1
## 3    3.000440                      3.000998       1
## 4    3.505891                      3.504547       1
## 5    3.000440                      3.000998       1
## 6    3.505891                      3.504547       1

K-Means Clustering Result Interpretation

Based on the clustering of the exercise data, there are 4 different intensities of the exercises that yield different averages of calories burned per kg for each exercise. Based on the mean value for each feature, the clusters are arranged in decreasing intensity in the following order:
Cluster 1
Cluster 2
Cluster 3
Cluster 4

Modelling - Linear Regression

Based on the EDA done above, the data points form a straight line, which suggests that bodyweight is unlikely to be a significant factor in the calories burnt per hour of exercise. To explore further whether the claim is true, regression modelling is done to determine whether bodyweight is a significant factor in the calories burnt per hour of exercise.

Data Preparation

In this section, we use the modified data set (exercise_data_clustered) defined previously in K-means clustering modelling.

cluster_1_exercise = filter(exercise_data_clustered,Cluster==1) # Filter for Cluster 1 only
cluster_1 = cluster_1_exercise[,-1] # Removing character column
cluster1 = cluster_1[,-6] # Removing last column
row.names(cluster1) = cluster_1_exercise[,1] # Naming the rows based on the exercise type
head(cluster1) # New data set with cluster 1 only

##                                     X58.96696kg X70.30676kg X81.64656kg
## Cycling, <10 mph, leisure bicycling    4.002241    3.996771    4.005068
## Unicycling                             5.002802    5.006631    5.009397
## Stationary cycling, very light         3.001681    3.001134    3.000739
## Calisthenics, light                    3.510440    3.498952    3.502903
## Weight lifting, light workout          3.001681    3.001134    3.000739
## Rowing machine, light                  3.510440    3.498952    3.502903
##                                     X92.98636kg mean_calories_per_hour_per_kg
## Cycling, <10 mph, leisure bicycling    4.000587                      4.001167
## Unicycling                             5.000733                      5.004891
## Stationary cycling, very light         3.000440                      3.000998
## Calisthenics, light                    3.505891                      3.504547
## Weight lifting, light workout          3.000440                      3.000998
## Rowing machine, light                  3.505891                      3.504547

cluster_2_exercise = filter(exercise_data_clustered,Cluster==2) # Filter for Cluster 2 only
cluster_2 = cluster_2_exercise[,-1] # Removing character column
cluster2 = cluster_2[,-6] # Removing last column
row.names(cluster2) = cluster_2_exercise[,1] # Naming the rows based on the exercise type
head(cluster2) # New data set with cluster 2 only

##                                            X58.96696kg X70.30676kg X81.64656kg
## Cycling, mountain bike, bmx                   8.513242    8.505583    8.512300
## Cycling, 10-11.9 mph, light                   6.003362    6.002268    6.001478
## Cycling, 12-13.9 mph, moderate                8.004483    8.007765    8.010135
## Stationary cycling, light                     5.511561    5.504449    5.499313
## Stationary cycling, moderate                  7.003922    7.012128    7.005806
## Calisthenics, vigorous, pushups, situpsâ\200¦    8.004483    8.007765    8.010135
##                                            X92.98636kg
## Cycling, mountain bike, bmx                   8.506624
## Cycling, 10-11.9 mph, light                   6.000880
## Cycling, 12-13.9 mph, moderate                8.011928
## Stationary cycling, light                     5.506184
## Stationary cycling, moderate                  7.001027
## Calisthenics, vigorous, pushups, situpsâ\200¦    8.011928
##                                            mean_calories_per_hour_per_kg
## Cycling, mountain bike, bmx                                     8.509437
## Cycling, 10-11.9 mph, light                                     6.001997
## Cycling, 12-13.9 mph, moderate                                  8.008578
## Stationary cycling, light                                       5.505377
## Stationary cycling, moderate                                    7.005721
## Calisthenics, vigorous, pushups, situpsâ\200¦                      8.008578

cluster_3_exercise = filter(exercise_data_clustered,Cluster==3) # Filter for Cluster 3 only
cluster_3 = cluster_3_exercise[,-1] # Removing character column
cluster3 = cluster_3[,-6] # Removing last column
row.names(cluster3) = cluster_3_exercise[,1] # Naming the rows based on the exercise type
head(cluster3) # New data set with cluster 3 only

##                                  X58.96696kg X70.30676kg X81.64656kg
## Cycling, >20 mph, racing            16.00897    16.01553    16.02027
## Running, 8 mph (7.5 min mile)       13.51604    13.51221    13.50945
## Running, 8.6 mph (7 min mile)       14.00784    14.01003    14.01161
## Running, 9 mph (6.5 min mile)       15.00840    15.01989    15.01594
## Running, 10 mph (6 min mile)        16.00897    16.01553    16.02027
## Running, 10.9 mph (5.5 min mile)    18.01009    18.02103    18.01668
##                                  X92.98636kg mean_calories_per_hour_per_kg
## Cycling, >20 mph, racing            16.01310                      16.01447
## Running, 8 mph (7.5 min mile)       13.50736                      13.51127
## Running, 8.6 mph (7 min mile)       14.01281                      14.01057
## Running, 9 mph (6.5 min mile)       15.01295                      15.01430
## Running, 10 mph (6 min mile)        16.01310                      16.01447
## Running, 10.9 mph (5.5 min mile)    18.01339                      18.01530

cluster_4_exercise = filter(exercise_data_clustered,Cluster==4) # Filter for Cluster 4 only
cluster_4 = cluster_4_exercise[,-1] # Removing character column
cluster4 = cluster_4[,-6] # Removing last column
row.names(cluster4) = cluster_4_exercise[,1] # Naming the rows based on the exercise type
head(cluster4) # New data set with cluster 4 only

##                                       X58.96696kg X70.30676kg X81.64656kg
## Cycling, 14-15.9 mph, vigorous          10.005603   10.013262   10.006545
## Cycling, 16-19 mph, very fast, racing   12.006724   12.004536   12.015203
## Stationary cycling, vigorous            10.514363   10.511080   10.508710
## Stationary cycling, very vigorous       12.515483   12.516577   12.517368
## Stair machine                            9.005043    9.003402    9.002216
## Rowing machine, very vigorous           12.006724   12.004536   12.015203
##                                       X92.98636kg mean_calories_per_hour_per_kg
## Cycling, 14-15.9 mph, vigorous          10.012221                     10.009408
## Cycling, 16-19 mph, very fast, racing   12.012515                     12.009744
## Stationary cycling, vigorous            10.506917                     10.510268
## Stationary cycling, very vigorous       12.507211                     12.514160
## Stair machine                            9.012074                      9.005684
## Rowing machine, very vigorous           12.012515                     12.009744

Determine Correlation Coefficient

Here, we determine the correlation coefficient between the variables. We regress calories per hour per kg variables for each cluster.

Correlation coefficient on Cluster 1

cor(cluster1, method = "pearson")

##                               X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg                     1.0000000   0.9999860   0.9999862   0.9999968
## X70.30676kg                     0.9999860   1.0000000   0.9999871   0.9999891
## X81.64656kg                     0.9999862   0.9999871   1.0000000   0.9999944
## X92.98636kg                     0.9999968   0.9999891   0.9999944   1.0000000
## mean_calories_per_hour_per_kg   0.9999960   0.9999943   0.9999957   0.9999989
##                               mean_calories_per_hour_per_kg
## X58.96696kg                                       0.9999960
## X70.30676kg                                       0.9999943
## X81.64656kg                                       0.9999957
## X92.98636kg                                       0.9999989
## mean_calories_per_hour_per_kg                     1.0000000

We determine the correlation coefficient between X58.96696kg and mean_calories_per_hour_per_kg.

cor.test(cluster1$X58.96696kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cluster1$X58.96696kg and cluster1$mean_calories_per_hour_per_kg
## t = 3597.8, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9999941 0.9999973
## sample estimates:
##      cor 
## 0.999996

The correlation coefficient of 0.99 suggests a very strong positive correlation between 58.96696kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.

We determine the correlation coefficient between X70.30676kg and mean_calories_per_hour_per_kg.

cor.test(cluster1$X70.30676kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cluster1$X70.30676kg and cluster1$mean_calories_per_hour_per_kg
## t = 3008.4, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9999916 0.9999961
## sample estimates:
##       cor 
## 0.9999943

The correlation coefficient of 0.99 suggests a very strong positive correlation between 70.30676kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.

We determine the correlation coefficient between X81.64656kg and mean_calories_per_hour_per_kg.

cor.test(cluster1$X81.64656kg, cluster1$mean_calories_per_hour_per_kg, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  cluster1$X81.64656kg and cluster1$mean_calories_per_hour_per_kg
## t = 3467.8, df = 103, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9999937 0.9999971
## sample estimates:
##       cor 
## 0.9999957

The correlation coefficient of 0.99 suggests a very strong positive correlation between 81.64656kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.

We determine the correlation coefficient between X92.98636kg and mean_calories_per_hour_per_kg.

cor.test(edkmeans1$X92.98636kg, edkmeans1$mean_calories_per_hour_per_kg, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  edkmeans1$X92.98636kg and edkmeans1$mean_calories_per_hour_per_kg
## t = 17929, df = 246, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9999995 0.9999997
## sample estimates:
##       cor 
## 0.9999996

The correlation coefficient of 0.99 suggests a very strong positive correlation between 92.98636kg and calories per hour per kg. As the p value for the test is much smaller than 0.05 (p < 0.002), the null hypothesis is rejected. 0.99 and 0.99 is the lower and upper limit of 95% confidence interval.

Correlation coefficient on Cluster 2

cor(cluster2, method = "pearson")

##                               X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg                     1.0000000   0.9999832   0.9999943   0.9999915
## X70.30676kg                     0.9999832   1.0000000   0.9999926   0.9999789
## X81.64656kg                     0.9999943   0.9999926   1.0000000   0.9999932
## X92.98636kg                     0.9999915   0.9999789   0.9999932   1.0000000
## mean_calories_per_hour_per_kg   0.9999964   0.9999928   0.9999992   0.9999950
##                               mean_calories_per_hour_per_kg
## X58.96696kg                                       0.9999964
## X70.30676kg                                       0.9999928
## X81.64656kg                                       0.9999992
## X92.98636kg                                       0.9999950
## mean_calories_per_hour_per_kg                     1.0000000

Correlation coefficient on Cluster 3

cor(cluster3, method = "pearson")

##                               X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg                     1.0000000   0.9999909   0.9999883   0.9999957
## X70.30676kg                     0.9999909   1.0000000   0.9999955   0.9999942
## X81.64656kg                     0.9999883   0.9999955   1.0000000   0.9999939
## X92.98636kg                     0.9999957   0.9999942   0.9999939   1.0000000
## mean_calories_per_hour_per_kg   0.9999963   0.9999977   0.9999970   0.9999985
##                               mean_calories_per_hour_per_kg
## X58.96696kg                                       0.9999963
## X70.30676kg                                       0.9999977
## X81.64656kg                                       0.9999970
## X92.98636kg                                       0.9999985
## mean_calories_per_hour_per_kg                     1.0000000

Correlation coefficient on Cluster 4

cor(cluster4, method = "pearson")

##                               X58.96696kg X70.30676kg X81.64656kg X92.98636kg
## X58.96696kg                     1.0000000   0.9999928   0.9999978   0.9999942
## X70.30676kg                     0.9999928   1.0000000   0.9999920   0.9999893
## X81.64656kg                     0.9999978   0.9999920   1.0000000   0.9999991
## X92.98636kg                     0.9999942   0.9999893   0.9999991   1.0000000
## mean_calories_per_hour_per_kg   0.9999984   0.9999957   0.9999994   0.9999978
##                               mean_calories_per_hour_per_kg
## X58.96696kg                                       0.9999984
## X70.30676kg                                       0.9999957
## X81.64656kg                                       0.9999994
## X92.98636kg                                       0.9999978
## mean_calories_per_hour_per_kg                     1.0000000

Regression Model and Summary of The Regression Model

Once again, our modelling objective with regards to regression is to estimate how many calories a person of bodyweight X will burn over one hour of a certain exercise. From our EDA, we suspect that bodyweight does not play a role in determining calories burnt per hour, adjusted for bodyweight. Therefore, our first sub-objective is to investigate this claim that bodyweight does not play a role in the calories burn rate per kg.

exercise_data_flattened <- data.frame(exercise = rep(exercise_data$Activity..Exercise.or.Sport..1.hour., length(bodyweights_kgs)), calories_per_hour_per_kg = unlist(exercise_data[,c(-1, -ncol(exercise_data))]), bodyweight = rep(bodyweights_kgs, each = nrow(exercise_data)), cluster = rep(exercise_data_clustered$Cluster, length(bodyweights_kgs)))
ggplot(exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ],aes(x=bodyweight, y=calories_per_hour_per_kg)) + 
  geom_point() + 
  stat_smooth(method="lm",se=FALSE) + 
  ylim(0, max(exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ]$calories_per_hour_per_kg))

## `geom_smooth()` using formula 'y ~ x'

The chart above is very similar to the one in our EDA. It is a graph of calories_per_hour_per_kg against bodyweight for the “Cycling, mountain bike, bmx” exercise, and it forms a nearly perfect flat line. However, we require more than just visual proof, so a linear regression model is applied.

regression_model <- lm(calories_per_hour_per_kg~bodyweight,exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[1], ])
summary(regression_model) #summary of regression model

## 
## Call:
## lm(formula = calories_per_hour_per_kg ~ bodyweight, data = exercise_data_flattened[exercise_data_flattened$exercise == 
##     exercise_data_flattened$exercise[1], ])
## 
## Residuals:
## X58.96696kg1 X70.30676kg1 X81.64656kg1 X92.98636kg1 
##    0.0018341   -0.0045109    0.0035194   -0.0008427 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.5182393  0.0130321 653.636 2.34e-06 ***
## bodyweight  -0.0001159  0.0001692  -0.685    0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00429 on 2 degrees of freedom
## Multiple R-squared:  0.1899, Adjusted R-squared:  -0.2151 
## F-statistic: 0.4689 on 1 and 2 DF,  p-value: 0.5642

From the results table above, we can see that the beta coefficient for bodyweight is not statistically significant at all reasonable significance levels. We now have statistical evidence that bodyweight does not influence calories_per_hour_per_kg. But this is just for one exercise, and we have 247 others to consider.

p_values <- vector(length = nrow(exercise_data))
betas <- vector(length = nrow(exercise_data))
for (index in 1:nrow(exercise_data)) {
  model <- lm(calories_per_hour_per_kg~bodyweight,exercise_data_flattened[exercise_data_flattened$exercise == exercise_data_flattened$exercise[index], ])
  p_values[index] <- summary(model)$coefficients["bodyweight", "Pr(>|t|)"]
  betas[index] <- summary(model)$coefficients["bodyweight", "Estimate"]
}  
summary(p_values) #summary of all bodyweight coefficients' p-values

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.009059 0.009059 0.498406 0.397729 0.648576 0.977595

Interestingly, not all beta coefficients for each exercise’s regression model were statistically insignificant. We examine the values of those which have p-values of less than 0.05.

significance_level <- 0.05
significant_beta_indices <- p_values < significance_level
significant_betas <- betas[significant_beta_indices]
summary(significant_betas) #summary of significant beta coefficients

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.542e-04 -7.262e-05 -3.631e-05  4.256e-05  2.179e-04  2.542e-04

All values of the significant beta coefficients are very small, thus we can safely conclude that bodyweight does not influence calories_per_hour_per_kg.

Prediction on Calories Burnt for Certain Weight

Nevertheless, the question still remains: how many calories will a person of bodyweight X burn during an hour of a certain exercise? With our findings above, in the absence of an impact from bodyweight on calories_per_hour_per_kg, we could perform simple arithmetic calculations to arrive at the desired result. However, we propose the use of more linear regressions.

original_exercise_data <- read.csv('exercise_dataset.csv')
original_exercise_data_flattened <- data.frame(exercise = rep(original_exercise_data$Activity..Exercise.or.Sport..1.hour., length(bodyweights_kgs)), calories_per_hour = unlist(original_exercise_data[,c(-1, -ncol(exercise_data))]), bodyweight = rep(bodyweights_kgs, each = nrow(exercise_data)), cluster = rep(exercise_data_clustered$Cluster, length(bodyweights_kgs)))
ggplot(original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[1], ],aes(x=bodyweight, y=calories_per_hour)) + 
  geom_point() + 
  stat_smooth(method="lm",se=FALSE) + 
  ylim(0, max(original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[1], ]$calories_per_hour))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing missing values (geom_smooth).

For this purpose, we used our original dataset whose values were expressed in calories_per_hour rather than calories_per_hour_per_kg. Here, for the “Cycling, mountain bike, bmx” exercise, we observe that an increase in bodyweight is associated with an increase in calories_per_hour of exercise, and the linear model has a very good fit. Does this hold across all exercises?

r_squareds <- vector(length = nrow(exercise_data))
betas <- vector(length = nrow(exercise_data))
for (index in 1:nrow(exercise_data)) {
  model <- lm(calories_per_hour~bodyweight,original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[index], ])
  r_squareds[index] <- summary(model)$r.squared
  betas[index] <- summary(model)$coefficients["bodyweight", "Estimate"]
}

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable

summary(r_squareds) #summary of all bodyweight coefficients' p-values

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9999  1.0000  1.0000  1.0000  1.0000  1.0000

Every single one of our linear regression models have very high r-squared values. Granted, there were only 4 data points for each exercise, but the goodness of fit of each model is compelling. We can now answer original question.

predictions <- vector(length = nrow(exercise_data))
bodyweight <- 75
for (index in 1:nrow(exercise_data)) {
  model <- lm(calories_per_hour~bodyweight,original_exercise_data_flattened[original_exercise_data_flattened$exercise == original_exercise_data_flattened$exercise[index], ])
  predictions[index] <- predict(model, data.frame(weight = bodyweight))
}  
pred_df <- data.frame(exercise = exercise_data$Activity..Exercise.or.Sport..1.hour., calories_per_hour = predictions)
pred_df

##                                       exercise calories_per_hour
## 1                  Cycling, mountain bike, bmx          638.1974
## 2          Cycling, <10 mph, leisure bicycling          300.0898
## 3                     Cycling, >20 mph, racing         1201.1008
## 4                  Cycling, 10-11.9 mph, light          450.1434
## 5               Cycling, 12-13.9 mph, moderate          600.6625
## 6               Cycling, 14-15.9 mph, vigorous          750.7160
## 7        Cycling, 16-19 mph, very fast, racing          900.7523
## 8                                   Unicycling          375.3666
## 9               Stationary cycling, very light          225.0717
## 10                   Stationary cycling, light          412.8843
## 11                Stationary cycling, moderate          525.4201
## 12                Stationary cycling, vigorous          788.2509
## 13           Stationary cycling, very vigorous          938.5458
## 14  Calisthenics, vigorous, pushups, situpsâ\200¦          600.6625
## 15                         Calisthenics, light          262.8308
## 16              Circuit training, minimal rest          600.6625
## 17     Weight lifting, body building, vigorous          450.1434
## 18               Weight lifting, light workout          225.0717
## 19                        Health club exercise          412.8843
## 20                               Stair machine          675.4392
## 21                       Rowing machine, light          262.8308
## 22                    Rowing machine, moderate          525.4201
## 23                    Rowing machine, vigorous          638.1974
## 24               Rowing machine, very vigorous          900.7523
## 25                                 Ski machine          525.4201
## 26                        Aerobics, low impact          375.3666
## 27                       Aerobics, high impact          525.4201
## 28                     Aerobics, step aerobics          638.1974
## 29                           Aerobics, general          487.9025
## 30                                  Jazzercise          450.1434
## 31                      Stretching, hatha yoga          300.0898
## 32                             Mild stretching          187.8126
## 33                   Instructing aerobic class          450.1434
## 34                              Water aerobics          300.0898
## 35                    Ballet, twist, jazz, tap          338.1075
## 36                      Ballroom dancing, slow          225.0717
## 37                      Ballroom dancing, fast          412.8843
## 38             Running, 5 mph (12 minute mile)          600.6625
## 39         Running, 5.2 mph (11.5 minute mile)          675.4392
## 40                Running, 6 mph (10 min mile)          750.7160
## 41               Running, 6.7 mph (9 min mile)          825.7342
## 42               Running, 7 mph (8.5 min mile)          863.2691
## 43                Running, 7.5mph (8 min mile)          938.5458
## 44               Running, 8 mph (7.5 min mile)         1013.3226
## 45               Running, 8.6 mph (7 min mile)         1050.8058
## 46               Running, 9 mph (6.5 min mile)         1126.0826
## 47                Running, 10 mph (6 min mile)         1201.1008
## 48            Running, 10.9 mph (5.5 min mile)         1351.1543
## 49                      Running, cross country          675.4392
## 50                            Running, general          600.6625
## 51          Running, on a track, team practice          750.7160
## 52                         Running, stairs, up         1126.0826
## 53              Track and field (shot, discus)          300.0898
## 54     Track and field (high jump, pole vault)          450.1434
## 55                   Track and field (hurdles)          750.7160
## 56                                     Archery          262.8308
## 57                                   Badminton          338.1075
## 58                Basketball game, competitive          600.6625
## 59                Playing basketball, non game          450.1434
## 60                     Basketball, officiating          525.4201
## 61                Basketball, shooting baskets          338.1075
## 62                      Basketball, wheelchair          487.9025
## 63       Running, training, pushing wheelchair          600.6625
## 64                                   Billiards          187.8126
## 65                                     Bowling          225.0717
## 66                             Boxing, in ring          900.7523
## 67                        Boxing, punching bag          450.1434
## 68                            Boxing, sparring          675.4392
## 69   Coaching: football, basketball, soccerâ\200¦          300.0898
## 70                  Cricket (batting, bowling)          375.3666
## 71                                     Croquet          187.8126
## 72                                     Curling          300.0898
## 73                        Darts (wall or lawn)          187.8126
## 74                                     Fencing          450.1434
## 75                       Football, competitive          675.4392
## 76              Football, touch, flag, general          600.6625
## 77         Football or baseball, playing catch          187.8126
## 78                    Frisbee playing, general          225.0717
## 79                   Frisbee, ultimate frisbee          600.6625
## 80                               Golf, general          338.1075
## 81            Golf, walking and carrying clubs          338.1075
## 82                         Golf, driving range          225.0717
## 83                        Golf, miniature golf          225.0717
## 84             Golf, walking and pulling clubs          322.8142
## 85                      Golf, using power cart          262.8308
## 86                                  Gymnastics          300.0898
## 87                                  Hacky sack          300.0898
## 88                                    Handball          900.7523
## 89                              Handball, team          600.6625
## 90                        Hockey, field hockey          600.6625
## 91                          Hockey, ice hockey          600.6625
## 92                     Riding a horse, general          300.0898
## 93            Horesback riding, saddling horse          262.8308
## 94            Horseback riding, grooming horse          262.8308
## 95                  Horseback riding, trotting          487.9025
## 96                   Horseback riding, walking          187.8126
## 97                     Horse racing, galloping          600.6625
## 98                    Horse grooming, moderate          450.1434
## 99                          Horseshoe pitching          225.0717
## 100                                   Jai alai          900.7523
## 101        Martial arts, judo, karate, jujitsu          750.7160
## 102                  Martial arts, kick boxing          750.7160
## 103                  Martial arts, tae kwan do          750.7160
## 104                         Krav maga training          750.7160
## 105                                   Juggling          300.0898
## 106                                   Kickball          525.4201
## 107                                   Lacrosse          600.6625
## 108                               Orienteering          675.4392
## 109                         Playing paddleball          450.1434
## 110                    Paddleball, competitive          750.7160
## 111                                       Polo          600.6625
## 112                   Racquetball, competitive          750.7160
## 113                        Playing racquetball          525.4201
## 114              Rock climbing, ascending rock          825.7342
## 115                  Rock climbing, rappelling          600.6625
## 116                         Jumping rope, fast          900.7523
## 117                     Jumping rope, moderate          750.7160
## 118                         Jumping rope, slow          600.6625
## 119                                      Rugby          750.7160
## 120                 Shuffleboard, lawn bowling          225.0717
## 121                              Skateboarding          375.3666
## 122                             Roller skating          525.4201
## 123            Roller blading, in-line skating          900.7523
## 124                                 Sky diving          225.0717
## 125                        Soccer, competitive          750.7160
## 126                             Playing soccer          525.4201
## 127                       Softball or baseball          375.3666
## 128                      Softball, officiating          300.0898
## 129                         Softball, pitching          450.1434
## 130                                     Squash          900.7523
## 131                    Table tennis, ping pong          300.0898
## 132                                    Tai chi          300.0898
## 133                             Playing tennis          525.4201
## 134                            Tennis, doubles          450.1434
## 135                            Tennis, singles          600.6625
## 136                                 Trampoline          262.8308
## 137                    Volleyball, competitive          600.6625
## 138                         Playing volleyball          225.0717
## 139                          Volleyball, beach          600.6625
## 140                                  Wrestling          450.1434
## 141                                  Wallyball          525.4201
## 142              Backpacking, Hiking with pack          525.4201
## 143              Carrying infant, level ground          262.8308
## 144                  Carrying infant, upstairs          375.3666
## 145            Carrying 16 to 24 lbs, upstairs          450.1434
## 146            Carrying 25 to 49 lbs, upstairs          600.6625
## 147     Standing, playing with children, light          210.2439
## 148  Walk/run, playing with children, moderate          300.0898
## 149  Walk/run, playing with children, vigorous          375.3666
## 150                    Carrying small children          225.0717
## 151                     Loading, unloading car          225.0717
## 152       Climbing hills, carrying up to 9 lbs          525.4201
## 153       Climbing hills, carrying 10 to 20 lb          563.1792
## 154       Climbing hills, carrying 21 to 42 lb          600.6625
## 155        Climbing hills, carrying over 42 lb          675.4392
## 156                         Walking downstairs          225.0717
## 157                      Hiking, cross country          450.1434
## 158                              Bird watching          187.8126
## 159                Marching, rapidly, military          487.9025
## 160     Children's games, hopscotch, dodgeball          375.3666
## 161  Pushing stroller or walking with children          187.8126
## 162                       Pushing a wheelchair          300.0898
## 163                               Race walking          487.9025
## 164           Rock climbing, mountain climbing          600.6625
## 165                     Walking using crutches          375.3666
## 166                            Walking the dog          225.0717
## 167          Walking, under 2.0 mph, very slow          150.0535
## 168                      Walking 2.0 mph, slow          187.8126
## 169                            Walking 2.5 mph          225.0717
## 170                  Walking 3.0 mph, moderate          247.7789
## 171                Walking 3.5 mph, brisk pace          285.2621
## 172                    Walking 3.5 mph, uphill          450.1434
## 173                Walking 4.0 mph, very brisk          375.3666
## 174                            Walking 4.5 mph          472.8506
## 175                            Walking 5.0 mph          600.6625
## 176                 Boating, power, speed boat          187.8126
## 177                     Canoeing, camping trip          300.0898
## 178                    Canoeing, rowing, light          225.0717
## 179                 Canoeing, rowing, moderate          525.4201
## 180                 Canoeing, rowing, vigorous          900.7523
## 181        Crew, sculling, rowing, competition          900.7523
## 182                                   Kayaking          375.3666
## 183                                Paddle boat          300.0898
## 184                       Windsurfing, sailing          225.0717
## 185                       Sailing, competition          375.3666
## 186           Sailing, yachting, ocean sailing          225.0717
## 187                       Skiing, water skiing          450.1434
## 188                               Ski mobiling          525.4201
## 189                          Skin diving, fast         1201.1008
## 190                      Skin diving, moderate          938.5458
## 191                  Skin diving, scuba diving          525.4201
## 192                                 Snorkeling          375.3666
## 193     Surfing, body surfing or board surfing          225.0717
## 194     Whitewater rafting, kayaking, canoeing          375.3666
## 195             Swimming laps, freestyle, fast          750.7160
## 196             Swimming laps, freestyle, slow          525.4201
## 197                        Swimming backstroke          525.4201
## 198                      Swimming breaststroke          750.7160
## 199                         Swimming butterfly          825.7342
## 200               Swimming leisurely, not laps          450.1434
## 201                        Swimming sidestroke          600.6625
## 202                      Swimming synchronized          600.6625
## 203   Swimming, treading water, fast, vigorous          750.7160
## 204         Swimming, treading water, moderate          300.0898
## 205         Water aerobics, water calisthenics          300.0898
## 206                                 Water polo          750.7160
## 207                           Water volleyball          225.0717
## 208                              Water jogging          600.6625
## 209            Diving, springboard or platform          225.0717
## 210                       Ice skating, < 9 mph          412.8843
## 211                 Ice skating, average speed          525.4201
## 212                       Ice skating, rapidly          675.4392
## 213            Speed skating, ice, competitive         1126.0826
## 214            Cross country snow skiing, slow          525.4201
## 215             Cross country skiing, moderate          600.6625
## 216             Cross country skiing, vigorous          675.4392
## 217               Cross country skiing, racing         1050.8058
## 218               Cross country skiing, uphill         1238.6185
## 219        Snow skiing, downhill skiing, light          375.3666
## 220             Downhill snow skiing, moderate          450.1434
## 221               Downhill snow skiing, racing          600.6625
## 222                Sledding, tobagganing, luge          525.4201
## 223                               Snow shoeing          600.6625
## 224                               Snowmobiling          262.8308
## 225                          General housework          262.8308
## 226                           Cleaning gutters          375.3666
## 227                                   Painting          338.1075
## 228                  Sit, playing with animals          187.8126
## 229           Walk / run, playing with animals          300.0898
## 230                                Bathing dog          262.8308
## 231             Mowing lawn, walk, power mower          412.8843
## 232                  Mowing lawn, riding mower          187.8126
## 233                       Walking, snow blower          262.8308
## 234                        Riding, snow blower          225.0717
## 235                     Shoveling snow by hand          450.1434
## 236                                Raking lawn          322.8142
## 237                         Gardening, general          300.0898
## 238                      Bagging grass, leaves          300.0898
## 239                    Watering lawn or garden          113.0358
## 240                Weeding, cultivating garden          338.1075
## 241                         Carpentry, general          262.8308
## 242                       Carrying heavy loads          600.6625
## 243           Carrying moderate loads upstairs          600.6625
## 244                           General cleaning          262.8308
## 245                          Cleaning, dusting          187.8126
## 246                           Taking out trash          225.0717
## 247              Walking, pushing a wheelchair          300.0898
## 248    Teach physical education,exercise class          300.0898

The example above shows the calories that would be burnt over an hour of each exercise by a person weighing 75kg. However, it may be difficult for this person, looking at the entire menu of 248 exercises, to make a choice. An interesting question would be: what if we modelled calories_per_hour not just by exercise, but by each cluster of exercise?

Conclusion

In the 248 individual activity and exercise datasets, each Si value was above 0.5 with a mean of 0.63 according to the silhouette coefficient results. We used a new dataset (edkmeans1) defined in K-means Clustering Data Preparation. A regression model was then run to explore further, identifying body weight as a significant factor in calories burned per hour of exercise. We show the accuracy of these individual components as well as the overall accuracy. The main limitation of this work is the lack of a general dataset for comparison with other methods. In the future, we plan to collect and annotate larger datasets to create a common ground for comparison and analysis. We also intend to add other features to the model, such as height and gender.

References

NHS Choices. (2022). Understanding calories. https://www.nhs.uk/live-well/healthy-weight/managing-your-weight/understanding-calories/

Nipas, M., Acoba, A. G., Mindoro, J. N., Malbog, M. A. F., Susa, J. A. B., & Gulmatico, J. S. (2022). Burned Calories Prediction using Supervised Machine Learning: Regression Algorithm. 2022 Second International Conference on Power, Control and Computing Technologies (ICPC2T). https://doi.org/10.1109/icpc2t53885.2022.9776710

Vinoy, S. P., & Joseph, B. (2022). Calorie Burn Prediction Analysis Using XGBoost Regressor and Linear Regression Algorithms. Zenodo. https://doi.org/10.5281/zenodo.6365018

Calorie Burn Prediction Analysis Using K-Means Clustering and Linear Regression Algorithms

Group 9 - Alif Leong (S2101190), Ahmad Faisal (S2125019), Lee Shin Ee (S2132380), Tang Jingfa (S2141959)

6/15/2022

Introduction

Description of Dataset

Importing the Data

Loading the Relevent Library

Description of Dataset

Objectives

Data Preprocessing

Exploratory Data Analysis

Modelling - K-Means Clustering

Preparing the data

Scaling the Data

Choosing K Value for Clustering

Performing K-Means Clustering

Viewing the results of the clustering

Plotting the final result of the k-means model

K-Means Clustering Result Evaluation

Merging Clustering Result to Original Data Set

K-Means Clustering Result Interpretation

Modelling - Linear Regression

Data Preparation

Determine Correlation Coefficient

Regression Model and Summary of The Regression Model

Prediction on Calories Burnt for Certain Weight

Conclusion

References