Julio Alvarez-Leija
11/06/2022
Here we have a collection of data based on each participant’s caloric output based on the amount of steps taken. We also have data how much sleep they get, and how long do they stay in bed for (no indication if it is solely based when they are ready to go to sleep, or staying up for a bit). We were given no true objective, but rather show our skills thanks to the Data Analytics Course from Coursera. This data is from a Kaggle Dataset that is from FitBit Watches by users that agreed to share their personal data. This data contains minute-, hourly-, and daily- information that one can look into. For the sake of the conversation, we have removed minutes out of the information due to how many outputs there were, and given how valuable hourly-datasets can be, we can simply ignore the minutes data and focus on daily and perhaps hourly as well.
This dataset was made in order to see various behavioral patterns recognition. It was generated by people from a distributed survey via Amazon Mechanical Turk from 03.12.2016-05.12.2016. 30 FitBit users consented to submit their personal data. Data points consist of DailySteps, TotalDistance, the amount of time the intensity of the walking, the amount of sleep and amount of time in bed that the participant had at that specific day.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
setwd("/cloud/project/Capstone")
hourly_calories = read.csv("hourlyCalories_merged.csv")
hourly_intensities = read.csv("hourlyIntensities_merged.csv")
hourly_steps = read.csv("hourlySteps_merged.csv")We are not doing by the minute since it is too many samples and it is just inefficient to look at things by the minute (in turns of half-hourly).
We will also be ignoring the fat, and logid section in the weight_log_info as there is not meaning for this to be a part of the data analysis section
## # A tibble: 940 × 15
## id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0 1.88 0.550 6.06
## 2 1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.690 4.71
## 3 1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.400 3.91
## 4 1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83
## 5 1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.410 5.04
## 6 1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.780 2.51
## 7 1503960366 4/18/2016 13019 8.59 8.59 0 3.25 0.640 4.71
## 8 1503960366 4/19/2016 15506 9.88 9.88 0 3.53 1.32 5.03
## 9 1503960366 4/20/2016 10544 6.68 6.68 0 1.96 0.480 4.24
## 10 1503960366 4/21/2016 9819 6.34 6.34 0 1.34 0.350 4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## # ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance
## # A tibble: 413 × 5
## id sleep_day total_sleep_records total_minutes_…¹ total…²
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 403 more rows, and abbreviated variable names ¹total_minutes_asleep,
## # ²total_time_in_bed
## # A tibble: 940 × 10
## id activity…¹ seden…² light…³ fairl…⁴ very_…⁵ seden…⁶ light…⁷ moder…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 728 328 13 25 0 6.06 0.550
## 2 1503960366 4/13/2016 776 217 19 21 0 4.71 0.690
## 3 1503960366 4/14/2016 1218 181 11 30 0 3.91 0.400
## 4 1503960366 4/15/2016 726 209 34 29 0 2.83 1.26
## 5 1503960366 4/16/2016 773 221 10 36 0 5.04 0.410
## 6 1503960366 4/17/2016 539 164 20 38 0 2.51 0.780
## 7 1503960366 4/18/2016 1149 233 16 42 0 4.71 0.640
## 8 1503960366 4/19/2016 775 264 31 50 0 5.03 1.32
## 9 1503960366 4/20/2016 818 205 12 28 0 4.24 0.480
## 10 1503960366 4/21/2016 838 211 8 19 0 4.65 0.350
## # … with 930 more rows, 1 more variable: very_active_distance <dbl>, and
## # abbreviated variable names ¹activity_day, ²sedentary_minutes,
## # ³lightly_active_minutes, ⁴fairly_active_minutes, ⁵very_active_minutes,
## # ⁶sedentary_active_distance, ⁷light_active_distance,
## # ⁸moderately_active_distance
## # A tibble: 940 × 3
## id activity_day step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## 7 1503960366 4/18/2016 13019
## 8 1503960366 4/19/2016 15506
## 9 1503960366 4/20/2016 10544
## 10 1503960366 4/21/2016 9819
## # … with 930 more rows
## [1] 0
## [1] 0
## [1] 0
## [1] 3
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## # A tibble: 410 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep Total…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 400 more rows, and abbreviated variable name ¹TotalTimeInBed
## [1] 0
##str(daily_activity)
##str(daily_calories)
##str(daily_intensities)
##str(daily_sleep)
str(daily_steps)## spc_tbl_ [940 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ StepTotal : num [1:940] 13162 10735 10460 9762 12669 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDay = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Seeing this makes easier to see that we need to change sleep day and activity day into actual dates. We would also need to change the the column names into lower case to type the column names easier. We also see there is little to no values in daily_weight.Fat, so it will be best to take not doing anything with it.
## 'data.frame': 22099 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : int 373 160 151 0 0 0 0 0 250 1864 ...
We can also connect daily and hourly into two big independent data frame. This is because each data frame contains a small amount of variables.
daily_activity = rename_with(daily_activity, tolower)
daily_calories = rename_with(daily_calories, tolower)
daily_intensities = rename_with(daily_intensities, tolower)
daily_sleep = rename_with(daily_sleep, tolower)
daily_steps = rename_with(daily_steps, tolower)hourly_calories = rename_with(hourly_calories, tolower)
hourly_intensities = rename_with(hourly_intensities, tolower)
hourly_steps = rename_with(hourly_steps, tolower)## # A tibble: 6 × 15
## id activ…¹ total…² total…³ track…⁴ logge…⁵ verya…⁶ moder…⁷ light…⁸ seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## # lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹activitydate, ²totalsteps, ³totaldistance,
## # ⁴trackerdistance, ⁵loggedactivitiesdistance, ⁶veryactivedistance,
## # ⁷moderatelyactivedistance, ⁸lightactivedistance, ⁹sedentaryactivedistance
daily_activity$date = as.Date(daily_activity$activitydate, "%m/%d/%Y")
daily_calories$date = as.Date(daily_calories$activityday, "%m/%d/%Y")
daily_steps$date = as.Date(daily_steps$activityday, "%m/%d/%Y")
daily_intensities$date = as.Date(daily_intensities$activityday, "%m/%d/%Y")## # A tibble: 413 × 5
## id sleepday totalsleeprecords totalminutesasleep totaltimeinbed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 "4/12/2016 " 1 327 346
## 2 1503960366 "4/13/2016 " 2 384 407
## 3 1503960366 "4/15/2016 " 1 412 442
## 4 1503960366 "4/16/2016 " 2 340 367
## 5 1503960366 "4/17/2016 " 1 700 712
## 6 1503960366 "4/19/2016 " 1 304 320
## 7 1503960366 "4/20/2016 " 1 360 377
## 8 1503960366 "4/21/2016 " 1 325 364
## 9 1503960366 "4/23/2016 " 1 361 384
## 10 1503960366 "4/24/2016 " 1 430 449
## # … with 403 more rows
The reason I want to merge the data set is in order to do more statistical analysis, and not “predict” values that are missing. For me, it is better to have all the data filled up as much as possible and not predict what the missing values are.
daily_id_log = daily_log %>%
group_by(id) %>%
summarise(avg_steps = mean(totalsteps), avg_cal = mean(calories), avg_sleep = mean(totalminutesasleep), avg_sleep_hour = (avg_sleep/60), avg_sleep_percent = mean(percent_sleep))I would like to add this variable as a factor in order to let the program know that each id is a different person. In a sense, it is treated as a categorical variable. It will serve to know the distribution of sleep and calories each person get in total.
daily_id_log = daily_id_log %>%
mutate(active_users = case_when(
avg_steps < 5000 ~ " Sedentary Inactive",
avg_steps >= 5000 & avg_steps < 9000 ~ "Light Active",
avg_steps >= 9001 & avg_steps < 12000 ~ "Moderately Active",
avg_steps > 12001 ~ "Very Active"))daily_id_log = daily_id_log %>%
mutate(sleep_rating = case_when(
avg_sleep_hour < 7 ~ "Insufficient Sleep",
avg_sleep_hour >= 7 & avg_sleep_hour < 9 ~ "Recommended Amount",
avg_sleep_hour > 9 ~ "Too Much Sleep"))steps_perc <- daily_id_log %>%
group_by(active_users) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(active_users) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))sleep_perc <- daily_id_log %>%
group_by(sleep_rating) %>%
summarise(total = n()) %>%
mutate(totals = sum(total)) %>%
group_by(sleep_rating) %>%
summarise(total_percent = total / totals) %>%
mutate(labels = scales::percent(total_percent))daily_log = daily_log %>%
mutate(activity_level = case_when(
totalsteps < 5000 ~ "Sedentary Active",
totalsteps >= 5000 & totalsteps < 9000 ~ "Light Active",
totalsteps >= 9001 & totalsteps < 12000 ~ "Moderately Active",
totalsteps > 12001 ~ "Very Active"))## Id ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
hourly_calories$date = as.Date(hourly_calories$ActivityHour, "%m/%d/%Y")
hourly_calories$datetime = as.POSIXct(hourly_calories$ActivityHour,format="%m/%d/%Y %H:%M:%S",tz=Sys.timezone())
hourly_calories$time_component <- format(hourly_calories$datetime,'%H:%M:%S')
hourly_calories$day = weekdays(as.Date(hourly_calories$date))
hourly_calories$month = month(as.Date(hourly_calories$date))
## head(hourly_calories)steps_perc %>%
ggplot(aes(x="",y=total_percent, fill=active_users))+
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette="Blues")+
theme_minimal()+
theme(axis.text.x=element_blank())+
scale_fill_manual(values=c("#a4e9d5",
"#6e0d25",
"#c1aba6",
"#22577a")) +
labs(title = "User's Activity Level Based on Steps", caption = 'Source: FitBit Fitness Tracker Data') +
guides(fill = guide_legend(title = "Active User"))## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## I was trying to get the code below to work, but had no success.
## geom_text(aes(y = total_percent/3 + c(0, cumsum(total_percent)[-length(total_percent)]),
## label = percent(total_percent/100)), size=5)We can see that the Activity Level of the participants is great since over 79% of them have been active to a greater extent towards their sedentary counterparts.
sleep_perc %>%
ggplot(aes(x="",y=total_percent, fill=sleep_rating))+
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer()+
theme_minimal()+
theme(axis.text.x=element_blank())+
scale_fill_manual(values=c(
"#a4e9d5",
"#6e0d25",
"#c1aba6"))+
labs(title = "User's Sleep Based on Total Amount of Sleeps in Minutes", caption = 'Source: FitBit Fitness Tracker Data')+
guides(fill = guide_legend(title = "Sleep Rating"))## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
This pie chart really surprised me more than the other one, since I knew people will be more active given the idea of wearing a FitBit means to live a healthier lifestyle. What is surprising is how many people are not getting sufficient amount of sleep. One thing that I also surprising is how according to (https://www.hopkinsmedicine.org/health/wellness-and-prevention/oversleeping-bad-for-your-health#:~:text=How%20Much%20Sleep%20Is%20Too,an%20underlying%20problem%2C%20Polotsky%20says.), oversleeping can even be between 8-9 hours.
We can now check the Daily Steps and Total Amount of Sleep by a bargraph
daily_log %>%
group_by(totalsteps, calories) %>%
ggplot(aes(x = totalsteps, y = calories, color = calories)) +
geom_point() +
scale_color_gradient(low = "blue",high = "red") +
geom_smooth(color = "beige") +
theme(legend.position = c(.8, .3),
legend.spacing.y = unit(2, "mm"),
panel.border = element_rect(colour = "black", fill=NA),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
labs(title = 'Calories vs. Total Steps',
y = 'Calories',
x = 'Total Steps',
caption = 'Source: FitBit Fitness Tracker Data')## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: daily_log$calories and daily_log$totalsteps
## t = 9.1666, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3285635 0.4890481
## sample estimates:
## cor
## 0.4119959
We can see that there is some positive correlation between these two variables. The reason of why this may be small is perhaps the middle part being a contributing factor of having a small correlation. We can also see that the p-value (< 2.2e-16) is less than significance level alpha = 0.05, making this correlation coefficient statistically significant.
##
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalsteps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1496.06 -571.36 -20.66 556.96 1875.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.752e+03 7.833e+01 22.363 <2e-16 ***
## daily_log$totalsteps 7.561e-02 8.248e-03 9.167 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 696 on 411 degrees of freedom
## Multiple R-squared: 0.1697, Adjusted R-squared: 0.1677
## F-statistic: 84.03 on 1 and 411 DF, p-value: < 2.2e-16
This shows us that there is a statistical significance of the linear regression shown below via the p-value (p < 2.2e^-16). What we can also see is the R^2 and Adj. R^2 being super low for this situation, what we can do for this is to include more variables in play. This will help us increase the R^2 and Adj R^2 of our model, but adding a lot of variables will cause more harm than good via residuals, and possibly decrease our p-value.
##
## studentized Breusch-Pagan test
##
## data: reg_daily_steps
## BP = 34.703, df = 1, p-value = 3.84e-09
What we can see from this R^2 and Adj R^2 values is that we need to add more variables into the model in order to get a better R^2. The reason of getting a higher R^2 is to provide a better precise view of the correlation by taking into account how many independent variables are being used for the model.
The Residuals vs Fitted graph, we can see some heteroscedasticity. It should not be a surprise that predicting the amount of calories burned gets harder to predict as time goes on. This indicates that we need to add more variables into the mix in order to predict the amount of calories burned. Perhaps we can use the other variables that contains their own steps with intensity of said steps.
The Q-Q Plot is relatively fine, however there is some data points that indicate that there is some massive outliers for some of the participants. It would make sense to delete these data points in order to see if it can help with the prediction. Given the scope of this capstone, it will be best to not delete any datapoints as there is no extreme outlier.
We can see with the Scale-Location graph is somewhat not giving a good idea if this data has heteroscedasticity. Thankfully we can implement the Breusch-Pagan Test (Thanks to the lmtest library), and according to the test we can see that the p-value is greater than 0.05, which we can say that there is no sufficient evidence that there is heteroscedasticity in our model.
reg_daily_steps_2 = lm(daily_log$calories ~ daily_log$totalsteps + daily_log$totaltimeinbed)
summary(reg_daily_steps_2)##
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalsteps + daily_log$totaltimeinbed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1568.30 -553.47 -46.58 540.94 1932.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1952.57561 157.71840 12.380 <2e-16 ***
## daily_log$totalsteps 0.07360 0.00835 8.814 <2e-16 ***
## daily_log$totaltimeinbed -0.40041 0.27309 -1.466 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 695 on 410 degrees of freedom
## Multiple R-squared: 0.1741, Adjusted R-squared: 0.17
## F-statistic: 43.21 on 2 and 410 DF, p-value: < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: reg_daily_steps_2
## BP = 37.45, df = 2, p-value = 7.375e-09
Now we can introduce the Activity Level or the intensity of the model
ggplot(data=daily_log,aes(x=totalsteps, y=calories, color = activity_level)) +
geom_point() +
geom_smooth(method=lm)## `geom_smooth()` using formula = 'y ~ x'
The moderate active trend does seem to be the one that contributes very differently compared to the other variables. I find this extremely interesting since I do think the more intense the work out, the more calories the participant should output. It also makes sense that very active is also affecting the model more than the others due to its huge cone.
reg_daily_steps_3 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance + daily_log$sedentaryactivedistance)
summary(reg_daily_steps_3)##
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance +
## daily_log$moderatelyactivedistance + daily_log$lightactivedistance +
## daily_log$sedentaryactivedistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1372.9 -463.9 -102.7 466.5 2266.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1557.50 79.03 19.707 <2e-16 ***
## daily_log$veryactivedistance 180.99 15.98 11.329 <2e-16 ***
## daily_log$moderatelyactivedistance -66.00 32.25 -2.046 0.0414 *
## daily_log$lightactivedistance 164.56 18.11 9.088 <2e-16 ***
## daily_log$sedentaryactivedistance 627.86 3584.54 0.175 0.8610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 627.8 on 408 degrees of freedom
## Multiple R-squared: 0.3294, Adjusted R-squared: 0.3228
## F-statistic: 50.1 on 4 and 408 DF, p-value: < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: reg_daily_steps_3
## BP = 35.073, df = 4, p-value = 4.487e-07
We can see a near better increase in R^2 and Adj R^2. However, it is not to the point where it is great. What we can do is add even more variables such as the intensity minutes to see if they can contribute more.
From what we can see, the model seems to not have many terrible outliers worth noticing as it does not fall into cook’s distance (a scale that details if a point or set of points contribute high influences in the variable).
We can see that the scale location plot helps us find the assumption of equal variance. As we can see, the values do seem to be close together in the 2000-3000 range. Making this model an excellent one to demonstrate a model’s close to equal variance (also called homoscedasticity).
This model is worth keeping as it helps us understand which intensity is making a great impact towards calories burned while working out.
Now, I will try to add other varibles such as totalminutesasleep, totalminutesinbed, and percent_sleep.
reg_daily_steps_4 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance + daily_log$sedentaryactivedistance + daily_log$veryactiveminutes + daily_log$fairlyactiveminutes + daily_log$sedentaryminutes)
summary(reg_daily_steps_4)##
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance +
## daily_log$moderatelyactivedistance + daily_log$lightactivedistance +
## daily_log$sedentaryactivedistance + daily_log$veryactiveminutes +
## daily_log$fairlyactiveminutes + daily_log$sedentaryminutes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1017.40 -346.18 -20.55 320.20 2366.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.0157 133.0297 5.157 3.94e-07 ***
## daily_log$veryactivedistance -102.4485 24.2561 -4.224 2.97e-05 ***
## daily_log$moderatelyactivedistance -262.4289 86.1158 -3.047 0.00246 **
## daily_log$lightactivedistance 202.0325 14.5317 13.903 < 2e-16 ***
## daily_log$sedentaryactivedistance 2792.5046 2801.5681 0.997 0.31947
## daily_log$veryactiveminutes 17.7613 1.4319 12.404 < 2e-16 ***
## daily_log$fairlyactiveminutes 9.4607 3.9173 2.415 0.01617 *
## daily_log$sedentaryminutes 0.9369 0.1482 6.323 6.80e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 486.8 on 405 degrees of freedom
## Multiple R-squared: 0.5998, Adjusted R-squared: 0.5929
## F-statistic: 86.71 on 7 and 405 DF, p-value: < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: reg_daily_steps_4
## BP = 31.497, df = 7, p-value = 5.034e-05
The R^2 has increased by a great sum, this can mean that we are able to keep this model due to the idea that the p-value is less than alpha = 0.05. What may become an issue is how robust the model can be, since the fewer the variables the more appropriate and more interpretative the model can be. What we can do is take out the fairly active minutes, sedentary active distance since they are not contributing as much as the others.
reg_daily_steps_5 = lm(daily_log$calories ~ daily_log$veryactivedistance + daily_log$moderatelyactivedistance + daily_log$lightactivedistance + daily_log$veryactiveminutes + daily_log$fairlyactiveminutes)
summary(reg_daily_steps_5)##
## Call:
## lm(formula = daily_log$calories ~ daily_log$veryactivedistance +
## daily_log$moderatelyactivedistance + daily_log$lightactivedistance +
## daily_log$veryactiveminutes + daily_log$fairlyactiveminutes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1176.5 -364.3 -70.6 375.2 2278.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1431.641 65.415 21.886 < 2e-16 ***
## daily_log$veryactivedistance -95.262 25.375 -3.754 0.000199 ***
## daily_log$moderatelyactivedistance -246.836 89.528 -2.757 0.006095 **
## daily_log$lightactivedistance 182.447 14.881 12.261 < 2e-16 ***
## daily_log$veryactiveminutes 17.371 1.498 11.595 < 2e-16 ***
## daily_log$fairlyactiveminutes 8.711 4.078 2.136 0.033261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 509.8 on 407 degrees of freedom
## Multiple R-squared: 0.5588, Adjusted R-squared: 0.5534
## F-statistic: 103.1 on 5 and 407 DF, p-value: < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: reg_daily_steps_5
## BP = 18.012, df = 5, p-value = 0.002932
We can now see that the model is relatively the same, but the amount of variables is not. This model is robust enough to implement machine learning techniques into it. Given the amount of variables that were given to me, there is not real variable to test besides sleep and calories. What I would’ve wanted is more variables such as age group, sex, diet plans, and other lifestyle variables.
I want to see if there is any effects that one will burn more calories which will affect sleep.
daily_log %>%
group_by(totalminutesasleep, calories) %>%
ggplot(aes(x = totalminutesasleep, y = calories, color = calories)) +
geom_point() +
scale_color_gradient(low = "blue",high = "red") +
geom_smooth(color = "green") +
theme(legend.position = c(.8, .3),
legend.spacing.y = unit(2, "mm"),
panel.border = element_rect(colour = "black", fill=NA),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
labs(title = 'Calories vs. Total Minutes Asleep',
y = 'Calories',
x = 'Total Minutes Asleep',
caption = 'Source: FitBit Fitness Tracker Data')## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: daily_log$calories and daily_log$totalminutesasleep
## t = -0.57854, df = 411, p-value = 0.5632
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12467707 0.06815644
## sample estimates:
## cor
## -0.02852571
We can see that the cor value for this model is -0.02 which demonstrates that calories burned does not have an effect towards that minutes of sleep one will get. Perhaps the intensity of the run is the variables that affects it more? We can also see that the p-value (0.5632) which gives us the idea that we cannot reject the null hypothesis in which we can say that calories has no affect towards the minutes of sleep.
We will now see if we can add both asleep and totaltimeinbed variables in the linear regression.
reg_daily_sleep = lm(daily_log$calories ~ daily_log$totalminutesasleep + daily_log$totaltimeinbed)
summary(reg_daily_sleep)##
## Call:
## lm(formula = daily_log$calories ~ daily_log$totalminutesasleep +
## daily_log$totaltimeinbed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2258.1 -496.4 -169.6 457.6 2532.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2661.6674 136.2240 19.539 < 2e-16 ***
## daily_log$totalminutesasleep 4.5505 0.8314 5.473 7.71e-08 ***
## daily_log$totaltimeinbed -4.7376 0.7741 -6.120 2.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 731.7 on 410 degrees of freedom
## Multiple R-squared: 0.08445, Adjusted R-squared: 0.07999
## F-statistic: 18.91 on 2 and 410 DF, p-value: 1.395e-08
##
## studentized Breusch-Pagan test
##
## data: reg_daily_sleep
## BP = 5.4609, df = 2, p-value = 0.06519
Since we did get a p-value that is greater than 0.05, we cannot reject the null hypothesis that there is no unequal variance in the residuals of the linear regression model. Note, this test is not definitive, and we may see some degree of heteroscedasticity not being detected by the test.
## [1] 0.9304575
We can see that the correlation between these two variables is too high. Which makes since it does seem that the participants will go to sleep once they go to bed, but a small amount of them will stay up a little bit longer when in bed.
sleep_vs_bed = ggplot(data =daily_log, aes(totalminutesasleep,totaltimeinbed))
sleep_vs_bed + geom_point(aes(colour = factor(id))) +
geom_smooth(aes(totalminutesasleep,totaltimeinbed)) +
labs(title = "Total minutes asleep vs Total time in bed", fill = "IDs", caption = 'Source: FitBit Fitness Tracker Data')## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
sleep_box = ggplot(daily_log, aes(totalminutesasleep,totaltimeinbed, fill = factor(id))) +
geom_boxplot() +
labs(title = "Box Plot of Particiants' Sleep in Minutes", fill = "IDs", caption = 'Source: FitBit Fitness Tracker Data')
sleep_box## ggplotly(sleep_box) This work but I cannot find a way to limit the bounds and see if I can have nicer looking boxplots with the ability to interactive with themWhile the R^2 of this model is terrible, but has a p-value that allows us to reject the null hypothesis, we can see that these two variables have a high correlations amongst each other. If this correlation was around .5-.75, then we can of course add both of them in the same model, but given the high correlation, it would be best to not include them together in the model.
According to this, there is a tiny negative correlation between calories and the amount of sleep one gets, given the sample population. It is so little that we can simply say that there is no correlation.
While we can see that this test is statically significant, we cannot continue on this search as there is a small R^2 and a high correlation (we can even see this with the ggplot graph and corr function). Even when one adds the time the person spent in bed and was sleep. While we can add more variables,as mentioned before, we must keep the model flexible and robust.
daily_log %>%
group_by(totalminutesasleep, totalsteps) %>%
ggplot(aes(x = totalminutesasleep, y = totalsteps, color = calories)) +
geom_point() +
geom_smooth(color = "green") +
theme(legend.position = c(.8, .3),
legend.spacing.y = unit(2, "mm"),
panel.border = element_rect(colour = "black", fill=NA),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
labs(title = 'Calories vs. Total Minutes Asleep',
y = 'Total Steps',
x = 'Total Minutes Asleep',
caption = 'Source: FitBit Fitness Tracker Data')## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## [1] -0.1721427
We can see that there is a negative correlation within the totaldistance and totalminutesalseep variable. This indicates that there is little correlation amongst these variables.
It will be best to test more of these variables independently. As in seeing if there is a statistical difference amongst one another.
Now we have an Anova test in order to find if each participants’ means are equal, or there is a relationship between calories and sleep independently. We will do this in order to see if each person has a unique case for themselves or if all participants are relativity the same in terms of level.
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Here we have a MASSIVE difference from each participant in the study. It is amazing how different each participant’s caloric output upon this study. But it is also concerning that we do a different amount of samples in each participants, making this study a bit uneven. Another thing to note is can this be enough to provide evidence against the idea that all the participant’s caloric outputs are equal? The answer is not really and we can see by the help of the aov_count function in R.
## Df Sum Sq Mean Sq F value Pr(>F)
## daily_log$id 23 175773479 7642325 46.45 <2e-16 ***
## Residuals 389 64008686 164547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have a good F-value at 46.45 (This helps us see if the variation among samples means dominates over the variation within the groups or not), and we have a very small p value (p < 2e^-16). This means we can conclude that for our confidence interval we can accept the idea that there is a significant relationship between calories and the participants.
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Df Sum Sq Mean Sq F value Pr(>F)
## daily_log$id 23 2204023 95827 10.45 <2e-16 ***
## Residuals 389 3566228 9168
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model has a lower F-value (This indicates that the variation of the sample set is much lower compared to the calories below) which helps us understand that it should provide us a high corresponding p-value. However, the p-value is the same as before, which gives us the idea that there is a statically difference between each participants mean.
sleep_fig = plot_ly(data = daily_log, x = ~totaldistance, y = ~totaltimeinbed, color = ~id)
sleep_fig## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
More data visualization in terms of participants.
Here we have a better look at the distribution between each participant’s activities during bed. It does seem that there is one participant that only less than 3 entrees of what time do they sleep. What surprises me the most is how participant X-2279 and X-0313 has a massive distribution
t = plot_ly(data = daily_log, x = ~totalminutesasleep, y = ~totaltimeinbed, color = ~factor(id), type = "box")
t = t %>% layout(boxmode = "group", xaxis = list(range = c(220, 600)), yaxis = list(range = c(300, 600)))
t## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
While I do not know why the boxes are small, we do know that each distribution for each participant is different from one another as shown below. Based by observation (and not by calculations) we can have a safe assumption that each participant’s sleep pattern is different. We can also see that participant X-3714 does have a tendencies to be awake while being in bed, and not sleeping.
daily_log %>% group_by(id) %>% summarise(mean_minutes_asleep=mean(totaltimeinbed)) %>%
ggplot(aes(x=mean_minutes_asleep, y=reorder(id, (mean_minutes_asleep)))) +
geom_col(fill = "#00abff", width=0.12) +
geom_point(color = "#00abff", size=2.5) +
geom_rect(aes(xmin=420, xmax=540, ymin=-Inf, ymax=Inf), fill="#59C96D", alpha=0.02) +
theme_light() +
theme(text = element_text(size=20)) +
labs(y = "FitBit IDs",
x = "Minutes Asleep",
title = "Average Daily Sleep by ID") +
annotate(geom="text", x=480, y=2, label="Ideal sleep \n duration", size=5)Here we have a graph that display the participant’s FitBit ID’s vs Minutes of Sleep
daily_log %>% group_by(id) %>% summarise(mean_steps=mean(totalsteps)) %>%
ggplot(aes(x=mean_steps, y=reorder(id, (mean_steps)))) +
geom_col(fill = "#00abff", width=0.12) +
geom_point(color = "#00abff", size=2.5) +
geom_rect(aes(xmin=420, xmax=540, ymin=-Inf, ymax=Inf), fill="#59C96D", alpha=0.02) +
theme_light() +
theme(text = element_text(size=20)) +
labs(y = "FitBit IDs",
x = "Steps",
title = "Figure 5: Average Daily Steps by ID") +
annotate(geom="text", x=6000, y=2, label="Ideal Steps \n duration", size=5)## Found this function thanks to https://stackoverflow.com/questions/34093169/horizontal-vertical-line-in-plotly
hline <- function(y = 0, color = "#6e0d25") {
list(
type = "line",
x0 = 0,
x1 = 1,
xref = "paper",
y0 = y,
y1 = y,
line = list(color = color)
)
}
daily_log$weekday = weekdays(daily_log$date)
daily_log$month = month(daily_log$date)
daily_log$weekday <- ordered(daily_log$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))
weekday_log = daily_log %>%
group_by(weekday) %>%
summarize(avg_daily_steps = mean(totalsteps), avg_sleep_mins = mean(totalminutesasleep), avg_percent_sleep = mean(percent_sleep))
sleep_week_fig = plot_ly(weekday_log, x = ~weekday, y = ~(avg_sleep_mins),
type = 'bar',
name = "Average Amount of Sleep (Minutes)",
text = ~round((avg_sleep_mins), digits = 2))
sleep_week_fig = sleep_week_fig %>% layout(yaxis = list(title = "Average Sleep in Minutes"), barcode = 'group',
xaxis = list(title = ""),
yaxis = list(title = ""))
sleep_week_fig = sleep_week_fig %>% layout(shapes = list(hline(420)))
sleep_week_fig## Warning: 'layout' objects don't have these attributes: 'barcode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
What we can see is that, as a collective, the participants are not sleeping the recommended amount of sleep at all. This is perhaps due to their schedule and their work-life balance. It would have been great to see what they of jobs they were working or more information about age and sex group. The red line we have represents 480 minutes or 8 hours of sleep.
Now, We can do the amount of all steps on average during the week.
step_week_fig = plot_ly(weekday_log, x = ~weekday, y = ~(avg_daily_steps),
type = 'bar',
name = "Average Amount of Steps",
text = ~round((avg_daily_steps), digits = 2))
step_week_fig = step_week_fig %>% layout(yaxis = list(title = "Average Steps"), barcode = 'group',
xaxis = list(title = ""),
yaxis = list(title = ""))
step_week_fig = step_week_fig %>% layout(shapes = list(hline(6000)))
step_week_fig## Warning: 'layout' objects don't have these attributes: 'barcode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
I am surprised that on average a lot of people are getting in their steps each day! What surprises me the most is how Sunday is the least amount of steps taken on average. This is probably because many participants treated Sunday as their true rest day and enjoyed staying outdoors, while Saturday is often recognized as a day to go out and do personal matters (in this case, walking).
On average, plenty of participants are getting enough sleep in order to start their day. We can see that 15/24 participants does reach the “ideal” 7-8 hours of sleep, but we can see one of them (participant X-5328) goes above and beyond in their rest! I would love to have more data in order to see why the other 8 participants do not me the average amount of sleep on average.
As shown in the line graphs, we can see that Sunday is normally the day where all of our participants are catching up to their sleep. I assume that many of them do have the normal 9-5 full time jobs, and are getting ready for Monday. The one day that I found interesting is how Wednesday is ranked number 2 when it comes to most amount of sleep out of the 7 day week.
As shown in the line graphs, we can see that Saturday has the most amount of steps on average. Which means that many of these participants are very active on Saturdays! I am amazed how Sunday, on average, contains the least amount of steps. I assume it is because of how many will try to prepare for the upcoming week.
Totaltimeasleep and totalsteps are negatively correlated with one another (-0.1721427), I was expecting this to be more negatively correlated with one another, but it would make enough sense that both variables are not highly correlated with one another given there are other more potent variables.
When we were able to do an ANOVA test for two variables. We can see that it is very dependent on how many calories the participant burned, as well as the amount of sleep they had. This means that at least one of the predictor variables are significantly different from other levels (for both variables, in their own ANOVA testings). We would be able to follow up with a post-hoc test in order to see which predictors are significantly different from one another.
When plotting the calories vs the intensity of activity level, we can see that, on average, the moderately intense walk is actually damaging the prediction of our linear regression. This is most interesting since it would be safe to assume that the more intense a workout is, the more calories one will burn. But this is not the case as shown in the calories vs totalsteps graph that has activity_level highlighted. Another thing to note is how huge the “cone” for the very intense level, I assume it is because many bodies will respond to the very intense activities/workout differently (and not in a good way).
Another thing to note is a lot of data was probably taken out. Data such as the participant’s age, sex, lifestyle, R.E.M, Deep, Light sleep, and diet were missing during this test. I can understand that writing down one’s diet is very time consuming, but the other variables may not be too time consuming. I would’ve appreciated if there was a little bit more information about each participant’s lifestyle, since it would’ve helped me provide a better story in this statistical analysis. That, and how some of the data is not there or it was just logged in. This is something that I would’ve enjoyed looking in, but I cannot conduct any findings if the majority did not log it in.