This document provides an analysis of Fitness Tracker Case Study data using R. The analysis covers data loading, cleaning, exploration, and visualization to understand activity patterns.
In this step, we will load all the necessary libraries for data manipulation and visualization operations.
library(dplyr) # Data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(janitor) # Data cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # Date handling
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(hms) # Time management
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
library(skimr) # Data summary
library(ggplot2) # Data visualization
library(tidyverse) # Unified ecosystem
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ✔ readr 2.1.5 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ hms::hms() masks lubridate::hms()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, we will load all the CSV files and inspect its structure and column names to understand the data.
# load data from csv files
daily_activity <- read.csv("D:\\Docs\\Google Data Analytics\\fitbit_fitness_tracker\\dailyActivity_merged.csv")
# Preview the first few rows and structure of the dataset
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
names(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
More Data Exploration: Explore the data using different methods.
# data exploration
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
skim_without_charts(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 |
as_tibble(daily_activity)
## # A tibble: 940 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <int> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## 7 1503960366 4/18/2016 13019 8.59 8.59
## 8 1503960366 4/19/2016 15506 9.88 9.88
## 9 1503960366 4/20/2016 10544 6.68 6.68
## 10 1503960366 4/21/2016 9819 6.34 6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <int>, FairlyActiveMinutes <int>,
## # LightlyActiveMinutes <int>, SedentaryMinutes <int>, Calories <int>
summary(daily_activity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
View(daily_activity)
Convert Data Types: We convert the ‘Id’ column to a factor and ‘ActivityDate’ to Date format for better analysis.
# Convert ID from numeric to factor
daily_activity$Id <- as.factor(daily_activity$Id)
# Convert Date from character to date format
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate,format="%m/%d/%Y")
# Verify the changes in data structure
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : Factor w/ 33 levels "1503960366","1624580081",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ActivityDate : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
Check for Missing and Duplicate Values: We check for and removes any duplicate entries to ensure the dataset is clean.
# Check for missing values
any(is.na(daily_activity))
## [1] FALSE
# Check total missing values
sum(is.na(daily_activity))
## [1] 0
# Check and remove duplicate values
daily_activity_V2 <- distinct(daily_activity)
View(daily_activity_V2)
Check for Invalid Values: We verify if there are any invalid values (e.g., negative values) which may need correction.
# Check for values less than or equal to -1 in each column
apply(daily_activity_V2, 2, function(x) any(x <= -1))
## Id ActivityDate TotalSteps
## FALSE FALSE FALSE
## TotalDistance TrackerDistance LoggedActivitiesDistance
## FALSE FALSE FALSE
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## FALSE FALSE FALSE
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## FALSE FALSE FALSE
## LightlyActiveMinutes SedentaryMinutes Calories
## FALSE FALSE FALSE
Basic Statistics: We calculate various statistics to understand the central tendency and dispersion of the data.
# Average of Steps, Distance & Calorie
daily_activity_V2 %>%
summarise(avg_TotalSteps=mean(TotalSteps),avg_TotalDistance=mean(TotalDistance),
avg_Calories=mean(Calories))
## avg_TotalSteps avg_TotalDistance avg_Calories
## 1 7637.911 5.489702 2303.61
# Maximum of Steps, Distance & Calorie
daily_activity_V2 %>%
summarise(max_TotalSteps=max(TotalSteps),max_TotalDistance=max(TotalDistance),
max_Calories=max(Calories))
## max_TotalSteps max_TotalDistance max_Calories
## 1 36019 28.03 4900
# Median of Steps, Distance & Calorie
daily_activity_V2 %>%
summarise(med_TotalSteps=median(TotalSteps),med_TotalDistance=median(TotalDistance),
med_Calories=median(Calories))
## med_TotalSteps med_TotalDistance med_Calories
## 1 7405.5 5.245 2134
# Standard deviation of Steps, Distance & Calorie
daily_activity_V2 %>%
summarise(sd_TotalSteps= sd(TotalSteps),sd_TotalDistance=sd(TotalDistance),
sd_Calories=sd(Calories),sd_TrackerDistance=sd(TrackerDistance))
## sd_TotalSteps sd_TotalDistance sd_Calories sd_TrackerDistance
## 1 5087.151 3.924606 718.1669 3.907276
User-Specific Analysis: We summarize data by user to understand individual activity patterns.
daily_activity_V2 %>% group_by(Id) %>%
summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories))
## # A tibble: 33 × 4
## Id avg_steps avg_distance avg_calorie
## <fct> <dbl> <dbl> <dbl>
## 1 1503960366 12117. 7.81 1816.
## 2 1624580081 5744. 3.91 1483.
## 3 1644430081 7283. 5.30 2811.
## 4 1844505072 2580. 1.71 1573.
## 5 1927972279 916. 0.635 2173.
## 6 2022484408 11371. 8.08 2510.
## 7 2026352035 5567. 3.45 1541.
## 8 2320127002 4717. 3.19 1724.
## 9 2347167796 9520. 6.36 2043.
## 10 2873212765 7556. 5.10 1917.
## # ℹ 23 more rows
User Classification: We classify users based on their activity level and calculates the percentage of users in each category.
# Classify users based on how many days they used their smart device during a 31-day survey period
# Active User
user_active_days <- daily_activity_V2 %>% group_by(Id) %>%
summarise(active_days=n_distinct(ActivityDate))
# Classify users based on active days
user_classification <- user_active_days %>%
mutate( usage_category = case_when(active_days >= 21 ~ "High user",active_days >= 10 ~ "Moderate user",
active_days<=10 ~ "Low user"
)
)
# Count users in each category
user_counts <- user_classification %>% group_by(usage_category) %>%
summarise(active_days=n())
## Calculate percentage
user_counts <- user_counts %>%
mutate(percentage = active_days / sum(active_days) * 100)
user_counts
## # A tibble: 3 × 3
## usage_category active_days percentage
## <chr> <int> <dbl>
## 1 High user 29 87.9
## 2 Low user 1 3.03
## 3 Moderate user 3 9.09
Trend Analysis: We analyze trends to find dates with the highest recorded values for steps, calories, and distance.
# Highest number of step by date
highest_steps <- daily_activity_V2 %>% select(ActivityDate,TotalSteps) %>% arrange(desc(TotalSteps))
as_tibble(highest_steps)
## # A tibble: 940 × 2
## ActivityDate TotalSteps
## <date> <int>
## 1 2016-05-01 36019
## 2 2016-04-16 29326
## 3 2016-04-30 27745
## 4 2016-04-27 23629
## 5 2016-04-12 23186
## 6 2016-04-24 22988
## 7 2016-05-07 22770
## 8 2016-04-23 22359
## 9 2016-04-16 22244
## 10 2016-05-08 22026
## # ℹ 930 more rows
# Highest number of calories by date
highest_calorie <- daily_activity_V2 %>% select(ActivityDate, Calories) %>% arrange(desc(Calories))
as_tibble(highest_calorie)
## # A tibble: 940 × 2
## ActivityDate Calories
## <date> <int>
## 1 2016-04-21 4900
## 2 2016-04-17 4552
## 3 2016-04-16 4547
## 4 2016-05-01 4546
## 5 2016-04-30 4501
## 6 2016-04-30 4398
## 7 2016-04-24 4392
## 8 2016-04-16 4274
## 9 2016-04-21 4236
## 10 2016-04-14 4163
## # ℹ 930 more rows
# Highest number of distance covered by date
highest_distance <- daily_activity_V2 %>% select(ActivityDate,TotalDistance) %>% arrange(desc(TotalDistance))
as_tibble(highest_distance)
## # A tibble: 940 × 2
## ActivityDate TotalDistance
## <date> <dbl>
## 1 2016-05-01 28.0
## 2 2016-04-30 26.7
## 3 2016-04-16 25.3
## 4 2016-04-27 20.6
## 5 2016-04-12 20.4
## 6 2016-05-11 19.6
## 7 2016-05-06 19.3
## 8 2016-04-14 19.0
## 9 2016-05-09 18.2
## 10 2016-04-20 18.1
## # ℹ 930 more rows
Create a column for Name of Week: We create a new column ‘week_name’ to represent the day of the week. The order of days is also arranged for better grouping in analysis.
# Create a column for name of the week
daily_activity_V2$week_name <- weekdays(daily_activity_V2$ActivityDate)
head(daily_activity_V2)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories week_name
## 1 13 328 728 1985 Tuesday
## 2 19 217 776 1797 Wednesday
## 3 11 181 1218 1776 Thursday
## 4 34 209 726 1745 Friday
## 5 10 221 773 1863 Saturday
## 6 20 164 539 1728 Sunday
# Arrange days of week in correct order
daily_activity_V2$week_name <- ordered(daily_activity_V2$week_name,
levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
Weekly Aggregations: We aggregate data weekly to analyze patterns across days of the week.
# Total Calorie burnt by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_calorie=sum(Calories))
## # A tibble: 7 × 2
## week_name total_calorie
## <ord> <int>
## 1 Sunday 273823
## 2 Monday 278905
## 3 Tuesday 358114
## 4 Wednesday 345393
## 5 Thursday 323337
## 6 Friday 293805
## 7 Saturday 292016
# Total Steps by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_Steps=sum(TotalSteps))
## # A tibble: 7 × 2
## week_name total_Steps
## <ord> <int>
## 1 Sunday 838921
## 2 Monday 933704
## 3 Tuesday 1235001
## 4 Wednesday 1133906
## 5 Thursday 1088658
## 6 Friday 938477
## 7 Saturday 1010969
# Total Distance by week
daily_activity_V2 %>% group_by(week_name) %>%
summarise(total_distance=sum(TotalDistance))
## # A tibble: 7 × 2
## week_name total_distance
## <ord> <dbl>
## 1 Sunday 608.
## 2 Monday 666.
## 3 Tuesday 886.
## 4 Wednesday 823.
## 5 Thursday 781.
## 6 Friday 669.
## 7 Saturday 726.
# Calculate Weekday vs weekend
daily_activity_V2 %>%
mutate(day_type=if_else(week_name %in% c("Sunday","Saturday"),"WeekEnd","WeekDay")) %>%
group_by(day_type) %>%
summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories))
## # A tibble: 2 × 4
## day_type avg_steps avg_distance avg_calorie
## <chr> <dbl> <dbl> <dbl>
## 1 WeekDay 7669. 5.51 2302.
## 2 WeekEnd 7551. 5.45 2310.
Summary, Correlation & Variance: We provide a summary, correlation, and variance analysis to understand relationships and variability within the data set.
# Summary stats for distance types
summary(daily_activity_V2[, c("TrackerDistance", "LoggedActivitiesDistance",
"VeryActiveDistance", "ModeratelyActiveDistance",
"LightActiveDistance", "SedentaryActiveDistance")])
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
# Correlation matrix for distance types
cor(daily_activity_V2[, c("TrackerDistance", "LoggedActivitiesDistance",
"VeryActiveDistance","ModeratelyActiveDistance",
"LightActiveDistance", "SedentaryActiveDistance")])
## TrackerDistance LoggedActivitiesDistance
## TrackerDistance 1.00000000 0.16258530
## LoggedActivitiesDistance 0.16258530 1.00000000
## VeryActiveDistance 0.79433807 0.15085226
## ModeratelyActiveDistance 0.47027739 0.07652693
## LightActiveDistance 0.66136481 0.13830151
## SedentaryActiveDistance 0.07459089 0.15499618
## VeryActiveDistance ModeratelyActiveDistance
## TrackerDistance 0.79433807 0.470277391
## LoggedActivitiesDistance 0.15085226 0.076526932
## VeryActiveDistance 1.00000000 0.192985874
## ModeratelyActiveDistance 0.19298587 1.000000000
## LightActiveDistance 0.15766926 0.237847447
## SedentaryActiveDistance 0.04611675 0.005793403
## LightActiveDistance SedentaryActiveDistance
## TrackerDistance 0.6613648 0.074590885
## LoggedActivitiesDistance 0.1383015 0.154996178
## VeryActiveDistance 0.1576693 0.046116748
## ModeratelyActiveDistance 0.2378474 0.005793403
## LightActiveDistance 1.0000000 0.099503204
## SedentaryActiveDistance 0.0995032 1.000000000
# Correlation between steps, distance and calories
cor(daily_activity_V2[,c("TotalSteps","TotalDistance","Calories")])
## TotalSteps TotalDistance Calories
## TotalSteps 1.0000000 0.9853688 0.5915681
## TotalDistance 0.9853688 1.0000000 0.6449619
## Calories 0.5915681 0.6449619 1.0000000
# Variance for distance types
daily_activity_V2 %>%
summarise(var_TrackerDistance = var(TrackerDistance), var_LoggedActivitiesDistance = var(LoggedActivitiesDistance),
var_VeryActiveDistance = var(VeryActiveDistance), var_ModeratelyActiveDistance = var(ModeratelyActiveDistance),
var_LightActiveDistance = var(LightActiveDistance), var_SedentaryActiveDistance = var(SedentaryActiveDistance))
## var_TrackerDistance var_LoggedActivitiesDistance var_VeryActiveDistance
## 1 15.26681 0.3842717 7.069968
## var_ModeratelyActiveDistance var_LightActiveDistance
## 1 0.7807142 4.164274
## var_SedentaryActiveDistance
## 1 5.396631e-05
Top users: We identify the top 5 and least 5 active users based on active day of users.
# Top 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>%
arrange(desc(active_days)) %>% slice_head(n=5)
## # A tibble: 5 × 2
## Id active_days
## <fct> <int>
## 1 1503960366 31
## 2 1624580081 31
## 3 1844505072 31
## 4 1927972279 31
## 5 2022484408 31
# Least 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>%
arrange(active_days) %>% slice_head(n=5)
## # A tibble: 5 × 2
## Id active_days
## <fct> <int>
## 1 4057192912 4
## 2 2347167796 18
## 3 8253242879 19
## 4 3372868164 20
## 5 6775888955 26
In this step, we plot the analysis for understanding data patterns and communicating insights.
# Daily steps over time
daily_activity_V2 %>%
group_by(ActivityDate) %>% summarise(totalsteps=sum(TotalSteps)) %>%
ggplot(aes(x=ActivityDate,y=totalsteps))+geom_line(color = "#1f77b4")+
labs(x="Date",y="Total Steps",title="Daily Steps Over Time")+theme_minimal()+
theme(axis.text.x = element_text(angle=45,hjust = 1))
# Calories vs Steps
daily_activity_V2 %>%
ggplot(aes(x=TotalSteps,y=Calories))+geom_jitter(color = "#ff7f0e")+geom_smooth(color = "#2ca02c")+
labs(x= "Total Steps",y="Calories",title = "Comparison of Steps & Calories")+
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Distance Vs calorie
daily_activity_V2 %>%
ggplot(aes(x= TotalDistance,y=Calories))+ geom_jitter(color = "#d62728") +
geom_smooth(color = "#9467bd") +
labs(x="Total Distance Covered",y="Calories",title = "Comparison of Distance & Calorie")+
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Calorie burnt by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_calorie=sum(Calories)) %>%
ggplot(aes(x=week_name,y=total_calorie,fill = week_name))+geom_col()+
labs(x="Name of Week",y="Total Calories",title="Calories usage by Week")+
theme_minimal()+ scale_fill_discrete(name="Week Name")+
theme(axis.text.x = element_text(angle = 20))+scale_fill_brewer(palette = "Blues", name = "Week Name")
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
# Total Steps by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_Steps=sum(TotalSteps)) %>%
ggplot(aes(x=week_name,y=total_Steps,fill = week_name))+geom_col()+
labs(x="Name of Week",y="Total Steps",title = "Total Steps covered by Week")+
theme_minimal()+scale_fill_discrete(name="Week Name")+scale_fill_brewer(palette = "Oranges", name = "Week Name")+
theme(axis.text.x = element_text(angle = 20))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
# Total distance by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_distance=sum(TotalDistance)) %>%
ggplot(aes(x=week_name,y=total_distance,fill = week_name))+geom_col()+
labs(x="Name of Week",y="Total Distance",title = "Total Distance Covered by Week")+
theme_minimal()+scale_fill_discrete(name="Week Name")+scale_fill_brewer(palette = "Greens", name = "Week Name")+ theme(axis.text.x = element_text(angle = 20))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
# Weekday vs weekend
summary_week_dayEnd <- daily_activity_V2 %>%
mutate(day_type=if_else(week_name %in% c("Sunday","Saturday"),"WeekEnd","WeekDay")) %>%
group_by(day_type) %>%
summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories))
dayEnd_long <- summary_week_dayEnd %>%
pivot_longer(cols = c(avg_steps,avg_distance,avg_calorie),
names_to = "Metric",values_to ="Average")
dayEnd_long %>% ggplot(aes(x=day_type,y=Average,fill = Metric))+geom_col(position = "dodge2")+
labs(x="Day Type",y="Average",title = "Average Steps,Distance,Calories")+
theme_minimal()+ scale_fill_manual(values = c("avg_steps" = "#1f77b4", "avg_distance" = "#ff7f0e", "avg_calorie" = "#2ca02c"))
# Top 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>%
arrange(desc(active_days)) %>% slice_head(n=5) %>%
ggplot(aes(x=Id,y=active_days))+geom_col(fill="#d62728")+coord_flip()+
labs(x="ID",y="Days of Active",title="Top 5 Active Users")+theme_minimal()
# Least 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>%
arrange(active_days) %>% slice_head(n=5) %>%
ggplot(aes(x=Id,y=active_days))+geom_col(fill="#d62728")+coord_flip()+
labs(x="ID",y="Days of Active",title="Least 5 Active Users")+theme_minimal()
# Users based on how many days they used their smart device during a 31-day survey period
user_counts %>% ggplot(aes(x=usage_category,y=percentage,fill = usage_category))+
geom_bar(stat="identity")+
labs(x="User Classification",y="Percentage",title = "Percentage of user by activity")+
theme_minimal()+ scale_fill_discrete(name="User Type") +
geom_text(aes(label = sprintf("%.1f%%", percentage),vjust = -0.199))+
scale_fill_manual(values = c("High user" = "#1f77b4", "Moderate user" = "#ff7f0e", "Low user" = "#2ca02c"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
# Activity trend by user
daily_activity_V2 %>% group_by(Id,ActivityDate) %>% summarise(totalsteps=sum(TotalSteps)) %>%
ggplot(aes(x=ActivityDate,y=totalsteps,colour=Id))+geom_line()+
labs(x="Date",y="Total Steps",title = "Activity Trends by User")+theme_minimal()
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
# Comparison between calories & distance by steps
daily_activity_V2 %>% ggplot(aes(x=TotalDistance,y=Calories,color=TotalSteps))+
geom_line()+labs(x="Total Distance",y="Calories",title="Calories vs Distance by Steps")+
theme_minimal()+scale_color_gradient(low = "#1f77b4", high = "#ff7f0e")
Finally, we export the cleaned data frame for further analysis to CSV format.
# Load the package to export
library(writexl)
# Export directly CSV file to local computer
write.csv(daily_activity_V2,"daily_activity_V2.csv",row.names = FALSE)
This document provides a comprehensive analysis of fitness tracker data. We examined basic statistics, user-specific patterns, and activity trends. The visualizations helped us understand activity patterns across different days, user classifications, and the relationship between steps, distance, and calories burned. This analysis can be used to gain insights into user behavior and improve fitness tracking features.