This document provides an analysis of Fitness Tracker Case Study data using R. The analysis covers data loading, cleaning, exploration, and visualization to understand activity patterns.

Step 1: Loading the required Library

In this step, we will load all the necessary libraries for data manipulation and visualization operations.

library(dplyr)      # Data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(janitor)    # Data cleaning
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)  # Date handling
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(hms)        # Time management
## 
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
## 
##     hms
library(skimr)      # Data summary
library(ggplot2)    # Data visualization
library(tidyverse)  # Unified ecosystem
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ purrr   1.0.2     ✔ tibble  3.2.1
## ✔ readr   2.1.5     ✔ tidyr   1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ hms::hms()      masks lubridate::hms()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2: Loading the CSV files

Next, we will load all the CSV files and inspect its structure and column names to understand the data.

# load data from csv files
daily_activity <- read.csv("D:\\Docs\\Google Data Analytics\\fitbit_fitness_tracker\\dailyActivity_merged.csv")

# Preview the first few rows and structure of the dataset
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
names(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

More Data Exploration: Explore the data using different methods.

# data exploration
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
skim_without_charts(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03
as_tibble(daily_activity)
## # A tibble: 940 × 15
##            Id ActivityDate TotalSteps TotalDistance TrackerDistance
##         <dbl> <chr>             <int>         <dbl>           <dbl>
##  1 1503960366 4/12/2016         13162          8.5             8.5 
##  2 1503960366 4/13/2016         10735          6.97            6.97
##  3 1503960366 4/14/2016         10460          6.74            6.74
##  4 1503960366 4/15/2016          9762          6.28            6.28
##  5 1503960366 4/16/2016         12669          8.16            8.16
##  6 1503960366 4/17/2016          9705          6.48            6.48
##  7 1503960366 4/18/2016         13019          8.59            8.59
##  8 1503960366 4/19/2016         15506          9.88            9.88
##  9 1503960366 4/20/2016         10544          6.68            6.68
## 10 1503960366 4/21/2016          9819          6.34            6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <int>, FairlyActiveMinutes <int>,
## #   LightlyActiveMinutes <int>, SedentaryMinutes <int>, Calories <int>
summary(daily_activity)
##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
View(daily_activity)

Step 3: Data Cleaning

Convert Data Types: We convert the ‘Id’ column to a factor and ‘ActivityDate’ to Date format for better analysis.

# Convert ID from numeric to factor
daily_activity$Id <- as.factor(daily_activity$Id)

# Convert Date from character to date format
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate,format="%m/%d/%Y")

# Verify the changes in data structure
str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : Factor w/ 33 levels "1503960366","1624580081",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

Check for Missing and Duplicate Values: We check for and removes any duplicate entries to ensure the dataset is clean.

# Check for missing values
any(is.na(daily_activity))
## [1] FALSE
# Check total missing values
sum(is.na(daily_activity))
## [1] 0
# Check and remove duplicate values
daily_activity_V2 <- distinct(daily_activity)
View(daily_activity_V2)

Check for Invalid Values: We verify if there are any invalid values (e.g., negative values) which may need correction.

# Check for values less than or equal to -1 in each column
apply(daily_activity_V2, 2, function(x) any(x <= -1))
##                       Id             ActivityDate               TotalSteps 
##                    FALSE                    FALSE                    FALSE 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                    FALSE                    FALSE                    FALSE 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                    FALSE                    FALSE                    FALSE 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                    FALSE                    FALSE                    FALSE 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                    FALSE                    FALSE                    FALSE

Step 4: Descriptive Analysis

Basic Statistics: We calculate various statistics to understand the central tendency and dispersion of the data.

# Average of Steps, Distance & Calorie
daily_activity_V2  %>% 
  summarise(avg_TotalSteps=mean(TotalSteps),avg_TotalDistance=mean(TotalDistance),
            avg_Calories=mean(Calories))
##   avg_TotalSteps avg_TotalDistance avg_Calories
## 1       7637.911          5.489702      2303.61
# Maximum of Steps, Distance & Calorie
daily_activity_V2 %>% 
  summarise(max_TotalSteps=max(TotalSteps),max_TotalDistance=max(TotalDistance),
            max_Calories=max(Calories))
##   max_TotalSteps max_TotalDistance max_Calories
## 1          36019             28.03         4900
# Median of Steps, Distance & Calorie
daily_activity_V2 %>% 
  summarise(med_TotalSteps=median(TotalSteps),med_TotalDistance=median(TotalDistance),
            med_Calories=median(Calories))
##   med_TotalSteps med_TotalDistance med_Calories
## 1         7405.5             5.245         2134
# Standard deviation of Steps, Distance & Calorie
daily_activity_V2 %>% 
  summarise(sd_TotalSteps= sd(TotalSteps),sd_TotalDistance=sd(TotalDistance),
            sd_Calories=sd(Calories),sd_TrackerDistance=sd(TrackerDistance))
##   sd_TotalSteps sd_TotalDistance sd_Calories sd_TrackerDistance
## 1      5087.151         3.924606    718.1669           3.907276

User-Specific Analysis: We summarize data by user to understand individual activity patterns.

daily_activity_V2 %>% group_by(Id) %>% 
  summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories))
## # A tibble: 33 × 4
##    Id         avg_steps avg_distance avg_calorie
##    <fct>          <dbl>        <dbl>       <dbl>
##  1 1503960366    12117.        7.81        1816.
##  2 1624580081     5744.        3.91        1483.
##  3 1644430081     7283.        5.30        2811.
##  4 1844505072     2580.        1.71        1573.
##  5 1927972279      916.        0.635       2173.
##  6 2022484408    11371.        8.08        2510.
##  7 2026352035     5567.        3.45        1541.
##  8 2320127002     4717.        3.19        1724.
##  9 2347167796     9520.        6.36        2043.
## 10 2873212765     7556.        5.10        1917.
## # ℹ 23 more rows

User Classification: We classify users based on their activity level and calculates the percentage of users in each category.

# Classify users based on how many days they used their smart device during a 31-day survey period
# Active User
user_active_days <- daily_activity_V2 %>% group_by(Id) %>%
  summarise(active_days=n_distinct(ActivityDate))

# Classify users based on active days
user_classification <- user_active_days %>%
  mutate( usage_category = case_when(active_days >= 21 ~ "High user",active_days >= 10 ~ "Moderate user",
      active_days<=10 ~ "Low user"
    )
  )

# Count users in each category
user_counts <- user_classification %>% group_by(usage_category) %>% 
  summarise(active_days=n())

## Calculate percentage
user_counts <- user_counts %>%
  mutate(percentage = active_days / sum(active_days) * 100)
user_counts
## # A tibble: 3 × 3
##   usage_category active_days percentage
##   <chr>                <int>      <dbl>
## 1 High user               29      87.9 
## 2 Low user                 1       3.03
## 3 Moderate user            3       9.09

Trend Analysis: We analyze trends to find dates with the highest recorded values for steps, calories, and distance.

# Highest number of step by date
highest_steps <- daily_activity_V2 %>% select(ActivityDate,TotalSteps) %>% arrange(desc(TotalSteps))
as_tibble(highest_steps)
## # A tibble: 940 × 2
##    ActivityDate TotalSteps
##    <date>            <int>
##  1 2016-05-01        36019
##  2 2016-04-16        29326
##  3 2016-04-30        27745
##  4 2016-04-27        23629
##  5 2016-04-12        23186
##  6 2016-04-24        22988
##  7 2016-05-07        22770
##  8 2016-04-23        22359
##  9 2016-04-16        22244
## 10 2016-05-08        22026
## # ℹ 930 more rows
# Highest number of calories by date
highest_calorie <- daily_activity_V2 %>% select(ActivityDate, Calories) %>% arrange(desc(Calories))
as_tibble(highest_calorie)
## # A tibble: 940 × 2
##    ActivityDate Calories
##    <date>          <int>
##  1 2016-04-21       4900
##  2 2016-04-17       4552
##  3 2016-04-16       4547
##  4 2016-05-01       4546
##  5 2016-04-30       4501
##  6 2016-04-30       4398
##  7 2016-04-24       4392
##  8 2016-04-16       4274
##  9 2016-04-21       4236
## 10 2016-04-14       4163
## # ℹ 930 more rows
# Highest number of distance covered by date
highest_distance <- daily_activity_V2 %>% select(ActivityDate,TotalDistance) %>% arrange(desc(TotalDistance))
as_tibble(highest_distance)
## # A tibble: 940 × 2
##    ActivityDate TotalDistance
##    <date>               <dbl>
##  1 2016-05-01            28.0
##  2 2016-04-30            26.7
##  3 2016-04-16            25.3
##  4 2016-04-27            20.6
##  5 2016-04-12            20.4
##  6 2016-05-11            19.6
##  7 2016-05-06            19.3
##  8 2016-04-14            19.0
##  9 2016-05-09            18.2
## 10 2016-04-20            18.1
## # ℹ 930 more rows

Create a column for Name of Week: We create a new column ‘week_name’ to represent the day of the week. The order of days is also arranged for better grouping in analysis.

# Create a column for name of the week
daily_activity_V2$week_name <-  weekdays(daily_activity_V2$ActivityDate)
head(daily_activity_V2)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories week_name
## 1                  13                  328              728     1985   Tuesday
## 2                  19                  217              776     1797 Wednesday
## 3                  11                  181             1218     1776  Thursday
## 4                  34                  209              726     1745    Friday
## 5                  10                  221              773     1863  Saturday
## 6                  20                  164              539     1728    Sunday
# Arrange days of week in correct order 
daily_activity_V2$week_name <- ordered(daily_activity_V2$week_name, 
                                   levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

Weekly Aggregations: We aggregate data weekly to analyze patterns across days of the week.

# Total Calorie burnt by week 
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_calorie=sum(Calories))
## # A tibble: 7 × 2
##   week_name total_calorie
##   <ord>             <int>
## 1 Sunday           273823
## 2 Monday           278905
## 3 Tuesday          358114
## 4 Wednesday        345393
## 5 Thursday         323337
## 6 Friday           293805
## 7 Saturday         292016
# Total Steps by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_Steps=sum(TotalSteps))
## # A tibble: 7 × 2
##   week_name total_Steps
##   <ord>           <int>
## 1 Sunday         838921
## 2 Monday         933704
## 3 Tuesday       1235001
## 4 Wednesday     1133906
## 5 Thursday      1088658
## 6 Friday         938477
## 7 Saturday      1010969
# Total Distance by week
daily_activity_V2 %>% group_by(week_name) %>% 
  summarise(total_distance=sum(TotalDistance)) 
## # A tibble: 7 × 2
##   week_name total_distance
##   <ord>              <dbl>
## 1 Sunday              608.
## 2 Monday              666.
## 3 Tuesday             886.
## 4 Wednesday           823.
## 5 Thursday            781.
## 6 Friday              669.
## 7 Saturday            726.
# Calculate Weekday vs weekend 
daily_activity_V2 %>% 
  mutate(day_type=if_else(week_name %in% c("Sunday","Saturday"),"WeekEnd","WeekDay")) %>% 
  group_by(day_type) %>% 
  summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories))
## # A tibble: 2 × 4
##   day_type avg_steps avg_distance avg_calorie
##   <chr>        <dbl>        <dbl>       <dbl>
## 1 WeekDay      7669.         5.51       2302.
## 2 WeekEnd      7551.         5.45       2310.

Summary, Correlation & Variance: We provide a summary, correlation, and variance analysis to understand relationships and variability within the data set.

# Summary stats for distance types
summary(daily_activity_V2[, c("TrackerDistance", "LoggedActivitiesDistance", 
                              "VeryActiveDistance", "ModeratelyActiveDistance",
                              "LightActiveDistance", "SedentaryActiveDistance")])
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000
# Correlation matrix for distance types
cor(daily_activity_V2[, c("TrackerDistance", "LoggedActivitiesDistance", 
                          "VeryActiveDistance","ModeratelyActiveDistance",
                          "LightActiveDistance", "SedentaryActiveDistance")])
##                          TrackerDistance LoggedActivitiesDistance
## TrackerDistance               1.00000000               0.16258530
## LoggedActivitiesDistance      0.16258530               1.00000000
## VeryActiveDistance            0.79433807               0.15085226
## ModeratelyActiveDistance      0.47027739               0.07652693
## LightActiveDistance           0.66136481               0.13830151
## SedentaryActiveDistance       0.07459089               0.15499618
##                          VeryActiveDistance ModeratelyActiveDistance
## TrackerDistance                  0.79433807              0.470277391
## LoggedActivitiesDistance         0.15085226              0.076526932
## VeryActiveDistance               1.00000000              0.192985874
## ModeratelyActiveDistance         0.19298587              1.000000000
## LightActiveDistance              0.15766926              0.237847447
## SedentaryActiveDistance          0.04611675              0.005793403
##                          LightActiveDistance SedentaryActiveDistance
## TrackerDistance                    0.6613648             0.074590885
## LoggedActivitiesDistance           0.1383015             0.154996178
## VeryActiveDistance                 0.1576693             0.046116748
## ModeratelyActiveDistance           0.2378474             0.005793403
## LightActiveDistance                1.0000000             0.099503204
## SedentaryActiveDistance            0.0995032             1.000000000
# Correlation between steps, distance and calories
cor(daily_activity_V2[,c("TotalSteps","TotalDistance","Calories")])
##               TotalSteps TotalDistance  Calories
## TotalSteps     1.0000000     0.9853688 0.5915681
## TotalDistance  0.9853688     1.0000000 0.6449619
## Calories       0.5915681     0.6449619 1.0000000
# Variance for distance types
daily_activity_V2 %>% 
  summarise(var_TrackerDistance = var(TrackerDistance), var_LoggedActivitiesDistance = var(LoggedActivitiesDistance),
            var_VeryActiveDistance = var(VeryActiveDistance), var_ModeratelyActiveDistance = var(ModeratelyActiveDistance),
            var_LightActiveDistance = var(LightActiveDistance), var_SedentaryActiveDistance = var(SedentaryActiveDistance))
##   var_TrackerDistance var_LoggedActivitiesDistance var_VeryActiveDistance
## 1            15.26681                    0.3842717               7.069968
##   var_ModeratelyActiveDistance var_LightActiveDistance
## 1                    0.7807142                4.164274
##   var_SedentaryActiveDistance
## 1                5.396631e-05

Top users: We identify the top 5 and least 5 active users based on active day of users.

# Top 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>% 
  arrange(desc(active_days)) %>% slice_head(n=5)
## # A tibble: 5 × 2
##   Id         active_days
##   <fct>            <int>
## 1 1503960366          31
## 2 1624580081          31
## 3 1844505072          31
## 4 1927972279          31
## 5 2022484408          31
# Least 5 active users
daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>% 
  arrange(active_days) %>% slice_head(n=5)
## # A tibble: 5 × 2
##   Id         active_days
##   <fct>            <int>
## 1 4057192912           4
## 2 2347167796          18
## 3 8253242879          19
## 4 3372868164          20
## 5 6775888955          26

Step 5: Visualization

In this step, we plot the analysis for understanding data patterns and communicating insights.

# Daily steps over time
daily_activity_V2 %>% 
  group_by(ActivityDate) %>% summarise(totalsteps=sum(TotalSteps)) %>% 
  ggplot(aes(x=ActivityDate,y=totalsteps))+geom_line(color = "#1f77b4")+
  labs(x="Date",y="Total Steps",title="Daily Steps Over Time")+theme_minimal()+
theme(axis.text.x = element_text(angle=45,hjust = 1))

# Calories vs Steps
daily_activity_V2 %>% 
  ggplot(aes(x=TotalSteps,y=Calories))+geom_jitter(color = "#ff7f0e")+geom_smooth(color = "#2ca02c")+
  labs(x= "Total Steps",y="Calories",title = "Comparison of Steps & Calories")+
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Distance Vs calorie
daily_activity_V2 %>% 
  ggplot(aes(x= TotalDistance,y=Calories))+ geom_jitter(color = "#d62728") +
  geom_smooth(color = "#9467bd") +
  labs(x="Total Distance Covered",y="Calories",title = "Comparison of Distance & Calorie")+
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Calorie burnt by week 
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_calorie=sum(Calories)) %>% 
  ggplot(aes(x=week_name,y=total_calorie,fill = week_name))+geom_col()+
  labs(x="Name of Week",y="Total Calories",title="Calories usage by Week")+
  theme_minimal()+ scale_fill_discrete(name="Week Name")+
  theme(axis.text.x = element_text(angle = 20))+scale_fill_brewer(palette = "Blues", name = "Week Name") 
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

# Total Steps by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_Steps=sum(TotalSteps)) %>% 
  ggplot(aes(x=week_name,y=total_Steps,fill = week_name))+geom_col()+
  labs(x="Name of Week",y="Total Steps",title = "Total Steps covered by Week")+
  theme_minimal()+scale_fill_discrete(name="Week Name")+scale_fill_brewer(palette = "Oranges", name = "Week Name")+
  theme(axis.text.x = element_text(angle = 20))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

# Total distance by week
daily_activity_V2 %>% group_by(week_name) %>% summarise(total_distance=sum(TotalDistance)) %>% 
  ggplot(aes(x=week_name,y=total_distance,fill = week_name))+geom_col()+
  labs(x="Name of Week",y="Total Distance",title = "Total Distance Covered by Week")+
  theme_minimal()+scale_fill_discrete(name="Week Name")+scale_fill_brewer(palette = "Greens", name = "Week Name")+ theme(axis.text.x = element_text(angle = 20))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

# Weekday vs weekend 
summary_week_dayEnd <- daily_activity_V2 %>% 
  mutate(day_type=if_else(week_name %in% c("Sunday","Saturday"),"WeekEnd","WeekDay")) %>% 
  group_by(day_type) %>% 
  summarise(avg_steps=mean(TotalSteps),avg_distance=mean(TotalDistance),avg_calorie=mean(Calories)) 
  
 dayEnd_long <- summary_week_dayEnd %>% 
   pivot_longer(cols = c(avg_steps,avg_distance,avg_calorie),
                names_to = "Metric",values_to ="Average")
 dayEnd_long %>% ggplot(aes(x=day_type,y=Average,fill = Metric))+geom_col(position = "dodge2")+
   labs(x="Day Type",y="Average",title = "Average Steps,Distance,Calories")+
   theme_minimal()+ scale_fill_manual(values = c("avg_steps" = "#1f77b4", "avg_distance" = "#ff7f0e", "avg_calorie" = "#2ca02c"))

 # Top 5 active users
 daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>% 
   arrange(desc(active_days)) %>% slice_head(n=5) %>% 
   ggplot(aes(x=Id,y=active_days))+geom_col(fill="#d62728")+coord_flip()+
   labs(x="ID",y="Days of Active",title="Top 5 Active Users")+theme_minimal()

 # Least 5 active users
 daily_activity_V2 %>% group_by(Id) %>% summarise(active_days=n_distinct(ActivityDate)) %>% 
   arrange(active_days) %>% slice_head(n=5) %>% 
   ggplot(aes(x=Id,y=active_days))+geom_col(fill="#d62728")+coord_flip()+
   labs(x="ID",y="Days of Active",title="Least 5 Active Users")+theme_minimal()

# Users based on how many days they used their smart device during a 31-day survey period
user_counts %>% ggplot(aes(x=usage_category,y=percentage,fill = usage_category))+
   geom_bar(stat="identity")+
   labs(x="User Classification",y="Percentage",title = "Percentage of user by activity")+
   theme_minimal()+ scale_fill_discrete(name="User Type") +
   geom_text(aes(label = sprintf("%.1f%%", percentage),vjust = -0.199))+
   scale_fill_manual(values = c("High user" = "#1f77b4", "Moderate user" = "#ff7f0e", "Low user" = "#2ca02c"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

 # Activity trend by user
daily_activity_V2 %>% group_by(Id,ActivityDate) %>% summarise(totalsteps=sum(TotalSteps)) %>% 
   ggplot(aes(x=ActivityDate,y=totalsteps,colour=Id))+geom_line()+
   labs(x="Date",y="Total Steps",title = "Activity Trends by User")+theme_minimal()
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.

 # Comparison between calories & distance by steps
 daily_activity_V2 %>% ggplot(aes(x=TotalDistance,y=Calories,color=TotalSteps))+
   geom_line()+labs(x="Total Distance",y="Calories",title="Calories vs Distance by Steps")+
   theme_minimal()+scale_color_gradient(low = "#1f77b4", high = "#ff7f0e")

Step 6: Export Combined DataFrame

Finally, we export the cleaned data frame for further analysis to CSV format.

# Load the package to export
library(writexl)

# Export directly CSV file to local computer
write.csv(daily_activity_V2,"daily_activity_V2.csv",row.names = FALSE)

Summary

This document provides a comprehensive analysis of fitness tracker data. We examined basic statistics, user-specific patterns, and activity trends. The visualizations helped us understand activity patterns across different days, user classifications, and the relationship between steps, distance, and calories burned. This analysis can be used to gain insights into user behavior and improve fitness tracking features.