BELLABEAT CASE STUDY

About us

Bellabeat is a small high-tech manufacturer of health products founded in 2013. By 2016, we had opened offices worldwide.

Ask

● What are the new growth opportunities for our products?
● What modifications should be made to tailor the products more to the customers?
● What insights are useful for developing effective future marketing strategies?

Prepare

Data information

Data source: FitBit Fitness Tracker Data (CC0: Public Domain). The data is stored on Kaggle at FitBit Fitness Tracker Data.

Copyright: The data is free to use. All details about licensing, privacy, security, and accessibility can be found at Deed - CC0 1.0 Universal - Creative Common

Install packages and load libraries

install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)
install.packages("readr")
library(readr)
install.packages("lubridate")
library(lubridate)
install.packages("janitor")
library(janitor)
install.packages("knitr")
library(knitr)
install.packages("highcharter")
library(highcharter)
install.packages("ggplot2")
library(ggplot2)
install.packages("plotly")
library(plotly)
install.packages("EnvStats")
library(EnvStats)
install.packages("htmltools")
library(htmltools)
install.packages("DT")
library(DT)

Import data

Upload the data set to Project Directory in Posit. Then renamed Fitabase Data 3.12.16-4.11.16 folder and Fitabase Data 4.12.16-5.12.16 folder as March_data and April_data respectively. See Appendix 1

Read csv files and count number of users in each file. The “merged” suffixes in each file name were replaced with “Apr” and “Mar” for those with start date from April and March respectively.

For example, dailyActivity_merged.csv in folder April_data would become dailyActivity_Apr.csv. Then count the distinct number of user IDs in each data table.

# Read file and print the first 3 rows
dailyActivity_Apr<-read_csv("April_data/dailyActivity_merged.csv")
print(dailyActivity_Apr, n=3)

## # A tibble: 940 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## # ℹ 937 more rows
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

# Count number of users
n_distinct(dailyActivity_Apr$Id)

## [1] 33

Continue skimming through the rest, then aggregate the dataset information into a data overview.

Data overview

*METs = Metabolic Equivalent Task. Vigorous-intensity activities are defined as ≥ 6.0 METs. Running at 10 minutes per mile (6.0 mph) is a 10 MET activity and is therefore classified as vigorous-intensity HHS 2008, 55. However, max METs in both given tables are significantly surpassed these indexes.

Data organisation

The data set consists of 29 files, organised into two folders with different timestapms. One is from 3/12/2016 to 4/11/2016 and the other is from 4/12/2016 to 5/12/2016. Data is mixed of wide and long formats.

There are 35 distinct IDs in the whole data set. However, the number of IDs in each file is inconsistent. See Appendix 2

Column names are consistent between files but the number of columns in each file also varies. See Appendix 3

Data integrity

Cross-verification: Data within one folder or between two folders in the same categories was compared to ensure that data from different tables matched. See Appendix 4

Audit trails: A change log was maintained to monitor changes to the dataset. See Appendix 1

Data limitations

● Abbreviations and calculation formulas in some tables are not clarified in the data source.
● Data does not include demographic information. It is hard to confirm if it is not sampling bias.
● It is relatively outdated because the survey was undertaken in 2016. The results of the analysis should be taken as a reference only. Further clarification and extra data sources are needed.

Process

R is used to examine and then merge the needed tables before analysing.

Major notes

● Time stamp in each file and in two folders
● Format dmy and dmy_hms
● Overlapsing and inconsistencies in data of the same period
● Number of IDs records in Mar are significantly smaller than April’s

Performe data cleaning

Cleaning files No. 1, 2, 4, 5, 12, 13, 24, 25 from the Dataset overview table

First, the Table dailyActivity_Apr table was cleanned

# Find the location of missing values
which(is.na(dailyActivity_Apr), arr.ind = TRUE)

##      row col

# Count the total number of missing values
sum(is.na(dailyActivity_Apr))

## [1] 0

# Count the number of duplicate rows
sum(duplicated(dailyActivity_Apr))

## [1] 0

# Check the structure of the data frame
str(dailyActivity_Apr)

## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Trim whitespace from column names
colnames(dailyActivity_Apr) <- trimws(colnames(dailyActivity_Apr))
# Remove empty rows and columns
dailyActivity_Apr <- remove_empty(dailyActivity_Apr, which = c("rows", "cols"))

Then, apply the same method to all of the table below. See more codes at Appendix 5

Table hourlyCalories_Apr

Table hourlyIntensities_Apr

Table hourlySteps_Apr

Table dailyActivity_Mar

Table hourlyCalories_Mar

Table hourlyIntensities_Mar

Table hourlySteps_Mar

Merge data

Based on a daily basis

Joining daily tables of Activity. Note: records on 5/12/2016 would not be considered as they are incomplete and could skew the analysis results. The period to be analysed is from 3/12/2016 to 5/11/2016, via 60 days.

# Filter out data on 5/12/2016
dailyActivity_Apr_without_end_date <- dailyActivity_Apr[dailyActivity_Apr$ActivityDate != "5/12/2016", ]
dailyActivity_Mar_without_end_date <- dailyActivity_Mar[dailyActivity_Mar$ActivityDate != "4/12/2016", ]

# Merge tables to create a master data frame named dailyActivity
dailyActivity <- rbind(dailyActivity_Apr_without_end_date,dailyActivity_Mar_without_end_date)

# Change the data type of the column activity_date from character to date.
dailyActivity$ActivityDate <- mdy(dailyActivity$ActivityDate)

Detect ouliers:

# Detect outliers in IDs based on their frequency
detect_outlier_frequency <- dailyActivity %>%
  group_by(Id) %>%
  summarise(Frequency = n(), .groups = 'drop')

# Find outliers using Rosner's Test
result <- rosnerTest(detect_outlier_frequency$Frequency, k = 5)

# Extract the outlier values
outliers <- result$all.stats %>%
  filter(Outlier) %>%
  select(Value)

# Identify the IDs corresponding to the outlier frequencies
outlier_Ids <- detect_outlier_frequency %>%
  filter(Frequency %in% outliers$Value)

# Print ouliers
print(outlier_Ids)

## # A tibble: 3 × 2
##           Id Frequency
##        <dbl>     <int>
## 1 2891001357         8
## 2 4020332650        61
## 3 6391747486         9

Three IDs were classified as outliers based on their frequency in the dataset. However, this could be due to the nature of the observation.

Another test will be conducted to detect outlier values in the Calories column.

# Detect outliers based on values in Calories column
rosnerTest(dailyActivity$Calories,k = 10)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            dailyActivity$Calories
## 
## Sample Size:                     1352
## 
## Test Statistics:                 R.1  = 3.626794
##                                  R.2  = 3.254037
##                                  R.3  = 3.268093
##                                  R.4  = 3.282332
##                                  R.5  = 3.296759
##                                  R.6  = 3.237084
##                                  R.7  = 3.243805
##                                  R.8  = 3.224483
##                                  R.9  = 3.223801
##                                  R.10 = 3.230316
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 2312.760 713.3682  4900     592 3.626794   4.113355   FALSE
## 2  1 2310.845 710.1472     0     639 3.254037   4.113177   FALSE
## 3  2 2312.557 707.6168     0     800 3.268093   4.112999   FALSE
## 4  3 2314.271 705.0693     0    1148 3.282332   4.112821   FALSE
## 5  4 2315.988 702.5046     0    1227 3.296759   4.112643   FALSE
## 6  5 2317.707 699.9224    52     338 3.237084   4.112465   FALSE
## 7  6 2319.391 697.4497    57     889 3.243805   4.112286   FALSE
## 8  7 2321.073 694.9725  4562    1049 3.224483   4.112108   FALSE
## 9  8 2319.406 692.5348  4552     558 3.223801   4.111929   FALSE
## 10 9 2317.743 690.1050  4547     894 3.230316   4.111750   FALSE

Since the test did not show any outliers, the original observations in the column Calories will be retained for analysis. However, some insights will be re-tested if outliers are detected in the analyzed columns to ensure the results are reliable.

Based on an hourly basis

Combine the selected files from overview table. The merge keys are Id and ActivityHour. Then, consolidate them into one master dataset.

# Merge hourlyCalories_Apr, hourlyIntensities_Apr, hourlySteps_Apr, 
hourly_Apr <- hourlyCalories_Apr %>%
  inner_join(hourlyIntensities_Apr, by = c("Id","ActivityHour")) %>%
  inner_join(hourlySteps_Apr, by =  c("Id","ActivityHour")) %>%
  select(Id,ActivityHour,TotalIntensity,StepTotal,Calories)

# Merge hourlyCalories_Mar, hourlyIntensities_Mar, hourlySteps_Mar
hourly_Mar <- hourlyCalories_Mar %>%
  inner_join(hourlyIntensities_Mar, by = c("Id","ActivityHour")) %>%
  inner_join(hourlySteps_Mar, by =  c("Id","ActivityHour")) %>%
  select(Id,ActivityHour,TotalIntensity,StepTotal,Calories)

# Combine these two data frames into master data, then change data type of column ActivityHour
hourly_merged <- rbind(hourly_Apr,hourly_Mar)
hourly_merged$ActivityHour <- mdy_hms (hourly_merged$ActivityHour)

Then, arrange data in order from Monday to Sunday:

hourly_merged_V1 <- hourly_merged %>% 
  mutate(
    hour = hour(ActivityHour), 
    weekday = factor(weekdays(ActivityHour), levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
  ) %>%
  group_by(weekday, hour) %>% 
  summarise(across(c(TotalIntensity:Calories), sum, na.rm = TRUE), .groups = 'drop') %>%
  arrange(weekday, hour)

Next, check for outliers:

rosnerTest(hourly_merged_V1$TotalIntensity, k = 5)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            hourly_merged_V1$TotalIntensity
## 
## Sample Size:                     168
## 
## Test Statistics:                 R.1 = 1.749956
##                                  R.2 = 1.645558
##                                  R.3 = 1.606795
##                                  R.4 = 1.605240
##                                  R.5 = 1.597390
## 
## Test Statistic Parameter:        k = 5
## 
## Alternative Hypothesis:          Up to 5 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 3135.179 1956.519  6559      67 1.749956   3.552401   FALSE
## 2 1 3114.677 1944.218  6314      43 1.645558   3.550554   FALSE
## 3 2 3095.404 1934.034  6203     134 1.606795   3.548694   FALSE
## 4 3 3076.570 1924.591  6166      66 1.605240   3.546821   FALSE
## 5 4 3057.732 1915.167  6117      68 1.597390   3.544935   FALSE

rosnerTest(hourly_merged_V1$StepTotal, k = 5)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            hourly_merged_V1$StepTotal
## 
## Sample Size:                     168
## 
## Test Statistics:                 R.1 = 2.004184
##                                  R.2 = 1.726885
##                                  R.3 = 1.703869
##                                  R.4 = 1.715386
##                                  R.5 = 1.686624
## 
## Test Statistic Parameter:        k = 5
## 
## Alternative Hypothesis:          Up to 5 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##   i   Mean.i     SD.i  Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 83146.71 56577.78 196539      67 2.004184   3.552401   FALSE
## 2 1 82467.71 56057.19 179272      43 1.726885   3.550554   FALSE
## 3 2 81884.55 55716.39 176818      68 1.703869   3.548694   FALSE
## 4 3 81309.20 55389.16 176323     135 1.715386   3.546821   FALSE
## 5 4 80729.85 55055.05 173587      66 1.686624   3.544935   FALSE

rosnerTest(hourly_merged_V1$Calories, k = 5)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            hourly_merged_V1$Calories
## 
## Sample Size:                     168
## 
## Test Statistics:                 R.1 = 1.958716
##                                  R.2 = 1.865898
##                                  R.3 = 1.825498
##                                  R.4 = 1.833752
##                                  R.5 = 1.834406
## 
## Test Statistic Parameter:        k = 5
## 
## Alternative Hypothesis:          Up to 5 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 26324.22 5079.235 36273      67 1.958716   3.552401   FALSE
## 2 1 26264.65 5035.297 35660     135 1.865898   3.550554   FALSE
## 3 2 26208.05 4996.966 35330     134 1.825498   3.548694   FALSE
## 4 3 26152.76 4960.995 35250      66 1.833752   3.546821   FALSE
## 5 4 26097.29 4924.595 35131      43 1.834406   3.544935   FALSE

Conclusion: There is no outliers in the data table hourly_merged_V1.

Analysis

Daily data tables

Check the number of daily users and steps

# Count the daily active users and the total number of daily steps in 60 days
dailyActivity_sum1 <- dailyActivity %>%
  group_by(ActivityDate) %>% 
  summarise(Total_Ids = n_distinct(Id), 
            Total_Steps = sum(TotalSteps),
            Total_Calories = sum(Calories),
            .groups = 'drop') %>% 
  select(ActivityDate, Total_Ids, Total_Steps,Total_Calories)

Insight 1: There was a surge in the number of users and number of activities daily from the beginning of April.

Check the relation of daily users and steps

# Total Ids and Total Steps
cor.test(dailyActivity_sum1$Total_Steps, dailyActivity_sum1$Total_Ids, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  dailyActivity_sum1$Total_Steps and dailyActivity_sum1$Total_Ids
## t = 47.74, df = 59, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9788453 0.9923915
## sample estimates:
##       cor 
## 0.9873024

Insight 2: The cor.test with cor = 0.9893389 indicates a very strong positive linear relationship between the surge in the number of users and the amount of steps taken. See the chart below:

However, the reasons behind this sudden increase require additional data for clarification. Possible explanations could include promotions, sales seasons, product launches, missing data, and other factors.

Check the relation of total steps and total calories

# Total calories and total steps

cor.test(dailyActivity_sum1$Total_Steps, dailyActivity_sum1$Total_Calories, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  dailyActivity_sum1$Total_Steps and dailyActivity_sum1$Total_Calories
## t = 61.699, df = 59, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9872159 0.9954145
## sample estimates:
##       cor 
## 0.9923396

Insight 3: The results also show a strong positive correlation between the amount of calories burned and total steps. See the chart below:

Check daily average of steps and distance

# Create dailyActivity_average table
dailyActivity_average <- dailyActivity %>%
  group_by(Id) %>% 
  summarise(Average_Steps = sum(TotalSteps)/60,
            Average_Distance = sum(TotalDistance)/60)

Insight 4: The vast majority of users are not active enough in term of average number of steps per day. There are only 5 users achieve the recommended daily steps by NIH. It is recommended to take at least 7500 steps or approximately 3.4 miles a day.

Check time spent on active levels

dailyActivity_sum2 <- dailyActivity%>%
  summarize(
    Very = sum(VeryActiveMinutes),
    Fairly = sum(FairlyActiveMinutes),
    Lightly = sum(LightlyActiveMinutes),
    Sedentary = sum(SedentaryMinutes)
  ) %>% 
  pivot_longer(cols = everything(),names_to = "Level",values_to = "Minutes")

print(dailyActivity_sum2)

## # A tibble: 4 × 2
##   Level     Minutes
##   <chr>       <dbl>
## 1 Very        27195
## 2 Fairly      18633
## 3 Lightly    256181
## 4 Sedentary 1361157

Insight 5: Fitbit users typically spend most of their time on sedentary and lightly active activities. This suggests that they mainly use the products as casual accessories with lower activity levels, rather than for physically demanding tasks or training purposes.

Correlationship between active levels and total steps

# Create dailyActivity_V2 by grouping ID column and sum all related columns
dailyActivity_V2 <- dailyActivity %>%
  group_by(Id) %>% 
  summarise(across(c(TotalSteps, VeryActiveMinutes:Calories), sum, na.rm = TRUE))

Check for outliers in dailyActivity_V2:

rosnerTest(dailyActivity_V2$VeryActiveMinutes, k = 5)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            dailyActivity_V2$VeryActiveMinutes
## 
## Sample Size:                     35
## 
## Test Statistics:                 R.1 = 2.810920
##                                  R.2 = 3.239663
##                                  R.3 = 3.053991
##                                  R.4 = 3.032100
##                                  R.5 = 2.681619
## 
## Test Statistic Parameter:        k = 5
## 
## Alternative Hypothesis:          Up to 5 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     4
## 
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 777.0000 984.7311  3545      30 2.810920   2.978183    TRUE
## 2 1 695.5882 871.8227  3520      22 3.239663   2.965315    TRUE
## 3 2 610.0000 725.9354  2827      35 3.053991   2.951949    TRUE
## 4 3 540.7188 616.8271  2411      32 3.032100   2.938048    TRUE
## 5 4 480.3871 522.3012  1881       1 2.681619   2.923571   FALSE

rosnerTest(dailyActivity_V2$FairlyActiveMinutes, k = 5)

## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            dailyActivity_V2$FairlyActiveMinutes
## 
## Sample Size:                     35
## 
## Test Statistics:                 R.1 = 3.677436
##                                  R.2 = 2.025512
##                                  R.3 = 1.819187
##                                  R.4 = 1.922798
##                                  R.5 = 1.852099
## 
## Test Statistic Parameter:        k = 5
## 
## Alternative Hypothesis:          Up to 5 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     1
## 
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 532.3714 457.0110  2213      13 3.677436   2.978183    TRUE
## 2 1 482.9412 356.4820  1205      22 2.025512   2.965315   FALSE
## 3 2 461.0606 338.0299  1076       3 1.819187   2.951949   FALSE
## 4 3 441.8438 324.6083  1066      20 1.922798   2.938048   FALSE
## 5 4 421.7097 308.9956   994      29 1.852099   2.923571   FALSE

Result: Outliers were detected in some columns. The results from this data table should be interpreted with caution due to the presence of outliers. These outliers could be due to natural variation in the data, recording errors, equipment malfunctions, or sample size issues. Further clarification is needed.

Plot original data in table dailyActivity_V2

Insight 6 There is a clear positive correlation between active minutes and total steps. As active minutes increase, total steps also increase. The trend lines indicate that the increase in total steps is more pronounced for those who are very active or fairly active compared to those who are lightly active.

Correlationship between active levels and total calories

Insight 7: All activity levels show trend lines indicating a positive correlation between active minutes and total calories burned. The ‘Very Active’ trend line has the steepest slope, followed by ‘Fairly Active,’ indicating a more rapid increase in calories burned with increased active minutes compared to ‘Lightly Active.’ There is no big gap between the slopeness of ‘Very Active’ and ‘Fairly Active’.

Hourly data tables

The data below would be visualised in the next phase for clearer trends and patterns.
Note: There is no ouliers in this data set

# Create a datatable with a search bar and no header wrap
data_hourly_merged_V1 <- datatable(hourly_merged_V1, options = list(
  pageLength = 10,
  scrollX = TRUE,
  scrollY = "400px",
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'white-space': 'nowrap'});",
    "}"
  )
), rownames = FALSE)

# Print the datatable
data_hourly_merged_V1

Act

Answer the questions in the ASK phase

1. What are the new growth opportunities for our products?

The analysis reveals two growth opportunities:

Targeting casual users: As Insight 5 shows, most Fitbit users engage in sedentary and lightly active activities, suggesting they use the products as casual accessories rather than for training. This presents an opportunity to expand marketing and product features toward casual users who may be seeking to improve their health gradually, rather than focusing solely on athletes or highly active users.

Increase engagement during peak times: Insights 1 and 8 highlight a surge in user activity starting in April, as well as peak hours during the day and week. These patterns could be leveraged to create time-specific promotions or challenges to engage users during periods of high activity.

2. What modifications should be made to tailor the products more to customers?

Several product modifications can improve user engagement:

Encouragement of more active behavior: Since Insight 4 shows that most users are not meeting the NIH-recommended daily step count, the product could include motivational reminders to increase activity. Insight 7 suggests that adding features to guide users toward “Very Active” or “Fairly Active” activities, in stead of the “Very Active” level only, could significantly impact calories burned and overall fitness.

Design improvements: To cater to the majority of users who are using the product casually (Insight 5), the product design could be more aesthetically tailored for casual, everyday wear instead of focusing solely on athletic features.

3. What insights are useful for developing effective future marketing strategies?*

**Time-specific campaigns:* Insights 8 shows the busiest periods of user activity. Marketing campaigns and promotions can be launched during peak times (e.g., 10:00 to 14:00 and 16:00 to 19:00) to maximize visibility and engagement. Additionally, campaigns can be adjusted to target weekdays or weekends depending on when users are most active.

Target casual users: Given that the majority of users are not highly active (Insight 5), marketing strategies should focus on promoting the benefits of moderate physical activity and how Bellabeat products can help users gradually improve their health, especially for those engaging in sedentary or lightly active lifestyles.

Next steps for the analyst

About the data, it is essential to address some gaps in the analysis, such as the lack of demographic data. Incorporating these factors could provide deeper insights. Moreover, internal data from Bellabeat’s current users could be utilized to cross-check and validate the findings from this analysis.

Next steps for stakeholders

The stakeholders, including the executive and marketing teams, will receive the analysis via visualizations prepared in this R Markdown document. It will be uploaded to https://rpubs.com/HenryN9\ for easy access and further reference. A small meeting would be the most effective way to communicate these findings and discuss their applications.

Final conclusion

This analysis reveals two key findings:

● The current users in the data set exhibit relatively low levels of activity, suggesting that they tend to use the products more as casual accessories rather than for training purposes.

● The busiest times of activity occur at specific periods within a day or week.

Appendix

1. Change Log

Upload and Organize Data in Posit

The dataset was uploaded to Posit under the Project folder, and the directory path was set.

The folders “Fitabase Data 3.12.16-4.11.16” and “Fitabase Data 4.12.16-5.12.16” were moved to the Project folder and renamed “March_data” and “April_data” respectively.

The folders “mturkfitbit_export_3.12.16-4.11.16” and “mturkfitbit_export_4.12.16-5.12.16” were deleted.

Change File Names

The “merged” suffix in each file was replaced with “Apr” or “Mar” for files with start dates in April or March, respectively.

For example, “dailyActivity_merged.csv” in the folder “April_data” was renamed “dailyActivity_Apr.csv”. Otherwise, the original data name was left unchanged.

# Read the files in the folder April_data in ascending order
dailyActivity_Apr<-read_csv("April_data/dailyActivity_merged.csv")
dailyCalories_Apr<-read_csv("April_data/dailyCalories_merged.csv")
dailyIntensities_Apr<-read_csv("April_data/dailyIntensities_merged.csv")
dailySteps_Apr<-read_csv("April_data/dailySteps_merged.csv")
heartrate_Apr<-read_csv("April_data/heartrate_merged.csv")
hourlyCalories_Apr<-read_csv("April_data/hourlyCalories_merged.csv")
hourlyIntensities_Apr<-read_csv("April_data/hourlyIntensities_merged.csv")
hourlySteps_Apr<-read_csv("April_data/hourlySteps_merged.csv")
minuteCaloriesNarrow_Apr<-read_csv("April_data/minuteCaloriesNarrow_merged.csv")
minuteCaloriesWide_Apr<-read_csv("April_data/minuteCaloriesWide_merged.csv")
minuteIntensitiesNarrow_Apr<-read_csv("April_data/minuteIntensitiesNarrow_merged.csv")
minuteIntensitiesWide_Apr<-read_csv("April_data/minuteIntensitiesWide_merged.csv")
minuteMETsNarrow_Apr<-read_csv("April_data/minuteMETsNarrow_merged.csv")
minuteSleep_Apr<-read_csv("April_data/minuteSleep_merged.csv")
minuteStepsNarrow_Apr<-read_csv("April_data/minuteStepsNarrow_merged.csv")
minuteStepsWide_Apr<-read_csv("April_data/minuteStepsWide_merged.csv")
sleepDay_Apr<-read_csv("April_data/sleepDay_merged.csv")
weightLogInfo_Apr<-read_csv("April_data/weightLogInfo_merged.csv")

# Read the files in the folder March_data in ascending order
dailyActivity_Mar<-read_csv("March_data/dailyActivity_merged.csv")
heartrate_Mar<-read_csv("March_data/heartrate_merged.csv")
hourlyCalories_Mar<-read_csv("March_data/hourlyCalories_merged.csv")
hourlyIntensities_Mar<-read_csv("March_data/hourlyIntensities_merged.csv")
hourlySteps_Mar<-read_csv("March_data/hourlySteps_merged.csv")
minuteCaloriesNarrow_Mar<-read_csv("March_data/minuteCaloriesNarrow_merged.csv")
minuteIntensitiesNarrow_Mar<-read_csv("March_data/minuteIntensitiesNarrow_merged.csv")
minuteMETsNarrow_Mar<-read_csv("March_data/minuteMETsNarrow_merged.csv")
minuteSleep_Mar<-read_csv("March_data/minuteSleep_merged.csv")
minuteStepsNarrow_Mar<-read_csv("March_data/minuteStepsNarrow_merged.csv")
weightLogInfo_Mar<-read_csv("March_data/weightLogInfo_merged.csv")

2. Examine the number of distinct IDs in each data table

# Tables in folder April_data
n_distinct(dailyActivity_Apr$Id)
n_distinct(dailyCalories_Apr$Id)
n_distinct(dailyIntensities_Apr$Id)
n_distinct(dailySteps_Apr$Id)
n_distinct(heartrate_Apr$Id)
n_distinct(hourlyCalories_Apr$Id)
n_distinct(hourlyIntensities_Apr$Id)
n_distinct(hourlySteps_Apr$Id)
n_distinct(minuteCaloriesNarrow_Apr$Id)
n_distinct(minuteCaloriesWide_Apr$Id)
n_distinct(minuteIntensitiesNarrow_Apr$Id)
n_distinct(minuteIntensitiesWide_Apr$Id)
n_distinct(minuteMETsNarrow_Apr$Id)
n_distinct(minuteSleep_Apr$Id)
n_distinct(minuteStepsNarrow_Apr$Id)
n_distinct(minuteStepsWide_Apr$Id)
n_distinct(sleepDay_Apr$Id)
n_distinct(weightLogInfo_Apr$Id)

# Tables in folder March_data
n_distinct(dailyActivity_Mar$Id)
n_distinct(heartrate_Mar$Id)
n_distinct(hourlyCalories_Mar$Id)
n_distinct(hourlyIntensities_Mar$Id)
n_distinct(hourlySteps_Mar$Id)
n_distinct(minuteCaloriesNarrow_Mar$Id)
n_distinct(minuteIntensitiesNarrow_Mar$Id)
n_distinct(minuteMETsNarrow_Mar$Id)
n_distinct(minuteSleep_Mar$Id)
n_distinct(minuteStepsNarrow_Mar$Id)
n_distinct(weightLogInfo_Mar$Id)

3. Column names of all tables in both folders

Firstly, extract all the column names of all the files.

# From folder April_data
colnames(minuteStepsNarrow_Apr)
colnames(minuteStepsWide_Apr)
colnames(sleepDay_Apr)
colnames(weightLogInfo_Apr)
colnames(minuteCaloriesWide_Apr)
colnames(minuteIntensitiesNarrow_Apr)
colnames(minuteIntensitiesWide_Apr)
colnames(minuteMETsNarrow_Apr)
colnames(minuteSleep_Apr)
colnames(dailyActivity_Apr)
colnames(dailyCalories_Apr)
colnames(dailyIntensities_Apr)
colnames(dailySteps_Apr)
colnames(heartrate_Apr)
colnames(hourlyCalories_Apr)
colnames(hourlyIntensities_Apr)
colnames(hourlySteps_Apr)
colnames(minuteCaloriesNarrow_Apr)

# From folder March_data
colnames(dailyActivity_Mar)
colnames(heartrate_Mar)
colnames(hourlyCalories_Mar)
colnames(hourlyIntensities_Mar)
colnames(hourlySteps_Mar)
colnames(minuteCaloriesNarrow_Mar)
colnames(minuteIntensitiesNarrow_Mar)
colnames(minuteMETsNarrow_Mar)
colnames(minuteSleep_Mar)
colnames(minuteStepsNarrow_Mar)
colnames(weightLogInfo_Mar)

Then aggregate all the column names into an excel file. The results would be as below:

4. Cross-check timestamps and values of tables with a sample size of more than 30 users

4.1 Timestamps

Compare same file name in different folder

dailyActivity_Apr and dailyActivity_Mar

# dailyActivity_Apr
# Change the time stamp from 'chracter' data type to 'date' data type.
dailyActivity_Apr$ActivityDate <- as.Date(dailyActivity_Apr$ActivityDate,format='%m/%d/%Y ')
# Find the start date and cutoff date
start_time <- min(dailyActivity_Apr$ActivityDate)
cutoff_time <- max(dailyActivity_Apr$ActivityDate)

# dailyActivity_Mar
dailyActivity_Mar$ActivityDate <- as.Date(dailyActivity_Mar$ActivityDate,format='%m/%d/%Y ')
# Find the start date and cutoff date
min(dailyActivity_Mar$ActivityDate)
max(dailyActivity_Mar$ActivityDate)

There is data overlap on 4.12.16. The data in dailyActivity_Mar is incomplete for 4.12.16. Therefore, the data from the dailyActivity_Apr table will be used, as its starting hour is 2016-04-12 00:00:00 UTC.

hourlyCalories_Apr and hourlyCalories_Mar

# Change the time stamp from 'chracter' to 'datetime'. Note the format of the ActivityHour column in the original files
hourlyCalories_Apr$ActivityHour = as_datetime(hourlyCalories_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyCalories_Apr$ActivityHour)
max(hourlyCalories_Apr$ActivityHour)

#  Change the time stamp from 'chracter' to 'datetime'
hourlyCalories_Mar$ActivityHour = as_datetime(hourlyCalories_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyCalories_Mar$ActivityHour)
max(hourlyCalories_Mar$ActivityHour)

The cutoff time for the hourlyCalories_Mar table is 04/12/2016 10:00:00 UTC, while for the hourlyCalories_Apr table, it is 05/12/2016 15:00:00 UTC.

The data in the former table is incomplete for 04/12/2016, and the latter table has partial data for 05/12/2016.

hourlyIntensities_Apr and hourlyIntensities_Mar

# Change the time stamp
hourlyIntensities_Apr$ActivityHour = as_datetime(hourlyIntensities_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyIntensities_Apr$ActivityHour)
max(hourlyIntensities_Apr$ActivityHour)

# Change the time stamp
hourlyIntensities_Mar$ActivityHour = as_datetime(hourlyIntensities_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyIntensities_Mar$ActivityHour)
max(hourlyIntensities_Mar$ActivityHour)

hourlySteps_Apr and hourlySteps_Mar

# Change the time stamp
hourlySteps_Apr$ActivityHour = as_datetime(hourlySteps_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find start time and cutoff time
min(hourlySteps_Apr$ActivityHour)
max(hourlySteps_Apr$ActivityHour)

# Change the time stamp
hourlySteps_Mar$ActivityHour = as_datetime(hourlySteps_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find start time and cutoff time
min(hourlySteps_Mar$ActivityHour)
max(hourlySteps_Mar$ActivityHour)

4.2 Compare values of the same files in Wide and Narrow format in folder April_data

minuteCaloriesWide and minuteCaloriesNarrow

# Change data type of column ActivityHour
minuteCaloriesWide_Apr$ActivityHour <- mdy_hms(minuteCaloriesWide_Apr$ActivityHour)
# Find start time and cutoff time
min(minuteCaloriesWide_Apr$ActivityHour)
max(minuteCaloriesWide_Apr$ActivityHour)

# Change data type of column ActivityHour
minuteCaloriesNarrow_Apr$ActivityMinute <- mdy_hms(minuteCaloriesNarrow_Apr$ActivityMinute)
# Find start time and cutoff time
min(minuteCaloriesNarrow_Apr$ActivityMinute)
max(minuteCaloriesNarrow_Apr$ActivityMinute)

Due to the differences in starting and cutoff times of the files, data comparison between “2016-04-13 UTC” and “2016-05-11 UTC” was conducted to inspect data consistency. Activities within this period were recorded in both tables.

# Pivot table minuteCaloriesWide_Apr to longer format then add column date and hour
df_wide_new <- minuteCaloriesWide_Apr %>%
pivot_longer(
    cols = starts_with("Calories"),
    names_to = "Minute",
    values_to = "Calories") %>%
  mutate(Minute = as.numeric(sub("Calories", "", Minute))) %>% 
  mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
  mutate(key = paste(Id, date, hour, Minute, sep = "-"))
  
# Transform minuteCaloriesNarrow_Apr by adding column date, hour and minute
df_narrow_new <- minuteCaloriesNarrow_Apr %>%
  mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>% 
  mutate(key = paste(Id, date, hour, minute, sep = "-"))

# Filter and pull data from df_wide_new and df_narrow_new
df_wide_check <- df_wide_new %>% 
  filter(date <= ymd("2016-05-11") & date >= ymd("2016-04-13")) %>% 
  pull(key)
  
df_narrow_check <- df_narrow_new %>% 
  filter(date <= ymd("2016-05-11") & date >= ymd("2016-04-13")) %>% 
  pull(key)

# Single out unmatched rows  
df_unmatched <- anti_join(df_wide_check, df_narrow_check, by = "key")
nrow(df_unmatched)

Result: The odd rows are of 2016-05-11. Then the cutoff date is set to 2016-05-10 for both 2 data frames.

# Check if the Calories column of these two matching again.
df_wide_check_new <- df_wide_new %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  pull(key)

df_narrow_check_new <- df_narrow_new %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  pull(key)

# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check_new, df_narrow_check_new))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Results: the data is consistent within the period.

minuteIntensitiesNarrow and minuteIntensitiesWide

# Pivot table minuteIntensitiesWide_Apr to longer format and transform minuteIntensitiesNarrow_Apr
df_wide_check <- minuteIntensitiesWide_Apr %>%
  mutate(ActivityHour = mdy_hms(ActivityHour)) %>% 
  pivot_longer(
    cols = starts_with("Intensity"),
    names_to = "Minute",
    values_to = "Intensities") %>%
  mutate(Minute = as.numeric(sub("Intensity", "", Minute))) %>% 
  mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
  mutate(key = paste(Id, date, hour, Minute, sep = "-")) %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  pull(Intensity)
  
df_narrow_check <- minuteIntensitiesNarrow_Apr %>%
  mutate(ActivityMinute = mdy_hms(ActivityMinute)) %>% 
  mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>% 
  mutate(key = paste(Id, date, hour, minute, sep = "-")) %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  pull(Intensity)

# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check, df_narrow_check))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Results: the data is consistent within the period.

minuteStepsNarrow_merged and minuteStepsWide_merged

# Pivot table minuteStepsWide_Apr to longer format and transform minuteStepsNarrow_Apr
df_wide_check <- minuteStepsWide_Apr %>%
  mutate(ActivityHour = mdy_hms(ActivityHour)) %>% 
  pivot_longer(
    cols = starts_with("Steps"),
    names_to = "Minute",
    values_to = "Steps") %>%
  mutate(Minute = as.numeric(sub("Steps", "", Minute))) %>% 
  mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
  mutate(key = paste(Id, date, hour, Minute, sep = "-")) %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  select(Steps)
  
df_narrow_check <- minuteStepsNarrow_Apr %>%
  mutate(ActivityMinute = mdy_hms(ActivityMinutes)) %>% 
  mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>% 
  mutate(key = paste(Id, date, hour, minute, sep = "-")) %>% 
  filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>% 
  select(Steps)

# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check, df_narrow_check))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Results: the data is consistent within the period.

4.3 Compare data of same category, different time resolution in folder April_data

minuteStepsNarrow and hourlySteps

minuteStepsNarrow_Apr$ActivityMinute <- mdy_hms(minuteStepsNarrow_Apr$ActivityMinute)
min(minuteStepsNarrow_Apr$ActivityMinute)
max(minuteStepsNarrow_Apr$ActivityMinute)
minuteStepsNarrow_Apr_Test <-minuteStepsNarrow_Apr %>%
  mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>% 
  filter(Activity_Date != "2016-05-12") %>% 
  group_by(Id,Activity_Date,Activity_Hour) %>% 
  summarise(StepTotal = sum(Steps), .groups = 'drop') %>% 
  pull(StepTotal)

hourlySteps_Apr$ActivityHour <- mdy_hms(hourlySteps_Apr$ActivityHour)
min(hourlySteps_Apr$ActivityHour)
max(hourlySteps_Apr$ActivityHour)
hourlySteps_Apr_Test <-  hourlySteps_Apr %>% 
  mutate(Activity_Date = date(ActivityHour)) %>% 
  filter(Activity_Date != "2016-05-12") %>% 
  pull(StepTotal)

# Check if the StepTotal vectors of the two data frames are the same
if (isTRUE(all.equal(minuteStepsNarrow_Apr_Test, hourlySteps_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Result: the data is consistent within the period.

hourlyIntensities and minuteIntensitiesNarrow

hourlyIntensities_Apr$ActivityHour <- mdy_hms(hourlyIntensities_Apr$ActivityHour)
min(hourlyIntensities_Apr$ActivityHour)
max(hourlyIntensities_Apr$ActivityHour)

hourlyIntensities_Apr_Test <- hourlyIntensities_Apr %>% 
  filter(date(ActivityHour) != "2016-05-12") %>% 
  select(TotalIntensity)

minuteIntensitiesNarrow_Apr$ActivityMinute <- mdy_hms(minuteIntensitiesNarrow_Apr$ActivityMinute)
min(minuteIntensitiesNarrow_Apr$ActivityMinute)
max(minuteIntensitiesNarrow_Apr$ActivityMinute)

minuteIntensitiesNarrow_Apr_Test <-minuteIntensitiesNarrow_Apr %>%
  mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>% 
  filter(Activity_Date != "2016-05-12") %>% 
  group_by(Id,Activity_Date,Activity_Hour) %>% 
  summarise(TotalIntensity = sum(Intensity), .groups = 'drop') %>% 
  select(TotalIntensity)

# Check if the TotalIntensities vectors of the two data frames are the same
if (isTRUE(all.equal(hourlyIntensities_Apr_Test, minuteIntensitiesNarrow_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Result: the data is consistent within the period.

hourlyCalories and minuteCaloriesNarrow

#hourlyCalories_Apr
hourlyCalories_Apr$ActivityHour <- mdy_hms(hourlyCalories_Apr$ActivityHour)
min(hourlyCalories_Apr$ActivityHour)
max(hourlyCalories_Apr$ActivityHour)

hourlyCalories_Apr_Test <- hourlyCalories_Apr %>% 
  filter(date(ActivityHour) != "2016-05-12") %>% 
  select(Calories)

# minuteCaloriesNarrow_Apr
minuteCaloriesNarrow_Apr$ActivityMinute <- mdy_hms(minuteCaloriesNarrow_Apr$ActivityMinute)
min(minuteCaloriesNarrow_Apr$ActivityMinute)
max(minuteCaloriesNarrow_Apr$ActivityMinute)

minuteCaloriesNarrow_Apr_Test <-minuteCaloriesNarrow_Apr %>%
  mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>% 
  filter(Activity_Date != "2016-05-12") %>% 
  group_by(Id,Activity_Date,Activity_Hour) %>% 
  summarise(Calories = sum(Calories), .groups = 'drop') %>% 
  select(Calories) %>% 
  mutate(Calories = round(Calories,0))

# Check if the Calories vectors of the two data frames are the same
if (isTRUE(all.equal(hourlyCalories_Apr_Test, minuteCaloriesNarrow_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}

Result: the data is consistent within the period.

5: Data cleaning

Table hourlyCalories_Apr

# Find the location of missing values
which(is.na(hourlyCalories_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyCalories_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlyCalories_Apr))
# Check the structure of the data frame
str(hourlyCalories_Apr)
# Trim whitespace from column names
colnames(hourlyCalories_Apr) <- trimws(colnames(hourlyCalories_Apr))
# Remove empty rows and columns
hourlyCalories_Apr <- remove_empty(hourlyCalories_Apr, which = c("rows", "cols"))

Table hourlyIntensities_Apr

# Find the location of missing values
which(is.na(hourlyIntensities_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyIntensities_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlyIntensities_Apr))
# Check the structure of the data frame
str(hourlyIntensities_Apr)
# Trim whitespace from column names
colnames(hourlyIntensities_Apr) <- trimws(colnames(hourlyIntensities_Apr))
# Remove empty rows and columns
hourlyIntensities_Apr <- remove_empty(hourlyIntensities_Apr, which = c("rows", "cols"))

Table hourlySteps_Apr

# Find the location of missing values
which(is.na(hourlySteps_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlySteps_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlySteps_Apr))
# Check the structure of the data frame
str(hourlySteps_Apr)
# Trim whitespace from column names
colnames(hourlySteps_Apr) <- trimws(colnames(hourlySteps_Apr))
# Remove empty rows and columns
hourlySteps_Apr <- remove_empty(hourlySteps_Apr, which = c("rows", "cols"))

Table dailyActivity_Mar

# Find the location of missing values
which(is.na(dailyActivity_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(dailyActivity_Mar))
# Count the number of duplicate rows
sum(duplicated(dailyActivity_Mar))
# Check the structure of the data frame
str(dailyActivity_Mar)
# Trim whitespace from column names
colnames(dailyActivity_Mar) <- trimws(colnames(dailyActivity_Mar))
# Remove empty rows and columns
dailyActivity_Mar <- remove_empty(dailyActivity_Mar, which = c("rows", "cols"))

Table hourlyCalories_Mar

# Find the location of missing values
which(is.na(hourlyCalories_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyCalories_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlyCalories_Mar))
# Check the structure of the data frame
str(hourlyCalories_Mar)
# Trim whitespace from column names
colnames(hourlyCalories_Mar) <- trimws(colnames(hourlyCalories_Mar))
# Remove empty rows and columns
hourlyCalories_Mar <- remove_empty(hourlyCalories_Mar, which = c("rows", "cols"))

Table hourlyIntensities_Mar

# Find the location of missing values
which(is.na(hourlyIntensities_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyIntensities_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlyIntensities_Mar))
# Check the structure of the data frame
str(hourlyIntensities_Mar)
# Trim whitespace from column names
colnames(hourlyIntensities_Mar) <- trimws(colnames(hourlyIntensities_Mar))
# Remove empty rows and columns
hourlyIntensities_Mar <- remove_empty(hourlyIntensities_Mar, which = c("rows", "cols"))

Table hourlySteps_Mar

# Find the location of missing values
which(is.na(hourlySteps_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlySteps_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlySteps_Mar))
# Check the structure of the data frame
str(hourlySteps_Mar)
# Trim whitespace from column names
colnames(hourlySteps_Mar) <- trimws(colnames(hourlySteps_Mar))
# Remove empty rows and columns
hourlySteps_Mar <- remove_empty(hourlySteps_Mar, which = c("rows", "cols"))

BELLABEAT CASE STUDY

Henry

2024-09-22

About us

Ask

Prepare

Data information

Install packages and load libraries

Import data

Data overview

Data organisation

Data integrity

Data limitations

Process

Major notes

Performe data cleaning

Merge data

Based on a daily basis

Based on an hourly basis

Analysis

Daily data tables

Check the number of daily users and steps

Check the relation of daily users and steps

Check the relation of total steps and total calories

Check daily average of steps and distance

Check time spent on active levels

Correlationship between active levels and total steps

Correlationship between active levels and total calories

Hourly data tables

Share

Plot 1: The number of users and steps taken over 60 days

Plot 2: Daily average steps and average distance

Plot 3: Time spent on each level of activity

Plot 4: Heatmap of Total Steps by weekday and hour

Plot 5: Heatmap of Total Intensity by weekday and hour

Plot 6: Heatmap of Total Calories burnt by weekday and hour

Act

Answer the questions in the ASK phase

Next steps for the analyst

Next steps for stakeholders

Final conclusion

Appendix

1. Change Log

2. Examine the number of distinct IDs in each data table

3. Column names of all tables in both folders

4. Cross-check timestamps and values of tables with a sample size of more than 30 users

4.1 Timestamps

4.2 Compare values of the same files in Wide and Narrow format in folder April_data

4.3 Compare data of same category, different time resolution in folder April_data

5: Data cleaning