INTRODUCTION.

Bellabeat is a high-tech company that manufactures health-focused smart products for women use only, these products were beautifully designed by Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer as a result of her background as an artist. The products collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

ASK

One of the stakeholders Urška Sršen asked me, as a junior data analyst in their company to analyze smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. She also wants me to select one Bellabeat product to apply these insights to in my presentation.

PREPARE

I used a public dataset suggested by one of the stakeholders, Urška Sršen, that explores smart device users’ daily habits. Here is the dataset link, this dataset was uploaded by Mobius, the dataset is not copyrighted and it is approved to be used for free by anyone license.

This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

About the data: It contains a total of 18 wide datasets,generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 with various records on participants’ activity and fitness data, i downloaded and saved them in a local file on my laptop, i used the Import Dataset - From Text(readr) to import them into my RStudio desktop and assigned new names to them appropriately.

Loaded necessary packages to make the analysis run smoothly

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.2
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(Tmisc)
## Warning: package 'Tmisc' was built under R version 4.2.2
library(readr)

Loaded datasets into working environment

daily_activity <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/dailyActivity_merged.csv")
hourly_calories <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlyCalories_merged.csv")
daily_sleep <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/sleepDay_merged.csv")
hourly_steps <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlySteps_merged.csv")

Cleaning the datasets

Cleaning datasets with the clean_names() function to ensure the data in the datasets are unique and consistent, having just characters, numbers and underscores. eg.clean_names(daily_activity), clean_names(hourly_calories) etc…

Get a glimpse of the kind of data contained in each of the loaded dataset

glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "04/12/2016", "4/13/2016", "4/14/2016", "4/15…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

The daily activity dataframe contains 15 columns and 940 observations.

glimpse(hourly_calories)
## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories     <int> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …

The hourly calories dataframe contains 3 columns and 22,099 observations.

glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

The daily sleep dataframe contains 5 columns and 413 observations

glimpse(hourly_steps)
## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal    <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…

The hourly steps dataframe contains 3 columns and 22,099 observations.

Inspecting column names of each loaded dataframe.

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(hourly_calories)
## [1] "Id"           "ActivityHour" "Calories"
colnames(daily_sleep)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"

After inspecting the columns names, i realized that the datasets have a column name in common, the “Id” column. This means that, the dataframes can be joined using the “Id” column to find possible trend(s).

Using the n_distinct() function to detect how many unique participants recorded their activities.

n_distinct(daily_activity$Id)
## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the daily activity dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

n_distinct(hourly_calories$Id)
## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the hourly calories dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

n_distinct(daily_sleep$Id)
## [1] 24

Running the above code revealed that just 24 of the 33 unique users recorded their sleep information.

n_distinct(hourly_steps$Id)
## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the hourly steps dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

Cleaning data further by removing the observations with some NA cells using “daily_activity %>% filter_all(all_vars(!is.na(.)))”, “hourly_calories %>% filter_all(all_vars(!is.na(.)))”, “daily_sleep %>% filter_all(all_vars(!is.na(.)))” and “hourly_steps %>% filter_all(all_vars(!is.na(.)))”,

NOTE

A good data source should be Reliable, Original, Comprehensive, Current, and Cited, in the case of the available data for this case study, reliability is low as it contains just 33 users, a larger sample would have been better, Its supplied by a third party (Amazon Mechanical Turk), its safe to say its not original, its neither comprehensive nor current, it was a 2016 dataset, its been 7years the data was collected, however cited. The source and it’s license were stated.

PROCESS AND ANALYZE

SHARE

Visualizations & Plots using R ggplot2 package.

Presented graphs from above via Google Slides to stakeholders (skipped displaying above plots again).

My analysis is complete, despite some positive trends in the plotted graphs analyzing smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices, I realized that the data is not reliable, the data provider stated there were 30 participants in the data set, i figured there were 3 more. The data was gotten from a third party (Amazon Mechanical Turk), its safe to say its not original, its neither comprehensive nor current, as it was a 2016 dataset.

Based on the above summary, I concluded the dataset is of bad quality and outdated (released 7years ago), it is not advisable to make business recommendations based on this finding.

Recommendation(s)

  • I’d advise the BellaBeat stakeholders, Urška Sršen and Sando Mur to get a dataset with larger sample size, original and current to work with in order to drive a positive business decision resulting to profitable outcome.