Introduction:

This case study is part of Google’s Data Analytics certification. Bellabeat aims to empower women by providing them with the data to discover themselves. In this case study, I will analyze 30 fitbit users’ data in order to gain insight into how users are using their smart devices. This will enable me to make recommendations to the marketing analytics team, the executive team, and the cofounder and Chief Creative Officer so that they can implement better marketing strategies.

Business Task

Business Task: Gain insight into how people use their smart devices and apply those trends to a Bellabeat product in order to influence Bellabeat marketing strategy

Key Stakeholders: Marketing Analytics Team, Executive Ream, Cofounder and Chief Creative Officer

Information About the Data set

Sources: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)

Information about Data set: This dataset contains personal tracker data from a small sample size of fitbit users stored in 18 .csv files. The data is organized in wide format where each user’s information is stored in one row with multiple columns to hold various attributes. Data was collected through a survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.

Limitations:

Based on the current products of Bellabeats and their dominant focus on activity, sleep, and stress, I’ve decided to use data from the dailyActivity_merged, sleepDay_merged, and weightLogInfo_merged datasets.

loading packages and libraries

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:lubridate':
## 
##     stamp

importing and reading data

dailyActivity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dailySleep <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weightLogInfo <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(dailyActivity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(dailySleep)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weightLogInfo)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

exploring the data

In order to get a better understanding of the organization and contents of the data file, I’ve chosen to see a glimpse of the data.

glimpse(dailyActivity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(dailySleep)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
glimpse(weightLogInfo)
## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…

The total number of rows for each data file is:

Within the dailySleep file, the column for SleepDay contains both a date and a time so it would be best to separate them. The weightLogInfo file appears to have some missing or unrecorded values.

I also know that each of the data files has an ID column, so I can use that to find the dinstint ID values for each data file:

finding how many distinct ID’s are there

n_distinct(dailyActivity$Id)
## [1] 33
n_distinct(dailySleep$Id)
## [1] 24
n_distinct(weightLogInfo$Id)
## [1] 8

Prior to cleaning the dataset, there are:

Since the dailySleep contains a column(sleepDay) that combines the date and time, it is best to separate them.

sleep_daily <- (dailySleep) %>% 
  separate(SleepDay, c("Date","Time")," ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
glimpse(sleep_daily)
## Rows: 413
## Columns: 6
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ Date               <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/2016",…
## $ Time               <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

Now that Date and Time have been separated into individual columns, we can continue to explore the data.

Activity

dailyActivity %>% 
  select(TotalSteps, TotalDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>% 
  summary()
##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 7406   Median : 5.245   Median :  4.00    Median :  6.00     
##  Mean   : 7638   Mean   : 5.490   Mean   : 21.16    Mean   : 13.56     
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00    3rd Qu.: 19.00     
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :199.0        Median :1057.5   Median :2134  
##  Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :518.0        Max.   :1440.0   Max.   :4900

Sleep

sleep_daily %>% 
  select(TotalMinutesAsleep, TotalTimeInBed, TotalSleepRecords) %>% 
  summary()
##  TotalMinutesAsleep TotalTimeInBed  TotalSleepRecords
##  Min.   : 58.0      Min.   : 61.0   Min.   :1.000    
##  1st Qu.:361.0      1st Qu.:403.0   1st Qu.:1.000    
##  Median :433.0      Median :463.0   Median :1.000    
##  Mean   :419.5      Mean   :458.6   Mean   :1.119    
##  3rd Qu.:490.0      3rd Qu.:526.0   3rd Qu.:1.000    
##  Max.   :796.0      Max.   :961.0   Max.   :3.000

Weight

weightLogInfo %>% 
  select(WeightPounds, BMI) %>% 
  summary()
##   WeightPounds        BMI       
##  Min.   :116.0   Min.   :21.45  
##  1st Qu.:135.4   1st Qu.:23.96  
##  Median :137.8   Median :24.39  
##  Mean   :158.8   Mean   :25.19  
##  3rd Qu.:187.5   3rd Qu.:25.56  
##  Max.   :294.3   Max.   :47.54

This summary tells us that the:

Finding duplicates

sum(duplicated(dailyActivity))
## [1] 0
sum(duplicated(dailySleep))
## [1] 3
sum(duplicated(weightLogInfo))
## [1] 0

Removing 3 duplicates from sleep_daily data file

sleepNoDup <- sleep_daily[!duplicated(sleep_daily),]

Checking to make sure there is no duplicates

sum(duplicated(sleepNoDup))
## [1] 0

Finding NA Values

sum(is.na(dailyActivity))
## [1] 0
sum(is.na(dailySleep))
## [1] 0
sum(is.na(weightLogInfo))
## [1] 65

Removing NA Values in weightLogInfo

weightClean <- na.omit(weightLogInfo)

Checking to ensure NA is removed

sum(is.na(weightClean))
## [1] 0

Merging Activity and Sleep

Most of the needed information is in the dailyActivity file so I’ve decided to merge together that file with the sleep_daily file as well as merging the combined file with the weightLogInfo file.

mergedSleepActivity <- merge(dailyActivity,sleepNoDup, by = c("Id"))
summary(mergedSleepActivity)
##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:12348       Min.   :    0   Min.   : 0.000  
##  1st Qu.:3.977e+09   Class :character   1st Qu.: 4660   1st Qu.: 3.160  
##  Median :4.703e+09   Mode  :character   Median : 8585   Median : 6.120  
##  Mean   :5.021e+09                      Mean   : 8108   Mean   : 5.722  
##  3rd Qu.:6.962e+09                      3rd Qu.:11317   3rd Qu.: 7.920  
##  Max.   :8.792e+09                      Max.   :22988   Max.   :17.950  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 3.160   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 6.120   Median :0.0000           Median : 0.530    
##  Mean   : 5.715   Mean   :0.1215           Mean   : 1.397    
##  3rd Qu.: 7.880   3rd Qu.:0.0000           3rd Qu.: 2.310    
##  Max.   :17.950   Max.   :4.9421           Max.   :13.400    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.0000000      
##  1st Qu.:0.0000           1st Qu.: 2.350      1st Qu.:0.0000000      
##  Median :0.4000           Median : 3.540      Median :0.0000000      
##  Mean   :0.7309           Mean   : 3.532      Mean   :0.0006795      
##  3rd Qu.:1.0000           3rd Qu.: 4.830      3rd Qu.:0.0000000      
##  Max.   :6.4800           Max.   :10.300      Max.   :0.1100000      
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:144.0        1st Qu.: 659.0  
##  Median :  8.00    Median : 10.00      Median :200.0        Median : 734.0  
##  Mean   : 23.94    Mean   : 17.34      Mean   :199.8        Mean   : 799.4  
##  3rd Qu.: 36.00    3rd Qu.: 24.00      3rd Qu.:258.0        3rd Qu.: 853.0  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories        Date               Time           TotalSleepRecords
##  Min.   :   0   Length:12348       Length:12348       Min.   :1.000    
##  1st Qu.:1776   Class :character   Class :character   1st Qu.:1.000    
##  Median :2158   Mode  :character   Mode  :character   Median :1.000    
##  Mean   :2323                                         Mean   :1.122    
##  3rd Qu.:2859                                         3rd Qu.:1.000    
##  Max.   :4900                                         Max.   :3.000    
##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:402.0  
##  Median :432.0      Median :462.0  
##  Mean   :419.1      Mean   :458.2  
##  3rd Qu.:492.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0
glimpse(mergedSleepActivity)
## Rows: 12,348
## Columns: 20
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "5/7/2016", "5/7/2016", "5/7/2016", "5/7/2016…
## $ TotalSteps               <dbl> 11992, 11992, 11992, 11992, 11992, 11992, 119…
## $ TotalDistance            <dbl> 7.71, 7.71, 7.71, 7.71, 7.71, 7.71, 7.71, 7.7…
## $ TrackerDistance          <dbl> 7.71, 7.71, 7.71, 7.71, 7.71, 7.71, 7.71, 7.7…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 2.46, 2.46, 2.46, 2.46, 2.46, 2.46, 2.46, 2.4…
## $ ModeratelyActiveDistance <dbl> 2.12, 2.12, 2.12, 2.12, 2.12, 2.12, 2.12, 2.1…
## $ LightActiveDistance      <dbl> 3.13, 3.13, 3.13, 3.13, 3.13, 3.13, 3.13, 3.1…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 3…
## $ FairlyActiveMinutes      <dbl> 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 46, 4…
## $ LightlyActiveMinutes     <dbl> 175, 175, 175, 175, 175, 175, 175, 175, 175, …
## $ SedentaryMinutes         <dbl> 833, 833, 833, 833, 833, 833, 833, 833, 833, …
## $ Calories                 <dbl> 1821, 1821, 1821, 1821, 1821, 1821, 1821, 182…
## $ Date                     <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/…
## $ Time                     <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00…
## $ TotalSleepRecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ TotalTimeInBed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …

Visualization

Distance vs Calories

ggplot(data = mergedSleepActivity, aes(x=TotalDistance, y = Calories)) +
  geom_point() + geom_smooth(method="loess") + 
  labs(title = "Calories vs. TotalDistance")
## `geom_smooth()` using formula 'y ~ x'

This plot shows a fairly positive correlation between the TotalDistance and the Calories burnt, which makes sense since the longer you walk.

Checking for correlation coefficient

cor.test(mergedSleepActivity$TotalDistance, mergedSleepActivity$Calories, method = "pearson", conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  mergedSleepActivity$TotalDistance and mergedSleepActivity$Calories
## t = 73.306, df = 12346, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5382867 0.5628674
## sample estimates:
##       cor 
## 0.5506964

A r value of 0.55 shows a positive correlation,which makes sense since the further you walk, the more calories you burn.

Level of Activity vs Calories

very_active <- ggplot(data = mergedSleepActivity, aes(x = VeryActiveMinutes, y = Calories)) + 
  geom_point(color = 'red') + geom_smooth(method = loess) + 
  labs(title = "Calories vs Very Active Minutes", x = "Very Active", y = "Calories") 
fairly_active <- ggplot(data = mergedSleepActivity, aes(x = FairlyActiveMinutes, y = Calories)) + 
  geom_point(color = 'blue') + geom_smooth(method = loess) + 
  labs(title = "Calories vs Fairly Actiive Minutes", x = "Fairly Active",y = "Calories")
lightly_active<- ggplot(data = mergedSleepActivity, aes(x = LightlyActiveMinutes, y = Calories)) + 
  geom_point(color = 'yellow') + geom_smooth(method = loess) + 
  labs(title = "Calories vs Lightly Active Minutes", x = "Lightly Active", y = "Calories")
sedentary_minutes <- ggplot(data = mergedSleepActivity, aes(x=SedentaryMinutes, y = Calories)) +
  geom_point(color = 'orange') + geom_smooth(method = loess) +
  labs(title = "Calories vs Sedentary Minutes", x = "Sedentary Minutes", y = "Calories")

plot_grid(very_active,fairly_active,lightly_active, sedentary_minutes)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'