Introduction

BellaBeat is the case study which i’ve selected as part of the Google Data Analysis Professional certificate. As the role of Junior data analyst I’m assigned to work with the Marketing Analyst team, BellaBeat is the high-tech manufacturer of health-focused products for womens. It is a successful company and has the potential to become a large player in global smart device market. I’ve been asked to focus on one of the BellaBeat’s product and analyse smart device data to gain insight into how consumers uses their smart device. The product are as follows:

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Note: This is the very first Data Analysis Case study i’m uploading in Kaggle Community, to viewers please go through my assignment and provide me with the genuine feedback

Objective

  • What are the trends in the smart device usages?
  • How these trends apply to bellabeat customers?
  • How could these trends help Bellabeat’s marketing strategy?

These are the key questions to be answered.

Overview

  • R studio has been used for data processing and data analysis.
  • In the data analysis process, I’ve tried to address the correlation among various variables. Correlation gives us a glimpse of how strongly the two variables are related.
  • Gauge the daily trend of the users, which gives us an understanding of how users perform the activity in a day.
  • Defining the types of users by using Cluster Analysis, classifying different users in categories.
  • Trend among different user types
  • To Develop recommendations based on insights from the above analysis.

Data Description

Data Structure and Type

There are 18 dataset in .csv format. These datasets can be segmented into Daily, Hourly, Minutes and Seconds. There are some wide datasets which I’ve not included for the analysis as the narrow dataset is good enough to move forward. Now let’s see the structure to get the feel of the data, For example:

str(Daily_Activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : Factor w/ 31 levels "4/12/2016","4/13/2016",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

We see that the ActivityDate is of factor type, we need to change it into Date format. This is across every dataset we’ll be using, so it’s necessary to change the format before using the data.

#Daily
Daily_Activity$ActivityDate <- mdy(Daily_Activity$ActivityDate)
Daily_Calories$ActivityDay <- mdy(Daily_Calories$ActivityDay)
Daily_Intensities$ActivityDay <- mdy(Daily_Intensities$ActivityDay)
Daily_Steps$ActivityDay <- mdy(Daily_Steps$ActivityDay)

#Hour
Hourly_Calories$ActivityHour <- as.POSIXct(Hourly_Calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
Hourly_Intensities$ActivityHour <- as.POSIXct(Hourly_Intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
Hourly_Steps$ActivityHour <- as.POSIXct(Hourly_Steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")

#Minutes
Minute_Calories_Narrow$ActivityMinute <- as.POSIXct(Minute_Calories_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
Minute_Intensities_Narrow$ActivityMinute <- as.POSIXct(Minute_Intensities_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
Minute_METs_Narrow$ActivityMinute <- as.POSIXct(Minute_METs_Narrow$ActivityMinute, format ="%m/%d/%Y %I:%M:%S %p")
Minute_Sleep$date <- as.POSIXct(Minute_Sleep$date, format = "%m/%d/%Y %I:%M:%S %p")
Minute_Steps_Narrow$ActivityMinute <- as.POSIXct(Minute_Steps_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")

#Seconds
Heartrate_Seconds$Time <- as.POSIXct(Heartrate_Seconds$Time, format = "%m/%d/%Y %I:%M:%S %p")

#Others
Sleep_Day$SleepDay <- as.POSIXct(Sleep_Day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
Weight_LogInfo$Date <- as.POSIXct(Weight_LogInfo$Date, format = "%m/%d/%Y %I:%M:%S %p")

Now after specifying the correct format, we need to check for any NA’s and see if the data is complete.

## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 65

We see that there’s only 1 dataset(Weight_LogInfo) with 65 NA values. Other datasets seem pretty good. Further, we need to investigate whether the rows of the data matches accordingly and how many distinct values each dataset has:

## [1] 940
## [1] 940
## [1] 940
## [1] 940
## [1] 22099
## [1] 22099
## [1] 22099
## [1] 1325580
## [1] 1325580
## [1] 1325580
## [1] 188521
## [1] 1325580
## [1] 2483658
## [1] 413
## [1] 67

Now checking for distinct users in each dataset:

## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 24
## [1] 33
## [1] 14
## [1] 24
## [1] 8

While counting the count of rows in the dataset, we observe that Daily data, Hourly Data and Minute data are consistent with the total count. Also, we see some inconsistency while counting distinct users in each dataset, MinuteSleep and SleepDay have a distinct count of 24 users and also Heartrate and WeightLog data have varying distinct counts.

Note: We see that there’s total count of 33 users

Data Summary

We can have the glimpse of the data summary just to have the feel of the data.

summary(Daily_Activity)
##        Id             ActivityDate          TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Min.   :2016-04-12   Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   1st Qu.:2016-04-19   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Median :2016-04-26   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09   Mean   :2016-04-26   Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09   3rd Qu.:2016-05-04   3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09   Max.   :2016-05-12   Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
head(Daily_Activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Feel free to explore the data summary for other dataset.

Note: We’ll be merging the dataset during the analysis process and as when required

Data Analysis

Correlation Chart

Correlation shows the strength of a relationship between the two variables and is expressed numerically between -1 and +1

##                          TotalSteps TotalDistance TrackerDistance
## TotalSteps                     1.00          0.99            0.98
## TotalDistance                  0.99          1.00            1.00
## TrackerDistance                0.98          1.00            1.00
## VeryActiveDistance             0.74          0.79            0.79
## ModeratelyActiveDistance       0.51          0.47            0.47
## LightActiveDistance            0.69          0.66            0.66
##                          VeryActiveDistance ModeratelyActiveDistance
## TotalSteps                             0.74                     0.51
## TotalDistance                          0.79                     0.47
## TrackerDistance                        0.79                     0.47
## VeryActiveDistance                     1.00                     0.19
## ModeratelyActiveDistance               0.19                     1.00
## LightActiveDistance                    0.16                     0.24
##                          LightActiveDistance SedentaryActiveDistance
## TotalSteps                              0.69                    0.07
## TotalDistance                           0.66                    0.08
## TrackerDistance                         0.66                    0.07
## VeryActiveDistance                      0.16                    0.05
## ModeratelyActiveDistance                0.24                    0.01
## LightActiveDistance                     1.00                    0.10
##                          VeryActiveMinutes FairlyActiveMinutes
## TotalSteps                            0.67                0.50
## TotalDistance                         0.68                0.46
## TrackerDistance                       0.68                0.46
## VeryActiveDistance                    0.83                0.21
## ModeratelyActiveDistance              0.23                0.95
## LightActiveDistance                   0.15                0.22
##                          LightlyActiveMinutes SedentaryMinutes Calories
## TotalSteps                               0.57            -0.33     0.59
## TotalDistance                            0.52            -0.29     0.64
## TrackerDistance                          0.51            -0.29     0.65
## VeryActiveDistance                       0.06            -0.06     0.49
## ModeratelyActiveDistance                 0.16            -0.22     0.22
## LightActiveDistance                      0.89            -0.41     0.47

In the correlation matrix, I’ve taken all the variables of Daily_Activity dataset to see how they are related to each other. Correlation of more than 70% (0.70) states that one variable is correlated to the other. In the matrix and chart above we can infer that:

  • Total steps are highly correlated to Total Distance
  • Very Active Distance is correlated to Total steps and Total Distance

Also, to check whether there’s any relation between the heart rate & calories we develop a linear graph.

## `geom_smooth()` using formula 'y ~ x'

## [1] 0.9721095
## 
## Call:
## lm(formula = mergedf_hea2_Cal2$Mean_heartrate ~ mergedf_hea2_Cal2$Mean_Calories)
## 
## Coefficients:
##                     (Intercept)  mergedf_hea2_Cal2$Mean_Calories  
##                         35.2381                           0.4099
  • From the plot and correlation coefficient we say that the mean calories are highly correlated to the mean heart rate
  • The Pearson correlation between the two is 0.972
  • Also, the linear function is: Mean heartrate = 32.2381 + 0.4099(Mean calories)

This is helpful when the mean heart rate is not given, the value can be derived from the mean calories themselves.

Activity level of users in a day

The Plot shows:

  • Most of the time users are inactive(Sedentary), that said most users are engaged in the activity for 15%-25% of the day (activity refers to VeryActive, ModerateActive and LightActive).
  • During Activity time the users are mostly engaged in Light Activity. Also, most of the users’ VeryActive% is greater than Moderate Activity%. Note: Activity levels of 33 users are displayed in the plot. There’s no categorization of user type. You’ll find categorization of the user through the unsupervised learning method of K-means clustering later.

Daily Trend

Keeping objective in our head and not swaying away from the main objective, we need to check on the user trend. In the daily trend chart, we’ll be exploring four variables over hours, they are (Heartrate, calories, Intensity & METs). Some transformation & data calculation was required to develop a dataset for a daily trend chart.

Transformation: Hour from the date-time was extracted, and then this hour variable was transformed to factor type. After this data frame was developed which grouped the variables by the hour and the mean value of the variable was considered in each specific hour.

For the METs dataset, the function was established to categorise it to the 4 Intensity level. (No activity) 1 when METs = value between 0 and 1 (Light Intensity) 2 when METs = value between 1.1 and 2.9 (Moderate Intensity) 3 when METs = value between 3 and 5.9 (Vigorous Intensity) 4 when METs = value above 6

After categorization, the same transformation and calculation were used as aforesaid.

Now we plot the graph

These are some interesting findings that can be inferred from the Daily trend plot:

  • Across all 4 graphs we see that the value is least between midnight to early morning 4 am.
  • Also, we see that around 6 pm to 8 pm the values are at the peak. This plot gives us the understanding that around 6 pm-8 pm the users are highly active and where the calories used & heart rate is highest.

The above graph shows the Sum of the Intensity trend of users by different Intensity levels. A similar inference can be drawn from this plot.

Cluster Analysis

As there was no categorization of user types in the dataset, I’ve employed unsupervised learning techniques such as K-means clustering to find out how these 33 users are classified into different categories.

I’ve created a cluster based on the Activity Distance. There is 4 Activity distances:

  • VeryActive_Distance
  • ModerateActive_Distance
  • LightActive_Distance
  • Senedatry Distance

Note: Excluded S as this occurs while you are inactive or seated, refer to Data summary head(Daily_Activity)

Extraction and data manipulation steps:

  • Extracted the 33 unique users(group by Id) mean distance travelled and mean activity distance.

  • Calculated percentage of their activity distance to the total distance

  • Calculation: VA_Distance_percent = mean(VeryActiveDistance)/mean(TotalDistance)100, MA_Distance_percent = mean(ModeratelyActiveDistance)/mean(TotalDistance)100, LA_Distance_percent = mean(LightActiveDistance)/mean(TotalDistance)100 Note: VA_Distance + MA_Distance + LA_Distance = Total Distance (refer to Data summary head)*

    This normalises the data into the percentage of activity type covered in total distance.

To calculate the number of clusters I’ve used the elbow graph method. In cluster analysis, the elbow method is a heuristic used in determining the number of clusters. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of the cluster to use. So, to proceed with the K-means plotting I’ve chosen 3 as the cluster number.

#Kmeans
km.out <- kmeans(df, centers = 3, nstart = 100)
print(km.out)
## K-means clustering with 3 clusters of sizes 6, 15, 12
## 
## Cluster means:
##   VA_Distance_percent MA_Distance_percent LA_Distance_percent
## 1           50.772950           10.839925            38.15899
## 2            6.762606            5.624851            86.39107
## 3           24.502970           14.320024            58.99094
## 
## Clustering vector:
##  [1] 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 3 1 2 2 1 3 3 1 1 1 3 3 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 1365.847 1781.477 2088.814
##  (between_SS / total_SS =  79.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
#visualizing
km.cluster <- km.out$cluster
fviz_cluster(list(data = df, cluster = km.cluster))

We see that there were 3 clusters formed successfully with the size of 6, 15 & 12.

By cluster means we can infer that:

  • Cluster 1 populates about 50% of VA_Distance and 38.15% of LA_Distance
  • Cluster 2 populates 86.391% of LA_Distance
  • Cluster 3 populates 24.5% of VA_Distance, 58.99% of LA_Distance and 14.32% of MA_Distance

As above we can name the cluster:

  • 1 as Runners
  • 2 as Walkers
  • 3 as Joggers

The cluster plot shows the cluster centroid and how each user falls in which category. There’s a clear differentiation of users according to the cluster formed.

After naming the cluster, we can see how the users are now better classified into 3 types of categories.

## # A tibble: 6 x 5
##   VA_Distance_percent MA_Distance_percent LA_Distance_percent cluster cluster_name
##                 <dbl>               <dbl>               <dbl>   <int> <fct>       
## 1              36.6                 10.2                 53.2       3 Joggers     
## 2              24.0                  9.21                66.6       3 Joggers     
## 3              13.8                 18.0                 68.2       3 Joggers     
## 4               0.492                2.87                96.6       2 Walkers     
## 5              15.1                  4.93                79.9       2 Walkers     
## 6              30.0                  8.91                61.1       3 Joggers

Now when clusters are well classified, we can further check what is the mean usage of calories by a different group

We see that runners burn more calories than Joggers and Walkers. About 42% more calories are used by runners than walkers.

Here is another plot showing the mean distance covered by a different group in a day.

Here we see that Runners covers more than 7 miles, Joggers covers more than 6 miles but a little less than 7 miles. Whereas, Walkers cover a mean distance of fewer than 4 miles in a day.

Recommendations

  • User tier/level can be developed. Where users can upgrade to a higher tier or degrade to a lower level. By my classification, there can be three such tiers of Walkers, Joggers and Runners(using more considerate users’ data and complex techniques like ANN the classification would be more effective). Users can be classified based on miles they achieve, like users achieving distance more than 7 miles and using calories above 2500 daily should be tier Runners, likewise, users achieving distance more than 4 but less than 7 miles and calorie usage of more than 2200 should be considered tier Joggers. Similarly, for the walkers, the distance any less than 4 miles and calorie usage of fewer than 2200 calories should be considered tier Walkers. A set of achievable tasks should be developed for each tier.

  • Runners category during early morning has 1st spike of 95 bpm while the steps during that time frame are between 0-7/min in the early morning. This raises the question of what actual activity they’d been doing at that time frame other than running. The activity type should be logged into the data. To accomplish this, we can create activity tags by which users can select what type of activity they’d performed during high-intensity, moderate-intensity and light-intensity activities. This will help the users to identify what activity intensity levels. Also, this data from the users will be more useful in developing recommended activities for different intensity levels.