BellaBeat is the case study which i’ve selected as part of the Google Data Analysis Professional certificate. As the role of Junior data analyst I’m assigned to work with the Marketing Analyst team, BellaBeat is the high-tech manufacturer of health-focused products for womens. It is a successful company and has the potential to become a large player in global smart device market. I’ve been asked to focus on one of the BellaBeat’s product and analyse smart device data to gain insight into how consumers uses their smart device. The product are as follows:
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
Note: This is the very first Data Analysis Case study i’m uploading in Kaggle Community, to viewers please go through my assignment and provide me with the genuine feedback
These are the key questions to be answered.
There are 18 dataset in .csv format. These datasets can be segmented into Daily, Hourly, Minutes and Seconds. There are some wide datasets which I’ve not included for the analysis as the narrow dataset is good enough to move forward. Now let’s see the structure to get the feel of the data, For example:
str(Daily_Activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : Factor w/ 31 levels "4/12/2016","4/13/2016",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
We see that the ActivityDate is of factor type, we need to change it into Date format. This is across every dataset we’ll be using, so it’s necessary to change the format before using the data.
#Daily
Daily_Activity$ActivityDate <- mdy(Daily_Activity$ActivityDate)
Daily_Calories$ActivityDay <- mdy(Daily_Calories$ActivityDay)
Daily_Intensities$ActivityDay <- mdy(Daily_Intensities$ActivityDay)
Daily_Steps$ActivityDay <- mdy(Daily_Steps$ActivityDay)
#Hour
Hourly_Calories$ActivityHour <- as.POSIXct(Hourly_Calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
Hourly_Intensities$ActivityHour <- as.POSIXct(Hourly_Intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
Hourly_Steps$ActivityHour <- as.POSIXct(Hourly_Steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
#Minutes
Minute_Calories_Narrow$ActivityMinute <- as.POSIXct(Minute_Calories_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
Minute_Intensities_Narrow$ActivityMinute <- as.POSIXct(Minute_Intensities_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
Minute_METs_Narrow$ActivityMinute <- as.POSIXct(Minute_METs_Narrow$ActivityMinute, format ="%m/%d/%Y %I:%M:%S %p")
Minute_Sleep$date <- as.POSIXct(Minute_Sleep$date, format = "%m/%d/%Y %I:%M:%S %p")
Minute_Steps_Narrow$ActivityMinute <- as.POSIXct(Minute_Steps_Narrow$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
#Seconds
Heartrate_Seconds$Time <- as.POSIXct(Heartrate_Seconds$Time, format = "%m/%d/%Y %I:%M:%S %p")
#Others
Sleep_Day$SleepDay <- as.POSIXct(Sleep_Day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
Weight_LogInfo$Date <- as.POSIXct(Weight_LogInfo$Date, format = "%m/%d/%Y %I:%M:%S %p")
Now after specifying the correct format, we need to check for any NA’s and see if the data is complete.
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 65
We see that there’s only 1 dataset(Weight_LogInfo) with 65 NA values. Other datasets seem pretty good. Further, we need to investigate whether the rows of the data matches accordingly and how many distinct values each dataset has:
## [1] 940
## [1] 940
## [1] 940
## [1] 940
## [1] 22099
## [1] 22099
## [1] 22099
## [1] 1325580
## [1] 1325580
## [1] 1325580
## [1] 188521
## [1] 1325580
## [1] 2483658
## [1] 413
## [1] 67
Now checking for distinct users in each dataset:
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 24
## [1] 33
## [1] 14
## [1] 24
## [1] 8
While counting the count of rows in the dataset, we observe that Daily data, Hourly Data and Minute data are consistent with the total count. Also, we see some inconsistency while counting distinct users in each dataset, MinuteSleep and SleepDay have a distinct count of 24 users and also Heartrate and WeightLog data have varying distinct counts.
Note: We see that there’s total count of 33 users
We can have the glimpse of the data summary just to have the feel of the data.
summary(Daily_Activity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
head(Daily_Activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Feel free to explore the data summary for other dataset.
Note: We’ll be merging the dataset during the analysis process and as when required
Correlation shows the strength of a relationship between the two variables and is expressed numerically between -1 and +1
## TotalSteps TotalDistance TrackerDistance
## TotalSteps 1.00 0.99 0.98
## TotalDistance 0.99 1.00 1.00
## TrackerDistance 0.98 1.00 1.00
## VeryActiveDistance 0.74 0.79 0.79
## ModeratelyActiveDistance 0.51 0.47 0.47
## LightActiveDistance 0.69 0.66 0.66
## VeryActiveDistance ModeratelyActiveDistance
## TotalSteps 0.74 0.51
## TotalDistance 0.79 0.47
## TrackerDistance 0.79 0.47
## VeryActiveDistance 1.00 0.19
## ModeratelyActiveDistance 0.19 1.00
## LightActiveDistance 0.16 0.24
## LightActiveDistance SedentaryActiveDistance
## TotalSteps 0.69 0.07
## TotalDistance 0.66 0.08
## TrackerDistance 0.66 0.07
## VeryActiveDistance 0.16 0.05
## ModeratelyActiveDistance 0.24 0.01
## LightActiveDistance 1.00 0.10
## VeryActiveMinutes FairlyActiveMinutes
## TotalSteps 0.67 0.50
## TotalDistance 0.68 0.46
## TrackerDistance 0.68 0.46
## VeryActiveDistance 0.83 0.21
## ModeratelyActiveDistance 0.23 0.95
## LightActiveDistance 0.15 0.22
## LightlyActiveMinutes SedentaryMinutes Calories
## TotalSteps 0.57 -0.33 0.59
## TotalDistance 0.52 -0.29 0.64
## TrackerDistance 0.51 -0.29 0.65
## VeryActiveDistance 0.06 -0.06 0.49
## ModeratelyActiveDistance 0.16 -0.22 0.22
## LightActiveDistance 0.89 -0.41 0.47
In the correlation matrix, I’ve taken all the variables of Daily_Activity dataset to see how they are related to each other. Correlation of more than 70% (0.70) states that one variable is correlated to the other. In the matrix and chart above we can infer that:
Also, to check whether there’s any relation between the heart rate & calories we develop a linear graph.
## `geom_smooth()` using formula 'y ~ x'
## [1] 0.9721095
##
## Call:
## lm(formula = mergedf_hea2_Cal2$Mean_heartrate ~ mergedf_hea2_Cal2$Mean_Calories)
##
## Coefficients:
## (Intercept) mergedf_hea2_Cal2$Mean_Calories
## 35.2381 0.4099
This is helpful when the mean heart rate is not given, the value can be derived from the mean calories themselves.
The Plot shows:
Keeping objective in our head and not swaying away from the main objective, we need to check on the user trend. In the daily trend chart, we’ll be exploring four variables over hours, they are (Heartrate, calories, Intensity & METs). Some transformation & data calculation was required to develop a dataset for a daily trend chart.
Transformation: Hour from the date-time was extracted, and then this hour variable was transformed to factor type. After this data frame was developed which grouped the variables by the hour and the mean value of the variable was considered in each specific hour.
For the METs dataset, the function was established to categorise it to the 4 Intensity level. (No activity) 1 when METs = value between 0 and 1 (Light Intensity) 2 when METs = value between 1.1 and 2.9 (Moderate Intensity) 3 when METs = value between 3 and 5.9 (Vigorous Intensity) 4 when METs = value above 6
After categorization, the same transformation and calculation were used as aforesaid.
Now we plot the graph
These are some interesting findings that can be inferred from the Daily trend plot:
The above graph shows the Sum of the Intensity trend of users by different Intensity levels. A similar inference can be drawn from this plot.
As there was no categorization of user types in the dataset, I’ve employed unsupervised learning techniques such as K-means clustering to find out how these 33 users are classified into different categories.
I’ve created a cluster based on the Activity Distance. There is 4 Activity distances:
Note: Excluded S as this occurs while you are inactive or seated, refer to Data summary head(Daily_Activity)
Extraction and data manipulation steps:
Extracted the 33 unique users(group by Id) mean distance travelled and mean activity distance.
Calculated percentage of their activity distance to the total distance
Calculation: VA_Distance_percent = mean(VeryActiveDistance)/mean(TotalDistance)100, MA_Distance_percent = mean(ModeratelyActiveDistance)/mean(TotalDistance)100, LA_Distance_percent = mean(LightActiveDistance)/mean(TotalDistance)100 Note: VA_Distance + MA_Distance + LA_Distance = Total Distance (refer to Data summary head)*
This normalises the data into the percentage of activity type covered in total distance.
To calculate the number of clusters I’ve used the elbow graph method. In cluster analysis, the elbow method is a heuristic used in determining the number of clusters. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of the cluster to use. So, to proceed with the K-means plotting I’ve chosen 3 as the cluster number.
#Kmeans
km.out <- kmeans(df, centers = 3, nstart = 100)
print(km.out)
## K-means clustering with 3 clusters of sizes 6, 15, 12
##
## Cluster means:
## VA_Distance_percent MA_Distance_percent LA_Distance_percent
## 1 50.772950 10.839925 38.15899
## 2 6.762606 5.624851 86.39107
## 3 24.502970 14.320024 58.99094
##
## Clustering vector:
## [1] 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 3 1 2 2 1 3 3 1 1 1 3 3 2 1
##
## Within cluster sum of squares by cluster:
## [1] 1365.847 1781.477 2088.814
## (between_SS / total_SS = 79.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#visualizing
km.cluster <- km.out$cluster
fviz_cluster(list(data = df, cluster = km.cluster))
We see that there were 3 clusters formed successfully with the size of 6, 15 & 12.
By cluster means we can infer that:
As above we can name the cluster:
The cluster plot shows the cluster centroid and how each user falls in which category. There’s a clear differentiation of users according to the cluster formed.
After naming the cluster, we can see how the users are now better classified into 3 types of categories.
## # A tibble: 6 x 5
## VA_Distance_percent MA_Distance_percent LA_Distance_percent cluster cluster_name
## <dbl> <dbl> <dbl> <int> <fct>
## 1 36.6 10.2 53.2 3 Joggers
## 2 24.0 9.21 66.6 3 Joggers
## 3 13.8 18.0 68.2 3 Joggers
## 4 0.492 2.87 96.6 2 Walkers
## 5 15.1 4.93 79.9 2 Walkers
## 6 30.0 8.91 61.1 3 Joggers
Now when clusters are well classified, we can further check what is the mean usage of calories by a different group
We see that runners burn more calories than Joggers and Walkers. About 42% more calories are used by runners than walkers.
Here is another plot showing the mean distance covered by a different group in a day.
Here we see that Runners covers more than 7 miles, Joggers covers more than 6 miles but a little less than 7 miles. Whereas, Walkers cover a mean distance of fewer than 4 miles in a day.
Now when groups have been defined we can see more concise trends of different groups.
The image below shows the mean heart rate/min in a day of a different group of users. This trend graph enlightens on how the pace of heart rate changes 24 hrs of a day for the different user groups.
From the above we have findings:
Runners
Walkers
Joggers
| User Group | Early Morning(5-9 a.m) | Late Morning(9-12 a.m) | Afternoon(12-4 p.m) | Evening(4-8 p.m) | Night(8-12 a.m) |
|---|---|---|---|---|---|
| Runners | 50-95 | 75-100 | 72-100 | 75-110 | 75-65 |
| Walkers | 60-85 | 75-85 | 80-88 | 85-90 | 70-80 |
| Joggers | 60-85 | 75-85 | 75-85 | 75-80 | 75-60 |
The image below shows the different trend for each group in a day and provides us the details at what time of the day the groups had most, least and moderate steps.
From the above we have findings:
Runners
Walkers
Joggers
| User Group | Early Morning(5-9 a.m) | Late Morning(9-12 a.m) | Afternoon(12-4 p.m) | Evening(4-8 p.m) | Night(8-12 a.m) |
|---|---|---|---|---|---|
| Runners | 0-7 | 2-12 | 2-25 | 2-26 | 0-10 |
| Walkers | 0-7 | 2-8 | 3-7 | 3-9 | 0-5 |
| Joggers | 0-16 | 7-14 | 5-7 | 7-13 | 2-8 |
The figure below shows the calories used per minute by a different group. These trends give us insights into the time of the day the calorie usage was most and least.
From the above trend chart we have findings:
Runners
Walkers
Joggers
| User Group | Early Morning(5-9 a.m) | Late Morning(9-12 a.m) | Afternoon(12-4 p.m) | Evening(4-8 p.m) | Night(8-12 a.m) |
|---|---|---|---|---|---|
| Runners | 2.25 | 2 | 3.25 | 3.45 | 2.25 |
| Walkers | 1.5 | 1.75 | 1.75 | 2 | 1.25 |
| Joggers | 1.75 | 2.25 | 2 | 2.20 | 2 |
User tier/level can be developed. Where users can upgrade to a higher tier or degrade to a lower level. By my classification, there can be three such tiers of Walkers, Joggers and Runners(using more considerate users’ data and complex techniques like ANN the classification would be more effective). Users can be classified based on miles they achieve, like users achieving distance more than 7 miles and using calories above 2500 daily should be tier Runners, likewise, users achieving distance more than 4 but less than 7 miles and calorie usage of more than 2200 should be considered tier Joggers. Similarly, for the walkers, the distance any less than 4 miles and calorie usage of fewer than 2200 calories should be considered tier Walkers. A set of achievable tasks should be developed for each tier.
Runners category during early morning has 1st spike of 95 bpm while the steps during that time frame are between 0-7/min in the early morning. This raises the question of what actual activity they’d been doing at that time frame other than running. The activity type should be logged into the data. To accomplish this, we can create activity tags by which users can select what type of activity they’d performed during high-intensity, moderate-intensity and light-intensity activities. This will help the users to identify what activity intensity levels. Also, this data from the users will be more useful in developing recommended activities for different intensity levels.