This is a capstone project that I have recently completed as part of the Google Data Analytics Professional Certificate. Course participants can learn the following:
The analysis follows the six steps of data analysis taught in the Google programme: Ask, Prepare, Process, Analyse, Share, and Act.
I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyse smart device data to gain insight into how consumers are using their smart devices. The insights gained from this analysis will help guide marketing strategy for Bellabeat. I will present my analysis to the Bellabeat executive team, together with high-level recommendations for Bellabeat’s marketing strategy.
Bellabeat was founded by Urška Sršen and Sando Mur in 2014. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website.
To identify and analyse trends in smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. These insights will then guide the marketing strategy for the company.
The data source used for this case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through Möbius under a public domain licence. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviours or preferences.
This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 12/3/2016 and 12/5/2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The data contains 18 CSV files of anonymised user information, and is organised in long format. Each row represents a specific time interval with corresponding data points such as activity level, heart rate, and sleep patterns.
The data was downloaded from Kaggle and stored in the local drive on my Windows laptop.
A good dataset is ROCCC, which stands for:
Let us then examine the FitBit fitness tracker data available based on the ROCCC criteria.
Sršen did mention to me that this data set might have some limitations, and encouraged me to consider adding another data to help address those limitations as I begin to work more with this data. While specific additional data sources are not mentioned, exploring and incorporating complementary datasets can provide a more comprehensive understanding of smart device usage and enhance the credibility of the analysis. I searched for similar datasets that might make up for the deficiencies in this dataset, but I could not find any forthcoming datasets.
For this analysis, I will make use of the following datasets:
Seeing the file names of some of the data already indicated that I will need to change the table and even the column names, so as to conform to accepted dataset naming conventions.
I have opted to use MySQL Workbench to begin the data wrangling process. I previously used SQLite in an earlier introductory SQL course I took at the end of 2024, and learnt to use Google BigQuery during this Google Data Analytics course. For this analysis, I am using MySQL to demonstrate my versatility with various SQL platforms.
I uploaded the datasets from Kaggle into MySQL, and placed them under the database/schema fitabase_data which I have created. Using the table import wizard in MySQL, I created new tables as follows:
After completing the upload of the datasets, I proceeded to explore
the raw data in each table using the SELECT
statement.
SELECT * FROM daily_activity LIMIT 6;
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
SELECT * FROM daily_sleep LIMIT 6;
Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed |
---|---|---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
1503960366 | 4/19/2016 12:00:00 AM | 1 | 304 | 320 |
SELECT * FROM weight_info LIMIT 6;
Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId |
---|---|---|---|---|---|---|---|
1503960366 | 5/2/2016 11:59:59 PM | 52.6 | 115.9631 | 22 | 22.65 | True | 1462233599000 |
1503960366 | 5/3/2016 11:59:59 PM | 52.6 | 115.9631 | 22.65 | True | 1462319999000 | |
1927972279 | 4/13/2016 1:08:52 AM | 133.5 | 294.3171 | 47.54 | False | 1460509732000 | |
2873212765 | 4/21/2016 11:59:59 PM | 56.7 | 125.0021 | 21.45 | True | 1461283199000 | |
2873212765 | 5/12/2016 11:59:59 PM | 57.3 | 126.3249 | 21.69 | True | 1463097599000 | |
4319703577 | 4/17/2016 11:59:59 PM | 72.4 | 159.6147 | 25 | 27.45 | True | 1460937599000 |
SELECT * FROM hourly_steps LIMIT 6;
Id | ActivityHour | StepTotal |
---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 373 |
1503960366 | 4/12/2016 1:00:00 AM | 160 |
1503960366 | 4/12/2016 2:00:00 AM | 151 |
1503960366 | 4/12/2016 3:00:00 AM | 0 |
1503960366 | 4/12/2016 4:00:00 AM | 0 |
1503960366 | 4/12/2016 5:00:00 AM | 0 |
SELECT * FROM hourly_intensities LIMIT 6;
Id | ActivityHour | TotalIntensity | AverageIntensity |
---|---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 20 | 0.333333 |
1503960366 | 4/12/2016 1:00:00 AM | 8 | 0.133333 |
1503960366 | 4/12/2016 2:00:00 AM | 7 | 0.116667 |
1503960366 | 4/12/2016 3:00:00 AM | 0 | 0.000000 |
1503960366 | 4/12/2016 4:00:00 AM | 0 | 0.000000 |
1503960366 | 4/12/2016 5:00:00 AM | 0 | 0.000000 |
I used the COUNT
function with the DISTINCT
clause to find out how many users contributed to the various
datasets.
SELECT COUNT(DISTINCT daily_activity.Id) AS users_activity,
COUNT(DISTINCT daily_sleep.Id) AS users_sleep,
COUNT(DISTINCT weight_info.Id) AS users_weight
FROM daily_activity, daily_sleep, weight_info;
users_activity | users_sleep | users_weight |
---|---|---|
33 | 24 | 8 |
The collected data only contained records from 33 users, and not all of them used all of the functions. Out of the the 33 users, only 24 users utilise the sleep tracker, while only eight users have used the weight log. The number of weight log users are too little to perform any significant analysis. I find that the lack of users providing data will certainly create a data bias; 33 users is definitely a very small sample, not reflecting the whole population.
However, some preliminary insights can already be drawn from this:
Seeing that the column names are in camel case, I decided to change
the column names to snake case, which is the standard naming convention
for many programming languages, such as R and Python. I have done this
by creating a new table and using the ALTER TABLE
statement
with the RENAME COLUMN
clause.
CREATE TABLE activity_daily AS (SELECT * FROM daily_activity);
ALTER TABLE activity_daily
RENAME COLUMN Id to user_id,
RENAME COLUMN ActivityDate to activity_date,
RENAME COLUMN TotalSteps to total_steps,
RENAME COLUMN TotalDistance to total_distance,
RENAME COLUMN TrackerDistance to tracker_distance,
RENAME COLUMN LoggedActivitiesDistance to logged_activities_distance,
RENAME COLUMN VeryActiveDistance to very_active_distance,
RENAME COLUMN ModeratelyActiveDistance to moderately_active_distance,
RENAME COLUMN LightActiveDistance to light_active_distance,
RENAME COLUMN SedentaryActiveDistance to sedentary_active_distance,
RENAME COLUMN VeryActiveMinutes to very_active_minutes,
RENAME COLUMN FairlyActiveMinutes to fairly_active_minutes,
RENAME COLUMN LightlyActiveMinutes to lightly_active_minutes,
RENAME COLUMN SedentaryMinutes to sedentary_minutes,
RENAME COLUMN Calories to calories;
CREATE TABLE sleep_daily AS (SELECT * FROM daily_sleep);
ALTER TABLE sleep_daily
RENAME COLUMN Id to user_id,
RENAME COLUMN SleepDay to sleep_day,
RENAME COLUMN TotalSleepRecords to total_sleep_records,
RENAME COLUMN TotalMinutesAsleep to total_minutes_asleep,
RENAME COLUMN TotalTimeInBed to total_time_in_bed;
CREATE TABLE steps_hourly AS (SELECT * FROM hourly_steps);
ALTER TABLE steps_hourly
RENAME COLUMN Id to user_id,
RENAME COLUMN ActivityHour to activity_hour,
RENAME COLUMN StepTotal to steps_total;
CREATE TABLE intensities_hourly AS (SELECT * FROM hourly_intensities);
ALTER TABLE intensities_hourly
RENAME COLUMN Id to user_id,
RENAME COLUMN ActivityHour to activity_hour,
RENAME COLUMN TotalIntensity to total_intensity,
RENAME COLUMN AverageIntensity to average_intensity;
The new table names to be used are as follows:
SELECT * FROM activity_daily LIMIT 6;
user_id | activity_date | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
SELECT * FROM sleep_daily LIMIT 6;
user_id | sleep_day | total_sleep_records | total_minutes_asleep | total_time_in_bed |
---|---|---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
1503960366 | 4/19/2016 12:00:00 AM | 1 | 304 | 320 |
SELECT * FROM steps_hourly LIMIT 6;
user_id | activity_hour | steps_total |
---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 373 |
1503960366 | 4/12/2016 1:00:00 AM | 160 |
1503960366 | 4/12/2016 2:00:00 AM | 151 |
1503960366 | 4/12/2016 3:00:00 AM | 0 |
1503960366 | 4/12/2016 4:00:00 AM | 0 |
1503960366 | 4/12/2016 5:00:00 AM | 0 |
SELECT * FROM intensities_hourly LIMIT 6;
user_id | activity_hour | total_intensity | average_intensity |
---|---|---|---|
1503960366 | 4/12/2016 12:00:00 AM | 20 | 0.333333 |
1503960366 | 4/12/2016 1:00:00 AM | 8 | 0.133333 |
1503960366 | 4/12/2016 2:00:00 AM | 7 | 0.116667 |
1503960366 | 4/12/2016 3:00:00 AM | 0 | 0.000000 |
1503960366 | 4/12/2016 4:00:00 AM | 0 | 0.000000 |
1503960366 | 4/12/2016 5:00:00 AM | 0 | 0.000000 |
Now I would like to know the data types of each column. To do that, I
used the DESCRIBE
statement.
DESCRIBE activity_daily;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_date | text | YES | NA | ||
total_steps | int | YES | NA | ||
total_distance | double | YES | NA | ||
tracker_distance | double | YES | NA | ||
logged_activities_distance | int | YES | NA | ||
very_active_distance | double | YES | NA | ||
moderately_active_distance | double | YES | NA | ||
light_active_distance | double | YES | NA | ||
sedentary_active_distance | int | YES | NA |
DESCRIBE sleep_daily;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
sleep_day | text | YES | NA | ||
total_sleep_records | int | YES | NA | ||
total_minutes_asleep | int | YES | NA | ||
total_time_in_bed | int | YES | NA |
DESCRIBE steps_hourly;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_hour | text | YES | NA | ||
steps_total | int | YES | NA |
DESCRIBE intensities_hourly;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_hour | text | YES | NA | ||
total_intensity | int | YES | NA | ||
average_intensity | double | YES | NA |
I noticed that the data types for the recorded dates in each table
are not in the DATE
data type, but are rather in the
TEXT
data type. Before this data can be used for analysis,
I will have to convert the TEXT
data type to the
DATE
or DATETIME
data type by using the
STR_TO_DATE
statement.
Apart from the activity_daily table, both the date and time are found in the relevant date columns of the other tables.
I then proceeded to replace the string data in the date columns with
the correct format for the data type conversion. I also have included a
column for the day of the week in each dataset by using the
DAYNAME
function on the dates.
For the activity_daily dataset, only date data is available.
CREATE TABLE activity_daily_dtype_conv AS (SELECT * FROM activity_daily);
UPDATE activity_daily_dtype_conv SET activity_date=(STR_TO_DATE(activity_date, '%m/%d/%Y'));
ALTER TABLE activity_daily_dtype_conv MODIFY COLUMN activity_date date;
CREATE TABLE activity_daily_date AS
(SELECT user_id,
activity_date,
DAYNAME(activity_date) AS day_name,
total_steps,
total_distance,
tracker_distance,
logged_activities_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
sedentary_active_distance,
very_activ_minutes,
fairly_active_minutes,
lightly_active_minutes,
sedentary_minutes,
calories
FROM activity_daily_dtype_conv);
As for the other datasets, both date and time data are available.
After converting the date and time data in the relevant columns to a
suitable format, I converted the data type from string to the
DATETIME
data type. I separated the date and time in the
relevant columns by using the CAST
function on the datetime
data columns.
CREATE TABLE sleep_daily_dtype_conv AS (SELECT * FROM sleep_daily);
UPDATE sleep_daily_dtype_conv SET sleep_day=(STR_TO_DATE(sleep_day, '%m/%d/%Y %r'));
ALTER TABLE sleep_daily_dtype_conv MODIFY COLUMN sleep_day datetime;
CREATE TABLE sleep_daily_date_time AS
(SELECT user_id,
CAST(sleep_day AS DATE) AS sleep_date,
CAST(sleep_day AS TIME) AS sleep_time,
DAYNAME(CAST(sleep_day AS DATE)) AS day_name,
total_sleep_records,
total_minutes_asleep,
total_time_in_bed
FROM sleep_daily_dtype_conv);
CREATE TABLE steps_hourly_dtype_conv AS (SELECT * FROM steps_hourly);
UPDATE steps_hourly_dtype_conv SET activity_hour=(STR_TO_DATE(activity_hour, '%m/%d/%Y %r'));
ALTER TABLE steps_hourly_dtype_conv MODIFY COLUMN activity_hour datetime;
CREATE TABLE steps_hourly_date_time AS
(SELECT user_id,
CAST(activity_hour AS DATE) AS activity_date,
CAST(activity_hour AS TIME) AS activity_time,
DAYNAME(CAST(activity_hour AS DATE)) AS day_name,
steps_total
FROM steps_hourly_dtype_conv);
CREATE TABLE intensities_hourly_dtype_conv AS (SELECT * FROM intensities_hourly);
UPDATE intensities_hourly_dtype_conv SET activity_hour=(STR_TO_DATE(activity_hour, '%m/%d/%Y %r'));
ALTER TABLE intensities_hourly_dtype_conv MODIFY COLUMN activity_hour datetime;
CREATE TABLE intensities_hourly_date_time AS
(SELECT user_id,
CAST(activity_hour AS DATE) AS activity_date,
CAST(activity_hour AS TIME) AS activity_time,
DAYNAME(CAST(activity_hour AS DATE)) AS day_name,
total_intensity,
average_intensity
FROM intensities_hourly_dtype_conv);
I then examined the datasets again to confirm that the correct format and data types are used.
SELECT * FROM activity_daily_date LIMIT 6;
user_id | activity_date | day_name | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1503960366 | 2016-04-12 | Tuesday | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
1503960366 | 2016-04-13 | Wednesday | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
1503960366 | 2016-04-14 | Thursday | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
1503960366 | 2016-04-15 | Friday | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
1503960366 | 2016-04-16 | Saturday | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
1503960366 | 2016-04-17 | Sunday | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
DESCRIBE activity_daily_date;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_date | date | YES | NA | ||
day_name | varchar(9) | YES | NA | ||
total_steps | int | YES | NA | ||
total_distance | double | YES | NA | ||
tracker_distance | double | YES | NA | ||
logged_activities_distance | int | YES | NA | ||
very_active_distance | double | YES | NA | ||
moderately_active_distance | double | YES | NA | ||
light_active_distance | double | YES | NA |
SELECT * FROM sleep_daily_date_time LIMIT 6;
user_id | sleep_date | sleep_time | day_name | total_sleep_records | total_minutes_asleep | total_time_in_bed |
---|---|---|---|---|---|---|
1503960366 | 2016-04-12 | 00:00:00 | Tuesday | 1 | 327 | 346 |
1503960366 | 2016-04-13 | 00:00:00 | Wednesday | 2 | 384 | 407 |
1503960366 | 2016-04-15 | 00:00:00 | Friday | 1 | 412 | 442 |
1503960366 | 2016-04-16 | 00:00:00 | Saturday | 2 | 340 | 367 |
1503960366 | 2016-04-17 | 00:00:00 | Sunday | 1 | 700 | 712 |
1503960366 | 2016-04-19 | 00:00:00 | Tuesday | 1 | 304 | 320 |
DESCRIBE sleep_daily_date_time;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
sleep_date | date | YES | NA | ||
sleep_time | time | YES | NA | ||
day_name | varchar(9) | YES | NA | ||
total_sleep_records | int | YES | NA | ||
total_minutes_asleep | int | YES | NA | ||
total_time_in_bed | int | YES | NA |
SELECT * FROM steps_hourly_date_time LIMIT 6;
user_id | activity_date | activity_time | day_name | steps_total |
---|---|---|---|---|
1503960366 | 2016-04-12 | 00:00:00 | Tuesday | 373 |
1503960366 | 2016-04-12 | 01:00:00 | Tuesday | 160 |
1503960366 | 2016-04-12 | 02:00:00 | Tuesday | 151 |
1503960366 | 2016-04-12 | 03:00:00 | Tuesday | 0 |
1503960366 | 2016-04-12 | 04:00:00 | Tuesday | 0 |
1503960366 | 2016-04-12 | 05:00:00 | Tuesday | 0 |
DESCRIBE steps_hourly_date_time;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_date | date | YES | NA | ||
activity_time | time | YES | NA | ||
day_name | varchar(9) | YES | NA | ||
steps_total | int | YES | NA |
SELECT * FROM intensities_hourly_date_time LIMIT 6;
user_id | activity_date | activity_time | day_name | total_intensity | average_intensity |
---|---|---|---|---|---|
1503960366 | 2016-04-12 | 00:00:00 | Tuesday | 20 | 0.333333 |
1503960366 | 2016-04-12 | 01:00:00 | Tuesday | 8 | 0.133333 |
1503960366 | 2016-04-12 | 02:00:00 | Tuesday | 7 | 0.116667 |
1503960366 | 2016-04-12 | 03:00:00 | Tuesday | 0 | 0.000000 |
1503960366 | 2016-04-12 | 04:00:00 | Tuesday | 0 | 0.000000 |
1503960366 | 2016-04-12 | 05:00:00 | Tuesday | 0 | 0.000000 |
DESCRIBE intensities_hourly_date_time;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
user_id | bigint | YES | NA | ||
activity_date | date | YES | NA | ||
activity_time | time | YES | NA | ||
day_name | varchar(9) | YES | NA | ||
total_intensity | int | YES | NA | ||
average_intensity | double | YES | NA |
Now that the data has been formatted appropriately with the right data types and suitably named columns, I can now proceed to clean the data for analysis.
I will need to check the datasets for duplicate and empty rows before analysis. Firstly, I will remove duplicate rows in the datasets. After that, I will remove rows with empty values.
I will use the SELECT DISTINCT
statement to extract only
the distinct rows from the datasets and to remove duplicate rows in the
datasets.
I used the COUNT
statement to find the number of rows in
each dataset, and used it together with the SELECT DISTINCT
statement to obtain the number of distinct rows in the datasets for
comparison.
SELECT COUNT(*) FROM activity_daily_date;
COUNT(*) |
---|
940 |
SELECT COUNT(*) FROM (SELECT DISTINCT * FROM activity_daily_date) AS distinct_rows;
COUNT(*) |
---|
940 |
SELECT COUNT(*) FROM sleep_daily_date_time;
COUNT(*) |
---|
413 |
SELECT COUNT(*) FROM (SELECT DISTINCT * FROM sleep_daily_date_time) AS distinct_rows;
COUNT(*) |
---|
410 |
SELECT COUNT(*) FROM steps_hourly_date_time;
COUNT(*) |
---|
22099 |
SELECT COUNT(*) FROM (SELECT DISTINCT * FROM steps_hourly_date_time) AS distinct_rows;
COUNT(*) |
---|
22099 |
SELECT COUNT(*) FROM intensities_hourly_date_time;
COUNT(*) |
---|
22099 |
SELECT COUNT(*) FROM (SELECT DISTINCT * FROM intensities_hourly_date_time) AS distinct_rows;
COUNT(*) |
---|
22099 |
There were three duplicate rows in the sleep_daily dataset, while there were no duplicate rows in the other datasets.
I then used the IS NULL
operator to find any missing or
empty values in my datasets, where I found that there were no empty
fields in the datasets.
Finding that there are no empty fields in the datasets, I created finalised tables to be used for analysis.
CREATE TABLE daily_activity_final AS (SELECT * FROM activity_daily_date);
CREATE TABLE daily_sleep_final AS (SELECT DISTINCT * FROM sleep_daily_date_time);
CREATE TABLE hourly_steps_final AS (SELECT * FROM steps_hourly_date_time);
CREATE TABLE hourly_intensities_final AS (SELECT * FROM intensities_hourly_date_time);
Finally, I exported the finalised tables as .csv files. The names of the datasets to be used for analysis are:
Now that the data has been cleaned and organised, we can proceed to the analysis step. I made use of the exported .csv files from MySQL and imported them to R to carry out the analysis, ensuring that the right data types and formatting is stated for various columns.
library(tidyverse)
library(ggplot2)
library(lubridate)
library(dplyr)
library(janitor)
library(skimr)
library(readr)
daily_activity_final <- read_csv("daily_activity_final.csv", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d")));
daily_sleep_final <- read_csv("daily_sleep_final.csv", col_types = cols(user_id = col_character(), sleep_date = col_date(format = "%Y-%m-%d"), sleep_time = col_time(format = "%X")));
hourly_steps_final <- read_csv("hourly_steps_final.csv", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));
hourly_intensities_final <- read_csv("hourly_intensities_final.csv", col_types = cols(user_id = col_character(), activity_date = col_date(format = "%Y-%m-%d"), activity_time = col_time(format = "%X")));
Now that the data has been imported into R, I proceeded to find out what type of users are represented in the dataset. To do this, I defined a new dataframe called user_types, which I then exported as a .csv file to be used in Tableau.
user_types <- daily_activity_final %>% group_by(user_id) %>% summarise(days_used=sum(n())) %>% mutate(user_type = case_when(
days_used >= 1 & days_used <= 10 ~ "Light User",
days_used >= 11 & days_used <= 20 ~ "Moderate User",
days_used >= 21 & days_used <= 29 ~ "Heavy User",
days_used >= 30 ~ "Everyday User"))
head(user_types)
## # A tibble: 6 × 3
## user_id days_used user_type
## <chr> <int> <chr>
## 1 1503960366 31 Everyday User
## 2 1624580081 31 Everyday User
## 3 1644430081 30 Everyday User
## 4 1844505072 31 Everyday User
## 5 1927972279 31 Everyday User
## 6 2022484408 31 Everyday User
write.csv(user_types, "user_types.csv", row.names=FALSE)
I used the summary()
function to obtain statistical
information about the datasets.
summary(daily_activity_final)
## user_id activity_date day_name total_steps
## Length:940 Min. :2016-04-12 Length:940 Min. : 0
## Class :character 1st Qu.:2016-04-19 Class :character 1st Qu.: 3790
## Mode :character Median :2016-04-26 Mode :character Median : 7406
## Mean :2016-04-26 Mean : 7638
## 3rd Qu.:2016-05-04 3rd Qu.:10727
## Max. :2016-05-12 Max. :36019
## total_distance tracker_distance logged_activities_distance
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 2.620 1st Qu.: 2.620 1st Qu.:0.0000
## Median : 5.245 Median : 5.245 Median :0.0000
## Mean : 5.490 Mean : 5.475 Mean :0.1064
## 3rd Qu.: 7.713 3rd Qu.: 7.710 3rd Qu.:0.0000
## Max. :28.030 Max. :28.030 Max. :5.0000
## very_active_distance moderately_active_distance light_active_distance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 1.945
## Median : 0.210 Median :0.2400 Median : 3.365
## Mean : 1.503 Mean :0.5675 Mean : 3.341
## 3rd Qu.: 2.052 3rd Qu.:0.8000 3rd Qu.: 4.783
## Max. :21.920 Max. :6.4800 Max. :10.710
## sedentary_active_distance very_active_minutes fairly_active_minutes
## Min. :0 Min. : 0.00 Min. : 0.00
## 1st Qu.:0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0 Median : 4.00 Median : 6.00
## Mean :0 Mean : 21.16 Mean : 13.56
## 3rd Qu.:0 3rd Qu.: 32.00 3rd Qu.: 19.00
## Max. :0 Max. :210.00 Max. :143.00
## lightly_active_minutes sedentary_minutes calories
## Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median :199.0 Median :1057.5 Median :2134
## Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :518.0 Max. :1440.0 Max. :4900
summary(daily_sleep_final)
## user_id sleep_date sleep_time day_name
## Length:410 Min. :2016-04-12 Min. :00:00:00 Length:410
## Class :character 1st Qu.:2016-04-19 1st Qu.:00:00:00 Class :character
## Mode :character Median :2016-04-27 Median :00:00:00 Mode :character
## Mean :2016-04-26 Mean :00:00:00
## 3rd Qu.:2016-05-04 3rd Qu.:00:00:00
## Max. :2016-05-12 Max. :00:00:00
## total_sleep_records total_minutes_asleep total_time_in_bed
## Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :1.00 Median :432.5 Median :463.0
## Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.00 Max. :796.0 Max. :961.0
summary(hourly_steps_final)
## user_id activity_date activity_time
## Length:22099 Min. :2016-04-12 Min. :00:00:00.000000
## Class :character 1st Qu.:2016-04-19 1st Qu.:05:00:00.000000
## Mode :character Median :2016-04-26 Median :11:00:00.000000
## Mean :2016-04-26 Mean :11:24:56.755509
## 3rd Qu.:2016-05-03 3rd Qu.:17:00:00.000000
## Max. :2016-05-12 Max. :23:00:00.000000
## day_name steps_total
## Length:22099 Min. : 0.0
## Class :character 1st Qu.: 0.0
## Mode :character Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
summary(hourly_intensities_final)
## user_id activity_date activity_time
## Length:22099 Min. :2016-04-12 Min. :00:00:00.000000
## Class :character 1st Qu.:2016-04-19 1st Qu.:05:00:00.000000
## Mode :character Median :2016-04-26 Median :11:00:00.000000
## Mean :2016-04-26 Mean :11:24:56.755509
## 3rd Qu.:2016-05-03 3rd Qu.:17:00:00.000000
## Max. :2016-05-12 Max. :23:00:00.000000
## day_name total_intensity average_intensity
## Length:22099 Min. : 0.00 Min. :0.0000
## Class :character 1st Qu.: 0.00 1st Qu.:0.0000
## Mode :character Median : 3.00 Median :0.0500
## Mean : 12.04 Mean :0.2006
## 3rd Qu.: 16.00 3rd Qu.:0.2667
## Max. :180.00 Max. :3.0000
Based on the statistical summaries, we can make the following observations:
Bellabeat has a mission to provide women with activity, sleep, stress, and reproductive health data, so as to help them discover more about their health and habits.
A larger data sample size would be necessary in order to carry out more in-depth analysis. Different types of data would also need to be collected in order to have a broader understanding in the analysis result; some of this additional data can include user demographics such as gender, age, and occupation. Having additional demographic data will help to remove bias in the datasets and analysis. Collection of more recent and up-to-date data will also increase the reliability of the analysis to learn about the current situation and trends.
With that said, here are the key findings from this analysis:
Based on this analysis, I would propose the following recommendations:
Further Analysis on the Weight Log Function
Seeing that there were only eight out of 33 users who used the weight log function, data collection on it (e.g. in the form of a survey) should be conducted to understand why users do not use it. Such a study would provide user insights into the function and inform Bellabeat on how they can perhaps improve on the weight log function.
Promote Device Use while Sleeping
People tend to remove watches before going to sleep. Bellabeat should encourage device users to continue wearing their devices when going to bed, so that Bellabeat will be able to collect more data on users’ sleep patterns. Users will also be able to gain insights from their sleep data and make helpful decisions for them to get enough sleep.
Sleep Notifications
Since users often do not have the recommended amount of sleep per day, we can consider to have a feature on the Bellabeat app that allows users to specify a time they desire to go to sleep. The app can then notify the user some time before that time to prepare going to sleep. Since users struggle to fall asleep, prompting them to go to bed earlier than the set time can help users to fall asleep on time and have sufficient rest.
Step and Activity Notifications
In order to encourage users to take more steps, Bellabeat can set up app notifications at different times of the day to make users conscious of the number of steps they have taken and motivate them to meet the daily minimum requirement of 8,00 steps by the U.S. CDC. Additional information on the health benefits of walking the daily recommended number of steps can also be included in the app to educate users about its benefits.
Prolonged sedentary activity alerts can also be set up, which will prompt users to get up and engage in some form of physical activity like walking. Additional information can also be included in the app about the dangers of highly sedentary activity, such as information published by WHO.
Online Campaign Recommendations
The online campaign should not portray the Bellabeat app as merely a fitness activity app. Rather, it should promote the app as a guide that will allow women to strike a balance between their personal and work lives, and to improve their health habits through education and daily app notifications.