I joined a fiction company six months ago called Bellabeat. I joined as a junior data analyst. Bellabeat is a high-tech company that manufactures health-focused smart products for women since 2013.This product collects data on sleep,stress and reproductive activities in women.Thus empowering and informing women about their health and habits.
The stakeholders wants to focus on analyzing how consumers use non-Bellabeat smart devices.. Then recommendations to improve the marketing strategy of Bellabeat products would be suggested from the insight gotten.
As adviced by Chief Creative Officer, I used a public data set from a public domain called kaggle.The data set “FitBit Fitness Tracker Data”contains personal fitness tracker from thirty fitbit users.
A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited. 1. Reliable - The data set is not so reliable because the sample selection bias does not reflect the overall population. In this data set the sample size is only 30 users. 2. Original - The originality of the data set is also low because the data set is gotten from a third party provider “Amazon Mechanical Turk”. 3. Comprehensive - The comprehensiveness of the data set is okay because the data set contains the important information and parameters needed to answer the business task 4. Current - The data set is not current as it was last updated in December 2020. 5. Cited - The data set was gotten from a third party so we cannot identify the source of the data.
I am using R programming language tool to process the dataset because it is easy to use and i am very familiar with it
library(readr)
Warning: package ‘readr’ was built under R version 4.1.2
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages --------------------------------------------- tidyverse 1.3.1 --
√ ggplot2 3.3.5 √ dplyr 1.0.8
√ tibble 3.1.6 √ stringr 1.4.0
√ tidyr 1.2.0 √ forcats 0.5.1
√ purrr 0.3.4
Warning: package ‘ggplot2’ was built under R version 4.1.2
Warning: package ‘tibble’ was built under R version 4.1.2
Warning: package ‘tidyr’ was built under R version 4.1.2
Warning: package ‘dplyr’ was built under R version 4.1.2
-- Conflicts ------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(ggplot2)
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
library(janitor)
Warning: package ‘janitor’ was built under R version 4.1.2
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(skimr)
Warning: package ‘skimr’ was built under R version 4.1.2
library(tinytex)
Warning: package ‘tinytex’ was built under R version 4.1.2
#Now to load the data sets
daily_Activity <- read_csv('dailyActivity_merged.csv')
Rows: 940 Columns: 15
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityDate
dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_Calories <- read_csv('dailyCalories_merged.csv')
Rows: 940 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityDay
dbl (2): Id, Calories
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_Intensities <- read_csv('dailyIntensities_merged.csv')
Rows: 940 Columns: 10
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityDay
dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActi...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_Steps <- read_csv('dailySteps_merged.csv')
Rows: 940 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityDay
dbl (2): Id, StepTotal
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate_seconds <- read_csv('heartrate_seconds_merged.csv')
Rows: 2483658 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): Time
dbl (2): Id, Value
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_Calories <- read_csv('hourlyCalories_merged.csv')
Rows: 22099 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (2): Id, Calories
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_Intensities <- read_csv('hourlyIntensities_merged.csv')
Rows: 22099 Columns: 4
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (3): Id, TotalIntensity, AverageIntensity
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_Steps <- read_csv('hourlySteps_merged.csv')
Rows: 22099 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (2): Id, StepTotal
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_cal_narrow <- read_csv('minuteCaloriesNarrow_merged.csv')
Rows: 1325580 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityMinute
dbl (2): Id, Calories
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_cal_wide <- read_csv('minuteCaloriesWide_merged.csv')
Rows: 21645 Columns: 62
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (61): Id, Calories00, Calories01, Calories02, Calories03, Calories04, Calories...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_int_narrow <- read_csv('minuteIntensitiesNarrow_merged.csv')
Rows: 1325580 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityMinute
dbl (2): Id, Intensity
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_int_wide <- read_csv('minuteIntensitiesWide_merged.csv')
Rows: 21645 Columns: 62
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (61): Id, Intensity00, Intensity01, Intensity02, Intensity03, Intensity04, Int...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_met_narrow <- read_csv('minuteMETsNarrow_merged.csv')
Rows: 1325580 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityMinute
dbl (2): Id, METs
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_sleep <- read_csv('minuteSleep_merged.csv')
Rows: 188521 Columns: 4
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): date
dbl (3): Id, value, logId
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_step_narrow <- read_csv('minuteStepsNarrow_merged.csv')
Rows: 1325580 Columns: 3
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityMinute
dbl (2): Id, Steps
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
minute_step_wide <- read_csv('minuteStepsWide_merged.csv')
Rows: 21645 Columns: 62
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): ActivityHour
dbl (61): Id, Steps00, Steps01, Steps02, Steps03, Steps04, Steps05, Steps06, Steps...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv('sleepDay_merged.csv')
Rows: 413 Columns: 5
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): SleepDay
dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_log_info <- read_csv('weightLogInfo_merged.csv')
Rows: 67 Columns: 8
-- Column specification --------------------------------------------------------------
Delimiter: ","
chr (1): Date
dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
lgl (1): IsManualReport
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(daily_Activity)
[1] "Id" "ActivityDate" "TotalSteps"
[4] "TotalDistance" "TrackerDistance" "LoggedActivitiesDistance"
[7] "VeryActiveDistance" "ModeratelyActiveDistance" "LightActiveDistance"
[10] "SedentaryActiveDistance" "VeryActiveMinutes" "FairlyActiveMinutes"
[13] "LightlyActiveMinutes" "SedentaryMinutes" "Calories"
colnames(daily_Calories)
[1] "Id" "ActivityDay" "Calories"
colnames(daily_Intensities)
[1] "Id" "ActivityDay" "SedentaryMinutes"
[4] "LightlyActiveMinutes" "FairlyActiveMinutes" "VeryActiveMinutes"
[7] "SedentaryActiveDistance" "LightActiveDistance" "ModeratelyActiveDistance"
[10] "VeryActiveDistance"
colnames(daily_Steps)
[1] "Id" "ActivityDay" "StepTotal"
colnames(heartrate_seconds)
[1] "Id" "Time" "Value"
colnames(hourly_Calories)
[1] "Id" "ActivityHour" "Calories"
colnames(hourly_Intensities)
[1] "Id" "ActivityHour" "TotalIntensity" "AverageIntensity"
colnames(hourly_Steps)
[1] "Id" "ActivityHour" "StepTotal"
colnames(minute_cal_narrow)
[1] "Id" "ActivityMinute" "Calories"
colnames(minute_cal_wide)
[1] "Id" "ActivityHour" "Calories00" "Calories01" "Calories02"
[6] "Calories03" "Calories04" "Calories05" "Calories06" "Calories07"
[11] "Calories08" "Calories09" "Calories10" "Calories11" "Calories12"
[16] "Calories13" "Calories14" "Calories15" "Calories16" "Calories17"
[21] "Calories18" "Calories19" "Calories20" "Calories21" "Calories22"
[26] "Calories23" "Calories24" "Calories25" "Calories26" "Calories27"
[31] "Calories28" "Calories29" "Calories30" "Calories31" "Calories32"
[36] "Calories33" "Calories34" "Calories35" "Calories36" "Calories37"
[41] "Calories38" "Calories39" "Calories40" "Calories41" "Calories42"
[46] "Calories43" "Calories44" "Calories45" "Calories46" "Calories47"
[51] "Calories48" "Calories49" "Calories50" "Calories51" "Calories52"
[56] "Calories53" "Calories54" "Calories55" "Calories56" "Calories57"
[61] "Calories58" "Calories59"
colnames(minute_int_narrow)
[1] "Id" "ActivityMinute" "Intensity"
colnames(minute_int_wide)
[1] "Id" "ActivityHour" "Intensity00" "Intensity01" "Intensity02"
[6] "Intensity03" "Intensity04" "Intensity05" "Intensity06" "Intensity07"
[11] "Intensity08" "Intensity09" "Intensity10" "Intensity11" "Intensity12"
[16] "Intensity13" "Intensity14" "Intensity15" "Intensity16" "Intensity17"
[21] "Intensity18" "Intensity19" "Intensity20" "Intensity21" "Intensity22"
[26] "Intensity23" "Intensity24" "Intensity25" "Intensity26" "Intensity27"
[31] "Intensity28" "Intensity29" "Intensity30" "Intensity31" "Intensity32"
[36] "Intensity33" "Intensity34" "Intensity35" "Intensity36" "Intensity37"
[41] "Intensity38" "Intensity39" "Intensity40" "Intensity41" "Intensity42"
[46] "Intensity43" "Intensity44" "Intensity45" "Intensity46" "Intensity47"
[51] "Intensity48" "Intensity49" "Intensity50" "Intensity51" "Intensity52"
[56] "Intensity53" "Intensity54" "Intensity55" "Intensity56" "Intensity57"
[61] "Intensity58" "Intensity59"
colnames(minute_met_narrow)
[1] "Id" "ActivityMinute" "METs"
colnames(minute_sleep)
[1] "Id" "date" "value" "logId"
colnames(minute_step_narrow)
[1] "Id" "ActivityMinute" "Steps"
colnames(minute_step_wide)
[1] "Id" "ActivityHour" "Steps00" "Steps01" "Steps02"
[6] "Steps03" "Steps04" "Steps05" "Steps06" "Steps07"
[11] "Steps08" "Steps09" "Steps10" "Steps11" "Steps12"
[16] "Steps13" "Steps14" "Steps15" "Steps16" "Steps17"
[21] "Steps18" "Steps19" "Steps20" "Steps21" "Steps22"
[26] "Steps23" "Steps24" "Steps25" "Steps26" "Steps27"
[31] "Steps28" "Steps29" "Steps30" "Steps31" "Steps32"
[36] "Steps33" "Steps34" "Steps35" "Steps36" "Steps37"
[41] "Steps38" "Steps39" "Steps40" "Steps41" "Steps42"
[46] "Steps43" "Steps44" "Steps45" "Steps46" "Steps47"
[51] "Steps48" "Steps49" "Steps50" "Steps51" "Steps52"
[56] "Steps53" "Steps54" "Steps55" "Steps56" "Steps57"
[61] "Steps58" "Steps59"
colnames(sleep_day)
[1] "Id" "SleepDay" "TotalSleepRecords"
[4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight_log_info)
[1] "Id" "Date" "WeightKg" "WeightPounds"
[5] "Fat" "BMI" "IsManualReport" "LogId"
# First method
sum(is.na(daily_Activity))
[1] 0
sum(is.na(daily_Calories))
[1] 0
sum(is.na(daily_Intensities))
[1] 0
sum(is.na(daily_Steps))
[1] 0
# Second method
daily_Activity %>% duplicated () %>% sum()
[1] 0
daily_Calories %>% duplicated () %>% sum()
[1] 0
daily_Intensities %>% duplicated () %>% sum()
[1] 0
# Third method
daily_Activity %>% is.na() %>% sum()
[1] 0
daily_Calories %>% is.na() %>% sum()
[1] 0
sleep_day %>% duplicated() %>% sum()
[1] 3
View(sleep_day)
# Running these methods i found out that there were no missing data values in the data set but there were three duplicates found in sleep_day data set
#To remove the duplicates found in sleep_day
cleaned_sleep_day <- sleep_day[!duplicated(sleep_day),]
#To verify we have removed the duplicates
cleaned_sleep_day %>% duplicated() %>% sum()
[1] 0
# For daily_activity: I convert the string column (ActivityDate) to date format
daily_Activity$ActivityDate <- mdy(daily_Activity$ActivityDate)
head(daily_Activity)
# For heart_rate: I converted the Time column to date format and i removed the time stamp
heartrate_seconds$Time <- as_date(mdy_hms(heartrate_seconds$Time))
head(heartrate_seconds)
#For Sleep day: I converted the SleepDay column to date format and i removed the time stamp
sleep_day$SleepDay <- as_date(mdy_hms(sleep_day$SleepDay))
head(sleep_day)
#For weight_log_info: I converted the Date column to date format and i removed the time stamp
weight_log_info$Date <- as_date(mdy_hms(weight_log_info$Date))
View(weight_log_info)
# For daily Activity: We rename the ActivityDate column to Date
clean_daily_activity <- rename(daily_Activity, Date = ActivityDate)
View(clean_daily_activity)
# For heart rate: We rename the ActivityDate column to Date
clean_heart_rate_sec <- rename(heartrate_seconds, Date = Time)
View(clean_heart_rate_sec)
# For sleep_day: We rename the Sleepday column to Date
clean_sleep_day <- rename(sleep_day, Date = SleepDay)
View(clean_sleep_day)
# I did not rename the date variable in "clean_weight_log" because it is already named "Date"
clean_weight_log <- weight_log_info
clean_daily_activity <- clean_daily_activity %>% janitor::clean_names()
clean_heart_rate_sec <- clean_heart_rate_sec %>% janitor::clean_names()
clean_sleep_day <- clean_sleep_day %>% janitor::clean_names()
clean_weight_log <- clean_weight_log %>% janitor::clean_names()
# For clean_daily_activity: Number of distinct participants
daily_disinct_id <- n_distinct(clean_daily_activity$id)
# There were 33 distinct participants in daily_activity
daily_distinct_date <- n_distinct(clean_daily_activity$date)
# The daily_activity data was tracked for 31 days
daily_min_date <- min(clean_daily_activity$date)
#The earliest date was 2016-04-12
daily_max_date <- max(clean_daily_activity$date)
#The last date was 2016-05-12
#So from the calculations above we can conclude that the daily_activity data was tracked for a duration of 31days between 2016-04-12 nd 2016-05-12 and we had 33 distinct participants/Id's in the data set.
#For clean_heart_rate_sec: Number of distinct participants
heart_distinct_id <- n_distinct(clean_heart_rate_sec$id)
# There were 14 distinct participants in heart_rate_sec
heart_distinct_date <- n_distinct(clean_heart_rate_sec$date)
# The heart_rate_sec data was tracked for 31 days
heart_min_date <- min(clean_heart_rate_sec$date)
# The earliest date was 2016-04-12
heart_max_date <- max(clean_heart_rate_sec$date)
# The last date was 2016-05-12
# So from the calculations above we can conclude that the heart_rate data was tracked for a duration of 31days between 2016-04-12 nd 2016-05-12 and we had 14 distinct participants/Id's in the data set.
#For weight_log: Number of distinct participants
weight_distinct_id <- n_distinct(clean_weight_log$id)
# There were 8 distinct participants
weight_distinct_date <- n_distinct(clean_weight_log$date)
# The data was tracked for 31 days
weight_min_date <- min(clean_weight_log$date)
# The earliest date was 2016-04-12
weight_max_date <- max(clean_weight_log$date)
# The last date was 2016-05-12
# So from the calculations above we can conclude that the weight_log data was tracked for a duration of 31days between 2016-04-12 nd 2016-05-12 and we had 8 distinct participants/Id's in the data set.
#For sleep_day: Number of distinct participants
sleep_distinct_id <- n_distinct(clean_sleep_day$id)
# There were 24 distinct participants
sleep_distinct_date <- n_distinct(clean_sleep_day$date)
# The data was tracked for 31 days
sleep_min_date <- min(clean_sleep_day$date)
# The earliest date was 2016-04-12
sleep_max_date <- max(clean_sleep_day$date)
# The last date was 2016-05-12
# So from the calculations above we can conclude that the sleep_day data was tracked for a duration of 31days between 2016-04-12 nd 2016-05-12 and we had 24 distinct participants/Id's in the data set.
Thank you