Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(ggplot2)
library(readr)
dailyActivity_merged <- read_csv("/cloud/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourlySteps_merged <- read_csv("/cloud/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleepDay_merged <- read_csv("/cloud/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weightLogInfo_merged <- read_csv("/cloud/weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The hourlySteps_merged data was changed and cleaned to be in minuites for the data to be consistent when comparing. The following data mutations where done:
minutesSteps <- mutate(hourlySteps_merged, StepTotalPerMin = StepTotal/60)
head(minutesSteps)
## # A tibble: 6 × 4
## Id ActivityHour StepTotal StepTotalPerMin
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373 6.22
## 2 1503960366 4/12/2016 1:00:00 AM 160 2.67
## 3 1503960366 4/12/2016 2:00:00 AM 151 2.52
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
The dailyActivity, minutesSteps, sleepDay, and weightLogInfor data format or datatype was changed from characters into standard date format and the date and time where further split into seperate columns
dailyActivity_merged$ActivityDate=as.POSIXct(dailyActivity_merged$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
dailyActivity_merged$date <- format(dailyActivity_merged$ActivityDate, format = "%m/%d/%y")
head(dailyActivity_merged)
## # A tibble: 6 × 16
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <dttm> <dbl> <dbl> <dbl>
## 1 1503960366 2016-04-12 00:00:00 13162 8.5 8.5
## 2 1503960366 2016-04-13 00:00:00 10735 6.97 6.97
## 3 1503960366 2016-04-14 00:00:00 10460 6.74 6.74
## 4 1503960366 2016-04-15 00:00:00 9762 6.28 6.28
## 5 1503960366 2016-04-16 00:00:00 12669 8.16 8.16
## 6 1503960366 2016-04-17 00:00:00 9705 6.48 6.48
## # … with 11 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # date <chr>
hourlySteps_merged$ActivityHour = as.POSIXct(hourlySteps_merged$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourlySteps_merged$time <- format(hourlySteps_merged$ActivityHour, format = "%H:%M:%S")
hourlySteps_merged$date <- format(hourlySteps_merged$ActivityHour, format = "%m/%d/%y")
head(hourlySteps_merged)
## # A tibble: 6 × 5
## Id ActivityHour StepTotal time date
## <dbl> <dttm> <dbl> <chr> <chr>
## 1 1503960366 2016-04-12 00:00:00 373 00:00:00 04/12/16
## 2 1503960366 2016-04-12 01:00:00 160 01:00:00 04/12/16
## 3 1503960366 2016-04-12 02:00:00 151 02:00:00 04/12/16
## 4 1503960366 2016-04-12 03:00:00 0 03:00:00 04/12/16
## 5 1503960366 2016-04-12 04:00:00 0 04:00:00 04/12/16
## 6 1503960366 2016-04-12 05:00:00 0 05:00:00 04/12/16
## Code re-entered to recreate the dataset with new format
minutesSteps <- mutate(hourlySteps_merged, StepTotalPerMin = StepTotal/60)
head(minutesSteps)
## # A tibble: 6 × 6
## Id ActivityHour StepTotal time date StepTotalPerMin
## <dbl> <dttm> <dbl> <chr> <chr> <dbl>
## 1 1503960366 2016-04-12 00:00:00 373 00:00:00 04/12/16 6.22
## 2 1503960366 2016-04-12 01:00:00 160 01:00:00 04/12/16 2.67
## 3 1503960366 2016-04-12 02:00:00 151 02:00:00 04/12/16 2.52
## 4 1503960366 2016-04-12 03:00:00 0 03:00:00 04/12/16 0
## 5 1503960366 2016-04-12 04:00:00 0 04:00:00 04/12/16 0
## 6 1503960366 2016-04-12 05:00:00 0 05:00:00 04/12/16 0
sleepDay_merged$SleepDay =as.POSIXct(sleepDay_merged$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleepDay_merged$time <- format(sleepDay_merged$SleepDay, format = "%H:%M:%S")
sleepDay_merged$date <- format(sleepDay_merged$SleepDay, format = "%m/%d/%y")
head(sleepDay_merged)
## # A tibble: 6 × 7
## Id SleepDay TotalSleepRecords TotalMinutesAsl… TotalTimeInBed
## <dbl> <dttm> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 00:00:00 1 327 346
## 2 1.50e9 2016-04-13 00:00:00 2 384 407
## 3 1.50e9 2016-04-15 00:00:00 1 412 442
## 4 1.50e9 2016-04-16 00:00:00 2 340 367
## 5 1.50e9 2016-04-17 00:00:00 1 700 712
## 6 1.50e9 2016-04-19 00:00:00 1 304 320
## # … with 2 more variables: time <chr>, date <chr>
weightLogInfo_merged$Date = as.POSIXct(weightLogInfo_merged$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
weightLogInfo_merged$time <- format(weightLogInfo_merged$Date, format = "%H:%M:%S")
weightLogInfo_merged$date <- format(weightLogInfo_merged$Date, format = "%m/%d/%y")
head(weightLogInfo_merged)
## # A tibble: 6 × 10
## Id Date WeightKg WeightPounds Fat BMI IsManualReport
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1.50e9 2016-05-02 23:59:59 52.6 116. 22 22.6 TRUE
## 2 1.50e9 2016-05-03 23:59:59 52.6 116. NA 22.6 TRUE
## 3 1.93e9 2016-04-13 01:08:52 134. 294. NA 47.5 FALSE
## 4 2.87e9 2016-04-21 23:59:59 56.7 125. NA 21.5 TRUE
## 5 2.87e9 2016-05-12 23:59:59 57.3 126. NA 21.7 TRUE
## 6 4.32e9 2016-04-17 23:59:59 72.4 160. 25 27.5 TRUE
## # … with 3 more variables: LogId <dbl>, time <chr>, date <chr>
The Distinct function was used to determine the number of participants for each dataset and the following was conducted.
n_distinct(dailyActivity_merged$Id)
## [1] 33
n_distinct(minutesSteps$Id)
## [1] 33
n_distinct(sleepDay_merged$Id)
## [1] 24
n_distinct(weightLogInfo_merged$Id)
## [1] 8
From the results we could determine the number of participants per each dataset and it was 33 in dailyActivity, 33 in minutesSteps, 24 in sleepDay and 8 in weightLogInfo.
In total the number of participants where:
Total_participants <- n_distinct(dailyActivity_merged$Id) + n_distinct(minutesSteps$Id) +
n_distinct(sleepDay_merged $Id) +
n_distinct(weightLogInfo_merged$Id)
Total_participants
## [1] 98
Therefore, this means the sample number (98) was moderate for analysis and a fair, unbiased conclusion can be determined from it.
The merging of database such as dailyActivty and miutestSteps; dailyActivty and weightLogInfor was performed in order to easily compare and determine data trends and insights when plotting.
merged_activity_minuteSteps <- merge(dailyActivity_merged, minutesSteps,by=c('Id', 'date'))
head(merged_activity_minuteSteps)
## Id date ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## 2 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## 3 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## 4 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## 5 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## 6 1503960366 04/12/16 2016-04-12 13162 8.5 8.5
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.88 0.55
## 3 0 1.88 0.55
## 4 0 1.88 0.55
## 5 0 1.88 0.55
## 6 0 1.88 0.55
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 6.06 0 25
## 3 6.06 0 25
## 4 6.06 0 25
## 5 6.06 0 25
## 6 6.06 0 25
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 13 328 728 1985
## 3 13 328 728 1985
## 4 13 328 728 1985
## 5 13 328 728 1985
## 6 13 328 728 1985
## ActivityHour StepTotal time StepTotalPerMin
## 1 2016-04-12 00:00:00 373 00:00:00 6.216667
## 2 2016-04-12 01:00:00 160 01:00:00 2.666667
## 3 2016-04-12 02:00:00 151 02:00:00 2.516667
## 4 2016-04-12 03:00:00 0 03:00:00 0.000000
## 5 2016-04-12 04:00:00 0 04:00:00 0.000000
## 6 2016-04-12 05:00:00 0 05:00:00 0.000000
merged_activity_weight <- merge(dailyActivity_merged, weightLogInfo_merged, by=c('Id', 'date'))
head(merged_activity_weight)
## Id date ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 05/02/16 2016-05-02 14727 9.71 9.71
## 2 1503960366 05/03/16 2016-05-03 15103 9.66 9.66
## 3 1927972279 04/13/16 2016-04-13 356 0.25 0.25
## 4 2873212765 04/21/16 2016-04-21 8859 5.98 5.98
## 5 2873212765 05/12/16 2016-05-12 7566 5.11 5.11
## 6 4319703577 04/17/16 2016-04-17 29 0.02 0.02
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 3.21 0.57
## 2 0 3.73 1.05
## 3 0 0.00 0.00
## 4 0 0.13 0.37
## 5 0 0.00 0.00
## 6 0 0.00 0.00
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 5.92 0.00 41
## 2 4.88 0.00 50
## 3 0.25 0.00 0
## 4 5.47 0.01 2
## 5 5.11 0.00 0
## 6 0.02 0.00 0
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 15 277 798 2004
## 2 24 254 816 1990
## 3 0 32 986 2151
## 4 10 371 1057 1970
## 5 0 268 720 1431
## 6 0 3 1363 1464
## Date WeightKg WeightPounds Fat BMI IsManualReport
## 1 2016-05-02 23:59:59 52.6 115.9631 22 22.65 TRUE
## 2 2016-05-03 23:59:59 52.6 115.9631 NA 22.65 TRUE
## 3 2016-04-13 01:08:52 133.5 294.3171 NA 47.54 FALSE
## 4 2016-04-21 23:59:59 56.7 125.0021 NA 21.45 TRUE
## 5 2016-05-12 23:59:59 57.3 126.3249 NA 21.69 TRUE
## 6 2016-04-17 23:59:59 72.4 159.6147 25 27.45 TRUE
## LogId time
## 1 1.462234e+12 23:59:59
## 2 1.462320e+12 23:59:59
## 3 1.460510e+12 01:08:52
## 4 1.461283e+12 23:59:59
## 5 1.463098e+12 23:59:59
## 6 1.460938e+12 23:59:59
The min, median, max and other valuable insights about the datasets were determined using the summary function.
merged_activity_minuteSteps %>%
select(Id, date, TotalSteps, TotalDistance, Calories, StepTotalPerMin, StepTotal) %>%
summary()
## Id date TotalSteps TotalDistance
## Min. :1.504e+09 Length:22099 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3974 1st Qu.: 2.680
## Median :4.445e+09 Mode :character Median : 7604 Median : 5.320
## Mean :4.848e+09 Mean : 7752 Mean : 5.572
## 3rd Qu.:6.962e+09 3rd Qu.:10771 3rd Qu.: 7.750
## Max. :8.878e+09 Max. :36019 Max. :28.030
## Calories StepTotalPerMin StepTotal
## Min. : 120 Min. : 0.0000 Min. : 0.0
## 1st Qu.:1841 1st Qu.: 0.0000 1st Qu.: 0.0
## Median :2162 Median : 0.6667 Median : 40.0
## Mean :2336 Mean : 5.3361 Mean : 320.2
## 3rd Qu.:2799 3rd Qu.: 5.9500 3rd Qu.: 357.0
## Max. :4900 Max. :175.9000 Max. :10554.0
merged_activity_weight %>%
select(Id, TotalSteps, TotalDistance, Calories, WeightKg, Fat, BMI) %>%
summary()
## Id TotalSteps TotalDistance Calories
## Min. :1.504e+09 Min. : 29 Min. : 0.020 Min. : 928
## 1st Qu.:6.962e+09 1st Qu.: 8477 1st Qu.: 5.945 1st Qu.:1998
## Median :6.962e+09 Median :11101 Median : 8.110 Median :2174
## Mean :7.009e+09 Mean :12102 Mean : 9.211 Mean :2545
## 3rd Qu.:8.878e+09 3rd Qu.:14996 3rd Qu.: 9.710 3rd Qu.:3258
## Max. :8.878e+09 Max. :29326 Max. :26.720 Max. :4552
##
## WeightKg Fat BMI
## Min. : 52.60 Min. :22.00 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:22.75 1st Qu.:23.96
## Median : 62.50 Median :23.50 Median :24.39
## Mean : 72.04 Mean :23.50 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :133.50 Max. :25.00 Max. :47.54
## NA's :65
sleepDay_merged %>%
select(Id, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## Id TotalMinutesAsleep TotalTimeInBed
## Min. :1.504e+09 Min. : 58.0 Min. : 61.0
## 1st Qu.:3.977e+09 1st Qu.:361.0 1st Qu.:403.0
## Median :4.703e+09 Median :433.0 Median :463.0
## Mean :5.001e+09 Mean :419.5 Mean :458.6
## 3rd Qu.:6.962e+09 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :8.792e+09 Max. :796.0 Max. :961.0
The median sleeping time in hours: 433/60 =
SleepTimeHr <- 433/60
SleepTimeHr
## [1] 7.216667
Observations from the summaries are:
The majority of the participants weren’t as active since their average total steps per day were 7752 compared to the health recommended average 8000 steps day.
The majority of participants weren’t healthy as their weight and BMI was 72.04 Kg and 25.19, respectively. The healthy BMI is between 18.5 to <25
Most participants slept healthy since adults are recommended to have 7-8 hours of sleep and the participants average sleep was 7.22 hours.
The following plots show the trends of Calories burnt with the distance walked by individuals.
ggplot(data = dailyActivity_merged) +
geom_point(mapping = aes(x = TotalDistance, y = Calories)) +
geom_smooth(mapping = aes(x = TotalDistance, y = Calories)) +
labs(title="Total Distance vs Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There is a positive correlation here between total distance and calories, which suggests that the active individuals are the more calories they burn.
The total distance covered and time taken for it was measured and the following graph shows the findings:
ggplot(data = merged_activity_minuteSteps) +
geom_point(mapping = aes(x = TotalSteps, y = StepTotalPerMin, color = VeryActiveMinutes)) +
labs(title=" Total Distance vs StepTotalPerMin")
From the scatter plot above it shows the data of speed the individuals walked or ran at. The most active individuals show to cover more number of steps in a shorter time frame and these findings show that they were running and moving at a greater speed. The opposite can also be observed where slower individuals cover a short distance or number of steps in a shorter time frame.
The following Data shows the Weight distribution of the females
ggplot(data = weightLogInfo_merged) +
geom_histogram(mapping = aes(x = WeightKg, color = BMI)) +
labs(title= " Weight(Kg) distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the weight distribution we could tell that most females weights were between 60 and 90 Kg. These findings agree with the weight median of 63Kg. The lowest weight was 52.60 Kg and the highest weight was 133.50 Kg.
The relationship between BMI and weight was also investigated.
ggplot(data = merged_activity_weight) +
geom_point(mapping = aes(x = WeightKg, y = BMI)) +
geom_smooth(mapping = aes(x = WeightKg, y = BMI)) +
labs(title= " BMI Vs Weight(Kg)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the plot of BMI vs Weight we can note a positive trend suggesting that as the weight increases so does the females BMI increases.
If the BMI is less than 18.5, it falls within the underweight range. If your BMI is 18.5 to <25, it falls within the healthy weight range. If your BMI is 25.0 to <30, it falls within the overweight range. If your BMI is 30.0 or higher, it falls within the obesity range.
From the data we also analysed the activity of individuals relative to their weight to determine which ones were more active and the following plot shows our findings:
ggplot(data = merged_activity_weight) +
geom_point(mapping = aes(x = WeightKg, y = FairlyActiveMinutes)) +
geom_smooth(mapping = aes(x = WeightKg, y = FairlyActiveMinutes)) +
labs(title= " FailyActive females Vs Weight(Kg)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From these finding we can attribute that females with approximately 60 Kg are more active higher wights. There is however a peak of activity in females with the weight of approximately 90 Kg and this may be due to them wanting to loose weight.
In conclusion we can say that most participants weren’t active or healthy and that the fitness companies such as BellaBeat should perhaps have innovative ways to encourage their target audience which is females to become healthier. They could also use smartphone notifications to alert females about their health and steps to take in improving it.
In answering the question of the Case Study I would recommend BellaBeat to take advantage of these findings and use innovative ways for females to improve their lifestyle and health. They should use these insights to market and encourage females to use their products in order to improve their healths and lifestyle by getting daily notification of their, steps, sleep, calory, etc. metrics
I would like to Thank Google Data Analytics course for the opportunity to excercise my skills This is my project using R
Regards, Tshepo Molefe