R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Case Study: Bellabeat Fitness Application Data.How Can a Wellness Technology Company Play It Smart?

Introduction

This Case Study project to fulfill Google Data Analytics Capstone project on Coursera. This is my first ever deep dive into a data analytics project. For the purposes of this exercise, we will be using public data from a bike share company called Bellabeat. In this scenario, and from now on, Bellabeat will be referred to as Beat. In this exercise, I work for a fictional company, Bellabeat. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

About the company

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

Questions for the analysis

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy

Business task

Identify potential opportunities for growth and recommendations for the Bellabeat marketing strategy improvement based on trends in smart device usage.

Process Phase:

In this phase we will process the data by cleaning and ensuring that it is correct,relevant,complete and error free.

-We have to check if data contains any missing or null values -Transform the data into format we want for the analysis

Loading packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(timechange)
library(ggplot2)
library(dplyr)
library(gsubfn)
## Loading required package: proto
library(proto)
library(RSQLite)
library(sqldf)
library(plotrix)
library(skimr)

Upload the dataset

The data that is provided is FitBit Fitness Tracker Data . This dataset has 18 different csv files that range from Daily activity, calories, steps; hourly calories, intensities, and steps; and heart rate, sleep data and weight logs. After looking at the types of data collected by these 30 fitbit users, i discovered: * No water intake data has been collected

These data may not actually assist me, but that will come with exploration. Load the CSV files The data frames I’ll be working with in this case study will be creating objects for:

Daily_Activity <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\case_study2\\Fitabase Data 4.12.16-5.12.16\\dailyActivity_merged.csv")
Sleep_Day <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\case_study2\\Fitabase Data 4.12.16-5.12.16\\sleepDay_merged.csv")
Weight_Log <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\case_study2\\Fitabase Data 4.12.16-5.12.16\\weightLogInfo_merged.csv")
Daily_calories <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\case_study2\\Fitabase Data 4.12.16-5.12.16\\dailyCalories_merged.csv")
Daily_intensities <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\case_study2\\Fitabase Data 4.12.16-5.12.16\\dailyIntensities_merged.csv")

Check and Removing ‘unwanted bad’ data

We need to see if there are any null or missing values in the data. We can check this using the following commands.

str(Daily_Activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
skim(Daily_Activity)
Data summary
Name Daily_Activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09 ▇▅▃▅▅
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04 ▇▇▁▁▁
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00 ▇▁▁▁▁
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01 ▇▁▁▁▁
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00 ▇▁▁▁▁
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01 ▆▇▆▁▁
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01 ▇▁▁▁▁
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02 ▇▁▁▁▁
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02 ▇▁▁▁▁
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02 ▅▇▇▃▁
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03 ▁▁▇▅▇
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03 ▁▆▇▃▁
head(Sleep_Day)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
str(Weight_Log)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
skim(Weight_Log)
Data summary
Name Weight_Log
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09 ▁▁▂▇▆
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02 ▇▃▃▁▁
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02 ▇▃▃▁▁
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01 ▇▁▁▁▇
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01 ▇▁▁▁▁
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12 ▇▇▆▇▇
head(Weight_Log)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

After executing these commands we found out the:

So we get to know that there are 940 records in daily calories, daily intensities, and daily_activity data, 413 in daily_sleep and 67 in weight_log. There are no null values present in any of the data set, So there is no requirement to clean the data. But the date column is in character format, so we need to convert it into datetime64 type. I have also created month and day of week column as we need them in analysis.

Daily_Activity$Rec_Date <- as.Date(Daily_Activity$ActivityDate,"%m/%d/%y")
Daily_Activity$month <- format(Daily_Activity$Rec_Date,"%B")
Daily_Activity$day_of_week <- format(Daily_Activity$Rec_Date,"%A")

We are also going to count unique IDs to confirm whether data has 30 IDs as claimed by the survey. We can do this with 2 approaches first using direct function to calculate and second using SQL query.

n_distinct(Daily_Activity$Id)
## [1] 33

There are 33 unique IDs, instead of 30 unique IDs as expected. Some users may have created additional IDs during the survey period.

Now the data cleaning and manipulation is done.Now data is ready to be analyzed.

Analyze Phase:

Let prepare the manipulated data for proper analysis

Now, we need to summarize the data. So that we can derive insights from the ready clean data to help drive decision making. To do - find the suammry of Daily_activity and Weight_log

Daily_Activity %>%  select(TotalSteps,TotalDistance,SedentaryMinutes,    VeryActiveMinutes) %>% summary()
##    TotalSteps    TotalDistance    SedentaryMinutes VeryActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median :1057.5   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :210.00
Weight_Log%>%  select(WeightKg,BMI) %>% summary()
##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

statistical interpretation:

1.Findings from Daily Activity data:

2.Findings from weight log:

Avg_minutes_asleep <- sqldf("SELECT SUM(TotalSleepRecords),SUM(TotalMinutesAsleep)/SUM(TotalSleepRecords) as avg_sleeptime
                            FROM Sleep_Day")
Avg_minutes_asleep
##   SUM(TotalSleepRecords) avg_sleeptime
## 1                    462           374
Avg_TimeInBed <- sqldf("SELECT SUM(TotalTimeInBed)/SUM(TotalSleepRecords) as avg_timeInBed
                       FROM Sleep_Day")

Avg_TimeInBed
##   avg_timeInBed
## 1           409

We will also calculate number of distinct records in daily sleep and weight log data.

n_distinct(Sleep_Day$Id)
## [1] 24

Share Phase: Data visualizations

In this step, we will create some visualizations based on our analysis and goal of project.

Let check and count the activity by the week days

Daily_Activity$day_of_week <- ordered(Daily_Activity$day_of_week,levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

ggplot(data=Daily_Activity) + geom_bar(mapping = aes(x=day_of_week),fill="blue") +
  labs(x="Day of week",y="Count",title="No. of times users used tracker across week")

As we can see, the frequency of usage of FitBit fitness tracker application as showed on the bar graph tend to be high on sunday, monday, and tuesday than other week days. This behaviour is as result of more activity as the week go by that make people get busy in mid week towards the weekend days which can be due to work pressure and they don’t get enough time to track their activity.That’s why people are more active on sunday and two days that follws.

Let correlate the daily_step with the total_calories burns.

mean_steps <- mean(Daily_Activity$TotalSteps)
mean_steps
## [1] 7637.911
mean_calories <- mean(Daily_Activity$Calories)
mean_calories
## [1] 2303.61
Avg_Calories_burn_per_step_taken <-mean_steps/mean_calories
Avg_Calories_burn_per_step_taken
## [1] 3.315627
ggplot(data=Daily_Activity) + geom_point(mapping=aes(x=TotalSteps, y=Calories, color=Calories)) +
geom_hline(mapping = aes(yintercept=mean_calories),color="yellow",lwd=1.0) +
  geom_vline(mapping = aes(xintercept=mean_steps),color="red",lwd=1.0) +
  geom_text(mapping = aes(x=10000,y=500,label="Average Steps",srt=-90)) +
  geom_text(mapping = aes(x=29000,y=2500,label="Average Calories")) +
  labs(x="Steps Taken",y="Calories Burned",title = "Calories burned vs steps taken")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Calories burned for every step taken

ggplot(data=Daily_Activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point() +
geom_smooth(method = "loess",color="green") + 
labs(x="Total Steps",y="Sedentary Minutes",title="Total Steps vs Sedentary Minutes")
## `geom_smooth()` using formula = 'y ~ x'

Total steps taken vs sedentary minutes

I was expecting a totally inverse relationship between steps taken and sedentary minutes.

Let compare the sleeps time with the time in beds

ggplot(data=Sleep_Day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point() + stat_smooth(method = lm) +
  labs(x="Total Minutes a sleep", y="Total Time in Bed", title = "Sleep Time vs Time in Bed")
## `geom_smooth()` using formula = 'y ~ x'

Relation between sleep and time in bed

Let compare the Active minutes with the calories burn

ggplot(data=Daily_Activity,aes(x = VeryActiveMinutes, y = Calories, color = Calories)) + geom_point() + 
geom_smooth(method = "loess",color="orange") +
labs(x="Very Active Minutes",y="Calories",title = "Very Active Minutes vs Calories Burned")
## `geom_smooth()` using formula = 'y ~ x'

Relation between very active minutes and calories burned

Let compare the sedentary minutes with the calories burned

ggplot(data=Daily_Activity,aes(x=SedentaryMinutes,y=Calories,color=Calories)) + geom_point() + 
geom_smooth(method="loess",color="red") + 
labs(y="Calories", x="Sedentary Minutes", title="Calories vs. Sedentary Minutes")
## `geom_smooth()` using formula = 'y ~ x'

## Relation between sedentary minutes and calories burned:

I was expecting the relation between sedentary minutes and calories burned to be totally inverse in nature.

Now,we will calculate the sum of individual minute column from daily activity data

This visualization will show the relationships between minutes in Daily Activity for proper data correlation

activity_min <- sqldf("SELECT SUM(VeryActiveMinutes),SUM(FairlyActiveMinutes),
      SUM(LightlyActiveMinutes),SUM(SedentaryMinutes)
      FROM Daily_Activity")
activity_min
##   SUM(VeryActiveMinutes) SUM(FairlyActiveMinutes) SUM(LightlyActiveMinutes)
## 1                  19895                    12751                    181244
##   SUM(SedentaryMinutes)
## 1                931738

As we got the values, we will use these values to plot a 3D pie chart to compare the percentage of activity by minutes.

x <- c(19895,12751,181244,931738)
piepercent <- round(100*x / sum(x), 1)
colors = c("red","blue","green","yellow")
 
pie3D(x,labels = paste0(piepercent,"%"),col = colors,main = "Percentage of Activity in Minutes")
legend("topright",c("VeryActiveMinutes","FairlyActiveMinutes","LightlyActiveMinutes","SedentaryMinutes"), cex=0.75, fill = colors)

## Percentage of activity in Minutes:

As we can see,

Now, we will calculate sum of different distance values from daily activity data:

Activity_dist <- sqldf("SELECT SUM(ModeratelyActiveDistance),SUM(LightActiveDistance),
      SUM(VeryActiveDistance),SUM(SedentaryActiveDistance)
      FROM Daily_Activity")
Activity_dist
##   SUM(ModeratelyActiveDistance) SUM(LightActiveDistance)
## 1                        533.49                  3140.37
##   SUM(VeryActiveDistance) SUM(SedentaryActiveDistance)
## 1                 1412.52                         1.51

As we can see that the values of sedentaryActiveDistance is very less as compare to other distances,So I am excluding it in drawing a 3D pie chart to compare the percentage of activity in minutes.

y <- c(533.49,3140.37,1412.52)

piepercent <- round(100*y / sum(y), 1)
colors = c("orange","green","blue")
pie3D(y,labels = paste0(piepercent,"%"),col=colors,main = "Percentage of Activity in Distance")
legend("topright",c("ModeratelyActiveDistance","LightlyActiveDistance","VeryActiveDistance"),cex=0.75,fill = colors)

## Percentage of activity in distance:

As we can see,

Now, we will calculate the count of people with over weight: The BMI for healthy person is between 18.5 and 24.9 and the persons who’s BMI is above 24.9 are considered to be overweight.(source:CDC)

count_overweight <- sqldf("SELECT COUNT(DISTINCT(Id))
                          FROM Weight_Log
                          WHERE BMI > 24.9")
count_overweight
##   COUNT(DISTINCT(Id))
## 1                   5

As we got the values, we will use these values to plot a 3D pie chart to compare the percentage of people with overweight vs healthy weight.

z <- c(5,3)
piepercent <- round(100*z / sum(z),1)
colors = c("red","green")
pie3D(z,labels=paste0(piepercent,"%"),explode=0.1,col=colors,radius=1,main="Percentage of people with Over Weight vs Healthy Weight")
legend("topright",c("OverWeight","HealthyWeight"), cex=0.75,fill=colors)

We have less number of records,but we can see:

Act Phase:

The goal of analysis is correct as we got many useful insights from the FitBit data,which will help us to make data driven decision. Both companies develop similar kind of products.So,the common trends surrounding health and fitness can also be applied to Bellabeat customers.

Based on the result and analysis I make following recommendations:

1.We have analyzed that most of the people use application to track the steps and calories burned;less number of people use it to track sleep and very few use it to track weight records.So, I will suggest to focus on step,calories and sleep tracking more in application.

2.People prefer to track their activities on Sunday, Monday and Tuesday than other week days.I think this behavior is because people get busier in week end days due to work pressure and they don’t get enough time to track their activity. i recommend using the Bellabeat Application to send notification to remind the potential customers to perpetually track their daily activity to know the actual calories being burn.

3.The relation between steps taken vs calories burned and very active minutes vs calories burned shows positive correlation.So, this can be a good marketing strategy.

4.Majority of users 81.3% who are using the FitBit app are inactive for longer period of time and not using it for tracking their health habits.So, this can be a great chance to use this information for market strategy as Bellabeat can alert people about their sedentary behavior time to time either on application or on tracker itself .

5.Majority of the users 62.5% who are using fitness tracker are overweight.So, that is an opportunity to recommend the of the Bellabeat Application to people and outline how it can help them monitor the weight loss and as well to make them get better healthy conditions.

6.Bellabeat marketing team can encourage users by educating and equipping their potential customers with knowledge about fitness benefits, suggest different types of exercises, calories intake and burn rate information on Bellabeat application.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.