Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
In this case study, datasets are provide by FitBit Fitness Tracker Data
activity <- read_csv("dailyActivity_merged.csv")
calories <- read_csv("dailyCalories_merged.csv")
heartrate <- read_csv("heartrate_seconds_merged.csv")
intensities <- read_csv("dailyIntensities_merged.csv")
sleep <- read_csv("sleepDay_merged.csv")
steps <- read_csv("dailySteps_merged.csv")
steps_hourly <- read_csv("hourlySteps_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
After we have import the relevant datasets into our working environment, let’s observe the basic structure of our dataset.
head(activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
colSums(is.na(activity))
## Id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
head(calories)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
colSums(is.na(calories))
## Id ActivityDay Calories
## 0 0 0
head(intensities)
## # A tibble: 6 × 10
## Id Activ…¹ Seden…² Light…³ Fairl…⁴ VeryA…⁵ Seden…⁶ Light…⁷ Moder…⁸ VeryA…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 728 328 13 25 0 6.06 0.550 1.88
## 2 1.50e9 4/13/2… 776 217 19 21 0 4.71 0.690 1.57
## 3 1.50e9 4/14/2… 1218 181 11 30 0 3.91 0.400 2.44
## 4 1.50e9 4/15/2… 726 209 34 29 0 2.83 1.26 2.14
## 5 1.50e9 4/16/2… 773 221 10 36 0 5.04 0.410 2.71
## 6 1.50e9 4/17/2… 539 164 20 38 0 2.51 0.780 3.19
## # … with abbreviated variable names ¹ActivityDay, ²SedentaryMinutes,
## # ³LightlyActiveMinutes, ⁴FairlyActiveMinutes, ⁵VeryActiveMinutes,
## # ⁶SedentaryActiveDistance, ⁷LightActiveDistance, ⁸ModeratelyActiveDistance,
## # ⁹VeryActiveDistance
colSums(is.na(intensities))
## Id ActivityDay SedentaryMinutes
## 0 0 0
## LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
## 0 0 0
## SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance
## 0 0 0
## VeryActiveDistance
## 0
head(steps)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
colSums(is.na(steps))
## Id ActivityDay StepTotal
## 0 0 0
head(weight)
## # A tibble: 6 × 8
## Id Date WeightKg Weight…¹ Fat BMI IsMan…² LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport
colSums(is.na(weight))
## Id Date WeightKg WeightPounds Fat
## 0 0 0 0 65
## BMI IsManualReport LogId
## 0 0 0
“FAT” has the majority of the data missing, so we will not be using FAT in this project.
sum(duplicated(activity))
## [1] 0
sum(duplicated(calories))
## [1] 0
sum(duplicated(heartrate))
## [1] 0
sum(duplicated(intensities))
## [1] 0
sum(duplicated(sleep))
## [1] 3
sum(duplicated(steps))
## [1] 0
sum(duplicated(steps_hourly))
## [1] 0
sum(duplicated(weight))
## [1] 0
Since there are duplicates in sleep dataset, we have to remove them.
sleep <- sleep %>%
distinct()
# recheck for duplicates
sum(duplicated(sleep))
## [1] 0
After running the code and rechecked, the duplicates were removed.
n_distinct(activity$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(heartrate$Id)
## [1] 14
n_distinct(intensities$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(steps$Id)
## [1] 33
n_distinct(steps_hourly$Id)
## [1] 33
n_distinct(weight$Id)
## [1] 8
This tells us about the number of user’s data collected. Heartrate has 14 users and weight dataset has 8, so we will not be using them for this project.
We will be converting the columns naming to snake_case format across all tables.
clean_names(activity)
## # A tibble: 940 × 15
## id activity…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5 0 1.88 0.550 6.06
## 2 1503960366 4/13/2016 10735 6.97 6.97 0 1.57 0.690 4.71
## 3 1503960366 4/14/2016 10460 6.74 6.74 0 2.44 0.400 3.91
## 4 1503960366 4/15/2016 9762 6.28 6.28 0 2.14 1.26 2.83
## 5 1503960366 4/16/2016 12669 8.16 8.16 0 2.71 0.410 5.04
## 6 1503960366 4/17/2016 9705 6.48 6.48 0 3.19 0.780 2.51
## 7 1503960366 4/18/2016 13019 8.59 8.59 0 3.25 0.640 4.71
## 8 1503960366 4/19/2016 15506 9.88 9.88 0 3.53 1.32 5.03
## 9 1503960366 4/20/2016 10544 6.68 6.68 0 1.96 0.480 4.24
## 10 1503960366 4/21/2016 9819 6.34 6.34 0 1.34 0.350 4.65
## # … with 930 more rows, 6 more variables: sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## # abbreviated variable names ¹activity_date, ²total_steps, ³total_distance,
## # ⁴tracker_distance, ⁵logged_activities_distance, ⁶very_active_distance,
## # ⁷moderately_active_distance, ⁸light_active_distance
activity <- rename_with(activity, tolower)
clean_names(calories)
## # A tibble: 940 × 3
## id activity_day calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
## 7 1503960366 4/18/2016 1921
## 8 1503960366 4/19/2016 2035
## 9 1503960366 4/20/2016 1786
## 10 1503960366 4/21/2016 1775
## # … with 930 more rows
calories <- rename_with(calories, tolower)
clean_names(intensities)
## # A tibble: 940 × 10
## id activity…¹ seden…² light…³ fairl…⁴ very_…⁵ seden…⁶ light…⁷ moder…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 728 328 13 25 0 6.06 0.550
## 2 1503960366 4/13/2016 776 217 19 21 0 4.71 0.690
## 3 1503960366 4/14/2016 1218 181 11 30 0 3.91 0.400
## 4 1503960366 4/15/2016 726 209 34 29 0 2.83 1.26
## 5 1503960366 4/16/2016 773 221 10 36 0 5.04 0.410
## 6 1503960366 4/17/2016 539 164 20 38 0 2.51 0.780
## 7 1503960366 4/18/2016 1149 233 16 42 0 4.71 0.640
## 8 1503960366 4/19/2016 775 264 31 50 0 5.03 1.32
## 9 1503960366 4/20/2016 818 205 12 28 0 4.24 0.480
## 10 1503960366 4/21/2016 838 211 8 19 0 4.65 0.350
## # … with 930 more rows, 1 more variable: very_active_distance <dbl>, and
## # abbreviated variable names ¹activity_day, ²sedentary_minutes,
## # ³lightly_active_minutes, ⁴fairly_active_minutes, ⁵very_active_minutes,
## # ⁶sedentary_active_distance, ⁷light_active_distance,
## # ⁸moderately_active_distance
intensities <- rename_with(intensities, tolower)
clean_names(sleep)
## # A tibble: 410 × 5
## id sleep_day total_sleep_records total_minutes_…¹ total…²
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 400 more rows, and abbreviated variable names ¹total_minutes_asleep,
## # ²total_time_in_bed
sleep <- rename_with(sleep, tolower)
clean_names(steps)
## # A tibble: 940 × 3
## id activity_day step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## 7 1503960366 4/18/2016 13019
## 8 1503960366 4/19/2016 15506
## 9 1503960366 4/20/2016 10544
## 10 1503960366 4/21/2016 9819
## # … with 930 more rows
steps <- rename_with(steps, tolower)
clean_names(steps_hourly)
## # A tibble: 22,099 × 3
## id activity_hour step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
## 7 1503960366 4/12/2016 6:00:00 AM 0
## 8 1503960366 4/12/2016 7:00:00 AM 0
## 9 1503960366 4/12/2016 8:00:00 AM 250
## 10 1503960366 4/12/2016 9:00:00 AM 1864
## # … with 22,089 more rows
steps_hourly <- rename_with(steps_hourly, tolower)
From Checking datasets topic, I spotted that all dates were in chr format.
# activity
activity <- activity %>%
rename(date = activitydate) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
# calories
calories <- calories %>%
rename(date = activityday) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
# intensities
intensities <- intensities %>%
rename(date = activityday) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
# sleep
sleep <- sleep %>%
rename(date = sleepday) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
# steps
steps <- steps %>%
rename(date = activityday) %>%
mutate(date = as_date(date, format = "%m/%d/%Y"))
# steps_hourly
steps_hourly <- steps_hourly %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
Since I will be merging some of the datasets, I want all of it to be consistent to avoid any future errors.
steps_hourly <- steps_hourly %>%
separate(date_time, into = c("date", "time"), sep= " ") %>%
mutate(date = ymd(date))
Separating date and time will make analyzing easier, such as grouping by date or time.
activity_sleep <- merge(activity, sleep, by = c("id", "date"))
calories_intensities <- merge(calories, intensities, by = c("id", "date"))
calories_intensities$totalminutes <- calories_intensities$lightlyactiveminutes + calories_intensities$fairlyactiveminutes + calories_intensities$veryactiveminutes
calories_steps <- merge(calories, steps, by = c("id", "date"))
ggplot(data = calories_intensities, mapping = aes(x = totalminutes, y = calories)) +
geom_jitter() + geom_smooth(method = lm) + labs(title = "Active Minutes VS Calories")
## `geom_smooth()` using formula 'y ~ x'
This plot shows a positive relation between total active minutes and calories burned.
ggplot(data = calories_steps, mapping = aes(x = steptotal, y = calories)) +
geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps VS Calories")
## `geom_smooth()` using formula 'y ~ x'
This help confirms the first graph, the more active users are, the more calories are burned.
ggplot(data = activity, mapping = aes(x = totalsteps, y = totaldistance)) +
geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps VS Total Distance")
## `geom_smooth()` using formula 'y ~ x'
This graph is to assures that the tracker is functioning, more steps means that more distance users travelled.
ggplot(data = activity_sleep, mapping = aes(x = totalminutesasleep, y = sedentaryminutes )) +
geom_point() + geom_smooth(method = lm) + labs(title = "Total Minutes Asleep VS Sedentary Minutes")
## `geom_smooth()` using formula 'y ~ x'
As we can see the negative relationship between these two, we can assume that people that tend to have higher sedentary minutes will sleep less. It could mean that they tend to work more.
steps_hourly %>%
group_by(time) %>%
summarize(avg_steps = mean(steptotal)) %>%
ggplot() +
geom_bar(mapping = aes(x = time, y = avg_steps), stat = "identity") +
labs(title = "Average Steps Hourly") +
theme(axis.text.x = element_text(angle = 45))
As we can see, people tend to have most steps during lunch time and after office hours.
This is my first R case study project. Thank you!