Business Task: Analyze smart device data usage to gain insight on how consumers use non-Bellabeat smart devices and apply these insights to one of Bellabeat’s product.
Questions to be answered are as follows:
Key Stakeholders include:
Data source: FitBit Fitness Tracker dataset available on Kaggle (CCO: Public Domain, dataset made available via Mobius). To navigate to the website, click link here. This dataset contains smart device data usage stored on Kaggle made open to the public. Data were initially stored on the local drive and then imported in Google’s BigQuery SQL database. The specifed files were utilized:
Data Organization: smart device data usages pertinent to daily activity, daily calories, daily intensities, and more are divided into two separate folders; each folder containing various usage data formatted via csv files for the dates of March 12th, 2016 through April 11th, 2016 and April 12th, 2016 through May 12th, 2016.
Data was sorted and filtered using Google’s BigQuery Cloud Data Warehouse
Data Credibility: data is from open data available on the Kaggle platform. Due to the nature of the program and directions of the capstone, secondary data pertinent to smart device usages are to be used for the data analysis process.
Consistent errors discovered among data files which includes, but were not limited to:
Tools Selected for Data Analysis Process:
knitr::include_graphics("Union_all_example.png", error=FALSE)
Saved combined table results using ‘BigQuery Tables’ for subsequent query
For files that contained timestamp data
(i.e. “2016-03-19 13:00:00 UTC” ; hourlyintensities & minutesleep
csv files), converted timestamp fields into both
“date” and “time” fields using the
following SQL syntax (example below):
Renamed field names, standardized & summarized value (numeric data) formats, and created new column to subsequently join datasets for daily activity, intensities, and sleep:
knitr::include_graphics("innerjoin_example.png", error=FALSE)
NOTE: Innerjoin used to establish finalized dataset containing rows/observations where all fields were entered into for ensuring completeness of variable/fields being measured (daily activities, intensities, sleep, etc.)
6.Final query executed to yield cleaned/transformed dataset in SQL and subsequently exported as CSV file for subsequent analysis to be performed (and further clean/transform) in RStudio.
Data organized in SQL as shown above in ‘Process’ Phase.
To set up R environment, install, if necessary, and load packages for data analysis: tidyverse, skimr, and janitor.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Next, follow the analysis steps below:
ais_df <- read_csv('activity_intensity_sleep_joined.csv')
## Rows: 644 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): ActivityDate, new_id, ActivityHour1, new_id_1, date1, new_id_2
## dbl (20): Id, TotalSteps, total_distance, tracker_distance, very_active_dist...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(ais_df)
head(ais_df)
## # A tibble: 6 × 26
## Id ActivityDate TotalSteps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 3977333714 4/17/2016 10415 6.97 6.97
## 2 3977333714 4/20/2016 11658 7.83 7.83
## 3 3977333714 5/6/2016 11677 8.28 8.28
## 4 1503960366 4/24/2016 10039 6.41 6.41
## 5 4319703577 4/14/2016 10210 6.88 6.88
## 6 8378563200 5/1/2016 11419 9.06 9.06
## # ℹ 21 more variables: very_active_distance <dbl>,
## # moderate_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, Calories <dbl>, new_id <chr>, Id_1 <dbl>,
## # ActivityHour1 <chr>, total_daily_intensity <dbl>,
## # average_daily_intensity <dbl>, new_id_1 <chr>, Id_2 <dbl>, date1 <chr>, …
colnames(ais_df)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "total_distance"
## [5] "tracker_distance" "very_active_distance"
## [7] "moderate_active_distance" "light_active_distance"
## [9] "sedentary_active_distance" "very_active_minutes"
## [11] "fairly_active_minutes" "lightly_active_minutes"
## [13] "sedentary_minutes" "Calories"
## [15] "new_id" "Id_1"
## [17] "ActivityHour1" "total_daily_intensity"
## [19] "average_daily_intensity" "new_id_1"
## [21] "Id_2" "date1"
## [23] "value" "value_minutes"
## [25] "value_hours" "new_id_2"
ais_df$activitydate_1 <- mdy(ais_df$ActivityDate)
trimmed_ais_df <- ais_df %>%
select(Id, ActivityDate, TotalSteps, total_distance, tracker_distance, very_active_distance,
moderate_active_distance, light_active_distance, sedentary_active_distance, very_active_minutes,
fairly_active_minutes, lightly_active_minutes, sedentary_minutes, Calories,
total_daily_intensity, average_daily_intensity, value_minutes, value_hours, activitydate_1) %>%
rename_with(tolower) %>%
clean_names() %>%
rename(sleep_minutes=value_minutes, sleep_hours=value_hours)
trimmed_ais_df_v1<- trimmed_ais_df %>%
mutate(trimmed_ais_df, total_active_minutes=very_active_minutes+
fairly_active_minutes+lightly_active_minutes) %>%
arrange(id, activitydate_1)
view(trimmed_ais_df_v1)
head(trimmed_ais_df_v1)
## # A tibble: 6 × 20
## id activitydate totalsteps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 3/25/2016 11004 7.11 7.11
## 2 1503960366 3/26/2016 17609 11.6 11.6
## 3 1503960366 3/27/2016 12736 8.53 8.53
## 4 1503960366 3/28/2016 13231 8.93 8.93
## 5 1503960366 3/29/2016 12041 7.85 7.85
## 6 1503960366 3/31/2016 12256 7.86 7.86
## # ℹ 15 more variables: very_active_distance <dbl>,
## # moderate_active_distance <dbl>, light_active_distance <dbl>,
## # sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## # fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## # sedentary_minutes <dbl>, calories <dbl>, total_daily_intensity <dbl>,
## # average_daily_intensity <dbl>, sleep_minutes <dbl>, sleep_hours <dbl>,
## # activitydate_1 <date>, total_active_minutes <dbl>
colnames(trimmed_ais_df_v1)
## [1] "id" "activitydate"
## [3] "totalsteps" "total_distance"
## [5] "tracker_distance" "very_active_distance"
## [7] "moderate_active_distance" "light_active_distance"
## [9] "sedentary_active_distance" "very_active_minutes"
## [11] "fairly_active_minutes" "lightly_active_minutes"
## [13] "sedentary_minutes" "calories"
## [15] "total_daily_intensity" "average_daily_intensity"
## [17] "sleep_minutes" "sleep_hours"
## [19] "activitydate_1" "total_active_minutes"
mdy("05-09-2016") - mdy("03-12-2016")
## Time difference of 58 days
trimmed_ais_df_v1$weekdays <- weekdays(as.Date(trimmed_ais_df_v1$activitydate_1))
trimmed_ais_df_v2 <- trimmed_ais_df_v1
days_of_the_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
view(days_of_the_week)
summary(trimmed_ais_df_v2)
## id activitydate totalsteps total_distance
## Min. :1.504e+09 Length:644 Min. : 0 Min. : 0.000
## 1st Qu.:3.977e+09 Class :character 1st Qu.: 4740 1st Qu.: 3.268
## Median :4.559e+09 Mode :character Median : 8371 Median : 6.065
## Mean :4.862e+09 Mean : 8222 Mean : 5.792
## 3rd Qu.:6.776e+09 3rd Qu.:11184 3rd Qu.: 7.860
## Max. :8.792e+09 Max. :22770 Max. :17.540
## tracker_distance very_active_distance moderate_active_distance
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 3.268 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 6.065 Median : 0.450 Median :0.4000
## Mean : 5.779 Mean : 1.335 Mean :0.7291
## 3rd Qu.: 7.860 3rd Qu.: 2.215 3rd Qu.:1.0425
## Max. :17.540 Max. :12.540 Max. :6.4800
## light_active_distance sedentary_active_distance very_active_minutes
## Min. : 0.000 Min. :0.0000000 Min. : 0.00
## 1st Qu.: 2.425 1st Qu.:0.0000000 1st Qu.: 0.00
## Median : 3.645 Median :0.0000000 Median : 8.00
## Mean : 3.692 Mean :0.0009938 Mean : 22.95
## 3rd Qu.: 4.885 3rd Qu.:0.0000000 3rd Qu.: 35.00
## Max. :12.510 Max. :0.1100000 Max. :210.00
## fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 0.00 1st Qu.:153.0 1st Qu.: 638.8 1st Qu.:1860
## Median : 11.00 Median :208.5 Median : 728.0 Median :2210
## Mean : 17.37 Mean :212.1 Mean : 733.4 Mean :2353
## 3rd Qu.: 26.00 3rd Qu.:266.0 3rd Qu.: 815.0 3rd Qu.:2859
## Max. :143.00 Max. :518.0 Max. :1440.0 Max. :4900
## total_daily_intensity average_daily_intensity sleep_minutes
## Min. : 0.0 Min. :0.0000 Min. : 2.0
## 1st Qu.:224.0 1st Qu.:0.1600 1st Qu.: 335.5
## Median :326.5 Median :0.2200 Median : 415.0
## Mean :322.0 Mean :0.2232 Mean : 401.4
## 3rd Qu.:413.2 3rd Qu.:0.2900 3rd Qu.: 484.2
## Max. :904.0 Max. :0.6300 Max. :1108.0
## sleep_hours activitydate_1 total_active_minutes weekdays
## Min. : 0.033 Min. :2016-03-12 Min. : 0.0 Length:644
## 1st Qu.: 5.592 1st Qu.:2016-04-09 1st Qu.:194.0 Class :character
## Median : 6.917 Median :2016-04-20 Median :262.0 Mode :character
## Mean : 6.690 Mean :2016-04-19 Mean :252.4
## 3rd Qu.: 8.071 3rd Qu.:2016-04-30 3rd Qu.:317.0
## Max. :18.467 Max. :2016-05-12 Max. :540.0
Note: Prior calculations, via aggregation of data, were performed on data in SQL under the Process phase prior analysis being performed in RStudio.
Questions to be answered:
Answer Summary: Users are tracking total distance, total steps taken, states of consciousness (sleep, restless, awake), intensity, caloric expenditures, and more to stay informed about their health and how their body responds to activities, stress, emotions, and illness.
High Level Recommendations include: