This is my final Google career project. The objective is to identify trends so marketing company can work efficient in a campaign for new clients. We are going to use a free and public data offered by Möbius. The data set is: https://www.kaggle.com/arashnic/fitbit.

Some of the data was modified by hand on excel spreadsheet and almost all the details were done in R.

Packages installed and libraries

  library(tidyverse)
  library(here)
  library(ggplot2)
  library(skimr)
  library(janitor)
  library(readxl)

Importing first data set:

dailiy_Activity <- read_excel("dailiy_Activity.xlsx")

Cleaning and analyzing data

## Cleaning column names
da <- clean_names(dailiy_Activity)
## Selecting only the columns I need for analysis. 
da_selected <- da %>%
    select(total_steps,
           total_distance,
           very_active_distance,
           moderately_active_distance,
           light_active_distance,
           very_active_minutes,
           lightly_active_minutes,
           fairly_active_minutes,
           calories) 
First 5 data set values
total_steps total_distance very_active_distance moderately_active_distance light_active_distance very_active_minutes lightly_active_minutes fairly_active_minutes calories
10602 6.81 2.29 1.60 2.92 33 246 35 1820
36019 28.03 21.92 4.19 1.91 186 171 63 2690
6132 4.46 0.24 0.99 3.23 3 146 24 2696
2573 1.70 0.00 0.26 1.45 0 75 7 1541
2704 1.87 1.01 0.03 0.83 14 70 1 2411
## Data summary 
summary(da_selected)
##   total_steps    total_distance   very_active_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000      
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 0.000      
##  Median : 7406   Median : 5.245   Median : 0.210      
##  Mean   : 7638   Mean   : 5.490   Mean   : 1.503      
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 2.053      
##  Max.   :36019   Max.   :28.030   Max.   :21.920      
##  moderately_active_distance light_active_distance very_active_minutes
##  Min.   :0.0000             Min.   : 0.000        Min.   :  0.00     
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:  0.00     
##  Median :0.2400             Median : 3.365        Median :  4.00     
##  Mean   :0.5675             Mean   : 3.341        Mean   : 21.16     
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.: 32.00     
##  Max.   :6.4800             Max.   :10.710        Max.   :210.00     
##  lightly_active_minutes fairly_active_minutes    calories   
##  Min.   :  0.0          Min.   :  0.00        Min.   :   0  
##  1st Qu.:127.0          1st Qu.:  0.00        1st Qu.:1828  
##  Median :199.0          Median :  6.00        Median :2134  
##  Mean   :192.8          Mean   : 13.56        Mean   :2304  
##  3rd Qu.:264.0          3rd Qu.: 19.00        3rd Qu.:2793  
##  Max.   :518.0          Max.   :143.00        Max.   :4900

Importing new data set

sleep_Day <- read_excel("sleep_Day.xlsx")

Cleaning and analyzing it

sl <- clean_names(sleep_Day) %>%
  select(-id_len)
  ##Only for future datasets merges & analysis, I will change the column named as SleepDay for activity_date
  sld <- rename(sl, activity_date = sleep_day)
First 5 data set values
id activity_date total_sleep_records hour_and_minutes_asleep total_minutes_asleep hours_and_minutos_in_bed total_time_in_bed
1503960366 2016-12-04 1 1899-12-31 05:27:00 327 1899-12-31 05:46:00 346
1503960366 2016-04-13 2 1899-12-31 06:24:00 384 1899-12-31 06:47:00 407
1503960366 2016-04-15 1 1899-12-31 06:52:00 412 1899-12-31 07:22:00 442
1503960366 2016-04-16 2 1899-12-31 05:40:00 340 1899-12-31 06:07:00 367
1503960366 2016-04-17 1 1899-12-31 11:40:00 700 1899-12-31 11:52:00 712
## Data summaries 
sld %>%
  select(hour_and_minutes_asleep,
  hours_and_minutos_in_bed,
  total_minutes_asleep,
  total_time_in_bed) %>%
  summary()
##  hour_and_minutes_asleep       hours_and_minutos_in_bed     
##  Min.   :1899-12-31 00:58:00   Min.   :1899-12-31 01:01:00  
##  1st Qu.:1899-12-31 06:01:00   1st Qu.:1899-12-31 06:43:00  
##  Median :1899-12-31 07:13:00   Median :1899-12-31 07:43:00  
##  Mean   :1899-12-31 06:59:28   Mean   :1899-12-31 07:38:38  
##  3rd Qu.:1899-12-31 08:10:00   3rd Qu.:1899-12-31 08:46:00  
##  Max.   :1899-12-31 13:16:00   Max.   :1899-12-31 16:01:00  
##  total_minutes_asleep total_time_in_bed
##  Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:361.0        1st Qu.:403.0    
##  Median :433.0        Median :463.0    
##  Mean   :419.5        Mean   :458.6    
##  3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :796.0        Max.   :961.0

Trend

Minutes asleep vs in bed

We found a trend that humans spend more time in bed than only sleeping. When I saw this data, I remembered the time I spend maybe with the phone or the notebook in the bed before or after sleeping.

Merge time!

We arrive to the merge time. We are going to merge the data sets, so we can find new trends and compare both data. How are we going to merge? We are going to use the inner join way, because we need only the information that match in both data sets. Ok, let’s do it!

da_sl <- merge(da, sld, by = c("id","activity_date"))
head(da_sl)
##           id activity_date len total_steps total_distance tracker_distance
## 1 1503960366    2016-01-05  10       10602           6.81             6.81
## 2 1503960366    2016-02-05  10       14727           9.71             9.71
## 3 1503960366    2016-03-05  10       15103           9.66             9.66
## 4 1503960366    2016-04-13  10       10735           6.97             6.97
## 5 1503960366    2016-04-15  10        9762           6.28             6.28
## 6 1503960366    2016-04-16  10       12669           8.16             8.16
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 2.29                       1.60
## 2                          0                 3.21                       0.57
## 3                          0                 3.73                       1.05
## 4                          0                 1.57                       0.69
## 5                          0                 2.14                       1.26
## 6                          0                 2.71                       0.41
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  2.92                         0                  33
## 2                  5.92                         0                  41
## 3                  4.88                         0                  50
## 4                  4.71                         0                  21
## 5                  2.83                         0                  29
## 6                  5.04                         0                  36
##   fairly_active_minutes lightly_active_minutes total_activity_minutes
## 1                    35                    246                    314
## 2                    15                    277                    333
## 3                    24                    254                    328
## 4                    19                    217                    257
## 5                    34                    209                    272
## 6                    10                    221                    267
##   sedentary_minutes total_minutes calories x19 total_distance_2
## 1               730          1044     1820  NA             6.81
## 2               798          1131     2004  NA             9.70
## 3               816          1144     1990  NA             9.66
## 4               776          1033     1797  NA             6.97
## 5               726           998     1745  NA             6.23
## 6               773          1040     1863  NA             8.16
##   total_sleep_records hour_and_minutes_asleep total_minutes_asleep
## 1                   1     1899-12-31 06:09:00                  369
## 2                   1     1899-12-31 04:37:00                  277
## 3                   1     1899-12-31 04:33:00                  273
## 4                   2     1899-12-31 06:24:00                  384
## 5                   1     1899-12-31 06:52:00                  412
## 6                   2     1899-12-31 05:40:00                  340
##   hours_and_minutos_in_bed total_time_in_bed
## 1      1899-12-31 06:36:00               396
## 2      1899-12-31 05:09:00               309
## 3      1899-12-31 04:56:00               296
## 4      1899-12-31 06:47:00               407
## 5      1899-12-31 07:22:00               442
## 6      1899-12-31 06:07:00               367

Cleaning and analyzing

Important: I want to know the people who sleep more and less than 8 hours and compare the data so maybe we can find new trends.

filter_8hours <- da_sl %>%
  filter(total_minutes_asleep >= 480)
  ##Conclusion: I had 413 rows in the data set and 117 people slept more than 8 hours. 
filter_less8hourse <- da_sl %>%
  filter(total_minutes_asleep < 480)
  ##Conclusion_ From 413 rows data, 296 slept less than 8 hours.

Interesting graphs

Sedentary Minutes vs Tota Activity Minutes

We can find imbalance in sedentary minutes comparing to activity minutes in people who sleep less than 8 hours. In people who sleep 8 hours or more we find a balance and the sedentary minutes are not so high, they trend to be more balance and center.

Balance between the total distance and the calories.

We can see a balance also in people who sleep 8 hours or more according to the different type of activities.

Blue = Very Active Green = Moderately active Red = Light Activity

Conclusion

We can see in the summaries that people who sleep less has higher mean values, almost in all the columns, but according to the graphs, we can see that the people who sleep 8 or more hours are more balance than the people who sleep less. This means, we can find a breakeven in +8 hours, and in - 8 hours we find lot of peaks values.

Importing more data

We are going to import the weight (kg) values, including bmi also (body mass index).

weight_Log_Info <- read_excel("weight_Log_Info.xlsx")

Cleaning and analyzing

head(weight_Log_Info)
## # A tibble: 6 x 10
##        Id ...2     ...3                Date    WeightKg WeightPounds   Fat   BMI
##     <dbl> <chr>    <dttm>              <chr>      <dbl>        <dbl> <dbl> <dbl>
## 1  1.50e9 42405.9… 2016-02-05 00:00:00 42405.…     52.6         116.    22  22.6
## 2  1.50e9 42434.9… 2016-03-05 00:00:00 42434.…     52.6         116.    NA  22.6
## 3  1.93e9 4/13/20… 2016-04-13 00:00:00 4/13/2…    134.          294.    NA  47.5
## 4  2.87e9 4/21/20… 2016-04-21 00:00:00 4/21/2…     56.7         125.    NA  21.5
## 5  2.87e9 42709.9… 2016-12-05 00:00:00 42709.…     57.3         126.    NA  21.7
## 6  4.32e9 4/17/20… 2016-04-17 00:00:00 4/17/2…     72.4         160.    25  27.5
## # … with 2 more variables: IsManualReport <lgl>, LogId <dbl>
weightt <- weight_Log_Info %>%
  clean_names() %>%
  rename(activity_date = x3) 
weight <- weightt %>%
  select(id,
    activity_date,
    weight_kg,
    fat,
    bmi,)
colnames(weight)
## [1] "id"            "activity_date" "weight_kg"     "fat"          
## [5] "bmi"

Merging da_sl data set with the new weight data

The data sets are going to be merge with inner join, we need only the information that match.

da_sl_we <- merge(da_sl, weight, by = c("id", "activity_date"))

Trend

We can see a trend when weight increment, the bmi also increment.

Importing new data sets, cleaning, analyzing and merging

Now, we want to analyze the data sets per hour and merge them for trends findings. We are going to import data sets, clean it and analyze it so we can see visual trends.

hourly_Intensities <- read_excel("hourly_Intensities.xlsx")
hourly_Calories <- read_excel("hourly_Calories.xlsx")
hourly_Steps <- read_excel("hourly_Steps.xlsx") 

hi =hourly_Itennsities hc = hourly_Calories hs = hourly_Steps

Cleaning

hi <- hourly_Intensities %>%
  clean_names()
    
hc <- hourly_Calories %>%
  clean_names()
      
hs <- hourly_Steps %>%
  clean_names()

Merging

hi_hc <- merge(hi, hc, by = c("id", "activity_date"))

Cleaning and analyzing

This merge is so extensive, like 254979 observations, so we are going to take a sample, according to the population so we can analyze it and do future merges. We choose the sample size from a 254979 population, 95% confidence and 3% margin error.

random_hi_hc <- hi_hc %>%
  sample_n(1063, replace = FALSE)

Merging with hours_steps

hi_hc_hs <- merge(random_hi_hc, hs, by = c("id", "activity_date"))

This new merge has duplicates on same dates and minutes, so we are going to clean this duplicates.

hi_hc_hs_nodup <- hi_hc_hs %>%
  distinct(id, activity_date, .keep_all = TRUE)
ggplot(data = hi_hc_hs_nodup) +
  geom_point(mapping = aes(x= step_total, y = calories))