This is my final Google career project. The objective is to identify trends so marketing company can work efficient in a campaign for new clients. We are going to use a free and public data offered by Möbius. The data set is: https://www.kaggle.com/arashnic/fitbit.
Some of the data was modified by hand on excel spreadsheet and almost all the details were done in R.
library(tidyverse)
library(here)
library(ggplot2)
library(skimr)
library(janitor)
library(readxl)
dailiy_Activity <- read_excel("dailiy_Activity.xlsx")
## Cleaning column names
da <- clean_names(dailiy_Activity)
## Selecting only the columns I need for analysis.
da_selected <- da %>%
select(total_steps,
total_distance,
very_active_distance,
moderately_active_distance,
light_active_distance,
very_active_minutes,
lightly_active_minutes,
fairly_active_minutes,
calories)
total_steps | total_distance | very_active_distance | moderately_active_distance | light_active_distance | very_active_minutes | lightly_active_minutes | fairly_active_minutes | calories |
---|---|---|---|---|---|---|---|---|
10602 | 6.81 | 2.29 | 1.60 | 2.92 | 33 | 246 | 35 | 1820 |
36019 | 28.03 | 21.92 | 4.19 | 1.91 | 186 | 171 | 63 | 2690 |
6132 | 4.46 | 0.24 | 0.99 | 3.23 | 3 | 146 | 24 | 2696 |
2573 | 1.70 | 0.00 | 0.26 | 1.45 | 0 | 75 | 7 | 1541 |
2704 | 1.87 | 1.01 | 0.03 | 0.83 | 14 | 70 | 1 | 2411 |
## Data summary
summary(da_selected)
## total_steps total_distance very_active_distance
## Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.000
## Median : 7406 Median : 5.245 Median : 0.210
## Mean : 7638 Mean : 5.490 Mean : 1.503
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 2.053
## Max. :36019 Max. :28.030 Max. :21.920
## moderately_active_distance light_active_distance very_active_minutes
## Min. :0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.: 0.00
## Median :0.2400 Median : 3.365 Median : 4.00
## Mean :0.5675 Mean : 3.341 Mean : 21.16
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.: 32.00
## Max. :6.4800 Max. :10.710 Max. :210.00
## lightly_active_minutes fairly_active_minutes calories
## Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.:127.0 1st Qu.: 0.00 1st Qu.:1828
## Median :199.0 Median : 6.00 Median :2134
## Mean :192.8 Mean : 13.56 Mean :2304
## 3rd Qu.:264.0 3rd Qu.: 19.00 3rd Qu.:2793
## Max. :518.0 Max. :143.00 Max. :4900
We can see trends in light distance and minutes means, the values are bigger compared with the other columns. This means that people use to do more exercise distance and minutes in a lightly way. Pay attention at the mean data of the next columns and compare it with the columns at the above.
## light_active_distance lightly_active_minutes
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 1.945 1st Qu.:127.0
## Median : 3.365 Median :199.0
## Mean : 3.341 Mean :192.8
## 3rd Qu.: 4.782 3rd Qu.:264.0
## Max. :10.710 Max. :518.0
In the next graph, we can see an important trend. People who do more active distance exercise, burn more calories than lighter and fairer exercises.
We can see that if you do light activity you need more time to burn calories, we can see a strong light activity at minute 200 and calories between 1500 - 3000.
But if you do an extensive activity, you need less time to burn calories. We can see that the strong values are between 0 - 100 minutes and 1000 - 4000 calories. So, if you are doing activity exercise, it is a good news.
We can see that humans used to spend more time sedentary than active. When I saw this data, I thought in the time I spend working, studying, playing pc (sedentary) and the time I am doing exercise or moving.
Total minutes = sedentary + active
sleep_Day <- read_excel("sleep_Day.xlsx")
sl <- clean_names(sleep_Day) %>%
select(-id_len)
##Only for future datasets merges & analysis, I will change the column named as SleepDay for activity_date
sld <- rename(sl, activity_date = sleep_day)
id | activity_date | total_sleep_records | hour_and_minutes_asleep | total_minutes_asleep | hours_and_minutos_in_bed | total_time_in_bed |
---|---|---|---|---|---|---|
1503960366 | 2016-12-04 | 1 | 1899-12-31 05:27:00 | 327 | 1899-12-31 05:46:00 | 346 |
1503960366 | 2016-04-13 | 2 | 1899-12-31 06:24:00 | 384 | 1899-12-31 06:47:00 | 407 |
1503960366 | 2016-04-15 | 1 | 1899-12-31 06:52:00 | 412 | 1899-12-31 07:22:00 | 442 |
1503960366 | 2016-04-16 | 2 | 1899-12-31 05:40:00 | 340 | 1899-12-31 06:07:00 | 367 |
1503960366 | 2016-04-17 | 1 | 1899-12-31 11:40:00 | 700 | 1899-12-31 11:52:00 | 712 |
## Data summaries
sld %>%
select(hour_and_minutes_asleep,
hours_and_minutos_in_bed,
total_minutes_asleep,
total_time_in_bed) %>%
summary()
## hour_and_minutes_asleep hours_and_minutos_in_bed
## Min. :1899-12-31 00:58:00 Min. :1899-12-31 01:01:00
## 1st Qu.:1899-12-31 06:01:00 1st Qu.:1899-12-31 06:43:00
## Median :1899-12-31 07:13:00 Median :1899-12-31 07:43:00
## Mean :1899-12-31 06:59:28 Mean :1899-12-31 07:38:38
## 3rd Qu.:1899-12-31 08:10:00 3rd Qu.:1899-12-31 08:46:00
## Max. :1899-12-31 13:16:00 Max. :1899-12-31 16:01:00
## total_minutes_asleep total_time_in_bed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.0
## Median :433.0 Median :463.0
## Mean :419.5 Mean :458.6
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
We found a trend that humans spend more time in bed than only sleeping. When I saw this data, I remembered the time I spend maybe with the phone or the notebook in the bed before or after sleeping.
We arrive to the merge time. We are going to merge the data sets, so we can find new trends and compare both data. How are we going to merge? We are going to use the inner join way, because we need only the information that match in both data sets. Ok, let’s do it!
da_sl <- merge(da, sld, by = c("id","activity_date"))
head(da_sl)
## id activity_date len total_steps total_distance tracker_distance
## 1 1503960366 2016-01-05 10 10602 6.81 6.81
## 2 1503960366 2016-02-05 10 14727 9.71 9.71
## 3 1503960366 2016-03-05 10 15103 9.66 9.66
## 4 1503960366 2016-04-13 10 10735 6.97 6.97
## 5 1503960366 2016-04-15 10 9762 6.28 6.28
## 6 1503960366 2016-04-16 10 12669 8.16 8.16
## logged_activities_distance very_active_distance moderately_active_distance
## 1 0 2.29 1.60
## 2 0 3.21 0.57
## 3 0 3.73 1.05
## 4 0 1.57 0.69
## 5 0 2.14 1.26
## 6 0 2.71 0.41
## light_active_distance sedentary_active_distance very_active_minutes
## 1 2.92 0 33
## 2 5.92 0 41
## 3 4.88 0 50
## 4 4.71 0 21
## 5 2.83 0 29
## 6 5.04 0 36
## fairly_active_minutes lightly_active_minutes total_activity_minutes
## 1 35 246 314
## 2 15 277 333
## 3 24 254 328
## 4 19 217 257
## 5 34 209 272
## 6 10 221 267
## sedentary_minutes total_minutes calories x19 total_distance_2
## 1 730 1044 1820 NA 6.81
## 2 798 1131 2004 NA 9.70
## 3 816 1144 1990 NA 9.66
## 4 776 1033 1797 NA 6.97
## 5 726 998 1745 NA 6.23
## 6 773 1040 1863 NA 8.16
## total_sleep_records hour_and_minutes_asleep total_minutes_asleep
## 1 1 1899-12-31 06:09:00 369
## 2 1 1899-12-31 04:37:00 277
## 3 1 1899-12-31 04:33:00 273
## 4 2 1899-12-31 06:24:00 384
## 5 1 1899-12-31 06:52:00 412
## 6 2 1899-12-31 05:40:00 340
## hours_and_minutos_in_bed total_time_in_bed
## 1 1899-12-31 06:36:00 396
## 2 1899-12-31 05:09:00 309
## 3 1899-12-31 04:56:00 296
## 4 1899-12-31 06:47:00 407
## 5 1899-12-31 07:22:00 442
## 6 1899-12-31 06:07:00 367
Important: I want to know the people who sleep more and less than 8 hours and compare the data so maybe we can find new trends.
filter_8hours <- da_sl %>%
filter(total_minutes_asleep >= 480)
##Conclusion: I had 413 rows in the data set and 117 people slept more than 8 hours.
filter_less8hourse <- da_sl %>%
filter(total_minutes_asleep < 480)
##Conclusion_ From 413 rows data, 296 slept less than 8 hours.
We are going to select only the columns we need for analysis and show a summary of both data sets ( x < 8 && x >= 8). In these summaries, we can compare the information and see some trends. We are going to focus in means values.
I created 2 data sets. One for the people who sleep equal or more than 8 hours, and the other one for people who sleep less than 8 hours. We can see from the original data set, 117 people sleep 8 or more hours and 296 less than 8 hours.
So, if we only focus on means values, we are going to face an issue (bias), cause is not going to be the same for both data sets. We need to be partial. In this case, I am going to pick a 117 random sample from my 296 rows data set. Now, both data sets are going to have 117 rows and we can compare means.
##I already created both data sets, this is only the random formula
random_selectfl8 <- select_fl8 %>%
sample_n(117, replace = FALSE)
## total_steps total_distance very_active_distance
## Min. : 42 Min. : 0.030 Min. :0.000
## 1st Qu.: 5325 1st Qu.: 3.610 1st Qu.:0.000
## Median : 9524 Median : 6.700 Median :0.850
## Mean : 8671 Mean : 6.064 Mean :1.457
## 3rd Qu.:11584 3rd Qu.: 8.030 3rd Qu.:2.500
## Max. :22770 Max. :17.540 Max. :9.450
## moderately_active_distance light_active_distance very_active_minutes
## Min. :0.0000 Min. :0.030 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:2.540 1st Qu.: 0.00
## Median :0.5000 Median :3.910 Median : 14.00
## Mean :0.7687 Mean :3.813 Mean : 25.38
## 3rd Qu.:1.1100 3rd Qu.:4.910 3rd Qu.: 45.00
## Max. :4.2200 Max. :9.460 Max. :120.00
## fairly_active_minutes lightly_active_minutes total_activity_minutes
## Min. : 0.00 Min. : 4.0 Min. : 4.0
## 1st Qu.: 0.00 1st Qu.:169.0 1st Qu.:215.0
## Median :12.00 Median :215.0 Median :263.0
## Mean :17.71 Mean :212.9 Mean :255.9
## 3rd Qu.:29.00 3rd Qu.:263.0 3rd Qu.:310.0
## Max. :92.00 Max. :432.0 Max. :482.0
## sedentary_minutes calories total_minutes_asleep total_time_in_bed
## Min. : 2.0 Min. : 403 Min. : 59.0 Min. : 65.0
## 1st Qu.: 692.0 1st Qu.:1954 1st Qu.:337.0 1st Qu.:372.0
## Median : 749.0 Median :2182 Median :404.0 Median :430.0
## Mean : 760.6 Mean :2410 Mean :369.3 Mean :402.2
## 3rd Qu.: 816.0 3rd Qu.:2929 3rd Qu.:442.0 3rd Qu.:470.0
## Max. :1265.0 Max. :4157 Max. :479.0 Max. :555.0
## total_steps total_distance very_active_distance
## Min. : 678 Min. : 0.470 Min. :0.000
## 1st Qu.: 3844 1st Qu.: 2.670 1st Qu.:0.000
## Median : 6564 Median : 4.530 Median :0.000
## Mean : 7328 Mean : 5.204 Mean :1.059
## 3rd Qu.:10060 3rd Qu.: 6.960 3rd Qu.:0.980
## Max. :20031 Max. :15.010 Max. :9.890
## moderately_active_distance light_active_distance very_active_minutes
## Min. :0.0000 Min. :0.470 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:2.250 1st Qu.: 0.00
## Median :0.1500 Median :3.260 Median : 0.00
## Mean :0.4453 Mean :3.629 Mean : 16.65
## 3rd Qu.:0.6800 3rd Qu.:5.180 3rd Qu.: 12.00
## Max. :5.1200 Max. :8.970 Max. :210.00
## fairly_active_minutes lightly_active_minutes total_activity_minutes
## Min. : 0.00 Min. : 55.0 Min. : 55.0
## 1st Qu.: 0.00 1st Qu.:148.0 1st Qu.:174.0
## Median : 4.00 Median :200.0 Median :240.0
## Mean :10.62 Mean :224.6 Mean :251.9
## 3rd Qu.:17.00 3rd Qu.:294.0 3rd Qu.:330.0
## Max. :95.00 Max. :512.0 Max. :512.0
## sedentary_minutes calories total_minutes_asleep total_time_in_bed
## Min. :125 Min. : 928 Min. :480 Min. :489.0
## 1st Qu.:542 1st Qu.:1692 1st Qu.:504 1st Qu.:535.0
## Median :603 Median :2105 Median :524 Median :552.0
## Mean :610 Mean :2284 Mean :543 Mean :583.2
## 3rd Qu.:709 3rd Qu.:2739 3rd Qu.:552 3rd Qu.:600.0
## Max. :862 Max. :4900 Max. :796 Max. :961.0
We can see a trend that people who sleep -8 hours use to do more active, lightly and moderately exercise and also used to spend more time doing this. Be careful with this information, cause it is not the final result.
Finally, after analyze data with numbers, it is time to see visual trends. We are going to merge the two 117 rows data sets, the people who sleep equal or more than 8 hours and less than 8 hours, so we have a new data set with 234 rows and with the information combine.
In this case, we are going to merge with outer join, cause we need both data and not only the ones who are the same.
merge_8h_fl8 <- merge(select_8h, random_selectfl8, all = TRUE)
eight_hours <- merge_8h_fl8 %>%
mutate(eight_hours_or_not = if_else(total_minutes_asleep >= 480, 'TRUE', 'FALSE'))
glimpse(eight_hours)
## Rows: 234
## Columns: 14
## $ total_steps <dbl> 42, 254, 678, 768, 980, 1202, 1219, 1251, 1…
## $ total_distance <dbl> 0.03, 0.16, 0.47, 0.52, 0.68, 0.78, 0.78, 0…
## $ very_active_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ moderately_active_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ light_active_distance <dbl> 0.03, 0.16, 0.47, 0.52, 0.68, 0.78, 0.78, 0…
## $ very_active_minutes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ fairly_active_minutes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ lightly_active_minutes <dbl> 4, 17, 55, 58, 51, 84, 84, 67, 82, 86, 79, …
## $ total_activity_minutes <dbl> 4, 17, 55, 58, 51, 84, 84, 67, 82, 86, 79, …
## $ sedentary_minutes <dbl> 2, 1002, 734, 380, 941, 506, 853, 836, 806,…
## $ calories <dbl> 403, 1141, 2220, 1212, 2221, 1463, 1963, 15…
## $ total_minutes_asleep <dbl> 411, 357, 750, 483, 475, 775, 486, 484, 531…
## $ total_time_in_bed <dbl> 473, 380, 775, 501, 499, 843, 503, 500, 552…
## $ eight_hours_or_not <chr> "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", …
We can find imbalance in sedentary minutes comparing to activity minutes in people who sleep less than 8 hours. In people who sleep 8 hours or more we find a balance and the sedentary minutes are not so high, they trend to be more balance and center.
We can see a balance also in people who sleep 8 hours or more according to the different type of activities.
Blue = Very Active Green = Moderately active Red = Light Activity
We can see in the summaries that people who sleep less has higher mean values, almost in all the columns, but according to the graphs, we can see that the people who sleep 8 or more hours are more balance than the people who sleep less. This means, we can find a breakeven in +8 hours, and in - 8 hours we find lot of peaks values.
We are going to import the weight (kg) values, including bmi also (body mass index).
weight_Log_Info <- read_excel("weight_Log_Info.xlsx")
head(weight_Log_Info)
## # A tibble: 6 x 10
## Id ...2 ...3 Date WeightKg WeightPounds Fat BMI
## <dbl> <chr> <dttm> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 42405.9… 2016-02-05 00:00:00 42405.… 52.6 116. 22 22.6
## 2 1.50e9 42434.9… 2016-03-05 00:00:00 42434.… 52.6 116. NA 22.6
## 3 1.93e9 4/13/20… 2016-04-13 00:00:00 4/13/2… 134. 294. NA 47.5
## 4 2.87e9 4/21/20… 2016-04-21 00:00:00 4/21/2… 56.7 125. NA 21.5
## 5 2.87e9 42709.9… 2016-12-05 00:00:00 42709.… 57.3 126. NA 21.7
## 6 4.32e9 4/17/20… 2016-04-17 00:00:00 4/17/2… 72.4 160. 25 27.5
## # … with 2 more variables: IsManualReport <lgl>, LogId <dbl>
weightt <- weight_Log_Info %>%
clean_names() %>%
rename(activity_date = x3)
weight <- weightt %>%
select(id,
activity_date,
weight_kg,
fat,
bmi,)
colnames(weight)
## [1] "id" "activity_date" "weight_kg" "fat"
## [5] "bmi"
The data sets are going to be merge with inner join, we need only the information that match.
da_sl_we <- merge(da_sl, weight, by = c("id", "activity_date"))
We can see a trend when weight increment, the bmi also increment.
Now, we want to analyze the data sets per hour and merge them for trends findings. We are going to import data sets, clean it and analyze it so we can see visual trends.
hourly_Intensities <- read_excel("hourly_Intensities.xlsx")
hourly_Calories <- read_excel("hourly_Calories.xlsx")
hourly_Steps <- read_excel("hourly_Steps.xlsx")
hi =hourly_Itennsities hc = hourly_Calories hs = hourly_Steps
hi <- hourly_Intensities %>%
clean_names()
hc <- hourly_Calories %>%
clean_names()
hs <- hourly_Steps %>%
clean_names()
hi_hc <- merge(hi, hc, by = c("id", "activity_date"))
This merge is so extensive, like 254979 observations, so we are going to take a sample, according to the population so we can analyze it and do future merges. We choose the sample size from a 254979 population, 95% confidence and 3% margin error.
random_hi_hc <- hi_hc %>%
sample_n(1063, replace = FALSE)
hi_hc_hs <- merge(random_hi_hc, hs, by = c("id", "activity_date"))
This new merge has duplicates on same dates and minutes, so we are going to clean this duplicates.
hi_hc_hs_nodup <- hi_hc_hs %>%
distinct(id, activity_date, .keep_all = TRUE)
ggplot(data = hi_hc_hs_nodup) +
geom_point(mapping = aes(x= step_total, y = calories))