Final Project Data Analyst Career

This is my final Google career project. The objective is to identify trends so marketing company can work efficient in a campaign for new clients. We are going to use a free and public data offered by Möbius. The data set is: https://www.kaggle.com/arashnic/fitbit.

Some of the data was modified by hand on excel spreadsheet and almost all the details were done in R.

Packages installed and libraries

  library(tidyverse)
  library(here)
  library(ggplot2)
  library(skimr)
  library(janitor)
  library(readxl)

Importing first data set:

dailiy_Activity <- read_excel("dailiy_Activity.xlsx")

Cleaning and analyzing data

## Cleaning column names
da <- clean_names(dailiy_Activity)

## Selecting only the columns I need for analysis. 
da_selected <- da %>%
    select(total_steps,
           total_distance,
           very_active_distance,
           moderately_active_distance,
           light_active_distance,
           very_active_minutes,
           lightly_active_minutes,
           fairly_active_minutes,
           calories)

First 5 data set values
total_steps	total_distance	very_active_distance	moderately_active_distance	light_active_distance	very_active_minutes	lightly_active_minutes	fairly_active_minutes	calories
10602	6.81	2.29	1.60	2.92	33	246	35	1820
36019	28.03	21.92	4.19	1.91	186	171	63	2690
6132	4.46	0.24	0.99	3.23	3	146	24	2696
2573	1.70	0.00	0.26	1.45	0	75	7	1541
2704	1.87	1.01	0.03	0.83	14	70	1	2411

## Data summary 
summary(da_selected)

##   total_steps    total_distance   very_active_distance
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000      
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 0.000      
##  Median : 7406   Median : 5.245   Median : 0.210      
##  Mean   : 7638   Mean   : 5.490   Mean   : 1.503      
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 2.053      
##  Max.   :36019   Max.   :28.030   Max.   :21.920      
##  moderately_active_distance light_active_distance very_active_minutes
##  Min.   :0.0000             Min.   : 0.000        Min.   :  0.00     
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:  0.00     
##  Median :0.2400             Median : 3.365        Median :  4.00     
##  Mean   :0.5675             Mean   : 3.341        Mean   : 21.16     
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.: 32.00     
##  Max.   :6.4800             Max.   :10.710        Max.   :210.00     
##  lightly_active_minutes fairly_active_minutes    calories   
##  Min.   :  0.0          Min.   :  0.00        Min.   :   0  
##  1st Qu.:127.0          1st Qu.:  0.00        1st Qu.:1828  
##  Median :199.0          Median :  6.00        Median :2134  
##  Mean   :192.8          Mean   : 13.56        Mean   :2304  
##  3rd Qu.:264.0          3rd Qu.: 19.00        3rd Qu.:2793  
##  Max.   :518.0          Max.   :143.00        Max.   :4900

Trends

Light exercise is the activity with more people

We can see trends in light distance and minutes means, the values are bigger compared with the other columns. This means that people use to do more exercise distance and minutes in a lightly way. Pay attention at the mean data of the next columns and compare it with the columns at the above.

##  light_active_distance lightly_active_minutes
##  Min.   : 0.000        Min.   :  0.0         
##  1st Qu.: 1.945        1st Qu.:127.0         
##  Median : 3.365        Median :199.0         
##  Mean   : 3.341        Mean   :192.8         
##  3rd Qu.: 4.782        3rd Qu.:264.0         
##  Max.   :10.710        Max.   :518.0

People burn more calories doing active excercise

In the next graph, we can see an important trend. People who do more active distance exercise, burn more calories than lighter and fairer exercises.

How many time do I need to burn calories? And which excercise is the bestone?

We can see that if you do light activity you need more time to burn calories, we can see a strong light activity at minute 200 and calories between 1500 - 3000.

But if you do an extensive activity, you need less time to burn calories. We can see that the strong values are between 0 - 100 minutes and 1000 - 4000 calories. So, if you are doing activity exercise, it is a good news.

Humans spend more time as active or sedentary?

We can see that humans used to spend more time sedentary than active. When I saw this data, I thought in the time I spend working, studying, playing pc (sedentary) and the time I am doing exercise or moving.

Total minutes = sedentary + active

Importing new data set

sleep_Day <- read_excel("sleep_Day.xlsx")

Cleaning and analyzing it

sl <- clean_names(sleep_Day) %>%
  select(-id_len)
  ##Only for future datasets merges & analysis, I will change the column named as SleepDay for activity_date
  sld <- rename(sl, activity_date = sleep_day)

First 5 data set values
id	activity_date	total_sleep_records	hour_and_minutes_asleep	total_minutes_asleep	hours_and_minutos_in_bed	total_time_in_bed
1503960366	2016-12-04	1	1899-12-31 05:27:00	327	1899-12-31 05:46:00	346
1503960366	2016-04-13	2	1899-12-31 06:24:00	384	1899-12-31 06:47:00	407
1503960366	2016-04-15	1	1899-12-31 06:52:00	412	1899-12-31 07:22:00	442
1503960366	2016-04-16	2	1899-12-31 05:40:00	340	1899-12-31 06:07:00	367
1503960366	2016-04-17	1	1899-12-31 11:40:00	700	1899-12-31 11:52:00	712

## Data summaries 
sld %>%
  select(hour_and_minutes_asleep,
  hours_and_minutos_in_bed,
  total_minutes_asleep,
  total_time_in_bed) %>%
  summary()

##  hour_and_minutes_asleep       hours_and_minutos_in_bed     
##  Min.   :1899-12-31 00:58:00   Min.   :1899-12-31 01:01:00  
##  1st Qu.:1899-12-31 06:01:00   1st Qu.:1899-12-31 06:43:00  
##  Median :1899-12-31 07:13:00   Median :1899-12-31 07:43:00  
##  Mean   :1899-12-31 06:59:28   Mean   :1899-12-31 07:38:38  
##  3rd Qu.:1899-12-31 08:10:00   3rd Qu.:1899-12-31 08:46:00  
##  Max.   :1899-12-31 13:16:00   Max.   :1899-12-31 16:01:00  
##  total_minutes_asleep total_time_in_bed
##  Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:361.0        1st Qu.:403.0    
##  Median :433.0        Median :463.0    
##  Mean   :419.5        Mean   :458.6    
##  3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :796.0        Max.   :961.0

Trend

Minutes asleep vs in bed

We found a trend that humans spend more time in bed than only sleeping. When I saw this data, I remembered the time I spend maybe with the phone or the notebook in the bed before or after sleeping.

Merge time!

We arrive to the merge time. We are going to merge the data sets, so we can find new trends and compare both data. How are we going to merge? We are going to use the inner join way, because we need only the information that match in both data sets. Ok, let’s do it!

da_sl <- merge(da, sld, by = c("id","activity_date"))
head(da_sl)

##           id activity_date len total_steps total_distance tracker_distance
## 1 1503960366    2016-01-05  10       10602           6.81             6.81
## 2 1503960366    2016-02-05  10       14727           9.71             9.71
## 3 1503960366    2016-03-05  10       15103           9.66             9.66
## 4 1503960366    2016-04-13  10       10735           6.97             6.97
## 5 1503960366    2016-04-15  10        9762           6.28             6.28
## 6 1503960366    2016-04-16  10       12669           8.16             8.16
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 2.29                       1.60
## 2                          0                 3.21                       0.57
## 3                          0                 3.73                       1.05
## 4                          0                 1.57                       0.69
## 5                          0                 2.14                       1.26
## 6                          0                 2.71                       0.41
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  2.92                         0                  33
## 2                  5.92                         0                  41
## 3                  4.88                         0                  50
## 4                  4.71                         0                  21
## 5                  2.83                         0                  29
## 6                  5.04                         0                  36
##   fairly_active_minutes lightly_active_minutes total_activity_minutes
## 1                    35                    246                    314
## 2                    15                    277                    333
## 3                    24                    254                    328
## 4                    19                    217                    257
## 5                    34                    209                    272
## 6                    10                    221                    267
##   sedentary_minutes total_minutes calories x19 total_distance_2
## 1               730          1044     1820  NA             6.81
## 2               798          1131     2004  NA             9.70
## 3               816          1144     1990  NA             9.66
## 4               776          1033     1797  NA             6.97
## 5               726           998     1745  NA             6.23
## 6               773          1040     1863  NA             8.16
##   total_sleep_records hour_and_minutes_asleep total_minutes_asleep
## 1                   1     1899-12-31 06:09:00                  369
## 2                   1     1899-12-31 04:37:00                  277
## 3                   1     1899-12-31 04:33:00                  273
## 4                   2     1899-12-31 06:24:00                  384
## 5                   1     1899-12-31 06:52:00                  412
## 6                   2     1899-12-31 05:40:00                  340
##   hours_and_minutos_in_bed total_time_in_bed
## 1      1899-12-31 06:36:00               396
## 2      1899-12-31 05:09:00               309
## 3      1899-12-31 04:56:00               296
## 4      1899-12-31 06:47:00               407
## 5      1899-12-31 07:22:00               442
## 6      1899-12-31 06:07:00               367

Cleaning and analyzing

Important: I want to know the people who sleep more and less than 8 hours and compare the data so maybe we can find new trends.

filter_8hours <- da_sl %>%
  filter(total_minutes_asleep >= 480)
  ##Conclusion: I had 413 rows in the data set and 117 people slept more than 8 hours. 
filter_less8hourse <- da_sl %>%
  filter(total_minutes_asleep < 480)
  ##Conclusion_ From 413 rows data, 296 slept less than 8 hours.

Important trends from last merge and filter

We are going to select only the columns we need for analysis and show a summary of both data sets ( x < 8 && x >= 8). In these summaries, we can compare the information and see some trends. We are going to focus in means values.

I created 2 data sets. One for the people who sleep equal or more than 8 hours, and the other one for people who sleep less than 8 hours. We can see from the original data set, 117 people sleep 8 or more hours and 296 less than 8 hours.

So, if we only focus on means values, we are going to face an issue (bias), cause is not going to be the same for both data sets. We need to be partial. In this case, I am going to pick a 117 random sample from my 296 rows data set. Now, both data sets are going to have 117 rows and we can compare means.

##I already created both data sets, this is only the random formula
random_selectfl8 <- select_fl8 %>%
  sample_n(117, replace = FALSE)

Less than 8 hours

##   total_steps    total_distance   very_active_distance
##  Min.   :   42   Min.   : 0.030   Min.   :0.000       
##  1st Qu.: 5325   1st Qu.: 3.610   1st Qu.:0.000       
##  Median : 9524   Median : 6.700   Median :0.850       
##  Mean   : 8671   Mean   : 6.064   Mean   :1.457       
##  3rd Qu.:11584   3rd Qu.: 8.030   3rd Qu.:2.500       
##  Max.   :22770   Max.   :17.540   Max.   :9.450       
##  moderately_active_distance light_active_distance very_active_minutes
##  Min.   :0.0000             Min.   :0.030         Min.   :  0.00     
##  1st Qu.:0.0000             1st Qu.:2.540         1st Qu.:  0.00     
##  Median :0.5000             Median :3.910         Median : 14.00     
##  Mean   :0.7687             Mean   :3.813         Mean   : 25.38     
##  3rd Qu.:1.1100             3rd Qu.:4.910         3rd Qu.: 45.00     
##  Max.   :4.2200             Max.   :9.460         Max.   :120.00     
##  fairly_active_minutes lightly_active_minutes total_activity_minutes
##  Min.   : 0.00         Min.   :  4.0          Min.   :  4.0         
##  1st Qu.: 0.00         1st Qu.:169.0          1st Qu.:215.0         
##  Median :12.00         Median :215.0          Median :263.0         
##  Mean   :17.71         Mean   :212.9          Mean   :255.9         
##  3rd Qu.:29.00         3rd Qu.:263.0          3rd Qu.:310.0         
##  Max.   :92.00         Max.   :432.0          Max.   :482.0         
##  sedentary_minutes    calories    total_minutes_asleep total_time_in_bed
##  Min.   :   2.0    Min.   : 403   Min.   : 59.0        Min.   : 65.0    
##  1st Qu.: 692.0    1st Qu.:1954   1st Qu.:337.0        1st Qu.:372.0    
##  Median : 749.0    Median :2182   Median :404.0        Median :430.0    
##  Mean   : 760.6    Mean   :2410   Mean   :369.3        Mean   :402.2    
##  3rd Qu.: 816.0    3rd Qu.:2929   3rd Qu.:442.0        3rd Qu.:470.0    
##  Max.   :1265.0    Max.   :4157   Max.   :479.0        Max.   :555.0

8 Hours or more

##   total_steps    total_distance   very_active_distance
##  Min.   :  678   Min.   : 0.470   Min.   :0.000       
##  1st Qu.: 3844   1st Qu.: 2.670   1st Qu.:0.000       
##  Median : 6564   Median : 4.530   Median :0.000       
##  Mean   : 7328   Mean   : 5.204   Mean   :1.059       
##  3rd Qu.:10060   3rd Qu.: 6.960   3rd Qu.:0.980       
##  Max.   :20031   Max.   :15.010   Max.   :9.890       
##  moderately_active_distance light_active_distance very_active_minutes
##  Min.   :0.0000             Min.   :0.470         Min.   :  0.00     
##  1st Qu.:0.0000             1st Qu.:2.250         1st Qu.:  0.00     
##  Median :0.1500             Median :3.260         Median :  0.00     
##  Mean   :0.4453             Mean   :3.629         Mean   : 16.65     
##  3rd Qu.:0.6800             3rd Qu.:5.180         3rd Qu.: 12.00     
##  Max.   :5.1200             Max.   :8.970         Max.   :210.00     
##  fairly_active_minutes lightly_active_minutes total_activity_minutes
##  Min.   : 0.00         Min.   : 55.0          Min.   : 55.0         
##  1st Qu.: 0.00         1st Qu.:148.0          1st Qu.:174.0         
##  Median : 4.00         Median :200.0          Median :240.0         
##  Mean   :10.62         Mean   :224.6          Mean   :251.9         
##  3rd Qu.:17.00         3rd Qu.:294.0          3rd Qu.:330.0         
##  Max.   :95.00         Max.   :512.0          Max.   :512.0         
##  sedentary_minutes    calories    total_minutes_asleep total_time_in_bed
##  Min.   :125       Min.   : 928   Min.   :480          Min.   :489.0    
##  1st Qu.:542       1st Qu.:1692   1st Qu.:504          1st Qu.:535.0    
##  Median :603       Median :2105   Median :524          Median :552.0    
##  Mean   :610       Mean   :2284   Mean   :543          Mean   :583.2    
##  3rd Qu.:709       3rd Qu.:2739   3rd Qu.:552          3rd Qu.:600.0    
##  Max.   :862       Max.   :4900   Max.   :796          Max.   :961.0

First conclusion

We can see a trend that people who sleep -8 hours use to do more active, lightly and moderately exercise and also used to spend more time doing this. Be careful with this information, cause it is not the final result.

Trends

Finally, after analyze data with numbers, it is time to see visual trends. We are going to merge the two 117 rows data sets, the people who sleep equal or more than 8 hours and less than 8 hours, so we have a new data set with 234 rows and with the information combine.

In this case, we are going to merge with outer join, cause we need both data and not only the ones who are the same.

merge_8h_fl8 <- merge(select_8h, random_selectfl8, all = TRUE)
      
eight_hours <- merge_8h_fl8 %>%
  mutate(eight_hours_or_not = if_else(total_minutes_asleep >= 480, 'TRUE', 'FALSE'))
glimpse(eight_hours)

## Rows: 234
## Columns: 14
## $ total_steps                <dbl> 42, 254, 678, 768, 980, 1202, 1219, 1251, 1…
## $ total_distance             <dbl> 0.03, 0.16, 0.47, 0.52, 0.68, 0.78, 0.78, 0…
## $ very_active_distance       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ moderately_active_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ light_active_distance      <dbl> 0.03, 0.16, 0.47, 0.52, 0.68, 0.78, 0.78, 0…
## $ very_active_minutes        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ fairly_active_minutes      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ lightly_active_minutes     <dbl> 4, 17, 55, 58, 51, 84, 84, 67, 82, 86, 79, …
## $ total_activity_minutes     <dbl> 4, 17, 55, 58, 51, 84, 84, 67, 82, 86, 79, …
## $ sedentary_minutes          <dbl> 2, 1002, 734, 380, 941, 506, 853, 836, 806,…
## $ calories                   <dbl> 403, 1141, 2220, 1212, 2221, 1463, 1963, 15…
## $ total_minutes_asleep       <dbl> 411, 357, 750, 483, 475, 775, 486, 484, 531…
## $ total_time_in_bed          <dbl> 473, 380, 775, 501, 499, 843, 503, 500, 552…
## $ eight_hours_or_not         <chr> "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", …

Interesting graphs

Sedentary Minutes vs Tota Activity Minutes

We can find imbalance in sedentary minutes comparing to activity minutes in people who sleep less than 8 hours. In people who sleep 8 hours or more we find a balance and the sedentary minutes are not so high, they trend to be more balance and center.

Balance between the total distance and the calories.

We can see a balance also in people who sleep 8 hours or more according to the different type of activities.

Blue = Very Active Green = Moderately active Red = Light Activity

Conclusion

We can see in the summaries that people who sleep less has higher mean values, almost in all the columns, but according to the graphs, we can see that the people who sleep 8 or more hours are more balance than the people who sleep less. This means, we can find a breakeven in +8 hours, and in - 8 hours we find lot of peaks values.

Importing more data

We are going to import the weight (kg) values, including bmi also (body mass index).

weight_Log_Info <- read_excel("weight_Log_Info.xlsx")

Cleaning and analyzing

head(weight_Log_Info)

## # A tibble: 6 x 10
##        Id ...2     ...3                Date    WeightKg WeightPounds   Fat   BMI
##     <dbl> <chr>    <dttm>              <chr>      <dbl>        <dbl> <dbl> <dbl>
## 1  1.50e9 42405.9… 2016-02-05 00:00:00 42405.…     52.6         116.    22  22.6
## 2  1.50e9 42434.9… 2016-03-05 00:00:00 42434.…     52.6         116.    NA  22.6
## 3  1.93e9 4/13/20… 2016-04-13 00:00:00 4/13/2…    134.          294.    NA  47.5
## 4  2.87e9 4/21/20… 2016-04-21 00:00:00 4/21/2…     56.7         125.    NA  21.5
## 5  2.87e9 42709.9… 2016-12-05 00:00:00 42709.…     57.3         126.    NA  21.7
## 6  4.32e9 4/17/20… 2016-04-17 00:00:00 4/17/2…     72.4         160.    25  27.5
## # … with 2 more variables: IsManualReport <lgl>, LogId <dbl>

weightt <- weight_Log_Info %>%
  clean_names() %>%
  rename(activity_date = x3) 
weight <- weightt %>%
  select(id,
    activity_date,
    weight_kg,
    fat,
    bmi,)
colnames(weight)

## [1] "id"            "activity_date" "weight_kg"     "fat"          
## [5] "bmi"

Merging da_sl data set with the new weight data

The data sets are going to be merge with inner join, we need only the information that match.

da_sl_we <- merge(da_sl, weight, by = c("id", "activity_date"))

Trend

We can see a trend when weight increment, the bmi also increment.

Importing new data sets, cleaning, analyzing and merging

Now, we want to analyze the data sets per hour and merge them for trends findings. We are going to import data sets, clean it and analyze it so we can see visual trends.

hourly_Intensities <- read_excel("hourly_Intensities.xlsx")
hourly_Calories <- read_excel("hourly_Calories.xlsx")
hourly_Steps <- read_excel("hourly_Steps.xlsx")

hi =hourly_Itennsities hc = hourly_Calories hs = hourly_Steps

Cleaning

hi <- hourly_Intensities %>%
  clean_names()
    
hc <- hourly_Calories %>%
  clean_names()
      
hs <- hourly_Steps %>%
  clean_names()

Merging

hi_hc <- merge(hi, hc, by = c("id", "activity_date"))

Cleaning and analyzing

This merge is so extensive, like 254979 observations, so we are going to take a sample, according to the population so we can analyze it and do future merges. We choose the sample size from a 254979 population, 95% confidence and 3% margin error.

random_hi_hc <- hi_hc %>%
  sample_n(1063, replace = FALSE)

Merging with hours_steps

hi_hc_hs <- merge(random_hi_hc, hs, by = c("id", "activity_date"))

This new merge has duplicates on same dates and minutes, so we are going to clean this duplicates.

hi_hc_hs_nodup <- hi_hc_hs %>%
  distinct(id, activity_date, .keep_all = TRUE)

ggplot(data = hi_hc_hs_nodup) +
  geom_point(mapping = aes(x= step_total, y = calories))

Final Project Data Analyst Career

Alexander Nicolás Tisko

Packages installed and libraries

Importing first data set:

Cleaning and analyzing data

Trends

Light exercise is the activity with more people

People burn more calories doing active excercise

How many time do I need to burn calories? And which excercise is the bestone?

Humans spend more time as active or sedentary?

Importing new data set

Cleaning and analyzing it

Trend

Minutes asleep vs in bed

Merge time!

Cleaning and analyzing

Important trends from last merge and filter

Less than 8 hours

8 Hours or more

First conclusion

Trends

Interesting graphs

Sedentary Minutes vs Tota Activity Minutes

Balance between the total distance and the calories.

Conclusion

Importing more data

Cleaning and analyzing

Merging da_sl data set with the new weight data

Trend

Importing new data sets, cleaning, analyzing and merging

Cleaning

Merging

Cleaning and analyzing

Merging with hours_steps