Statement of the business task

Assessing consumer usage trends of smart devices to inform Bellabeat marketing strategy

Data sources used

FitBit Fitness Tracker Data was downloaded from Kaggle (FitBit Fitness Tracker Data (kaggle.com)) and stored on my laptop in two sub-folders. In my analysis, I used relevant tables/dataframes to analyze relationships between total steps, calories, activity minutes, time in bed and sleep time. There are many dataframes with most of them repeated using different measures for the same thing.The dailyActivity_merged dataframe has the advantage of measuring many things in it. Therefore, that table was retained. The minuteSleep_merged dataframe was also added. All dataframes had the word merged with no distinction or specific meaning. So it was left as it was. The dataframes used are in long format.

Data integrity was ensured by following steps.

Reliable?

Data was provided by public data source (Kaggle), it was assumed reliable, but was not verified. Original: the data was provided voluntarily by the people who used the smart devices for a certain period.

Comprehensive

The data used covers the aspect of the business task under analysis.

Current

The data is a static data for a fixed period.

Complete

completeness was verified during data integrity check.

Data cleaning and transformation

The data cleaning follows the installation and loading the necessary packages.

The function sample() was used to check for data bias.

library(ggplot2)
library(tidyverse)
library(janitor)
library(lubridate)
library(dplyr)
library(reshape2)
library(tidyr)
library(SimDesign)
dailyactivity <- read.csv("C:/Users/nokwanda/Documents/dailyactivity.csv")
minutesleep <- read.csv("C:/Users/nokwanda/Documents/minutesleep.csv")
missing_data_count <- colSums(is.na(dailyactivity))
sum(duplicated(dailyactivity))
sum(duplicated(minutesleep))

There are no missing data in the 2 dataframes. However, because the minuteSleep dataframe has 543 duplicates, it was dropped and replaced with sleepDay_merged dataframe with only 3 duplicates that have been removed, and the dataframe renamed to sleepDay_merged_unique.

sleepday   <- read.csv("C:/Users/nokwanda/Documents/sleepday.csv")
missing_data_count <- colSums(is.na(sleepday))
sum(duplicated(sleepday))
sleepdayunique <- sleepday %>% 
unique()
sum(duplicated(sleepdayunique))

The sleepDay_merged_unique dataframe date (sleepDay) has been transformed into date format to be the sames as the other dataframes transformed date formats.

All dates and/or time in the 2 dataframes were formatted and changed to a uniform name: activityDate, and columns renamed as/if necessary.

Bias

After this data cleaning and transformation process, I assume that any possible bias in the data has been substantially reduced to a minimum.

Insufficient data

Because there is no time to collect new data, and extra data is not available, the analysis will be based on the existing data only.

Data analysis

Summary statistics about each dataframe

For the daily activity dataframe:

dailyactivity %>%  
  select(TotalSteps,  SedentaryMinutes, FairlyActiveMinutes, VeryActiveMinutes,
         Calories) %>%
  summary()
   TotalSteps    SedentaryMinutes FairlyActiveMinutes VeryActiveMinutes    Calories   
 Min.   :    0   Min.   :   0.0   Min.   :  0.00      Min.   :  0.00    Min.   :   0  
 1st Qu.: 3790   1st Qu.: 729.8   1st Qu.:  0.00      1st Qu.:  0.00    1st Qu.:1828  
 Median : 7406   Median :1057.5   Median :  6.00      Median :  4.00    Median :2134  
 Mean   : 7638   Mean   : 991.2   Mean   : 13.56      Mean   : 21.16    Mean   :2304  
 3rd Qu.:10727   3rd Qu.:1229.5   3rd Qu.: 19.00      3rd Qu.: 32.00    3rd Qu.:2793  
 Max.   :36019   Max.   :1440.0   Max.   :143.00      Max.   :210.00    Max.   :4900  

For the sleepday unique dataframe:

sleepdayunique %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
 TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
 Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
 1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
 Median :1.00      Median :432.5      Median :463.0  
 Mean   :1.12      Mean   :419.2      Mean   :458.5  
 3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
 Max.   :3.00      Max.   :796.0      Max.   :961.0  

Exploring some plots

Let see what these summaries tell us about how this sample of people’s activities, starting with the relationship between total steps taken and calories used.

ggplot(dailyactivity) + 
  geom_smooth(mapping = aes(x=TotalSteps, y=Calories)) +
  labs(title = 'Total steps taken vs calories burned')
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This plot shows that the more steps are taken, the more calories are used/lost/burned, although it seems to slow down when it start getting to very high number of steps taken. This will help direct the marketing to customers who would be looking to burn more calories for any reason such as wanting to lose weight, or those who lost appetite and want to retrieve it again, so that they can start walking more. The marketing team could also use this info to advertise to athletes who would want to measure the steps they take a day against the calories they would need.

Let see how is the relationship between total minutes asleep and total time in bed.

ggplot(data = sleepdayunique) + geom_smooth(mapping = aes(x=TotalMinutesAsleep, y=TotalTimeInBed )) + labs(title = 'Total minutes asleep vs total minutes in bed')
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This graph is not linear as one would expect, showing that the time in bed is not equal to the time asleep. This means that further exploration is needed before making a conclusion here.

Let’s now merge these two data sets.

combineddata <- merge(dailyactivity, sleepdayunique, by="Id")
head(combineddata)

Now, let’s look at the relationship between steps taken and sleep time.

ggplot(data = combineddata, mapping = aes(x=TotalSteps, y=TotalMinutesAsleep))+ geom_smooth() + labs(title = 'Total steps taken vs total minutes asleep')
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There is no clear relationship between these two variable, for instance indicating that the more steps, the more or less sleep time, as it seems to vary with the line graph starting high, then going down and up again. This info may help the marketing team to draw attention of the customers to the fact that taking more steps will not necessary make them sleep more. An advice on the average steps may be recommended.

Let’s also look at the relationship between Very active minutes and minutes asleep.

ggplot(data = combineddata, mapping = aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Very active minutes vs total minutes asleep')
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here too, there is no clear conclusion to be drawn since the relationship fluctuates a lot, showing a high asleep time at zero minutes activity, then very low sleep time at around 40 minutes activity, and going high again at around an hour activity, and down again.

Let’s now look at what is happening between sedentary minutes and total minutes asleep.

ggplot(data = combineddata, mapping = aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Sedentary minutes vs total minutes asleep')
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here again, there is no clear conclusion to be drawn since the relationship fluctuates a lot, showing a relatively high asleep time at zero sedentary minutes to a highest asleep time around 400 sedentary minutes, then down very low sleep time at around 1150 sedentary minutes, and going up again.

Finally, let see what is happening in between very active minutes and sedentary minutes by looking at fairly active minutes against total minutes asleep.

ggplot(data = combineddata, mapping = aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Fairly active minutes vs total minutes asleep')
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This graph shows a better relationship, even though not perfect, that low to fairly active minutes result in longer sleep time. It can be noticed that the sleep time gradually decreases as active minutes increase before starting to pick up slightly again after 100 of active minutes. From these last three graphs analysis, the marketing team is able to advise customers on how active is enough for a good sleep, and recommend an average number of minutes of activities.

In conclusion, analysing these activities and visualizing them gives a good idea to the Bellabeat marketingn team on the strategies to adopt for their advertsing campaign. Asking more of their clients to regularly report data on an online portal or requesting to connect the various apps and to the company’s server to allow the creation of dashboards to monitor the tools use should be considered for better decision making.

---
title: "FitBit Analysis"
author: "Bouba Ismaila"
date: "2024-05-24"
output: html_notebook
---

# Statement of the business task

Assessing consumer usage trends of smart devices to inform Bellabeat marketing strategy

# Data sources used

FitBit Fitness Tracker Data was downloaded from Kaggle (FitBit Fitness Tracker Data (kaggle.com)) and stored on my laptop in two sub-folders. In my analysis, I used relevant tables/dataframes to analyze relationships between total steps, calories, activity minutes, time in bed and sleep time. There are many dataframes with most of them repeated using different measures for the same thing.The dailyActivity_merged dataframe has the advantage of measuring many things in it. Therefore, that table was retained. The minuteSleep_merged dataframe was also added. All dataframes had the word merged with no distinction or specific meaning. So it was left as it was.
The dataframes used are in long format.

# Data integrity was ensured by following steps.

* All Dates stated in different forms were formatted to use one common formats: MM.DD.YYYY
Data was verified for completeness to make sure that no data was left out during replication (copying or importing).
* Date fields were also formatted to be date not string/character.
When manipulating data during cleaning, careful attention was exercised not to remove data just because it appears to be a duplicate, but after verification.
Credibility of data

# Reliable? 

Data was provided by public data source (Kaggle), it was assumed reliable, but was not verified.
Original: the data was provided voluntarily by the people who used the smart devices for a certain period.

# Comprehensive 

The data used covers the aspect of the business task under analysis.

# Current

The data is a static data for a fixed period. 

# Complete 

completeness was verified during data integrity check.

# Data cleaning and transformation

The data cleaning follows the installation and loading the necessary packages.

The function sample() was used to check for data bias.


```{r}
library(ggplot2)
library(tidyverse)
library(janitor)
library(lubridate)
library(dplyr)
library(reshape2)
library(tidyr)
library(SimDesign)
```


```{r}
dailyactivity <- read.csv("C:/Users/nokwanda/Documents/dailyactivity.csv")
```
```{r}
minutesleep <- read.csv("C:/Users/nokwanda/Documents/minutesleep.csv")
```


```{r}
missing_data_count <- colSums(is.na(dailyactivity))
```

```{r}
sum(duplicated(dailyactivity))
```
```{r}
sum(duplicated(minutesleep))
```

There are no missing data in the 2 dataframes. However, because the minuteSleep dataframe has 543 duplicates, it was dropped and replaced with sleepDay_merged dataframe with only 3 duplicates that have been removed, and the dataframe renamed to sleepDay_merged_unique.

```{r}
sleepday   <- read.csv("C:/Users/nokwanda/Documents/sleepday.csv")
missing_data_count <- colSums(is.na(sleepday))
```

```{r}
sum(duplicated(sleepday))
```
```{r}
sleepdayunique <- sleepday %>% 
unique()
sum(duplicated(sleepdayunique))
```

The sleepDay_merged_unique dataframe date (sleepDay) has been transformed into date format to be the sames as the other dataframes transformed date formats.

All dates and/or time in the 2 dataframes were formatted and changed to a uniform name: activityDate, and columns renamed as/if necessary.

## Bias

After this data cleaning and transformation process, I assume that any possible bias in the data has been substantially reduced to a minimum. 

# Insufficient data

Because there is no time to collect new data, and extra data is not available, the analysis will be based on the existing data only. 

# Data analysis

## Summary statistics about each dataframe

### For the daily activity dataframe:

```{r}
dailyactivity %>%  
  select(TotalSteps,  SedentaryMinutes, FairlyActiveMinutes, VeryActiveMinutes,
         Calories) %>%
  summary()
```
### For the sleepday unique dataframe:

```{r}
sleepdayunique %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
```
### Exploring some plots

Let see what these summaries tell us about how this sample of people’s activities, starting with the relationship between total steps taken and calories used.

```{r}
ggplot(dailyactivity) + 
  geom_smooth(mapping = aes(x=TotalSteps, y=Calories)) +
  labs(title = 'Total steps taken vs calories burned')
```

This plot shows that the more steps are taken, the more calories are used/lost/burned, although it seems to slow down when it start getting to very high number of steps taken. This will help direct the marketing to customers who would be looking to burn more calories for any reason such as wanting to lose weight, or those who lost appetite and want to retrieve it again, so that they can start walking more. The marketing team could also use this info to advertise to athletes who would want to measure the steps they take a day against the calories they would need.

Let see how is the relationship between total minutes asleep and total time in bed.

```{r}
ggplot(data = sleepdayunique) + geom_smooth(mapping = aes(x=TotalMinutesAsleep, y=TotalTimeInBed )) + labs(title = 'Total minutes asleep vs total minutes in bed')
```


This graph is not linear as one would expect, showing that the time in bed is not equal to the time asleep. This means that further exploration is needed before making a conclusion here.

Let’s now merge these two data sets.

```{r}
combineddata <- merge(dailyactivity, sleepdayunique, by="Id")
head(combineddata)
```


Now, let’s look at the relationship between steps taken and sleep time.

```{r}
ggplot(data = combineddata, mapping = aes(x=TotalSteps, y=TotalMinutesAsleep))+ geom_smooth() + labs(title = 'Total steps taken vs total minutes asleep')
```

There is no clear relationship between these two variable, for instance indicating that the more steps, the more or less sleep time, as it seems to vary with the line graph starting high, then going down and up again. This info may help the marketing team to draw attention of the customers to the fact that taking more steps will not necessary make them sleep more. An advice on the average steps may be recommended.

Let’s also look at the relationship between Very active minutes and minutes asleep.

```{r}
ggplot(data = combineddata, mapping = aes(x=VeryActiveMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Very active minutes vs total minutes asleep')
```

Here too, there is no clear conclusion to be drawn since the relationship fluctuates a lot, showing a high asleep time at zero minutes activity, then very low sleep time at around 40 minutes activity, and going high again at around an hour activity, and down again.

Let’s now look at what is happening between sedentary minutes and total minutes asleep.

```{r}
ggplot(data = combineddata, mapping = aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Sedentary minutes vs total minutes asleep')
```

Here again, there is no clear conclusion to be drawn since the relationship fluctuates a lot, showing a relatively high asleep time at zero sedentary minutes to a highest asleep time around 400 sedentary minutes, then down very low sleep time at around 1150 sedentary minutes, and going up again.

Finally, let see what is happening in between very active minutes and sedentary minutes by looking at fairly active minutes against total minutes asleep.

```{r}
ggplot(data = combineddata, mapping = aes(x=FairlyActiveMinutes, y=TotalMinutesAsleep)) + geom_smooth() +
  labs(title = 'Fairly active minutes vs total minutes asleep')
```

This graph shows a better relationship, even though not perfect, that low to fairly active minutes result in longer sleep time. It can be noticed that the sleep time gradually decreases as active minutes increase before starting to pick up slightly again after 100 of active minutes. From these last three graphs analysis, the marketing team is able to advise customers on how active is enough for a good sleep, and recommend an average number of minutes of activities.

In conclusion, analysing these activities and visualizing them gives a good idea to the Bellabeat marketingn team on the strategies to adopt for their advertsing campaign. Asking more of their clients to regularly report data on an online portal or requesting to connect the various apps and to the company's server to allow the creation of dashboards to monitor the tools use should be considered for better decision making. 