This is an example case study to demonstrate my knowledge and skills in data analytics using R programming. Working as a junior data analyst for the Bellabeat Company, a high-tech company that manufactures health-focused smart products, I show my skills in analyzing and presenting smart device data to gain insight into how consumers are using their smart products.
Upload CSV files to the project from the relevant data source: https://www.kaggle.com/arashnic/fitbit
In this project the following two CSV files will be used: * Daily Activity * Daily Calories
You can always install and load packages along the way as you may discover you need different packages after you start your analysis. There are several packages that could be installed and loaded at this stage.
Here a dataframe named ‘daily_activity’ needs to be created and read in one of the CSV files from the dataset. In addition, the “readr” package should be installed and loaded in order to read CSV files.
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(readr)
daily_activity <- read_csv("daily_activity.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This stage provides with a quick overview of the dataset.
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(dplyr)
daily_activity %>%
select(Id, ActivityDate, TotalSteps, Calories)
## # A tibble: 940 × 4
## Id ActivityDate TotalSteps Calories
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 1985
## 2 1503960366 4/13/2016 10735 1797
## 3 1503960366 4/14/2016 10460 1776
## 4 1503960366 4/15/2016 9762 1745
## 5 1503960366 4/16/2016 12669 1863
## 6 1503960366 4/17/2016 9705 1728
## 7 1503960366 4/18/2016 13019 1921
## 8 1503960366 4/19/2016 15506 2035
## 9 1503960366 4/20/2016 10544 1786
## 10 1503960366 4/21/2016 9819 1775
## # … with 930 more rows
Here a dataframe named ‘daily_calories’ needs to be created and read in one of the CSV files from the dataset.
library(readr)
daily_calories <- read_csv("daily_calories.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A brief description of the dataset could be shown using the following code chunks.
head(daily_calories)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
This stage of the data analysis is very important as it could reduce possible errors appearing during the analysis stage.
library(tidyverse)
daily_activity %>% arrange(Id)
## # A tibble: 940 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## 7 1.50e9 4/18/2016 13019 8.59 8.59 0
## 8 1.50e9 4/19/2016 15506 9.88 9.88 0
## 9 1.50e9 4/20/2016 10544 6.68 6.68 0
## 10 1.50e9 4/21/2016 9819 6.34 6.34 0
## # … with 930 more rows, and 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
daily_activity %>% group_by(Id, ActivityDate) %>% drop_na() %>% summarize(average_TotalSteps=mean(TotalSteps))
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
## # A tibble: 940 × 3
## # Groups: Id [33]
## Id ActivityDate average_TotalSteps
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## 7 1503960366 4/18/2016 13019
## 8 1503960366 4/19/2016 15506
## 9 1503960366 4/20/2016 10544
## 10 1503960366 4/21/2016 9819
## # … with 930 more rows
daily_users <- daily_activity %>% distinct(Id, .keep_all=TRUE)
users_steps <- daily_users %>%
select(Id, TotalSteps)
users_activity <- daily_users %>%
select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()
users_activity$Max<-pmax(users_activity$VeryActiveMinutes, users_activity$FairlyActiveMinutes, users_activity$LightlyActiveMinutes, users_activity$SedentaryMinutes)
users_status <- users_activity %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes)
users_status$Largest_Column<-colnames(users_status)[apply(users_status,1,which.max)]
status <- table(users_status['Largest_Column'])
total_activity <- daily_activity %>%
select(Id, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>% drop_na()
total_activity$sum <- total_activity$VeryActiveMinutes+total_activity$FairlyActiveMinutes+total_activity$LightlyActiveMinutes+total_activity$SedentaryMinutes
total_activity2 <- total_activity %>% distinct(Id, .keep_all=TRUE)
library(dplyr)
total_steps <- daily_activity %>%
select(Id, TotalSteps)
library(dplyr)
total_calories <- daily_calories %>%
select(Id, Calories)
total_calories[total_calories==0] <- NA
total_calories2 <- total_calories[complete.cases(total_calories),]
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(dplyr)
steps_vs_calories <- inner_join(x = total_steps, y = total_calories2, by = "Id")
steps_vs_calories2 <- aggregate(. ~ Id, # Keep all variables
steps_vs_calories,
sum)
Visualization stage requires some packages to be installed and loaded.
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(ggplot2)
install.packages("colorspace")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(colorspace)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
plot(users_steps$TotalSteps,type = "o", col = "black", xlab = "Users", ylab = "Steps",
main = "Total Steps")
The table above illustrates total steps made by 33 unique customers.
ggplot(data=users_steps) +
geom_line(mapping = aes(x=Id,y=TotalSteps)) +
geom_hline(yintercept = mean(users_steps$TotalSteps), color="blue")+
labs(title="Total Steps")+
annotate("text",x=5.0e+09,y=20000,label="The blue line indicates the average number of steps")
The table above gives an information about the average number of total steps made by every customer, which is close to 10 000 steps.
barplot(total_activity2$sum,type = "o", col = "grey", xlab = "Users", ylab = "Minutes",
main = "Activity Minutes")
## Warning in plot.window(xlim, ylim, log = log, ...): graphical parameter "type"
## is obsolete
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## graphical parameter "type" is obsolete
## Warning in axis(if (horiz) 1 else 2, cex.axis = cex.axis, ...): graphical
## parameter "type" is obsolete
The table above illustrates activity minutes represented by 33 unique customers. Total activity for most customers is more than 800 minutes.
slices <- c(29, 4)
lbls <- c("Lightly Active", "Very Active")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices,labels = lbls, col=grey.colors(length(lbls)),
main="Pie Chart of Activity")
The pie chart above represents frequently mentioned activity statuses. In accordance with this chart 88% of customers are light users and 12% of customers are very active users.
ggplot(data=steps_vs_calories2)+
geom_point(mapping=aes(x=TotalSteps,y=Calories))+
labs(title="Steps vs. Calories")
The table above represents a positive relationship between steps and calories. The more steps are made, the more calories are burned.
Bellabeat is a successful company focused on producing smart products for women: Bellabeat app, Leaf, Time, Spring and Bellabeat membership.
The analysis of smart device fitness data from FitBit customers shows the insights on how users of smart trackers use their smart devices. And the following trends were identified:
Based on the identified trends the following changes to the marketing strategy of Bellabeat app and Time could be made in order to increase users’ activity and help unlock new growth opportunities for the company: