Case study using R and its IDE Rstudio RMarkdown and Rnotebook
“From Paris my hometown I’ve decided to make it and make it out in order to land a job here in Paris area or remote as a rookie Data Analyst. I’ve successfully gained all the8 courses required by The Google Data Analytics Certificate all in English of course ! in a full period of 6 months. But impossible n’est pas Français as we say!”
Here we go for the case study!
I’ll not use pipes “%>%” in order to help people understand R by learning it through practice and hands-on as I did and will do from now on. Instead I’ll use the sessionInfo function to help get the current version and packages to understand the work done even in the future by viewers including future-me to see my progress.
We are perfoming data analysis with RStudio likened to a todolist of 33 questions to answer using the 6 phases in Google Data Analytics Certificate roadmap Ask Prepare Process Analyze Share Act
Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.Sando Mur: Mathematician and Bellabeat’s cofounder key member of the Bellabeat executive team Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing level recommendations for Bellabeat’s marketing strategy.
Sršen asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to in my presentation.
I’ve chosen the Bellabeat made device Leaf.
Leaf is Bellabeat’s classic wellness tracker that can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
The following data points and topics, daily activity ,daily sleep, and daily calories intakes will be my focused data to get insights and help the primary stakeholders Urška Sršen and Sando Mur: Mathematician and Bellabeat’s cofounder a key member of the Bellabeat executive team make data driven-decision. My team is Bellabeat marketing analytics. A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
Sršen encourages me to use public data that explores smart device users’ daily habits. She points me to a specific data set: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data setcontains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Sršen tells me that this data set might have some limitations, and encourages me to consider adding another data to help address those limitations as I begin to work more with this data.
I’ll load the data and create some R objects to work with all along my case study.
Even though the data set is not a perfect one this will be my day to day work so the best practice is to get used to this kind of data sets with the right tools a programming language like R a tool and query language like SQL platforms like MySQL and a another mixed tool Tableau for visualizations and databases manipulations plus spreadsheets like Excel and GoogleSheets .
With R I can do it all within one tool. The challenge is to understand R and its codes and impressive packages and libraries tidyverse lubridate and more .
Let’s bring into our workflow the needed packages and libraries so far.
library(tidyverse)
library(lubridate)
Reading the files.
read("dailyCalories_merged")
read("dailySteps_merged")
read("sleepDay_merged")
Viewing the files
view(dailyActivity_merged)
view(dailySleeping_merged)
view(dailySteps_merged)
Cleaning names of the variables or columns.
names(dailyCalories_merged)
names(dailySteps_merged)
names(sleepDay_merged)
Creating my R objects from the data set once downloaded cleaned and uploaded.
dailyCalories <- (dailyCalories_merged)
dailySteps <- (dailySteps_merged)
dailySleeping <- (sleepDay_merged)
summary(dailyCalories)
summary(dailyActivity)
summary(dailySleeping)
summary(dailySteps)
summary(weightLogInfo)
My business task is to Analyze data from non-bellebeat consumers use of smart devices and compare them to one bellabeat device here the Leaf, a wearable wellness tracker.
1 What are some trends in smart device usage?
Users in this data set didn’t log their records . Maybe it is due to a lack of information about the howtos of their devices. Bellabeat can fix it and gain these customers.
2 How could these trends apply to Bellabeat customers?
Bellabeat could give to these customers its Leaf device which is easy to use and connect to the app to track sleep, activities, calories burned and stress awake or asleep. It is designed like a leaf and fashion.
3 How could these trends help influence Bellabeat marketing strategy ?
Bellabeat can give customers free trials of its Leaf device knowing that these customers who try it will love it for its outstanding and trendy efficiency. A win-win deal.
Data is stored in a Kaggle account and is a public domain free to use.
4 Where is your data stored?
Here https://www.kaggle.com/datasets/arashnic/fitbit
5 How is the data organized?
It is in tables as merged Excel files .
6 Is it in long or wide format?
In a wide format like usual Excel spreadsheets GoogleSheets or R’s tibbles.
7 Are there issues with bias or credibility in this data?
The data set is biased and lacks some credibility. However I can work with it.
8 Does your data ROCCC?
I doesn’t fit the 5 ROCCC points, meaning not Reliable Original Comprehensive Current Cited.
9 How are you addressing licensing, privacy, security, and accessibility?
It’s free to use no licensing needed secure and accessible .
10 How did you verify the data’s integrity?
By downloading it first in a folder and checking its structure and making copy then a back up in case we mistakenly lose some data points.
11 How does it help you answer your question?
By its trends and patterns that showed up after filtering and sorting with R functions.
12 Are there any problems with the data?
Some data points are missing but we will fix them to be sure they will not skew the results and remove them if necessary or consider them as NULL . Knowing that NULL doesn’t mean no value or no entry of values but equal zero, to be short.
We go further with R to check for some errors choosing right tools to clean data and transform it in order to work with easily and make it tidy to handle with R using tidyr and ggplot2.
Tidy data is when we get: each variable in its own column, each observation in its own row, and each value in its own cell, to quote Hadley Wickham and Garrett Grolemund .
13 What tools are you choosing and why?
Main tools are tidyr and ggplot2 to clean and organize data an documenting it for future work or for my colleagues when they want to work with the data set .
14 Have you ensured your data’s integrity?
Data is secure and easy to handle.
15 What steps have you taken to ensure that your data is clean?
Tidying and filtering made it easy to process and then analyze
16 How can you verify that your data is clean and ready to analyze?
With R tools from packages dedicated to clean huge amounts of data sets in a faster way.
17 Have you documented your cleaning process so you can review and share those results?
I’ve created folders to document all the steps of my work for me and for my teammates and stakeholders in case they want to work with the data I’ve cleaned, wrangled and made tidy.
Once I’ve aggregated my data and made it useful and accessible, organized and formatted. I then did some calculations to see trends and patterns shown by data itself. Data tells its own story to gives us trends and patterns.
18 How should you organize your data to perform analysis on it?
Data is organized by aggregate and calculation functions using R programming language.
19 Has your data been properly formatted?
I’ve formatted the needed variables for this analysis.
20 What surprises did you discover in the data?
Some characters and doubles or numeric, to talk mainstream integer and mmddyyyy format were in same strings. I fixed it using the tidyverse and lubridate packages after uploaded the appropriate libraries.
mean(dailyCalories)
mean(dailySteps)
mean(dailySleeping)
21 What trends or relationships did you find in the data?
Some users found it difficult to log and note down their daily records. This could be an opportunity for Bellabeat to showcase its Leaf device that can fix this for customers .
22 How will these insights help answer your business questions?
Filling the gaps between customers goals to lose weight in an easy going way and Bellabeat’s targets in the global smart device market is the main insight.
With the Leaf wearable bracelet, customers can achieve their goals to weight loss without frustration or stress all in a healthy life.
My top high-level recommendation is to help my stakeholders make this data-driven decision: give to customers in the global market the opportunity to use the Leaf wearable wellness tracker to achieve their goal losing weight without frustration and in good health.
30 What is your final conclusion based on your analysis?
Bellabeat can easily gain customers and make it out in the smart devices global market . With Leaf an attractive wearable tracker for wellness dedicated to women .
31 How could your team and business apply your insights?
By giving high-quality and nicely designed services with the launching of Leaf and other Bellabeat devices to showcase our brand.
32 What next steps would you or your stakeholders take based on your findings?
Ads online and magazines specialized in wellness and diet focusing on women’s wellness.
33 Is there additional data you could use to expand on your findings?
I would use free and public data from WHO World Health Organization along with its recommendations worldwide https://www.who.int/news-room/fact-sheets/detail/physical-activity. to get insights in global populations habits, needs, and uses in their daily lives. See patterns and study the market which is different from the US.
Even though the data set is not perfect we eventually were able to get insights after having used the Google Data Analytics 6 phases I’ve learned for 6 months to analyze any data available to me even NoSQL data .
This case study helped me showcase the courses and my workflow so far as a data analyst.
Bellabeat can make it in the global market of smart devices dedicated to women offering them the Leaf tracker .
sessionInfo is a R function will help future viewers get the Dec,14, 2022 version of R, the packages used and their corresponding libraries used in this case study.
sessionInfo