Bellabeat is a small high-tech manufacturer of health products founded in 2013. By 2016, we had opened offices worldwide.
● What are the new growth opportunities for our products?
● What modifications should be made to tailor the products more to the
customers?
● What insights are useful for developing effective future marketing
strategies?
Data source: FitBit Fitness Tracker Data (CC0: Public Domain). The data is stored on Kaggle at FitBit Fitness Tracker Data.
Copyright: The data is free to use. All details about licensing, privacy, security, and accessibility can be found at Deed - CC0 1.0 Universal - Creative Common
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)
install.packages("readr")
library(readr)
install.packages("lubridate")
library(lubridate)
install.packages("janitor")
library(janitor)
install.packages("knitr")
library(knitr)
install.packages("highcharter")
library(highcharter)
install.packages("ggplot2")
library(ggplot2)
install.packages("plotly")
library(plotly)
install.packages("EnvStats")
library(EnvStats)
install.packages("htmltools")
library(htmltools)
install.packages("DT")
library(DT)
Upload the data set to Project Directory in Posit. Then renamed Fitabase Data 3.12.16-4.11.16 folder and Fitabase Data 4.12.16-5.12.16 folder as March_data and April_data respectively. See Appendix 1
Read csv files and count number of users in each file. The “merged” suffixes in each file name were replaced with “Apr” and “Mar” for those with start date from April and March respectively.
For example, dailyActivity_merged.csv in folder April_data would become dailyActivity_Apr.csv. Then count the distinct number of user IDs in each data table.
# Read file and print the first 3 rows
dailyActivity_Apr<-read_csv("April_data/dailyActivity_merged.csv")
print(dailyActivity_Apr, n=3)
## # A tibble: 940 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## # ℹ 937 more rows
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
# Count number of users
n_distinct(dailyActivity_Apr$Id)
## [1] 33
Continue skimming through the rest, then aggregate the dataset information into a data overview.
*METs = Metabolic Equivalent Task. Vigorous-intensity activities are defined as ≥ 6.0 METs. Running at 10 minutes per mile (6.0 mph) is a 10 MET activity and is therefore classified as vigorous-intensity HHS 2008, 55. However, max METs in both given tables are significantly surpassed these indexes.
The data set consists of 29 files, organised into two folders with different timestapms. One is from 3/12/2016 to 4/11/2016 and the other is from 4/12/2016 to 5/12/2016. Data is mixed of wide and long formats.
There are 35 distinct IDs in the whole data set. However, the number of IDs in each file is inconsistent. See Appendix 2
Column names are consistent between files but the number of columns in each file also varies. See Appendix 3
Cross-verification: Data within one folder or between two folders in the same categories was compared to ensure that data from different tables matched. See Appendix 4
Audit trails: A change log was maintained to monitor changes to the dataset. See Appendix 1
● Abbreviations and calculation formulas in some tables are not
clarified in the data source.
● Data does not include demographic information. It is hard to confirm
if it is not sampling bias.
● It is relatively outdated because the survey was undertaken in 2016.
The results of the analysis should be taken as a reference only. Further
clarification and extra data sources are needed.
R is used to examine and then merge the needed tables before analysing.
● Time stamp in each file and in two folders
● Format dmy and dmy_hms
● Overlapsing and inconsistencies in data of the same period
● Number of IDs records in Mar are significantly smaller than
April’s
Cleaning files No. 1, 2, 4, 5, 12, 13, 24, 25 from the Dataset overview table
First, the Table dailyActivity_Apr table was cleanned
# Find the location of missing values
which(is.na(dailyActivity_Apr), arr.ind = TRUE)
## row col
# Count the total number of missing values
sum(is.na(dailyActivity_Apr))
## [1] 0
# Count the number of duplicate rows
sum(duplicated(dailyActivity_Apr))
## [1] 0
# Check the structure of the data frame
str(dailyActivity_Apr)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Trim whitespace from column names
colnames(dailyActivity_Apr) <- trimws(colnames(dailyActivity_Apr))
# Remove empty rows and columns
dailyActivity_Apr <- remove_empty(dailyActivity_Apr, which = c("rows", "cols"))
Then, apply the same method to all of the table below. See more codes at Appendix 5
Table hourlyCalories_Apr
Table hourlyIntensities_Apr
Table hourlySteps_Apr
Table dailyActivity_Mar
Table hourlyCalories_Mar
Table hourlyIntensities_Mar
Table hourlySteps_Mar
Joining daily tables of Activity. Note: records on 5/12/2016 would not be considered as they are incomplete and could skew the analysis results. The period to be analysed is from 3/12/2016 to 5/11/2016, via 60 days.
# Filter out data on 5/12/2016
dailyActivity_Apr_without_end_date <- dailyActivity_Apr[dailyActivity_Apr$ActivityDate != "5/12/2016", ]
dailyActivity_Mar_without_end_date <- dailyActivity_Mar[dailyActivity_Mar$ActivityDate != "4/12/2016", ]
# Merge tables to create a master data frame named dailyActivity
dailyActivity <- rbind(dailyActivity_Apr_without_end_date,dailyActivity_Mar_without_end_date)
# Change the data type of the column activity_date from character to date.
dailyActivity$ActivityDate <- mdy(dailyActivity$ActivityDate)
Detect ouliers:
# Detect outliers in IDs based on their frequency
detect_outlier_frequency <- dailyActivity %>%
group_by(Id) %>%
summarise(Frequency = n(), .groups = 'drop')
# Find outliers using Rosner's Test
result <- rosnerTest(detect_outlier_frequency$Frequency, k = 5)
# Extract the outlier values
outliers <- result$all.stats %>%
filter(Outlier) %>%
select(Value)
# Identify the IDs corresponding to the outlier frequencies
outlier_Ids <- detect_outlier_frequency %>%
filter(Frequency %in% outliers$Value)
# Print ouliers
print(outlier_Ids)
## # A tibble: 3 × 2
## Id Frequency
## <dbl> <int>
## 1 2891001357 8
## 2 4020332650 61
## 3 6391747486 9
Three IDs were classified as outliers based on their frequency in the dataset. However, this could be due to the nature of the observation.
Another test will be conducted to detect outlier values in the Calories column.
# Detect outliers based on values in Calories column
rosnerTest(dailyActivity$Calories,k = 10)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: dailyActivity$Calories
##
## Sample Size: 1352
##
## Test Statistics: R.1 = 3.626794
## R.2 = 3.254037
## R.3 = 3.268093
## R.4 = 3.282332
## R.5 = 3.296759
## R.6 = 3.237084
## R.7 = 3.243805
## R.8 = 3.224483
## R.9 = 3.223801
## R.10 = 3.230316
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 2312.760 713.3682 4900 592 3.626794 4.113355 FALSE
## 2 1 2310.845 710.1472 0 639 3.254037 4.113177 FALSE
## 3 2 2312.557 707.6168 0 800 3.268093 4.112999 FALSE
## 4 3 2314.271 705.0693 0 1148 3.282332 4.112821 FALSE
## 5 4 2315.988 702.5046 0 1227 3.296759 4.112643 FALSE
## 6 5 2317.707 699.9224 52 338 3.237084 4.112465 FALSE
## 7 6 2319.391 697.4497 57 889 3.243805 4.112286 FALSE
## 8 7 2321.073 694.9725 4562 1049 3.224483 4.112108 FALSE
## 9 8 2319.406 692.5348 4552 558 3.223801 4.111929 FALSE
## 10 9 2317.743 690.1050 4547 894 3.230316 4.111750 FALSE
Since the test did not show any outliers, the original observations in the column Calories will be retained for analysis. However, some insights will be re-tested if outliers are detected in the analyzed columns to ensure the results are reliable.
Combine the selected files from overview table. The merge keys are Id and ActivityHour. Then, consolidate them into one master dataset.
# Merge hourlyCalories_Apr, hourlyIntensities_Apr, hourlySteps_Apr,
hourly_Apr <- hourlyCalories_Apr %>%
inner_join(hourlyIntensities_Apr, by = c("Id","ActivityHour")) %>%
inner_join(hourlySteps_Apr, by = c("Id","ActivityHour")) %>%
select(Id,ActivityHour,TotalIntensity,StepTotal,Calories)
# Merge hourlyCalories_Mar, hourlyIntensities_Mar, hourlySteps_Mar
hourly_Mar <- hourlyCalories_Mar %>%
inner_join(hourlyIntensities_Mar, by = c("Id","ActivityHour")) %>%
inner_join(hourlySteps_Mar, by = c("Id","ActivityHour")) %>%
select(Id,ActivityHour,TotalIntensity,StepTotal,Calories)
# Combine these two data frames into master data, then change data type of column ActivityHour
hourly_merged <- rbind(hourly_Apr,hourly_Mar)
hourly_merged$ActivityHour <- mdy_hms (hourly_merged$ActivityHour)
Then, arrange data in order from Monday to Sunday:
hourly_merged_V1 <- hourly_merged %>%
mutate(
hour = hour(ActivityHour),
weekday = factor(weekdays(ActivityHour), levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
) %>%
group_by(weekday, hour) %>%
summarise(across(c(TotalIntensity:Calories), sum, na.rm = TRUE), .groups = 'drop') %>%
arrange(weekday, hour)
Next, check for outliers:
rosnerTest(hourly_merged_V1$TotalIntensity, k = 5)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: hourly_merged_V1$TotalIntensity
##
## Sample Size: 168
##
## Test Statistics: R.1 = 1.749956
## R.2 = 1.645558
## R.3 = 1.606795
## R.4 = 1.605240
## R.5 = 1.597390
##
## Test Statistic Parameter: k = 5
##
## Alternative Hypothesis: Up to 5 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 3135.179 1956.519 6559 67 1.749956 3.552401 FALSE
## 2 1 3114.677 1944.218 6314 43 1.645558 3.550554 FALSE
## 3 2 3095.404 1934.034 6203 134 1.606795 3.548694 FALSE
## 4 3 3076.570 1924.591 6166 66 1.605240 3.546821 FALSE
## 5 4 3057.732 1915.167 6117 68 1.597390 3.544935 FALSE
rosnerTest(hourly_merged_V1$StepTotal, k = 5)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: hourly_merged_V1$StepTotal
##
## Sample Size: 168
##
## Test Statistics: R.1 = 2.004184
## R.2 = 1.726885
## R.3 = 1.703869
## R.4 = 1.715386
## R.5 = 1.686624
##
## Test Statistic Parameter: k = 5
##
## Alternative Hypothesis: Up to 5 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 83146.71 56577.78 196539 67 2.004184 3.552401 FALSE
## 2 1 82467.71 56057.19 179272 43 1.726885 3.550554 FALSE
## 3 2 81884.55 55716.39 176818 68 1.703869 3.548694 FALSE
## 4 3 81309.20 55389.16 176323 135 1.715386 3.546821 FALSE
## 5 4 80729.85 55055.05 173587 66 1.686624 3.544935 FALSE
rosnerTest(hourly_merged_V1$Calories, k = 5)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: hourly_merged_V1$Calories
##
## Sample Size: 168
##
## Test Statistics: R.1 = 1.958716
## R.2 = 1.865898
## R.3 = 1.825498
## R.4 = 1.833752
## R.5 = 1.834406
##
## Test Statistic Parameter: k = 5
##
## Alternative Hypothesis: Up to 5 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 26324.22 5079.235 36273 67 1.958716 3.552401 FALSE
## 2 1 26264.65 5035.297 35660 135 1.865898 3.550554 FALSE
## 3 2 26208.05 4996.966 35330 134 1.825498 3.548694 FALSE
## 4 3 26152.76 4960.995 35250 66 1.833752 3.546821 FALSE
## 5 4 26097.29 4924.595 35131 43 1.834406 3.544935 FALSE
Conclusion: There is no outliers in the data table hourly_merged_V1.
# Count the daily active users and the total number of daily steps in 60 days
dailyActivity_sum1 <- dailyActivity %>%
group_by(ActivityDate) %>%
summarise(Total_Ids = n_distinct(Id),
Total_Steps = sum(TotalSteps),
Total_Calories = sum(Calories),
.groups = 'drop') %>%
select(ActivityDate, Total_Ids, Total_Steps,Total_Calories)
Insight 1: There was a surge in the number of users and number of activities daily from the beginning of April.
# Total Ids and Total Steps
cor.test(dailyActivity_sum1$Total_Steps, dailyActivity_sum1$Total_Ids, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dailyActivity_sum1$Total_Steps and dailyActivity_sum1$Total_Ids
## t = 47.74, df = 59, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9788453 0.9923915
## sample estimates:
## cor
## 0.9873024
Insight 2: The cor.test with cor = 0.9893389 indicates a very strong positive linear relationship between the surge in the number of users and the amount of steps taken. See the chart below:
However, the reasons behind this sudden increase require additional data for clarification. Possible explanations could include promotions, sales seasons, product launches, missing data, and other factors.
# Total calories and total steps
cor.test(dailyActivity_sum1$Total_Steps, dailyActivity_sum1$Total_Calories, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: dailyActivity_sum1$Total_Steps and dailyActivity_sum1$Total_Calories
## t = 61.699, df = 59, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9872159 0.9954145
## sample estimates:
## cor
## 0.9923396
Insight 3: The results also show a strong positive correlation between the amount of calories burned and total steps. See the chart below:
# Create dailyActivity_average table
dailyActivity_average <- dailyActivity %>%
group_by(Id) %>%
summarise(Average_Steps = sum(TotalSteps)/60,
Average_Distance = sum(TotalDistance)/60)
Insight 4: The vast majority of users are not active enough in term of average number of steps per day. There are only 5 users achieve the recommended daily steps by NIH. It is recommended to take at least 7500 steps or approximately 3.4 miles a day.
dailyActivity_sum2 <- dailyActivity%>%
summarize(
Very = sum(VeryActiveMinutes),
Fairly = sum(FairlyActiveMinutes),
Lightly = sum(LightlyActiveMinutes),
Sedentary = sum(SedentaryMinutes)
) %>%
pivot_longer(cols = everything(),names_to = "Level",values_to = "Minutes")
print(dailyActivity_sum2)
## # A tibble: 4 × 2
## Level Minutes
## <chr> <dbl>
## 1 Very 27195
## 2 Fairly 18633
## 3 Lightly 256181
## 4 Sedentary 1361157
Insight 5: Fitbit users typically spend most of their time on sedentary and lightly active activities. This suggests that they mainly use the products as casual accessories with lower activity levels, rather than for physically demanding tasks or training purposes.
# Create dailyActivity_V2 by grouping ID column and sum all related columns
dailyActivity_V2 <- dailyActivity %>%
group_by(Id) %>%
summarise(across(c(TotalSteps, VeryActiveMinutes:Calories), sum, na.rm = TRUE))
Check for outliers in dailyActivity_V2:
rosnerTest(dailyActivity_V2$VeryActiveMinutes, k = 5)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: dailyActivity_V2$VeryActiveMinutes
##
## Sample Size: 35
##
## Test Statistics: R.1 = 2.810920
## R.2 = 3.239663
## R.3 = 3.053991
## R.4 = 3.032100
## R.5 = 2.681619
##
## Test Statistic Parameter: k = 5
##
## Alternative Hypothesis: Up to 5 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 4
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 777.0000 984.7311 3545 30 2.810920 2.978183 TRUE
## 2 1 695.5882 871.8227 3520 22 3.239663 2.965315 TRUE
## 3 2 610.0000 725.9354 2827 35 3.053991 2.951949 TRUE
## 4 3 540.7188 616.8271 2411 32 3.032100 2.938048 TRUE
## 5 4 480.3871 522.3012 1881 1 2.681619 2.923571 FALSE
rosnerTest(dailyActivity_V2$FairlyActiveMinutes, k = 5)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: dailyActivity_V2$FairlyActiveMinutes
##
## Sample Size: 35
##
## Test Statistics: R.1 = 3.677436
## R.2 = 2.025512
## R.3 = 1.819187
## R.4 = 1.922798
## R.5 = 1.852099
##
## Test Statistic Parameter: k = 5
##
## Alternative Hypothesis: Up to 5 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 1
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 532.3714 457.0110 2213 13 3.677436 2.978183 TRUE
## 2 1 482.9412 356.4820 1205 22 2.025512 2.965315 FALSE
## 3 2 461.0606 338.0299 1076 3 1.819187 2.951949 FALSE
## 4 3 441.8438 324.6083 1066 20 1.922798 2.938048 FALSE
## 5 4 421.7097 308.9956 994 29 1.852099 2.923571 FALSE
Result: Outliers were detected in some columns. The results from this data table should be interpreted with caution due to the presence of outliers. These outliers could be due to natural variation in the data, recording errors, equipment malfunctions, or sample size issues. Further clarification is needed.
Plot original data in table dailyActivity_V2
Insight 6 There is a clear positive correlation between active minutes and total steps. As active minutes increase, total steps also increase. The trend lines indicate that the increase in total steps is more pronounced for those who are very active or fairly active compared to those who are lightly active.
Insight 7: All activity levels show trend lines indicating a positive correlation between active minutes and total calories burned. The ‘Very Active’ trend line has the steepest slope, followed by ‘Fairly Active,’ indicating a more rapid increase in calories burned with increased active minutes compared to ‘Lightly Active.’ There is no big gap between the slopeness of ‘Very Active’ and ‘Fairly Active’.
The data below would be visualised in the next phase for clearer
trends and patterns.
Note: There is no ouliers in this data set
# Create a datatable with a search bar and no header wrap
data_hourly_merged_V1 <- datatable(hourly_merged_V1, options = list(
pageLength = 10,
scrollX = TRUE,
scrollY = "400px",
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'white-space': 'nowrap'});",
"}"
)
), rownames = FALSE)
# Print the datatable
data_hourly_merged_V1
1. What are the new growth opportunities for our products?
The analysis reveals two growth opportunities:
Targeting casual users: As Insight 5 shows, most Fitbit users engage in sedentary and lightly active activities, suggesting they use the products as casual accessories rather than for training. This presents an opportunity to expand marketing and product features toward casual users who may be seeking to improve their health gradually, rather than focusing solely on athletes or highly active users.
Increase engagement during peak times: Insights 1 and 8 highlight a surge in user activity starting in April, as well as peak hours during the day and week. These patterns could be leveraged to create time-specific promotions or challenges to engage users during periods of high activity.
2. What modifications should be made to tailor the products more to customers?
Several product modifications can improve user engagement:
Encouragement of more active behavior: Since Insight 4 shows that most users are not meeting the NIH-recommended daily step count, the product could include motivational reminders to increase activity. Insight 7 suggests that adding features to guide users toward “Very Active” or “Fairly Active” activities, in stead of the “Very Active” level only, could significantly impact calories burned and overall fitness.
Design improvements: To cater to the majority of users who are using the product casually (Insight 5), the product design could be more aesthetically tailored for casual, everyday wear instead of focusing solely on athletic features.
3. What insights are useful for developing effective future marketing strategies?*
**Time-specific campaigns:* Insights 8 shows the busiest periods of user activity. Marketing campaigns and promotions can be launched during peak times (e.g., 10:00 to 14:00 and 16:00 to 19:00) to maximize visibility and engagement. Additionally, campaigns can be adjusted to target weekdays or weekends depending on when users are most active.
Target casual users: Given that the majority of users are not highly active (Insight 5), marketing strategies should focus on promoting the benefits of moderate physical activity and how Bellabeat products can help users gradually improve their health, especially for those engaging in sedentary or lightly active lifestyles.
About the data, it is essential to address some gaps in the analysis, such as the lack of demographic data. Incorporating these factors could provide deeper insights. Moreover, internal data from Bellabeat’s current users could be utilized to cross-check and validate the findings from this analysis.
The stakeholders, including the executive and marketing teams, will receive the analysis via visualizations prepared in this R Markdown document. It will be uploaded to https://rpubs.com/HenryN9\ for easy access and further reference. A small meeting would be the most effective way to communicate these findings and discuss their applications.
This analysis reveals two key findings:
● The current users in the data set exhibit relatively low levels of activity, suggesting that they tend to use the products more as casual accessories rather than for training purposes.
● The busiest times of activity occur at specific periods within a day or week.
Upload and Organize Data in Posit
The dataset was uploaded to Posit under the Project folder, and the directory path was set.
The folders “Fitabase Data 3.12.16-4.11.16” and “Fitabase Data 4.12.16-5.12.16” were moved to the Project folder and renamed “March_data” and “April_data” respectively.
The folders “mturkfitbit_export_3.12.16-4.11.16” and “mturkfitbit_export_4.12.16-5.12.16” were deleted.
Change File Names
The “merged” suffix in each file was replaced with “Apr” or “Mar” for files with start dates in April or March, respectively.
For example, “dailyActivity_merged.csv” in the folder “April_data” was renamed “dailyActivity_Apr.csv”. Otherwise, the original data name was left unchanged.
# Read the files in the folder April_data in ascending order
dailyActivity_Apr<-read_csv("April_data/dailyActivity_merged.csv")
dailyCalories_Apr<-read_csv("April_data/dailyCalories_merged.csv")
dailyIntensities_Apr<-read_csv("April_data/dailyIntensities_merged.csv")
dailySteps_Apr<-read_csv("April_data/dailySteps_merged.csv")
heartrate_Apr<-read_csv("April_data/heartrate_merged.csv")
hourlyCalories_Apr<-read_csv("April_data/hourlyCalories_merged.csv")
hourlyIntensities_Apr<-read_csv("April_data/hourlyIntensities_merged.csv")
hourlySteps_Apr<-read_csv("April_data/hourlySteps_merged.csv")
minuteCaloriesNarrow_Apr<-read_csv("April_data/minuteCaloriesNarrow_merged.csv")
minuteCaloriesWide_Apr<-read_csv("April_data/minuteCaloriesWide_merged.csv")
minuteIntensitiesNarrow_Apr<-read_csv("April_data/minuteIntensitiesNarrow_merged.csv")
minuteIntensitiesWide_Apr<-read_csv("April_data/minuteIntensitiesWide_merged.csv")
minuteMETsNarrow_Apr<-read_csv("April_data/minuteMETsNarrow_merged.csv")
minuteSleep_Apr<-read_csv("April_data/minuteSleep_merged.csv")
minuteStepsNarrow_Apr<-read_csv("April_data/minuteStepsNarrow_merged.csv")
minuteStepsWide_Apr<-read_csv("April_data/minuteStepsWide_merged.csv")
sleepDay_Apr<-read_csv("April_data/sleepDay_merged.csv")
weightLogInfo_Apr<-read_csv("April_data/weightLogInfo_merged.csv")
# Read the files in the folder March_data in ascending order
dailyActivity_Mar<-read_csv("March_data/dailyActivity_merged.csv")
heartrate_Mar<-read_csv("March_data/heartrate_merged.csv")
hourlyCalories_Mar<-read_csv("March_data/hourlyCalories_merged.csv")
hourlyIntensities_Mar<-read_csv("March_data/hourlyIntensities_merged.csv")
hourlySteps_Mar<-read_csv("March_data/hourlySteps_merged.csv")
minuteCaloriesNarrow_Mar<-read_csv("March_data/minuteCaloriesNarrow_merged.csv")
minuteIntensitiesNarrow_Mar<-read_csv("March_data/minuteIntensitiesNarrow_merged.csv")
minuteMETsNarrow_Mar<-read_csv("March_data/minuteMETsNarrow_merged.csv")
minuteSleep_Mar<-read_csv("March_data/minuteSleep_merged.csv")
minuteStepsNarrow_Mar<-read_csv("March_data/minuteStepsNarrow_merged.csv")
weightLogInfo_Mar<-read_csv("March_data/weightLogInfo_merged.csv")
# Tables in folder April_data
n_distinct(dailyActivity_Apr$Id)
n_distinct(dailyCalories_Apr$Id)
n_distinct(dailyIntensities_Apr$Id)
n_distinct(dailySteps_Apr$Id)
n_distinct(heartrate_Apr$Id)
n_distinct(hourlyCalories_Apr$Id)
n_distinct(hourlyIntensities_Apr$Id)
n_distinct(hourlySteps_Apr$Id)
n_distinct(minuteCaloriesNarrow_Apr$Id)
n_distinct(minuteCaloriesWide_Apr$Id)
n_distinct(minuteIntensitiesNarrow_Apr$Id)
n_distinct(minuteIntensitiesWide_Apr$Id)
n_distinct(minuteMETsNarrow_Apr$Id)
n_distinct(minuteSleep_Apr$Id)
n_distinct(minuteStepsNarrow_Apr$Id)
n_distinct(minuteStepsWide_Apr$Id)
n_distinct(sleepDay_Apr$Id)
n_distinct(weightLogInfo_Apr$Id)
# Tables in folder March_data
n_distinct(dailyActivity_Mar$Id)
n_distinct(heartrate_Mar$Id)
n_distinct(hourlyCalories_Mar$Id)
n_distinct(hourlyIntensities_Mar$Id)
n_distinct(hourlySteps_Mar$Id)
n_distinct(minuteCaloriesNarrow_Mar$Id)
n_distinct(minuteIntensitiesNarrow_Mar$Id)
n_distinct(minuteMETsNarrow_Mar$Id)
n_distinct(minuteSleep_Mar$Id)
n_distinct(minuteStepsNarrow_Mar$Id)
n_distinct(weightLogInfo_Mar$Id)
Firstly, extract all the column names of all the files.
# From folder April_data
colnames(minuteStepsNarrow_Apr)
colnames(minuteStepsWide_Apr)
colnames(sleepDay_Apr)
colnames(weightLogInfo_Apr)
colnames(minuteCaloriesWide_Apr)
colnames(minuteIntensitiesNarrow_Apr)
colnames(minuteIntensitiesWide_Apr)
colnames(minuteMETsNarrow_Apr)
colnames(minuteSleep_Apr)
colnames(dailyActivity_Apr)
colnames(dailyCalories_Apr)
colnames(dailyIntensities_Apr)
colnames(dailySteps_Apr)
colnames(heartrate_Apr)
colnames(hourlyCalories_Apr)
colnames(hourlyIntensities_Apr)
colnames(hourlySteps_Apr)
colnames(minuteCaloriesNarrow_Apr)
# From folder March_data
colnames(dailyActivity_Mar)
colnames(heartrate_Mar)
colnames(hourlyCalories_Mar)
colnames(hourlyIntensities_Mar)
colnames(hourlySteps_Mar)
colnames(minuteCaloriesNarrow_Mar)
colnames(minuteIntensitiesNarrow_Mar)
colnames(minuteMETsNarrow_Mar)
colnames(minuteSleep_Mar)
colnames(minuteStepsNarrow_Mar)
colnames(weightLogInfo_Mar)
Then aggregate all the column names into an excel file. The results would be as below:
Compare same file name in different folder
dailyActivity_Apr and dailyActivity_Mar
# dailyActivity_Apr
# Change the time stamp from 'chracter' data type to 'date' data type.
dailyActivity_Apr$ActivityDate <- as.Date(dailyActivity_Apr$ActivityDate,format='%m/%d/%Y ')
# Find the start date and cutoff date
start_time <- min(dailyActivity_Apr$ActivityDate)
cutoff_time <- max(dailyActivity_Apr$ActivityDate)
# dailyActivity_Mar
dailyActivity_Mar$ActivityDate <- as.Date(dailyActivity_Mar$ActivityDate,format='%m/%d/%Y ')
# Find the start date and cutoff date
min(dailyActivity_Mar$ActivityDate)
max(dailyActivity_Mar$ActivityDate)
There is data overlap on 4.12.16. The data in dailyActivity_Mar is incomplete for 4.12.16. Therefore, the data from the dailyActivity_Apr table will be used, as its starting hour is 2016-04-12 00:00:00 UTC.
hourlyCalories_Apr and hourlyCalories_Mar
# Change the time stamp from 'chracter' to 'datetime'. Note the format of the ActivityHour column in the original files
hourlyCalories_Apr$ActivityHour = as_datetime(hourlyCalories_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyCalories_Apr$ActivityHour)
max(hourlyCalories_Apr$ActivityHour)
# Change the time stamp from 'chracter' to 'datetime'
hourlyCalories_Mar$ActivityHour = as_datetime(hourlyCalories_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyCalories_Mar$ActivityHour)
max(hourlyCalories_Mar$ActivityHour)
The cutoff time for the hourlyCalories_Mar table is 04/12/2016 10:00:00 UTC, while for the hourlyCalories_Apr table, it is 05/12/2016 15:00:00 UTC.
The data in the former table is incomplete for 04/12/2016, and the latter table has partial data for 05/12/2016.
hourlyIntensities_Apr and hourlyIntensities_Mar
# Change the time stamp
hourlyIntensities_Apr$ActivityHour = as_datetime(hourlyIntensities_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyIntensities_Apr$ActivityHour)
max(hourlyIntensities_Apr$ActivityHour)
# Change the time stamp
hourlyIntensities_Mar$ActivityHour = as_datetime(hourlyIntensities_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find the start datetime and the end datetime
min(hourlyIntensities_Mar$ActivityHour)
max(hourlyIntensities_Mar$ActivityHour)
hourlySteps_Apr and hourlySteps_Mar
# Change the time stamp
hourlySteps_Apr$ActivityHour = as_datetime(hourlySteps_Apr$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find start time and cutoff time
min(hourlySteps_Apr$ActivityHour)
max(hourlySteps_Apr$ActivityHour)
# Change the time stamp
hourlySteps_Mar$ActivityHour = as_datetime(hourlySteps_Mar$ActivityHour, tz = "UTC", format='%m/%d/%Y %I:%M:%S %p')
# Find start time and cutoff time
min(hourlySteps_Mar$ActivityHour)
max(hourlySteps_Mar$ActivityHour)
minuteCaloriesWide and minuteCaloriesNarrow
# Change data type of column ActivityHour
minuteCaloriesWide_Apr$ActivityHour <- mdy_hms(minuteCaloriesWide_Apr$ActivityHour)
# Find start time and cutoff time
min(minuteCaloriesWide_Apr$ActivityHour)
max(minuteCaloriesWide_Apr$ActivityHour)
# Change data type of column ActivityHour
minuteCaloriesNarrow_Apr$ActivityMinute <- mdy_hms(minuteCaloriesNarrow_Apr$ActivityMinute)
# Find start time and cutoff time
min(minuteCaloriesNarrow_Apr$ActivityMinute)
max(minuteCaloriesNarrow_Apr$ActivityMinute)
Due to the differences in starting and cutoff times of the files, data comparison between “2016-04-13 UTC” and “2016-05-11 UTC” was conducted to inspect data consistency. Activities within this period were recorded in both tables.
# Pivot table minuteCaloriesWide_Apr to longer format then add column date and hour
df_wide_new <- minuteCaloriesWide_Apr %>%
pivot_longer(
cols = starts_with("Calories"),
names_to = "Minute",
values_to = "Calories") %>%
mutate(Minute = as.numeric(sub("Calories", "", Minute))) %>%
mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
mutate(key = paste(Id, date, hour, Minute, sep = "-"))
# Transform minuteCaloriesNarrow_Apr by adding column date, hour and minute
df_narrow_new <- minuteCaloriesNarrow_Apr %>%
mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>%
mutate(key = paste(Id, date, hour, minute, sep = "-"))
# Filter and pull data from df_wide_new and df_narrow_new
df_wide_check <- df_wide_new %>%
filter(date <= ymd("2016-05-11") & date >= ymd("2016-04-13")) %>%
pull(key)
df_narrow_check <- df_narrow_new %>%
filter(date <= ymd("2016-05-11") & date >= ymd("2016-04-13")) %>%
pull(key)
# Single out unmatched rows
df_unmatched <- anti_join(df_wide_check, df_narrow_check, by = "key")
nrow(df_unmatched)
Result: The odd rows are of 2016-05-11. Then the cutoff date is set to 2016-05-10 for both 2 data frames.
# Check if the Calories column of these two matching again.
df_wide_check_new <- df_wide_new %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
pull(key)
df_narrow_check_new <- df_narrow_new %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
pull(key)
# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check_new, df_narrow_check_new))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Results: the data is consistent within the period.
minuteIntensitiesNarrow and minuteIntensitiesWide
# Pivot table minuteIntensitiesWide_Apr to longer format and transform minuteIntensitiesNarrow_Apr
df_wide_check <- minuteIntensitiesWide_Apr %>%
mutate(ActivityHour = mdy_hms(ActivityHour)) %>%
pivot_longer(
cols = starts_with("Intensity"),
names_to = "Minute",
values_to = "Intensities") %>%
mutate(Minute = as.numeric(sub("Intensity", "", Minute))) %>%
mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
mutate(key = paste(Id, date, hour, Minute, sep = "-")) %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
pull(Intensity)
df_narrow_check <- minuteIntensitiesNarrow_Apr %>%
mutate(ActivityMinute = mdy_hms(ActivityMinute)) %>%
mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>%
mutate(key = paste(Id, date, hour, minute, sep = "-")) %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
pull(Intensity)
# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check, df_narrow_check))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Results: the data is consistent within the period.
minuteStepsNarrow_merged and minuteStepsWide_merged
# Pivot table minuteStepsWide_Apr to longer format and transform minuteStepsNarrow_Apr
df_wide_check <- minuteStepsWide_Apr %>%
mutate(ActivityHour = mdy_hms(ActivityHour)) %>%
pivot_longer(
cols = starts_with("Steps"),
names_to = "Minute",
values_to = "Steps") %>%
mutate(Minute = as.numeric(sub("Steps", "", Minute))) %>%
mutate(date = date(ActivityHour), hour = hour(ActivityHour)) %>%
mutate(key = paste(Id, date, hour, Minute, sep = "-")) %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
select(Steps)
df_narrow_check <- minuteStepsNarrow_Apr %>%
mutate(ActivityMinute = mdy_hms(ActivityMinutes)) %>%
mutate(date = date(ActivityMinute), hour = hour(ActivityMinute), minute = minute(ActivityMinute)) %>%
mutate(key = paste(Id, date, hour, minute, sep = "-")) %>%
filter(date <= ymd("2016-05-10") & date >= ymd("2016-04-13")) %>%
select(Steps)
# Check if the data vectors from two files match
if (isTRUE(all.equal(df_wide_check, df_narrow_check))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Results: the data is consistent within the period.
minuteStepsNarrow and hourlySteps
minuteStepsNarrow_Apr$ActivityMinute <- mdy_hms(minuteStepsNarrow_Apr$ActivityMinute)
min(minuteStepsNarrow_Apr$ActivityMinute)
max(minuteStepsNarrow_Apr$ActivityMinute)
minuteStepsNarrow_Apr_Test <-minuteStepsNarrow_Apr %>%
mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>%
filter(Activity_Date != "2016-05-12") %>%
group_by(Id,Activity_Date,Activity_Hour) %>%
summarise(StepTotal = sum(Steps), .groups = 'drop') %>%
pull(StepTotal)
hourlySteps_Apr$ActivityHour <- mdy_hms(hourlySteps_Apr$ActivityHour)
min(hourlySteps_Apr$ActivityHour)
max(hourlySteps_Apr$ActivityHour)
hourlySteps_Apr_Test <- hourlySteps_Apr %>%
mutate(Activity_Date = date(ActivityHour)) %>%
filter(Activity_Date != "2016-05-12") %>%
pull(StepTotal)
# Check if the StepTotal vectors of the two data frames are the same
if (isTRUE(all.equal(minuteStepsNarrow_Apr_Test, hourlySteps_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Result: the data is consistent within the period.
hourlyIntensities and minuteIntensitiesNarrow
hourlyIntensities_Apr$ActivityHour <- mdy_hms(hourlyIntensities_Apr$ActivityHour)
min(hourlyIntensities_Apr$ActivityHour)
max(hourlyIntensities_Apr$ActivityHour)
hourlyIntensities_Apr_Test <- hourlyIntensities_Apr %>%
filter(date(ActivityHour) != "2016-05-12") %>%
select(TotalIntensity)
minuteIntensitiesNarrow_Apr$ActivityMinute <- mdy_hms(minuteIntensitiesNarrow_Apr$ActivityMinute)
min(minuteIntensitiesNarrow_Apr$ActivityMinute)
max(minuteIntensitiesNarrow_Apr$ActivityMinute)
minuteIntensitiesNarrow_Apr_Test <-minuteIntensitiesNarrow_Apr %>%
mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>%
filter(Activity_Date != "2016-05-12") %>%
group_by(Id,Activity_Date,Activity_Hour) %>%
summarise(TotalIntensity = sum(Intensity), .groups = 'drop') %>%
select(TotalIntensity)
# Check if the TotalIntensities vectors of the two data frames are the same
if (isTRUE(all.equal(hourlyIntensities_Apr_Test, minuteIntensitiesNarrow_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Result: the data is consistent within the period.
hourlyCalories and minuteCaloriesNarrow
#hourlyCalories_Apr
hourlyCalories_Apr$ActivityHour <- mdy_hms(hourlyCalories_Apr$ActivityHour)
min(hourlyCalories_Apr$ActivityHour)
max(hourlyCalories_Apr$ActivityHour)
hourlyCalories_Apr_Test <- hourlyCalories_Apr %>%
filter(date(ActivityHour) != "2016-05-12") %>%
select(Calories)
# minuteCaloriesNarrow_Apr
minuteCaloriesNarrow_Apr$ActivityMinute <- mdy_hms(minuteCaloriesNarrow_Apr$ActivityMinute)
min(minuteCaloriesNarrow_Apr$ActivityMinute)
max(minuteCaloriesNarrow_Apr$ActivityMinute)
minuteCaloriesNarrow_Apr_Test <-minuteCaloriesNarrow_Apr %>%
mutate(Activity_Date = date(ActivityMinute), Activity_Hour = hour(ActivityMinute)) %>%
filter(Activity_Date != "2016-05-12") %>%
group_by(Id,Activity_Date,Activity_Hour) %>%
summarise(Calories = sum(Calories), .groups = 'drop') %>%
select(Calories) %>%
mutate(Calories = round(Calories,0))
# Check if the Calories vectors of the two data frames are the same
if (isTRUE(all.equal(hourlyCalories_Apr_Test, minuteCaloriesNarrow_Apr_Test))) {print("The vectors are equal.")} else {print("The vectors are not equal.")}
Result: the data is consistent within the period.
Table hourlyCalories_Apr
# Find the location of missing values
which(is.na(hourlyCalories_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyCalories_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlyCalories_Apr))
# Check the structure of the data frame
str(hourlyCalories_Apr)
# Trim whitespace from column names
colnames(hourlyCalories_Apr) <- trimws(colnames(hourlyCalories_Apr))
# Remove empty rows and columns
hourlyCalories_Apr <- remove_empty(hourlyCalories_Apr, which = c("rows", "cols"))
Table hourlyIntensities_Apr
# Find the location of missing values
which(is.na(hourlyIntensities_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyIntensities_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlyIntensities_Apr))
# Check the structure of the data frame
str(hourlyIntensities_Apr)
# Trim whitespace from column names
colnames(hourlyIntensities_Apr) <- trimws(colnames(hourlyIntensities_Apr))
# Remove empty rows and columns
hourlyIntensities_Apr <- remove_empty(hourlyIntensities_Apr, which = c("rows", "cols"))
Table hourlySteps_Apr
# Find the location of missing values
which(is.na(hourlySteps_Apr), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlySteps_Apr))
# Count the number of duplicate rows
sum(duplicated(hourlySteps_Apr))
# Check the structure of the data frame
str(hourlySteps_Apr)
# Trim whitespace from column names
colnames(hourlySteps_Apr) <- trimws(colnames(hourlySteps_Apr))
# Remove empty rows and columns
hourlySteps_Apr <- remove_empty(hourlySteps_Apr, which = c("rows", "cols"))
Table dailyActivity_Mar
# Find the location of missing values
which(is.na(dailyActivity_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(dailyActivity_Mar))
# Count the number of duplicate rows
sum(duplicated(dailyActivity_Mar))
# Check the structure of the data frame
str(dailyActivity_Mar)
# Trim whitespace from column names
colnames(dailyActivity_Mar) <- trimws(colnames(dailyActivity_Mar))
# Remove empty rows and columns
dailyActivity_Mar <- remove_empty(dailyActivity_Mar, which = c("rows", "cols"))
Table hourlyCalories_Mar
# Find the location of missing values
which(is.na(hourlyCalories_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyCalories_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlyCalories_Mar))
# Check the structure of the data frame
str(hourlyCalories_Mar)
# Trim whitespace from column names
colnames(hourlyCalories_Mar) <- trimws(colnames(hourlyCalories_Mar))
# Remove empty rows and columns
hourlyCalories_Mar <- remove_empty(hourlyCalories_Mar, which = c("rows", "cols"))
Table hourlyIntensities_Mar
# Find the location of missing values
which(is.na(hourlyIntensities_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlyIntensities_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlyIntensities_Mar))
# Check the structure of the data frame
str(hourlyIntensities_Mar)
# Trim whitespace from column names
colnames(hourlyIntensities_Mar) <- trimws(colnames(hourlyIntensities_Mar))
# Remove empty rows and columns
hourlyIntensities_Mar <- remove_empty(hourlyIntensities_Mar, which = c("rows", "cols"))
Table hourlySteps_Mar
# Find the location of missing values
which(is.na(hourlySteps_Mar), arr.ind = TRUE)
# Count the total number of missing values
sum(is.na(hourlySteps_Mar))
# Count the number of duplicate rows
sum(duplicated(hourlySteps_Mar))
# Check the structure of the data frame
str(hourlySteps_Mar)
# Trim whitespace from column names
colnames(hourlySteps_Mar) <- trimws(colnames(hourlySteps_Mar))
# Remove empty rows and columns
hourlySteps_Mar <- remove_empty(hourlySteps_Mar, which = c("rows", "cols"))