DATA 606 Data Final Project - Self Control and its Impacts on Habit Execution
library(tidyverse)
library(infer)
library(ggplot2)
library(cowplot)
library(rmarkdown)
knitr::opts_chunk$set(include = TRUE, message = FALSE, echo = TRUE)daily <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA606/main/Project/3.RawData/APPData/BehaviorQuestionsDaily.csv")
#habitReport <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA606/main/Project/3.RawData/APPData/QuestionsHabitReport.csv")
selfControl <- read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA606/main/Project/3.RawData/APPData/QuestionsSelfControlScale.csv")Part 1 - Introduction & Abstract
Please note the differences between the above proposed question and the study’s. The question this analysis is focused on is the impact of an individual’s level of self control, as measured by the Brief Self Control Scale, and their ability to execute the desired action. The study’s focus was on the development of habits over time, where as this analysis is focused on a point of time. Since the following work is an unrelated venture, the phrashing “this analysis” or “this work” will be used to refer to any work contained within this notebook and resulting presentation to further make clear the differences.
Foreword on Criticisms and Limitations
Part 2 - Data
Summary
1. Segment the data into three factors of time
2. Establish an individual’s over self control capacity scores across the entire duration of the study to determine quantiles and thus classify if a user is high, mid, or low in self control capacity.
3. Establish an individual’s self control capacity within the time segment
4. Establish average scores of the daily habit questions procured from the phone app will be used as measures of one’s ability to execute a habit.
5. Classify the average scores of daily habit execution by the participant’s self control capacity level
6. Attempt inference at the relationship between habit execution and self control capacity
Collection
Metholodogy of Metrics
Habit execution was measured with by the daily phone app questions in 3 questions. These questions are translated and while a part of their meaning may be lost, their rough effects for the study are maintained.
1. Did the context you chose for your habit occur today? (ie. Did you have breakfast today?)
2. Did you perform your chosen habit today? (ie. Did you eat fruit toay?)
3. Did you perform your chosen habit in the chosen context today? (ie. Did you eat fruit at breakfast today?)
The series of questioning for habit execution expanded into 3 more questions if habit execution failed and were meant to measure attribution of failure. This analysis will focus on the first 3 as sources for measurement of habit execution.
Data Source
Part 3 - Exploratory data analysis
This first code chunk is adding variables we require such as the time segment and a transformation into a longer format that was easier to process.
### Daily Habit Execution Segment Variable Creation
# We segmented the data into thirds, determined by number of day (Max day was 147, chose to split on 48s)
# We do this to allow for drawing relationships over time
#range(daily$DAY)
daily <- daily %>%
group_by(PPN) %>%
mutate(SEGMENT = ifelse(DAY < 48, '1', ifelse(DAY >= 48 & DAY < 96, '2', '3')))
dailyAVG <- daily %>%
mutate(avg = mean(c(Q1,Q2,Q3)))
# Creating a new DF per segment
dailySeg1 <- daily %>%
filter(SEGMENT == '1')
dailySeg2 <- daily %>%
filter(SEGMENT == '2')
dailySeg3 <- daily %>%
filter(SEGMENT == '3')
## Longer Transformation
dailyLonger <- daily %>% select(-Q4,-Q5,-Q6) %>%
pivot_longer(cols = starts_with("Q"),
names_to = "Question",
values_to = "Response")
dailyLongerSeg1 <- dailyLonger %>%
filter(SEGMENT == '1')
dailyLongerSeg2 <- dailyLonger %>%
filter(SEGMENT == '2')
dailyLongerSeg3 <- dailyLonger %>%
filter(SEGMENT == '3')
paged_table(daily)paged_table(dailyLongerSeg1)## -------------------------------------- Self Control
# selfControl <- selfControl %>% #Calculate the row-wise averages of each self control report
# rowwise() %>%
# arrange(PPN) %>%
# mutate(avg = mean(c(Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13)))
### Self Control Segment Variable Creation
#range(selfControl$DAY)
#We are segmenting the data into thirds to remove the time variable, determined by number of day (Max day was 137, chose to split on 48s)
selfControl <- selfControl %>%
group_by(PPN) %>%
mutate(SEGMENT = ifelse(DAY < 48, '1', ifelse(DAY >= 48 & DAY < 96, '2', '3')))
#Below we have our table of participants' average self control scores across the entire study, we will utilize this to determine our breakpoints of low, mid, high classification
#Below code will take the average of column
selfControlAVG <- selfControl %>%
group_by(PPN) %>%
summarise(pAVG = mean(c(Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13)))
paged_table(selfControl)paged_table(selfControlAVG)Self Control Capacity EDA
From here, the next step is to classify each individual a classifier of being a low, mid, or high self control capacity class based on the above distribution
starting with the distribution of average daily habit scores within each time segment is natural. The higher averages mean the individual performed their habit more often during that segment of time. Segments of time are 1 - 47, 48 - 95, and 96+ days. They are titled Segments 1, 2, and 3 in the dataframes, respectively.
#Create New Table for Each individual's control class at each segment, create a segment average, and a segment control class
#Does it matter that I kept the same IQR from all reports instead of just Segment Averages as opposed to running another IQR on Segment Averages and classifying the segment that way?
summary(selfControlAVG$pAVG) #If >= 1.74 they're high, 1.74 > x <= 1.30 mid, 1.3>x then low## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8077 1.3013 1.4945 1.5359 1.7436 2.5897
#Average the scores within each segment, classify the segment's avg score
selfControlClass <- selfControl %>%
#select(PPN,DATE,avg,SEGMENT,CONTROLCLASS) %>%
group_by(PPN,SEGMENT) %>%
mutate(SEGAVG = mean(c(Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13))) %>%
mutate(SEGCLASS = ifelse(SEGAVG <1.30, 'Low', ifelse(SEGAVG >= 1.30 & SEGAVG < 1.75, 'Mid', 'High'))) %>%
mutate(SEGMENT = factor(SEGMENT, levels = c('1','2','3'))) %>%
mutate(SEGCLASS = factor(SEGCLASS, levels = c('Low','Mid','High')))
ggplot(data = selfControlAVG, aes(x = pAVG)) +
geom_histogram(fill = 'lightblue',color='grey',binwidth = .05) + labs( x = "Avg Score of Self Control Capacity", y = "# of People",) + theme_light() + geom_vline(xintercept = mean(selfControlAVG$pAVG, na.rm = TRUE), lty = 'dashed', color = 'red')paged_table(selfControlClass)Since these questions could be seen as a progression of habit execution, this analysis will compare the ability of each self control capacity class to execute between the same questions, rather than performing an analysis in aggregate of questions (ie. comparing how a high self control participant’s likelihood to answer Q1+Q2+Q3 to a low self control participant). This decision was made mainly in the interest of work scope. Null values are removed that were introduced by individuals that may not have completed a Self Control Capacity report within that segment of time.
#Create the score average for each question within a segment to represent that participants' ability to execute their habit
dailySeg1Avg <- dailyLongerSeg1 %>%
group_by(PPN, Question) %>%
summarize(Avg_Response = mean(Response, na.rm = TRUE))
dailySeg2Avg <- dailyLongerSeg2 %>%
group_by(PPN, Question) %>%
summarize(Avg_Response = mean(Response, na.rm = TRUE))
dailySeg3Avg <- dailyLongerSeg3 %>%
group_by(PPN, Question) %>%
summarize(Avg_Response = mean(Response, na.rm = TRUE))
#Joining the average scores to the classifiers of Self Control Capacity of that person during that segment
dailySeg1AvgClass <- left_join(dailySeg1Avg,selfControlClass
%>% filter(SEGMENT == '1')
%>% select(SEGMENT,PPN,SEGCLASS)
%>% unique(), by = "PPN")
dailySeg1AvgClass <- dailySeg1AvgClass[complete.cases(dailySeg1AvgClass),] #Remove nulls from not filling out Self Control Index Report
dailySeg2AvgClass <- left_join(dailySeg2Avg,selfControlClass
%>% filter(SEGMENT == '2')
%>% select(SEGMENT,PPN,SEGCLASS)
%>% unique(), by = "PPN")
dailySeg2AvgClass <- dailySeg2AvgClass[complete.cases(dailySeg2AvgClass),]
dailySeg3AvgClass <- left_join(dailySeg3Avg,selfControlClass
%>% filter(SEGMENT == '3')
%>% select(SEGMENT,PPN,SEGCLASS)
%>% unique(), by = "PPN")
dailySeg3AvgClass <- dailySeg3AvgClass[complete.cases(dailySeg3AvgClass),]
paged_table(dailySeg1Avg)paged_table(dailySeg1AvgClass)q1Seg1Plot <- ggplot(data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q1',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Habit Score", title = "Q1 | Day 1 - 47: Daily Habit Response Scores")
q2Seg1Plot <- ggplot(data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q2',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q2 | Day 1 - 47: Daily Habit Response Scores")
q3Seg1Plot <- ggplot(data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q3',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q3 | Day 1 - 47: Daily Habit Response Scores")
plot_grid(q1Seg1Plot,q2Seg1Plot,q3Seg1Plot, nrow = 1)summary(dailySeg1Avg[dailySeg1Avg$Question == 'Q1',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.7259 0.8970 0.8255 0.9773 1.0000
summary(dailySeg1Avg[dailySeg1Avg$Question == 'Q2',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6028 0.8048 0.7312 0.9231 1.0000
summary(dailySeg1Avg[dailySeg1Avg$Question == 'Q3',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -0.2176 0.2071 0.1714 0.5861 1.0000
q1Seg2Plot <- ggplot(data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q1',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Habit Score", title = "Q1 | Day 48 - 95: Daily Habit Response Scores")
q2Seg2Plot <- ggplot(data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q2',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q2 | Day 48 - 95: Daily Habit Response Scores")
q3Seg2Plot <- ggplot(data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q3',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q3 | Day 48 - 95: Daily Habit Response Scores")
plot_grid(q1Seg2Plot,q2Seg2Plot,q3Seg2Plot, nrow = 1)summary(dailySeg2Avg[dailySeg2Avg$Question == 'Q1',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.7267 0.9286 0.8037 1.0000 1.0000
summary(dailySeg2Avg[dailySeg2Avg$Question == 'Q2',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6974 0.8889 0.7845 0.9765 1.0000
summary(dailySeg2Avg[dailySeg2Avg$Question == 'Q3',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -0.2952 0.3333 0.2296 0.7661 1.0000
# Violin plot
q1Seg3Plot <- ggplot(data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q1',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Habit Score", title = "Q1 | Day 96+: Daily Habit Response Scores")
q2Seg3Plot <- ggplot(data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q2',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q2 | Day 96+: Daily Habit Response Scores")
q3Seg3Plot <- ggplot(data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q3',], aes(x = SEGCLASS, y = Avg_Response)) +
geom_violin(fill = "lightblue", trim = TRUE) +
labs(x = "Question", y = "Average Score", title = "Q3 | Day 96+: Daily Habit Response Scores")
plot_grid(q1Seg3Plot,q2Seg3Plot,q3Seg3Plot, nrow = 1)summary(dailySeg3Avg[dailySeg3Avg$Question == 'Q1',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.7500 1.0000 0.8084 1.0000 1.0000
summary(dailySeg3Avg[dailySeg3Avg$Question == 'Q2',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.6667 1.0000 0.8097 1.0000 1.0000
summary(dailySeg3Avg[dailySeg3Avg$Question == 'Q3',]$Avg_Response)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -0.3333 0.5000 0.2842 1.0000 1.0000
Part 4 - Inference
infer package to run ANOVA. First, question averages within
Days 1 - 47anovaSeg1Q1 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q1',])
summary(anovaSeg1Q1)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.048 0.02414 0.709 0.494
## Residuals 145 4.935 0.03403
anovaSeg1Q2 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q2',])
summary(anovaSeg1Q2)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.001 0.00049 0.011 0.989
## Residuals 145 6.567 0.04529
anovaSeg1Q3 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg1AvgClass[dailySeg1AvgClass$Question == 'Q3',])
summary(anovaSeg1Q3)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.16 0.07952 0.322 0.725
## Residuals 145 35.78 0.24676
anovaSeg2Q1 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q1',])
summary(anovaSeg2Q1)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.040 0.01998 0.376 0.687
## Residuals 112 5.951 0.05313
anovaSeg2Q2 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q2',])
summary(anovaSeg2Q2)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.047 0.02355 0.451 0.638
## Residuals 112 5.855 0.05227
anovaSeg2Q3 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg2AvgClass[dailySeg2AvgClass$Question == 'Q3',])
summary(anovaSeg2Q3)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.76 0.3819 1.094 0.338
## Residuals 112 39.09 0.3490
anovaSeg3Q1 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q1',])
summary(anovaSeg3Q1)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.326 0.1631 1.577 0.216
## Residuals 55 5.691 0.1035
anovaSeg3Q2 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q2',])
summary(anovaSeg3Q2)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 0.229 0.11442 1.35 0.268
## Residuals 55 4.660 0.08473
anovaSeg3Q3 <- aov(Avg_Response ~ SEGCLASS, data = dailySeg3AvgClass[dailySeg3AvgClass$Question == 'Q3',])
summary(anovaSeg3Q3)## Df Sum Sq Mean Sq F value Pr(>F)
## SEGCLASS 2 1.668 0.8342 1.517 0.228
## Residuals 55 30.241 0.5498
Part 5 - Conclusion
Potential Extensions
References
Main Study van der Weiden A, Benjamins J, Gillebaart M, Ybema JF and de Ridder D (2020) How to Form Good Habits? A Longitudinal Field Study on the Role of Self-Control in Habit Formation. Front. Psychol. 11:560. doi: 10.3389/fpsyg.2020.00560
Brief Self Control Scale Tangney, J.P., Baumeister, R.F. and Boone, A.L. (2004), High Self-Control Predicts Good Adjustment, Less Pathology, Better Grades, and Interpersonal Success. Journal of Personality, 72: 271-324. https://doi.org/10.1111/j.0022-3506.2004.00263.x