The Quantified Self is a growing phenomenon of self-tracking varies elements of daily life, with most of the populat ones focusing are on health; physical, dietiery, mental and emotional. The digitisation many parts of people’s lives have led to the development of self-tracking tools to collect data on one’s health, habits, well-being, productivity, etc (Lupton, 2016) in both passive and active ways.
A common example and early adoption of self-tracking tools are wearable fitness and sleep trackers such as the Fitbit or the Apple Watch which are still set to grow over the next 5 years (Lamkin, 2018). However, this can also include other wearables, phone apps and self-documentation of and data points. The goal of self-analysing can be for the goal of self-improvement or undertaken for the purpose of future research (Lupton, 2016).
Personal data was collected from a small group of 6 members. The data collected was transaction spend data (structured data) and recording our mood (unstructured data). This data was collected for 38 days from 20th August 2021 till 26th September 2021 inclusive.
A shared Google Drive was used to aggregate each of the group members data. This allowed for each group members to have equal visibility to the group’s data. At the end of the data collection period, each member can copy or export the group data for analysis.
There were some problems stemming from a mix of human error, measurement error and weakness in the data collection methodology that caused issues and limitation to the datasets. Data cleaning was necessary to prepare it for analysis.
The best laid plans of mice of men often go awry. -Robert Burns, 1786
Possible disruptions in data collection at the start of the 5th week, initially the team had planned to record 4 weeks of data, from 20th August 2021 till 19th September 2021. As the final day of collection passed, concerns were raised that it wasn’t enough and a decision was made to extend collection to 26th September 2021.
An additional column was added in to list information on the ‘Week Commencing’, based on the date of each record. This was done to allow analysis grouped by week.
A single record was noted to return a value of 0. As 0 cannot be considered spend, this record is filtered out of the analysis.
A reclassification was done on the ‘Groceries’ subcategory. ‘Groceries’ where interchangeably as category and subcategory, edited to all subcategory labels to read at category level.
Subcategory labels needed heavy editing to aggregate variants of the same labels. Ultimately it was decided that this measurement will not to be used in analysis as to many records had missing subcategory labels.
Other considerations that has been noted, but not actioned on, is the one member of the team was recording spend in Hong Kong. This may result in potential error in conversion rates and unaccounted variance in cost of living.
The ‘time’ information is highly likely to be incorrect as the app it was collected on defaulted to 20:00 unless stated otherwise. This metric was not used in the analysis.
There is a human fallibility in committing to a daily schedule to record mood data, leading to missing records in the data set.
The app allowed you to save keywords as tag to reuse, potentially resulting in increased likelihood in repeat descriptors and less unique key words.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In the Spend dataset, it is noted that there is an outlier spend for ‘Bob’ member, is not considers an error and was simple a large home office expense.
There is one record that of missing data that returns with NA, that is removed from analysis.
Mood dataset was planned to have a single record for each member to keep it consistent for analysis, however a misunderstanding from one of the member lead to them making several entries in for the first week of collection. The multiple records where used for analysis as, as aggregating it meant losing some mood data.
Exercise dataset had a single outlier. One record had reported an uncharacteristic low distance, whilst a near average duration. This was due to an in-app error, where GPS tracking failed, resulting in an inaccurate reading. Since the error is known, this record will be removed from the analysis.
The data collected from the group was recorded and used in this report by de-identifying the data. This was done by using randomly assign aliases.
Also the concerned about privacy is relevant in this report. Some member raised an issue with their financial privacy being collected, so the data collection was limited to discretionary spend, removing income, bills, debt, etc.
All analysis of the dataset was done individually with on collaboration among the team members.
With the small group of six member, there was a wide range of total spend and where they spent it. Fox was the leading spender of the group, with over $4000 in spend distribute over several categories.
When looking the the distribution of spend by each member, they all had the same pattern of large volume of small transactions, however, each members with tails towards the high spend purchuses are all for different categories, which would be based on their hobbies and lifestyle preferences.
Interestingly, there is no key category that significant had greater spend then the rest, with over $2000 spent collectively on the 5 top categories. However, when looking the distribution of spend within the categories, we can see their are different patterns. - Presents are few in purchases, but around the same cost. - Food and Drink and Entertainment (sans the outliers) purchases have higher counts at cheaper costs. - Groceries bundle around the same costs. - Health/Beauty have the most ‘even’ distribution of cost of all the categories.
Notably when it came the food, there was a near equal spend on Groceries and Takeaway. Although, by looking at each member, …
Each member had different word choices and range in vocabulary. Dan and Fox had the most limited range of mood in the date collected, while Eli had the widest variety of mood.
‘Bad’ was the highest occurrence of Mood recorded, followed by ‘Meh’. Since the data was only collected for a 5 week period during the city-wide lock down orders, it impossible to stated the causes of this insight.
The frequency of word indicate the the main topic most members talked about is whether they ‘stayed inside’ or ‘went out’. It seems likely that due to the data collection period being conduct during an a lock down with stay-at-home orders, it had an impact on the members day to day activities and thoughts.
The limitation are also compounded due to the limited pool of using students within the same course, all members show similar lifestyles as key words are “study,”work“,”exercise“, gaming” and “tv”.
On each occurrence, different routes/distances were chosen. The data indicts that there was no pattern or increasing preference in running distance over time.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are 3 distinct clutters of distances ran. Most jogs recorded was the distance about 2.9 km, and had an average performance of 9.1 Minutes per Km. This performance begins to decline as average Mins/Kms raise.
During the data collection period, I noticed that when my behavior was being recorded, I was likely to make different decisions on those behaviors. A key example was once I started using the app to track my jogging performance, I was influenced by poor performance which impacted my choices of when I would go jogging and how often I took a break.
As it is natural to alter undesirable behavior when out in the spotlight, this fact supports the method of self-improvement by the ‘quantified self method’. It small study found that quantified feedback created a sense of achievement and acted as motivation for participants (Fritz, Huang, Murphy, & Zimmerman, 2013).
However, it is inline with records on the negative emotional impact lock down has on younger people. One way this impact was seen was by the increase in mental health services. At the start of 2021, Lifeline had an increase of 10% and 21% in call compared to the same time in 2020 and 2019 respectively. For BeyondBlue, the increase was 27% and 23% respectively. for the same time period comparisons (Australian Institute of Health and Welfare, 2021).
Common advice given to improve you’s mental health is to have a routine and sturcute, such as keeping to schedules, allocating time for self-care tasks, chors and connecting with people (The Australian Psychological Society, 2020). Many self-tracking apps are know to boost the self-improvement progress by having a positve impact on their perceived health and their sense of accomplishment (Stiglbauer, Weber and Batinica, 2018).
This exercise has made me more aware about the privacy of my personal data. The team had raised concerns about sharing information about parts of their lives, with prompted me to read the privacy policy of some self-tracking apps. It seems that most app creators have access all content that you input into their trackers. The tracker we used, Daylio, had been the only one that doesn’t access content; “The application does NOT collect any user-generated content… Your entries and memories are truly private and fully under your control.” (Daylio Privacy Policy, 2021)
It increasing important to know where the line is when it comes it ‘privacy’ as it is now common practice for many data companies to take anonymised data and used key indicators to ‘re-profile’ them with attributes not explicitly shared, ei. marital status, gender, location, pet ownership or household size etc.
Although this practice hovers on the line of data ethics, this practice of profiling is common in marketing. This is best illustrated in the case of retailers analysing sale patterns to predict customers future needs and lifestyle, however many consumers are willing to trade information for benefits or conveniences (Artun & Levin, 2015).
Data Ethics is important beyond the compliance to regulations and law. Expectations of what is fair and reasonable data usage has always outpaced the law, and though we can use data ethics as a away to bridge the gap and we also need to consider how the digital world impacts our sense of moral.
If ethics is defined as shared values that help humanity differentiate right from wrong, the increasing digitalization of human activity shapes the very definitions of how we evaluate the world around us. - Schlenker
This shift to a data-centric world means we have to put focus on the different ethical challenges of data (Taddeo & Floridi, 2015). As tracking tools become more integrated into your day-to-day lives, we need to not just be cautious, but continue to set the high ethical expectations for those that look to benefit from our information.
This section implements the Data Ethics Canvas template from the Open Data Institute (ODI, 2019) to complete a broad evaluation of ethical issues in Data Science practice. Equally important, it includes legal framing for this QS project.