Introduction

According to the Quantified self website, “THE QUANTIFIED SELF is about making personally relevant discoveries using our own self-collected data.”

This sums up the purpose of this assignment. Our team named Innodata comprises of 3 members. It was agreed between all 3 of us to collect and share our individual data within the group for the purpose of assignment. A whatsapp group was selected to stay connected and provide updates regarding the assignment.

Explanation of Data process and methods

Data Collection :

During our first group interaction we discussed on the different types of personal data and the methods through which they can be collected. Our challenge was to identify a data colletcion method for each data which would be uniform and accessible across all mobile operating systems.

Our group agreed unanimously on collecting the below mentioned data and we started gathering them from 5th August 2019 to 14th September 2019.

Structured Data :

  1. Data Collected : GPS location

Measures : Latitude, Longitude, Date and time

Method : Timestamp and location coordinates of the places travelled or visited are tracked by smartphone.

Application : Google Account location history

Comment : Google location history was enabled on Google account settings of each members smartphone. Zipped file containing location history in json format was downloaded from https://takeout.google.com/settings/takeout by signing in with individual Google account credentials. Following that each group member uploaded their individual data files to Google drive.

  1. Data Collected : Steps and calories burned

Measures : Steps count, Distance Travelled, Calories burned , highest and lowest latitude and longitude for each day, Speed

Method : Smartphone application tracks the steps count on a daily basis.

Application : Initially we thought of using apps like Pedometer or Myfitnesspal. But later discarded them as we realised that Myfitnesspal doesn’t record steps and pedometer requires the application to be open everytime we walk or run in order for it to calculate the steps. We finally zeroed in on using Google Fit as it can track both steps and calories burned even when the application is closed.

Comment : A zip file with the steps count data in csv format was downloaded from https://takeout.google.com/settings/takeout by signing in with individual Google account credentials. The individual data files of each member was uploaded to Google drive.

Unstructured Data

  1. Data Collected : Selfies

Measures : Date, Mood rating(Happy, Angry, Neutral, Disgust, Surprise, Fear)

Method : Selfies were clicked daily by each member through their respective smartphone camera application and uploaded to Google Drive.

Application : Initially we used an application called Feely. But later on opted out of it as it was only displaying the emotion which had the highest score. Finally, we decided to choose an application of our own personal interest to analyse the selfies for Mood detection. I used an online Facial Emotion Detection tool called ParallelDots which was available on the website https://www.paralleldots.com/facial-emotion. Each other member in the team created their individual programming scripts which invoke Facial Mood detection APIs to analyse the moods of each selfie.

Comment : The clicked selfies of all the team members were uploaded one by one to the ParallelDots Facial Emotion Detection tool available on https://www.paralleldots.com/facial-emotion. The various mood ratings(Happy, Angry, Neutral, Disgust, Surprise, Fear) for each selfie was recorded on an excel sheet.

Individual Data

  1. Data collected : Daily Expenses

Measures : Daily Total, Shopping, Transport, Eating out, Utilities, Groceries, Education, Rent expenses

Method : Debit card spending was tracked using the Bank mobile Application

Application : Commonwealth Bank Mobile Application

Comments : The daily expenses tracked by the mobile Application were manually entered into an excel sheet at the end.

External Data

Data Collected : Temperature

Measures : Daily maximum tempereature

Source : Australian Government - Bureau of Meteorology (link : http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=122&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=066062)

Method : Sydney’s daily maximum temperature recorded by Sydney (Observatory Hill) weather station was imported from the Australian Government - Bureau of Meteorology website .

Data Anonymity

In order to ensure that each member has provided their data, every shared data was affixed with the member name and type of data. I realised the necessity to anonymise our data only when we were urged to share our anonymised data for the mystery box challenge. Hence even though the data shared within the group was unanonymised, I have replaced the member names with general references for this report. I have kept Member 1 as my alias for this report.

Data size and storage

A shared folder was created in Google drive for the group members allowing them to upload and share the individual data which they have collected. As shown below, the data size varied according to the member and the type of data collected.

  1. Data : Steps and calories burned

Format : .csv

Data size :

Member 1 : 6KB

Member 2 : 10KB

Member 3 : 7KB

  1. Data : GPS location

Format : .json

Data size :

Member 1 : 129MB

Member 2 : 214MB

Member 3 : 6MB

  1. Data : Selfies

Format : .jpg

Data size :

Member 1 : 234MB

Member 2 : 54MB

Member 3 : 73MB

The steps and calories burned data for each group member was small enough to be easily uploaded onto Google drive. However the GPS location and selfie data of majority group members were huge and hence took a longer time to upload onto Google drive. It was agreed that each group member would clean and merge the available datasets as per their requirements.

Data Collection and Quality Issues :

The data collected for steps count was of good quality and readily available for use. However, the GPS location and selfie data did have some collection and quality issues as mentioned below.

  1. Data : GPS Location

Issues : Plenty of data cleaning process had to be done on GPS location data. The application saves the latitude and longitude co-ordinates in E7 format and hence required manual conversion to standard GPS co-ordinates for mapping. Similarly, the recorded timestamp required conversion from POSIX millisecond format to human readable date/time format. I created an R script to convert the values to the respective desired formats.

  1. Data : Selfie

Issues : Initially it was agreed that selfies would be clicked thrice a day during morning, afternoon and night. Despite clicking 3 selfies per day regularly at the start, the number of selfies clicked on subsequent days reduced as members started forgetting to click selfies. Hence it was decided that clicking one selfie for the day would suffice. This has resulted in multiple selfie entries recorded during the start of the data collection. In order to resolve this I had to consider the last selfie clicked for the days which had multiple selfie entries. Also the online Facial emotion detection application which i was using was able to predict moods for majority of the selfies. However it was not able to predict the moods for 3 selfies of member 3.

  1. Data : Daily Expenses

Issues : Sometimes even after the payment is made, the payment transfer takes a couple of days to complete. In these situations the expense would be recorded on the date when the payment transfer was completed and not the payment date. In order to overcome this I had to keep a track of pending payment transfers and correct the entries by adding the expenses to the date when the payment was made.

Analysis

GPS Location Data

The GPS location data consists of sufficient location co-ordinates to track the location and paths taken by all team members within and outside Sydney. The location coordinates of each member are color coded with circle markers on the interactive map as mentioned below :

Member 1 - Blue

Member 2 - Red

Member 3 - Green

The GPS location visualisations can be found on the below mentioned links.

http://rpubs.com/ganesharun237/530989

http://rpubs.com/ganesharun237/530990

Overlapping circle markers of a member at a particular location show that the member visits the location often. Also continuous stretches of overlapping co-ordinates show the path taken by the members.

In order to check the quality of the data, I tried to plot the location co-ordinates of each member for 31st August 2019. On that day all the members of our team came to uni to attend Seminar 2 for the subject Data Science practice. The seminar starts at 9am and all 3 of us were late. I arrived approximately at 9.30am whereas Members 2 and 3 arrived at around 10am. However On checking the data I was disappointed to see that my location co-ordinates for that day were logged only from 12:11 pm whereas for the others it was available from 10am. Even though my location co-ordinates before 12pm were not available, the location co-ordinates circle markers for all members were overlapping at UTS during the lecture timing from 9am to 5pm. Member 3 left UTS when the seminar ended, whereas Member 2 and I stayed back at UTS till 11pm to finish our Assignment Task 2 for Statistical thinking for Data science. After which both of us went together to board the train at central station at 11.30pm. I was surprised to see the location co-ordinate circle markers for both of us to be nearby and overlapping each other for the entire duration from 5pm to 11.30pm until we boarded our respective trains.

Steps count Data

## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v purrr   0.3.2     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   `Calories (kcal)` = col_double(),
##   `Distance (m)` = col_double(),
##   `Low latitude (deg)` = col_double(),
##   `Low longitude (deg)` = col_double(),
##   `High latitude (deg)` = col_double(),
##   `High longitude (deg)` = col_double(),
##   `Average speed (m/s)` = col_double(),
##   `Max speed (m/s)` = col_double(),
##   `Min speed (m/s)` = col_double(),
##   `Step count` = col_double(),
##   `Average weight (kg)` = col_double(),
##   `Max weight (kg)` = col_double(),
##   `Min weight (kg)` = col_double(),
##   `Inactive duration (ms)` = col_double(),
##   `Walking duration (ms)` = col_double(),
##   `Running duration (ms)` = col_double()
## )
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   `Calories (kcal)` = col_double(),
##   `Distance (m)` = col_double(),
##   `Low latitude (deg)` = col_double(),
##   `Low longitude (deg)` = col_double(),
##   `High latitude (deg)` = col_double(),
##   `High longitude (deg)` = col_double(),
##   `Average speed (m/s)` = col_double(),
##   `Max speed (m/s)` = col_double(),
##   `Min speed (m/s)` = col_double(),
##   `Step count` = col_double(),
##   `Average weight (kg)` = col_double(),
##   `Max weight (kg)` = col_double(),
##   `Min weight (kg)` = col_double(),
##   `Inactive duration (ms)` = col_double(),
##   `Walking duration (ms)` = col_double(),
##   `Running duration (ms)` = col_double()
## )
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   `Calories (kcal)` = col_double(),
##   `Distance (m)` = col_double(),
##   `Low latitude (deg)` = col_double(),
##   `Low longitude (deg)` = col_double(),
##   `High latitude (deg)` = col_double(),
##   `High longitude (deg)` = col_double(),
##   `Average speed (m/s)` = col_double(),
##   `Max speed (m/s)` = col_double(),
##   `Min speed (m/s)` = col_double(),
##   `Step count` = col_double(),
##   `Average weight (kg)` = col_double(),
##   `Max weight (kg)` = col_double(),
##   `Min weight (kg)` = col_double(),
##   `Inactive duration (ms)` = col_double(),
##   `Walking duration (ms)` = col_double(),
##   `Running duration (ms)` = col_double()
## )

On examining the group steps count data, a similar trend was observed everyday for each member.

I was able to find similar correlations between steps count, calories burned and distance walked. The plots between each of these was plotted and a regression line was added to find the below mentioned correlations.

The steps count rises with the increase in distance. Calories burned rises with the increase in distance. Calories burned rises with the increase in steps count.

Summarising it , a person has to take more steps to cover a larger distance. As the distance covered by him increases more calories are burned.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).

## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).

## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).

## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 9 rows containing non-finite values (stat_smooth).

## Warning: Removed 9 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

According to the Physical Activity & Sedentary Behaviour Guidelines for Adults published by the Australian Department of Health, it is recommended to undergo a “minimum of 150 minutes of moderate intensity physical activity a week. This equates to 30 minutes on most days. A half hour of activity corresponds to about 3,000 to 4,000 dedicated steps at a moderate pace. In Australia, the average adult accumulates about 7,400 steps a day.”(The Conversation 2019).

A quick summary of my steps count shows that my mean steps count for the duration is just above the recommended guidelines but way below the average Australian adult step count.

summary(gstep$"Step count")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    12.0   664.2  3549.0  4191.9  6625.5 15683.0       1

Selfies Data (Mood/Emotion)

## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   Happy = col_double(),
##   Angry = col_double(),
##   Neutral = col_double(),
##   Disgust = col_double(),
##   Surprise = col_double(),
##   Sad = col_double(),
##   Fear = col_double()
## )
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   Happy = col_double(),
##   Angry = col_double(),
##   Neutral = col_double(),
##   Disgust = col_double(),
##   Surprise = col_double(),
##   Sad = col_double(),
##   Fear = col_double()
## )
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Member = col_character(),
##   Happy = col_double(),
##   Angry = col_double(),
##   Neutral = col_double(),
##   Disgust = col_double(),
##   Surprise = col_double(),
##   Sad = col_double(),
##   Fear = col_double()
## )

Both the averages of happiness and sadness were higher on weekends, with Saturday recording the highest average for both emotions. As the weekends are holidays members might be happy that they find time to hang out and do things which they like. Higher happiness average on Saturday compared to Sunday could be due to the fact that the following day is a holiday. On the contrary the reason for higher sadness average on Saturday could be due to the worry of only a day left for assignment submissions as most of the assignments are either due on Sunday midnight or Monday morning.

## Warning: Removed 40 rows containing non-finite values (stat_boxplot).

## Warning: Removed 40 rows containing non-finite values (stat_boxplot).

The sadness of members decreases as the distance covered, steps count and calories burned increases. Members might tend to feel refreshing and active after burning more calories by walking far distances which helps them to decrease their sadness.

## Warning: Removed 46 rows containing non-finite values (stat_smooth).
## Warning: Removed 46 rows containing missing values (geom_point).

## Warning: Removed 46 rows containing non-finite values (stat_smooth).

## Warning: Removed 46 rows containing missing values (geom_point).

## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).

## Warning: Removed 42 rows containing non-finite values (stat_smooth).

## Warning: Removed 42 rows containing missing values (geom_point).

## Warning: Removed 40 rows containing non-finite values (stat_smooth).
## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 40 rows containing non-finite values (stat_smooth).

## Warning: Removed 40 rows containing missing values (geom_point).

It can be observed that the happiness of members decreases as the distance covered, steps count and calories burned increases. Group members might feel exhausted and tired after walking long distances which might be the cause for decrease in happiness.

## Warning: Removed 46 rows containing non-finite values (stat_smooth).
## Warning: Removed 46 rows containing missing values (geom_point).

## Warning: Removed 46 rows containing non-finite values (stat_smooth).

## Warning: Removed 46 rows containing missing values (geom_point).

## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing missing values (geom_point).

## Warning: Removed 42 rows containing non-finite values (stat_smooth).

## Warning: Removed 42 rows containing missing values (geom_point).

## Warning: Removed 40 rows containing non-finite values (stat_smooth).
## Warning: Removed 40 rows containing missing values (geom_point).

## Warning: Removed 40 rows containing non-finite values (stat_smooth).

## Warning: Removed 40 rows containing missing values (geom_point).

Daily Expenses

## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Day = col_character(),
##   Shopping = col_double(),
##   Transport = col_double(),
##   `Eating Out` = col_double(),
##   Utilities = col_double(),
##   Groceries = col_double(),
##   Education = col_double(),
##   Rent = col_double(),
##   Total = col_double()
## )

I was able to find a truly quantified self correlation between my happy mood and eating out expenses. My eating out expenses increase when I am happier. This is certainly true about myself, as I prefer going outside to restaurants for having good food when I am happy and also tend to feel a lot more happier when I eat outside. This claim was further backed by the correlation between eating out expenses and anger, where I found that my anger decreases when I spend more on eating expenses. Although more expenditure on eating expenses doesn’t guarantee better quality and quantity of food, I do tend to feel a lot more happier when I eat outside which might be the reason for my decrease in anger.

## Warning: Removed 34 rows containing non-finite values (stat_smooth).
## Warning: Removed 34 rows containing missing values (geom_point).

## Warning: Removed 34 rows containing non-finite values (stat_smooth).

## Warning: Removed 34 rows containing missing values (geom_point).

Another correlation was identified between my sadness and total expenses. My total expenses decrease when my sadness increases. This is also true about myself as I prefer staying at home rather than going outdoors when I feel sad. My tendency to avoid going outdoors when feeling sad might be the reason for my reduced total expenses.

## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).

My rental expenses are higher on Wednesdays and Saturdays as those are the days which I pay my rental every fortnight. Although I do not pay any rentals on Monday , it accounted for more rental expenses as I paid a rental lease deposit amount of 6 weeks rent for the recently shifted house at Westmead on 12th August which was a Monday. The high rental expenses which I have paid on Monday,Wednesday and Saturday has increased the overall total expenses on those days.

## Warning: Removed 36 rows containing non-finite values (stat_boxplot).

## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

Temperature

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Month = col_double(),
##   Day = col_double(),
##   `Maximum temperature (Degree C)` = col_double(),
##   Date = col_date(format = "")
## )

The correaltion between daily maximum temperature and individual distance walked, steps count and calories burned shows that they all decrease with rise in temperature.

## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

It illustrates that my happiness increases and sadness decreases with rise in temperature.

## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

## Warning: Removed 16 rows containing non-finite values (stat_smooth).

## Warning: Removed 16 rows containing missing values (geom_point).

It shows that that my total and eating out expenses decrease when the maximum temperature increases.

## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

## Warning: Removed 31 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).

Findings and Conclusions

A similar increasing trend is observed everyday amongst all members for correlations between Distance - steps count, Distance - calories burned and steps count - calories burned. During the weekdays Monday is observed to be the happiest day while Tuesday and Wednesday are observed to be the saddest days. During the weekends both Saturday and Sunday provide highest happiness while Saturday is also seemed as the saddest day on weekend. The sadness and happiness of members decreases when the distance covered by them increases.

Moreover, it can also be observed that I tend spend more on eating outside when I am happy which in turn reduces my anger. I reduce my expenditure when I feel sad. I tend to spend the most on my rental bills. I walk less and eventually burn less calories when the temperature is high. I tend to be more happy when the temperature increases as it makes me feel warm. I spend less on hot days as I do not prefer walking or travelling much on those days.

Discussion

The most debatable point in our data collection and analysis process was the shared data being unanonymised. Neither of our team members felt the necessity to anonymise our data until we started our analysis process. Especially when analysing the GPS location data, the places travelled by a person were easily traceable. This is a possible threat to one’s personal life. In order to prevent further leak of our unanonymised data it was decided that each member of the team would anonymise our data for the report. The importance of anonymising and securing our data can be understood from the lawsuit accusing google of tracking peoples location history even when the settings are turned off(Tung 2018). It was also made sure that nobody else apart from the group and instructors had access to the data stored on google drive, as it was important to ensure data security.

Reflection

Initially, we decided on collecting data which could be tracked automatically with mininmal manual efforts. I realised that it was an incorrect method to decide the data based on ease of collection during my Assignment for Statistical Thinking for Data science. It made me realise that it’s important to first define research questions which would then lead to the decision on what data to be collected and how. Once decided on all these it’s important to identify how all these data can be interrelated. Our failure in following this approach resulted to choosing GPS location which despite being huge can’t be correlated much to the other datas. This restricted us to analysing the GPS data within itself as it couldn’t be related in any way to the other datas. Given a second attempt to redo this assignment I would strongly refuse collecting GPS data. Initially I was very worried on what to look for and how to perform analysis on the collected data. My worries perished when I attended the mystery box challenge as I got to learn plenty of stuff related to data analysis through it. Particularly I learnt how to analyse different data by trying to correlate them. After the mystery box challenge I had framed my research questions and decided on what I would have to look out for in the data. Although I was unfamiliar with Excel, I was learning R as part of the Statistical Thinking for Data Science Subject which helped me to perform data wrangling and visualisations. I realised that most of my peers used Tableau for their visualisations as it’s drag and drop User interface makes it easy. My lack of skills in Excel and tableau forced me to stick with R. Despite having good knowledge of data wrangling in R, I struggled with making vsualisations. It was at this moment that I found out about a R package called esquisse in R which provides a drag and drop interface similar to Tableau in R. I have used this package to plot all the visualisations for this report.

References

“DESIGN AND IMPLEMENTATION OF PARTICIPANT-LED RESEARCH.” 2010. Quantified Self. October 17. http://quantifiedself.com/about/.

Tung 2018, ‘Google sued for tracking you, even when ’location history’ is off’ , ZDNet , 21 August, viewed 19 September 2019, https://www.zdnet.com/article/google-sued-for-tracking-you-even-when-location-history-is-off/

The Conversation 2019, ‘Health Check: do we really need to take 10,000 steps a day?’ , Better Health Channel, 10 April, viewed 19 September 2019, https://www.betterhealth.vic.gov.au/blog/blogcollectionpage/Conversation-10000steps

Others

The GPS location visualisations have been published on Rpubs on the below mentioned links.

http://rpubs.com/ganesharun237/530989

http://rpubs.com/ganesharun237/530990