The purpose of this project is seeded in the objectives of the international community known as Quantified Self. The Quantified Self is a group of users from around the world of whom are mutually interested in the improvement of knowledge of one’s self through data analysis, as described by Quantified Self (2019). This report has leveraged the Quantified Self concept to a group size of 7 people in order to conduct data analysis on shared data to shed light on a question - Are there any effects /relationships between public transport use and our mood/perceptions of public transport. The shared group data set used includes multiple variables from Opal fares and a log of daily written journal articles.
2.1 Data Collected & Associated Justifications
There have been three different types of data collected during this project; Opal fare data, daily journal entry of a personal reflection on individual public transport use and morning commute time data.
Opal fare dataset (group level): Collected since (25/3/19 - 6/5/19) and the frequency is daily. The data is collected by Opal and downloadable from their website. The reason for selecting this data is that we need fundamental measure of public transport use each da. This is a structured dataset.
Daily journal entry of public transport experience (group level): Data has been collected in the form of a short journal entry reflection of mood/perception towards public transport for the period (25/3/19 - 6/5/19). Collection frequency is daily. The entries have been recorded in the DayOne application/Excel. The justification for collecting this data set is that will create a ‘value’ each day for mood/perception each day. This is an unstructured data set.
Morning commute time (personal level): Data regarding commuting time each morning was also collected, personally. Collection frequency is daily. The data has been recorded in Excel. The reason for collecting this data is that it would hopefully provide a correlation/relationship with the mood/perception to public transport each day. This is a structured data set.
Customer Satisfaction Index (cohort level): This is a report published by the NSW Government which details customer satisfaction metrics for different modes of transport. This cohort level dataset will serve as a reference point for the personal and group level data analysis. This is an unstructured data set.
2.2 Data Quality & Collection Issues
Opal fare data;
Issues
Management of Issues
Daily journal entry of public transport experience;
Issues
Management of Issues
Morning commute time;
Issues
Management of Issues
Global data set;
Issues
Management of Issues
2.3 Anonymisation of Datasets
One of the primary reasons for data anonymity is that the Opal fare data tracks your tap-on and tap-off locations. Therefore, given this data set, leakage of this sensitive information could lead to unethical consequences.
Anonymisation of the group level data sets was of upmost importance to our group in order to preserve the privacy of everyone’s personal data. In order to anonymise the data sets, the group leader assigned each person a number from 1 to 7 at the beginning of the project. So when each person uploaded their data sets to Google Drive they would label it with this number so that people were de-identified from the data and everyone could see if all 7 data sets had been uploaded.
2.4 Storage of Data and Group Access to Shared Data
The storage and sharing of the group data sets was facilitated by Google Drive. Each group member would upload their Opal Fare PDFs and their JSON journal entry files to Google Drive so that we could see who had uploaded their files. From here, the PDFs and JSON files were run through the Python script written by a another group member to convert them to XLS format. Finally, the group leader aggregated all individual files into one master XLS file which included both the Opal fare and daily journal entries data sets. Each group member therefore has access to all uncleaned and cleaned data sets for the group and the final aggregated, cleaned data set.
The analysis of the personally collected data and group level data was done using the RStudio software. The first step in the analysis was to identify the completeness of the data sets. Figure 1 displays that 25% of the observations were missing, Figure 2 displays that all observations are observed (100%), and Figure 3 displays that 99% of the observations are complete. Also, the cohort level data set (Customer Satisfaction Index) is not structured data as such, and therefore requires no analysis. It is to be noted that all NAs in the data sets were removed before analysis begun, and there was no data imputation.
Figure 1: 75% of the data set is observed
Figure 2: 100% of the data set is observed
Figure 3: 99% of the data set is observed
The next phase of the analysis was to combine the personal data collection element (total commute time to work in the morning) with my group level data (journal entries). Figure 4 displays a graph of Sentiment (sentiment analysis of daily journal entries) versus Response (commute time in minutes). The graph purports to display a strong negative relationship between decreasing total commute time and increasing positive sentiment in public transport experience. It should be noted from the results of the graph that the points are actually quite dispersed across the graph, so it is difficult to determine whether this is actually a causal effect.
Figure 4:
An analysis of my personally recorded journal entries was compared against the gross journal entries of the group. For this analysis of sentiment, I used a Wordcloud. Figure 5 represents the personal data Wordcloud and Figure 6 represents the group data Wordcloud. The first key insight derived from these Wordclouds is the identification of transport. I only ever catch buses around the city, and rarely drive. The personal data word cloud has identified ‘bus’ as a significant word in my journal entries. The group data Wordcloud has identified; ‘bus’, ‘train’ and ‘drive’ as significant words in the group journal entries. So the Wordclouds have been able to extract different modes of transport based on the differences in frequencies of words between my personal data and the group level data. Note, the group level data did not include mode of transport, so the Wordclouds have provided transparency here. Moreover, the identification of words such as ‘good’, ‘average’, ‘comfortable and ’fast’ in Figure 5 can be cross examined with the cohort level data, i.e., bus use has a positive 89% customer satisfaction rating, as stated by Transport for NSW (2018).
Figure 5:
Figure 6:
A final sentiment analysis was done between the two group levels of data, the journal entries and the Opal fare data. Two further Wordclouds were generated based on the time that group members boarded public transport each day (Opal data), ie., midnight - midday (morning) and midday - midnight (afternoon). Figure 7 displays significant words such as ‘bus’, ‘average’, ‘fast’ and ‘think’, whereas Figure 8 displays significant words such as ‘train’, ‘walk’, ‘work’ and ‘early’. Based on interpretation of the most significant words in each Wordcloud, those people that commuted on public transport in the morning had more positive experiences, morning public transport commuters used buses more frequently, and afternoon commuters used the trains more frequently. From a recent report by Transport for NSW (2018), train commuters had an 86% customer satisfaction rating as opposed to the 89% rating by bus users. So, the Wordclouds may have made a satisfactory discovery in that train commutes result in a more negative experience than bus commutes.
Figure 7:
Figure 8:
The primary finding from the analysis was that there was a reasonably strong negative relationships between commute time and sentiment, ie., mood. Given a longer timespan of data collection this phenomenon may become more apparent, evident and more statistically significant. As explained by Morris & Guerra (2015), there is a statistically significant and negative association between trip duration and mood, primarily because of rising stress, fatigue and sadness on long trips.
The way in which the personal data was collected raises data quality issues. The time interval measurements was done reading a digital clock. And the time was rounded to the nearest minute. Therefore, timing accuracy will have negatively influenced the analysis and results of the data sets.
Another key finding of the project was the sentiment analysis performed on the group data. The sentiment, in the form of a Wordcloud generated by RStudio, was able to determine that I had used the bus as my primary mode of public transport through word frequency analysis, and that my trips were typically ‘good’, ‘fast’ and/or average. This purports to seemingly correlate with the finding made in the analysis of the individual data, that short commutes on the bus have the ability to result in a more positive mood. But such relationships using this tool should be examined with in strict context. As BetterEvaluation (2019) points out, a primary challenge associated with Wordcloud use is their interpretation, because frequency of words is the concept and not their importance.
It should be noted that the collection of the journal entry data has likely resulted in negative externalities over this project. Once the data had been cleaned and aggregated for the group, it became evident that each group member wrote journal entries in different lengths, described different aspects of the commute and in some instances did not report on their experience/perception towards public transport each day.
Based on these findings, and issues with the data, it is difficult to come to any one particular conclusion from this analysis. There is research to suggest that longer commute times lead to poorer moods, such as the conclusions of Hansson et al. (2011), which reiterates that there are associations found between increased commute time and negative mental health outcomes. Moreover, the Wordcloud sentiment analysis has produced those words most frequently used and presented them in an interpretable format. Since mood is not an easily quantifiable notion, the results of this analysis should be interpreted strictly in context. Finally, the findings of this analysis must be viewed in light of the discussed data quality issues of missingness and consistency of collection.
A primary result of the data analysis was that the data collected was not sufficient to prove the objective question of this project. I had to change the way in which I was framing the objective question of this project in order to suit the data that we had collected. It became increasingly difficult to incorporate the meaningful use of the Opal fare data into the analysis as it wasn’t collecting such meaningful information apart from broad geolocation data, and the journal entries were very heterogenous. The geolocation data was of concern. Dealing with this data to retain its anonymity for ethical, legal and privacy risks was quite time consuming and burdensome. I believe that if we were to do this project again, I would think more deeply about data types that would allow me to more accurately measure sentiment and use better measurement tools such as advanced applications. One particular area of further development would be ethical and privacy risks.
The concept of the quantified self is inherently intrusive which raises questions around the ethics of human data collection. At the beginning of this project, we elected to collect Opal fare data, not realising that we would be collecting geolocation data every time we tapped on and off. So, we were totally unaware that up until data aggregation that we were releasing this information to the group leader to anonymise. So the ethics risks of the data sensitivity, compounded by the privacy concerns of the group members made us think more deeply about data consent. Informed consent is an important part of the human data collection process in order to prevent legal risk and data privacy concerns, as described by Connelly (2014). To reiterate, if we were to redo a similar project, we would look more deeply into what levels of data we could collect.
This project has highlighted the gravity of the Quantified Self concept. In a time period of just 2 months we were able to collect data on a group of 7 people, for two different types of data and come to some sound inferences about the group of people being studied. I am quite surprised by insights that the analysis provided, such as shorter commute times may lead to more positive outcomes on mood and perception towards public transport. I have been able to learn and develop the tools and the methodologies a data scientist must apply in order to practice effectively. Once the analysis had been completed it became evident how important the collection and management data collection is and its inherent issues.
Over the course of this project, I have developed my understanding of data collection in order to investigate a topic. In the context of this quantified self project, If I were to redo this project I would have selected a better method for measuring my personal level data (commute times) - using an application that records in HH:MM:SS accuracy and geo-location. However, I do not believe that HH:MM:SS timing accuracy will significantly impact the results of this analysis.
During the analysis I found that a lot of my personal data was missing at random, approximately 25%. This is because NA values were substituted when public transport was not used in the morning. If I were to do this project again I would try to learn and employ the use of data imputation, but this is not without inherent risks. As Janssen et al. (2015) explains, you can yield misleading results from data imputation, and it is important to be reminded that imputation is not to create data, but to prevent the exclusion of observed data. So I believe this could have the potential to improve the rigor of the analysis.
At the beginning of this project I had little to knowledge in the domain of privacy, legal and ethical risks stemming from personal data collection. This quantified self project has exaggerated the emphases of these concerns. One such issue is the collection of the Opal fare data at the group level, as it records your location each time you tap-on and tap-off Sydney Transport vehicles. If I were to redo this project again, I would seek to mitigate these risks by having each user delete the locational variables in the dataset before uploading the anonymised data sets to Google Drive. There would likely be an information loss, but geolocation has not proved to be an important part of this project analysis and has not helped me answer the question leading this project.
BetterEvaluation 2019, Word Cloud, viewed 28 March 2019, https://www.betterevaluation.org/en/evaluation-options/wordcloud.
Connelly, M 2014, ‘Ethical Considerations in Research Studies’, Medsurg Nursing, vol. 23, no. 1, pp. 54-55.
Hansson, E, Mattisson, K, Bjork, J, Ostergren, P, Jakobsson, K 2011, ‘Relationship between commuting and health outcomes in a cross-sectional population survey in southern Sweden’, BMC Public Health, vol. 11, p. 834.
Janssen, K, Donders, R, Harrell, F, Vergouwe, Y, Chen, Q, Grobbee, G & Moons, K 2015, ‘Missing covariate data in medical research’, Journal of Clinical Epidemiology, vol.7, no. 9, pp. 990-994.
Morris, E & Guerra, E 2015, ‘Are we there yet? Trip duration and mood during travel’, Transportation Research Part F: Traffic Psychology and Behaviours, vol. 33, pp. 38-47.
Quantified Self 2019, What is Quantified Self?, viewed 16 April 2019, https://quantifiedself.com/about/what-is-quantified-self/.
Transport for NSW 2018, Customer Satisfaction Index, NSW Government, Haymarket NSW.