About a year and a half ago, I started learning about data science in my free time.
But what have I actually learned about data? Am I even good at it?
I started learning online, where there’s no shortage of great and free resources. I read Wickham’s R for Data Science (https://r4ds.hadley.nz). I downloaded R Studio, completed the swirl() package exercises and started exploring open datasets like Gapminder’s.
Last month I completed the Google Data Analytic Professional Certificate, but sometimes I feel very green at this.
Most of what I have learned so far has been from exploring things that interest me.
For example, I documented my exercise habits for 400 days in a row and then I looked at the data I generated - very basic things, like sorting different exercises by repetitions and time, subsetting by week and month, etc.
I asked Spotify for my streaming data and did more basic analysis - things like finding out the artists I listened to the most on monthly basis, summarizing time spent listening to podcasts versus music, etc.
I also documented every bill that was discussed in the Portuguese Parliament during the first quarter of 2023 by manually creating a database - then trying finding patterns in the types of proposals favored by each party, approval and rejection percentages by political party, etc.
(Thanks to the policy of open data, this is all on the Portuguese Parliament’s website, being updated daily, and only a slice of it ever reaches citizens via the media.)
Data science is a very big field.
As I understand it, data are values of qualitative of quantitative variables belonging to a set of items.
I came to understand data itself as the second most important thing in data science, the most important being the question. And I am learning about all sorts of questions - descriptive, exploratory, inferential, predictive, causal, mechanistic.
I see the data analytic process as a sequence of phases, starting from the question to the preparing and processing of the data, to the actual analysis stage, ending in the final phases of sharing and acting on the conclusions reached.
The most interesting practical skill I have gained so far has been learning the R programming language.
At this stage, I am reasonably comfortable with importing data from a spreadsheet or a database into R, performing basic analysis and creating average data visualizations.
I’m okay with basic data consistency checks, like making sure all data types are fit for analysis by coercing or removing missing values.
If, however, it’s hierarchical data stored in a JSON format and it’s nested lists, then there’s still plenty of steady progress to be made.
Working with lists is an area that is very much at the limits of what I am able to do in R.
In R, everything is an object and not every object is of the same data structure. There’s a handful of atomic classes of objects: character, numeric (real numbers), integer, complex, and logical (TRUE/FALSE).
Variables, vectors, data frames, matrices, lists… these are all objects in R. A vector, for example, can only contain objects of the same type/class. A data frame, however, can contain columns of different data types. You can have a data frame with a character vector and a numeric vector as columns, for example.
Lists are an exception, because a list’s elements can be of any dimension and type.
As I understand it, the degree of complexity of each class of object is related to the number of dimensions each object holds. A vector, for example, has one dimension - as every element is of the same class.
Matrices, on the other hand, are vectors too, but with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (one for the number of rows, the other for the number of columns).
A list can contain elements of different classes. This means the same list can store a character vector, a integer, a sequence of logical values - even multiple lists!
Those lists inside the list can store multiple lists and inside those lists there can be… well, you get the picture. It’s a chest of drawers sketched by MC Escher.
I know there’s R packages like jsonlite() that are especially tailored to help with parsing JSON files and nested lists. So I need to go over the documentation in order to improve my list skills.
The particular list I am struggling happens to be a JSON format file that I have imported into R and inspected its overall structure, but I am yet to fully untangle and make sense of.
The source of this data is the Portuguese government.
library(jsonlite)
DadosAbertosParlamento <- fromJSON(
"https://app.parlamento.pt/webutils/docs/doc.txt?path=LySdCyiaGb5D%2blB0%2bf0Y8g57BaQzyN%2bM0d%2bjV3egDf7sK9otw09cd4VFXCMZmLGv5z6UP7r5U7zYkdMbd9fqcpqpGMnamAHcmurl1EO%2bjDM3dZC7lY%2f%2b6EPANtOSL7ublPJfry2KPcl9AcfxvY8H4ZpFHphiI%2bfMFOr%2fqDTl1ANpIJApSzoDkgyzrN7ig9BdbiH%2bketEsotLU6pohnvQmqKp15OsJ7JSGZLS8TJd%2fckhrWfiN8Sm96c3ZHO6%2b5Lrn9PXIdqhzLOErCN5qccRM06U50XP910hntCkY6himienV8Kz2LnwEgVnVAcJFANyNUsbBem4%2bpSiYqt%2b9U0ByjZA%2btcA5J03wuq1tFRH8Rf0dCEPWQRFtpgxfYqN4NWE&fich=IniciativasXV_json.txt&Inline=true"
)
typeof(DadosAbertosParlamento)
## [1] "list"
My ultimate goal with this project is creating a place where citizens can find unbiased information about how all the political parties represented in Parliament behave, through descriptive statistics.
The idea is presenting the open data stored in the Parliament’s website in an informative way, depending on the basic type of question might want an answer to - such as what is proposed by who, how each votes, what type of language each uses, etc.
Not clickbait, not guesswork. Just what happened in Parliament.
But I’m still stuck on this puzzle.