Week 2 Learning Log

Julia Chen

13/06/2021

This Week’s Coding Goals

  • Get started on Danielle’s videos earlier on in the week as I anticipate that this week’s workload would be more than last weeks
  • Follow Danielle’s videos and exercises to learn and practice my data visualization skills
  • Further develop my data visualization skills by using other data from here

How did I go with the videos and exercises?

Exercise 5:

When first previewing the slides, I was most excited about creating this dinosaur in the plot. After watching the first few parts of the videos I got straight into this exercise. The instructions were easy and clear to follow, so I decided to make the plot a little more interesting by making the dinosaur pink! And adding a minimal theme for the clear background.

Exercise 9 and 10: Handwriting data

Okay, this part took by far the longest to work through, but hey I got there in the end.

#load the packages and I need
library(tidyverse)
## ── Attaching packages ───────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
forensic <- read_csv("data_forensic.csv")
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   handwriting_expert = col_character(),
##   us = col_character(),
##   condition = col_character(),
##   age = col_double(),
##   forensic_scientist = col_character(),
##   forensic_specialty = col_character(),
##   handwriting_reports = col_double(),
##   confidence = col_double(),
##   familiarity = col_double(),
##   feature = col_character(),
##   est = col_double(),
##   true = col_double(),
##   band = col_character()
## )
print(forensic)
## # A tibble: 5,700 x 14
##    participant handwriting_exp… us    condition   age forensic_scient…
##          <dbl> <chr>            <chr> <chr>     <dbl> <chr>           
##  1           1 HW Expert        Non-… Non-US H…    52 Yes             
##  2           1 HW Expert        Non-… Non-US H…    52 Yes             
##  3           1 HW Expert        Non-… Non-US H…    52 Yes             
##  4           1 HW Expert        Non-… Non-US H…    52 Yes             
##  5           1 HW Expert        Non-… Non-US H…    52 Yes             
##  6           1 HW Expert        Non-… Non-US H…    52 Yes             
##  7           1 HW Expert        Non-… Non-US H…    52 Yes             
##  8           1 HW Expert        Non-… Non-US H…    52 Yes             
##  9           1 HW Expert        Non-… Non-US H…    52 Yes             
## 10           1 HW Expert        Non-… Non-US H…    52 Yes             
## # … with 5,690 more rows, and 8 more variables: forensic_specialty <chr>,
## #   handwriting_reports <dbl>, confidence <dbl>, familiarity <dbl>,
## #   feature <chr>, est <dbl>, true <dbl>, band <chr>
#constructing the plot to make it interpretable and pretty

picture <- ggplot(forensic) +
  geom_boxplot(aes(band,est, fill = band))+
  facet_wrap(facets = vars(handwriting_expert))+
  theme_minimal()+
  scale_x_discrete(name = NULL,labels = NULL) +
  scale_y_continuous(name = "estimate") +
  ggtitle(label = "Handwriting estimate for experts and novices", subtitle = "Source: Martire et al.") +
  scale_fill_viridis_d(alpha = .5, name = NULL)

print(picture)
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Extra exercise: Spotify songs

Out of all the different data sets I looked at in the tidytuesday website, this data set seemed the most interesting.

#read in the data
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   track_id = col_character(),
##   track_name = col_character(),
##   track_artist = col_character(),
##   track_album_id = col_character(),
##   track_album_name = col_character(),
##   track_album_release_date = col_character(),
##   playlist_name = col_character(),
##   playlist_id = col_character(),
##   playlist_genre = col_character(),
##   playlist_subgenre = col_character()
## )
## See spec(...) for full column specifications.
print(spotify_songs)
## # A tibble: 32,833 x 23
##    track_id track_name track_artist track_popularity track_album_id
##    <chr>    <chr>      <chr>                   <dbl> <chr>         
##  1 6f807x0… I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
##  2 0r7CVbZ… Memories … Maroon 5                   67 63rPSO264uRjW…
##  3 1z1Hg7V… All the T… Zara Larsson               70 1HoSmj2eLcsrR…
##  4 75Fpbth… Call You … The Chainsm…               60 1nqYsOef1yKKu…
##  5 1e8PAfc… Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
##  6 7fvUMiy… Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
##  7 2OAylPU… Never Rea… Katy Perry                 62 7INHYSeusaFly…
##  8 6b1RNvA… Post Malo… Sam Feldt                  69 6703SRPsLkS4b…
##  9 7bF6tCO… Tough Lov… Avicii                     68 7CvAfGvq4RlIw…
## 10 1IXGILk… If I Can'… Shawn Mendes               67 4QxzbfSsVryEQ…
## # … with 32,823 more rows, and 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>
glimpse(spotify_songs)
## Observations: 32,833
## Variables: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 1630…

okay nice, now lets see if there different song genres have different popularity ratings:

#constructing the plot 

picture <- ggplot(data = spotify_songs) +
  geom_point(mapping = aes(x = playlist_genre, y = track_popularity))+
  ggtitle(label = "Popularity of different song genres on spotify")+
  theme_minimal()

print(picture)

Alright, that’s a good start. But seeing as song genre is categorical data, lets use a different kind of plot to better visualise the data so that we can actually interpret which genre is most popular.

#constructing the plot 

picture <- ggplot(data = spotify_songs) +
  geom_boxplot(mapping = aes(x = playlist_genre, y = track_popularity))+
  ggtitle(label = "Popularity of different song genres on spotify")+
  theme_minimal()

print(picture)

Now we can tell that all theses genres are not that different in popularity from each other. It seems that pop is the most popular followed by latin, which surprised me a little. And edm seems to be the least popular genre here.

Challenges

  • I struggled a lot with remembering the order of the functions, as well as where and how to add in what I want to happen in the graph into my code, so practice should reduce this problem

  • I made a lot of silly mistakes such as mismatching the parantheses, spacing things out where needed, forgetting to add commas, and also the print(picture) at the end to actually get the plot

  • Doing the extra exercise was definitely challenging, there are still some things I tried out but it didn’t work so I’ll be doing some googling as asking for help also during the QnA sessions

Successes

  • Despite all the challenges I think I am happy with my progress, I didn’t give up the extra exercise and powered through to finally complete it

  • I was able to condense some of the parts of my code where I got comfortable and familiar with the functions eg. writing ggplot(blah) instead of writing ggplot(data = blah)

Moving Forward

There is definitely a lot more practice that is needed to remember all the things I learnt this week, so I plan on doing some revision of this weeks content, perhaps with some extra exercises.

Also I will work on the data wrangling videos and exercises, and play around a bit more with coding to get familiar and used to googling for answers as I’m sure the assessment will involve.