This Week’s Coding Goals
- Get started on Danielle’s videos earlier on in the week as I anticipate that this week’s workload would be more than last weeks
- Follow Danielle’s videos and exercises to learn and practice my data visualization skills
- Further develop my data visualization skills by using other data from here
How did I go with the videos and exercises?
Exercise 5:
When first previewing the slides, I was most excited about creating this dinosaur in the plot. After watching the first few parts of the videos I got straight into this exercise. The instructions were easy and clear to follow, so I decided to make the plot a little more interesting by making the dinosaur pink! And adding a minimal theme for the clear background.
Exercise 9 and 10: Handwriting data
Okay, this part took by far the longest to work through, but hey I got there in the end.
#load the packages and I need
library(tidyverse)## ── Attaching packages ───────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
forensic <- read_csv("data_forensic.csv")## Parsed with column specification:
## cols(
## participant = col_double(),
## handwriting_expert = col_character(),
## us = col_character(),
## condition = col_character(),
## age = col_double(),
## forensic_scientist = col_character(),
## forensic_specialty = col_character(),
## handwriting_reports = col_double(),
## confidence = col_double(),
## familiarity = col_double(),
## feature = col_character(),
## est = col_double(),
## true = col_double(),
## band = col_character()
## )
print(forensic)## # A tibble: 5,700 x 14
## participant handwriting_exp… us condition age forensic_scient…
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1 HW Expert Non-… Non-US H… 52 Yes
## 2 1 HW Expert Non-… Non-US H… 52 Yes
## 3 1 HW Expert Non-… Non-US H… 52 Yes
## 4 1 HW Expert Non-… Non-US H… 52 Yes
## 5 1 HW Expert Non-… Non-US H… 52 Yes
## 6 1 HW Expert Non-… Non-US H… 52 Yes
## 7 1 HW Expert Non-… Non-US H… 52 Yes
## 8 1 HW Expert Non-… Non-US H… 52 Yes
## 9 1 HW Expert Non-… Non-US H… 52 Yes
## 10 1 HW Expert Non-… Non-US H… 52 Yes
## # … with 5,690 more rows, and 8 more variables: forensic_specialty <chr>,
## # handwriting_reports <dbl>, confidence <dbl>, familiarity <dbl>,
## # feature <chr>, est <dbl>, true <dbl>, band <chr>
#constructing the plot to make it interpretable and pretty
picture <- ggplot(forensic) +
geom_boxplot(aes(band,est, fill = band))+
facet_wrap(facets = vars(handwriting_expert))+
theme_minimal()+
scale_x_discrete(name = NULL,labels = NULL) +
scale_y_continuous(name = "estimate") +
ggtitle(label = "Handwriting estimate for experts and novices", subtitle = "Source: Martire et al.") +
scale_fill_viridis_d(alpha = .5, name = NULL)
print(picture)## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Extra exercise: Spotify songs
Out of all the different data sets I looked at in the tidytuesday website, this data set seemed the most interesting.
#read in the data
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')## Parsed with column specification:
## cols(
## .default = col_double(),
## track_id = col_character(),
## track_name = col_character(),
## track_artist = col_character(),
## track_album_id = col_character(),
## track_album_name = col_character(),
## track_album_release_date = col_character(),
## playlist_name = col_character(),
## playlist_id = col_character(),
## playlist_genre = col_character(),
## playlist_subgenre = col_character()
## )
## See spec(...) for full column specifications.
print(spotify_songs)## # A tibble: 32,833 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0… I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZ… Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7V… All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75Fpbth… Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfc… Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiy… Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## 7 2OAylPU… Never Rea… Katy Perry 62 7INHYSeusaFly…
## 8 6b1RNvA… Post Malo… Sam Feldt 69 6703SRPsLkS4b…
## 9 7bF6tCO… Tough Lov… Avicii 68 7CvAfGvq4RlIw…
## 10 1IXGILk… If I Can'… Shawn Mendes 67 4QxzbfSsVryEQ…
## # … with 32,823 more rows, and 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
glimpse(spotify_songs)## Observations: 32,833
## Variables: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdf…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lu…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "T…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, …
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Lux…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "2…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop …
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7c…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "p…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "danc…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.3…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.12…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030,…
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.14…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, …
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 1630…
okay nice, now lets see if there different song genres have different popularity ratings:
#constructing the plot
picture <- ggplot(data = spotify_songs) +
geom_point(mapping = aes(x = playlist_genre, y = track_popularity))+
ggtitle(label = "Popularity of different song genres on spotify")+
theme_minimal()
print(picture)Alright, that’s a good start. But seeing as song genre is categorical data, lets use a different kind of plot to better visualise the data so that we can actually interpret which genre is most popular.
#constructing the plot
picture <- ggplot(data = spotify_songs) +
geom_boxplot(mapping = aes(x = playlist_genre, y = track_popularity))+
ggtitle(label = "Popularity of different song genres on spotify")+
theme_minimal()
print(picture)Now we can tell that all theses genres are not that different in popularity from each other. It seems that pop is the most popular followed by latin, which surprised me a little. And edm seems to be the least popular genre here.
Challenges
I struggled a lot with remembering the order of the functions, as well as where and how to add in what I want to happen in the graph into my code, so practice should reduce this problem
I made a lot of silly mistakes such as mismatching the parantheses, spacing things out where needed, forgetting to add commas, and also the print(picture) at the end to actually get the plot
Doing the extra exercise was definitely challenging, there are still some things I tried out but it didn’t work so I’ll be doing some googling as asking for help also during the QnA sessions
Successes
Despite all the challenges I think I am happy with my progress, I didn’t give up the extra exercise and powered through to finally complete it
I was able to condense some of the parts of my code where I got comfortable and familiar with the functions eg. writing ggplot(blah) instead of writing ggplot(data = blah)
Moving Forward
There is definitely a lot more practice that is needed to remember all the things I learnt this week, so I plan on doing some revision of this weeks content, perhaps with some extra exercises.
Also I will work on the data wrangling videos and exercises, and play around a bit more with coding to get familiar and used to googling for answers as I’m sure the assessment will involve.