What were the goals for this week?

The goal for this week was to continue to learn about R. This week we covered data wrangling, the process by which we transform our raw data into something meaningful or useful. As always, the aim was to overcome any challenges by consulting the internet as well as peers in the course.

How did I achive my goals?

As always, the first step to success is to load tidyverse and dplyr.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

This week we learned something called the pipe. We can use the pipe to do stuff to our data. Speaking of data, let’s use the Small World of Words dataset introduced to us this week.

data <- 'data_swow.csv.zip' %>% 
  read_tsv() %>%
  mutate(id = 1:n()) %>% 
  rename(
    n_response = R1,
    n_total = N,
    strength = R1.Strength
  )
## ! Multiple files in zip: reading ''swow.csv''
## 
## -- Column specification --------------------------------------------------------
## cols(
##   cue = col_character(),
##   response = col_character(),
##   R1 = col_double(),
##   N = col_double(),
##   R1.Strength = col_double()
## )
print(data)
## # A tibble: 483,636 x 6
##    cue   response n_response n_total strength    id
##    <chr> <chr>         <dbl>   <dbl>    <dbl> <int>
##  1 a     one              21      97   0.216      1
##  2 a     the              16      97   0.165      2
##  3 a     b                 9      97   0.0928     3
##  4 a     an                4      97   0.0412     4
##  5 a     first             3      97   0.0309     5
##  6 a     letter            3      97   0.0309     6
##  7 a     alphabet          2      97   0.0206     7
##  8 a     apple             2      97   0.0206     8
##  9 a     article           2      97   0.0206     9
## 10 a     bat               2      97   0.0206    10
## # ... with 483,626 more rows

I want to look at the associations to the word cat and dog, so let’s tidy up our dataset.

dog_bck <- data %>% 
  filter(response == "dog", n_response > 1)

dog_fwd <- data %>% 
  filter(cue == "dog", n_response > 1)

cat_bck <- data %>% 
  filter(response == "cat", n_response > 1)

cat_fwd <- data %>% 
  filter(cue == "cat", n_response > 1)

dog_bck
## # A tibble: 202 x 6
##    cue        response n_response n_total strength    id
##    <chr>      <chr>         <dbl>   <dbl>    <dbl> <int>
##  1 a          dog               2      97   0.0206    11
##  2 aggressive dog               2     100   0.02    8209
##  3 agility    dog               2      97   0.0206  8344
##  4 animal     dog              17     100   0.17   14979
##  5 animals    dog               9     100   0.09   15035
##  6 attack     dog               2     100   0.02   25347
##  7 backyard   dog               2     100   0.02   28815
##  8 bad        dog               3     100   0.03   28925
##  9 barf       dog               3      94   0.0319 31894
## 10 bark       dog              50      99   0.505  32030
## # ... with 192 more rows
dog_fwd
## # A tibble: 8 x 6
##   cue   response n_response n_total strength     id
##   <chr> <chr>         <dbl>   <dbl>    <dbl>  <int>
## 1 dog   cat              45      99   0.455  126251
## 2 dog   pet               8      99   0.0808 126252
## 3 dog   canine            6      99   0.0606 126253
## 4 dog   bark              3      99   0.0303 126254
## 5 dog   puppy             3      99   0.0303 126255
## 6 dog   friend            2      99   0.0202 126256
## 7 dog   Labrador          2      99   0.0202 126257
## 8 dog   pup               2      99   0.0202 126258
cat_bck
## # A tibble: 123 x 6
##    cue             response n_response n_total strength    id
##    <chr>           <chr>         <dbl>   <dbl>    <dbl> <int>
##  1 adopt           cat               2     100   0.02    5427
##  2 agility         cat               2      97   0.0206  8343
##  3 alley           cat              17      99   0.172  10493
##  4 ally            cat               2      98   0.0204 10815
##  5 animal          cat               6     100   0.06   14980
##  6 animals         cat               5     100   0.05   15038
##  7 aristocrat      cat               6      95   0.0632 20423
##  8 aristocratic    cat               2      97   0.0206 20482
##  9 bad luck        cat               5     100   0.05   29106
## 10 black and white cat               2     100   0.02   40810
## # ... with 113 more rows
cat_fwd
## # A tibble: 11 x 6
##    cue   response n_response n_total strength    id
##    <chr> <chr>         <dbl>   <dbl>    <dbl> <int>
##  1 cat   dog              42     100     0.42 63164
##  2 cat   feline           13     100     0.13 63165
##  3 cat   meow              8     100     0.08 63166
##  4 cat   mouse             5     100     0.05 63167
##  5 cat   animal            3     100     0.03 63168
##  6 cat   kitty             3     100     0.03 63169
##  7 cat   furry             2     100     0.02 63170
##  8 cat   kitten            2     100     0.02 63171
##  9 cat   pussy             2     100     0.02 63172
## 10 cat   sat               2     100     0.02 63173
## 11 cat   soft              2     100     0.02 63174

This is still pretty untidy though. I want to see the top associations for cats and dogs ranked from strongest to weakest.

dog_bck <- data %>% 
  filter(response == "dog", n_response > 1) %>%
  arrange(desc(strength))

dog_fwd <- data %>% 
  filter(cue == "dog", n_response > 1) %>% 
  arrange(desc(strength))

cat_bck <- data %>% 
  filter(response == "cat", n_response > 1) %>% 
  arrange(desc(strength))

cat_fwd <- data %>% 
  filter(cue == "cat", n_response > 1) %>% 
  arrange(desc(strength))

dog_bck
## # A tibble: 202 x 6
##    cue       response n_response n_total strength     id
##    <chr>     <chr>         <dbl>   <dbl>    <dbl>  <int>
##  1 hound     dog              80      96    0.833 209149
##  2 beagle    dog              75      95    0.789  33997
##  3 canine    dog              74      97    0.763  59553
##  4 leash     dog              74      97    0.763 243783
##  5 barking   dog              74     100    0.74   32043
##  6 poodle    dog              71      98    0.724 322834
##  7 woof      dog              71      98    0.724 478229
##  8 dalmatian dog              69      99    0.697 105340
##  9 husky     dog              62      97    0.639 211580
## 10 pup       dog              60      95    0.632 337475
## # ... with 192 more rows
dog_fwd
## # A tibble: 8 x 6
##   cue   response n_response n_total strength     id
##   <chr> <chr>         <dbl>   <dbl>    <dbl>  <int>
## 1 dog   cat              45      99   0.455  126251
## 2 dog   pet               8      99   0.0808 126252
## 3 dog   canine            6      99   0.0606 126253
## 4 dog   bark              3      99   0.0303 126254
## 5 dog   puppy             3      99   0.0303 126255
## 6 dog   friend            2      99   0.0202 126256
## 7 dog   Labrador          2      99   0.0202 126257
## 8 dog   pup               2      99   0.0202 126258
cat_bck
## # A tibble: 123 x 6
##    cue      response n_response n_total strength     id
##    <chr>    <chr>         <dbl>   <dbl>    <dbl>  <int>
##  1 feline   cat              81      94    0.862 158255
##  2 meow     cat              81      98    0.827 266740
##  3 purr     cat              78      99    0.788 337950
##  4 kitty    cat              77     100    0.77  238392
##  5 whiskers cat              66      99    0.667 473410
##  6 whisker  cat              55      97    0.567 473379
##  7 pussy    cat              51     100    0.51  338221
##  8 puma     cat              46     100    0.46  337022
##  9 dog      cat              45      99    0.455 126251
## 10 puss     cat              41      94    0.436 338192
## # ... with 113 more rows
cat_fwd
## # A tibble: 11 x 6
##    cue   response n_response n_total strength    id
##    <chr> <chr>         <dbl>   <dbl>    <dbl> <int>
##  1 cat   dog              42     100     0.42 63164
##  2 cat   feline           13     100     0.13 63165
##  3 cat   meow              8     100     0.08 63166
##  4 cat   mouse             5     100     0.05 63167
##  5 cat   animal            3     100     0.03 63168
##  6 cat   kitty             3     100     0.03 63169
##  7 cat   furry             2     100     0.02 63170
##  8 cat   kitten            2     100     0.02 63171
##  9 cat   pussy             2     100     0.02 63172
## 10 cat   sat               2     100     0.02 63173
## 11 cat   soft              2     100     0.02 63174

Let’s transform our data so we know whether dog/cat is the cue or associate, and also filter out some of the columns to make the data easier to read.

dog_bck <- data %>%
  filter(response == "dog", n_response > 1) %>%
  select(cue, response, strength, id) %>%
  mutate(
    rank = rank(-strength), 
    type = "backward",        
    word = "dog",          
    associate = cue     
  )

dog_fwd <- data %>%
  filter(cue == "dog", n_response > 1) %>%
  select(cue, response, strength, id) %>%
  mutate(
    rank = rank(-strength),  
    type = "forward",        
    word = "dog",         
    associate = response    
  )

cat_bck <- data %>%
  filter(response == "cat", n_response > 1) %>%
  select(cue, response, strength, id) %>%
  mutate(
    rank = rank(-strength), 
    type = "backward",        
    word = "cat",         
    associate = cue     
  )

cat_fwd <- data %>%
  filter(cue == "cat", n_response > 1) %>%
  select(cue, response, strength, id) %>%
  mutate(
    rank = rank(-strength), 
    type = "forward",        
    word = "cat",         
    associate = response
  )

dog_bck
## # A tibble: 202 x 8
##    cue        response strength    id  rank type     word  associate 
##    <chr>      <chr>       <dbl> <int> <dbl> <chr>    <chr> <chr>     
##  1 a          dog        0.0206    11  150. backward dog   a         
##  2 aggressive dog        0.02    8209  193  backward dog   aggressive
##  3 agility    dog        0.0206  8344  150. backward dog   agility   
##  4 animal     dog        0.17   14979   47  backward dog   animal    
##  5 animals    dog        0.09   15035   75  backward dog   animals   
##  6 attack     dog        0.02   25347  193  backward dog   attack    
##  7 backyard   dog        0.02   28815  193  backward dog   backyard  
##  8 bad        dog        0.03   28925  136  backward dog   bad       
##  9 barf       dog        0.0319 31894  118. backward dog   barf      
## 10 bark       dog        0.505  32030   14  backward dog   bark      
## # ... with 192 more rows
dog_fwd
## # A tibble: 8 x 8
##   cue   response strength     id  rank type    word  associate
##   <chr> <chr>       <dbl>  <int> <dbl> <chr>   <chr> <chr>    
## 1 dog   cat        0.455  126251   1   forward dog   cat      
## 2 dog   pet        0.0808 126252   2   forward dog   pet      
## 3 dog   canine     0.0606 126253   3   forward dog   canine   
## 4 dog   bark       0.0303 126254   4.5 forward dog   bark     
## 5 dog   puppy      0.0303 126255   4.5 forward dog   puppy    
## 6 dog   friend     0.0202 126256   7   forward dog   friend   
## 7 dog   Labrador   0.0202 126257   7   forward dog   Labrador 
## 8 dog   pup        0.0202 126258   7   forward dog   pup
cat_bck
## # A tibble: 123 x 8
##    cue             response strength    id  rank type     word  associate      
##    <chr>           <chr>       <dbl> <int> <dbl> <chr>    <chr> <chr>          
##  1 adopt           cat        0.02    5427 116   backward cat   adopt          
##  2 agility         cat        0.0206  8343  96   backward cat   agility        
##  3 alley           cat        0.172  10493  24   backward cat   alley          
##  4 ally            cat        0.0204 10815 100   backward cat   ally           
##  5 animal          cat        0.06   14980  52.5 backward cat   animal         
##  6 animals         cat        0.05   15038  63   backward cat   animals        
##  7 aristocrat      cat        0.0632 20423  48   backward cat   aristocrat     
##  8 aristocratic    cat        0.0206 20482  96   backward cat   aristocratic   
##  9 bad luck        cat        0.05   29106  63   backward cat   bad luck       
## 10 black and white cat        0.02   40810 116   backward cat   black and white
## # ... with 113 more rows
cat_fwd
## # A tibble: 11 x 8
##    cue   response strength    id  rank type    word  associate
##    <chr> <chr>       <dbl> <int> <dbl> <chr>   <chr> <chr>    
##  1 cat   dog          0.42 63164   1   forward cat   dog      
##  2 cat   feline       0.13 63165   2   forward cat   feline   
##  3 cat   meow         0.08 63166   3   forward cat   meow     
##  4 cat   mouse        0.05 63167   4   forward cat   mouse    
##  5 cat   animal       0.03 63168   5.5 forward cat   animal   
##  6 cat   kitty        0.03 63169   5.5 forward cat   kitty    
##  7 cat   furry        0.02 63170   9   forward cat   furry    
##  8 cat   kitten       0.02 63171   9   forward cat   kitten   
##  9 cat   pussy        0.02 63172   9   forward cat   pussy    
## 10 cat   sat          0.02 63173   9   forward cat   sat      
## 11 cat   soft         0.02 63174   9   forward cat   soft

Now we’ll combine the data sets, removing boring associations such as dog/cat

pet <- bind_rows(dog_bck, dog_fwd,
                 cat_bck, cat_fwd) %>%
  select(id:associate) %>%
  filter(associate != "cat", associate != "dog")

pet
## # A tibble: 340 x 5
##       id  rank type     word  associate 
##    <int> <dbl> <chr>    <chr> <chr>     
##  1    11  150. backward dog   a         
##  2  8209  193  backward dog   aggressive
##  3  8344  150. backward dog   agility   
##  4 14979   47  backward dog   animal    
##  5 15035   75  backward dog   animals   
##  6 25347  193  backward dog   attack    
##  7 28815  193  backward dog   backyard  
##  8 28925  136  backward dog   bad       
##  9 31894  118. backward dog   barf      
## 10 32030   14  backward dog   bark      
## # ... with 330 more rows

We can then use pivot to compare the top rated associations for both pets, and then plot out the results with ggplot.

pet_fwd <- pet %>% 
  filter(
    type == "forward"
  ) %>% 
  pivot_wider(
    id_cols = associate, 
    names_from = word, 
    values_from = rank
  ) %>% 
  mutate(
    cat = replace_na(1/cat, 0),
    dog = replace_na(1/dog, 0),
    diff = cat - dog
  ) %>% 
  arrange(diff)

picture_fwd <- ggplot(
  data = pet_fwd,
  mapping = aes(
    x = associate %>% reorder(diff),
    y = diff
  )) +
  geom_col() +
  coord_flip()
 
plot(picture_fwd)

pet_bck <- pet %>% 
  filter(
    type == "backward"
  ) %>% 
  pivot_wider(
    id_cols = associate, 
    names_from = word, 
    values_from = rank
  ) %>% 
  mutate(
    cat = replace_na(1/cat, 0),
    dog = replace_na(1/dog, 0),
    diff = cat - dog
  ) %>% 
  filter(diff < -0.05 | diff > 0.05) %>% 
  arrange(diff) 

picture_bck <- ggplot(
  data = pet_bck,
  mapping = aes(
    x = associate %>% reorder(diff),
    y = diff
  )) + 
  geom_col() + 
  coord_flip()

plot(picture_bck)

What are the next steps?

For next week I need to really work on time management. As things are getting busy with uni and my personal life, it is imperative I maximise my productivity. Next week I suspect we will really be diving into the group assignment so I need to ensure I keep up to date with the workshops in order to make sure I do not drag my group down.