What were the goals for this week?
The goal for this week was to continue to learn about R. This week we covered data wrangling, the process by which we transform our raw data into something meaningful or useful. As always, the aim was to overcome any challenges by consulting the internet as well as peers in the course.
How did I achive my goals?
As always, the first step to success is to load tidyverse and dplyr.
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)This week we learned something called the pipe. We can use the pipe to do stuff to our data. Speaking of data, let’s use the Small World of Words dataset introduced to us this week.
data <- 'data_swow.csv.zip' %>%
read_tsv() %>%
mutate(id = 1:n()) %>%
rename(
n_response = R1,
n_total = N,
strength = R1.Strength
)## ! Multiple files in zip: reading ''swow.csv''
##
## -- Column specification --------------------------------------------------------
## cols(
## cue = col_character(),
## response = col_character(),
## R1 = col_double(),
## N = col_double(),
## R1.Strength = col_double()
## )
print(data)## # A tibble: 483,636 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 a one 21 97 0.216 1
## 2 a the 16 97 0.165 2
## 3 a b 9 97 0.0928 3
## 4 a an 4 97 0.0412 4
## 5 a first 3 97 0.0309 5
## 6 a letter 3 97 0.0309 6
## 7 a alphabet 2 97 0.0206 7
## 8 a apple 2 97 0.0206 8
## 9 a article 2 97 0.0206 9
## 10 a bat 2 97 0.0206 10
## # ... with 483,626 more rows
I want to look at the associations to the word cat and dog, so let’s tidy up our dataset.
dog_bck <- data %>%
filter(response == "dog", n_response > 1)
dog_fwd <- data %>%
filter(cue == "dog", n_response > 1)
cat_bck <- data %>%
filter(response == "cat", n_response > 1)
cat_fwd <- data %>%
filter(cue == "cat", n_response > 1)
dog_bck## # A tibble: 202 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 a dog 2 97 0.0206 11
## 2 aggressive dog 2 100 0.02 8209
## 3 agility dog 2 97 0.0206 8344
## 4 animal dog 17 100 0.17 14979
## 5 animals dog 9 100 0.09 15035
## 6 attack dog 2 100 0.02 25347
## 7 backyard dog 2 100 0.02 28815
## 8 bad dog 3 100 0.03 28925
## 9 barf dog 3 94 0.0319 31894
## 10 bark dog 50 99 0.505 32030
## # ... with 192 more rows
dog_fwd## # A tibble: 8 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 dog cat 45 99 0.455 126251
## 2 dog pet 8 99 0.0808 126252
## 3 dog canine 6 99 0.0606 126253
## 4 dog bark 3 99 0.0303 126254
## 5 dog puppy 3 99 0.0303 126255
## 6 dog friend 2 99 0.0202 126256
## 7 dog Labrador 2 99 0.0202 126257
## 8 dog pup 2 99 0.0202 126258
cat_bck## # A tibble: 123 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 adopt cat 2 100 0.02 5427
## 2 agility cat 2 97 0.0206 8343
## 3 alley cat 17 99 0.172 10493
## 4 ally cat 2 98 0.0204 10815
## 5 animal cat 6 100 0.06 14980
## 6 animals cat 5 100 0.05 15038
## 7 aristocrat cat 6 95 0.0632 20423
## 8 aristocratic cat 2 97 0.0206 20482
## 9 bad luck cat 5 100 0.05 29106
## 10 black and white cat 2 100 0.02 40810
## # ... with 113 more rows
cat_fwd## # A tibble: 11 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 cat dog 42 100 0.42 63164
## 2 cat feline 13 100 0.13 63165
## 3 cat meow 8 100 0.08 63166
## 4 cat mouse 5 100 0.05 63167
## 5 cat animal 3 100 0.03 63168
## 6 cat kitty 3 100 0.03 63169
## 7 cat furry 2 100 0.02 63170
## 8 cat kitten 2 100 0.02 63171
## 9 cat pussy 2 100 0.02 63172
## 10 cat sat 2 100 0.02 63173
## 11 cat soft 2 100 0.02 63174
This is still pretty untidy though. I want to see the top associations for cats and dogs ranked from strongest to weakest.
dog_bck <- data %>%
filter(response == "dog", n_response > 1) %>%
arrange(desc(strength))
dog_fwd <- data %>%
filter(cue == "dog", n_response > 1) %>%
arrange(desc(strength))
cat_bck <- data %>%
filter(response == "cat", n_response > 1) %>%
arrange(desc(strength))
cat_fwd <- data %>%
filter(cue == "cat", n_response > 1) %>%
arrange(desc(strength))
dog_bck## # A tibble: 202 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 hound dog 80 96 0.833 209149
## 2 beagle dog 75 95 0.789 33997
## 3 canine dog 74 97 0.763 59553
## 4 leash dog 74 97 0.763 243783
## 5 barking dog 74 100 0.74 32043
## 6 poodle dog 71 98 0.724 322834
## 7 woof dog 71 98 0.724 478229
## 8 dalmatian dog 69 99 0.697 105340
## 9 husky dog 62 97 0.639 211580
## 10 pup dog 60 95 0.632 337475
## # ... with 192 more rows
dog_fwd## # A tibble: 8 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 dog cat 45 99 0.455 126251
## 2 dog pet 8 99 0.0808 126252
## 3 dog canine 6 99 0.0606 126253
## 4 dog bark 3 99 0.0303 126254
## 5 dog puppy 3 99 0.0303 126255
## 6 dog friend 2 99 0.0202 126256
## 7 dog Labrador 2 99 0.0202 126257
## 8 dog pup 2 99 0.0202 126258
cat_bck## # A tibble: 123 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 feline cat 81 94 0.862 158255
## 2 meow cat 81 98 0.827 266740
## 3 purr cat 78 99 0.788 337950
## 4 kitty cat 77 100 0.77 238392
## 5 whiskers cat 66 99 0.667 473410
## 6 whisker cat 55 97 0.567 473379
## 7 pussy cat 51 100 0.51 338221
## 8 puma cat 46 100 0.46 337022
## 9 dog cat 45 99 0.455 126251
## 10 puss cat 41 94 0.436 338192
## # ... with 113 more rows
cat_fwd## # A tibble: 11 x 6
## cue response n_response n_total strength id
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 cat dog 42 100 0.42 63164
## 2 cat feline 13 100 0.13 63165
## 3 cat meow 8 100 0.08 63166
## 4 cat mouse 5 100 0.05 63167
## 5 cat animal 3 100 0.03 63168
## 6 cat kitty 3 100 0.03 63169
## 7 cat furry 2 100 0.02 63170
## 8 cat kitten 2 100 0.02 63171
## 9 cat pussy 2 100 0.02 63172
## 10 cat sat 2 100 0.02 63173
## 11 cat soft 2 100 0.02 63174
Let’s transform our data so we know whether dog/cat is the cue or associate, and also filter out some of the columns to make the data easier to read.
dog_bck <- data %>%
filter(response == "dog", n_response > 1) %>%
select(cue, response, strength, id) %>%
mutate(
rank = rank(-strength),
type = "backward",
word = "dog",
associate = cue
)
dog_fwd <- data %>%
filter(cue == "dog", n_response > 1) %>%
select(cue, response, strength, id) %>%
mutate(
rank = rank(-strength),
type = "forward",
word = "dog",
associate = response
)
cat_bck <- data %>%
filter(response == "cat", n_response > 1) %>%
select(cue, response, strength, id) %>%
mutate(
rank = rank(-strength),
type = "backward",
word = "cat",
associate = cue
)
cat_fwd <- data %>%
filter(cue == "cat", n_response > 1) %>%
select(cue, response, strength, id) %>%
mutate(
rank = rank(-strength),
type = "forward",
word = "cat",
associate = response
)
dog_bck## # A tibble: 202 x 8
## cue response strength id rank type word associate
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 a dog 0.0206 11 150. backward dog a
## 2 aggressive dog 0.02 8209 193 backward dog aggressive
## 3 agility dog 0.0206 8344 150. backward dog agility
## 4 animal dog 0.17 14979 47 backward dog animal
## 5 animals dog 0.09 15035 75 backward dog animals
## 6 attack dog 0.02 25347 193 backward dog attack
## 7 backyard dog 0.02 28815 193 backward dog backyard
## 8 bad dog 0.03 28925 136 backward dog bad
## 9 barf dog 0.0319 31894 118. backward dog barf
## 10 bark dog 0.505 32030 14 backward dog bark
## # ... with 192 more rows
dog_fwd## # A tibble: 8 x 8
## cue response strength id rank type word associate
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 dog cat 0.455 126251 1 forward dog cat
## 2 dog pet 0.0808 126252 2 forward dog pet
## 3 dog canine 0.0606 126253 3 forward dog canine
## 4 dog bark 0.0303 126254 4.5 forward dog bark
## 5 dog puppy 0.0303 126255 4.5 forward dog puppy
## 6 dog friend 0.0202 126256 7 forward dog friend
## 7 dog Labrador 0.0202 126257 7 forward dog Labrador
## 8 dog pup 0.0202 126258 7 forward dog pup
cat_bck## # A tibble: 123 x 8
## cue response strength id rank type word associate
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 adopt cat 0.02 5427 116 backward cat adopt
## 2 agility cat 0.0206 8343 96 backward cat agility
## 3 alley cat 0.172 10493 24 backward cat alley
## 4 ally cat 0.0204 10815 100 backward cat ally
## 5 animal cat 0.06 14980 52.5 backward cat animal
## 6 animals cat 0.05 15038 63 backward cat animals
## 7 aristocrat cat 0.0632 20423 48 backward cat aristocrat
## 8 aristocratic cat 0.0206 20482 96 backward cat aristocratic
## 9 bad luck cat 0.05 29106 63 backward cat bad luck
## 10 black and white cat 0.02 40810 116 backward cat black and white
## # ... with 113 more rows
cat_fwd## # A tibble: 11 x 8
## cue response strength id rank type word associate
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr> <chr>
## 1 cat dog 0.42 63164 1 forward cat dog
## 2 cat feline 0.13 63165 2 forward cat feline
## 3 cat meow 0.08 63166 3 forward cat meow
## 4 cat mouse 0.05 63167 4 forward cat mouse
## 5 cat animal 0.03 63168 5.5 forward cat animal
## 6 cat kitty 0.03 63169 5.5 forward cat kitty
## 7 cat furry 0.02 63170 9 forward cat furry
## 8 cat kitten 0.02 63171 9 forward cat kitten
## 9 cat pussy 0.02 63172 9 forward cat pussy
## 10 cat sat 0.02 63173 9 forward cat sat
## 11 cat soft 0.02 63174 9 forward cat soft
Now we’ll combine the data sets, removing boring associations such as dog/cat
pet <- bind_rows(dog_bck, dog_fwd,
cat_bck, cat_fwd) %>%
select(id:associate) %>%
filter(associate != "cat", associate != "dog")
pet## # A tibble: 340 x 5
## id rank type word associate
## <int> <dbl> <chr> <chr> <chr>
## 1 11 150. backward dog a
## 2 8209 193 backward dog aggressive
## 3 8344 150. backward dog agility
## 4 14979 47 backward dog animal
## 5 15035 75 backward dog animals
## 6 25347 193 backward dog attack
## 7 28815 193 backward dog backyard
## 8 28925 136 backward dog bad
## 9 31894 118. backward dog barf
## 10 32030 14 backward dog bark
## # ... with 330 more rows
We can then use pivot to compare the top rated associations for both pets, and then plot out the results with ggplot.
pet_fwd <- pet %>%
filter(
type == "forward"
) %>%
pivot_wider(
id_cols = associate,
names_from = word,
values_from = rank
) %>%
mutate(
cat = replace_na(1/cat, 0),
dog = replace_na(1/dog, 0),
diff = cat - dog
) %>%
arrange(diff)
picture_fwd <- ggplot(
data = pet_fwd,
mapping = aes(
x = associate %>% reorder(diff),
y = diff
)) +
geom_col() +
coord_flip()
plot(picture_fwd)pet_bck <- pet %>%
filter(
type == "backward"
) %>%
pivot_wider(
id_cols = associate,
names_from = word,
values_from = rank
) %>%
mutate(
cat = replace_na(1/cat, 0),
dog = replace_na(1/dog, 0),
diff = cat - dog
) %>%
filter(diff < -0.05 | diff > 0.05) %>%
arrange(diff)
picture_bck <- ggplot(
data = pet_bck,
mapping = aes(
x = associate %>% reorder(diff),
y = diff
)) +
geom_col() +
coord_flip()
plot(picture_bck)What are the next steps?
For next week I need to really work on time management. As things are getting busy with uni and my personal life, it is imperative I maximise my productivity. Next week I suspect we will really be diving into the group assignment so I need to ensure I keep up to date with the workshops in order to make sure I do not drag my group down.