MATH217 Dog Project

Author

Julian Beckert

Libraries and dataset

library(tidyverse)
library(tidymodels)
library(ggridges)
library(ggdogs) 
doggies <- read_csv("dog_and_non_dog_owners.csv")

Data cleaning

Image rating columns

names(doggies)[73:120]
 [1] "cute1"                 "cute2"                 "cute3"                
 [4] "cute4"                 "cute5"                 "cute6"                
 [7] "cute7"                 "cute_total"            "cute_mean"            
[10] "excite1"               "excite2"               "excite3"              
[13] "excite4"               "excite5"               "excite6"              
[16] "excite7"               "excite_total"          "excite_mean"          
[19] "infan1"                "infan2"                "infan3"               
[22] "infan4"                "infan5"                "infan6"               
[25] "infan7"                "infan_total"           "infan_mean"           
[28] "pleasant1"             "pleasant2"             "pleasant3"            
[31] "pleasant4"             "pleasant5"             "pleasnat6"            
[34] "pleasant7"             "pleasant_total"        "pleasant_mean"        
[37] "close1"                "close2"                "close3"               
[40] "close4"                "close5"                "close6"               
[43] "close7"                "close_total"           "close_mean"           
[46] "love_dogs"             "other_pet_owner_group" "dog_owner_group"      
# I will need to select these columns later:
cat(rep(paste0("cute", 1:7)))
cute1 cute2 cute3 cute4 cute5 cute6 cute7
cat(rep(paste0("excite", 1:7)))
excite1 excite2 excite3 excite4 excite5 excite6 excite7
cat(rep(paste0("infan", 1:7)))
infan1 infan2 infan3 infan4 infan5 infan6 infan7
cat(rep(paste0("pleasant", 1:7)))
pleasant1 pleasant2 pleasant3 pleasant4 pleasant5 pleasant6 pleasant7
cat(rep(paste0("close", 1:7)))
close1 close2 close3 close4 close5 close6 close7
# mutate values in the columns above to be categorical: not cute, somewhat cute, very cute
dogratings <- doggies %>% 
  select(2, 74:118)


# I might also make groups based on their ratings of the dog pictures

Task performance columns

These columns are confusing, I’m going to have to read the publication thoroughly to really understand them. I also might not use them.

#

New dataframe with dog opinions

This could also just be a new column in the main dataset + remove the 2 rows with NA for love_dogs from the entire dataset.

sum(is.na(doggies$love_dogs)) # 2 rows
[1] 2
doggieopinions <- doggies %>% 
  filter(!is.na(love_dogs)) %>%
  mutate(love_binary = 
    if_else(love_dogs <= 4, 0, 1)
    )

Exploratory plots

Dog visualizations

doggies %>% 
  ggplot() +
  geom_density(aes(x = age))

The multimodal age distribution shows that the participants are all relatively young, and based on the peaks, the majority are in their early 20s or about 30. The second peak at 30 is somewhat interesting, but likely due to the small number of participants.

doggies %>% 
  ggplot() +
  geom_point(aes(x = age, y = cute_total, color = sex))

It really doesn’t look like there’s any correlation at all with any of these variables (age, sex, and total cuteness ratings).

lovedog0to3 <- sum(doggieopinions$love_dogs == 0)+sum(doggieopinions$love_dogs == 1)+sum(doggieopinions$love_dogs == 2)+sum(doggieopinions$love_dogs == 3)

lovedog0to4 <- sum(doggieopinions$love_dogs == 0)+sum(doggieopinions$love_dogs == 1)+sum(doggieopinions$love_dogs == 2)+sum(doggieopinions$love_dogs == 3)+sum(doggieopinions$love_dogs == 4)

doggies %>% 
  group_by(love_dogs) %>%
  summarise(count = n()) %>% 
  ggplot() +
  geom_bar(aes(x = love_dogs, y = count), stat = "identity") +
  geom_text(aes(label = count, x = love_dogs, y = count+5)) +
  geom_text(label = paste0(lovedog0to3, "  values"), x = 1.5, y = 60) +
  geom_text(label = paste0(lovedog0to4, "  values"), x = 2.5, y = 85) +
  
  # score 0 (Vertical 1)
  geom_segment(aes(x = 0, y = 15, xend = 0, yend = 50), linewidth = 0.5, linetype = "dashed") +
  # score 3 (Vertical 2)
    geom_segment(aes(x = 3, y = 47, xend = 3, yend = 50), linewidth = 0.5, linetype = "dashed") +
  # score 0-3 (Horizontal)
  geom_segment(aes(x = 0, y = 50, xend = 3, yend = 50), linewidth = 0.5, linetype = "dashed") +
  # score 0-4 (Vertical 1)
  geom_segment(aes(x = 0, y = 15, xend = 0, yend = 75), linewidth = 0.5, linetype = "dashed") + 
  # score 0-4 (Vertical 2)
  geom_segment(aes(x = 4, y = 49, xend = 4, yend = 75), linewidth = 0.5, linetype = "dashed") +
  # score 0-4 (Horizontal)
  geom_segment(aes(x = 0, y = 75, xend = 4, yend = 75), linewidth = 0.5, linetype = "dashed") +
  
  scale_x_continuous(breaks = 0:5) + # display all tick marks on x axis
  
  labs(
    title = "Dog opinion totals",
    subtitle = "+ grouping options by size",
    y = "Count",
    x = "Self reported love of dogs (0-5)"
  )

# the purpose of the original study was to see if looking at dog pictures [positively] impacts people's performance on tasks, I wonder if the person who answered 0 here (who also answered 0 for every single cuteness metric on every single dog picture) performed worse after looking at dog pictures

There’s a very strong bias in participants’ opinions on dogs. This could be response bias, from the study authors mentioning the topic of the study while recruiting participants, but I didn’t notice any mention of that having been done. My plan was to break this variable into three groups, but it looks like I might need to break it into 2 groups instead (loves dogs and does not love dogs) to have sufficient data for analysis. The score of 5 has so many more responses than the rest; this makes me think that there is a difference between people who chose to answer 4 and those who chose to answer 5. That is one thing I might want to investigate.

doggieopinions$love_binary <- as.character(doggieopinions$love_binary)

doggieopinions$love_dogs_char <- as.character(doggieopinions$love_dogs)

doggieopinions %>% 
  filter(love_dogs != 0) %>% 
  ggplot(aes(x=love_dogs_char, y=cute_total, fill= love_dogs_char)) +
  geom_boxplot() +
  scale_fill_manual(values = c("#2C6E49", "#4C956C", "#FEFEE3", "#FFC9B9", "#D68C45")) +
  geom_dog(dog = "gabe", position = position_jitter(width = 0.1)) + 
  labs(
    title = "Total points given to dog pictures for 'cuteness', by reported love of dogs",
    x = "Self-reported love of dogs (0-5)",
    y = "Total cuteness points"
  ) +
  theme_light() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 12),
    plot.background = element_rect(color = "#39a3aa", fill = "#FAFEFF"),
  ) 

There’s noticeable variety in opinions across the dog love groups for how cute participants thought the dog images were overall. The shape of the boxes in succession implies that there may be a positive correlation between cuteness scores and dog love score. Returning to my previous thought, that people who answered 4 and 5 are different, there is a clear difference between their responses to cuteness as seen here. The interquartile range of the respondents for 5 is higher overall, with the lower quartile landing relatively close to the median of the group for 4; the upper quartile for 5 also appears to be about even with the top of the upper whisker for 4. It might be possible to predict if participants answered 5 for their love of dogs based on their opinions about dog pictures.

Examine correlated variables

Hypothesis test

Any needed subsetting/cleaning

#

Set up hypothesis

#

Test hypothesis

#

Create binary classifier

#