Week 8: Visualising Regression in R

01. Visualising Relationships in Gaming, Social Isolation, and Gender

01.2 But it can’t all be down to one single factor

Let’s go back to our original plot again. If you are thinking that surely social isolation cannot all be a function of gaming, then you are correct. Social isolation will have several determinants, and we cannot tell the direction of causality (which came first) nor the mechanism (the story of how more gaming translates into greater isolation) without considering more variables. This is a process known as exerting statistical control. It involves introducing additional variables into our analysis to see if they ‘explain’ more of the phenomenon of interest. Again, we will do this formally later. For now, let’s take a look. One other variable in the dataset is game genre entered in the dataset as game_genre. This is a categorical, or nominal variable where the categories have no inherent order or hierarchy. The command below calls on the summarytools package to produce a simple frequency table.

freq(gaming$gender)
## Frequencies  
## gaming$gender  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       Female    331     33.10          33.10     33.10          33.10
##         Male    647     64.70          97.80     64.70          97.80
##        Other     22      2.20         100.00      2.20         100.00
##         <NA>      0                               0.00         100.00
##        Total   1000    100.00         100.00    100.00         100.00
freq(gaming$game_genre)
## Frequencies  
## gaming$game_genre  
## Type: Character  
## 
##                       Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------- ------ --------- -------------- --------- --------------
##       Battle Royale    141     14.10          14.10     14.10          14.10
##                 FPS    134     13.40          27.50     13.40          27.50
##                 MMO    143     14.30          41.80     14.30          41.80
##                MOBA    156     15.60          57.40     15.60          57.40
##        Mobile Games    139     13.90          71.30     13.90          71.30
##                 RPG    146     14.60          85.90     14.60          85.90
##            Strategy    141     14.10         100.00     14.10         100.00
##                <NA>      0                               0.00         100.00
##               Total   1000    100.00         100.00    100.00         100.00

Reading the results above, we see that 647 (64.7%) of our sample are male, 331 (33.1%) are female, with 22 (2.2%) identifying as ‘other’. At this point, we could approach the visualisation in several ways. We could split our output for the scatterplot by gender or genre using facet_grid and supplying either gender or game_genre as a faceting variable. This would draw separate plots for each group, and allow us to check if the correlation remains for both groups. This is fine for variables with a small number of categories like gender, less so for variables with more. We are interested in whether social isolation might vary according to something other than gaming - and while the initial indications are strong, we know that other factors are likely to be involved. This is where boxplots might be useful, since we can stack them neatly side by side as we split the output by categories of a second variable. Let’s try this now.

ggplot(gaming, aes(y = social_isolation_score, 
                   x = reorder(game_genre, social_isolation_score, FUN = median),
                   fill = reorder(game_genre, social_isolation_score, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Social isolation score") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Social Isolation", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

There is a slight tendency toward lower social isolation for FPS (First-Person Shooter) and MOBA (Multiplayer Online Battle Arena) gamers. This makes sense if we consider the community nature of these, especially as they tend to be more interaction-heavy. But this can also be the case for MMO (Massively Multiplayer Online) games. There doesn’t seem to be much going on here. The same appears to hold true if we check variation in gaming hours by genre also.

ggplot(gaming, aes(y = daily_gaming_hours, 
                   x = reorder(game_genre, daily_gaming_hours, FUN = median),
                   fill = reorder(game_genre, daily_gaming_hours, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Gaming hours") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Gaming Hours", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

At this point, genre doesn’t seem to show a large disparity - although this question will come up again. How much of a difference is a ‘big’ difference? If the difference in median gaming time for FPS vs Strategy players is less than 2 hours, is this a big or small difference in ‘real world’ terms? There is no easy way to answer this - it may mean little to a social researcher, but a lot to a marketing analyst or psychologist. What about gender? Let’s consider gender both descriptively and visually.

sumtable(gaming, vars = c('daily_gaming_hours', 'social_isolation_score', 'exercise_hours_weekly'),
        digits = 3, 
        add.median = TRUE, 
        title = 'Gaming Hours and Social Isolation by Gender',
        group = 'gender')
Gaming Hours and Social Isolation by Gender
gender
Female
Male
Other
Variable N Mean SD Median N Mean SD Median N Mean SD Median
daily_gaming_hours 331 5.97 2.7 5.9 647 6.23 2.94 6 22 6.63 2.9 5.9
social_isolation_score 331 3.72 1.95 4 647 3.95 2.17 4 22 3.86 1.83 4
exercise_hours_weekly 331 6.94 1.81 6.9 647 6.96 1.81 7 22 6.53 1.61 6.75

Gaming time seems to vary a little by gender, with mean gaming hours for females at 5.97, vs 6.23 for males. There are also small variations in social isolation, but little in exercise. Splitting the original scatterplot might help here, but with a difference so small, it is likely to be difficult to diagnose from the distribution of points alone.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

Notice a few things in this plot. You can see the differences in sample size for each of the subgroups reflected in the density of the points. Remember, we have 331 females, 647 males, and 22 identifying as other. This will be important when it comes to formal analysis later - how confident can we be when inferring the characteristics of the ‘other’ group, when there are so few represented in the data? At this point, we can add some fit lines, as before, to help us draw some conclusions. Adding a fit line to each group might show some variations in slope between each group.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), method='lm') +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

01.3 Understanding statistical control

It is a bit difficult to tell from this plot. Combining the plots into a single plot field with overlaid fit lines could help, as we will then be able to tell if the slopes are diverging, and if there is a difference in the impact of gaming on social isolation for each gender.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, 
                          y = social_isolation_score, 
                          color = gender), 
             position = position_jitter(width = 0.5, height = 0.5),
             alpha = 0.5) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, 
                           y = social_isolation_score, 
                           color = gender,
                           fill = gender,
                           linetype = gender), 
              method = 'lm') +
  theme_gray(base_size = 12) +
  labs(x = "Hours gaming daily", 
       y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle",
       color = "Gender",
       fill = "Gender",
       linetype = "Gender")

The difference is subtle, but we can see a slightly lesser tendency toward social isolation for each additional unit of gaming for those of ‘other’ gender, relative to males and females. The difference is small, and we need to be mindful of the problem of small sample size here. For now, treat it as an exercise in searching visually through your data for a possible analytical strategy. From this, we can deduce a few things about the relationship between gaming, social isolation, genre, and gender.

  1. Gaming is positively associated with social isolation - the longer the gaming time, the greater the social isolation.
  2. Gender appears to play a small role in explaining differences in the effect of gaming time on social isolation.
  3. In plainer English, the ‘penalty’ (increase) in social isolation at greater levels of gaming time seems to be more for males than females, and less for those of ‘other’ gender.
  4. All gender still experience the impact of gaming time on isolation, but it seems to be slightly less so for the other group.
  5. FPS and MOBA players play for slightly less time on average than players of other genres, but the difference in median play time is small, whilst FPS and MOBA players are less isolated. Here, gaming time and genre may interact in their effect on social isolation.
  6. Exercise is negatively related to social isolation - the longer the average exercise, the lower the social isolation.

Where might we go from here? Readers with some familiarity with the subject will know that the visuals point toward a multivariate analysis. Enter all of the possibly relevant conditions into a statistical model and examine the resulting parameters. Check the effect sizes, and infer on the slopes and between-group differences in means. This jargon will become apparent later - but you already understand it (seriously). The ‘slope’ is a more precise way of quantifying how much social isolation changes with every single-unit increase in x (either exercise, or gaming time). The ‘between-group differences’, we have already examined in our plots of gender and genre. We can see the differences in average isolation or gaming time for each genre or gender. And while this example draws on somewhat unusual, curated teaching data, it will give us a good foundation for seeking out relationships in the wider world. We will try this now with macrodata (country-level data) from Eurostat.

02. Visualising Relationships with Macrodata, or Telling Interesting Stories About Countries

02.1 Units of analysis and the ecological problem (fallacy)