Week 8: Visualising Regression in R

01. Visualising Relationships in Gaming, Social Isolation, and Gender

01.1 Checking if two things are related (by looking first)

You are already familiar with the concept of correlation, as you have been looking at it for some time now. The scatterplots from the previous section showed a relationship between life expectancy and income. On these plots, we saw evidence of correlation visually, because we can clearly see how change in average life expectancy seems to be positively related to change in income. We could also intuit this on the basis of theory. You might suspect that income influences life expectancy, because you could explain how both are related without seeing any data. You might even be able to build a detailed narrative about why this might be the case. This story might include remarks on how income per capita translates into gains in other social domains such as healthcare systems, healthcare knowledge, and disposable income. Going a step further for a quick thought experiment, could you explain this connection between life expectancy and income without reference to quantitative data at all, assuming that a reasonable account could be given with little reference to statistical data? Although not impossible, it is not desirable to try to account for this with narrative alone. But the thought experiment does raise a question of the relative value of theoretical vs quantitative accounts of social causality. We obviously need both in this case, as one cannot be explained without the other. But some have raised the question of how much, and what kind of theory we need. Peter Hedstrom (Hedstrom, 2005; Hedström and Ylikoski, 2010) tackled this question when he developed the ‘Analytical Sociology’ framework. This is a perspective on social analysis that takes a very specific approach to social theory, where it is used to explain how specific properties relate to each other in the social world. It distinguished between abstract and middle-range theories, arguing that in order for theory to be useful, it should explain mechanisms in the social world. To the analytical sociologist, someone like Bourdieu is too diffuse to be testable, therefore of limited explanatory value.

A mechanism-based theory of life expectancy and income would focus clearly and specifically on how income translates into longer average life. It might call for models of greater complexity with additional variables, or try to account for the phenomenon of life expectancy by focusing on how income affects the action of individuals, ‘aggregating up’ into a effect that is greater than the sum of its parts (longer life). In sociology we call this emergence, and the idea is invoked in methods such as agent-based modelling. As you know by now, sociology is unfortunately defined to a certain extent by a division between quantitative and quantitative approaches, that has fortunately softened since the 2000s in particular. While we might not argue about the inherent worth of one over the other, or which is more ‘scientific’ (whatever that may mean), we do come back to the question of the value of reductionist measurement, given the inherent complexity of social life. For a simple example of this, consider the phenomenon of social media addiction. The nature of quantitative analysis calls for standardised measurement of a sort. We have to ask the same question of all respondents, unless we are doing something like a list experiment, which is quite rare. We might ask several questions about social media habits - frequency of use, feelings of connection, extent of ability to switch off - and sum these together into a single index of addiction. A counter-argument to this kind of measurement might point out that addiction itself is socially and culturally subjective. Does the form or wording of question itself influence the response in an excessively biasing way? Even if we measure something in this way, how can we possible explain or appreciate the complex meanings, determinants, and impacts of social media use with a question that reduces this complexity to a narrow set of questions? These are all valid criticisms, even if they sometimes overlook the limits that quantitative researchers typically place on their own capacity to explain. Using standardised measurement of this kind allows us to do certain advantageous things. If we sample appropriately, we might be able to draw conclusions about a larger extent of people than we could with a smaller amount, or more limited sample. We could address the question of group differences in a fine-grained way if we measure other properties such as age, gender, education. We might even be able to sort out the causal ordering of things if we collect data at multiple time points.

The main point here is that there are certain things we can only know through measurement. David Byrne once referred to this faith in measurement as a kind of throwback to the enlightenment. We might now be conscious of the limits to scientific explanation, and also understand that science takes place in a social context, but we can also have faith that measurement can tell us real things about a real world that exists somewhat independent of ourselves. Maybe we can even figure things out about this world to help us improve or change people’s material lot. To do this, we need measurement and theory. We might have well-defined theories about how individual reproductive choice conditioned the demographic transition, but we did not discover it through intuition alone. Its ubiquity and generality was only clarified as the collection of statistics at country level became more widespread.

This might be overstretching the division somewhat. But it does describe a uniquely recurrent issue in sociology - when we refer to historically pervasive and stable social properties such as class, gender, or ethnic inequalities, these are only apparent to their full extent due to measurement. Unfortunately we cannot surmise why these inequalities arise and how they persist - no matter how committed we might be to saturating models with multiple parameters, simulating from agents to the emergent, or unleashing AI on the now considerable ocean of available global data. Which is to say, neither the qualitative nor the quantitative perspective has any monopoly on explanation as such. So, a responsible approach to quantitative analysis in sociology might begin by drawing limits around what we can explain, and to what extent. This is a good place for us to start - how much are things in the social world related, and, in the case of data drawn through sampling how certain are we? What we will do in this section is address the first, how much? Sometimes this question of degree will express itself as a difference in means between groups - by how much does mean income differ between men and women? Does this depend on the nature of work - the sector of employment, whether full or part time, or whether public or private sector? What about differences between wider range of groups - such as those with different levels of education? Sometimes it will express itself as covariance, correlation, or the tendency of one variable to change alongside another. We might diagnose this visually as a sort of ‘direction of travel’ in the arrangement of points on a scatterplot. Our life expectancy and income plot showed such a positive correlation. Sometimes we will need to construct a table, or a different kind of plot such as a stacked or grouped bar chart. The might let us assess questions such as whether conservative voters favour certain types of media, or whether a student’s programme of study is associated with use of certain social media platforms.

Let’s start by looking at a more (socially) familiar example - gaming. Familiar to some of you at least. If you are not familiar with gaming, take this as an opportunity to exercise your intuition, and appreciate why imagination is as important as computation. Kaggle hosts a gaming dataset with a number of useful variables. Download the data from Moodle to follow along with this example, or access the data by clicking here. Remember to set your working directory, and save the dataset here also. I have assigned the datasheet to an object gaming in the Global Environment. You can do the same with this code. I will then get the column names and we will take a closer look at the dataset. The code below loads the data, so make sure you have downloaded it first, and set your working directory. Then, we can return the column names / variable names for the first 10 entries. If you want to see the names for the whole set, just leave out the [1:10] from the code below.

gaming <- read_csv("Gaming and Mental Health.csv")

colnames(gaming)[1:10]

##  [1] "record_id"                  "age"                       
##  [3] "gender"                     "daily_gaming_hours"        
##  [5] "game_genre"                 "primary_game"              
##  [7] "gaming_platform"            "sleep_hours"               
##  [9] "sleep_quality"              "sleep_disruption_frequency"

You can probably get a good sense of the variables from their names along - a reminder of the importance of spending time naming your variables appropriately. We could start by checking all pairwise correlations between all suitable variables, and this will become part of our recommended workflow in time. For now, let’s check if hours spent gaming daily is related to social isolation. Remember to consider the causal order here - would you expect more hours spent gaming to lead to greater social isolation? Or are those more socially isolated also more likely to game? I will add some jitter to the plot in the code below. This will add some random variation to the points to make the graph easier to read.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation",
       caption = "Source: Kaggle")

What do you notice? We can tell a lot from the direction of the points. Plots like this are obvious, so please do not get a false sense of optimism. Unfortunately it won’t always be like this. Correlations in the wider world are often more subtle. This is especially strong, and positive. Can you see how we identify this? As we ascend the level of hours, so too does the score on social isolation appear to increase. Pick two points on the graph along the x (hours) axis. At around 2-3 hours of average play, social isolation is around 2-2.5. Social isolation is scored from 1-10, and we can check the range by checking our summary statistics. Let’s do this for both variables.

summary(gaming[, c("daily_gaming_hours", "social_isolation_score")])

##  daily_gaming_hours social_isolation_score
##  Min.   : 0.500     Min.   : 1.000        
##  1st Qu.: 4.100     1st Qu.: 2.000        
##  Median : 6.000     Median : 4.000        
##  Mean   : 6.151     Mean   : 3.872        
##  3rd Qu.: 8.025     3rd Qu.: 5.000        
##  Max.   :15.100     Max.   :10.000

We always need to interpret our visualisations alongside descriptive statistics or tables for each of our variables. They tell us something about how to interpret the results. Median daily_gaming_hours is 6, and the largest score in the dataset is 15.1. This means that at least one person gamed for 15 hours daily on average. Even without this, it seems like the gamers in the dataset are on the higher side of play time. Median social_isolation_scorecomes in at 4, with a range (highest-lowest) of 1-10. This looks like pretty good evidence of correlation. For now, limited to this sample data alone, we can say that it appears to be strongly correlated. Why? The clustering of the points answers this partly. The pattern seems clear. The direction of travel seems linear. For this combination at least, it is easy. Conversely, there are some variables that might be negatively correlated with social isolation. One of these is exercise, entered in the dataset as exercise_hours_weekly.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = exercise_hours_weekly, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Exercise hours weekly", y = "Social isolation score",
       title = "Exercise and Social Isolation",
       caption = "Source: Kaggle")

As expected, the correlation is negative. Reading the x-axis (exercise) from left to right, we see the tendency for average isolation to decrease for increases in exercise. Those points (people) at the top of the social isolation scale on the y-axis with scores close to 10, also fall lowest on exercise scores - around 1-2.5 hours of exercise. We can be more precise with these estimates later, for now we can just read them roughly from the graph. Those at the bottom of the social isolation scale also tend to fall within the higher bands of exercise: 7.5 hours +, and much more clearly so at 10+ hours. We are starting to piece together the puzzle of gaming’s impact (sort of).

01.2 But it can’t all be down to one single factor

Let’s go back to our original plot again. If you are thinking that surely social isolation cannot all be a function of gaming, then you are correct. Social isolation will have several determinants, and we cannot tell the direction of causality (which came first) nor the mechanism (the story of how more gaming translates into greater isolation) without considering more variables. This is a process known as exerting statistical control. It involves introducing additional variables into our analysis to see if they ‘explain’ more of the phenomenon of interest. Again, we will do this formally later. For now, let’s take a look. One other variable in the dataset is game genre entered in the dataset as game_genre. This is a categorical, or nominal variable where the categories have no inherent order or hierarchy. The command below calls on the summarytools package to produce a simple frequency table.

freq(gaming$gender)

## Frequencies  
## gaming$gender  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       Female    331     33.10          33.10     33.10          33.10
##         Male    647     64.70          97.80     64.70          97.80
##        Other     22      2.20         100.00      2.20         100.00
##         <NA>      0                               0.00         100.00
##        Total   1000    100.00         100.00    100.00         100.00

freq(gaming$game_genre)

## Frequencies  
## gaming$game_genre  
## Type: Character  
## 
##                       Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------- ------ --------- -------------- --------- --------------
##       Battle Royale    141     14.10          14.10     14.10          14.10
##                 FPS    134     13.40          27.50     13.40          27.50
##                 MMO    143     14.30          41.80     14.30          41.80
##                MOBA    156     15.60          57.40     15.60          57.40
##        Mobile Games    139     13.90          71.30     13.90          71.30
##                 RPG    146     14.60          85.90     14.60          85.90
##            Strategy    141     14.10         100.00     14.10         100.00
##                <NA>      0                               0.00         100.00
##               Total   1000    100.00         100.00    100.00         100.00

Reading the results above, we see that 647 (64.7%) of our sample are male, 331 (33.1%) are female, with 22 (2.2%) identifying as ‘other’. At this point, we could approach the visualisation in several ways. We could split our output for the scatterplot by gender or genre using facet_grid and supplying either gender or game_genre as a faceting variable. This would draw separate plots for each group, and allow us to check if the correlation remains for both groups. This is fine for variables with a small number of categories like gender, less so for variables with more. We are interested in whether social isolation might vary according to something other than gaming - and while the initial indications are strong, we know that other factors are likely to be involved. This is where boxplots might be useful, since we can stack them neatly side by side as we split the output by categories of a second variable. Let’s try this now.

ggplot(gaming, aes(y = social_isolation_score, 
                   x = reorder(game_genre, social_isolation_score, FUN = median),
                   fill = reorder(game_genre, social_isolation_score, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Social isolation score") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Social Isolation", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

There is a slight tendency toward lower social isolation for FPS (First-Person Shooter) and MOBA (Multiplayer Online Battle Arena) gamers. This makes sense if we consider the community nature of these, especially as they tend to be more interaction-heavy. But this can also be the case for MMO (Massively Multiplayer Online) games. There doesn’t seem to be much going on here. The same appears to hold true if we check variation in gaming hours by genre also.

ggplot(gaming, aes(y = daily_gaming_hours, 
                   x = reorder(game_genre, daily_gaming_hours, FUN = median),
                   fill = reorder(game_genre, daily_gaming_hours, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Gaming hours") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Gaming Hours", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

At this point, genre doesn’t seem to show a large disparity - although this question will come up again. How much of a difference is a ‘big’ difference? If the difference in median gaming time for FPS vs Strategy players is less than 2 hours, is this a big or small difference in ‘real world’ terms? There is no easy way to answer this - it may mean little to a social researcher, but a lot to a marketing analyst or psychologist. What about gender? Let’s consider gender both descriptively and visually.

sumtable(gaming, vars = c('daily_gaming_hours', 'social_isolation_score', 'exercise_hours_weekly'),
        digits = 3, 
        add.median = TRUE, 
        title = 'Gaming Hours and Social Isolation by Gender',
        group = 'gender')

Gaming Hours and Social Isolation by Gender
gender	Female				Male				Other
Variable	N	Mean	SD	Median	N	Mean	SD	Median	N	Mean	SD	Median
daily_gaming_hours	331	5.97	2.7	5.9	647	6.23	2.94	6	22	6.63	2.9	5.9
social_isolation_score	331	3.72	1.95	4	647	3.95	2.17	4	22	3.86	1.83	4
exercise_hours_weekly	331	6.94	1.81	6.9	647	6.96	1.81	7	22	6.53	1.61	6.75

Gaming time seems to vary a little by gender, with mean gaming hours for females at 5.97, vs 6.23 for males. There are also small variations in social isolation, but little in exercise. Splitting the original scatterplot might help here, but with a difference so small, it is likely to be difficult to diagnose from the distribution of points alone.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

Notice a few things in this plot. You can see the differences in sample size for each of the subgroups reflected in the density of the points. Remember, we have 331 females, 647 males, and 22 identifying as other. This will be important when it comes to formal analysis later - how confident can we be when inferring the characteristics of the ‘other’ group, when there are so few represented in the data? At this point, we can add some fit lines, as before, to help us draw some conclusions. Adding a fit line to each group might show some variations in slope between each group.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), method='lm') +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

01.3 Understanding statistical control

It is a bit difficult to tell from this plot. Combining the plots into a single plot field with overlaid fit lines could help, as we will then be able to tell if the slopes are diverging, and if there is a difference in the impact of gaming on social isolation for each gender.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, 
                          y = social_isolation_score, 
                          color = gender), 
             position = position_jitter(width = 0.5, height = 0.5),
             alpha = 0.5) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, 
                           y = social_isolation_score, 
                           color = gender,
                           fill = gender,
                           linetype = gender), 
              method = 'lm') +
  theme_gray(base_size = 12) +
  labs(x = "Hours gaming daily", 
       y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle",
       color = "Gender",
       fill = "Gender",
       linetype = "Gender")

The difference is subtle, but we can see a slightly lesser tendency toward social isolation for each additional unit of gaming for those of ‘other’ gender, relative to males and females. The difference is small, and we need to be mindful of the problem of small sample size here. For now, treat it as an exercise in searching visually through your data for a possible analytical strategy. From this, we can deduce a few things about the relationship between gaming, social isolation, genre, and gender.

Gaming is positively associated with social isolation - the longer the gaming time, the greater the social isolation.
Gender appears to play a small role in explaining differences in the effect of gaming time on social isolation.
In plainer English, the ‘penalty’ (increase) in social isolation at greater levels of gaming time seems to be more for males than females, and less for those of ‘other’ gender.
All gender still experience the impact of gaming time on isolation, but it seems to be slightly less so for the other group.
FPS and MOBA players play for slightly less time on average than players of other genres, but the difference in median play time is small, whilst FPS and MOBA players are less isolated. Here, gaming time and genre may interact in their effect on social isolation.
Exercise is negatively related to social isolation - the longer the average exercise, the lower the social isolation.

This example is unusual in terms of its clarity. It will not always be this clear when you are working with real, messy data. By way of example, look at the plot below. Here, we have a plot examining the relationship between women’s education and childhood immunization rates with data from the World Bank. We know these to be related through several mechanisms as gender and education impact health-seeking behaviors at the individual level, whilst overall levels of gender equality influence national service coverage and access (Tracey et al., 2024). Yet in the plot below we see little evidence of a connection, at least for this set of measures. Here, we infer from the ‘flatness’ of the slope, but we might also remark on the distribution of points. For the set of countries included, female labour force participation for the highly educated is high. Yet at lower levels of education, say, below 55% (the red dashed line in the plot) immunization rates are still relatively high. This is a good example of two variables that are seemingly uncorrelated. Remember - variables. That’s about as much as we can say. Immunization and gender may be correlated, but this is not apparent from the measures we have available here in this plot. We would have a long way to go before eliminating gender as a factor influencing immunization.

# Draw the plot
ggplot(data = wb_renamed) +
  geom_point(mapping = aes(x = labor_force_ed, y = immunization)) +
  geom_smooth(mapping = aes(x = labor_force_ed, y = immunization), method = "gam") +
  theme_gray(base_size = 12) +
  ylim(50, 100) +
  labs(
    title = "Women's Education and Infant Immunization Rates",
    subtitle = "Data from World Bank, 2024",
    x = "Female labour force w. high education (%)",
    y = "Measles imunization rate (%)",
    caption = "Measles immunization uptake, % all children 12-23 months"
  ) +
  geom_vline(xintercept = 55, linetype="dashed", color = "red", size=.5)

Where might we go from here? Readers with some familiarity with the subject will know that the visuals point toward a multivariate analysis. Enter all of the possibly relevant conditions into a statistical model and examine the resulting parameters. Check the effect sizes, and infer on the slopes and between-group differences in means. This jargon will become apparent later - but you already understand it (seriously). The ‘slope’ is a more precise way of quantifying how much social isolation changes with every single-unit increase in x (either exercise, or gaming time). The ‘between-group differences’, we have already examined in our plots of gender and genre. We can see the differences in average isolation or gaming time for each genre or gender. And while this example draws on somewhat unusual, curated teaching data, it will give us a good foundation for seeking out relationships in the wider world. We will try this now with macrodata (country-level data) from Eurostat.

02. Visualising Relationships with Macrodata, or Telling Interesting Stories About Countries

References

Hedstrom, P. (2005) Dissecting the Social: On the Principles of Analytical Sociology. Cambridge University Press.

Hedström, P. and Ylikoski, P. (2010) Causal Mechanisms in the Social Sciences. Annual Review of Sociology, 36(Volume 36, 2010), pp. 49–67.

Tracey, G. et al. (2024) Why does gender matter for immunization? Vaccine, 42, pp. S91–S97.