Week 8: Visualising Regression in R

01. Visualising Relationships in Gaming, Social Isolation, and Gender

01.1 Checking if two things are related (by looking first)

You are already familiar with the concept of correlation, as you have been looking at it for some time now. The scatterplots of life expectancy and income show evidence of correlation visually because we can clearly see how change in average life expectancy seems to be positively related to change in income. We could also intuit this on the basis of theory. You might suspect that income influences life expectancy, because you could explain - without seeing any data - how both are related. You might even be able to construct a detailed narrative about why this might be the case, and this story might include remarks on how income translates into gains in other social domains such as healthcare systems, healthcare knowledge, and disposable income. You might also question the value of the data analysis at all, given that a reasonable account of life expectancy’s relationship with income could be given with little reference to statistical data. Although this might be a minority opinion, it does raise a question of the relative value of theoretical (perhaps qualitative) accounts of social causality vs quantitative. Some (me) would claim you cannot have one without the other. We might have well-defined theories about how individual reproductive choice conditioned the demographic transition, but we did not discover it through intuition alone. Its ubiquity and generality was only clarified as the collection of statistics at country level became more widespread.

This might be overstretching the division somewhat. But it does describe a uniquely recurrent issue in sociology - that most references to historically pervasive and stable social properties such as class, gender, or ethnic inequalities, are only apparent to their full extent due to measurement. Unfortunately we cannot surmise why this might be the case - not matter how committed we might be to saturating models with multiple parameters, simulating from agents to the emergent, or unleashing AI on the now considerable ocean of available global data. Which is to say, neither the qualitative nor the quantitative perspective has any monopoly on explanation as such. So, a responsible approach to quantitative analysis in sociology might begin by drawing limits around what we can explain, and to what extent. This is a good place for us to start - how much are things in the social world related, and how certain are we? What we will do in this section is address the first, how much? Sometimes this question of degree will express itself as a difference in means between groups - by how much does mean income differ between men and women? Does this depend on the nature of work - the sector of employment, whether full or part time, or whether public or private sector? What about differences between wider range of groups - such as those with different levels of education? Sometimes it will express itself as covariance, correlation, or the tendency of one variable to change alongside another. We might diagnose this visually as a sort of ‘direction of travel’ in the arrangement of points on a scatterplot. Our life expectancy and income plot showed such a positive correlation. Sometimes we will need to construct a table, or a different kind of plot such as a stacked or grouped bar chart. The might let us assess questions such as whether conservative voters favour certain types of media, or whether a student’s programme of study is associated with use of certain social media platforms.

Let’s start by looking at a more (socially) familiar example - gaming. Familiar to some of you at least. If you are not familiar with gaming, take this as an opportunity to exercise your intuition, and appreciate why imagination is as important as computation. Kaggle hosts a gaming dataset with a number of useful variables. Download the data from Moodle to follow along with this example, or access the data by clicking here. Remember to set your working directory, and save the dataset here also. I have assigned the datasheet to an object gaming in the Global Environment. You can do the same with this code. I will then get the column names and we will take a closer look at the dataset.

gaming <- read_csv("Gaming and Mental Health.csv")

colnames(gaming)

##  [1] "record_id"                        "age"                             
##  [3] "gender"                           "daily_gaming_hours"              
##  [5] "game_genre"                       "primary_game"                    
##  [7] "gaming_platform"                  "sleep_hours"                     
##  [9] "sleep_quality"                    "sleep_disruption_frequency"      
## [11] "academic_work_performance"        "grades_gpa"                      
## [13] "work_productivity_score"          "mood_state"                      
## [15] "mood_swing_frequency"             "withdrawal_symptoms"             
## [17] "loss_of_other_interests"          "continued_despite_problems"      
## [19] "eye_strain"                       "back_neck_pain"                  
## [21] "weight_change_kg"                 "exercise_hours_weekly"           
## [23] "social_isolation_score"           "face_to_face_social_hours_weekly"
## [25] "monthly_game_spending_usd"        "years_gaming"                    
## [27] "gaming_addiction_risk_level"

You can probably get a good sense of the variables from their names along - a reminder of the importance of spending time naming your variables appropriately. We could start by checking all pairwise correlations between all suitable variables, and this will become part of our recommended workflow in time. For now, let’s check if hours spent gaming daily is related to social isolation. Remember to consider the causal order here - would you expect more hours spent gaming to lead to greater social isolation? Or are those more socially isolated also more likely to game? I will add some jitter to the plot in the code below. This will add some random variation to the points to make the graph easier to read.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation",
       caption = "Source: Kaggle")

What do you notice? We can tell a lot from the direction of the points. Plots like this are obvious, so please do not get a false sense of optimism. Unfortunately it won’t always be like this. Correlations in the wider world are often more subtle. This is especially strong, and positive. Can you see how we identify this? As we ascend the level of hours, so too does the score on social isolation appear to increase. Pick two points on the graph along the x (hours) axis. At around 2-3 hours of average play, social isolation is around 2-2.5. Social isolation is scored from 1-10, and we can check the range by checking our summary statistics. Let’s do this for both variables.

summary(gaming[, c("daily_gaming_hours", "social_isolation_score")])

##  daily_gaming_hours social_isolation_score
##  Min.   : 0.500     Min.   : 1.000        
##  1st Qu.: 4.100     1st Qu.: 2.000        
##  Median : 6.000     Median : 4.000        
##  Mean   : 6.151     Mean   : 3.872        
##  3rd Qu.: 8.025     3rd Qu.: 5.000        
##  Max.   :15.100     Max.   :10.000

We always need to interpret our visualisations alongside descriptive statistics or tables for each of our variables. They tell us something about how to interpret the results. Median daily_gaming_hours is 6, and the largest score in the dataset is 15.1. This means that at least one person gamed for 15 hours daily on average. Even without this, it seems like the gamers in the dataset are on the higher side of play time. Median social_isolation_scorecomes in at 4, with a range (highest-lowest) of 1-10. This looks like pretty good evidence of correlation. For now, limited to this sample data alone, we can say that it appears to be strongly correlated. Why? The clustering of the points answers this partly. The pattern seems clear. The direction of travel seems linear. For this combination at least, it is easy. Conversely, there are some variables that might be negatively correlated with social isolation. One of these is exercise, entered in the dataset as exercise_hours_weekly.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = exercise_hours_weekly, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Exercise hours weekly", y = "Social isolation score",
       title = "Exercise and Social Isolation",
       caption = "Source: Kaggle")

As expected, the correlation is negative. Reading the x-axis (exercise) from left to right, we see the tendency for average isolation to decrease for increases in exercise. Those points (people) at the top of the social isolation scale on the y-axis with scores close to 10, also fall lowest on exercise scores - around 1-2.5 hours of exercise. We can be more precise with these estimates later, for now we can just read them roughly from the graph. Those at the bottom of the social isolation scale also tend to fall within the higher bands of exercise: 7.5 hours +, and much more clearly so at 10+ hours. We are starting to piece together the puzzle of gaming’s impact (sort of).

01.2 But it can’t all be down to one single factor

Let’s go back to our original plot again. If you are thinking that surely social isolation cannot all be a function of gaming, then you are correct. Social isolation will have several determinants, and we cannot tell the direction of causality (which came first) nor the mechanism (the story of how more gaming translates into greater isolation) without considering more variables. This is a process known as exerting statistical control. It involves introducing additional variables into our analysis to see if they ‘explain’ more of the phenomenon of interest. Again, we will do this formally later. For now, let’s take a look. One other variable in the dataset is game genre entered in the dataset as game_genre. This is a categorical, or nominal variable where the categories have no inherent order or hierarchy. The command below calls on the summarytools package to produce a simple frequency table.

freq(gaming$gender)

## Frequencies  
## gaming$gender  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
##       Female    331     33.10          33.10     33.10          33.10
##         Male    647     64.70          97.80     64.70          97.80
##        Other     22      2.20         100.00      2.20         100.00
##         <NA>      0                               0.00         100.00
##        Total   1000    100.00         100.00    100.00         100.00

freq(gaming$game_genre)

## Frequencies  
## gaming$game_genre  
## Type: Character  
## 
##                       Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------- ------ --------- -------------- --------- --------------
##       Battle Royale    141     14.10          14.10     14.10          14.10
##                 FPS    134     13.40          27.50     13.40          27.50
##                 MMO    143     14.30          41.80     14.30          41.80
##                MOBA    156     15.60          57.40     15.60          57.40
##        Mobile Games    139     13.90          71.30     13.90          71.30
##                 RPG    146     14.60          85.90     14.60          85.90
##            Strategy    141     14.10         100.00     14.10         100.00
##                <NA>      0                               0.00         100.00
##               Total   1000    100.00         100.00    100.00         100.00

Reading the results above, we see that 647 (64.7%) of our sample are male, 331 (33.1%) are female, with 22 (2.2%) identifying as ‘other’. At this point, we could approach the visualisation in several ways. We could split our output for the scatterplot by gender or genre using facet_grid and supplying either gender or game_genre as a faceting variable. This would draw separate plots for each group, and allow us to check if the correlation remains for both groups. This is fine for variables with a small number of categories like gender, less so for variables with more. We are interested in whether social isolation might vary according to something other than gaming - and while the initial indications are strong, we know that other factors are likely to be involved. This is where boxplots might be useful, since we can stack them neatly side by side as we split the output by categories of a second variable. Let’s try this now.

ggplot(gaming, aes(y = social_isolation_score, 
                   x = reorder(game_genre, social_isolation_score, FUN = median),
                   fill = reorder(game_genre, social_isolation_score, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Social isolation score") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Social Isolation", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

There is a slight tendency toward lower social isolation for FPS (First-Person Shooter) and MOBA (Multiplayer Online Battle Arena) gamers. This makes sense if we consider the community nature of these, especially as they tend to be more interaction-heavy. But this can also be the case for MMO (Massively Multiplayer Online) games. There doesn’t seem to be much going on here. The same appears to hold true if we check variation in gaming hours by genre also.

ggplot(gaming, aes(y = daily_gaming_hours, 
                   x = reorder(game_genre, daily_gaming_hours, FUN = median),
                   fill = reorder(game_genre, daily_gaming_hours, FUN = median))) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, shape = 21, color = "gray15") +
  theme_gray(base_size = 12) +
  ylab("Gaming hours") +
  xlab("Preferred game genre") +
  labs(title = "Game Genre and Gaming Hours", 
       caption = "Source: Kaggle") +
  guides(fill = "none")

At this point, genre doesn’t seem to show a large disparity - although this question will come up again. How much of a difference is a ‘big’ difference? If the difference in median gaming time for FPS vs Strategy players is less than 2 hours, is this a big or small difference in ‘real world’ terms? There is no easy way to answer this - it may mean little to a social researcher, but a lot to a marketing analyst or psychologist. What about gender? Let’s consider gender both descriptively and visually.

sumtable(gaming, vars = c('daily_gaming_hours', 'social_isolation_score', 'exercise_hours_weekly'),
        digits = 3, 
        add.median = TRUE, 
        title = 'Gaming Hours and Social Isolation by Gender',
        group = 'gender')

Gaming Hours and Social Isolation by Gender
gender	Female				Male				Other
Variable	N	Mean	SD	Median	N	Mean	SD	Median	N	Mean	SD	Median
daily_gaming_hours	331	5.97	2.7	5.9	647	6.23	2.94	6	22	6.63	2.9	5.9
social_isolation_score	331	3.72	1.95	4	647	3.95	2.17	4	22	3.86	1.83	4
exercise_hours_weekly	331	6.94	1.81	6.9	647	6.96	1.81	7	22	6.53	1.61	6.75

Gaming time seems to vary a little by gender, with mean gaming hours for females at 5.97, vs 6.23 for males. There are also small variations in social isolation, but little in exercise. Splitting the original scatterplot might help here, but with a difference so small, it is likely to be difficult to diagnose from the distribution of points alone.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

Notice a few things in this plot. You can see the differences in sample size for each of the subgroups reflected in the density of the points. Remember, we have 331 females, 647 males, and 22 identifying as other. This will be important when it comes to formal analysis later - how confident can we be when inferring the characteristics of the ‘other’ group, when there are so few represented in the data? At this point, we can add some fit lines, as before, to help us draw some conclusions. Adding a fit line to each group might show some variations in slope between each group.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), 
             position = position_jitter(width = 0.5, height = 0.5)) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, y = social_isolation_score), method='lm') +
  theme_gray(base_size=12) +
  labs(x = "Hours gaming daily", y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle") +
  facet_grid(.~gender)

01.3 Understanding statistical control

It is a bit difficult to tell from this plot. Combining the plots into a single plot field with overlaid fit lines could help, as we will then be able to tell if the slopes are diverging, and if there is a difference in the impact of gaming on social isolation for each gender.

ggplot(data = gaming) +
  geom_point(mapping = aes(x = daily_gaming_hours, 
                          y = social_isolation_score, 
                          color = gender), 
             position = position_jitter(width = 0.5, height = 0.5),
             alpha = 0.5) +
  geom_smooth(mapping = aes(x = daily_gaming_hours, 
                           y = social_isolation_score, 
                           color = gender,
                           fill = gender,
                           linetype = gender), 
              method = 'lm') +
  theme_gray(base_size = 12) +
  labs(x = "Hours gaming daily", 
       y = "Social isolation score",
       title = "Gaming and Social Isolation by Gender",
       caption = "Source: Kaggle",
       color = "Gender",
       fill = "Gender",
       linetype = "Gender")

The difference is subtle, but we can see a slightly lesser tendency toward social isolation for each additional unit of gaming for those of ‘other’ gender, relative to males and females. The difference is small, and we need to be mindful of the problem of small sample size here. For now, treat it as an exercise in searching visually through your data for a possible analytical strategy. From this, we can deduce a few things about the relationship between gaming, social isolation, genre, and gender.

Gaming is positively associated with social isolation - the longer the gaming time, the greater the social isolation.
Gender appears to play a small role in explaining differences in the effect of gaming time on social isolation.
In plainer English, the ‘penalty’ (increase) in social isolation at greater levels of gaming time seems to be more for males than females, and less for those of ‘other’ gender.
All gender still experience the impact of gaming time on isolation, but it seems to be slightly less so for the other group.
FPS and MOBA players play for slightly less time on average than players of other genres, but the difference in median play time is small, whilst FPS and MOBA players are less isolated. Here, gaming time and genre may interact in their effect on social isolation.
Exercise is negatively related to social isolation - the longer the average exercise, the lower the social isolation.

Where might we go from here? Readers with some familiarity with the subject will know that the visuals point toward a multivariate analysis. Enter all of the possibly relevant conditions into a statistical model and examine the resulting parameters. Check the effect sizes, and infer on the slopes and between-group differences in means. This jargon will become apparent later - but you already understand it (seriously). The ‘slope’ is a more precise way of quantifying how much social isolation changes with every single-unit increase in x (either exercise, or gaming time). The ‘between-group differences’, we have already examined in our plots of gender and genre. We can see the differences in average isolation or gaming time for each genre or gender. And while this example draws on somewhat unusual, curated teaching data, it will give us a good foundation for seeking out relationships in the wider world. We will try this now with macrodata (country-level data) from Eurostat.

Week 8: Visualising Regression in R

Week 8: Visualising Regression in R

02. Visualising Relationships with Macrodata, or Telling Interesting Stories About Countries

02.1 Units of analysis and the ecological problem (fallacy)