1. PREPARE

Our brief case study for Unit 5 is adapted from an exercise on network influence included in the appendix of Data Science in Education Using R. In this case study we explore the process of network influence, which is the flip side of the network selection processes we examined in our previous case study. Both of these processes are commonly the focus of statistical analyses of networks. Selection and influence, however, do not interact independently and often affect each other reciprocally (Xu, Frank, & Penuel, 2018).

Let’s first quickly define these two processes:

  • Selection: the process of choosing relationships

  • Influence: the process of how our social relationships affect behavior

1a. Research Question

In Unit 5, we will briefly examine a hypothetical example of peer influence on attitudes towards STEM careers among a small group of friends. Specifically, we will try to answer the following question:

Are student’ attitudes towards careers in STEM influenced by the attitudes of their friends?

1b. Load Libraries

While selection and influence processes are complex, it is possible to study them using data about people’s relationships and behavior. Happily, the use of these methods has expanded along with R. In fact, long-standing R packages have become some of the best tools for studying social networks. Additionally, while there are many nuances to studying selection and influence, these are models that can be carried out with relatively simple modeling techniques like linear regression, as we’ll see later in this case study.

👉 Your Turn ⤵

To begin, load the the following packages that will be using for our analysis of network influence:

library(tidyverse)
library(tidygraph)
library(ggraph)
library(moderndive)

2. WRANGLE

To examine a hypothetical case of network selection, we’ll create three different data frames:

  • One edgelist that contains the nominator and nominee for a relationship. For example, if Stefanie says that José is her friend, then Stefanie is the nominator and José the nominee. Our edgelist will also contain variable indicating the weight, or strength, of their relation.

  • Two nodelists indicating the values of some behavior - an outcome - at two different time points. In our case, these values will represent a measure of students’ attitudes towards careers in STEM.

2a. Create an Edgelist

First let’s create an edgelist with three variables for a nominator, nominee, and strength of the relation. Run the following code to create tibble – a tidy table or data frame – that contains information about each friendship dyad in our network with a corresponding strength of “1”, which simply indicates the presence or absence of a tie:

student_edges <- tibble(nominator = c(2, 1, 3, 1, 2, 6, 3, 5, 6, 4, 3, 4), 
                        nominee = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6), 
                        relate = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ) 

student_edges
## # A tibble: 12 x 3
##    nominator nominee relate
##        <dbl>   <dbl>  <dbl>
##  1         2       1      1
##  2         1       2      1
##  3         3       2      1
##  4         1       3      1
##  5         2       3      1
##  6         6       3      1
##  7         3       4      1
##  8         5       4      1
##  9         6       4      1
## 10         4       5      1
## 11         3       6      1
## 12         4       6      1

2a. Create Nodelists

Next, let’s create a data frame, or tibble, that contains the nominee and the values of a behavior at the first time point. For our hypothetical example, our time_1 variable represents a composite score for each students’ attitude towards STEM careers at the beginning of the semester and measured on a scale of -3 (very negative) to 3 (very positive).

nominee_data <- tibble(nominee = c(1, 2, 3, 4, 5, 6), 
                       time_1 = c(2.4, 2.6, 1.1, -0.5, -3, -1)
                       )

nominee_data
## # A tibble: 6 x 2
##   nominee time_1
##     <dbl>  <dbl>
## 1       1    2.4
## 2       2    2.6
## 3       3    1.1
## 4       4   -0.5
## 5       5   -3  
## 6       6   -1

We will use this data later in our Explore section to create a new measure called an “exposure term” that captures the extent to which an individual is “exposed” to some attribute or characteristic there alters in the network.

👉 Your Turn ⤵

Create a third data frame for data collected at the end of the semester on students’ attitudes towards STEM careers called nominator_data. Your output should look something like this:

nominator time_2
1 2.0
2 2.0
3 1.0
4 -0.5
5 -2.0
6 -0.5
# YOUR CODE HERE
nominator_data <-  tibble(nominator = c(1, 2, 3, 4, 5, 6), 
                       time_2 = c(2.0, 2.0, 1.0, -0.5, -2.0, -0.5)
                       )

nominator_data
## # A tibble: 6 x 2
##   nominator time_2
##       <dbl>  <dbl>
## 1         1    2  
## 2         2    2  
## 3         3    1  
## 4         4   -0.5
## 5         5   -2  
## 6         6   -0.5

2c. Create Network with Attributes

Recall from previous units that we introduced the {tidygraph} package for preparing and summarizing network data. Similar to previous units, we will use the tbl_graph() function to convert our student_edges and nominee_data into a network graph object, by including the following arguments and supplying the appropriate code:

  • edges = expects a data frame, in our case student_edges, containing information about the edges in the graph. The nodes of each edge must either be in a to and from column, or in the two first columns like the data frame we provided.

  • nodes = expects a data frame, in our case nominee_data, containing information about the nodes in the graph.

  • directed = specifies whether the constructed graph be directed.

# YOUR CODE HERE
student_network <- tbl_graph(edges = student_edges,
                             nodes = nominee_data,
                             directed = TRUE)


student_network
## # A tbl_graph: 6 nodes and 12 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 6 x 2 (active)
##   nominee time_1
##     <dbl>  <dbl>
## 1       1    2.4
## 2       2    2.6
## 3       3    1.1
## 4       4   -0.5
## 5       5   -3  
## 6       6   -1  
## #
## # Edge Data: 12 x 3
##    from    to relate
##   <int> <int>  <dbl>
## 1     2     1      1
## 2     1     2      1
## 3     3     2      1
## # ... with 9 more rows

Congrats! You made it to the end of data wrangling section and are ready to start analysis!


3. EXPLORE

In Section 3, we use the {tidygraph} package for retrieving network descriptives and the {ggraph} package to create a network visualization to help illustrate these metrics. Specifically, in this section we will:

  1. Examine Basic Descriptives. We quick revisit calculating basic centrality measures for our nodes and focus primarily on creating an “exposure term” and summarizing each students exposure to his friends attitudes towards STEM careers.

  2. Make a Sociogram. Finally, we wrap up the explore phases by creating a simple network and tweak key elements like the size, shape, and position of nodes and edges to better at communicating key findings.

3a. Examine Descriptives

As noted in SNA and Education (Carolan 2014), many analyses of social networks are primarily descriptive and aim to either represent the network’s underlying social structure through data-reduction techniques or to characterize network properties through network measures.

Calculate Node Degree

Recall that degree is the number of ties to and from an ego, or in the case of a “simple graph” like ours, degree is simply the number of people to whom someone is connected. In a directed network, in-degree is the number of connections or ties received, whereas out-degree is the number of connections or ties sent.

The activate() function from the {tidygraph} package allows us to treat the nodes in our network object as if they were a standard data frame to which we can then apply tidyverse functions such as select(), filter(), and mutate().

We can use the mutate() functions to create new variables for nodes such as measures of degree, in-degree, and out-degree using the centrality_degree() function in the {tidygraph} package.

👉 Your Turn ⤵

Use the following code chunk to calculate some basic degree measures for each student:

#YOUR CODE HERE
student_network <- student_network |>
  activate(nodes) |>
  mutate(in_degree = centrality_degree(mode = "in"),
         out_degree = centrality_degree(mode = "out"),
         all_degree = centrality_degree(mode = "all"))
student_network
## # A tbl_graph: 6 nodes and 12 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 6 x 5 (active)
##   nominee time_1 in_degree out_degree all_degree
##     <dbl>  <dbl>     <dbl>      <dbl>      <dbl>
## 1       1    2.4         1          2          3
## 2       2    2.6         2          2          4
## 3       3    1.1         3          3          6
## 4       4   -0.5         3          2          5
## 5       5   -3           1          1          2
## 6       6   -1           2          2          4
## #
## # Edge Data: 12 x 3
##    from    to relate
##   <int> <int>  <dbl>
## 1     2     1      1
## 2     1     2      1
## 3     3     2      1
## # ... with 9 more rows

4a. Determine “Exposure”

Next we’ll create an simple exposure term. The idea is behind this concept is relatively straightforward.

The exposure term captures how your interactions or relationships with someone over a period of time impacts an outcome.

Our simple linear regression model in the next section will test whether changes attitudes towards STEM careers were significatnly impacted by the attitudes of students they identified as friends.

First, let’s create a new data frame called student_data which combines our edgelist along with the attitudes of the nominee.

student_data <- left_join(student_edges,
                          nominee_data,
                          by = "nominee")

student_data
## # A tibble: 12 x 4
##    nominator nominee relate time_1
##        <dbl>   <dbl>  <dbl>  <dbl>
##  1         2       1      1    2.4
##  2         1       2      1    2.6
##  3         3       2      1    2.6
##  4         1       3      1    1.1
##  5         2       3      1    1.1
##  6         6       3      1    1.1
##  7         3       4      1   -0.5
##  8         5       4      1   -0.5
##  9         6       4      1   -0.5
## 10         4       5      1   -3  
## 11         3       6      1   -1  
## 12         4       6      1   -1

In the data frame above we see that student 2 was “exposed” to a friend (student 1) with a positive attitude towards STEM careers measured at 2.4.

Calculate Exposure Term

To calculate an exposure term, we need to create a new variable, which we’ll call exposure, that captures the both their friend’s attitude and the strength of the relationship.

# Calculating exposure 
student_exposure <- student_data |>
  mutate(exposure = relate * time_1) 


student_exposure
## # A tibble: 12 x 5
##    nominator nominee relate time_1 exposure
##        <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
##  1         2       1      1    2.4      2.4
##  2         1       2      1    2.6      2.6
##  3         3       2      1    2.6      2.6
##  4         1       3      1    1.1      1.1
##  5         2       3      1    1.1      1.1
##  6         6       3      1    1.1      1.1
##  7         3       4      1   -0.5     -0.5
##  8         5       4      1   -0.5     -0.5
##  9         6       4      1   -0.5     -0.5
## 10         4       5      1   -3       -3  
## 11         3       6      1   -1       -1  
## 12         4       6      1   -1       -1

Since the strength of each friendship was based on simply the presence or absence of a named friend as represented by the value of “1”, the exposure term simple equals their nominee’s attitude at time point 1. However, if we had captured valued ties and asked students to indicate the strength of each friendship, such as “best friend”, our exposure terms might be larger for some friends since those friendships may carry more “weight” and have a great impact on the nominators’ attitudes.

Calculate Mean Exposure

Now that we have an exposure term for each students’ relationship, for the purposes of modeling in our next section, we need to create a single measure that attempts to quantify their overall exposure in the network. To do so, we’ll need to calculate the average exposure for each friend nominated, the output of which should look like this:

nominator exposure_mean
1 1.8500000
2 1.7500000
3 0.3666667
4 -2.0000000
5 -0.5000000
6 0.3000000

Since student 1 named two friends whose attitudes towards STEM careers prior to the semester were 2.1 and 1.1 respectively, his mean “exposure” would be (2.6+1.1)/2 = 1.85, somewhat less that their own attitude at the start, which for student one was 2.4. We might expect, there for that student 1’s attitude might decrease by the end of the semester since, on average, he’s exposed to friends with more negative views.

👉 Your Turn ⤵

Complete the following code chunk to create a mean_exposure data frame that summarizes the data from our student_exposure data frame.

#YOUR CODE HERE
mean_exposure <- student_exposure |>
  group_by(nominator) |>
  summarize(mean_exposure = mean(exposure)) 

mean_exposure
## # A tibble: 6 x 2
##   nominator mean_exposure
##       <dbl>         <dbl>
## 1         1         1.85 
## 2         2         1.75 
## 3         3         0.367
## 4         4        -2    
## 5         5        -0.5  
## 6         6         0.3

Hint: consider using the group_by() and summarize() functions to help calculate the mean exposure for each student.

3b. Visualize Network Properties

👉 Your Turn ⤵

Try creating a sociogram for our small student network by modifying the code below and/or adding new ones for layouts, nodes, and edges that highlights the following for students in our network:

  • number of name friends
  • attitudes towards STEM careers
  • corresponding label for each student

Then answer the questions that follow below.

# setting theme_graph 
set_graph_style()


student_network %>% 
  activate(nodes) %>%
  mutate(importance = centrality_pagerank())%>%
  activate(edges) %>%
  mutate(friendly = edge_is_mutual()) %>%
  ggraph() +
  geom_edge_link(aes(alpha = friendly)) +
  geom_node_point(aes(size = importance, colour = importance)) + 
  geom_node_text(aes(label = nominee)) +
  # discrete colour legend
  scale_color_gradient(guide = 'legend')

Based on the the mean exposure calculated above, and the sociogram you just created, what predictions do you have about how students attitudes might change at time 2?

  • I predict that 1,3 and 2 have an decrease in STEM attitude while 4,5,and 6 have an increae or remain the same.

Congrats! You made it to the end of the Explore section and are ready to learn a little about modeling network influence processes using simple linear regression!


4. MODEL

Since our goal is to determine if there is a relationship between a students’ attitude towards careers in STEM based on their exposure to their friends attitudes, we will need to a final data frame for our model that contains their attitudes at time_2 (our outcome and dependent variable), their mean_exposure (our predictor variable), while also controlling for their attitudes at time_1.

First, let’s join our mean_exposure data frame with our nominator_data frame using the nominator variable to match our value in both data frames. Let’s also rename our nominator variable to student to avoid confusion.

final_data <- 
  # data3 already has nominator, so no need to change 
  left_join(mean_exposure, nominator_data, by = "nominator") |>
  rename(student = nominator)

final_data
## # A tibble: 6 x 3
##   student mean_exposure time_2
##     <dbl>         <dbl>  <dbl>
## 1       1         1.85     2  
## 2       2         1.75     2  
## 3       3         0.367    1  
## 4       4        -2       -0.5
## 5       5        -0.5     -2  
## 6       6         0.3     -0.5

👉 Your Turn ⤵

Now that we have our dependent and predictor variables, let also join our nominee_data that contains our time_1 variable for each student.

#YOUR CODE HERE
final_data <- nominee_data |>
  rename(student = nominee) |>
  left_join(final_data, nominee_data, by = "student")
  

final_data
## # A tibble: 6 x 4
##   student time_1 mean_exposure time_2
##     <dbl>  <dbl>         <dbl>  <dbl>
## 1       1    2.4         1.85     2  
## 2       2    2.6         1.75     2  
## 3       3    1.1         0.367    1  
## 4       4   -0.5        -2       -0.5
## 5       5   -3          -0.5     -2  
## 6       6   -1           0.3     -0.5

Hint: Your final_data frame should look something like this since our model will examine how well students’ attitudes ar time_1 as well as their mean_exposure to their named friends’ attitudes predict their attitudes at time_2:

student time_1 mean_exposure time_2
1 2.4 1.8500000 2.0
2 2.6 1.7500000 2.0
3 1.1 0.3666667 1.0
4 -0.5 -2.0000000 -0.5
5 -3.0 -0.5000000 -2.0
6 -1.0 0.3000000 -0.5

Regression (Linear Model)

Calculating the exposure term is the most distinctive and important step in carrying out influence models. Now, we can use a linear model to find out how much relations - as captured by the exposure term - affect their attitudes at the end of the semester.

Run the following code to create a simple linear regression model that contains time_2 as our dependent variable and time_1 and mean_exposure as our “independent” variables.

model1 <- lm(time_2 ~ time_1 + mean_exposure, data = final_data) 

summary(model1)
## 
## Call:
## lm(formula = time_2 ~ time_1 + mean_exposure, data = final_data)
## 
## Residuals:
##        1        2        3        4        5        6 
##  0.02946 -0.09319  0.09429 -0.02730 -0.02548  0.02222 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.11614    0.03445   3.371   0.0434 *  
## time_1         0.67598    0.02406  28.092  9.9e-05 ***
## mean_exposure  0.12542    0.03615   3.470   0.0403 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08232 on 3 degrees of freedom
## Multiple R-squared:  0.9984, Adjusted R-squared:  0.9974 
## F-statistic: 945.3 on 2 and 3 DF,  p-value: 6.306e-05
get_regression_table(model1)
## # A tibble: 3 x 7
##   term          estimate std_error statistic p_value lower_ci upper_ci
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept        0.116     0.034      3.37   0.043    0.006    0.226
## 2 time_1           0.676     0.024     28.1    0        0.599    0.753
## 3 mean_exposure    0.125     0.036      3.47   0.04     0.01     0.24

👉 Your Turn ⤵

Take a look at Chapter 6.2.2 Regression plane from Statistical Inference via Data Science: A ModernDive into R and the Tidyverse to help you provide a basic interpretation of our findings below:

  • For every increase of 1 time point in time_1 , there is an associated increase of on average .7 in time_2. For every 1 unit increase in mean_exposure there is an associated increase on average of .1 in time_2.

It is statistically significant that the sooner students are exposed to positive attitudes and behaviors about STEM careers the higher correlation their positive behaviors towards STEM careers will continue.


5. COMMUNICATE

For your final Your Turn, your goal is to distill our analysis from above into a simple “data product” designed to illustrate key findings about changes in the collaboration network over time:

  1. Select. Select a finding from our analysis above, or a new analysis if so motivated, that you think would be interesting or relevant and that helps answer our research question.

  2. Polish. Create and polish a data visualization and/or data table to communicate your selected findings.

  3. Narrate. Write a brief narrative (2-3 paragraphs) to accompany your visualization and/or table that includes the following:

    • The question or questions guiding the analysis;

    • The conclusions you’ve reached based on our findings;

    • How your audience might use this information;

    • How you might revisit or improve upon this analysis in the future.

👉 Your Turn ⤵

Use the code chunk below create a polished table and/or visualization(s) and write a brief narrative in the space that follows.

Data Visualization or Table

library(ggplot2)
library(reshape2)
library(broom)
model1 %>%
  augment() %>%
  melt(measure.vars = c("time_1", "mean_exposure"), variable.name = c("IV")) %>%
  ggplot(., aes(value, time_2)) +
  geom_smooth(method = "lm") +
  facet_wrap(~IV, scales = "free_x")

Since, I have not used R to create regression visualization I wanted to focus on created said viz. 

When looking directly at the print out of the model it is hard to interpret. But here we see the relationship between each independent variable and the dependent variable. The geom_smooth function shows the 95% confidence intervals.

The mean_exposure has a very large confidence interval, especially at the higher and lower exposure to STEM careers. This could be because there is fewer observations at the higher and lower levels of exposure. AS you may know a smaller sample size leads to a higher standard deviation and also higher confidence intervals.

We could use this model to predict how long a student should be exposed to positive STEM career attitudes as a means of entering the STEM field.

🧶 Knit & Check ✅

Congratulations - you’ve completed the Unit 5 case study! To share your work, click the drop down arrow next to the ball of yarn that says “Knit” at the top of this markdown file, then select “Knit top HTML”. Assuming your code contains no errors, this will create a web page in your Files pane that serves as a record of your work.

Once your file has been knitted, you can publish this file online using RPubs, or share the HTML file through another means.

References

Carolan, Brian. 2014. “Social Network Analysis and Education: Theory, Methods & Applications.” https://doi.org/10.4135/9781452270104.