# To set your working directory, either manually select the file location, click the three dots, and set as working directory

#  OR

# Manually add your working directory in the line below (remove the # in front to uncomment the line!)

# setwd("")

## example in Windows format (Mac users beware): setwd("C:/Users/rpederse/OneDrive - Texas Tech University/Teaching/SNA/EDCI 6306 SNA/Labs/Lab 0")

0. INTRODUCTION

Background

Welcome to using EDCI 6306 Social Network Analysis as a Research Method! Each of the labs will include a “walk through” that will focus on a basic analysis using social network analysis techniques that you’ll be expected to apply during an independent analysis. This getting started task is designed to orient you to both our data analysis assignments and to R, RStudio, and/or RMarkdown, which we’ll be using to complete those assignments.

Organization

This independent practice is really a warm-up. It is a chance to become familiar with how RStudio works. In the context of doing so, we’ll focus on three things:

  1. Reading data into R (in the Prepare section)
  2. Preparing and “wrangling” data in table (think spreadsheet!) format (in the Wrangle section)
  3. Creating some plots (in the Explore section)
  4. Running a model - specifically, a regression model (in the Model section)
  5. Finally, creating a reproducible report of your work you can share with others (in the Communicate section)

You may be wondering what these bolded terms refer to; what’s so special about preparing, wrangling, exploring, and modeling data - and communicating results? We’re using these terms as a part of a framework, or workflow, that comes from the work of Krumm et al.’s Learning Analytics Goes to School.

Essentially, this document is organized around these five components of the Data Intensive Research Workflow.

Click the arrow to the right of the code chunk below to view the image (more on that process of clicking the green arrow and what it does, too, in a moment)!

knitr::include_graphics("youdidit.jpg")

How to use this document

This is an R Notebook. There are two keys to your use of it:

  1. First, be sure that you are viewing the document in the “Visual Editor” mode. You can use this mode by clicking the symbol that appears like a letter A (or the tip of a pencil!) in the top right of this window.
  2. Second, click “Preview” at the top of this screen to preview the document as you work through it. This will allow you to see your code and the input in a rendered - easy-to-read - document.

Let’s get started!

1. PREPARE

By preparing, we refer to developing a question or purpose for the analysis, which you likely know from your research can be difficult! This part of the process also involves developing an understanding of the data and what you may need to analyze the data. Often this involves looking at the data and its documentation. For now, we’ll focus on just a few parts of this process, diving in much more deeply over the coming weeks.

Packages 📦

R uses “packages,” add-ons that enhance its functionality. One package that we’ll be using is the tidyverse. The {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures.

Before we can begin using these packages, we will need to install them using the install.packages() function built into R.

Click the green arrow in the right corner of the block-or “chunk”-of code that follows and see if you can identify which packages have been installed in the console below.

FYI, it may say “do you want to restart R before installing?” - I will often click “no” to this!

install.packages("tidyverse")

Once these packages have been installed, we will need to load them in order to use the handy functions they contain.

To load the tidyverse, click the green arrow in the right corner of the block-or “chunk”-of code that follows. Notice that we do not need to use the quotation marks again because the {tidyverse} package and packages it contains are now a part of our package library!

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Please do not worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. If you see an error, though, try to interpret or search via your search engine the contents of the error, or reach out to us for assistance.

Loading (or reading in) data

Next, we’ll load data - specifically, a CSV file, the kind that you can export from Microsoft Excel or Google Sheets - into R, using the read_csv() function in the next chunk.

Clicking the green arrow runs the code; do that next.

d <- read_csv("data/sci-online-classes.csv")
## Rows: 603 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
## dbl (23): student_id, total_points_possible, total_points_earned, percentage...
## lgl  (1): Grade_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that we “assigned” our data set to a new object in R named d that will now be saved in your environment pane in the upper right corner of RStudio. Go ahead and take a look to make sure it’s there.

Your Turn

Why do you think we included data/ before our sci-online-classes.csv file? Why quotation marks?

Add your responses after the dashes below:

  • It is to assign it to the data folder in our working directory.

Hint: check the files pane in the lower right corner of RStudio.

Viewing or inspecting data

Last, let’s check that the code worked as we intended; run the next chunk and look at the results, tabbing left or right with the arrows, or scanning through the rows by clicking the numbers at the bottom of the pane with the print-out of the data you loaded:

d

Your Turn

What do you notice about this dataset? What do you wonder? Add one or two thoughts after the dash below:

  • It is curious that these are course ID’s put have different number of total possible points for similar sections. My first thought was maybe extra credit opportunities but the range is too big for that.

  • Looking at page 1, FrScA-S116-02 for student 53447 has a possible points of 4655 and student 53475 only has a possible points of 1710. Interesting…

There are other ways to inspect your data; the glimpse() function provides one such way. Run the code below to take a glimpse at your data.

glimpse(d)
## Rows: 603
## Columns: 30
## $ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
## $ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
## $ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
## $ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
## $ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
## $ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
## $ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
## $ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
## $ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
## $ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
## $ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
## $ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
## $ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
## $ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
## $ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
## $ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
## $ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
## $ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
## $ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
## $ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
## $ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
## $ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
## $ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
## $ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
## $ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
## $ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
## $ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
## $ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
## $ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Generally, rows typically represent “cases,” the units that we measure, or the units on which we collect data. What counts as a “case” (and therefore what is represented as a row) varies by (and within) fields. There may be multiple types or levels of units studied in your field; listing more than one is fine! Also, please consider what columns - which usually represent variables - represent in your area of work and/or research.

Your Turn

How many “cases” or observations are in this data set?

  • According to the console there are 603 rows, so 603 cases. I am a bit confused as it says there are 603 rows with 30 columns. From glimpse it looks like the student IDs are the columns which there would be 603 of, yet the console states there are only 30 columns. Update: After the wrangle section, the data makes much more sense.

Pick two columns (or more) and write what you think it represents:

  • Assuming that the columns are students I think it gives you a rundown of their grades for a course. Sort of like a report card.

  • I also notice there is gender information in the data set so we can now disaggregate by gender when analyzing the data.

  • I really like the time spent information. That would be super helpful as a teacher!

-

Next, we’ll use a few functions that are handy for preparing data in table form.

2. WRANGLE

By wrangle, we refer to the process of cleaning and processing data, and, in cases, merging (or joining) data from multiple sources. Often, this part of the process is very (surprisingly) time-intensive. Wrangling your data into shape can itself be an important accomplishment! There are great tools in R to do this, especially through the use of the {dplyr} R package.

Selecting variables

Let’s select only a few variables by typing our dataset d and “passing” that using the %>% operator to the select() function from the {dplyr} package:

d %>% 
  select(student_id, total_points_possible, total_points_earned, TimeSpent)

Notice how the number of columns (variables) is now different.

Let’s include one additional variable in your select function.

First, we need to figure out what variables exist in our dataset (or be reminded of this - it’s very common in R to be continually checking and inspecting your data)!

In addition to glimpse() function, you can use a function named View() to do this. Try it out and see what happens!

View(d)

Your Turn

In the code chunk below, add a new variable to the code, being careful to type the new variable name as it appears in the data. I’ve added some code to get you started. Consider how the names of the other variables are separated as you think about how to add an additional variable to this code.

d %>% 
  select(student_id, total_points_possible, total_points_earned, TimeSpent, semester)

Once added, the output should be different than in the code above - there should now be an additional variable included in the print-out.

Filtering variables

Next, let’s explore filtering variables. Check out and run the next chunk of code, imagining that we wish to filter our data to view only the rows associated with students who earned a final grade (as a percentage) of 70 - 70% - or higher.

d %>% 
  filter(FinalGradeCEMS > 70)

Your Turn

In the next code chunk, change the cut-off from 70% to some other value - larger or smaller (maybe much larger or smaller - feel free to play around with the code a bit!).

d %>% 
  filter(FinalGradeCEMS < 70)

What happens when you change the cut-off from 70 to something else? Add a thought (or more):

  • When I changed the cutoff from 70 to 90 the rows went from 438 to 203. This means that nearly half of the student who passed received an “A” (or at least a 90).

  • I then flipped the sign and ran < 70. This reduced the rows to 135, which indicates the number or students who failed (assuming a standard grading scale).

Arrange

The last function we’ll use for preparing tables is arrange.

We’ll combine this arrange() function with a function we used already - select(). We do this so we can view only the student ID and their final grade.

d %>% 
  select(student_id, FinalGradeCEMS) %>% 
  arrange(FinalGradeCEMS)

Note that arrange works by sorting values in ascending order (from lowest to highest); you can change this by using the desc() function with arrange, like the following:

d %>% 
  select(student_id, FinalGradeCEMS) %>% 
  arrange(desc(FinalGradeCEMS))

Your Turn

In the code chunk below, replace FinalGradeCEMS that is used with both the select() and arrange() functions with a different variable in the data set. Consider returning to the code chunk above in which you glimpsed at the names of all of the variables.

d %>% 
  select(student_id, total_points_possible) %>% 
  arrange(desc(total_points_possible))

Optional

Can you compose a series of functions that include the select(), filter(), and arrange functions? Recall that you can “pipe” the output from one function to the next as when we used select() and arrange() together in the code chunk above.

This reach is not required/necessary to complete; it’s just for those who wish to do a bit more with these functions at this time.

d %>%
  select(subject, course_id, total_points_possible) %>%
  arrange(desc(total_points_possible))

3. EXPLORE

Exploratory data analysis, or exploring your data, involves processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data prior to modeling. In this section, we’ll create a few plots to explore our data.

Histogram

The code below creates a histogram, or a distribution of the values, in this case for students’ final grades.

ggplot(d, aes(x = FinalGradeCEMS)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).

You can change the color of the histogram bars by specifying a color as follows:

ggplot(d, aes(x = FinalGradeCEMS)) +
  geom_histogram(fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).

Changing colors

Your Turn

In the code chunk below, change the color to one of your choosing; consider this list of valid color names here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

ggplot(d, aes(x = FinalGradeCEMS)) +
  geom_histogram(fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 30 rows containing non-finite outside the scale range
## (`stat_bin()`).

Finally, we’ll make one more change; visualize the distribution of another variable in the data - one other than FinalGradeCEMS. You can do so by swapping out the name for another variable with FinalGradeCEMS. Also, change the color to one other than blue.

ggplot(d, aes(x = total_points_possible)) +
  geom_histogram(fill = "green")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Optional

Completed the above? Nice job! Try for a “reach” by creating a scatter plot for the relationship between two variables. You will need to pass the names of two variables to the code below for what is now simply XXX (a placeholder).

ggplot(d, aes(x = total_points_earned, y = total_points_possible)) +
  geom_point(fill = "black")

4. MODEL

“Model” is one of those terms that has many different meanings. For our purpose, we refer to the process of simplifying and summarizing our data. Thus, models can take many forms; calculating means represents a legitimate form of modeling data, as does estimating more complex models, including linear regressions, and models and algorithms associated with machine learning tasks. For now, we’ll run a linear regression to predict students’ final grades.

Below, we predict students’ final grades FinalGradeCEMS, which is on a 0-100 point scale, on the basis of the time they spent on the course (measured through their learning management system in minutes, TimeSpent, and the subject (one of five) of their specific course.

m1 <- lm(FinalGradeCEMS ~ TimeSpent_hours + subject, data = d)
summary(m1)
## 
## Call:
## lm(formula = FinalGradeCEMS ~ TimeSpent_hours + subject, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.378  -8.836   4.816  12.855  36.047 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     57.39317    2.33822  24.546  < 2e-16 ***
## TimeSpent_hours  0.42659    0.03909  10.912  < 2e-16 ***
## subjectBioA     -1.55965    3.60531  -0.433    0.665    
## subjectFrScA    11.73065    2.21438   5.297 1.68e-07 ***
## subjectOcnA      1.09745    2.57715   0.426    0.670    
## subjectPhysA    16.03572    3.07129   5.221 2.50e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.8 on 567 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.2061 
## F-statistic: 30.69 on 5 and 567 DF,  p-value: < 2.2e-16

There is a lot to unpack in this output, but for now the most important values to look at are those in the Estimate column, which represent the intercept and slopes for your linear regression model.

Note that the estimate for TimeSpent is 0.46 and statistically significant. We can interpret this as telling us that for every additional hour students spend on the course, the estimated value for their final grade will be 0.42 (on a 0-100 scale) greater than the intercept, which is around 57. So if a student spent, for instance, 40 hours on the course, their estimated final grade would be 57 + (.42 * 40), or around 74.

Your Turn

Notice how above the variables are separated by a + symbol. Below, add another - a third - variable to the regression model. Specifically, add a variable for students’ initial, self-reported interest in science, int - and any other variable(s) you like!

m2 <- lm(FinalGradeCEMS ~ TimeSpent + subject + int + semester, data = d)
summary(m2)
## 
## Call:
## lm(formula = FinalGradeCEMS ~ TimeSpent + subject + int + semester, 
##     data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.65  -8.17   4.73  13.35  41.35 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.3263168  7.0025910   8.472 2.78e-16 ***
## TimeSpent     0.0072791  0.0006974  10.437  < 2e-16 ***
## subjectBioA  -1.7247690  3.9464791  -0.437   0.6623    
## subjectFrScA 14.1276046  2.4123437   5.856 8.64e-09 ***
## subjectOcnA   4.2422559  2.7608781   1.537   0.1250    
## subjectPhysA 17.6501783  3.2753432   5.389 1.10e-07 ***
## int          -0.6816621  1.5019616  -0.454   0.6501    
## semesterS216 -3.8968476  1.9034186  -2.047   0.0412 *  
## semesterT116  3.0188918  4.4532788   0.678   0.4982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.87 on 494 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.2373, Adjusted R-squared:  0.2249 
## F-statistic: 19.21 on 8 and 494 DF,  p-value: < 2.2e-16

What do you notice about the results? We’re going to dive into this much more: if you have many questions now, you’re in the right spot!

  • I added semester as a variable and it came up with semesterS216 have a 0.05 significance code.

5. COMMUNICATE

I would love to know how this was for you! In the space below, answer the following questions:

Your Turn

  1. On a scale of 1 (This was the worst thing I have ever done) to 5 (Not terrible, not great) to 10 (Easiest thing ever), tell me how Lab 0 went for you!

    1. This lab was actually pretty fun. I would rate it as a 7-8 as far as difficulty is concerned. I am looking forward to learning more about R studio.
  2. What is one thing that was particularly useful during this lab?

    1. The instructions and chunks of code were super helpful. Also, I wasn’t certain how much we needed to write in the comments. If I did not comment enough please let me know and I will be sure to put more next time.

Great job! Once you’ve finished your work, click the arrow beside the button you used to “Preview” your document to see what it will look like when you share it with others.

When everything looks good, click “Knit to HTML” at the top to render a report that you can be viewed using a web browser and shared online.

Upload both your .rmd and .html files to the assignments page.

Congratulations on getting started, you’re ready for our Module 1 lab!