Step #1 - Load in the Data

The first step of any doing any statistical analysis in R, is of course to upload our data. This can be a little bit tricky at first, but once you get the hang of it, it becomes fairly simply.

The first thing you want to do is call the library “readxl”. There are other packages to read other file types such as a CSV or Txt file, but since it is excel this is the only one we need to call. If you do not have the library readxl, you must first type “install.packages(”readxl”)”
Second, we need to access our data from our comptuer. I think it is best practice to create a new file on your desktop with any project you are doing, and save the data set to that folder.
Now, in the bottom right portion of the screen, there will be a section with different tabs. Click on “Files”. Then go to desktop, your new folder, and then select your data set.
After you click on it, click on import data set. In the window that will pop up, in the bottom towards the right there will be a code preview that will pop up. You can copy and past that into your code, and run it to open the data.
For readability, I like to change my variable names. So, instead of the long FYE name survey, I just named it “df” (data frame). To ensure it works, you can call “head(df)”. Head will display the first five entries of your data, and show you that everything is running okay.

*** Another way is to use the “setwd()” command to make your desktop your working directory. I did this so that is why I only had to input the file name and not the entire path ***

Install Necessary Packages:

Each package below has a distinct usage. “Read_xl” is needed to import excel files, “data.tree” is used for a visualization, “ggplot2” is used to make complex visualizations, and so on. Each package serves a purpose, so when you think you cannot find something in base R, google it and see if there is a package! I had the idea for a data tree but had never used one in R. So I googled it and imported just for one example, but this is a process that occurs frequently.

library(readxl)
library(data.tree)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggplot2)
library(tidyr)
library(viridis)

## Loading required package: viridisLite

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:viridis':
## 
##     viridis_pal
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(dplyr)

Now we can import our data into R.

getwd()

## [1] "/Users/sullystefanik/Desktop"

df <- read_excel("FYE_Data.xlsx")
#head(df)

# You can use the head function to double check if the data is loaded in correctly

Step #2 - Begin to sort the data.

Abberviation Dictionary:

nb = Non-Binary
si = Self-Identify
t = Transfers
nt = Non-Transfer
w = White
l = Latino
b = Black
na = Native American / Indigenous
a = Asian
me = Middle Eastern

In this chunk below, you can display the column names to easily see what everything is labeled as. This is typically useful with shorter column names. However, in this specific instance, there are some really long titles, that make typing them in a bit more challenging. To combat this you can use the following function: df$column_name.

In this case, “df” is what we named our data set, or short for data frame. The “$” symbol is used to select the column name exactly as it is written. This is what I utilized for longer column names. So I made the chunk below a comment just to show another option, even though I did not necessarily utilize this throughout the project.

Side note incase you were not aware, the “#” symbol can be used to make a comment. It will only display what you wrote, and not execute any code.

#Column_Names <- data.frame(
  #colnames(df)
#)
#print(Column_Names)

Now we can move into sorting and subsetting. Here is a simple diagram of our thought process on how to break down the data. The tree diagram we can again utilize the “$” command, and add on new branches to our tree.

tree <- Node$new("Student Responses")

tree$AddChild("Transfer")
tree$AddChild("Non-Transfer")

tree$Transfer$AddChild("Male")
tree$Transfer$AddChild("Female")

tree$`Non-Transfer`$AddChild("Male")
tree$`Non-Transfer`$AddChild("Female")

tree$Transfer$Male$AddChild("Demographics")
tree$Transfer$Female$AddChild("Demographics")
tree$`Non-Transfer`$Male$AddChild("Demographics")
tree$`Non-Transfer`$Female$AddChild("Demographics")

plot(tree)

Do not be intimidated by the chunk below, it may look like a lot but there truly is not much to it. All we are doing here is a bit of data “cleaning”, as well as subsetting much of the data frame. To start out, we removed the rows that did not have an answer for a response ID. In this case, these are ghost students so to speak. You can remove these by using the “drop_na” command that is found in the tidyverse package.

(Notice how I did not change the name of the data frame after dropping the NA’s, it you can keep the name the same after you revise it and continue you to call it throughout the rest of the function.)

Following that, much of it is rinse and repeat. As we discussed, we wanted to see the differences between transfer and non transfer students, and then break it down further from there. The format of this is quite easy, I’ll use our first subset as an example for all of the following.

We start by naming what we want to subset, in this case transfer students, so lets make it a simple name, transfer. Then, set it equal to the “subset” command, followed by parenthesis. You then put in the data you want to use, in this case, df. Then a coma, followed by the column you want to look at. So if their response was a 1, this means they are a transfer student, use a “==” to lock this in.

The rest of the code follows the same format:

# Remove Student ID's that cannot be identified
df <- df %>% drop_na(`Response ID`)

# Subset Transfer students from non-transfer students or potential non freshman
transfer <- subset(df, `Are you a transfer student?  ` == 1)
non_transfer <- subset(df, `Are you a transfer student?  ` == 2)

# Transfers subset based on gender
male_t <- subset(transfer, `How would you describe the sex were you assigned at birth?` == 1)
female_t <- subset(transfer, `How would you describe the sex were you assigned at birth?` == 2)
nb_t <- subset(transfer, `How would you describe the sex were you assigned at birth?` == 3)
si_t <- subset(transfer, `How would you describe the sex were you assigned at birth?` == 5)

# Non-transfers based on gender 
male_nt <- subset(non_transfer, `How would you describe the sex were you assigned at birth?` == 1)
female_nt <- subset(non_transfer, `How would you describe the sex were you assigned at birth?` == 2)
nb_nt <- subset(non_transfer, `How would you describe the sex were you assigned at birth?` == 3)
si_nt <- subset(non_transfer, `How would you describe the sex were you assigned at birth?` == 5)


# Male transfers based on race 
male_t_b <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 1)
male_t_l <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 2)
male_t_w <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 3)
male_t_na <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 4)
male_t_a <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 5)
male_t_me <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 6)
male_t_nh <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 7)
male_t_pa <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 8)
male_t_br <- subset(male_t, male_t$`How would you describe your race and/or ethnicity?  ` == 9)

# Male non-transfer based on race 
male_nt_b <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 1)
male_nt_l <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 2)
male_nt_w <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 3)
male_nt_na <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 4)
male_nt_a <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 5)
male_nt_me <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 6)
male_nt_nh <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 7)
male_nt_pa <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 8)
male_nt_br <- subset(male_nt, male_nt$`How would you describe your race and/or ethnicity?  ` == 9)

# Female transfers based on race
female_t_b <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 1)
female_t_l <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 2)
female_t_w <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 3)
female_t_na <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 4)
female_t_a <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 5)
female_t_me <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 6)
female_t_nh <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 7)
female_t_pa <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 8)
female_t_br <- subset(female_t, female_t$`How would you describe your race and/or ethnicity?  ` == 9)

# Female non-transfers based on race 
female_nt_b <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 1)
female_nt_l <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 2)
female_nt_w <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 3)
female_nt_na <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 4)
female_nt_a <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 5)
female_nt_me <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 6)
female_nt_nh <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 7)
female_nt_pa <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 8)
female_nt_br <- subset(female_nt, female_nt$`How would you describe your race and/or ethnicity?  ` == 9)

Now that we have adequately sorted our data, we can begin plotting to easily visualize it. There are other ways to subset data as you will see below, but I wanted to show a few examples. This is more of a “brute-force” way to do it, and down below we will see more simple ways to do it. Sometimes subsetting is a good idea and other times not as much, this is just one example.

Plotting can be a bit trickier than it seems at times. In R I have found making direct visuals for data can e a bit tricky, so something I like to do is utilize the count functions. By using “nrow” we can see how many rows make up the transfer student counts, and vice versa for non transfer. Once we have those, those replace our direct numebr count. For pie charts, I like to personally add percentages as I think they are a nice touch to better understand the data. I did this by using a roun function, and getting the percentage. Much of the visuals below follow this same format.

#Plotting transfer vs non transfer students
transfer_count <- nrow(transfer)
non_transfer_count <- nrow(non_transfer)
counts <- c(transfer = transfer_count, non_transfer = non_transfer_count)
percentages <- round(100 * counts / sum(counts), 1)  # Rounded to 1 decimal place

labels <- paste(c("Transfer", "Non-Transfer"), ": ", " (", percentages, "%)", sep = "")

pie(counts, 
    main = "Transfer vs Non-Transfer Students (2023)", 
    col = c("orange", "lightblue"), 
    labels = labels, 
    radius = 1)

# Plotting the gender discrepancies in each 
count_mnt <- nrow(male_nt)
count_fnt <- nrow(female_nt)
count_nbnt <- nrow(nb_nt)
count_sint <- nrow(si_nt)
counts <- c(Male = count_mnt, Female = count_fnt)
percentages <- round(100 * counts / sum(counts), 1)

labels <- paste(c("Male Students", "Female Students"), ": ", " (", percentages, "%)", sep = "")

pie(counts, 
    main = "Gender of Non-Transfer Students", 
    col = c("orange", "lightblue"), 
    labels = labels,
    radius = 1)

# Only three students did not fall under male or female, left out for simplification of visualization 

count_mt <- nrow(male_t)
count_ft <- nrow(female_t)
count_nbt <- nrow(nb_t)
count_sit <- nrow(si_t)
counts <- c(Male = count_mt, Female = count_ft)
percentages <- round(100 * counts / sum(counts), 1)

labels <- paste(c("Male Students", "Female Students"), ": ", " (", percentages, "%)", sep = "")

pie(counts, 
    main = "Gender of Transfer Students", 
    col = c("orange", "lightblue"), 
    labels = labels,
    radius = 1)

count_b <- nrow(male_t_b)  
count_l <- nrow(male_t_l)   
count_w <- nrow(male_t_w) 
count_na <- nrow(male_t_na)  
count_a <- nrow(male_t_a) 
count_me <- nrow(male_t_me)  
count_nh <- nrow(male_t_nh) 
count_pa <- nrow(male_t_pa) 
count_br <- nrow(male_t_br)  


race_counts_mnt <- c(Black = count_b, 
                     Latino = count_l, 
                     White = count_w, 
                     "Indigenous American or Alaskan Native" = count_na, 
                     Asian = count_a, 
                     "Middle Eastern" = count_me, 
                     "Native Hawaiian" = count_nh, 
                     "Pacific Islander" = count_pa, 
                     "Biracial/Multiracial" = count_br)

race_counts_mnt <- race_counts_mnt[race_counts_mnt > 0]

percentages <- round(100 * race_counts_mnt / sum(race_counts_mnt), 1)

labels <- paste(names(race_counts_mnt), ": ", race_counts_mnt, " (", percentages, "%)", sep = "")
dark_colors_brewer <- brewer.pal(length(race_counts_mnt), "Dark2")

pie(race_counts_mnt, 
    main = "Racial Makeup of Male Transfer Students", 
    col = dark_colors_brewer,  
    labels = labels, 
    radius = 1,  
    cex = 0.8)

count_b <- nrow(male_nt_b)  
count_l <- nrow(male_nt_l)   
count_w <- nrow(male_nt_w) 
count_na <- nrow(male_nt_na)  
count_a <- nrow(male_nt_a) 
count_me <- nrow(male_nt_me)  
count_nh <- nrow(male_nt_nh) 
count_pa <- nrow(male_nt_pa) 
count_br <- nrow(male_nt_br)  

race_counts_mnt <- c(Black = count_b, 
                     Latino = count_l, 
                     White = count_w, 
                     "Indigenous American or Alaskan Native" = count_na, 
                     Asian = count_a, 
                     "Middle Eastern" = count_me, 
                     "Native Hawaiian" = count_nh, 
                     "Pacific Islander" = count_pa, 
                     "Biracial/Multiracial" = count_br)

race_counts_mnt <- race_counts_mnt[race_counts_mnt > 0]

percentages <- round(100 * race_counts_mnt / sum(race_counts_mnt), 1)

labels <- paste(names(race_counts_mnt), ": ", race_counts_mnt, " (", percentages, "%)", sep = "")
dark_colors_brewer <- brewer.pal(length(race_counts_mnt), "Dark2")

## Warning in brewer.pal(length(race_counts_mnt), "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors

pie(race_counts_mnt, 
    main = "Racial Makeup of Male Non-Transfer Students", 
    col = dark_colors_brewer,  
    labels = labels, 
    radius = 1, 
    las = 1,
    cex = 1)

count_b <- nrow(female_nt_b)  
count_l <- nrow(female_nt_l)   
count_w <- nrow(female_nt_w) 
count_na <- nrow(female_nt_na)  
count_a <- nrow(female_nt_a) 
count_me <- nrow(female_nt_me)  
count_nh <- nrow(female_nt_nh) 
count_pa <- nrow(female_nt_pa) 
count_br <- nrow(female_nt_br)  

race_counts_mnt <- c(Black = count_b, 
                     Latino = count_l, 
                     White = count_w, 
                     "Indigenous American or Alaskan Native" = count_na, 
                     Asian = count_a, 
                     "Middle Eastern" = count_me, 
                     "Native Hawaiian" = count_nh, 
                     "Pacific Islander" = count_pa, 
                     "Biracial/Multiracial" = count_br)

race_counts_mnt <- race_counts_mnt[race_counts_mnt > 0]

percentages <- round(100 * race_counts_mnt / sum(race_counts_mnt), 1)

labels <- paste(names(race_counts_mnt), ": ", race_counts_mnt, " (", percentages, "%)", sep = "")
dark_colors_brewer <- brewer.pal(length(race_counts_mnt), "Dark2")

pie(race_counts_mnt, 
    main = "Racial Makeup of Female Non-Transfer Students", 
    col = dark_colors_brewer,  
    labels = labels, 
    radius = 1, 
    las = 1,
    cex = 1)

count_b <- nrow(female_t_b)  
count_l <- nrow(female_t_l)   
count_w <- nrow(female_t_w) 
count_na <- nrow(female_t_na)  
count_a <- nrow(female_t_a) 
count_me <- nrow(female_t_me)  
count_nh <- nrow(female_t_nh) 
count_pa <- nrow(female_t_pa) 
count_br <- nrow(female_t_br)  

race_counts_mnt <- c(Black = count_b, 
                     Latino = count_l, 
                     White = count_w, 
                     "Indigenous American or Alaskan Native" = count_na, 
                     Asian = count_a, 
                     "Middle Eastern" = count_me, 
                     "Native Hawaiian" = count_nh, 
                     "Pacific Islander" = count_pa, 
                     "Biracial/Multiracial" = count_br)

race_counts_mnt <- race_counts_mnt[race_counts_mnt > 0]

percentages <- round(100 * race_counts_mnt / sum(race_counts_mnt), 1)

labels <- paste(names(race_counts_mnt), ": ", race_counts_mnt, " (", percentages, "%)", sep = "")
dark_colors_brewer <- brewer.pal(length(race_counts_mnt), "Dark2")

pie(race_counts_mnt, 
    main = "Racial Makeup of Female Transfer Students", 
    col = dark_colors_brewer,  
    labels = labels, 
    radius = 1, 
    las = 1,
    cex = 1)

Here is a great example of how generative AI can be used hand in hand with R studio. Instead of manually typing out all of these course names into a list, I can copy and paste into Chat GPT, and say “Make this is a list in R called Course_Name”. It is not doing critical thinking persay, rather cutting out unnecessary time doing time consuming tasks.

# Course break down 
table <- data.frame(
  Class = 1:47, 
  Course_Name = c(
    "AHRM 1104", "AHRM 1104 T", "ALS 1234", "ALS 1234 T", "APSC 1504", 
    "APSC 1504 T", "AAD 1004", "ART 1304", "BC 1014", "BCHM 1014", 
    "BIOL 1004", "BIOL 1004 T", "CHEM 1004", "CMDA 1634", "COMM 1004", 
    "COS 1004", "COS 1015", "ECON 2984", "ENGE 1215", "ENGL 1004", 
    "GEOS 2024", "HD 1984", "HD 2335", "HIST 1004", "HNFE 1114", 
    "HNFE 2984", "IS 1034/PSCI 1034", "ISC 1004", "MATH 1004", "MATH 1044", 
    "MGT 1104", "NEUR 1004", "NR 1234", "NR 2234 T", "PHS 1984", 
    "PHYS 2325", "PPE 1004", "PSYC 1024", "PSYC 2024 T", "REAL 1004", 
    "RLCL 1004", "SPES 1004", "STAT 1004", "UNIV 1824", "UNIV 2114 T", 
    "AT 0984", "ENGE 1216"
  )
)
freq_table <- table(df$`What was your FYE course?  It is helpful for us to know what FYE course you are enrolled in to guide future improvements.`)
freq_df <- as.data.frame(freq_table)
colnames(freq_df) <- c("Class", "Count")
merged_table <- merge(freq_df, table, by = "Class", all.x = TRUE)
blue_palette <- colorRampPalette(c("lightblue", "blue", "darkblue"))(nrow(merged_table))

ggplot(merged_table, aes(x = Count, y = reorder(Course_Name, Count), fill = factor(Course_Name))) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Course Counts",
    x = "Number of Student Responses",
    y = "Course Name"
  ) + 
  scale_fill_manual(values = blue_palette) +  # Assign each bar a unique blue color
  theme_minimal()

Data Sorting and Cleaning for Visualizations

Sullivan Stefanik

2024-10-24

Step #1 - Load in the Data

Install Necessary Packages:

Step #2 - Begin to sort the data.