We will have five datasets to build our application. We will read them using the “readr” library (install it if you do not have it). We will also be using tidyverse to manage the data.
library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v dplyr 1.0.2
## v tibble 3.0.3 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
student_records <- read_csv("data/student_records.csv")
## Parsed with column specification:
## cols(
## ANONID = col_double(),
## SEX = col_character(),
## HSGPA = col_double(),
## LAST_ACT_ENGL_SCORE = col_double(),
## LAST_ACT_MATH_SCORE = col_double(),
## LAST_ACT_READ_SCORE = col_double(),
## LAST_ACT_SCIRE_SCORE = col_double(),
## LAST_ACT_COMP_SCORE = col_double(),
## LAST_SATI_VERB_SCORE = col_double(),
## LAST_SATI_MATH_SCORE = col_double(),
## LAST_SATI_TOTAL_SCORE = col_double(),
## MAJOR1_DESCR = col_character(),
## ADMIT_TERM = col_character()
## )
student_course <- read_csv("data/student_course.csv")
## Parsed with column specification:
## cols(
## ANONID = col_double(),
## SUBJECT = col_character(),
## CATALOG_NBR = col_double(),
## GRD_PTS_PER_UNIT = col_double(),
## GPAO = col_double(),
## DIV = col_character(),
## ANON_INSTR_ID = col_double(),
## TERM = col_character()
## )
instructor_evaluation <- read_csv("data/instructor_evaluation.csv")
## Parsed with column specification:
## cols(
## anon_id = col_double(),
## overall = col_double(),
## environment = col_double(),
## feedback = col_double(),
## async = col_double(),
## opportunities = col_double(),
## sensitiviy = col_double()
## )
course_evaluation <- read_csv("data/course_evaluation.csv")
## Parsed with column specification:
## cols(
## course_id = col_double(),
## overall = col_double(),
## objectives = col_double(),
## organized = col_double(),
## stimulating = col_double(),
## engaging = col_double(),
## discussion = col_double(),
## hours = col_double()
## )
course_list <- read_csv("data/course_list.csv")
## Parsed with column specification:
## cols(
## course_id = col_double(),
## subject = col_character(),
## instructor = col_double(),
## location = col_character(),
## day_of_week = col_character(),
## time = col_character()
## )
Now we will examine each dataset. We will start with “student_record”:
head(student_records)
Student record has personal information about each student with the following columns (they are in a different order in the dataset):
Now lets see “student_course”.
head(student_course)
This dataset contains information about the courses taken by the students, their grade and their instructor.
Now let’s explore “instructor_evaluation”:
head(instructor_evaluation)
This dataset contains the information about the student evaluation of the instructors:
Then we will see the “course_evaluation” dataset:
head(course_evaluation)
This file contains the average student evaluation for each course.
The final dataset is course_list:
head(course_list)
This dataset contains the information of the courses being offered this semester:
Because data is distributed among several tables, we will now see an example on how to get data from one table given information from another. To exemplify this, we will create a visualization of the distribution of the major of the students that have taken the course “272” ACC.
First we will get the ID of all the students that have taken that course from the “student_course” dataset:
selected_students <- student_course %>%
filter(CATALOG_NBR==272) %>%
select(ANONID)
head(selected_students)
To the student_course dataset we apply a filter to select only those rows in which the course ID (CATALOG_NBR) is 272. Then we select only the student ID column (ANONID).
Now, from the student_records dataset, we select all those that are in the list of selected students:
selected_students_data<-student_records %>%
filter(ANONID %in% selected_students$ANONID)
head(selected_students_data)
We filter the students of which their ID (ANONID) is in the list of selected_students IDs.
Then we generate a barchart of the top 10 majors of the selected students:
selected_students_data %>%
drop_na(MAJOR1_DESCR) %>%
group_by(MAJOR1_DESCR) %>%
tally() %>%
arrange(desc(n)) %>%
slice(1:10) %>%
ggplot(aes(x=reorder(MAJOR1_DESCR,n), y=n))+
geom_bar(stat='identity') + coord_flip()
From the data of the selected students, we first eliminate those that have not chosen a major yet and have a NA in the MAJOR1_DESCR column (drop_na).
Then we group the remained students by their major (group_by).
Then we count the number of students in each group (tally).
Then we arrange the list from the hightest to the lowest number of counts (arrange(desc n))
Then we take only the first 10 (slice(1:10))
Then we plot the data. The x axis will be major (ordered by the count [reorder(MAJOR1_DESCR,n)]), the y axis will be the count [n]. Then we select the barchart (geom_bar) that will show the value of y (identity). Finally we flip the axes to make it more pleasent to the eye.