The goal of this project is to explore the relationship between teaching qualities and student ratings of professors, focusing on how specific teaching behaviors influence overall evaluations. Based on the dataset from RateMyProfessor.com, the following research questions guide the analysis:
How do certain teaching qualities (e.g., giving good feedback, inspirational lectures) influence professor ratings?
Are there differences in ratings between online and in-person courses?
2. Rationale for Choosing the Dataset
I am interested in this dataset because it offers a unique and extensive source of student feedback across a variety of courses at universities in the United States. The dataset includes over 20,000 student reviews, providing a robust basis for analyzing how teaching qualities such as feedback quality, lecture style, and grading practices impact students’ overall evaluations of professors. By focusing on specific teaching behaviors and comparing ratings between online and in-person courses, this analysis aims to offer actionable insights that can guide both professional development for educators and course design improvements.
This dataset is especially valuable as it provides a large and diverse sample of student perceptions, making it possible to identify generalizable trends across different institutions and disciplines.
3. Target Audience
The findings of this analysis will benefit several stakeholders in higher education:
University administrators: Insights into which teaching qualities are most valued by students can inform decisions about faculty professional development and course design. For example, administrators could prioritize training faculty in areas that lead to higher student satisfaction and better ratings.
Academic departments and faculty members: Understanding which teaching behaviors influence student evaluations will allow faculty to adjust their teaching methods to enhance student engagement and satisfaction. Professors can use the results to fine-tune their course delivery and improve interactions with students.
Students: Although this is not a primary audience for the project, students might benefit indirectly by being able to use the information to make more informed decisions when selecting courses or professors based on the factors that contribute to higher ratings.
data <- read.csv("RateMyProfessor_Sample data.csv")
# Check the first few rows to confirm the data is loaded
head(data)
## professor_name school_name department_name
## 1 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## 2 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## 3 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## 4 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## 5 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## 6 Leslie Looney University Of Illinois at Urbana-Champaign Astronomy department
## local_name state_name year_since_first_review star_rating
## 1 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## 2 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## 3 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## 4 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## 5 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## 6 Champaign\\xe2\\x80\\x93Urbana IL 11 4.7
## take_again diff_index tag_professor
## 1 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## 2 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## 3 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## 4 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## 5 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## 6 2 Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)
## num_student post_date name_onlines name_not_onlines student_star student_difficult
## 1 26 06/27/2017 NAN ASTR122 5 3
## 2 26 04/16/2017 NAN ASTR122 5 2
## 3 26 12/07/2016 NAN ASTR330 4 3
## 4 26 12/08/2014 NAN ASTR1008WKS 5 3
## 5 26 05/02/2014 NAN ASTR150 5 1
## 6 26 12/14/2013 NAN ASTR150 5 2
## attence for_credits would_take_agains grades help_useful help_not_useful
## 1 Mandatory Yes Yes B 0 0
## 2 Not Mandatory Yes A+ 0 0
## 3 Yes Yes 0 0
## 4 Mandatory Yes A 0 0
## 5 Mandatory 0 0
## 6 0 0
## comments
## 1 This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most classes, you have to actually participate to pass. Sections are easy and offer extra credit every week. Very funny dude. Not much more I can say.
## 2 Definitely going to choose Prof. Looney\\'s class again! Interesting class and easy A. You can bring notes to exams so you don\\'t need to remember a lot. Lots of bonus points available and the observatory sessions are awesome!
## 3 I overall enjoyed this class because the assignments were straightforward and interesting. I just didn\\'t enjoy the video project because I felt like no one in my group cared enough to help.
## 4 Yes, it\\'s possible to get an A but you\\'ll definitely have to work for it. The content is pretty interesting, but you have tog get super organized in this class. You\\'ll have multiple things due every week and a ton lectures to go over. If possible, I\\'d avoid this class as an 8 week course. You\\'ll definitely always have somethingto do in this class.
## 5 Professor Looney has great knowledge in Astronomy, while he can explain them in a super easy way in an elementary class. He taught this class with great passion and great illustration. This class is definitely fun to take. If you are interested in further knowledge that this class won\\'t cover, don\\'t hesitate to ask him. Great teacher.
## 6 Looney is a super funny guy and this class was really interesting. It wasn\\'t a complete blow off, but it wasn\\'t too bad. Just do all the homework/observation sessions and study a bit the night before exams and you should pull an A. I would definitely recommend him for a science gen ed. There was no text book and no final.
## word_comment gender race asian hispanic nh_black nh_white
## 1 44 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 2 38 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 3 32 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 4 64 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 5 57 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 6 61 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## gives_good_feedback caring respected participation_matters clear_grading_criteria
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## skip_class amazing_lectures inspirational tough_grader hilarious get_ready_to_read
## 1 0 0 0 0 1 0
## 2 0 0 0 0 1 0
## 3 0 0 0 0 1 0
## 4 0 0 0 0 1 0
## 5 0 0 0 0 1 0
## 6 0 0 0 0 1 0
## lots_of_homework accessible_outside_class lecture_heavy extra_credit
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## graded_by_few_things group_projects test_heavy so_many_papers beware_of_pop_quizzes
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## 3 0 1 0 0 0
## 4 0 1 0 0 0
## 5 0 1 0 0 0
## 6 0 1 0 0 0
## IsCourseOnline
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
# Check the structure of the dataset
str(data)
## 'data.frame': 20000 obs. of 51 variables:
## $ professor_name : chr "Leslie Looney" "Leslie Looney" "Leslie Looney" "Leslie Looney" ...
## $ school_name : chr "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" ...
## $ department_name : chr "Astronomy department" "Astronomy department" "Astronomy department" "Astronomy department" ...
## $ local_name : chr " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" ...
## $ state_name : chr " IL" " IL" " IL" " IL" ...
## $ year_since_first_review : num 11 11 11 11 11 11 11 11 11 11 ...
## $ star_rating : num 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 ...
## $ take_again : chr "" "" "" "" ...
## $ diff_index : num 2 2 2 2 2 2 2 2 2 2 ...
## $ tag_professor : chr "Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)" "Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)" "Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)" "Hilarious (2) GROUP PROJECTS (2) Gives good feedback (1)" ...
## $ num_student : num 26 26 26 26 26 26 26 26 26 26 ...
## $ post_date : chr "06/27/2017" "04/16/2017" "12/07/2016" "12/08/2014" ...
## $ name_onlines : chr "NAN" "NAN" "NAN" "NAN" ...
## $ name_not_onlines : chr "ASTR122" "ASTR122" "ASTR330" "ASTR1008WKS" ...
## $ student_star : num 5 5 4 5 5 5 5 5 5 5 ...
## $ student_difficult : num 3 2 3 3 1 2 2 2 1 2 ...
## $ attence : chr "Mandatory" "Not Mandatory" "" "Mandatory" ...
## $ for_credits : chr "Yes" "" "Yes" "Yes" ...
## $ would_take_agains : chr "Yes" "Yes" "Yes" "" ...
## $ grades : chr "B" "A+" "" "A" ...
## $ help_useful : num 0 0 0 0 0 0 0 0 0 0 ...
## $ help_not_useful : num 0 0 0 0 0 0 1 0 0 0 ...
## $ comments : chr "This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most clas"| __truncated__ "Definitely going to choose Prof. Looney\\'s class again! Interesting class and easy A. You can bring notes to e"| __truncated__ "I overall enjoyed this class because the assignments were straightforward and interesting. I just didn\\'t enjo"| __truncated__ "Yes, it\\'s possible to get an A but you\\'ll definitely have to work for it. The content is pretty interesting"| __truncated__ ...
## $ word_comment : num 44 38 32 64 57 61 55 38 26 31 ...
## $ gender : chr "unknown" "unknown" "unknown" "unknown" ...
## $ race : chr "hispanic" "hispanic" "hispanic" "hispanic" ...
## $ asian : num 0.00855 0.00855 0.00855 0.00855 0.00855 ...
## $ hispanic : num 0.731 0.731 0.731 0.731 0.731 ...
## $ nh_black : num 0.0871 0.0871 0.0871 0.0871 0.0871 ...
## $ nh_white : num 0.173 0.173 0.173 0.173 0.173 ...
## $ gives_good_feedback : int 1 1 1 1 1 1 1 1 1 1 ...
## $ caring : int 0 0 0 0 0 0 0 0 0 0 ...
## $ respected : int 0 0 0 0 0 0 0 0 0 0 ...
## $ participation_matters : int 0 0 0 0 0 0 0 0 0 0 ...
## $ clear_grading_criteria : int 0 0 0 0 0 0 0 0 0 0 ...
## $ skip_class : int 0 0 0 0 0 0 0 0 0 0 ...
## $ amazing_lectures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ inspirational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tough_grader : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hilarious : int 1 1 1 1 1 1 1 1 1 1 ...
## $ get_ready_to_read : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lots_of_homework : int 0 0 0 0 0 0 0 0 0 0 ...
## $ accessible_outside_class: int 0 0 0 0 0 0 0 0 0 0 ...
## $ lecture_heavy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ extra_credit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ graded_by_few_things : int 0 0 0 0 0 0 0 0 0 0 ...
## $ group_projects : int 1 1 1 1 1 1 1 1 1 1 ...
## $ test_heavy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ so_many_papers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ beware_of_pop_quizzes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IsCourseOnline : int 0 0 0 0 0 0 0 0 0 0 ...
# Check column names
colnames(data)
## [1] "professor_name" "school_name" "department_name"
## [4] "local_name" "state_name" "year_since_first_review"
## [7] "star_rating" "take_again" "diff_index"
## [10] "tag_professor" "num_student" "post_date"
## [13] "name_onlines" "name_not_onlines" "student_star"
## [16] "student_difficult" "attence" "for_credits"
## [19] "would_take_agains" "grades" "help_useful"
## [22] "help_not_useful" "comments" "word_comment"
## [25] "gender" "race" "asian"
## [28] "hispanic" "nh_black" "nh_white"
## [31] "gives_good_feedback" "caring" "respected"
## [34] "participation_matters" "clear_grading_criteria" "skip_class"
## [37] "amazing_lectures" "inspirational" "tough_grader"
## [40] "hilarious" "get_ready_to_read" "lots_of_homework"
## [43] "accessible_outside_class" "lecture_heavy" "extra_credit"
## [46] "graded_by_few_things" "group_projects" "test_heavy"
## [49] "so_many_papers" "beware_of_pop_quizzes" "IsCourseOnline"
install.packages("tm")
## Error in install.packages : Updating loaded packages
install.packages("tidytext")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("naniar")
## Error in install.packages : Updating loaded packages
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("topicmodels")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("reshape2")
## Error in install.packages : Updating loaded packages
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("wordcloud")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tm)
## Loading required package: NLP
library(tidytext)
##
## Attaching package: 'tidytext'
## The following object is masked _by_ '.GlobalEnv':
##
## sentiments
library(naniar)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(topicmodels)
library(reshape2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
# Check the summary of the dataset to identify missing values
summary(data)
## professor_name school_name department_name local_name
## Length:20000 Length:20000 Length:20000 Length:20000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## state_name year_since_first_review star_rating take_again
## Length:20000 Min. : 0.000 Min. :1.000 Length:20000
## Class :character 1st Qu.: 4.000 1st Qu.:3.000 Class :character
## Mode :character Median : 8.000 Median :3.700 Mode :character
## Mean : 7.824 Mean :3.644
## 3rd Qu.:11.000 3rd Qu.:4.300
## Max. :16.000 Max. :5.000
##
## diff_index tag_professor num_student post_date
## Min. :1.000 Length:20000 Min. : 1.00 Length:20000
## 1st Qu.:2.400 Class :character 1st Qu.: 15.00 Class :character
## Median :3.000 Mode :character Median : 24.00 Mode :character
## Mean :2.956 Mean : 33.27
## 3rd Qu.:3.500 3rd Qu.: 41.00
## Max. :5.000 Max. :321.00
##
## name_onlines name_not_onlines student_star student_difficult
## Length:20000 Length:20000 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:2.500 1st Qu.:2.000
## Mode :character Mode :character Median :4.000 Median :3.000
## Mean :3.617 Mean :2.988
## 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
## NA's :5 NA's :5
## attence for_credits would_take_agains grades
## Length:20000 Length:20000 Length:20000 Length:20000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## help_useful help_not_useful comments word_comment
## Min. :0.0000 Min. :0.0000 Length:20000 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character 1st Qu.: 18.00
## Median :0.0000 Median :0.0000 Mode :character Median : 38.00
## Mean :0.2939 Mean :0.1855 Mean : 36.96
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 57.00
## Max. :9.0000 Max. :9.0000 Max. :142.00
## NA's :7
## gender race asian hispanic
## Length:20000 Length:20000 Min. :0.0006844 Min. :0.001992
## Class :character Class :character 1st Qu.:0.0075621 1st Qu.:0.029815
## Mode :character Mode :character Median :0.0160804 Median :0.083610
## Mean :0.0225248 Mean :0.231090
## 3rd Qu.:0.0313843 3rd Qu.:0.441890
## Max. :0.3883634 Max. :0.939691
##
## nh_black nh_white gives_good_feedback caring
## Min. :0.002455 Min. :0.0142 Min. :0.000 Min. :0.0000
## 1st Qu.:0.034711 1st Qu.:0.3033 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.170501 Median :0.4881 Median :0.000 Median :0.0000
## Mean :0.247043 Mean :0.4993 Mean :0.303 Mean :0.2646
## 3rd Qu.:0.413896 3rd Qu.:0.6569 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :0.972921 Max. :0.9735 Max. :1.000 Max. :1.0000
##
## respected participation_matters clear_grading_criteria skip_class
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.2798 Mean :0.249 Mean :0.2228 Mean :0.2796
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
##
## amazing_lectures inspirational tough_grader hilarious get_ready_to_read
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2005 Mean :0.1981 Mean :0.3374 Mean :0.1953 Mean :0.2637
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## lots_of_homework accessible_outside_class lecture_heavy extra_credit
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2014 Mean :0.1343 Mean :0.2177 Mean :0.1162
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## graded_by_few_things group_projects test_heavy so_many_papers
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.1028 Mean :0.0807 Mean :0.09835 Mean :0.07675
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## beware_of_pop_quizzes IsCourseOnline
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.0679 Mean :0.0201
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
##
# Check rows with missing values in 'student_star' column
missing_student_star <- which(is.na(data$student_star))
head(missing_student_star) # First few rows with missing star ratings
## [1] 694 695 696 697 698
# Check rows with missing values in 'comments' column
missing_comments <- which(is.na(data$comments))
head(missing_comments) # First few rows with missing comments
## integer(0)
Impute missing ratings with the median:
data$student_star[is.na(data$student_star)] <- median(data$student_star, na.rm = TRUE)
# Create a corpus from the review text (assuming the column is named 'comments')
corpus <- Corpus(VectorSource(data$comments))
# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)): transformation
## drops documents
# Remove punctuation, numbers, and stopwords
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops documents
corpus <- tm_map(corpus, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")): transformation
## drops documents
# Apply stemming (recommended for uniformity)
corpus <- tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops documents
# Inspect cleaned text (first 5 reviews)
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] class hard twoinon gene knockout content stimul unlik class actual particip pass section easi offer extra credit everi week funni dude much can say
## [2] definit go choos prof looney class interest class easi can bring note exam dont need rememb lot lot bonus point avail observatori session awesom
## [3] overal enjoy class assign straightforward interest just didnt enjoy video project felt like one group care enough help
## [4] yes possibl get youll definit work content pretti interest tog get super organ class youll multipl thing due everi week ton lectur go possibl id avoid class week cours youll definit alway somethingto class
## [5] professor looney great knowledg astronomi can explain super easi way elementari class taught class great passion great illustr class definit fun take interest knowledg class wont cover dont hesit ask great teacher
# Check the size of the original corpus
length(corpus)
## [1] 20000
# Inspect a sample of the original corpus (before cleaning)
inspect(corpus[1:5]) # Inspect first 5 documents
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] class hard twoinon gene knockout content stimul unlik class actual particip pass section easi offer extra credit everi week funni dude much can say
## [2] definit go choos prof looney class interest class easi can bring note exam dont need rememb lot lot bonus point avail observatori session awesom
## [3] overal enjoy class assign straightforward interest just didnt enjoy video project felt like one group care enough help
## [4] yes possibl get youll definit work content pretti interest tog get super organ class youll multipl thing due everi week ton lectur go possibl id avoid class week cours youll definit alway somethingto class
## [5] professor looney great knowledg astronomi can explain super easi way elementari class taught class great passion great illustr class definit fun take interest knowledg class wont cover dont hesit ask great teacher
# Check the length of each document in the corpus before cleaning
doc_lengths <- sapply(corpus, function(x) nchar(as.character(x)))
summary(doc_lengths) # This will give you a summary of document lengths
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 59.0 117.0 112.6 169.0 346.0
# Remove empty documents (length 0)
corpus <- corpus[sapply(corpus, function(x) nchar(as.character(x)) > 0)]
# Recheck the size of the cleaned corpus
length(corpus) # Should be greater than 5 now
## [1] 19975
# Tokenize the cleaned text into unigrams (single words)
tokens_unigrams <- data.frame(text = sapply(corpus, as.character))
tokens_unigrams <- tokens_unigrams %>%
unnest_tokens(word, text)
# Inspect the first few tokens
head(tokens_unigrams)
## word
## 1 class
## 2 hard
## 3 twoinon
## 4 gene
## 5 knockout
## 6 content
# Check the structure of tokens_unigrams
str(tokens_unigrams)
## 'data.frame': 375622 obs. of 1 variable:
## $ word: chr "class" "hard" "twoinon" "gene" ...
# Tokenize the 'text' column into bigrams
tokens_bigrams <- tokens_unigrams_text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# Inspect the first few bigrams
head(tokens_bigrams)
## # A tibble: 6 × 2
## document_id bigram
## <int> <chr>
## 1 1 class hard
## 2 1 hard twoinon
## 3 1 twoinon gene
## 4 1 gene knockout
## 5 1 knockout content
## 6 1 content stimul
For topic modeling, I will use LDA (Latent Dirichlet Allocation) to explore the underlying themes in student reviews.
# Create a document-term matrix
dtm <- tokens_bigrams %>%
count(document_id, bigram) %>%
cast_dtm(document_id, bigram, n)
# Apply LDA model to the DTM
lda_model <- LDA(dtm, k = 5) # k = number of topics
# View the topics
topics <- tidy(lda_model, matrix = "beta")
head(topics)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 absolut hate 0.00000916
## 2 2 absolut hate 0.000000962
## 3 3 absolut hate 0.00000268
## 4 4 absolut hate 0.0000225
## 5 5 absolut hate 0.0000209
## 6 1 actual particip 0.000000152
# Apply LDA model to the Document-Term Matrix (DTM)
lda_model <- LDA(dtm, k = 5)
# Get the top words for each topic
topics <- tidy(lda_model, matrix = "beta")
head(topics)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 absolut hate 0.0000292
## 2 2 absolut hate 0.00000644
## 3 3 absolut hate 0.00000437
## 4 4 absolut hate 0.00000868
## 5 5 absolut hate 0.00000752
## 6 1 actual particip 0.000000822
# Find the top 10 terms for each topic
topics_top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta)
# Visualize top terms for each topic
ggplot(topics_top_terms, aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top Terms for Each Topic", x = "Term", y = "Beta") +
theme_minimal()
In addition to topic modeling, I also performed sentiment analysis to measure the positivity or negativity of student reviews. The sentiment analysis will help confirm whether the themes identified by LDA align with positive or negative sentiments expressed by students.
# Perform sentiment analysis using the Bing lexicon
sentiments <- tokens_bigrams %>%
unnest_tokens(word, bigram) %>%
inner_join(get_sentiments("bing"))
## Joining with `by = join_by(word)`
# Inspect the sentiment of the first few bigrams
head(sentiments)
## # A tibble: 6 × 3
## document_id word sentiment
## <int> <chr> <chr>
## 1 1 hard negative
## 2 1 hard negative
## 3 1 good positive
## 4 1 good positive
## 5 1 luck positive
## 6 1 luck positive
# Calculate the overall sentiment for each review
sentiment_summary <- sentiments %>%
count(document_id, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment_score = positive - negative)
# Inspect the sentiment scores
head(sentiment_summary)
## # A tibble: 6 × 4
## document_id negative positive sentiment_score
## <int> <dbl> <dbl> <dbl>
## 1 1 8 4 -4
## 2 2 2 4 2
## 3 3 2 6 4
## 4 4 4 12 8
## 5 5 6 8 2
## 6 6 0 6 6
# Visualize sentiment distribution
ggplot(sentiment_summary, aes(x = sentiment_score)) +
geom_bar() +
labs(title = "Sentiment Distribution of Reviews", x = "Sentiment Score", y = "Count")
While sentiment analysis provides valuable insights into student reviews, it is important to acknowledge the algorithmic bias inherent in sentiment lexicons.
# Remove NA values from tokens_bigrams
tokens_bigrams_clean <- tokens_bigrams %>%
filter(!is.na(bigram))
# Create a frequency table for bigrams
bigram_freq <- table(tokens_bigrams_clean$bigram)
# Generate the word cloud
wordcloud(words = names(bigram_freq),
freq = bigram_freq,
min.freq = 100,
scale = c(3, 0.5))
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## great professor could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## easi class could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## best teacher could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## good teacher could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## class interest could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## take class could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## midterm final could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## high school could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## go class could not be fit on page. It will not be plotted.
# Clean the unigram data by removing NA values
tokens_unigrams_clean <- tokens_unigrams %>%
filter(!is.na(word))
# Create a frequency table for unigrams (words)
unigram_freq <- table(tokens_unigrams_clean$word)
# Generate the word cloud
wordcloud(words = names(unigram_freq),
freq = unigram_freq,
min.freq = 100,
scale = c(3, 0.5))
# Check the ratings for online vs. in-person courses
online_in_person <- data %>%
select(student_star, IsCourseOnline)
# Perform a t-test to compare ratings for online and in-person courses
t_test_result <- t.test(student_star ~ IsCourseOnline, data = online_in_person)
t_test_result
##
## Welch Two Sample t-test
##
## data: student_star by IsCourseOnline
## t = 4.0111, df = 414.39, p-value = 7.171e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.1663742 0.4861644
## sample estimates:
## mean in group 0 mean in group 1
## 3.623533 3.297264
# Visualize the comparison between online and in-person courses
ggplot(online_in_person, aes(x = IsCourseOnline, y = student_star)) +
geom_boxplot() +
labs(title = "Comparison of Ratings: Online vs In-Person Courses", x = "Course Modality", y = "Rating")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Topic Modeling Results:
The “Top Terms for Each Topic” graph provides insight into the most frequent terms associated with each topic. For example, terms such as “best professor,” “great professor,” and “care student” are most frequently associated with high-rating topics. These terms indicate that students tend to rate professors highly when they perceive the professor as being supportive and knowledgeable.
The sentiment distribution of reviews shows that most reviews are positive, as indicated by the sentiment score distribution being skewed towards higher positive scores. This suggests that students value qualities like “helpful,” “great teacher,” and “extra credit,” which correlate with positive ratings.
These findings suggest that positive teaching behaviors like providing helpful feedback, showing care for students, and offering additional resources (e.g., extra credit) lead to higher ratings.
The word clouds show frequent phrases like “learn lot,” “good class,” “high recommend,” “teach class,” and “know stuff.” These terms are strongly associated with teaching behaviors such as effective teaching, clear communication, and helpfulness. The prominent terms “learn lot” and “good class” suggest that students appreciate teachers who help them learn effectively and who provide a positive class experience. These qualities could directly contribute to higher ratings.
T-test Results:
Boxplot:
The boxplot visual comparing ratings for online and in-person courses shows that in-person courses generally have a higher median rating, with less variation than online courses. This suggests that students tend to rate in-person courses more favorably than online ones.
In conclusion, there is a significant difference in ratings between online and in-person courses, with in-person courses receiving higher ratings on average.
The analysis of teaching quality and course modality reveals important insights for improving student satisfaction. The sentiment analysis and LDA topic modeling results suggest that positive teaching behaviors, such as offering helpful feedback, being a great teacher, and providing extra resources (e.g., extra credit), contribute to higher ratings. The LDA topics highlighted terms such as “best professor,” “great teacher,” and “care student,” which were associated with high ratings. These topics reinforce the findings from sentiment analysis, showing that students appreciate professors who provide a supportive learning environment.
Additionally, the boxplot comparison between online and in-person courses shows that students tend to rate in-person courses more favorably, with less variability in ratings. This implies that students value the structured, face-to-face interaction that in-person classes provide. These findings suggest that educators should focus on fostering positive interactions with students, offering additional learning resources, and considering course modality when designing effective teaching strategies. Academic departments and administrators can use these insights to guide professional development programs and inform decisions about course delivery to improve student engagement and satisfaction.
Future directions could involve exploring how different aspects of online learning (e.g., synchronous vs. asynchronous) affect student satisfaction, or conducting longitudinal studies to assess how teaching quality and modality impact long-term academic outcomes. Additionally, future research could investigate other factors influencing student ratings, such as course difficulty or faculty-student communication, to gain a more comprehensive understanding of what drives student satisfaction.
Moreover, it would be useful to apply topic modeling to specific rating tiers, such as 1-star or 5-star reviews, to further uncover the themes that resonate with both highly dissatisfied and highly satisfied students.
While the analysis provides valuable insights into the relationship between teaching qualities and student ratings, there are several limitations to consider. One key limitation is the reliance on self-reported data from RateMyProfessor.com, which may be biased due to the voluntary nature of the reviews. Students who feel strongly about their experiences—either positively or negatively—are more likely to leave reviews, potentially skewing the data towards extreme opinions. Additionally, the dataset may not fully capture all relevant teaching behaviors, as some qualities are inferred through keyword searches or topic modeling, which may not be perfectly accurate. The LDA topic modeling approach assumes that words within each topic are closely related, but misclassifications or irrelevant terms could introduce noise into the analysis. Therefore, topic interpretation may be subjective, and additional validation of the topics could enhance the robustness of the findings.
Moreover, the study focuses only on U.S. universities, limiting the generalizability of the findings to other educational systems or countries.
From an ethical perspective, there are concerns related to privacy and anonymity. Although the dataset is publicly available, it is important to ensure that any identifiable information about the professors or institutions is handled responsibly, maintaining confidentiality. Furthermore, the study relies on sentiment analysis and topic modeling, which could misinterpret the context or tone of reviews, leading to inaccurate conclusions. The automated nature of sentiment analysis and topic modeling may fail to capture subtle nuances in language, such as sarcasm or cultural differences in phrasing, potentially leading to biased or skewed interpretations. Researchers must be mindful of these limitations when drawing conclusions from automated text analysis.
Lastly, while the study aims to identify teaching behaviors that enhance student satisfaction, the data should not be used to unfairly judge individual educators or make high-stakes decisions without a more nuanced understanding of the teaching context. The LDA topics and sentiment analysis outcomes provide broad insights, but they should be seen as part of a larger picture that includes in-depth, qualitative evaluations and broader institutional considerations.