Munirah Final

Prepare

1. Research Questions

The goal of this project is to explore the relationship between teaching qualities and student ratings of professors, focusing on how specific teaching behaviors influence overall evaluations. Based on the dataset from RateMyProfessor.com, the following research questions guide the analysis:

How do certain teaching qualities (e.g., giving good feedback, inspirational lectures) influence professor ratings?
- These teaching qualities will be identified through predefined tags in the dataset (e.g., “gives_good_feedback,” “amazing_lectures”). For qualities that are not directly tagged, Topic modeling is used to infer these qualities from student reviews.
Are there differences in ratings between online and in-person courses?
- The IsCourseOnline variable will be utilized to compare ratings for online versus in-person courses to determine if there is a significant difference in student evaluations based on course modality.

2. Rationale for Choosing the Dataset

I am interested in this dataset because it offers a unique and extensive source of student feedback across a variety of courses at universities in the United States. The dataset includes over 20,000 student reviews, providing a robust basis for analyzing how teaching qualities such as feedback quality, lecture style, and grading practices impact students’ overall evaluations of professors. By focusing on specific teaching behaviors and comparing ratings between online and in-person courses, this analysis aims to offer actionable insights that can guide both professional development for educators and course design improvements.

This dataset is especially valuable as it provides a large and diverse sample of student perceptions, making it possible to identify generalizable trends across different institutions and disciplines.

3. Target Audience

The findings of this analysis will benefit several stakeholders in higher education:

University administrators: Insights into which teaching qualities are most valued by students can inform decisions about faculty professional development and course design. For example, administrators could prioritize training faculty in areas that lead to higher student satisfaction and better ratings.
Academic departments and faculty members: Understanding which teaching behaviors influence student evaluations will allow faculty to adjust their teaching methods to enhance student engagement and satisfaction. Professors can use the results to fine-tune their course delivery and improve interactions with students.
Students: Although this is not a primary audience for the project, students might benefit indirectly by being able to use the information to make more informed decisions when selecting courses or professors based on the factors that contribute to higher ratings.

1. Install The Data

data <- read.csv("RateMyProfessor_Sample data.csv")

# Check the first few rows to confirm the data is loaded
head(data)

##   professor_name                                school_name      department_name
## 1 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
## 2 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
## 3 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
## 4 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
## 5 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
## 6 Leslie  Looney University Of Illinois at Urbana-Champaign Astronomy department
##                        local_name state_name year_since_first_review star_rating
## 1  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
## 2  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
## 3  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
## 4  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
## 5  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
## 6  Champaign\\xe2\\x80\\x93Urbana         IL                      11         4.7
##   take_again diff_index                                              tag_professor
## 1                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
## 2                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
## 3                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
## 4                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
## 5                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
## 6                     2 Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)
##   num_student  post_date name_onlines name_not_onlines student_star student_difficult
## 1          26 06/27/2017          NAN          ASTR122            5                 3
## 2          26 04/16/2017          NAN          ASTR122            5                 2
## 3          26 12/07/2016          NAN          ASTR330            4                 3
## 4          26 12/08/2014          NAN      ASTR1008WKS            5                 3
## 5          26 05/02/2014          NAN          ASTR150            5                 1
## 6          26 12/14/2013          NAN          ASTR150            5                 2
##         attence for_credits would_take_agains grades help_useful help_not_useful
## 1     Mandatory         Yes               Yes      B           0               0
## 2 Not Mandatory                           Yes     A+           0               0
## 3                       Yes               Yes                  0               0
## 4     Mandatory         Yes                        A           0               0
## 5     Mandatory                                                0               0
## 6                                                              0               0
##                                                                                                                                                                                                                                                                                                                                                                  comments
## 1                                                                                                            This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most classes, you have to actually participate to pass. Sections are easy and offer extra credit every week. Very funny dude. Not much more I can say.
## 2                                                                                                                                     Definitely going to choose Prof. Looney\\'s class again! Interesting class and easy A. You can bring notes to exams so you don\\'t need to remember a lot. Lots of bonus points available and the observatory sessions are awesome!
## 3                                                                                                                                                                         I overall enjoyed this class because the assignments were straightforward and interesting. I just didn\\'t enjoy the video project because I felt like no one in my group cared enough to help.
## 4 Yes, it\\'s possible to get an A but you\\'ll definitely have to work for it. The content is pretty interesting, but you have tog get super organized in this class. You\\'ll have multiple things due every week and a ton lectures to go over. If possible, I\\'d avoid this class as an 8 week course. You\\'ll definitely always have somethingto do in this class.
## 5                    Professor Looney has great knowledge in Astronomy, while he can explain them in a super easy way in an elementary class. He taught this class with great passion and great illustration. This class is definitely fun to take. If you are interested in further knowledge that this class won\\'t cover, don\\'t hesitate to ask him. Great teacher.
## 6                                 Looney is a super funny guy and this class was really interesting. It wasn\\'t a complete blow off, but it wasn\\'t too bad. Just do all the homework/observation sessions and study a bit the night before exams and you should pull an A. I would definitely recommend him for a science gen ed. There was no text book and no final.
##   word_comment  gender     race       asian  hispanic   nh_black nh_white
## 1           44 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 2           38 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 3           32 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 4           64 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 5           57 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
## 6           61 unknown hispanic 0.008548441 0.7314959 0.08709476 0.172861
##   gives_good_feedback caring respected participation_matters clear_grading_criteria
## 1                   1      0         0                     0                      0
## 2                   1      0         0                     0                      0
## 3                   1      0         0                     0                      0
## 4                   1      0         0                     0                      0
## 5                   1      0         0                     0                      0
## 6                   1      0         0                     0                      0
##   skip_class amazing_lectures inspirational tough_grader hilarious get_ready_to_read
## 1          0                0             0            0         1                 0
## 2          0                0             0            0         1                 0
## 3          0                0             0            0         1                 0
## 4          0                0             0            0         1                 0
## 5          0                0             0            0         1                 0
## 6          0                0             0            0         1                 0
##   lots_of_homework accessible_outside_class lecture_heavy extra_credit
## 1                0                        0             0            0
## 2                0                        0             0            0
## 3                0                        0             0            0
## 4                0                        0             0            0
## 5                0                        0             0            0
## 6                0                        0             0            0
##   graded_by_few_things group_projects test_heavy so_many_papers beware_of_pop_quizzes
## 1                    0              1          0              0                     0
## 2                    0              1          0              0                     0
## 3                    0              1          0              0                     0
## 4                    0              1          0              0                     0
## 5                    0              1          0              0                     0
## 6                    0              1          0              0                     0
##   IsCourseOnline
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

# Check the structure of the dataset
str(data)

## 'data.frame':    20000 obs. of  51 variables:
##  $ professor_name          : chr  "Leslie  Looney" "Leslie  Looney" "Leslie  Looney" "Leslie  Looney" ...
##  $ school_name             : chr  "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" "University Of Illinois at Urbana-Champaign" ...
##  $ department_name         : chr  "Astronomy department" "Astronomy department" "Astronomy department" "Astronomy department" ...
##  $ local_name              : chr  " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" " Champaign\\xe2\\x80\\x93Urbana" ...
##  $ state_name              : chr  " IL" " IL" " IL" " IL" ...
##  $ year_since_first_review : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ star_rating             : num  4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7 ...
##  $ take_again              : chr  "" "" "" "" ...
##  $ diff_index              : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ tag_professor           : chr  "Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)" "Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)" "Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)" "Hilarious (2)  GROUP PROJECTS (2)  Gives good feedback (1)" ...
##  $ num_student             : num  26 26 26 26 26 26 26 26 26 26 ...
##  $ post_date               : chr  "06/27/2017" "04/16/2017" "12/07/2016" "12/08/2014" ...
##  $ name_onlines            : chr  "NAN" "NAN" "NAN" "NAN" ...
##  $ name_not_onlines        : chr  "ASTR122" "ASTR122" "ASTR330" "ASTR1008WKS" ...
##  $ student_star            : num  5 5 4 5 5 5 5 5 5 5 ...
##  $ student_difficult       : num  3 2 3 3 1 2 2 2 1 2 ...
##  $ attence                 : chr  "Mandatory" "Not Mandatory" "" "Mandatory" ...
##  $ for_credits             : chr  "Yes" "" "Yes" "Yes" ...
##  $ would_take_agains       : chr  "Yes" "Yes" "Yes" "" ...
##  $ grades                  : chr  "B" "A+" "" "A" ...
##  $ help_useful             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ help_not_useful         : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ comments                : chr  "This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most clas"| __truncated__ "Definitely going to choose Prof. Looney\\'s class again! Interesting class and easy A. You can bring notes to e"| __truncated__ "I overall enjoyed this class because the assignments were straightforward and interesting. I just didn\\'t enjo"| __truncated__ "Yes, it\\'s possible to get an A but you\\'ll definitely have to work for it. The content is pretty interesting"| __truncated__ ...
##  $ word_comment            : num  44 38 32 64 57 61 55 38 26 31 ...
##  $ gender                  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ race                    : chr  "hispanic" "hispanic" "hispanic" "hispanic" ...
##  $ asian                   : num  0.00855 0.00855 0.00855 0.00855 0.00855 ...
##  $ hispanic                : num  0.731 0.731 0.731 0.731 0.731 ...
##  $ nh_black                : num  0.0871 0.0871 0.0871 0.0871 0.0871 ...
##  $ nh_white                : num  0.173 0.173 0.173 0.173 0.173 ...
##  $ gives_good_feedback     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ caring                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ respected               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ participation_matters   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ clear_grading_criteria  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ skip_class              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ amazing_lectures        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ inspirational           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tough_grader            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hilarious               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ get_ready_to_read       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lots_of_homework        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ accessible_outside_class: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lecture_heavy           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ extra_credit            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ graded_by_few_things    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ group_projects          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ test_heavy              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ so_many_papers          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ beware_of_pop_quizzes   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IsCourseOnline          : int  0 0 0 0 0 0 0 0 0 0 ...

# Check column names
colnames(data)

##  [1] "professor_name"           "school_name"              "department_name"         
##  [4] "local_name"               "state_name"               "year_since_first_review" 
##  [7] "star_rating"              "take_again"               "diff_index"              
## [10] "tag_professor"            "num_student"              "post_date"               
## [13] "name_onlines"             "name_not_onlines"         "student_star"            
## [16] "student_difficult"        "attence"                  "for_credits"             
## [19] "would_take_agains"        "grades"                   "help_useful"             
## [22] "help_not_useful"          "comments"                 "word_comment"            
## [25] "gender"                   "race"                     "asian"                   
## [28] "hispanic"                 "nh_black"                 "nh_white"                
## [31] "gives_good_feedback"      "caring"                   "respected"               
## [34] "participation_matters"    "clear_grading_criteria"   "skip_class"              
## [37] "amazing_lectures"         "inspirational"            "tough_grader"            
## [40] "hilarious"                "get_ready_to_read"        "lots_of_homework"        
## [43] "accessible_outside_class" "lecture_heavy"            "extra_credit"            
## [46] "graded_by_few_things"     "group_projects"           "test_heavy"              
## [49] "so_many_papers"           "beware_of_pop_quizzes"    "IsCourseOnline"

2. Install and Load Libraries

install.packages("tm")

## Error in install.packages : Updating loaded packages

install.packages("tidytext")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("naniar")

## Error in install.packages : Updating loaded packages

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("topicmodels")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("reshape2")

## Error in install.packages : Updating loaded packages

install.packages("tidyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("wordcloud")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(tm)

## Loading required package: NLP

library(tidytext)

## 
## Attaching package: 'tidytext'

## The following object is masked _by_ '.GlobalEnv':
## 
##     sentiments

library(naniar) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(topicmodels)
library(reshape2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

Wrangle

Data Preprocessing.

1. Handling Missing Data

# Check the summary of the dataset to identify missing values
summary(data)

##  professor_name     school_name        department_name     local_name       
##  Length:20000       Length:20000       Length:20000       Length:20000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   state_name        year_since_first_review  star_rating     take_again       
##  Length:20000       Min.   : 0.000          Min.   :1.000   Length:20000      
##  Class :character   1st Qu.: 4.000          1st Qu.:3.000   Class :character  
##  Mode  :character   Median : 8.000          Median :3.700   Mode  :character  
##                     Mean   : 7.824          Mean   :3.644                     
##                     3rd Qu.:11.000          3rd Qu.:4.300                     
##                     Max.   :16.000          Max.   :5.000                     
##                                                                               
##    diff_index    tag_professor       num_student      post_date        
##  Min.   :1.000   Length:20000       Min.   :  1.00   Length:20000      
##  1st Qu.:2.400   Class :character   1st Qu.: 15.00   Class :character  
##  Median :3.000   Mode  :character   Median : 24.00   Mode  :character  
##  Mean   :2.956                      Mean   : 33.27                     
##  3rd Qu.:3.500                      3rd Qu.: 41.00                     
##  Max.   :5.000                      Max.   :321.00                     
##                                                                        
##  name_onlines       name_not_onlines    student_star   student_difficult
##  Length:20000       Length:20000       Min.   :1.000   Min.   :1.000    
##  Class :character   Class :character   1st Qu.:2.500   1st Qu.:2.000    
##  Mode  :character   Mode  :character   Median :4.000   Median :3.000    
##                                        Mean   :3.617   Mean   :2.988    
##                                        3rd Qu.:5.000   3rd Qu.:4.000    
##                                        Max.   :5.000   Max.   :5.000    
##                                        NA's   :5       NA's   :5        
##    attence          for_credits        would_take_agains     grades         
##  Length:20000       Length:20000       Length:20000       Length:20000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   help_useful     help_not_useful    comments          word_comment   
##  Min.   :0.0000   Min.   :0.0000   Length:20000       Min.   :  1.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character   1st Qu.: 18.00  
##  Median :0.0000   Median :0.0000   Mode  :character   Median : 38.00  
##  Mean   :0.2939   Mean   :0.1855                      Mean   : 36.96  
##  3rd Qu.:0.0000   3rd Qu.:0.0000                      3rd Qu.: 57.00  
##  Max.   :9.0000   Max.   :9.0000                      Max.   :142.00  
##                                                       NA's   :7       
##     gender              race               asian              hispanic       
##  Length:20000       Length:20000       Min.   :0.0006844   Min.   :0.001992  
##  Class :character   Class :character   1st Qu.:0.0075621   1st Qu.:0.029815  
##  Mode  :character   Mode  :character   Median :0.0160804   Median :0.083610  
##                                        Mean   :0.0225248   Mean   :0.231090  
##                                        3rd Qu.:0.0313843   3rd Qu.:0.441890  
##                                        Max.   :0.3883634   Max.   :0.939691  
##                                                                              
##     nh_black           nh_white      gives_good_feedback     caring      
##  Min.   :0.002455   Min.   :0.0142   Min.   :0.000       Min.   :0.0000  
##  1st Qu.:0.034711   1st Qu.:0.3033   1st Qu.:0.000       1st Qu.:0.0000  
##  Median :0.170501   Median :0.4881   Median :0.000       Median :0.0000  
##  Mean   :0.247043   Mean   :0.4993   Mean   :0.303       Mean   :0.2646  
##  3rd Qu.:0.413896   3rd Qu.:0.6569   3rd Qu.:1.000       3rd Qu.:1.0000  
##  Max.   :0.972921   Max.   :0.9735   Max.   :1.000       Max.   :1.0000  
##                                                                          
##    respected      participation_matters clear_grading_criteria   skip_class    
##  Min.   :0.0000   Min.   :0.000         Min.   :0.0000         Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000         1st Qu.:0.0000         1st Qu.:0.0000  
##  Median :0.0000   Median :0.000         Median :0.0000         Median :0.0000  
##  Mean   :0.2798   Mean   :0.249         Mean   :0.2228         Mean   :0.2796  
##  3rd Qu.:1.0000   3rd Qu.:0.000         3rd Qu.:0.0000         3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000         Max.   :1.0000         Max.   :1.0000  
##                                                                                
##  amazing_lectures inspirational     tough_grader      hilarious      get_ready_to_read
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   
##  Mean   :0.2005   Mean   :0.1981   Mean   :0.3374   Mean   :0.1953   Mean   :0.2637   
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##                                                                                       
##  lots_of_homework accessible_outside_class lecture_heavy     extra_credit   
##  Min.   :0.0000   Min.   :0.0000           Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000           1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000           Median :0.0000   Median :0.0000  
##  Mean   :0.2014   Mean   :0.1343           Mean   :0.2177   Mean   :0.1162  
##  3rd Qu.:0.0000   3rd Qu.:0.0000           3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000           Max.   :1.0000   Max.   :1.0000  
##                                                                             
##  graded_by_few_things group_projects     test_heavy      so_many_papers   
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000       Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.1028       Mean   :0.0807   Mean   :0.09835   Mean   :0.07675  
##  3rd Qu.:0.0000       3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                           
##  beware_of_pop_quizzes IsCourseOnline  
##  Min.   :0.0000        Min.   :0.0000  
##  1st Qu.:0.0000        1st Qu.:0.0000  
##  Median :0.0000        Median :0.0000  
##  Mean   :0.0679        Mean   :0.0201  
##  3rd Qu.:0.0000        3rd Qu.:0.0000  
##  Max.   :1.0000        Max.   :1.0000  
##

# Check rows with missing values in 'student_star' column
missing_student_star <- which(is.na(data$student_star))
head(missing_student_star)  # First few rows with missing star ratings

## [1] 694 695 696 697 698

# Check rows with missing values in 'comments' column
missing_comments <- which(is.na(data$comments))
head(missing_comments)  # First few rows with missing comments

## integer(0)

Impute missing ratings with the median:

data$student_star[is.na(data$student_star)] <- median(data$student_star, na.rm = TRUE)

2. Clean and Preprocess the Text Data

# Create a corpus from the review text (assuming the column is named 'comments')
corpus <- Corpus(VectorSource(data$comments))

# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)): transformation
## drops documents

# Remove punctuation, numbers, and stopwords
corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")): transformation
## drops documents

# Apply stemming (recommended for uniformity)
corpus <- tm_map(corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops documents

# Inspect cleaned text (first 5 reviews)
inspect(corpus[1:5])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] class hard twoinon gene knockout content stimul unlik class actual particip pass section easi offer extra credit everi week funni dude much can say                                                                  
## [2] definit go choos prof looney class interest class easi can bring note exam dont need rememb lot lot bonus point avail observatori session awesom                                                                     
## [3] overal enjoy class assign straightforward interest just didnt enjoy video project felt like one group care enough help                                                                                               
## [4] yes possibl get youll definit work content pretti interest tog get super organ class youll multipl thing due everi week ton lectur go possibl id avoid class week cours youll definit alway somethingto class        
## [5] professor looney great knowledg astronomi can explain super easi way elementari class taught class great passion great illustr class definit fun take interest knowledg class wont cover dont hesit ask great teacher

# Check the size of the original corpus
length(corpus)

## [1] 20000

# Inspect a sample of the original corpus (before cleaning)
inspect(corpus[1:5])  # Inspect first 5 documents

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] class hard twoinon gene knockout content stimul unlik class actual particip pass section easi offer extra credit everi week funni dude much can say                                                                  
## [2] definit go choos prof looney class interest class easi can bring note exam dont need rememb lot lot bonus point avail observatori session awesom                                                                     
## [3] overal enjoy class assign straightforward interest just didnt enjoy video project felt like one group care enough help                                                                                               
## [4] yes possibl get youll definit work content pretti interest tog get super organ class youll multipl thing due everi week ton lectur go possibl id avoid class week cours youll definit alway somethingto class        
## [5] professor looney great knowledg astronomi can explain super easi way elementari class taught class great passion great illustr class definit fun take interest knowledg class wont cover dont hesit ask great teacher

# Check the length of each document in the corpus before cleaning
doc_lengths <- sapply(corpus, function(x) nchar(as.character(x)))
summary(doc_lengths)  # This will give you a summary of document lengths

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    59.0   117.0   112.6   169.0   346.0

# Remove empty documents (length 0)
corpus <- corpus[sapply(corpus, function(x) nchar(as.character(x)) > 0)]

# Recheck the size of the cleaned corpus
length(corpus)  # Should be greater than 5 now

## [1] 19975

3. Tokenization (Unigrams, Bigrams, or Trigrams)

# Tokenize the cleaned text into unigrams (single words)
tokens_unigrams <- data.frame(text = sapply(corpus, as.character))
tokens_unigrams <- tokens_unigrams %>%
  unnest_tokens(word, text)

# Inspect the first few tokens
head(tokens_unigrams)

##       word
## 1    class
## 2     hard
## 3  twoinon
## 4     gene
## 5 knockout
## 6  content

# Check the structure of tokens_unigrams
str(tokens_unigrams)

## 'data.frame':    375622 obs. of  1 variable:
##  $ word: chr  "class" "hard" "twoinon" "gene" ...

# Tokenize the 'text' column into bigrams
tokens_bigrams <- tokens_unigrams_text %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

# Inspect the first few bigrams
head(tokens_bigrams)

## # A tibble: 6 × 2
##   document_id bigram          
##         <int> <chr>           
## 1           1 class hard      
## 2           1 hard twoinon    
## 3           1 twoinon gene    
## 4           1 gene knockout   
## 5           1 knockout content
## 6           1 content stimul

Analyze

1. Topic Modeling (LDA)

For topic modeling, I will use LDA (Latent Dirichlet Allocation) to explore the underlying themes in student reviews.

# Create a document-term matrix
dtm <- tokens_bigrams %>%
  count(document_id, bigram) %>%
  cast_dtm(document_id, bigram, n)


# Apply LDA model to the DTM
lda_model <- LDA(dtm, k = 5)  # k = number of topics

# View the topics
topics <- tidy(lda_model, matrix = "beta")
head(topics)

## # A tibble: 6 × 3
##   topic term                   beta
##   <int> <chr>                 <dbl>
## 1     1 absolut hate    0.00000916 
## 2     2 absolut hate    0.000000962
## 3     3 absolut hate    0.00000268 
## 4     4 absolut hate    0.0000225  
## 5     5 absolut hate    0.0000209  
## 6     1 actual particip 0.000000152

# Apply LDA model to the Document-Term Matrix (DTM)
lda_model <- LDA(dtm, k = 5)  

# Get the top words for each topic
topics <- tidy(lda_model, matrix = "beta")
head(topics)

## # A tibble: 6 × 3
##   topic term                   beta
##   <int> <chr>                 <dbl>
## 1     1 absolut hate    0.0000292  
## 2     2 absolut hate    0.00000644 
## 3     3 absolut hate    0.00000437 
## 4     4 absolut hate    0.00000868 
## 5     5 absolut hate    0.00000752 
## 6     1 actual particip 0.000000822

# Find the top 10 terms for each topic
topics_top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta)

# Visualize top terms for each topic
ggplot(topics_top_terms, aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top Terms for Each Topic", x = "Term", y = "Beta") +
  theme_minimal()

2. Sentiment Analysis

In addition to topic modeling, I also performed sentiment analysis to measure the positivity or negativity of student reviews. The sentiment analysis will help confirm whether the themes identified by LDA align with positive or negative sentiments expressed by students.

# Perform sentiment analysis using the Bing lexicon
sentiments <- tokens_bigrams %>%
  unnest_tokens(word, bigram) %>%
  inner_join(get_sentiments("bing"))

## Joining with `by = join_by(word)`

# Inspect the sentiment of the first few bigrams
head(sentiments)

## # A tibble: 6 × 3
##   document_id word  sentiment
##         <int> <chr> <chr>    
## 1           1 hard  negative 
## 2           1 hard  negative 
## 3           1 good  positive 
## 4           1 good  positive 
## 5           1 luck  positive 
## 6           1 luck  positive

# Calculate the overall sentiment for each review
sentiment_summary <- sentiments %>%
  count(document_id, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment_score = positive - negative)

# Inspect the sentiment scores
head(sentiment_summary)

## # A tibble: 6 × 4
##   document_id negative positive sentiment_score
##         <int>    <dbl>    <dbl>           <dbl>
## 1           1        8        4              -4
## 2           2        2        4               2
## 3           3        2        6               4
## 4           4        4       12               8
## 5           5        6        8               2
## 6           6        0        6               6

# Visualize sentiment distribution
ggplot(sentiment_summary, aes(x = sentiment_score)) +
  geom_bar() +
  labs(title = "Sentiment Distribution of Reviews", x = "Sentiment Score", y = "Count")

While sentiment analysis provides valuable insights into student reviews, it is important to acknowledge the algorithmic bias inherent in sentiment lexicons.

Word Cloud:

# Remove NA values from tokens_bigrams
tokens_bigrams_clean <- tokens_bigrams %>%
  filter(!is.na(bigram))

# Create a frequency table for bigrams
bigram_freq <- table(tokens_bigrams_clean$bigram)

# Generate the word cloud
wordcloud(words = names(bigram_freq), 
          freq = bigram_freq, 
          min.freq = 100, 
          scale = c(3, 0.5))

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## great professor could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## easi class could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## best teacher could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## good teacher could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## class interest could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## take class could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## midterm final could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## high school could not be fit on page. It will not be plotted.

## Warning in wordcloud(words = names(bigram_freq), freq = bigram_freq, min.freq = 100, :
## go class could not be fit on page. It will not be plotted.

# Clean the unigram data by removing NA values
tokens_unigrams_clean <- tokens_unigrams %>%
  filter(!is.na(word))

# Create a frequency table for unigrams (words)
unigram_freq <- table(tokens_unigrams_clean$word)

# Generate the word cloud
wordcloud(words = names(unigram_freq), 
          freq = unigram_freq, 
          min.freq = 100, 
          scale = c(3, 0.5))

3. Comparison of Ratings for Online vs. In-Person Courses

# Check the ratings for online vs. in-person courses
online_in_person <- data %>%
  select(student_star, IsCourseOnline)

# Perform a t-test to compare ratings for online and in-person courses
t_test_result <- t.test(student_star ~ IsCourseOnline, data = online_in_person)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  student_star by IsCourseOnline
## t = 4.0111, df = 414.39, p-value = 7.171e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.1663742 0.4861644
## sample estimates:
## mean in group 0 mean in group 1 
##        3.623533        3.297264

# Visualize the comparison between online and in-person courses
ggplot(online_in_person, aes(x = IsCourseOnline, y = student_star)) +
  geom_boxplot() +
  labs(title = "Comparison of Ratings: Online vs In-Person Courses", x = "Course Modality", y = "Rating")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Communicate

Research Question 1: How do certain teaching qualities (e.g., giving good feedback, inspirational lectures) influence professor ratings?

Topic Modeling Results:
- The “Top Terms for Each Topic” graph provides insight into the most frequent terms associated with each topic. For example, terms such as “best professor,” “great professor,” and “care student” are most frequently associated with high-rating topics. These terms indicate that students tend to rate professors highly when they perceive the professor as being supportive and knowledgeable.
  1. Sentiment Analysis:
- The sentiment distribution of reviews shows that most reviews are positive, as indicated by the sentiment score distribution being skewed towards higher positive scores. This suggests that students value qualities like “helpful,” “great teacher,” and “extra credit,” which correlate with positive ratings.
  
  These findings suggest that positive teaching behaviors like providing helpful feedback, showing care for students, and offering additional resources (e.g., extra credit) lead to higher ratings.

The word clouds show frequent phrases like “learn lot,” “good class,” “high recommend,” “teach class,” and “know stuff.” These terms are strongly associated with teaching behaviors such as effective teaching, clear communication, and helpfulness. The prominent terms “learn lot” and “good class” suggest that students appreciate teachers who help them learn effectively and who provide a positive class experience. These qualities could directly contribute to higher ratings.

Research Question 2: Are there differences in ratings between online and in-person courses?

T-test Results:
- The Welch Two Sample t-test results indicate a significant difference in ratings between online and in-person courses (p-value = 7.171e-05). The ratings for in-person courses (mean = 3.62) are slightly higher than for online courses (mean = 3.30).
Boxplot:
- The boxplot visual comparing ratings for online and in-person courses shows that in-person courses generally have a higher median rating, with less variation than online courses. This suggests that students tend to rate in-person courses more favorably than online ones.
  
  In conclusion, there is a significant difference in ratings between online and in-person courses, with in-person courses receiving higher ratings on average.

Key Insights & Implications

The analysis of teaching quality and course modality reveals important insights for improving student satisfaction. The sentiment analysis and LDA topic modeling results suggest that positive teaching behaviors, such as offering helpful feedback, being a great teacher, and providing extra resources (e.g., extra credit), contribute to higher ratings. The LDA topics highlighted terms such as “best professor,” “great teacher,” and “care student,” which were associated with high ratings. These topics reinforce the findings from sentiment analysis, showing that students appreciate professors who provide a supportive learning environment.

Additionally, the boxplot comparison between online and in-person courses shows that students tend to rate in-person courses more favorably, with less variability in ratings. This implies that students value the structured, face-to-face interaction that in-person classes provide. These findings suggest that educators should focus on fostering positive interactions with students, offering additional learning resources, and considering course modality when designing effective teaching strategies. Academic departments and administrators can use these insights to guide professional development programs and inform decisions about course delivery to improve student engagement and satisfaction.

Future directions could involve exploring how different aspects of online learning (e.g., synchronous vs. asynchronous) affect student satisfaction, or conducting longitudinal studies to assess how teaching quality and modality impact long-term academic outcomes. Additionally, future research could investigate other factors influencing student ratings, such as course difficulty or faculty-student communication, to gain a more comprehensive understanding of what drives student satisfaction.

Moreover, it would be useful to apply topic modeling to specific rating tiers, such as 1-star or 5-star reviews, to further uncover the themes that resonate with both highly dissatisfied and highly satisfied students.

Limitations and Ethical Issues

While the analysis provides valuable insights into the relationship between teaching qualities and student ratings, there are several limitations to consider. One key limitation is the reliance on self-reported data from RateMyProfessor.com, which may be biased due to the voluntary nature of the reviews. Students who feel strongly about their experiences—either positively or negatively—are more likely to leave reviews, potentially skewing the data towards extreme opinions. Additionally, the dataset may not fully capture all relevant teaching behaviors, as some qualities are inferred through keyword searches or topic modeling, which may not be perfectly accurate. The LDA topic modeling approach assumes that words within each topic are closely related, but misclassifications or irrelevant terms could introduce noise into the analysis. Therefore, topic interpretation may be subjective, and additional validation of the topics could enhance the robustness of the findings.

Moreover, the study focuses only on U.S. universities, limiting the generalizability of the findings to other educational systems or countries.

From an ethical perspective, there are concerns related to privacy and anonymity. Although the dataset is publicly available, it is important to ensure that any identifiable information about the professors or institutions is handled responsibly, maintaining confidentiality. Furthermore, the study relies on sentiment analysis and topic modeling, which could misinterpret the context or tone of reviews, leading to inaccurate conclusions. The automated nature of sentiment analysis and topic modeling may fail to capture subtle nuances in language, such as sarcasm or cultural differences in phrasing, potentially leading to biased or skewed interpretations. Researchers must be mindful of these limitations when drawing conclusions from automated text analysis.

Lastly, while the study aims to identify teaching behaviors that enhance student satisfaction, the data should not be used to unfairly judge individual educators or make high-stakes decisions without a more nuanced understanding of the teaching context. The LDA topics and sentiment analysis outcomes provide broad insights, but they should be seen as part of a larger picture that includes in-depth, qualitative evaluations and broader institutional considerations.