Coursera Data Analysis

Objective

The goal of this project is to perform a comprehensive analysis of a data set related to Coursera courses. I explored the dataset to understand the data structure and performed necessary data cleansing. Then, I created visualizations to see insights for online studying trends and user engagement.

Data Collection

The data is provided on Kaggle, which can be viewed here.

There are two data sets which are cleaned and uncleaned. For this project, I will only use the uncleaned version since our aim includes data cleansing. The dataset includes various columns which are basic information about a course. I will use URL link as our main unique index, which I will also clean on this in the Project Workflow section.

Further data set description can be read here.

First, we will read the data set using read.csv function, save as unclean_data, and see an overview of the data set:

orig_df <- read.csv("CourseraDataset-Unclean.csv")
summary(orig_df)

##  Course.Title           Rating         Level             Duration        
##  Length:9595        Min.   :1.500   Length:9595        Length:9595       
##  Class :character   1st Qu.:4.600   Class :character   Class :character  
##  Mode  :character   Median :4.700   Mode  :character   Mode  :character  
##                     Mean   :4.652                                        
##                     3rd Qu.:4.800                                        
##                     Max.   :5.000                                        
##                     NA's   :1439                                         
##    Schedule            Review          What.you.will.learn  Skill.gain       
##  Length:9595        Length:9595        Length:9595         Length:9595       
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##    Modules           Instructor         Offered.By          Keyword         
##  Length:9595        Length:9595        Length:9595        Length:9595       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Course.Url       
##  Length:9595       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

colnames(orig_df)

##  [1] "Course.Title"        "Rating"              "Level"              
##  [4] "Duration"            "Schedule"            "Review"             
##  [7] "What.you.will.learn" "Skill.gain"          "Modules"            
## [10] "Instructor"          "Offered.By"          "Keyword"            
## [13] "Course.Url"

Libraries and Tools

We will set up necessary libraries that can help me clean and visualize data in further steps:

library(tidyverse)
library(zoo)
library(textcat)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] graphics  grDevices utils     datasets  stats     methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4  textcat_1.0-8   zoo_1.8-12      lubridate_1.9.3
##  [5] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2    
##  [9] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1  
## [13] tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    slam_0.1-52      
##  [5] stringi_1.8.4     lattice_0.22-6    hms_1.1.3         digest_0.6.37    
##  [9] magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1        timechange_0.3.0 
## [13] fastmap_1.2.0     plyr_1.8.9        jsonlite_1.8.8    fansi_1.0.6      
## [17] scales_1.3.0      jquerylib_0.1.4   cli_3.6.3         rlang_1.1.4      
## [21] munsell_0.5.1     withr_3.0.1       cachem_1.1.0      yaml_2.3.10      
## [25] tools_4.4.1       tzdb_0.4.0        colorspace_2.1-1  vctrs_0.6.5      
## [29] R6_2.5.1          lifecycle_1.0.4   pkgconfig_2.0.3   pillar_1.9.0     
## [33] bslib_0.8.0       gtable_0.3.5      Rcpp_1.0.13       glue_1.7.0       
## [37] xfun_0.47         tidyselect_1.2.1  tau_0.0-25        rstudioapi_0.16.0
## [41] knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.28    compiler_4.4.1

Project Workflow

1. Set columns name

The dataset has some columns’ names which we find them hard to use throughout the workflow. Therefore, we would rather change them into easily accessible names. We will use the rename function to apply new columns’ names:

orig_df <- rename(orig_df, crs_title = "Course.Title")
orig_df <- rename(orig_df, will_learn = "What.you.will.learn")
orig_df <- rename(orig_df, offered = "Offered.By")
orig_df <- rename(orig_df, URL = "Course.Url")
orig_df <- rename(orig_df, skill = "Skill.gain")

data_types <- sapply(orig_df, typeof)
results <- paste(names(orig_df), ":", data_types, collapse = ", ")
cat(results,'\n')

## crs_title : character, Rating : double, Level : character, Duration : character, Schedule : character, Review : character, will_learn : character, skill : character, Modules : character, Instructor : character, offered : character, Keyword : character, URL : character

We want to use function head() to see the format of data in every column:

head(orig_df)

##                                                      crs_title Rating
## 1                                            Fashion as Design    4.8
## 2                                       Modern American Poetry    4.4
## 3                                    Pixel Art for Video Games    4.5
## 4              Distribución digital de la música independiente     NA
## 5 The Blues: Understanding and Performing an American Art Form    4.8
## 6                                 So You Think You Know Tango?    4.6
##            Level                     Duration          Schedule        Review
## 1 Beginner level     20 hours (approximately) Flexible schedule 2,813 reviews
## 2 Beginner level Approx. 34 hours to complete Flexible schedule   100 reviews
## 3 Beginner level      9 hours (approximately) Flexible schedule   227 reviews
## 4 Beginner level  Approx. 8 hours to complete Flexible schedule              
## 5 Beginner level Approx. 11 hours to complete Flexible schedule   582 reviews
## 6 Beginner level  Approx. 5 hours to complete Flexible schedule   107 reviews
##                                                                                                                                                                                                                                                                 will_learn
## 1                                                                                                                                                                                                                                                                         
## 2                                                                                                                                                                                                                                                                         
## 3                                                                                                                                                                                                                                                                         
## 4                                                                                                                                                                                                                                                                         
## 5                                                                                                       Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues. 
## 6 Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
##                                              skill
## 1  ['Art History', 'Art', 'History', 'Creativity']
## 2                                               []
## 3                                               []
## 4                                               []
## 5 ['Music', 'Chord', 'Jazz', 'Jazz Improvisation']
## 6                                               []
##                                                                                                                                                                                                                               Modules
## 1                                                                                                                                         ['Introduction', 'Heroes', 'Silhouettes', 'Coutures', 'Lifecycle', 'Modesty', 'Expression']
## 2                                                                                                                                                                     ['Orientation', 'Module 1', 'Module 2', 'Module 3', 'Module 4']
## 3                                                                 ['Week 1: Introduction to Pixel Art', 'Week 2: Pixel Art Environments', 'Week 3: Pixel Art Characters', 'Week 4: Pixel Art Animation', 'Week 5: Pixel Art Project']
## 4                                                                                                                                                                                    ['Semana 1', 'Semana 2', 'Semana 3', 'Semana 4']
## 5 ['Blues Progressions – Theory and Practice ', 'Blues Scales ', 'Keyboard Realization ', '“Bird” Blues and Other Blues Progressions', 'Improvisational Tools ', 'Improvising the Blues – Part 1 ', 'Improvising the Blues – Part 2']
## 6                                                                                                                                   ['Module 1: The Many Dimensions of Tango and Tango Music', 'Module 2: Tango Words and Movements']
##                                                                             Instructor
## 1 ['Anna Burckhardt', 'Paola Antonelli', 'Michelle Millar Fisher', 'Stephanie Kramer']
## 2                                                                      ['Cary Nelson']
## 3                                               ['Andrew Dennis', 'Ricardo Guimaraes']
## 4                                                        ['Eduardo de la Vara Brown.']
## 5                                                                ['Dariusz Terefenko']
## 6                                                                 ['Kristin Wendland']
##                                          offered             Keyword
## 1                   ['The Museum of Modern Art'] Arts and Humanities
## 2 ['University of Illinois at Urbana-Champaign'] Arts and Humanities
## 3                  ['Michigan State University'] Arts and Humanities
## 4                       ['SAE Institute México'] Arts and Humanities
## 5                    ['University of Rochester'] Arts and Humanities
## 6                           ['Emory University'] Arts and Humanities
##                                                                              URL
## 1                                  https://www.coursera.org/learn/fashion-design
## 2                          https://www.coursera.org/learn/modern-american-poetry
## 3                           https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5                                       https://www.coursera.org/learn/the-blues
## 6                                           https://www.coursera.org/learn/tango

2. Cleansing N/A values for Rating col

In cleansing process, we first want to see the amount of N/A values in every column:

colSums(is.na(orig_df))

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0       1439          0          0          0          0          0 
##      skill    Modules Instructor    offered    Keyword        URL 
##          0          0          0          0          0          0

There are about 1500 N/A values for column Rating. The Rating column has format of double so to be easy, we can fill N/A with 0.

In case we need to clean any sudden N/A in the future (if having additional column), it would be better to create a custom function to help us handle this. The function will need inputs of a data frame, column’s name need to be cleansed, and the replaced value for it.

We use mutate() to easily access the data frame and its columns. Then, we use sym() function to converts the column name from a string to a symbol since our input colname of the function is a string and mutate() can’t handle string. We need !! before it to unquote the sym() function.

Then, we have the ifelse to check a value is N/A:

clean_NA_func <- function(input_df, colname, replace_value) {
  input_df <- input_df %>% mutate(
    !!sym(colname) := ifelse(
      is.na(!!sym(colname)), replace_value, !!sym(colname)
    )
  )
  return(input_df)
}

We now apply this function to clean the Rating column and check again:

orig_df <- clean_NA_func(orig_df, "Rating", 0)
colSums(is.na(orig_df))

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0          0          0          0          0          0          0 
##      skill    Modules Instructor    offered    Keyword        URL 
##          0          0          0          0          0          0

3. Cleansing empty values for Level column

There are some entries that are not identified as N/A but empty. We want to build a function that helps fill those empty entries with desired values. The logic behind this function is similar to the clean_NA_func, but we replace the condition checking is.na with == "".

colSums(orig_df == "")

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0          0       1265        262        683       1443       4611 
##      skill    Modules Instructor    offered    Keyword        URL 
##          0          0          0          0          0          0

clean_empty_func <- function(input_df, colname, replace_value) {
  input_df <- input_df %>% mutate(
    !!sym(colname) := ifelse(
      !!sym(colname) == "", replace_value, !!sym(colname)
    )
  )
}

There are number of columns with empty entries. Depending on each column data structure, we fill the empty with different values.

columns_to_clean = c('Level','Duration','Schedule','will_learn','skill','Modules','Instructor')

for (col in columns_to_clean) {
  orig_df <- clean_empty_func(orig_df, col, 'No information')
}

colSums(orig_df=="")

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0          0          0          0          0       1443          0 
##      skill    Modules Instructor    offered    Keyword        URL 
##          0          0          0          0          0          0

For Review column specifically, since there is reviews in the entries, we will remove this word (using str_replace_all), fill the empty ones with 0, and then convert the data back to double type for further analysis purpose.

orig_df$Review <- str_replace_all(orig_df$Review, fixed(" reviews"), "")
orig_df <- clean_empty_func(orig_df, 'Review', '0')
orig_df$Review <- as.double(orig_df$Review)

## Warning: NAs introduced by coercion

4. Cleaning duplicated URL link

As mentioned in Objective, we use URL link as our main index, so we need to clean the duplicated link. Also, there are rows with same link but different keywords so we want to append the keywords together. The logic behind this function is to use the while loop to continuously track the existing of duplication. We want to save the indices of duplicated URL rows, pasting the keywords of larger indices rows to the minimum one and then remove those bigger indices rows.

sum(duplicated(orig_df$URL))

## [1] 3191

clean_dup_func <- function(input_df, col_dup, col_append) {
  while (sum(duplicated(input_df[, col_dup])) >= 1) {
    
    dup_df <- input_df[which(duplicated(input_df[, col_dup])), ] # get df of duplicated rows of URL
    link_1st_row <- as.character(dup_df[1, col_dup])
    row_indices <- as.numeric(which(input_df[, col_dup] == link_1st_row)) # get rows having same links
    row_indices_without_min <- row_indices[-which.min(row_indices)]

    for (i in row_indices_without_min) {
      input_df[min(row_indices), col_append] <- paste(
        input_df[min(row_indices), col_append], input_df[i, col_append],
        sep = ", "
      )
    }
    
    input_df <- input_df[-row_indices_without_min, ]
    
  }

  return(input_df)
}

orig_df <- clean_dup_func(orig_df, "URL", "Keyword")
summary(orig_df)

##   crs_title             Rating         Level             Duration        
##  Length:6404        Min.   :0.000   Length:6404        Length:6404       
##  Class :character   1st Qu.:4.300   Class :character   Class :character  
##  Mode  :character   Median :4.600   Mode  :character   Mode  :character  
##                     Mean   :3.739                                        
##                     3rd Qu.:4.800                                        
##                     Max.   :5.000                                        
##                                                                          
##    Schedule             Review       will_learn           skill          
##  Length:6404        Min.   :  0.0   Length:6404        Length:6404       
##  Class :character   1st Qu.: 10.0   Class :character   Class :character  
##  Mode  :character   Median : 59.0   Mode  :character   Mode  :character  
##                     Mean   :168.8                                        
##                     3rd Qu.:244.0                                        
##                     Max.   :998.0                                        
##                     NA's   :1173                                         
##    Modules           Instructor          offered            Keyword         
##  Length:6404        Length:6404        Length:6404        Length:6404       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      URL           
##  Length:6404       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

5. Cleaning Social Sciences duplication in Keyword column

After we perform the fourth cleaning, there seems to be duplicated Social Sciences keyword and we don’t want that. At this step, we don’t build a custom function here. Instead, we directly replace any row having more than 1 Social Sciences keyword with single term. We perform that with functions str_count for counting purpose and str_replace_all for replacing purpose.

orig_df$Keyword <- ifelse(
  str_count(orig_df$Keyword, fixed("Social Sciences")) > 1, str_replace_all(
    orig_df$Keyword, fixed("Social Sciences, Social Sciences"), "Social Sciences"
  ), orig_df$Keyword
)

unique(
  (orig_df %>% mutate(
    count_ss = str_count(orig_df$Keyword, fixed("Social Sciences"))
  )
  )$count_ss
)

## [1] 0 1

After performing replacement, we can see that now every row only has 0 or 1 keyword Social Sciences and this is desirable.

6. Cleaning specific signs [,],’

When viewing the data set, there are columns skill, Modules, Instructor, offered having specific signs in their values. We want to get rid of these for consistent data purpose. Since there are different signs, we want to build a custom function for different inputs purpose. The logic behind this custom function is similar to the others created above, but we have placeholder for sign2 and sign3 inputs since we are not sure how many signs we will have and use function gsub for matching and replacement.

head(orig_df)

##                                                      crs_title Rating
## 1                                            Fashion as Design    4.8
## 2                                       Modern American Poetry    4.4
## 3                                    Pixel Art for Video Games    4.5
## 4              Distribución digital de la música independiente    0.0
## 5 The Blues: Understanding and Performing an American Art Form    4.8
## 6                                 So You Think You Know Tango?    4.6
##            Level                     Duration          Schedule Review
## 1 Beginner level     20 hours (approximately) Flexible schedule     NA
## 2 Beginner level Approx. 34 hours to complete Flexible schedule    100
## 3 Beginner level      9 hours (approximately) Flexible schedule    227
## 4 Beginner level  Approx. 8 hours to complete Flexible schedule      0
## 5 Beginner level Approx. 11 hours to complete Flexible schedule    582
## 6 Beginner level  Approx. 5 hours to complete Flexible schedule    107
##                                                                                                                                                                                                                                                                 will_learn
## 1                                                                                                                                                                                                                                                           No information
## 2                                                                                                                                                                                                                                                           No information
## 3                                                                                                                                                                                                                                                           No information
## 4                                                                                                                                                                                                                                                           No information
## 5                                                                                                       Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues. 
## 6 Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
##                                              skill
## 1  ['Art History', 'Art', 'History', 'Creativity']
## 2                                               []
## 3                                               []
## 4                                               []
## 5 ['Music', 'Chord', 'Jazz', 'Jazz Improvisation']
## 6                                               []
##                                                                                                                                                                                                                               Modules
## 1                                                                                                                                         ['Introduction', 'Heroes', 'Silhouettes', 'Coutures', 'Lifecycle', 'Modesty', 'Expression']
## 2                                                                                                                                                                     ['Orientation', 'Module 1', 'Module 2', 'Module 3', 'Module 4']
## 3                                                                 ['Week 1: Introduction to Pixel Art', 'Week 2: Pixel Art Environments', 'Week 3: Pixel Art Characters', 'Week 4: Pixel Art Animation', 'Week 5: Pixel Art Project']
## 4                                                                                                                                                                                    ['Semana 1', 'Semana 2', 'Semana 3', 'Semana 4']
## 5 ['Blues Progressions – Theory and Practice ', 'Blues Scales ', 'Keyboard Realization ', '“Bird” Blues and Other Blues Progressions', 'Improvisational Tools ', 'Improvising the Blues – Part 1 ', 'Improvising the Blues – Part 2']
## 6                                                                                                                                   ['Module 1: The Many Dimensions of Tango and Tango Music', 'Module 2: Tango Words and Movements']
##                                                                             Instructor
## 1 ['Anna Burckhardt', 'Paola Antonelli', 'Michelle Millar Fisher', 'Stephanie Kramer']
## 2                                                                      ['Cary Nelson']
## 3                                               ['Andrew Dennis', 'Ricardo Guimaraes']
## 4                                                        ['Eduardo de la Vara Brown.']
## 5                                                                ['Dariusz Terefenko']
## 6                                                                 ['Kristin Wendland']
##                                          offered             Keyword
## 1                   ['The Museum of Modern Art'] Arts and Humanities
## 2 ['University of Illinois at Urbana-Champaign'] Arts and Humanities
## 3                  ['Michigan State University'] Arts and Humanities
## 4                       ['SAE Institute México'] Arts and Humanities
## 5                    ['University of Rochester'] Arts and Humanities
## 6                           ['Emory University'] Arts and Humanities
##                                                                              URL
## 1                                  https://www.coursera.org/learn/fashion-design
## 2                          https://www.coursera.org/learn/modern-american-poetry
## 3                           https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5                                       https://www.coursera.org/learn/the-blues
## 6                                           https://www.coursera.org/learn/tango

clean_sign_func <- function(input_df, colname, sign1, sign2 = "", sign3 = "") {
  input_df[, colname] <- sapply(input_df[, colname], function(x) {
    x <- gsub(paste0(fixed(sign1), "|", fixed(sign2), "|", fixed(sign3)), "", x)
  })
  return(input_df)
}

# Apply function
orig_df <- clean_sign_func(orig_df, "skill", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "Modules", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "Instructor", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "offered", "\\[", "\\]", "'")

After removing signs, there are empty entries so we will use clean_empty_func:

7. Cleaning Duration column

Having a look at Duration column, we see there is inconsistency how data is structure. Therefore, we will change these into single unit hour.

orig_df %>% head(20) %>% select(Duration)

##                        Duration
## 1      20 hours (approximately)
## 2  Approx. 34 hours to complete
## 3       9 hours (approximately)
## 4   Approx. 8 hours to complete
## 5  Approx. 11 hours to complete
## 6   Approx. 5 hours to complete
## 7  Approx. 13 hours to complete
## 8  Approx. 23 hours to complete
## 9   Approx. 5 hours to complete
## 10 Approx. 44 hours to complete
## 11     10 hours (approximately)
## 12 Approx. 26 hours to complete
## 13 Approx. 12 hours to complete
## 14     10 hours (approximately)
## 15 Approx. 21 hours to complete
## 16      6 hours (approximately)
## 17               No information
## 18 Approx. 11 hours to complete
## 19 Approx. 21 hours to complete
## 20 Approx. 17 hours to complete

Have a look at unique values for Duration:

unique(orig_df$Duration) %>% head(10)

##  [1] "20 hours (approximately)"     "Approx. 34 hours to complete"
##  [3] "9 hours (approximately)"      "Approx. 8 hours to complete" 
##  [5] "Approx. 11 hours to complete" "Approx. 5 hours to complete" 
##  [7] "Approx. 13 hours to complete" "Approx. 23 hours to complete"
##  [9] "Approx. 44 hours to complete" "10 hours (approximately)"

We can see that there are empty ones, approx or approximately, hours, minutes, week, and month. There is one entry with a full sentence, which is the longest entry. There is entry with one hour which does not use digit but letters. Steps to clean include:

Replace the longest one with a correct one with function which.max
Replace the letters but not digit with a correct one with which to find the row index
Delete approximate using pprox as regular expression

First, we get entries having pprox into a new data frame:

approx_rows <- orig_df[which(str_detect(orig_df$Duration, "pprox")), c("Duration", "URL")]
approx_rows[which.max(nchar(approx_rows$Duration)), "Duration"] <- "2"
unique(approx_rows) %>% head(20)

##                        Duration
## 1      20 hours (approximately)
## 2  Approx. 34 hours to complete
## 3       9 hours (approximately)
## 4   Approx. 8 hours to complete
## 5  Approx. 11 hours to complete
## 6   Approx. 5 hours to complete
## 7  Approx. 13 hours to complete
## 8  Approx. 23 hours to complete
## 9   Approx. 5 hours to complete
## 10 Approx. 44 hours to complete
## 11     10 hours (approximately)
## 12 Approx. 26 hours to complete
## 13 Approx. 12 hours to complete
## 14     10 hours (approximately)
## 15 Approx. 21 hours to complete
## 16      6 hours (approximately)
## 18 Approx. 11 hours to complete
## 19 Approx. 21 hours to complete
## 20 Approx. 17 hours to complete
## 21 Approx. 14 hours to complete
##                                                                               URL
## 1                                   https://www.coursera.org/learn/fashion-design
## 2                           https://www.coursera.org/learn/modern-american-poetry
## 3                            https://www.coursera.org/learn/pixel-art-video-games
## 4  https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5                                        https://www.coursera.org/learn/the-blues
## 6                                            https://www.coursera.org/learn/tango
## 7                    https://www.coursera.org/learn/erasmus-philosophy-skepticism
## 8                  https://www.coursera.org/learn/the-making-of-modern-ukraine-ua
## 9              https://www.coursera.org/learn/exploring-beethoven-piano-sonatas-4
## 10                                         https://www.coursera.org/learn/bei-lun
## 11                                         https://www.coursera.org/learn/pai-she
## 12                                        https://www.coursera.org/learn/painting
## 13                           https://www.coursera.org/learn/philosophy-of-science
## 14                           https://www.coursera.org/learn/basic-elements-design
## 15                        https://www.coursera.org/learn/interactive-media-gaming
## 16                         https://www.coursera.org/learn/concept-art-video-games
## 18                  https://www.coursera.org/learn/history-israel-sovereign-state
## 19                           https://www.coursera.org/learn/shaping-urban-futures
## 20                              https://www.coursera.org/learn/chinese-philosophy
## 21                         https://www.coursera.org/learn/cultura-maya-en-yucatan

Since there is only hours estimate, it is easy for us to remove all the characters and add hours at the end of these entries.

approx_rows[, "Duration"] <- gsub("[^0-9]", "", approx_rows[, "Duration"])

There are entries having consistent format of <number> month(s) at <number> hours a week. We will convert all these into single number of months and get rid

matches <- grepl("pprox", orig_df$Duration)
orig_df[matches, "Duration"] <- approx_rows$Duration
orig_df[which(orig_df$Duration == "one hour"), "Duration"] <- "1"

To make sure we don’t have any whitespace during the cleansing process for Duration until now, we will trim it:

for (col_name in colnames(orig_df)){
  orig_df[[col_name]] <- trimws(orig_df[[col_name]])
}

Take a look at unique values for Duration column again, we can see there are amount of entries having similar format <digits> months at <digits> hours per/a week. We will build a function to extract the number for months, hours, and minutes. We have 2 parameters: value and request. The request is presented for further purpose of calculating total number of hours for other entries. The two input options for request are total_hours and convert_hours:

duration_func <- function(value, request) {
  if (grepl("^\\d+$", value)) {return(as.numeric(value))}
  else if(value != 'No information'){
    
    hours <- as.numeric((
      str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+hours?\\b)")
    ))
    minutes <- as.numeric((
      str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+min(ute)?s?\\b)")
    ))
    months <- as.numeric((
      str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+months?\\b)")
    ))
    
    hours <- ifelse(is.na(hours),0,hours)
    minutes <- ifelse(is.na(minutes),0,minutes)
    months <- ifelse(is.na(months),0,months)
    
    if (request == 'total_hours') {
      total_hours <- hours + minutes/60 + months*720
      return(round(total_hours,2))
      } else if (request == 'convert_hours') {
        
        if(months == 0){
        return(hours)
        
      } else {
        
        return (hours*months*4)
        
      }
    } else {
      return (0)
    }
  } else if (value == 'No information') {
    return (0)
    }
  
  return (value)
  
}

orig_df <- orig_df %>% mutate(
  Duration = sapply(Duration, duration_func, request = 'convert_hours')
)

orig_df <- orig_df %>% mutate(
  Duration = sapply(Duration, duration_func, request = 'total_hours')
)

orig_df$Duration <- as.double(orig_df$Duration)

Have a look at summary of current data set, top 10 rows and bottom 10 rows:

summary(orig_df)

##   crs_title            Rating             Level              Duration     
##  Length:6404        Length:6404        Length:6404        Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.:  7.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 14.00  
##                                                           Mean   : 22.96  
##                                                           3rd Qu.: 25.00  
##                                                           Max.   :480.00  
##    Schedule            Review           will_learn           skill          
##  Length:6404        Length:6404        Length:6404        Length:6404       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Modules           Instructor          offered            Keyword         
##  Length:6404        Length:6404        Length:6404        Length:6404       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      URL           
##  Length:6404       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

orig_df %>% head(10)

##                                                       crs_title Rating
## 1                                             Fashion as Design    4.8
## 2                                        Modern American Poetry    4.4
## 3                                     Pixel Art for Video Games    4.5
## 4               Distribución digital de la música independiente      0
## 5  The Blues: Understanding and Performing an American Art Form    4.8
## 6                                  So You Think You Know Tango?    4.6
## 7                                    The Politics of Skepticism    4.5
## 8                                  Становлення сучасної України      0
## 9                    Exploring Beethoven's Piano Sonatas Part 4    4.9
## 10                                             悖论：思维的魔方    4.8
##                 Level Duration          Schedule Review
## 1      Beginner level       20 Flexible schedule   <NA>
## 2      Beginner level       34 Flexible schedule    100
## 3      Beginner level        9 Flexible schedule    227
## 4      Beginner level        8 Flexible schedule      0
## 5      Beginner level       11 Flexible schedule    582
## 6      Beginner level        5 Flexible schedule    107
## 7  Intermediate level       13 Flexible schedule     38
## 8      Beginner level       23 Flexible schedule      0
## 9  Intermediate level        5 Flexible schedule     63
## 10     Beginner level       44 Flexible schedule     39
##                                                                                                                                                                                                                                                                  will_learn
## 1                                                                                                                                                                                                                                                            No information
## 2                                                                                                                                                                                                                                                            No information
## 3                                                                                                                                                                                                                                                            No information
## 4                                                                                                                                                                                                                                                            No information
## 5                                                                                                         Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues.
## 6  Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
## 7                                                                                                                                                                                                                                                            No information
## 8                                                                                                                                                                                                                                                            No information
## 9                                                                                                                                                                                                                                                            No information
## 10                                                                                                                                                                                                                                                           No information
##                                     skill
## 1   Art History, Art, History, Creativity
## 2                                        
## 3                                        
## 4                                        
## 5  Music, Chord, Jazz, Jazz Improvisation
## 6                                        
## 7                                        
## 8                                        
## 9                                        
## 10                                       
##                                                                                                                                                                                                                                                                                                                                                                                                        Modules
## 1                                                                                                                                                                                                                                                                                                                                  Introduction, Heroes, Silhouettes, Coutures, Lifecycle, Modesty, Expression
## 2                                                                                                                                                                                                                                                                                                                                                          Orientation, Module 1, Module 2, Module 3, Module 4
## 3                                                                                                                                                                                                                                                      Week 1: Introduction to Pixel Art, Week 2: Pixel Art Environments, Week 3: Pixel Art Characters, Week 4: Pixel Art Animation, Week 5: Pixel Art Project
## 4                                                                                                                                                                                                                                                                                                                                                                       Semana 1, Semana 2, Semana 3, Semana 4
## 5                                                                                                                                                                                          Blues Progressions – Theory and Practice , Blues Scales , Keyboard Realization , “Bird” Blues and Other Blues Progressions, Improvisational Tools , Improvising the Blues – Part 1 , Improvising the Blues – Part 2
## 6                                                                                                                                                                                                                                                                                                                  Module 1: The Many Dimensions of Tango and Tango Music, Module 2: Tango Words and Movements
## 7  Political Origins, Skepticism and Religion, Skepticism and Natural Law, Skepticism and Conservatism, “There’s a method to his madness”: Responses to Cartesian Skepticism, Fallibilism, prejudices and toleration: Lessons from Pyrrhonian Skepticism, The marketplace of ideas: An imaginative argument for freedom of expression, The benefit of the doubt: Critical creative problem solving in politics
## 8                                                                                                                                                                                                                                                                                                                                                                                 Становлення сучасної України
## 9                                                                                                                                                                                                                                                                                                                       Welcome to Class!, Op. 2, No. 2, Op. 10, No. 3, Op. 28, Op. 110, Your Thoughts Welcome
## 10                                                                                                                                             预备知识和悖论概述, 上帝悖论和连锁悖论, 芝诺悖论和无穷之谜, 逻辑-集合论悖论和语义悖论, 语义悖论、归纳悖论和认知悖论, 各种认知悖论, 认知悖论和合理行动悖论, 道德悖论和中国古代悖论, 中国古代悖论, 关于悖论的进一步思考, 拓展材料：研讨会实录及学生报告, 期末考试
##                                                                    Instructor
## 1  Anna Burckhardt, Paola Antonelli, Michelle Millar Fisher, Stephanie Kramer
## 2                                                                 Cary Nelson
## 3                                            Andrew Dennis, Ricardo Guimaraes
## 4                                                   Eduardo de la Vara Brown.
## 5                                                           Dariusz Terefenko
## 6                                                            Kristin Wendland
## 7                                                  Tim De Mey, Wiep van Bunge
## 8                                                              Timothy Snyder
## 9                                                               Jonathan Biss
## 10                                                                       陈波
##                                       offered             Keyword
## 1                    The Museum of Modern Art Arts and Humanities
## 2  University of Illinois at Urbana-Champaign Arts and Humanities
## 3                   Michigan State University Arts and Humanities
## 4                        SAE Institute México Arts and Humanities
## 5                     University of Rochester Arts and Humanities
## 6                            Emory University Arts and Humanities
## 7                Erasmus University Rotterdam Arts and Humanities
## 8                             Yale University Arts and Humanities
## 9                   Curtis Institute of Music Arts and Humanities
## 10                          Peking University Arts and Humanities
##                                                                               URL
## 1                                   https://www.coursera.org/learn/fashion-design
## 2                           https://www.coursera.org/learn/modern-american-poetry
## 3                            https://www.coursera.org/learn/pixel-art-video-games
## 4  https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5                                        https://www.coursera.org/learn/the-blues
## 6                                            https://www.coursera.org/learn/tango
## 7                    https://www.coursera.org/learn/erasmus-philosophy-skepticism
## 8                  https://www.coursera.org/learn/the-making-of-modern-ukraine-ua
## 9              https://www.coursera.org/learn/exploring-beethoven-piano-sonatas-4
## 10                                         https://www.coursera.org/learn/bei-lun

orig_df %>% tail(10)

##                                                                                      crs_title
## 8686                                                            Felicidad y Políticas Públicas
## 8687                                   Teaching Impacts of Technology: Workplace of the Future
## 8688                                             Teaching Impacts of Technology: Relationships
## 8689                Computational Thinking for K-12 Educators: Abstraction, Methods, and Lists
## 8690                          L’engagement efficace de la société civile dans le développement
## 8691                         Architecting with Google Kubernetes Engine: Production en Español
## 8692 Computational Thinking for K-12 Educators: Nested If Statements and Compound Conditionals
## 8693                                                  Cómo combinar y analizar datos complejos
## 8694             Architecting with Google Kubernetes Engine: Workloads em Português Brasileiro
## 8695                                                        Visualizing static networks with R
##      Rating              Level Duration          Schedule Review
## 8686      0     Beginner level       13 Flexible schedule      0
## 8687      0     Beginner level       12 Flexible schedule      0
## 8688      0     Beginner level       10 Flexible schedule      0
## 8689      0     Beginner level        9 Flexible schedule      0
## 8690      0 Intermediate level       13 Flexible schedule      0
## 8691    4.9 Intermediate level       14 Flexible schedule     30
## 8692      0     Beginner level       11 Flexible schedule      0
## 8693      0     No information        9 Flexible schedule      0
## 8694      0 Intermediate level       19 Flexible schedule      0
## 8695      0 Intermediate level        2    No information      0
##                                                                                                                                                                                                                                                                                                                                                                                         will_learn
## 8686 Conocer las características de los conceptos de felicidad y políticas públicas y cómo influye la felicidad a la hora de diseñar las políticas. Distinguir entre la visión tradicional y la nueva perspectiva de ciudadanía, y definir la ciudadanía social y su relevancia en derechos básicos.  Analizar los usos de "satisfacción con la democracia" y el rol de formuladores de políticas.
## 8687                                                                                                                                                                                                                                                                                                                                                                                No information
## 8688                                                                                                                                                                                                                                                                                                                                                                                No information
## 8689                                                                                                                                                                                                                                                                                                                                                                                No information
## 8690                                                                                                                                                                                                                                                                                                                                                                                No information
## 8691                                                                                                                                                                                                                                                                                                                                                                                No information
## 8692                                                                                                                                                                                                                                                                                                                                                                                No information
## 8693                                                                                                                                                                                                                                                                                                                                                                                No information
## 8694                                                                                                                                                                                                                                                                                                                                                                                No information
## 8695                                                                                                                                                                                                                           Learn to preprocess raw data to create nodes and edgesLearn to create network data using the igraph packageLearn to visualize static networks with base R functions
##                                                                        skill
## 8686                   Democracia, Ciudadanía, Políticas públicas, Felicidad
## 8687                                                                        
## 8688                                                                        
## 8689                                           Education, want, belief, aunt
## 8690                                                                        
## 8691                                                                        
## 8692                                    Education, want, Resource, Causality
## 8693                                                                        
## 8694                                                                        
## 8695 Network Analysis, igraph, R Programming, Graph Drawing, text processing
##                                                                                                                                                                                                                                                                                                                                                                Modules
## 8686                                                                                                                                                                     Explorando los conceptos de felicidad y políticas públicas, Ciudadanía y felicidad, La satisfacción con la democracia y la felicidad, La felicidad y la formulación de las políticas públicas
## 8687                                                                                                                                                                                                             Course Orientation, Getting a Job in New Ways, Physical Ties to Work, Advancing your career in the technical world, Impacts of Computing and Pedagogy
## 8688                                                                                                                                        Course Orientation, Keeping Connected in a Global Society, Making Geography-based Connections, Impacts of Computing and Supporting Interactive Learning, More lesson plans for interactivity:  Impacts and Encoding Images
## 8689                                                                                                                                                                                                                                                       Course Orientation, Abstractions Part 1, Abstractions Part 2, Lists Part 1, Lists Part 2, Equity & Pedagogy
## 8690 Module 1 : Introduction à un engagement efficace de la société civile dans les processus de développement, Module 2 : Le dialogue multipartite, Module 3 : Efficacité, transparence et redevabilité des OSC en matière de développement , Module 4 : Coopération officielle au développement avec les OSC, Module 5 : L’environnement législatif et réglementaire
## 8691              Introducción al curso, Control de acceso y seguridad en Kubernetes y Google Kubernetes Engine (GKE), Almacenamiento de registros y monitorización con Google Kubernetes Engine (GKE), Cómo usar los servicios de almacenamiento administrados de Google Cloud con Google Kubernetes Engine (GKE), Cómo usar CI/CD con Google Kubernetes Engine (GKE)
## 8692                                                                                                                                                                                                                  Course Orientation, Nested If/Else Part 1, Nested If/Else Part 2, Compound Conditionals Part 1 , Compound Conditionals Part 2, Equity & Pedagogy
## 8693                                                                                                                                                                                                                                                                                                       Estimación básica, Modelos, Vinculación de registros, Ética
## 8694                                                                                                                                                                                                       Introdução ao curso, Operações do Kubernetes, Implantações, jobs e escalonamento, Rede do Google Kubernetes Engine (GKE), Armazenamento e dados permanentes
## 8695                                                                                                                                                                                                                                                                                                                                                Learn step-by-step
##                   Instructor                              offered
## 8686          Graciela Tonon               Universidad de Palermo
## 8687              Beth Simon   University of California San Diego
## 8688              Beth Simon   University of California San Diego
## 8689              Beth Simon   University of California San Diego
## 8690           Sanne Huesken         Erasmus University Rotterdam
## 8691   Google Cloud Training                         Google Cloud
## 8692              Beth Simon   University of California San Diego
## 8693 Richard Valliant, Ph.D. University of Maryland, College Park
## 8694   Google Cloud Training                         Google Cloud
## 8695      You (Lilian) Cheng             Coursera Project Network
##              Keyword
## 8686 Social Sciences
## 8687 Social Sciences
## 8688 Social Sciences
## 8689 Social Sciences
## 8690 Social Sciences
## 8691 Social Sciences
## 8692 Social Sciences
## 8693 Social Sciences
## 8694 Social Sciences
## 8695 Social Sciences
##                                                                                                           URL
## 8686                                            https://www.coursera.org/learn/felicidad-y-politicas-publicas
## 8687                                 https://www.coursera.org/learn/teach-impacts-technology-workplace-future
## 8688                                    https://www.coursera.org/learn/teach-impacts-technology-relationships
## 8689                       https://www.coursera.org/learn/block-programming-k12-educators-abstraction-methods
## 8690            https://www.coursera.org/learn/engagement-efficace-de-la-societe-civile-dans-le-developpement
## 8691                   https://www.coursera.org/learn/deploying-secure-kubernetes-containers-in-production-es
## 8692 https://www.coursera.org/learn/block-programming-k12-educators-nested-if-statement-compound-conditionals
## 8693                                      https://www.coursera.org/learn/data-collection-analytics-project-es
## 8694                       https://www.coursera.org/learn/deploying-workloads-google-kubernetes-engine-gke-br
## 8695                                          https://www.coursera.org/projects/visualizing-static-networks-r

colSums(is.na(orig_df))

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0          0          0          0          0       1173          0 
##      skill    Modules Instructor    offered    Keyword        URL 
##          0          0          0          0          0          0

orig_df <- clean_NA_func(orig_df,'Review',0)

colSums(orig_df == '')

##  crs_title     Rating      Level   Duration   Schedule     Review will_learn 
##          0          0          0          0          0          0          0 
##      skill    Modules Instructor    offered    Keyword        URL 
##       2183         95         62          0          0          0

for (col in c('skill', 'Modules', 'Instructor')) {
  orig_df <- clean_empty_func(orig_df, col, 'No information')
}

Visualizations

1.Number of course vs. Level

We will count the number of courses offered in different levels to see if Coursera is offering centrally to any specific level:

df_count <- orig_df %>% count(Level)
df_count <- df_count %>% mutate(
  Level = reorder(Level, -n)
)

ggplot(df_count, aes(x = Level, y=n, fill = Level)) +
  geom_bar(stat='identity') +
  labs(y = "Course count") +
  ggtitle("Courses vs. Level")

The Beginner level accounts most courses from Coursera so this platform is a good source for any one in this level.

2. Count of offered

sample <- orig_df %>% count(offered) %>% arrange(desc(n)) %>% head(5)
ggplot(sample, aes(x = offered, y = n, fill = offered)) +
  geom_col() + 
  ggtitle('Top 5 Sources Offering') + 
  labs(x = 'Source Offer', y = 'Count', fill = 'Source Offer')

3. Number of Courses vs. Ratings

orig_df$Rating <- as.double(orig_df$Rating)
ranges <- c(0, 1.0, 2.0, 3.0, 4.0, 5.1)
orig_df$ranges <- cut(orig_df$Rating, breaks = ranges, labels = c("0-1.0", "1.1-2.0", "2.1-3.0", "3.1-4.0", "4.1-5.0"), right = FALSE)
orig_df[which(is.na(orig_df$Review)), "Review"] <- 0
df_rating <- as.data.frame(count(orig_df, ranges))

ggplot(df_rating, aes(x = ranges, y = n, fill = ranges)) +
  geom_col() +
  labs(x = "Rating ranges", y = "Courses", fill = "Rating ranges") +
  ggtitle("Course vs. Rating")

4. Average Time Spent for a Course by Level

average_time_df <- orig_df %>% group_by(Level) %>%
  summarize(average_time = mean(Duration))
ggplot(average_time_df, aes(x=Level, y=average_time, fill=Level)) +
  geom_col() + 
  labs(title='Average Time Spent for a Course by Level', y='Average Time')

5. Distribution of Courses by Level and Rating Ranges

df_count <- orig_df %>% count(Level, ranges)
level_ranges_wideFormat <- dcast(df_count, Level ~ ranges, value.var = 'n')
level_ranges_matrix <- as.matrix(level_ranges_wideFormat[,-1])
rownames(level_ranges_matrix) <- level_ranges_wideFormat$Level
level_ranges_matrix <- melt(level_ranges_matrix, varnames=c("Level","Ranges"), value.name = 'Count')

ggplot(level_ranges_matrix, aes(x=Level, y=Ranges, fill=Count)) +
  geom_tile() +
  geom_text(aes(label=Count), color='black') +
  scale_fill_gradient(low='lightblue', high='blue', na.value='white') +
  labs(y='Rating Ranges', title='Distribution of Courses by Level and Rating Ranges')

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text()`).

Improvements to make

The custom function such as duration_func can be adjusted to be more flexible on request parameter. There can be more analysis on relationships between features, given that further cleansing methods on Instructor performed.

Contact

For any questions or feedback, please reach out to:

Name: Eirlys Vo
Email: vopq@mail.uc.edu
GitHub: ezishr
LinkedIn: Eirlys Vo

Feel free to open an issue or submit a pull request if you have suggestions or improvements!