The goal of this project is to perform a comprehensive analysis of a data set related to Coursera courses. I explored the dataset to understand the data structure and performed necessary data cleansing. Then, I created visualizations to see insights for online studying trends and user engagement.
The data is provided on Kaggle, which can be viewed here.
There are two data sets which are cleaned and uncleaned. For this
project, I will only use the uncleaned version since our aim includes
data cleansing. The dataset includes various columns which are basic
information about a course. I will use URL link as our main
unique index, which I will also clean on this in the Project
Workflow section.
Further data set description can be read here.
First, we will read the data set using read.csv
function, save as unclean_data, and see an overview of the
data set:
orig_df <- read.csv("CourseraDataset-Unclean.csv")
summary(orig_df)
## Course.Title Rating Level Duration
## Length:9595 Min. :1.500 Length:9595 Length:9595
## Class :character 1st Qu.:4.600 Class :character Class :character
## Mode :character Median :4.700 Mode :character Mode :character
## Mean :4.652
## 3rd Qu.:4.800
## Max. :5.000
## NA's :1439
## Schedule Review What.you.will.learn Skill.gain
## Length:9595 Length:9595 Length:9595 Length:9595
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Modules Instructor Offered.By Keyword
## Length:9595 Length:9595 Length:9595 Length:9595
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Course.Url
## Length:9595
## Class :character
## Mode :character
##
##
##
##
colnames(orig_df)
## [1] "Course.Title" "Rating" "Level"
## [4] "Duration" "Schedule" "Review"
## [7] "What.you.will.learn" "Skill.gain" "Modules"
## [10] "Instructor" "Offered.By" "Keyword"
## [13] "Course.Url"
We will set up necessary libraries that can help me clean and visualize data in further steps:
library(tidyverse)
library(zoo)
library(textcat)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] graphics grDevices utils datasets stats methods base
##
## other attached packages:
## [1] reshape2_1.4.4 textcat_1.0-8 zoo_1.8-12 lubridate_1.9.3
## [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [13] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 slam_0.1-52
## [5] stringi_1.8.4 lattice_0.22-6 hms_1.1.3 digest_0.6.37
## [9] magrittr_2.0.3 evaluate_0.24.0 grid_4.4.1 timechange_0.3.0
## [13] fastmap_1.2.0 plyr_1.8.9 jsonlite_1.8.8 fansi_1.0.6
## [17] scales_1.3.0 jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [21] munsell_0.5.1 withr_3.0.1 cachem_1.1.0 yaml_2.3.10
## [25] tools_4.4.1 tzdb_0.4.0 colorspace_2.1-1 vctrs_0.6.5
## [29] R6_2.5.1 lifecycle_1.0.4 pkgconfig_2.0.3 pillar_1.9.0
## [33] bslib_0.8.0 gtable_0.3.5 Rcpp_1.0.13 glue_1.7.0
## [37] xfun_0.47 tidyselect_1.2.1 tau_0.0-25 rstudioapi_0.16.0
## [41] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28 compiler_4.4.1
The dataset has some columns’ names which we find them hard to use
throughout the workflow. Therefore, we would rather change them into
easily accessible names. We will use the rename function to
apply new columns’ names:
orig_df <- rename(orig_df, crs_title = "Course.Title")
orig_df <- rename(orig_df, will_learn = "What.you.will.learn")
orig_df <- rename(orig_df, offered = "Offered.By")
orig_df <- rename(orig_df, URL = "Course.Url")
orig_df <- rename(orig_df, skill = "Skill.gain")
data_types <- sapply(orig_df, typeof)
results <- paste(names(orig_df), ":", data_types, collapse = ", ")
cat(results,'\n')
## crs_title : character, Rating : double, Level : character, Duration : character, Schedule : character, Review : character, will_learn : character, skill : character, Modules : character, Instructor : character, offered : character, Keyword : character, URL : character
We want to use function head() to see the format of data
in every column:
head(orig_df)
## crs_title Rating
## 1 Fashion as Design 4.8
## 2 Modern American Poetry 4.4
## 3 Pixel Art for Video Games 4.5
## 4 Distribución digital de la música independiente NA
## 5 The Blues: Understanding and Performing an American Art Form 4.8
## 6 So You Think You Know Tango? 4.6
## Level Duration Schedule Review
## 1 Beginner level 20 hours (approximately) Flexible schedule 2,813 reviews
## 2 Beginner level Approx. 34 hours to complete Flexible schedule 100 reviews
## 3 Beginner level 9 hours (approximately) Flexible schedule 227 reviews
## 4 Beginner level Approx. 8 hours to complete Flexible schedule
## 5 Beginner level Approx. 11 hours to complete Flexible schedule 582 reviews
## 6 Beginner level Approx. 5 hours to complete Flexible schedule 107 reviews
## will_learn
## 1
## 2
## 3
## 4
## 5 Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues.
## 6 Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
## skill
## 1 ['Art History', 'Art', 'History', 'Creativity']
## 2 []
## 3 []
## 4 []
## 5 ['Music', 'Chord', 'Jazz', 'Jazz Improvisation']
## 6 []
## Modules
## 1 ['Introduction', 'Heroes', 'Silhouettes', 'Coutures', 'Lifecycle', 'Modesty', 'Expression']
## 2 ['Orientation', 'Module 1', 'Module 2', 'Module 3', 'Module 4']
## 3 ['Week 1: Introduction to Pixel Art', 'Week 2: Pixel Art Environments', 'Week 3: Pixel Art Characters', 'Week 4: Pixel Art Animation', 'Week 5: Pixel Art Project']
## 4 ['Semana 1', 'Semana 2', 'Semana 3', 'Semana 4']
## 5 ['Blues Progressions – Theory and Practice ', 'Blues Scales ', 'Keyboard Realization ', '“Bird” Blues and Other Blues Progressions', 'Improvisational Tools ', 'Improvising the Blues – Part 1 ', 'Improvising the Blues – Part 2']
## 6 ['Module 1: The Many Dimensions of Tango and Tango Music', 'Module 2: Tango Words and Movements']
## Instructor
## 1 ['Anna Burckhardt', 'Paola Antonelli', 'Michelle Millar Fisher', 'Stephanie Kramer']
## 2 ['Cary Nelson']
## 3 ['Andrew Dennis', 'Ricardo Guimaraes']
## 4 ['Eduardo de la Vara Brown.']
## 5 ['Dariusz Terefenko']
## 6 ['Kristin Wendland']
## offered Keyword
## 1 ['The Museum of Modern Art'] Arts and Humanities
## 2 ['University of Illinois at Urbana-Champaign'] Arts and Humanities
## 3 ['Michigan State University'] Arts and Humanities
## 4 ['SAE Institute México'] Arts and Humanities
## 5 ['University of Rochester'] Arts and Humanities
## 6 ['Emory University'] Arts and Humanities
## URL
## 1 https://www.coursera.org/learn/fashion-design
## 2 https://www.coursera.org/learn/modern-american-poetry
## 3 https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5 https://www.coursera.org/learn/the-blues
## 6 https://www.coursera.org/learn/tango
In cleansing process, we first want to see the amount of N/A values in every column:
colSums(is.na(orig_df))
## crs_title Rating Level Duration Schedule Review will_learn
## 0 1439 0 0 0 0 0
## skill Modules Instructor offered Keyword URL
## 0 0 0 0 0 0
There are about 1500 N/A values for column
Rating. The Rating column has format of
double so to be easy, we can fill N/A with
0.
In case we need to clean any sudden N/A in the future
(if having additional column), it would be better to create a custom
function to help us handle this. The function will need inputs of a data
frame, column’s name need to be cleansed, and the replaced value for
it.
We use mutate() to easily access the data frame and its
columns. Then, we use sym() function to converts the column
name from a string to a symbol since our input colname of
the function is a string and mutate() can’t handle string.
We need !! before it to unquote the sym()
function.
Then, we have the ifelse to check a value is
N/A:
clean_NA_func <- function(input_df, colname, replace_value) {
input_df <- input_df %>% mutate(
!!sym(colname) := ifelse(
is.na(!!sym(colname)), replace_value, !!sym(colname)
)
)
return(input_df)
}
We now apply this function to clean the Rating column
and check again:
orig_df <- clean_NA_func(orig_df, "Rating", 0)
colSums(is.na(orig_df))
## crs_title Rating Level Duration Schedule Review will_learn
## 0 0 0 0 0 0 0
## skill Modules Instructor offered Keyword URL
## 0 0 0 0 0 0
There are some entries that are not identified as N/A
but empty. We want to build a function that helps fill those empty
entries with desired values. The logic behind this function is similar
to the clean_NA_func, but we replace the condition checking
is.na with == "".
colSums(orig_df == "")
## crs_title Rating Level Duration Schedule Review will_learn
## 0 0 1265 262 683 1443 4611
## skill Modules Instructor offered Keyword URL
## 0 0 0 0 0 0
clean_empty_func <- function(input_df, colname, replace_value) {
input_df <- input_df %>% mutate(
!!sym(colname) := ifelse(
!!sym(colname) == "", replace_value, !!sym(colname)
)
)
}
There are number of columns with empty entries. Depending on each column data structure, we fill the empty with different values.
columns_to_clean = c('Level','Duration','Schedule','will_learn','skill','Modules','Instructor')
for (col in columns_to_clean) {
orig_df <- clean_empty_func(orig_df, col, 'No information')
}
colSums(orig_df=="")
## crs_title Rating Level Duration Schedule Review will_learn
## 0 0 0 0 0 1443 0
## skill Modules Instructor offered Keyword URL
## 0 0 0 0 0 0
For Review column specifically, since there is
reviews in the entries, we will remove this word (using
str_replace_all), fill the empty ones with 0, and then
convert the data back to double type for further analysis
purpose.
orig_df$Review <- str_replace_all(orig_df$Review, fixed(" reviews"), "")
orig_df <- clean_empty_func(orig_df, 'Review', '0')
orig_df$Review <- as.double(orig_df$Review)
## Warning: NAs introduced by coercion
As mentioned in Objective, we use URL link as our
main index, so we need to clean the duplicated link. Also, there are
rows with same link but different keywords so we want to append the
keywords together. The logic behind this function is to use the
while loop to continuously track the existing of
duplication. We want to save the indices of duplicated URL rows, pasting
the keywords of larger indices rows to the minimum one and then remove
those bigger indices rows.
sum(duplicated(orig_df$URL))
## [1] 3191
clean_dup_func <- function(input_df, col_dup, col_append) {
while (sum(duplicated(input_df[, col_dup])) >= 1) {
dup_df <- input_df[which(duplicated(input_df[, col_dup])), ] # get df of duplicated rows of URL
link_1st_row <- as.character(dup_df[1, col_dup])
row_indices <- as.numeric(which(input_df[, col_dup] == link_1st_row)) # get rows having same links
row_indices_without_min <- row_indices[-which.min(row_indices)]
for (i in row_indices_without_min) {
input_df[min(row_indices), col_append] <- paste(
input_df[min(row_indices), col_append], input_df[i, col_append],
sep = ", "
)
}
input_df <- input_df[-row_indices_without_min, ]
}
return(input_df)
}
orig_df <- clean_dup_func(orig_df, "URL", "Keyword")
summary(orig_df)
## crs_title Rating Level Duration
## Length:6404 Min. :0.000 Length:6404 Length:6404
## Class :character 1st Qu.:4.300 Class :character Class :character
## Mode :character Median :4.600 Mode :character Mode :character
## Mean :3.739
## 3rd Qu.:4.800
## Max. :5.000
##
## Schedule Review will_learn skill
## Length:6404 Min. : 0.0 Length:6404 Length:6404
## Class :character 1st Qu.: 10.0 Class :character Class :character
## Mode :character Median : 59.0 Mode :character Mode :character
## Mean :168.8
## 3rd Qu.:244.0
## Max. :998.0
## NA's :1173
## Modules Instructor offered Keyword
## Length:6404 Length:6404 Length:6404 Length:6404
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## URL
## Length:6404
## Class :character
## Mode :character
##
##
##
##
When viewing the data set, there are columns skill, Modules,
Instructor, offered having specific signs in their values. We want to
get rid of these for consistent data purpose. Since there are different
signs, we want to build a custom function for different inputs purpose.
The logic behind this custom function is similar to the others created
above, but we have placeholder for sign2 and
sign3 inputs since we are not sure how many signs we will
have and use function gsub for matching and
replacement.
head(orig_df)
## crs_title Rating
## 1 Fashion as Design 4.8
## 2 Modern American Poetry 4.4
## 3 Pixel Art for Video Games 4.5
## 4 Distribución digital de la música independiente 0.0
## 5 The Blues: Understanding and Performing an American Art Form 4.8
## 6 So You Think You Know Tango? 4.6
## Level Duration Schedule Review
## 1 Beginner level 20 hours (approximately) Flexible schedule NA
## 2 Beginner level Approx. 34 hours to complete Flexible schedule 100
## 3 Beginner level 9 hours (approximately) Flexible schedule 227
## 4 Beginner level Approx. 8 hours to complete Flexible schedule 0
## 5 Beginner level Approx. 11 hours to complete Flexible schedule 582
## 6 Beginner level Approx. 5 hours to complete Flexible schedule 107
## will_learn
## 1 No information
## 2 No information
## 3 No information
## 4 No information
## 5 Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues.
## 6 Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
## skill
## 1 ['Art History', 'Art', 'History', 'Creativity']
## 2 []
## 3 []
## 4 []
## 5 ['Music', 'Chord', 'Jazz', 'Jazz Improvisation']
## 6 []
## Modules
## 1 ['Introduction', 'Heroes', 'Silhouettes', 'Coutures', 'Lifecycle', 'Modesty', 'Expression']
## 2 ['Orientation', 'Module 1', 'Module 2', 'Module 3', 'Module 4']
## 3 ['Week 1: Introduction to Pixel Art', 'Week 2: Pixel Art Environments', 'Week 3: Pixel Art Characters', 'Week 4: Pixel Art Animation', 'Week 5: Pixel Art Project']
## 4 ['Semana 1', 'Semana 2', 'Semana 3', 'Semana 4']
## 5 ['Blues Progressions – Theory and Practice ', 'Blues Scales ', 'Keyboard Realization ', '“Bird” Blues and Other Blues Progressions', 'Improvisational Tools ', 'Improvising the Blues – Part 1 ', 'Improvising the Blues – Part 2']
## 6 ['Module 1: The Many Dimensions of Tango and Tango Music', 'Module 2: Tango Words and Movements']
## Instructor
## 1 ['Anna Burckhardt', 'Paola Antonelli', 'Michelle Millar Fisher', 'Stephanie Kramer']
## 2 ['Cary Nelson']
## 3 ['Andrew Dennis', 'Ricardo Guimaraes']
## 4 ['Eduardo de la Vara Brown.']
## 5 ['Dariusz Terefenko']
## 6 ['Kristin Wendland']
## offered Keyword
## 1 ['The Museum of Modern Art'] Arts and Humanities
## 2 ['University of Illinois at Urbana-Champaign'] Arts and Humanities
## 3 ['Michigan State University'] Arts and Humanities
## 4 ['SAE Institute México'] Arts and Humanities
## 5 ['University of Rochester'] Arts and Humanities
## 6 ['Emory University'] Arts and Humanities
## URL
## 1 https://www.coursera.org/learn/fashion-design
## 2 https://www.coursera.org/learn/modern-american-poetry
## 3 https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5 https://www.coursera.org/learn/the-blues
## 6 https://www.coursera.org/learn/tango
clean_sign_func <- function(input_df, colname, sign1, sign2 = "", sign3 = "") {
input_df[, colname] <- sapply(input_df[, colname], function(x) {
x <- gsub(paste0(fixed(sign1), "|", fixed(sign2), "|", fixed(sign3)), "", x)
})
return(input_df)
}
# Apply function
orig_df <- clean_sign_func(orig_df, "skill", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "Modules", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "Instructor", "\\[", "\\]", "'")
orig_df <- clean_sign_func(orig_df, "offered", "\\[", "\\]", "'")
After removing signs, there are empty entries so we will use
clean_empty_func:
Having a look at Duration column, we see there is
inconsistency how data is structure. Therefore, we will change these
into single unit hour.
orig_df %>% head(20) %>% select(Duration)
## Duration
## 1 20 hours (approximately)
## 2 Approx. 34 hours to complete
## 3 9 hours (approximately)
## 4 Approx. 8 hours to complete
## 5 Approx. 11 hours to complete
## 6 Approx. 5 hours to complete
## 7 Approx. 13 hours to complete
## 8 Approx. 23 hours to complete
## 9 Approx. 5 hours to complete
## 10 Approx. 44 hours to complete
## 11 10 hours (approximately)
## 12 Approx. 26 hours to complete
## 13 Approx. 12 hours to complete
## 14 10 hours (approximately)
## 15 Approx. 21 hours to complete
## 16 6 hours (approximately)
## 17 No information
## 18 Approx. 11 hours to complete
## 19 Approx. 21 hours to complete
## 20 Approx. 17 hours to complete
Have a look at unique values for Duration:
unique(orig_df$Duration) %>% head(10)
## [1] "20 hours (approximately)" "Approx. 34 hours to complete"
## [3] "9 hours (approximately)" "Approx. 8 hours to complete"
## [5] "Approx. 11 hours to complete" "Approx. 5 hours to complete"
## [7] "Approx. 13 hours to complete" "Approx. 23 hours to complete"
## [9] "Approx. 44 hours to complete" "10 hours (approximately)"
We can see that there are empty ones, approx or
approximately, hours, minutes,
week, and month. There is one entry with a
full sentence, which is the longest entry. There is entry with
one hour which does not use digit but letters. Steps to
clean include:
which.maxwhich to find the row indexapproximate using pprox as regular
expressionFirst, we get entries having pprox into a new data
frame:
approx_rows <- orig_df[which(str_detect(orig_df$Duration, "pprox")), c("Duration", "URL")]
approx_rows[which.max(nchar(approx_rows$Duration)), "Duration"] <- "2"
unique(approx_rows) %>% head(20)
## Duration
## 1 20 hours (approximately)
## 2 Approx. 34 hours to complete
## 3 9 hours (approximately)
## 4 Approx. 8 hours to complete
## 5 Approx. 11 hours to complete
## 6 Approx. 5 hours to complete
## 7 Approx. 13 hours to complete
## 8 Approx. 23 hours to complete
## 9 Approx. 5 hours to complete
## 10 Approx. 44 hours to complete
## 11 10 hours (approximately)
## 12 Approx. 26 hours to complete
## 13 Approx. 12 hours to complete
## 14 10 hours (approximately)
## 15 Approx. 21 hours to complete
## 16 6 hours (approximately)
## 18 Approx. 11 hours to complete
## 19 Approx. 21 hours to complete
## 20 Approx. 17 hours to complete
## 21 Approx. 14 hours to complete
## URL
## 1 https://www.coursera.org/learn/fashion-design
## 2 https://www.coursera.org/learn/modern-american-poetry
## 3 https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5 https://www.coursera.org/learn/the-blues
## 6 https://www.coursera.org/learn/tango
## 7 https://www.coursera.org/learn/erasmus-philosophy-skepticism
## 8 https://www.coursera.org/learn/the-making-of-modern-ukraine-ua
## 9 https://www.coursera.org/learn/exploring-beethoven-piano-sonatas-4
## 10 https://www.coursera.org/learn/bei-lun
## 11 https://www.coursera.org/learn/pai-she
## 12 https://www.coursera.org/learn/painting
## 13 https://www.coursera.org/learn/philosophy-of-science
## 14 https://www.coursera.org/learn/basic-elements-design
## 15 https://www.coursera.org/learn/interactive-media-gaming
## 16 https://www.coursera.org/learn/concept-art-video-games
## 18 https://www.coursera.org/learn/history-israel-sovereign-state
## 19 https://www.coursera.org/learn/shaping-urban-futures
## 20 https://www.coursera.org/learn/chinese-philosophy
## 21 https://www.coursera.org/learn/cultura-maya-en-yucatan
Since there is only hours estimate, it is easy for us to remove all
the characters and add hours at the end of these
entries.
approx_rows[, "Duration"] <- gsub("[^0-9]", "", approx_rows[, "Duration"])
There are entries having consistent format of
<number> month(s) at <number> hours a week. We
will convert all these into single number of months and get rid
matches <- grepl("pprox", orig_df$Duration)
orig_df[matches, "Duration"] <- approx_rows$Duration
orig_df[which(orig_df$Duration == "one hour"), "Duration"] <- "1"
To make sure we don’t have any whitespace during the cleansing
process for Duration until now, we will trim it:
for (col_name in colnames(orig_df)){
orig_df[[col_name]] <- trimws(orig_df[[col_name]])
}
Take a look at unique values for Duration column again,
we can see there are amount of entries having similar format
<digits> months at <digits> hours per/a week.
We will build a function to extract the number for months, hours, and
minutes. We have 2 parameters: value and
request. The request is presented for further
purpose of calculating total number of hours for other entries. The two
input options for request are total_hours and
convert_hours:
duration_func <- function(value, request) {
if (grepl("^\\d+$", value)) {return(as.numeric(value))}
else if(value != 'No information'){
hours <- as.numeric((
str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+hours?\\b)")
))
minutes <- as.numeric((
str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+min(ute)?s?\\b)")
))
months <- as.numeric((
str_extract(value, "(?<=\\b)\\d+(\\.\\d+)?(?=\\s+months?\\b)")
))
hours <- ifelse(is.na(hours),0,hours)
minutes <- ifelse(is.na(minutes),0,minutes)
months <- ifelse(is.na(months),0,months)
if (request == 'total_hours') {
total_hours <- hours + minutes/60 + months*720
return(round(total_hours,2))
} else if (request == 'convert_hours') {
if(months == 0){
return(hours)
} else {
return (hours*months*4)
}
} else {
return (0)
}
} else if (value == 'No information') {
return (0)
}
return (value)
}
orig_df <- orig_df %>% mutate(
Duration = sapply(Duration, duration_func, request = 'convert_hours')
)
orig_df <- orig_df %>% mutate(
Duration = sapply(Duration, duration_func, request = 'total_hours')
)
orig_df$Duration <- as.double(orig_df$Duration)
Have a look at summary of current data set, top 10 rows and bottom 10 rows:
summary(orig_df)
## crs_title Rating Level Duration
## Length:6404 Length:6404 Length:6404 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 7.00
## Mode :character Mode :character Mode :character Median : 14.00
## Mean : 22.96
## 3rd Qu.: 25.00
## Max. :480.00
## Schedule Review will_learn skill
## Length:6404 Length:6404 Length:6404 Length:6404
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Modules Instructor offered Keyword
## Length:6404 Length:6404 Length:6404 Length:6404
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## URL
## Length:6404
## Class :character
## Mode :character
##
##
##
orig_df %>% head(10)
## crs_title Rating
## 1 Fashion as Design 4.8
## 2 Modern American Poetry 4.4
## 3 Pixel Art for Video Games 4.5
## 4 Distribución digital de la música independiente 0
## 5 The Blues: Understanding and Performing an American Art Form 4.8
## 6 So You Think You Know Tango? 4.6
## 7 The Politics of Skepticism 4.5
## 8 Становлення сучасної України 0
## 9 Exploring Beethoven's Piano Sonatas Part 4 4.9
## 10 悖论:思维的魔方 4.8
## Level Duration Schedule Review
## 1 Beginner level 20 Flexible schedule <NA>
## 2 Beginner level 34 Flexible schedule 100
## 3 Beginner level 9 Flexible schedule 227
## 4 Beginner level 8 Flexible schedule 0
## 5 Beginner level 11 Flexible schedule 582
## 6 Beginner level 5 Flexible schedule 107
## 7 Intermediate level 13 Flexible schedule 38
## 8 Beginner level 23 Flexible schedule 0
## 9 Intermediate level 5 Flexible schedule 63
## 10 Beginner level 44 Flexible schedule 39
## will_learn
## 1 No information
## 2 No information
## 3 No information
## 4 No information
## 5 Students will be able to describe the blues as an important musical form. Students will be able to explain differences in jazz and other variations of the blues.
## 6 Gain an appreciation for the Argentine Tango as a multidimensional art form, including music, dance, and poetry.List the components that make a tango.Eexplore the tango as a living art form today and how it can be used to cross cultural boundaries and stereotypes.
## 7 No information
## 8 No information
## 9 No information
## 10 No information
## skill
## 1 Art History, Art, History, Creativity
## 2
## 3
## 4
## 5 Music, Chord, Jazz, Jazz Improvisation
## 6
## 7
## 8
## 9
## 10
## Modules
## 1 Introduction, Heroes, Silhouettes, Coutures, Lifecycle, Modesty, Expression
## 2 Orientation, Module 1, Module 2, Module 3, Module 4
## 3 Week 1: Introduction to Pixel Art, Week 2: Pixel Art Environments, Week 3: Pixel Art Characters, Week 4: Pixel Art Animation, Week 5: Pixel Art Project
## 4 Semana 1, Semana 2, Semana 3, Semana 4
## 5 Blues Progressions – Theory and Practice , Blues Scales , Keyboard Realization , “Bird” Blues and Other Blues Progressions, Improvisational Tools , Improvising the Blues – Part 1 , Improvising the Blues – Part 2
## 6 Module 1: The Many Dimensions of Tango and Tango Music, Module 2: Tango Words and Movements
## 7 Political Origins, Skepticism and Religion, Skepticism and Natural Law, Skepticism and Conservatism, “There’s a method to his madness”: Responses to Cartesian Skepticism, Fallibilism, prejudices and toleration: Lessons from Pyrrhonian Skepticism, The marketplace of ideas: An imaginative argument for freedom of expression, The benefit of the doubt: Critical creative problem solving in politics
## 8 Становлення сучасної України
## 9 Welcome to Class!, Op. 2, No. 2, Op. 10, No. 3, Op. 28, Op. 110, Your Thoughts Welcome
## 10 预备知识和悖论概述, 上帝悖论和连锁悖论, 芝诺悖论和无穷之谜, 逻辑-集合论悖论和语义悖论, 语义悖论、归纳悖论和认知悖论, 各种认知悖论, 认知悖论和合理行动悖论, 道德悖论和中国古代悖论, 中国古代悖论, 关于悖论的进一步思考, 拓展材料:研讨会实录及学生报告, 期末考试
## Instructor
## 1 Anna Burckhardt, Paola Antonelli, Michelle Millar Fisher, Stephanie Kramer
## 2 Cary Nelson
## 3 Andrew Dennis, Ricardo Guimaraes
## 4 Eduardo de la Vara Brown.
## 5 Dariusz Terefenko
## 6 Kristin Wendland
## 7 Tim De Mey, Wiep van Bunge
## 8 Timothy Snyder
## 9 Jonathan Biss
## 10 陈波
## offered Keyword
## 1 The Museum of Modern Art Arts and Humanities
## 2 University of Illinois at Urbana-Champaign Arts and Humanities
## 3 Michigan State University Arts and Humanities
## 4 SAE Institute México Arts and Humanities
## 5 University of Rochester Arts and Humanities
## 6 Emory University Arts and Humanities
## 7 Erasmus University Rotterdam Arts and Humanities
## 8 Yale University Arts and Humanities
## 9 Curtis Institute of Music Arts and Humanities
## 10 Peking University Arts and Humanities
## URL
## 1 https://www.coursera.org/learn/fashion-design
## 2 https://www.coursera.org/learn/modern-american-poetry
## 3 https://www.coursera.org/learn/pixel-art-video-games
## 4 https://www.coursera.org/learn/distribucion-digital-de-la-musica-independiente
## 5 https://www.coursera.org/learn/the-blues
## 6 https://www.coursera.org/learn/tango
## 7 https://www.coursera.org/learn/erasmus-philosophy-skepticism
## 8 https://www.coursera.org/learn/the-making-of-modern-ukraine-ua
## 9 https://www.coursera.org/learn/exploring-beethoven-piano-sonatas-4
## 10 https://www.coursera.org/learn/bei-lun
orig_df %>% tail(10)
## crs_title
## 8686 Felicidad y Políticas Públicas
## 8687 Teaching Impacts of Technology: Workplace of the Future
## 8688 Teaching Impacts of Technology: Relationships
## 8689 Computational Thinking for K-12 Educators: Abstraction, Methods, and Lists
## 8690 L’engagement efficace de la société civile dans le développement
## 8691 Architecting with Google Kubernetes Engine: Production en Español
## 8692 Computational Thinking for K-12 Educators: Nested If Statements and Compound Conditionals
## 8693 Cómo combinar y analizar datos complejos
## 8694 Architecting with Google Kubernetes Engine: Workloads em Português Brasileiro
## 8695 Visualizing static networks with R
## Rating Level Duration Schedule Review
## 8686 0 Beginner level 13 Flexible schedule 0
## 8687 0 Beginner level 12 Flexible schedule 0
## 8688 0 Beginner level 10 Flexible schedule 0
## 8689 0 Beginner level 9 Flexible schedule 0
## 8690 0 Intermediate level 13 Flexible schedule 0
## 8691 4.9 Intermediate level 14 Flexible schedule 30
## 8692 0 Beginner level 11 Flexible schedule 0
## 8693 0 No information 9 Flexible schedule 0
## 8694 0 Intermediate level 19 Flexible schedule 0
## 8695 0 Intermediate level 2 No information 0
## will_learn
## 8686 Conocer las características de los conceptos de felicidad y políticas públicas y cómo influye la felicidad a la hora de diseñar las políticas. Distinguir entre la visión tradicional y la nueva perspectiva de ciudadanía, y definir la ciudadanía social y su relevancia en derechos básicos. Analizar los usos de "satisfacción con la democracia" y el rol de formuladores de políticas.
## 8687 No information
## 8688 No information
## 8689 No information
## 8690 No information
## 8691 No information
## 8692 No information
## 8693 No information
## 8694 No information
## 8695 Learn to preprocess raw data to create nodes and edgesLearn to create network data using the igraph packageLearn to visualize static networks with base R functions
## skill
## 8686 Democracia, Ciudadanía, Políticas públicas, Felicidad
## 8687
## 8688
## 8689 Education, want, belief, aunt
## 8690
## 8691
## 8692 Education, want, Resource, Causality
## 8693
## 8694
## 8695 Network Analysis, igraph, R Programming, Graph Drawing, text processing
## Modules
## 8686 Explorando los conceptos de felicidad y políticas públicas, Ciudadanía y felicidad, La satisfacción con la democracia y la felicidad, La felicidad y la formulación de las políticas públicas
## 8687 Course Orientation, Getting a Job in New Ways, Physical Ties to Work, Advancing your career in the technical world, Impacts of Computing and Pedagogy
## 8688 Course Orientation, Keeping Connected in a Global Society, Making Geography-based Connections, Impacts of Computing and Supporting Interactive Learning, More lesson plans for interactivity: Impacts and Encoding Images
## 8689 Course Orientation, Abstractions Part 1, Abstractions Part 2, Lists Part 1, Lists Part 2, Equity & Pedagogy
## 8690 Module 1 : Introduction à un engagement efficace de la société civile dans les processus de développement, Module 2 : Le dialogue multipartite, Module 3 : Efficacité, transparence et redevabilité des OSC en matière de développement , Module 4 : Coopération officielle au développement avec les OSC, Module 5 : L’environnement législatif et réglementaire
## 8691 Introducción al curso, Control de acceso y seguridad en Kubernetes y Google Kubernetes Engine (GKE), Almacenamiento de registros y monitorización con Google Kubernetes Engine (GKE), Cómo usar los servicios de almacenamiento administrados de Google Cloud con Google Kubernetes Engine (GKE), Cómo usar CI/CD con Google Kubernetes Engine (GKE)
## 8692 Course Orientation, Nested If/Else Part 1, Nested If/Else Part 2, Compound Conditionals Part 1 , Compound Conditionals Part 2, Equity & Pedagogy
## 8693 Estimación básica, Modelos, Vinculación de registros, Ética
## 8694 Introdução ao curso, Operações do Kubernetes, Implantações, jobs e escalonamento, Rede do Google Kubernetes Engine (GKE), Armazenamento e dados permanentes
## 8695 Learn step-by-step
## Instructor offered
## 8686 Graciela Tonon Universidad de Palermo
## 8687 Beth Simon University of California San Diego
## 8688 Beth Simon University of California San Diego
## 8689 Beth Simon University of California San Diego
## 8690 Sanne Huesken Erasmus University Rotterdam
## 8691 Google Cloud Training Google Cloud
## 8692 Beth Simon University of California San Diego
## 8693 Richard Valliant, Ph.D. University of Maryland, College Park
## 8694 Google Cloud Training Google Cloud
## 8695 You (Lilian) Cheng Coursera Project Network
## Keyword
## 8686 Social Sciences
## 8687 Social Sciences
## 8688 Social Sciences
## 8689 Social Sciences
## 8690 Social Sciences
## 8691 Social Sciences
## 8692 Social Sciences
## 8693 Social Sciences
## 8694 Social Sciences
## 8695 Social Sciences
## URL
## 8686 https://www.coursera.org/learn/felicidad-y-politicas-publicas
## 8687 https://www.coursera.org/learn/teach-impacts-technology-workplace-future
## 8688 https://www.coursera.org/learn/teach-impacts-technology-relationships
## 8689 https://www.coursera.org/learn/block-programming-k12-educators-abstraction-methods
## 8690 https://www.coursera.org/learn/engagement-efficace-de-la-societe-civile-dans-le-developpement
## 8691 https://www.coursera.org/learn/deploying-secure-kubernetes-containers-in-production-es
## 8692 https://www.coursera.org/learn/block-programming-k12-educators-nested-if-statement-compound-conditionals
## 8693 https://www.coursera.org/learn/data-collection-analytics-project-es
## 8694 https://www.coursera.org/learn/deploying-workloads-google-kubernetes-engine-gke-br
## 8695 https://www.coursera.org/projects/visualizing-static-networks-r
colSums(is.na(orig_df))
## crs_title Rating Level Duration Schedule Review will_learn
## 0 0 0 0 0 1173 0
## skill Modules Instructor offered Keyword URL
## 0 0 0 0 0 0
orig_df <- clean_NA_func(orig_df,'Review',0)
colSums(orig_df == '')
## crs_title Rating Level Duration Schedule Review will_learn
## 0 0 0 0 0 0 0
## skill Modules Instructor offered Keyword URL
## 2183 95 62 0 0 0
for (col in c('skill', 'Modules', 'Instructor')) {
orig_df <- clean_empty_func(orig_df, col, 'No information')
}
We will count the number of courses offered in different levels to see if Coursera is offering centrally to any specific level:
df_count <- orig_df %>% count(Level)
df_count <- df_count %>% mutate(
Level = reorder(Level, -n)
)
ggplot(df_count, aes(x = Level, y=n, fill = Level)) +
geom_bar(stat='identity') +
labs(y = "Course count") +
ggtitle("Courses vs. Level")
The Beginner level accounts most courses from Coursera
so this platform is a good source for any one in this level.
sample <- orig_df %>% count(offered) %>% arrange(desc(n)) %>% head(5)
ggplot(sample, aes(x = offered, y = n, fill = offered)) +
geom_col() +
ggtitle('Top 5 Sources Offering') +
labs(x = 'Source Offer', y = 'Count', fill = 'Source Offer')
orig_df$Rating <- as.double(orig_df$Rating)
ranges <- c(0, 1.0, 2.0, 3.0, 4.0, 5.1)
orig_df$ranges <- cut(orig_df$Rating, breaks = ranges, labels = c("0-1.0", "1.1-2.0", "2.1-3.0", "3.1-4.0", "4.1-5.0"), right = FALSE)
orig_df[which(is.na(orig_df$Review)), "Review"] <- 0
df_rating <- as.data.frame(count(orig_df, ranges))
ggplot(df_rating, aes(x = ranges, y = n, fill = ranges)) +
geom_col() +
labs(x = "Rating ranges", y = "Courses", fill = "Rating ranges") +
ggtitle("Course vs. Rating")
average_time_df <- orig_df %>% group_by(Level) %>%
summarize(average_time = mean(Duration))
ggplot(average_time_df, aes(x=Level, y=average_time, fill=Level)) +
geom_col() +
labs(title='Average Time Spent for a Course by Level', y='Average Time')
df_count <- orig_df %>% count(Level, ranges)
level_ranges_wideFormat <- dcast(df_count, Level ~ ranges, value.var = 'n')
level_ranges_matrix <- as.matrix(level_ranges_wideFormat[,-1])
rownames(level_ranges_matrix) <- level_ranges_wideFormat$Level
level_ranges_matrix <- melt(level_ranges_matrix, varnames=c("Level","Ranges"), value.name = 'Count')
ggplot(level_ranges_matrix, aes(x=Level, y=Ranges, fill=Count)) +
geom_tile() +
geom_text(aes(label=Count), color='black') +
scale_fill_gradient(low='lightblue', high='blue', na.value='white') +
labs(y='Rating Ranges', title='Distribution of Courses by Level and Rating Ranges')
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text()`).
The custom function such as duration_func can be
adjusted to be more flexible on request parameter. There
can be more analysis on relationships between features, given that
further cleansing methods on Instructor performed.
For any questions or feedback, please reach out to:
Feel free to open an issue or submit a pull request if you have suggestions or improvements!