Introduction
Dataset Description
Assumption based on Dataset
Preprocessing Steps
Tokenization
Stopwords
N-grams
Word_Frequency
WordCloud
Insights
This Education dataset is extracted from the coursera web page by the process of webscrapping using UI path studio and the scrapped data's are saved as csv files. This dataset shows the list of free online Data Science courses and the skills which are developed by these course then course duration and ratings.
str(course)
## 'data.frame': 239 obs. of 4 variables:
## $ Courses : chr "IBM Data Analyst" "Introduction to Data Science" "Data Processing Using Python" "HTML, CSS, and Javascript for Web Developers" ...
## $ Skills.Will.Learn : chr "Skills you'll gain: Algebra, Analysis, Apache, Big Data, Business Analysis, Computational Logic, Computer Progr"| __truncated__ "Skills you'll gain: Communication, Computer Programming, Data Analysis, Data Management, Data Mining, Database "| __truncated__ "Skills you'll gain: Statistical Programming, Computer Programming, Python Programming" "Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS" ...
## $ Rating.And.Reviews: chr "4.6\n\n(48.7k reviews)" "4.6\n\n(67.6k reviews)" "4.2\n\n(260 reviews)" "4.7\n\n(13.7k reviews)" ...
## $ Course.Duration : chr "Beginner · Professional Certificate · 3+ Months" "Beginner · Specialization · 3+ Months" "Beginner · Course · 1-3 Months" "Mixed · Course · 1-3 Months" ...
summary(course)
## Courses Skills.Will.Learn Rating.And.Reviews Course.Duration
## Length:239 Length:239 Length:239 Length:239
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
skim(course)
| Name | course |
| Number of rows | 239 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Courses | 0 | 1 | 13 | 77 | 0 | 239 | 0 |
| Skills.Will.Learn | 0 | 1 | 0 | 1113 | 20 | 217 | 0 |
| Rating.And.Reviews | 0 | 1 | 0 | 21 | 21 | 208 | 0 |
| Course.Duration | 0 | 1 | 26 | 48 | 0 | 18 | 0 |
names(course)
## [1] "Courses" "Skills.Will.Learn" "Rating.And.Reviews"
## [4] "Course.Duration"
newcourse<-course[c("Courses")]
skim command is used to understand the dataset as much better, names command is used to view the column names which are all present in the dataset then using the subset command to select columns for the preprocessing steps.
##Assumption based on the dataset
In this dataset I have chosen the course attribute for text preprocessing process, then find which word has the highest frequency and which word has the lowest frequency, from this frequencies I understood the related word has more free courses in coursera. This dataset contains 239 different types of courses and they are related to data science.
## word
## 1 ibm
## 2 data
## 3 analyst
## 4 introduction
## 5 to
## 6 data
## 7 science
## 8 data
## 9 processing
## 10 using
## word
## 1306 thread
## 1307 implementation
## 1308 build
## 1309 a
## 1310 twitter
## 1311 clone
## 1312 front
## 1313 end
## 1314 with
## 1315 reactjs
data_3gram<-unnest_tokens(course,word,Courses,token = "ngrams",n=3)
head(data_3gram,20)
## Skills.Will.Learn
## 1 Skills you'll gain: Algebra, Analysis, Apache, Big Data, Business Analysis, Computational Logic, Computer Programming, Computer Programming Tools, Correlation And Dependence, Data Analysis, Data Analysis Software, Data Management, Data Mining, Data Visualization, Data Visualization Software, Data Warehousing, Database Administration, Database Application, Databases, Econometrics, Exploratory Data Analysis, Extract, Transform, Load, General Statistics, Machine Learning, Mathematical Theory & Analysis, Mathematics, Microsoft Excel, NoSQL, Operating Systems, Plot (Graphics), Probability & Statistics, Python Programming, Regression, SQL, Spreadsheet Software, Statistical Analysis, Statistical Machine Learning, Statistical Programming, Statistical Visualization, System Programming, Theoretical Computer Science
## 2 Skills you'll gain: Communication, Computer Programming, Data Analysis, Data Management, Data Mining, Database Administration, Database Application, Databases, General Statistics, Machine Learning, Marketing, Probability & Statistics, Python Programming, R Programming, Regression, SPSS, SQL, Statistical Programming
## 3 Skills you'll gain: Communication, Computer Programming, Data Analysis, Data Management, Data Mining, Database Administration, Database Application, Databases, General Statistics, Machine Learning, Marketing, Probability & Statistics, Python Programming, R Programming, Regression, SPSS, SQL, Statistical Programming
## 4 Skills you'll gain: Statistical Programming, Computer Programming, Python Programming
## 5 Skills you'll gain: Statistical Programming, Computer Programming, Python Programming
## 6 Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS
## 7 Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS
## 8 Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS
## 9 Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS
## 10 Skills you'll gain: Web Design, Html, HTML and CSS, Web Development, CSS
## 11 Skills you'll gain: Algebra, Algorithms, Analysis, Business Analysis, Cloud API, Cloud Computing, Communication, Computational Logic, Computer Programming, Computer Programming Tools, Correlation And Dependence, Data Analysis, Data Management, Data Mining, Data Structures, Data Visualization, Database Administration, Database Application, Databases, Econometrics, Exploratory Data Analysis, Extract, Transform, Load, General Statistics, Machine Learning, Machine Learning Algorithms, Marketing, Mathematical Theory & Analysis, Mathematics, Plot (Graphics), Probability & Statistics, Python Programming, R Programming, Regression, SPSS, SQL, Spreadsheet Software, Statistical Analysis, Statistical Machine Learning, Statistical Programming, Statistical Visualization, Theoretical Computer Science
## 12 Skills you'll gain: Data Management, Data Visualization, Data Analysis, Statistical Analysis, NoSQL, Data Warehousing, Big Data, Data Mining, Business Analysis, Extract, Transform, Load, Databases, Apache, Analysis, General Statistics
## 13 Skills you'll gain: Data Management, Data Visualization, Data Analysis, Statistical Analysis, NoSQL, Data Warehousing, Big Data, Data Mining, Business Analysis, Extract, Transform, Load, Databases, Apache, Analysis, General Statistics
## 14 Skills you'll gain: Bayesian Statistics, Theoretical Computer Science, Mathematical Theory & Analysis, Probability, Factorial, Computational Logic, Graph Theory, Probability Distribution, Probability & Statistics, Mathematics, General Statistics, Algebra
## 15 Skills you'll gain: Bayesian Statistics, Theoretical Computer Science, Mathematical Theory & Analysis, Probability, Factorial, Computational Logic, Graph Theory, Probability Distribution, Probability & Statistics, Mathematics, General Statistics, Algebra
## 16 Skills you'll gain: Distributed Computing Architecture, Deep Learning, Computer Architecture, Statistical Machine Learning, Computer Programming, Theoretical Computer Science, Algorithms, Machine Learning Algorithms, Applied Machine Learning, Data Analysis, Computer Vision, Differential Equations, Artificial Neural Networks, Other Programming Languages, Estimation, Calculus, General Statistics, Dimensionality Reduction, Probability Distribution, Probability & Statistics, Linear Algebra, Security Engineering, Network Security, Data Mining, Econometrics, Data Analysis Software, Mathematics, Natural Language Processing, Feature Engineering, Geostatistics, Machine Learning, Support Vector Machine, Computer Networking, Regression
## 17 Skills you'll gain: Data Analysis, Spreadsheet, Spreadsheet Software, Business Analysis, Analysis
## 18 Skills you'll gain: Data Analysis, Spreadsheet, Spreadsheet Software, Business Analysis, Analysis
## 19 Skills you'll gain: Data Analysis, Spreadsheet, Spreadsheet Software, Business Analysis, Analysis
## 20 Skills you'll gain: Data Analysis, Spreadsheet, Spreadsheet Software, Business Analysis, Analysis
## Rating.And.Reviews Course.Duration
## 1 4.6\n\n(48.7k reviews) Beginner · Professional Certificate · 3+ Months
## 2 4.6\n\n(67.6k reviews) Beginner · Specialization · 3+ Months
## 3 4.6\n\n(67.6k reviews) Beginner · Specialization · 3+ Months
## 4 4.2\n\n(260 reviews) Beginner · Course · 1-3 Months
## 5 4.2\n\n(260 reviews) Beginner · Course · 1-3 Months
## 6 4.7\n\n(13.7k reviews) Mixed · Course · 1-3 Months
## 7 4.7\n\n(13.7k reviews) Mixed · Course · 1-3 Months
## 8 4.7\n\n(13.7k reviews) Mixed · Course · 1-3 Months
## 9 4.7\n\n(13.7k reviews) Mixed · Course · 1-3 Months
## 10 4.7\n\n(13.7k reviews) Mixed · Course · 1-3 Months
## 11 4.6\n\n(91.5k reviews) Beginner · Professional Certificate · 3+ Months
## 12 4.8\n\n(6.3k reviews) Beginner · Course · 1-3 Months
## 13 4.8\n\n(6.3k reviews) Beginner · Course · 1-3 Months
## 14 4.5\n\n(10.2k reviews) Beginner · Course · 1-3 Months
## 15 4.5\n\n(10.2k reviews) Beginner · Course · 1-3 Months
## 16 4.9\n\n(170k reviews) Mixed · Course · 3+ Months
## 17 4.3\n\n(385 reviews) Beginner · Rhyme Project · Less Than 2 Hours
## 18 4.3\n\n(385 reviews) Beginner · Rhyme Project · Less Than 2 Hours
## 19 4.3\n\n(385 reviews) Beginner · Rhyme Project · Less Than 2 Hours
## 20 4.3\n\n(385 reviews) Beginner · Rhyme Project · Less Than 2 Hours
## word
## 1 ibm data analyst
## 2 introduction to data
## 3 to data science
## 4 data processing using
## 5 processing using python
## 6 html css and
## 7 css and javascript
## 8 and javascript for
## 9 javascript for web
## 10 for web developers
## 11 ibm data science
## 12 introduction to data
## 13 to data analytics
## 14 data science math
## 15 science math skills
## 16 <NA>
## 17 introduction to business
## 18 to business analysis
## 19 business analysis using
## 20 analysis using spreadsheets
tail(data_3gram,20)
## Skills.Will.Learn
## 831 Skills you'll gain: Data Management, Computer Programming, Theoretical Computer Science, Data Structures, Algorithms, Statistical Programming, Python Programming, Computer Program
## 832 Skills you'll gain: Data Management, Computer Programming, Theoretical Computer Science, Data Structures, Algorithms, Statistical Programming, Python Programming, Computer Program
## 833 Skills you'll gain: Data Management, Computer Programming, Theoretical Computer Science, Data Structures, Algorithms, Statistical Programming, Python Programming, Computer Program
## 834 Skills you'll gain: Theoretical Computer Science, Algorithms
## 835 Skills you'll gain: Theoretical Computer Science, Algorithms
## 836 Skills you'll gain: Theoretical Computer Science, Algorithms
## 837
## 838
## 839
## 840 Skills you'll gain: Data Management, Modeling, Pricing, Advertising, Marketing, Accounting, Research and Design, Theoretical Computer Science, Strategy and Operations, Communication, Finance, Flow Network, Analysis, Data Structures, Decision Tree, Cash Flow, Investment
## 841 Skills you'll gain: Data Management, Modeling, Pricing, Advertising, Marketing, Accounting, Research and Design, Theoretical Computer Science, Strategy and Operations, Communication, Finance, Flow Network, Analysis, Data Structures, Decision Tree, Cash Flow, Investment
## 842 Skills you'll gain: Data Management, Modeling, Pricing, Advertising, Marketing, Accounting, Research and Design, Theoretical Computer Science, Strategy and Operations, Communication, Finance, Flow Network, Analysis, Data Structures, Decision Tree, Cash Flow, Investment
## 843 Skills you'll gain: Data Management, Modeling, Pricing, Advertising, Marketing, Accounting, Research and Design, Theoretical Computer Science, Strategy and Operations, Communication, Finance, Flow Network, Analysis, Data Structures, Decision Tree, Cash Flow, Investment
## 844 Skills you'll gain: Simulation, Modeling, Architecture, Business Process Management, Marketing, Strategy and Operations, Theoretical Computer Science, Manufacturing Process Management, Version Control, Process Analysis, Virtual Reality, Entrepreneurship, Computer-Aided Design, Algorithms, Human Computer Interaction, Communication, Design and Product, Business Analysis, Product Design, Process, Computer Graphics, Operations Research, Internet
## 845 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## 846 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## 847 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## 848 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## 849 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## 850 Skills you'll gain: Web, Cloud Computing, Computer Programming, Angular, Web Development, React (web framework), Cloud Applications, Front-End Web Development
## Rating.And.Reviews Course.Duration
## 831 4.4\n\n(32 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 832 4.4\n\n(32 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 833 4.4\n\n(32 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 834 Intermediate · Course · 1-3 Months
## 835 Intermediate · Course · 1-3 Months
## 836 Intermediate · Course · 1-3 Months
## 837 Intermediate · Course · 1-3 Months
## 838 Intermediate · Course · 1-3 Months
## 839 Intermediate · Course · 1-3 Months
## 840 4.6\n\n(75 reviews) Beginner · Course · 1-4 Weeks
## 841 4.6\n\n(75 reviews) Beginner · Course · 1-4 Weeks
## 842 4.6\n\n(75 reviews) Beginner · Course · 1-4 Weeks
## 843 4.6\n\n(75 reviews) Beginner · Course · 1-4 Weeks
## 844 4.7\n\n(430 reviews) Beginner · Course · 1-3 Months
## 845 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 846 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 847 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 848 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 849 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## 850 4.5\n\n(25 reviews) Intermediate · Rhyme Project · Less Than 2 Hours
## word
## 831 solver using recursion
## 832 using recursion in
## 833 recursion in python
## 834 data structures and
## 835 structures and algorithms
## 836 and algorithms ii
## 837 data structures and
## 838 structures and algorithms
## 839 and algorithms iii
## 840 applying investment decision
## 841 investment decision rules
## 842 decision rules for
## 843 rules for startups
## 844 digital thread implementation
## 845 build a twitter
## 846 a twitter clone
## 847 twitter clone front
## 848 clone front end
## 849 front end with
## 850 end with reactjs
tokens_stop=data_tokens %>% filter(!word %in% stop_words$word)
head(tokens_stop,20)
## word
## 1 ibm
## 2 data
## 3 analyst
## 4 introduction
## 5 data
## 6 science
## 7 data
## 8 processing
## 9 python
## 10 html
## 11 css
## 12 javascript
## 13 web
## 14 developers
## 15 ibm
## 16 data
## 17 science
## 18 introduction
## 19 data
## 20 analytics
tail(tokens_stop,20)
## word
## 950 structures
## 951 algorithms
## 952 ii
## 953 data
## 954 structures
## 955 algorithms
## 956 iii
## 957 applying
## 958 investment
## 959 decision
## 960 rules
## 961 startups
## 962 digital
## 963 thread
## 964 implementation
## 965 build
## 966 twitter
## 967 clone
## 968 front
## 969 reactjs
word_freq<-tokens_stop %>% count(word,sort = TRUE)
head(word_freq,10)
## word n
## 1 data 66
## 2 python 26
## 3 science 26
## 4 introduction 23
## 5 learning 18
## 6 machine 18
## 7 analysis 15
## 8 excel 13
## 9 google 10
## 10 processing 9
tail(word_freq,10)
## word n
## 444 wix 1
## 445 word 1
## 446 workbench 1
## 447 workflow 1
## 448 workloads 1
## 449 workplace 1
## 450 workspace 1
## 451 wrangling 1
## 452 writer 1
## 453 xg 1
Using R commands Tokenization, Stopwords, N-grams, word_freq, WordCloud are identified. By using unnest_tokens command I separated the sentence present in the course attribute as a tokens, then using stopword command I removed the stopwords like of, in, an etc.,
wordcloud2(data=word_freq,size = 1.2,color = 'random-light', backgroundColor = 'litegreen')
Finally using WordCloud2 find the repeaded words it gives its result in visual representation.
As per my understanding from this dataset, some words are repeated and has the most high frequencies, the repeated words are data, science, python, learning, analysis, fundamentals, processing, programming, finally in coursera there are more analytics oreinted free courses are offered, and the same time some words has very less frequencies, they are probability, management, thinking, marketing, cybersecurity etc,. The word which has less frequency are helps to identify the minimum number of courses are offered sin coursera.