Presentation on Training 2018

Devansh Saxena
31-Jul-2018

Introduction to MOOC and Coursera

MOOC

MOOC stands for massive open online course (MOOC) which is an online education system providing various courses, which aims at large-scale interactive participation and open access via web. MOOC aims to provide real time education online with the help of various features like videos, study materials, quizzes and online exams and also tries to make it more efficient than the real time education in class rooms by removing time constraints and location constraints. MOOCs also provide interactive discussion sessions for the user through interactive discussion forums that help to build a community for the students and professors.

Coursera

Coursera is an online learning platform founded by Stanford professors Andrew Ng and Daphne Koller that offers courses, specializations, and degrees. Open EdX is an open source platform for building MOOCs with various ad-vanced features to make the online education more effective. Standard features being provided by Coursera are listed as follows:

  • Study materials like books, notes, cheat sheets, etc (Downloadable).
  • Online test of different types like video embedded quiz, practice sessions, mid term exam, final exam, etc.
  • Virtual Laboratory with interactive interface for user to view the expected sim-ulation.
  • Calendar based schedule.
  • Multi-device and Multi lingual support. In order to keep learners of all types as engaged as possible, Coursera offers support for multiple devices and languages. Courses can be subtitled in 35 different languages, allowing users to learn in the language that they prefer. This broadens the user base that can benefit from Coursera courses and makes sense as a whole, considering these courses originate from dozens of different countries. Furthermore, Coursera is accessible from most major mobile devices, featuring native iOS and Android applications. This provides learners with a lot more flexibility regarding how they choose to engage with a coursera course can be completed on your own terms within your own schedule.
  • Discussion forums.
  • Wiki edits for implementing collaborative learning.
  • Progress reports and other kinds of embedded analytics.
  • Different kinds of assessment systems for submitted assignments(open response problems). It includes:

-> Peer Grading.

-> Self Grading.

-> Staff Grading.

  • Emails and Notification facilities for registered student.
  • Provision of certification.
  • Registering and deregistering from a course.
  • Contacting authors through mailing.

Data Science Specialization

Data Science is a rapidly accelerating field that combines expertise in the management, analysis and visualization of large-scale and complex data. The Johns Hopkins Bloomberg School of Public Health is expanding its open education offerings in this area with a structured and comprehensive Data Science Specialization offered through Coursera, a leading provider of Massive Open Online Courses (MOOCs).

What makes the specialization unique is that it covers the complete set of skills for data science from soup to nuts.

About the Course

Course 1 - The Data Scientist's Toolbox

This course was an introduction to the tools and ideas that we will see throughout the rest of the Data Science Specialization.The course track was focused on providing us with two things:

1) An introduction to the key ideas behind working with data in a scientific way that will produce new and reproducible insight.

2) An introduction to the tools that will allow us to execute on a data analytic strategy, from raw data in a database to a completed report with interactive graphics.

This course was primarily focused on getting set up with the appropriate tools and accounts we will need for the rest of the specialization and on giving us a solid grounding in the key conceptual ideas. We were also made to Install R Studio, Setting up of GitHub account. Also an introduction to the forums, how to find help and how to use forums was told.

Course Project

Course 2 - R Programming

This is the second course in the Data Science Specialization and it focuses on the nuts and bolts of using R as a programming language. The Course started with an overview and history of R followed by data types, objects, vectors, lists, data frames, Matrices, Factors, how to deal with missing values in data, Reading tables and subsetting.

  • week-2 comprises of Control Structure (if-else,for loops,while loops,Repeat,Next,Break), Functions, Scoping Rules.
  • week-3 comprises of all the loop functions like lapply,mapply,sapply,apply,tapply and split.
  • week-4 was basically focused on the practical exercises on swirl also the course had R profiler and Simulation. The course also contained programming Assignments, Swirl Practical Exercises and completely focused on brushing up progamming skills on R.

Course Project

Course 3 - Getting and Cleaning Data

This course focused on preparing us for collecting and cleaning data for downstream analysis and sharing. One of the major components of a data scientist's job is to collect and clean data. Whether at a small organization or a major enterprise, the first step in using data is getting, cleaning and understanding the data. In this course, we focus on R packages and a few outside tools that can be used to collect data from a variety of sources, from Excel files to databases like MySQL. We were also tought about variety of formats including JSON, XML, and flat files (.csv, .txt).

The course also contained, Reading from Web,API's and various other formats,dplyr library,Managing and merging different data sets. The emphasis of this course was on creating tidy data sets that can be used in downstream analyses.

Course Project

Course 4 - Exploratory Data Analysis

This is the fourth course in the Data Science Specialization. Exploratory data analysis (EDA) is a key element of data science because it allows you to develop a rough idea of what your data look like and what kinds of questions might be answered by them. EDA is often the “fun part” of data analysis, where you get to play around with the data and, well, explore!

The course empasis on different operations that can be done on a data like Statistical operations, plotting, using graphics devices, Clustering, Dimension Reduction.

Course Project 1

Course Project 2

Course 5 - Reproducible Research

In this course we learned about the ideas of reproducible research and reporting of statistical analyses. Topics covered include literate programming tools, evidence-based data analysis, and organizing data analyses. In this course we also learned to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, publish reproducible documents to the web(Rpubs), and organize a data analysis so that it is reproducible and accessible to others.

Course Project 1

Course Project 2

Course 6 - Statistical Inference

This course presents the fundamentals of statistical inference that you will need throughout the rest of the Data Science track.

This Specialization is focused on providing us with two things:

  • An introduction to the key ideas behind working with data in a scientific way that will produce new and reproducible insight.
  • An introduction to the tools that will allow you to execute on a data analytic strategy, from raw data in a database to a completed report with interactive graphics. mathematics.

Course Project part 1

Course Project part 2

Course 7 - Regression Models

This course presents the fundamentals of regression modeling that you will need for the rest of the specialization and ultimately for your work in the field of data science. Linear models, as their name implies, relates an outcome to a set of predictors of interest using linear assumptions. Regression models, a subset of linear models, are the most important statistical analysis tool in a data scientist's toolkit. This course covers regression analysis, least squares and inference using regression models. Special cases of the regression model, ANOVA and ANCOVA. Analysis of residuals and variability were investigated. The course also covered modern thinking on model selection and novel uses of regression models including scatterplot smoothing. Regression Models represents a both fundamental and foundational component of the series, and it presents the single most practical data analysis toolset, using only a bare minimum of mathematics.

Course Project

Course 8 - Practical Machine Learning

This course focuses on developing the tools and techniques for understanding, building, and testing prediction functions.

These tools are at the center of the Data Science revolution. Many researchers, companies, and governmental organizations would like to use the cheap and abundant data they are collecting to predict what customers will like, what services to offer, or how to improve people's lives.

One of the most common tasks performed by data scientists and data analysts are prediction and machine learning. This course covered the basic components of building and applying prediction functions with an emphasis on practical applications. The course will provide basic grounding in concepts such as training and tests sets, overfitting, and error rates.

The course also introduced a range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The course also coverd the complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.

Course Project

Course 9 - Developing Data Products

A data product is the production output from a statistical analysis. Data products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference. This course covers the basics of creating data products using Shiny, R packages, and interactive graphics. The course focused on the fundamentals of creating a data product that can be used to tell a story about data to a mass audience.

In this class we learned a variety of core tools for creating data products in R and R Studio.

Course Project 1

Course Project 2

Course Project 3

Capstone Data Science Project

About the project

The objective of this Capstone Project is to produce a predictive text algorithm written in R that based on a user's text input will suggest the next 8 most likely words to be entered.

As the user inputs characters the set of characters will be compared to text against a word list. The predicted word will be the word that has the highest probability following the previous word or multi-word phrase.

In the current project stage the dataset has been downloaded from Coursera and Swiftkey. Some initial exploratory data analysis has been performed along with some data preparation in order to proceed with the predictive modeling and construction of the end user application.

The next objective is to find the optimal sample size from the dataset required to build a corpus on which to train the prediction algorithm.

Capstone Project

Rpubs Link