Data Science Overview - I

Varun Juyal
Wed Jun 29 20:47:44 2016

Data Science -- Why all the excitement?...

  • Internet Search: Deliver the best results for our search
  • Digital Advertisements: Ads based on user's past behavior
  • Recommender Systems: Jobs and similar products on B2C sites
  • Image Recognition: Automatic tag suggestion feature in Facebook
  • Speech Recognition: Cortona, Google Voice and Siri

Data Science -- Why all the excitement?

  • Gaming: designed using ML algorithms which upgrade as the player advances
  • Price Comparison Websites
  • Airlines: Identify areas of improvements
  • Fraud and Risk Detection: Past behavior and variables to analyze the probabilities of risk and default

Where does data come from?

  • It's All Happening On-line
  • User Generated (Web & Mobile)
  • Internet of Things/M2M
  • Health/Scientific Computing

"Data is the New Oil" - WEF 2011

5 Vs of Big Data

  • Raw Data: Volume
  • Change over time: Velocity
  • Data types: Variety
  • Data Quality: Veracity
  • Information for Decision Making: Value

What exactly is Data Science?

Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products.

Data Science - A Visual Definition

Steps involved

  • Decide on the objective
  • Getting and Cleaning Data
  • Exploratory Data Analysis
  • Data Modeling - Statistics, ML, Mathematics and Business Insights
  • Optimize and repeat

Getting and Cleaning data - Using R

Get data fit for analysis

Data are values of qualitative(Country of origin, treatment) or quantitative(Height, B.P., Weight) variables, belonging to a set of items.

Typical flow

Raw Data -> Processing -> tidy Data -> Data Analysys -> Data Communication

Raw Versus Processed data

Raw Data

  • Original Source of the data
  • Often hard to use for data Analysis
  • Data Analysis includes processing
  • May need to be processed only once

Processed Data

  • Ready for analysis
  • Processing can include merging, subsetting, transforming, etc
  • There may be standards for processing
  • All steps should be recorded

The four things you should have

  • The raw data
  • A tidy data set
  • A code book describing each variable and its values in the tidy data set
  • An explicit and exact recipe you used to go from Step 1 -> 2,3

Example: Getting and Cleaning Data

Exploratory Data Analysis - Using R

Identify key features and gain insights into your data

  • To understand data properties
  • To find patterns in data
  • To suggest modeling strategies
  • To “debug” analyses

Tools

  • Graphs
  • R functions

Exploratory Graphs Types

  • One dimensional: Barplot, Histogram, Boxplot, Density Plot
  • Two dimensional: Multiple/overlayed 1-D plots(Lattice/ggplot2), ScatterPlots
  • Multi dimensional

Plotting Systems

  • Base Plotting
  • Lattice Plotting
  • ggplot Plotting

Example: Exploratory Graphs

Exploratory Functions

  • str
  • summary
  • head, dim
  • Mean: Add up all the numbers and then divide by the number of numbers
  • Median: “middle” value in the list of numbers in numerical order.
  • Mode: value that occurs most often

Examples and Applications