Data Science Dojo Bootcamp: Day 1

Programs used

Jupyterlab - used to store R and Python code R Python Azure machine learning Git

Materials covered

Data exploration, visualization, and machine learning Predictive analytics, classifiction, and decision trees Evaluation of classification models

Definitions

Signals: data science-based Features: engineering-based Predictors: statistically-based ## All three: Pattern, repeatable process that we hope to capture and describe

## Data exploration, visualization, and feature engineering

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

## Boxplot
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Sepal length of species",
        xlab = "Species",
        ylab = "Sepal Length")

## Pie chart
pie(table(iris$Species))

## Scatterplot
plot(Sepal.Width ~ Sepal.Length, data = iris,
     xlab = "Sepal Width",
     ylab = "Sepal Length")

### Scatterplot  with smoother
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Dispelling common myths

There is no single ML algorithm that will take raw data and give you the best model. You do not need to know a lot of machine leraning algorithms to build robust predictive models. Not spending time understanding your data is a source of many problems.

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

library(wesanderson)

## Density plot
ggplot(diamonds) + geom_density(aes(x=carat), fill = "indianred4")

## Scatterplot
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Storing ggplot object
g <- ggplot(diamonds, aes(x=carat, y=price))
g+geom_point(aes(color=clarity)) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Looking for correlation
ggplot(iris, aes(Sepal.Width, Sepal.Length)) + geom_point(aes(color=Species)) + facet_wrap(~Species)

## Separating segments
g+geom_point(aes(color=color)) + facet_wrap(~color)

## Further segmentation
g+geom_point(aes(color=color)) + facet_wrap(cut~clarity)

## Separating out segmentations
g+ geom_point(aes(color=color)) + facet_grid(cut~clarity)


## Continuous bivariate distribution
ggplot(diamonds, aes(x = carat, y = price)) + geom_hex() +
  scale_fill_gradientn(colours = pal) +
  labs(title = "Relationship between carat and price") +
  guides(fill=guide_legend(title = "Price per carat"))

### Coronavirus dataset: John Hopkins Real-time Data

covid <- read.csv("C:/Users/Monte Richardson/Desktop/Data Science Dojo/covid_19_data.csv")
pal <- wes_palette("Zissou1", 100, type = "continuous")

covid2 <- covid %>%
  select(Country.Region, Deaths) %>%
  filter(Country.Region %in% c('Mainland China', 'US', 'Japan', 'Italy'))

### Deaths per country as of 3/9/20
ggplot(covid2, aes(Country.Region, Deaths, fill = "Blue")) + geom_col()

  labs(xlab = "Countries",
       ylab = "Deaths",
             title = "Coronavirus deaths by country")

## $xlab
## [1] "Countries"
## 
## $ylab
## [1] "Deaths"
## 
## $title
## [1] "Coronavirus deaths by country"
## 
## attr(,"class")
## [1] "labels"

Predictive analytics overview

Assumes that historical data sampled will be the same/similar to future data sampled Adding features changes variance in the representativeness of the sample More rows is good If there are too many columns, there needs to be even more rows to account for variance and granulation of data We must assume that data distribution is likely to change over time

Types of machine learning models

Adversarial: Searches for certain signals that indicate aberration from the allowable features Such as: spam detector, cyber security

Machine learning problems

High variability: Sampling from a highly transient population. When this is the case, it is worth trying to build your own, new model. Low variability: Recognition of distinct characters or letters when handrwitten. Facial recognition. When this is the case, don’t reinvent the wheel. Use ready to use machine learning models that already exist. Typical variability: Natural progression of business