Jupyterlab - used to store R and Python code R Python Azure machine learning Git
Data exploration, visualization, and machine learning Predictive analytics, classifiction, and decision trees Evaluation of classification models
Signals: data science-based Features: engineering-based Predictors: statistically-based ## All three: Pattern, repeatable process that we hope to capture and describe
## Data exploration, visualization, and feature engineering
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Boxplot
boxplot(Sepal.Length ~ Species, data = iris,
main = "Sepal length of species",
xlab = "Species",
ylab = "Sepal Length")
## Pie chart
pie(table(iris$Species))
## Scatterplot
plot(Sepal.Width ~ Sepal.Length, data = iris,
xlab = "Sepal Width",
ylab = "Sepal Length")
### Scatterplot with smoother
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There is no single ML algorithm that will take raw data and give you the best model. You do not need to know a lot of machine leraning algorithms to build robust predictive models. Not spending time understanding your data is a source of many problems.
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
library(wesanderson)
## Density plot
ggplot(diamonds) + geom_density(aes(x=carat), fill = "indianred4")
## Scatterplot
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Storing ggplot object
g <- ggplot(diamonds, aes(x=carat, y=price))
g+geom_point(aes(color=clarity)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Looking for correlation
ggplot(iris, aes(Sepal.Width, Sepal.Length)) + geom_point(aes(color=Species)) + facet_wrap(~Species)
## Separating segments
g+geom_point(aes(color=color)) + facet_wrap(~color)
## Further segmentation
g+geom_point(aes(color=color)) + facet_wrap(cut~clarity)
## Separating out segmentations
g+ geom_point(aes(color=color)) + facet_grid(cut~clarity)
## Continuous bivariate distribution
ggplot(diamonds, aes(x = carat, y = price)) + geom_hex() +
scale_fill_gradientn(colours = pal) +
labs(title = "Relationship between carat and price") +
guides(fill=guide_legend(title = "Price per carat"))
### Coronavirus dataset: John Hopkins Real-time Data
covid <- read.csv("C:/Users/Monte Richardson/Desktop/Data Science Dojo/covid_19_data.csv")
pal <- wes_palette("Zissou1", 100, type = "continuous")
covid2 <- covid %>%
select(Country.Region, Deaths) %>%
filter(Country.Region %in% c('Mainland China', 'US', 'Japan', 'Italy'))
### Deaths per country as of 3/9/20
ggplot(covid2, aes(Country.Region, Deaths, fill = "Blue")) + geom_col()
labs(xlab = "Countries",
ylab = "Deaths",
title = "Coronavirus deaths by country")
## $xlab
## [1] "Countries"
##
## $ylab
## [1] "Deaths"
##
## $title
## [1] "Coronavirus deaths by country"
##
## attr(,"class")
## [1] "labels"
#
Assumes that historical data sampled will be the same/similar to future data sampled Adding features changes variance in the representativeness of the sample More rows is good If there are too many columns, there needs to be even more rows to account for variance and granulation of data We must assume that data distribution is likely to change over time
Adversarial: Searches for certain signals that indicate aberration from the allowable features Such as: spam detector, cyber security
High variability: Sampling from a highly transient population. When this is the case, it is worth trying to build your own, new model. Low variability: Recognition of distinct characters or letters when handrwitten. Facial recognition. When this is the case, don’t reinvent the wheel. Use ready to use machine learning models that already exist. Typical variability: Natural progression of business