Part 1
Intro
Real Life Applications of ML
- Supervised Classification: Whether a 1st year
college student will graduate
- Supervised Regression: The avg. rating of a
professor for a class
- Unsupervised Clustering: What songs to queue in a
music streaming app
- Unsupervised Dimension Reduction: 5 Most important
attributes to profile a customer
College Student Classification
Application:
Applying Machine learning to predict whether a 1st year college
student will graduate allows for universities to:
- Identify key features that support successful students
- Support individuals with lower probabilities of success
Supervised or Unsupervised:
Since universities measure and retain information about the binary
outcome (Y/N) of whether a college student graduates, this would be a
Supervised Classification problem.
Target Variable:
The target variable is a Binary Y/N of whether the student graduates
or not
Potential Feature Variables:
- High school GPA
- Household Income
- Declared or Undeclared Degree (Y/N)
- Anticipated Degree
- Geographic Location
Data Collection:
Colleges already keep extensive information about students (and
likely buy it from third party sources). Anticipated barriers lie in
accessing the data/gaining data permissions.
Ethical Concerns:
So many… Especially regarding public colleges and universities
(versus private or in regular business). - Could be used (and likely is
used) to optimize college selection for students, reducing opportunity
for low-opportunity students (rather than helping them) - Anticipated
lawsuits for seeming discriminatory against certain segments in
admission process —
Professor Rating Regression
Application:
Applying Machine Learning to predict the average rating (out of 5) of
college professors for the classes they teach:
- Determine certain classes or trends leading to higher or lower avg.
ratings
- Provide strategic feedback to professors to increase their rating
among students
Supervised or Unsupervised:
Since the model would predict a numerical value that can be checked
against actual results, this falls within Supervised Regression
territory.
Target Variable:
The target variable is Average Rating of a professor at the end of a
semester for a class
Potential Feature Variables:
- Number of students in class
- Class name
- Department
- Online/In Person
- Tenure (Y/N)
- Years of experience
Data Collection:
Again, colleges already maintain information about classes and
professors, including professor ratings by semester.
Ethical Concerns:
The biggest ethical concern lies in professors utilizing this model
to change their teaching style for better ratings (and higher pay)
without considering the value that they bring to their students (AKA
optimizing for rating, rather than teaching). —
Clustering Songs for Queue
Application:
Applying Machine Learning to ascertain clusters of songs to be placed
in the queue of a user’s music streaming app:
- Better queue recommendations/clusters lead to higher customer
satisfaction
Supervised or Unsupervised:
- Since this model aims to group songs based on user information, and
since there is no true answer to measure against, this serves as an
Unsupervised Clustering model.
Target Variable:
Unsupervised models do not use target variables
Potential Feature Variables:
- Time of day
- Day of week/month
- Previous songs queued/played by user
- Liked artists
- Liked albums/music types
Data Collection:
Ideally, this would be developed as an employee within one of these
companies. Streaming apps collect extensive user data, which can be
leveraged within the model.
Ethical Concerns:
While users agree to data collection within the terms of use of
streaming apps, they still find direct use of their information
off-putting. Not so much an ethical concern, but a concern of the
brand’s PR (think of the famous Target pregnancy model)
Dimension Reduction for Customer Profiling
Application:
I work in an insurance company, where customer information includes
over one hundred variables. This model reduces the # of variables to be
considered for customer profiling within the business.:
- Identify and strip away customer features deemed unimportant
- Simplify customer profiling for future modeling
Supervised or Unsupervised:
No target variable and no true answer, this would be an Unsupervised
Dimension Reduction model.
Target Variable:
No target variable
Potential Feature Variables:
Literally hundreds of variables, including…
- Age
- Quoted Bodily Injury Limites
- Prior Insurance
- Number of Automobiles vs. # of Drivers (also used for fraud
detection)
- Geographic Location
Data Collection:
Assuming working within an insurance company. Data collection ability
lies on the user’s ability to gain access to the data (or find people
who can make it happen). This can take months, depending on the data
required.
Ethical Concerns:
Insurance is heavily regulated,and customer profiling often fits in
the “grey” are of legal vs not legal..
Part 2
Data Setup & Loading
Setting global chunk options
knitr::opts_chunk$set(
comment = '', fig.width = 6, fig.height = 6,
warning = FALSE, error = FALSE, message = FALSE,
include = TRUE, echo = TRUE, strip.white = TRUE, highlight = TRUE, results = TRUE
)
Loading Libraries
packages = c('tidyverse','tidymodels','here','kknn')
lapply(packages, library, character.only = TRUE)
Lab Questions
1. Is this a supervised or unsupervised learning problem?
Why?
- Since the target variable is contained within the data, and since
the model aims to predict a numerical value, this is a Supervised
Regression model
2. There are 16 variables in this data set. Which variable
is the response variable and which variables are the predictor variables
(aka features)?
- Target Var: cmedv (Median value of owner-occupied homes in
USD 1,000’s)
- Predictor Vars:
- lon: longitude of census tract
- lat: latitude of census tract
- crim: per capita crime rate by town
- zn: proportion of residential land zoned for lots over 25,000
sq.ft
- indus: proportion of non-retail business acres per town
- chas: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
- nox: nitric oxides concentration (parts per 10 million) –> aka
air pollution
- rm: average number of rooms per dwelling
- age: proportion of owner-occupied units built prior to 1940
- dis: weighted distances to five Boston employment centers
- rad: index of accessibility to radial highways
- tax: full-value property-tax rate per USD 10,000
- ptratio: pupil-teacher ratio by town
- lstat: percentage of lower status of the population
3. Given the type of variable cmedv is, is this a regression
or classification problem?
- As stated before, this is a Regression problem
4. Fill in the blanks to import the Boston housing data set
(boston.csv). Are there any missing values? What is the minimum and
maximum values of cmedv? What is the average cmedv value?
path <- here('Data','boston.csv')
boston <- readr::read_csv(path)
head(boston)
5. Fill in the blanks to split the data into a training set
and test set using a 70-30% split. Be sure to include the set.seed(123)
so that your train and test sets are the same size as
mine.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
6. How many observations are in the training set and test
set?
There are 352 observations in the “train” data set
and 154 observations in the “test” data set.
7. Compare the distribution of cmedv between the training
set and test set. Do they appear to have the same distribution or do
they differ significantly?
train %>%
mutate(id = 'train') %>%
bind_rows(test %>% mutate(id = 'test')) %>%
ggplot(aes(cmedv, color = id)) +
geom_density()

8. Fill in the blanks to fit a linear regression model using
the rm feature variable to predict cmedv and compute the RMSE on the
test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
fit( cmedv ~ rm, data = train)
# compute the RMSE on the test data
prd1 <- lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
prd1
The test set RMSE is $6,8301 when using rm (average
number of rooms) to estimate cmedv
