Part 1

Intro

Real Life Applications of ML

  1. Supervised Classification: Whether a 1st year college student will graduate
  2. Supervised Regression: The avg. rating of a professor for a class
  3. Unsupervised Clustering: What songs to queue in a music streaming app
  4. Unsupervised Dimension Reduction: 5 Most important attributes to profile a customer

College Student Classification

Application:

Applying Machine learning to predict whether a 1st year college student will graduate allows for universities to:

  • Identify key features that support successful students
  • Support individuals with lower probabilities of success

Supervised or Unsupervised:

Since universities measure and retain information about the binary outcome (Y/N) of whether a college student graduates, this would be a Supervised Classification problem.

Target Variable:

The target variable is a Binary Y/N of whether the student graduates or not

Potential Feature Variables:
  • High school GPA
  • Household Income
  • Declared or Undeclared Degree (Y/N)
  • Anticipated Degree
  • Geographic Location

Data Collection:

Colleges already keep extensive information about students (and likely buy it from third party sources). Anticipated barriers lie in accessing the data/gaining data permissions.

Ethical Concerns:

So many… Especially regarding public colleges and universities (versus private or in regular business). - Could be used (and likely is used) to optimize college selection for students, reducing opportunity for low-opportunity students (rather than helping them) - Anticipated lawsuits for seeming discriminatory against certain segments in admission process —

Professor Rating Regression

Application:

Applying Machine Learning to predict the average rating (out of 5) of college professors for the classes they teach:

  • Determine certain classes or trends leading to higher or lower avg. ratings
  • Provide strategic feedback to professors to increase their rating among students

Supervised or Unsupervised:

Since the model would predict a numerical value that can be checked against actual results, this falls within Supervised Regression territory.

Target Variable:

The target variable is Average Rating of a professor at the end of a semester for a class

Potential Feature Variables:
  • Number of students in class
  • Class name
  • Department
  • Online/In Person
  • Tenure (Y/N)
  • Years of experience
Data Collection:

Again, colleges already maintain information about classes and professors, including professor ratings by semester.

Ethical Concerns:

The biggest ethical concern lies in professors utilizing this model to change their teaching style for better ratings (and higher pay) without considering the value that they bring to their students (AKA optimizing for rating, rather than teaching). —

Clustering Songs for Queue

Application:

Applying Machine Learning to ascertain clusters of songs to be placed in the queue of a user’s music streaming app:

  • Better queue recommendations/clusters lead to higher customer satisfaction

Supervised or Unsupervised:
  • Since this model aims to group songs based on user information, and since there is no true answer to measure against, this serves as an Unsupervised Clustering model.

Target Variable:

Unsupervised models do not use target variables

Potential Feature Variables:
  • Time of day
  • Day of week/month
  • Previous songs queued/played by user
  • Liked artists
  • Liked albums/music types

Data Collection:

Ideally, this would be developed as an employee within one of these companies. Streaming apps collect extensive user data, which can be leveraged within the model.

Ethical Concerns:

While users agree to data collection within the terms of use of streaming apps, they still find direct use of their information off-putting. Not so much an ethical concern, but a concern of the brand’s PR (think of the famous Target pregnancy model)


Dimension Reduction for Customer Profiling

Application:

I work in an insurance company, where customer information includes over one hundred variables. This model reduces the # of variables to be considered for customer profiling within the business.:

  • Identify and strip away customer features deemed unimportant
  • Simplify customer profiling for future modeling

Supervised or Unsupervised:

No target variable and no true answer, this would be an Unsupervised Dimension Reduction model.

Target Variable:

No target variable

Potential Feature Variables:

Literally hundreds of variables, including…

  • Age
  • Quoted Bodily Injury Limites
  • Prior Insurance
  • Number of Automobiles vs. # of Drivers (also used for fraud detection)
  • Geographic Location

Data Collection:

Assuming working within an insurance company. Data collection ability lies on the user’s ability to gain access to the data (or find people who can make it happen). This can take months, depending on the data required.

Ethical Concerns:

Insurance is heavily regulated,and customer profiling often fits in the “grey” are of legal vs not legal..


Part 2

Data Setup & Loading

Setting global chunk options

knitr::opts_chunk$set(
  comment = '', fig.width = 6, fig.height = 6,
  warning = FALSE, error = FALSE, message = FALSE,
  include = TRUE, echo = TRUE, strip.white = TRUE, highlight = TRUE, results = TRUE
)

Loading Libraries

packages = c('tidyverse','tidymodels','here','kknn')
lapply(packages, library, character.only = TRUE)

Lab Questions

1. Is this a supervised or unsupervised learning problem? Why?
  • Since the target variable is contained within the data, and since the model aims to predict a numerical value, this is a Supervised Regression model

2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?
  • Target Var: cmedv (Median value of owner-occupied homes in USD 1,000’s)
  • Predictor Vars:
    • lon: longitude of census tract
    • lat: latitude of census tract
    • crim: per capita crime rate by town
    • zn: proportion of residential land zoned for lots over 25,000 sq.ft
    • indus: proportion of non-retail business acres per town
    • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    • nox: nitric oxides concentration (parts per 10 million) –> aka air pollution
    • rm: average number of rooms per dwelling
    • age: proportion of owner-occupied units built prior to 1940
    • dis: weighted distances to five Boston employment centers
    • rad: index of accessibility to radial highways
    • tax: full-value property-tax rate per USD 10,000
    • ptratio: pupil-teacher ratio by town
    • lstat: percentage of lower status of the population


3. Given the type of variable cmedv is, is this a regression or classification problem?
  • As stated before, this is a Regression problem


4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?
path <- here('Data','boston.csv')
boston <- readr::read_csv(path)
head(boston)



5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)



6. How many observations are in the training set and test set?

There are 352 observations in the “train” data set and 154 observations in the “test” data set.

7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?
train %>% 
  mutate(id = 'train') %>% 
  bind_rows(test %>% mutate(id = 'test')) %>%
  ggplot(aes(cmedv, color = id)) +
  geom_density()



8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
  fit( cmedv ~ rm, data = train)

# compute the RMSE on the test data
prd1 <- lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

prd1

The test set RMSE is $6,8301 when using rm (average number of rooms) to estimate cmedv



9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
# fit model
features <- colnames(boston[names(boston) %in% 'cmedv' == FALSE])

lm2 <- linear_reg() %>%
  fit( cmedv ~ ., data = train[,c("cmedv",features)])

# compute the RMSE on the test data
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The test set RMSE is $4,829 when using all features to estimate cmedv. This represents a nearly $2,000 increase in the model’s accuracy.

10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
# fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit( cmedv ~ ., data = train[,c("cmedv",features)])

# compute the RMSE on the test data
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

Using K-Nearest-Neighbor, the test set RMSE is $3,357 (with all features included). The kknn models maintains a $1,500 increase in accuracy compared to linear regression using all features, and a $3,500 increase in accruacy compared to linear regression using only the rm factor.

