Part 1
Intro
Real Life Applications of ML
- Supervised Classification: Whether a 1st year
college student will graduate
- Supervised Regression: The avg. rating of a
professor for a class
- Unsupervised Clustering: What songs to queue in a
music streaming app
- Unsupervised Dimension Reduction: 5 Most important
attributes to profile a customer
College Student Classification
Application:
Applying Machine learning to predict whether a 1st year college
student will graduate allows for universities to:
- Identify key features that support successful students
- Support individuals with lower probabilities of success
Supervised or Unsupervised:
Since universities measure and retain information about the binary
outcome (Y/N) of whether a college student graduates, this would be a
Supervised Classification problem.
Target Variable:
The target variable is a Binary Y/N of whether the student graduates
or not
Potential Feature Variables:
- High school GPA
- Household Income
- Declared or Undeclared Degree (Y/N)
- Anticipated Degree
- Geographic Location
Data Collection:
Colleges already keep extensive information about students (and
likely buy it from third party sources). Anticipated barriers lie in
accessing the data/gaining data permissions.
Ethical Concerns:
So many… Especially regarding public colleges and universities
(versus private or in regular business). - Could be used (and likely is
used) to optimize college selection for students, reducing opportunity
for low-opportunity students (rather than helping them) - Anticipated
lawsuits for seeming discriminatory against certain segments in
admission process —
Professor Rating Regression
Application:
Applying Machine Learning to predict the average rating (out of 5) of
college professors for the classes they teach:
- Determine certain classes or trends leading to higher or lower avg.
ratings
- Provide strategic feedback to professors to increase their rating
among students
Supervised or Unsupervised:
Since the model would predict a numerical value that can be checked
against actual results, this falls within Supervised Regression
territory.
Target Variable:
The target variable is Average Rating of a professor at the end of a
semester for a class
Potential Feature Variables:
- Number of students in class
- Class name
- Department
- Online/In Person
- Tenure (Y/N)
- Years of experience
Data Collection:
Again, colleges already maintain information about classes and
professors, including professor ratings by semester.
Ethical Concerns:
The biggest ethical concern lies in professors utilizing this model
to change their teaching style for better ratings (and higher pay)
without considering the value that they bring to their students (AKA
optimizing for rating, rather than teaching). —
Clustering Songs for Queue
Application:
Applying Machine Learning to ascertain clusters of songs to be placed
in the queue of a user’s music streaming app:
- Better queue recommendations/clusters lead to higher customer
satisfaction
Supervised or Unsupervised:
- Since this model aims to group songs based on user information, and
since there is no true answer to measure against, this serves as an
Unsupervised Clustering model.
Target Variable:
Unsupervised models do not use target variables
Potential Feature Variables:
- Time of day
- Day of week/month
- Previous songs queued/played by user
- Liked artists
- Liked albums/music types
Data Collection:
Ideally, this would be developed as an employee within one of these
companies. Streaming apps collect extensive user data, which can be
leveraged within the model.
Ethical Concerns:
While users agree to data collection within the terms of use of
streaming apps, they still find direct use of their information
off-putting. Not so much an ethical concern, but a concern of the
brand’s PR (think of the famous Target pregnancy model)
Dimension Reduction for Customer Profiling
Application:
I work in an insurance company, where customer information includes
over one hundred variables. This model reduces the # of variables to be
considered for customer profiling within the business.:
- Identify and strip away customer features deemed unimportant
- Simplify customer profiling for future modeling
Supervised or Unsupervised:
No target variable and no true answer, this would be an Unsupervised
Dimension Reduction model.
Target Variable:
No target variable
Potential Feature Variables:
Literally hundreds of variables, including…
- Age
- Quoted Bodily Injury Limites
- Prior Insurance
- Number of Automobiles vs. # of Drivers (also used for fraud
detection)
- Geographic Location
Data Collection:
Assuming working within an insurance company. Data collection ability
lies on the user’s ability to gain access to the data (or find people
who can make it happen). This can take months, depending on the data
required.
Ethical Concerns:
Insurance is heavily regulated,and customer profiling often fits in
the “grey” are of legal vs not legal..
Part 2
Data Setup & Loading
Setting global chunk options
knitr::opts_chunk$set(
comment = '', fig.width = 6, fig.height = 6,
warning = FALSE, error = FALSE, message = FALSE,
include = TRUE, echo = TRUE, strip.white = TRUE, highlight = TRUE, results = TRUE
)
Loading Libraries
packages = c('tidyverse','tidymodels','here','kknn')
lapply(packages, library, character.only = TRUE)
Lab Questions
1. Is this a supervised or unsupervised learning problem?
Why?
- Since the target variable is contained within the data, and since
the model aims to predict a numerical value, this is a Supervised
Regression model
2. There are 16 variables in this data set. Which variable
is the response variable and which variables are the predictor variables
(aka features)?
- Target Var: cmedv (Median value of owner-occupied homes in
USD 1,000’s)
- Predictor Vars:
- lon: longitude of census tract
- lat: latitude of census tract
- crim: per capita crime rate by town
- zn: proportion of residential land zoned for lots over 25,000
sq.ft
- indus: proportion of non-retail business acres per town
- chas: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
- nox: nitric oxides concentration (parts per 10 million) –> aka
air pollution
- rm: average number of rooms per dwelling
- age: proportion of owner-occupied units built prior to 1940
- dis: weighted distances to five Boston employment centers
- rad: index of accessibility to radial highways
- tax: full-value property-tax rate per USD 10,000
- ptratio: pupil-teacher ratio by town
- lstat: percentage of lower status of the population
3. Given the type of variable cmedv is, is this a regression
or classification problem?
- As stated before, this is a Regression problem
4. Fill in the blanks to import the Boston housing data set
(boston.csv). Are there any missing values? What is the minimum and
maximum values of cmedv? What is the average cmedv value?
path <- here('Data','boston.csv')
boston <- readr::read_csv(path)
head(boston)
5. Fill in the blanks to split the data into a training set
and test set using a 70-30% split. Be sure to include the set.seed(123)
so that your train and test sets are the same size as
mine.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
6. How many observations are in the training set and test
set?
There are 352 observations in the “train” data set
and 154 observations in the “test” data set.
7. Compare the distribution of cmedv between the training
set and test set. Do they appear to have the same distribution or do
they differ significantly?
train %>%
mutate(id = 'train') %>%
bind_rows(test %>% mutate(id = 'test')) %>%
ggplot(aes(cmedv, color = id)) +
geom_density()

8. Fill in the blanks to fit a linear regression model using
the rm feature variable to predict cmedv and compute the RMSE on the
test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
fit( cmedv ~ rm, data = train)
# compute the RMSE on the test data
prd1 <- lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
prd1
The test set RMSE is $6,8301 when using rm (average
number of rooms) to estimate cmedv
---
title: "Module 8 Lab"
subtitle: "Intro to Machine Learning"
author: "Roan Zappanti"
output: html_notebook


header-includes:
- \usepackage{titlesec}
---
<br>

## Part 1 {.tabset .tabset-pills}
### Intro

#### Real Life Applications of ML

1. **Supervised Classification**: Whether a 1st year college student will graduate <br>
2. **Supervised Regression**: The avg. rating of a professor for a class <br>
3. **Unsupervised Clustering**: What songs to queue in a music streaming app <br>
4. **Unsupervised Dimension Reduction**: 5 Most important attributes to profile a customer  

---

### College Student Classification
###### **Application**:
  Applying Machine learning to predict whether a 1st year college student will 
  graduate allows for universities to:<br>

  - Identify key features that support successful students
  - Support individuals with lower probabilities of success<br><br>

###### **Supervised or Unsupervised**:
  Since universities measure and retain information about the binary outcome (Y/N) of
  whether a college student graduates, this would be a Supervised Classification problem.
  <br><br>
    
###### **Target Variable**: 
  The target variable is a Binary Y/N of whether the student graduates or not <br><br>
  
###### **Potential Feature Variables**:
  - High school GPA
  - Household Income
  - Declared or Undeclared Degree (Y/N)
  - Anticipated Degree
  - Geographic Location  
  <br>
  
###### **Data Collection**: 
  Colleges already keep extensive information about students (and likely buy it from third
  party sources). Anticipated barriers lie in accessing the data/gaining data permissions.
  <br><br>

###### **Ethical Concerns**: 
  So many... Especially regarding public colleges and universities (versus private or in
  regular business).
  - Could be used (and likely is used) to optimize college selection for students, reducing
  opportunity for low-opportunity students (rather than helping them)
  - Anticipated lawsuits for seeming discriminatory against certain segments in admission
  process
---


### Professor Rating Regression
###### **Application**:
  Applying Machine Learning to predict the average rating (out of 5) of college professors
  for the classes they teach:<br>

  - Determine certain classes or trends leading to higher or lower avg. ratings
  - Provide strategic feedback to professors to increase their rating among students<br><br>

###### **Supervised or Unsupervised**:
  Since the model would predict a numerical value that can be checked against actual results,
  this falls within Supervised Regression territory.
  <br><br>
    
###### **Target Variable**: 
  The target variable is Average Rating of a professor at the end of a semester for a class <br><br>
  
###### **Potential Feature Variables**:
  - Number of students in class
  - Class name
  - Department
  - Online/In Person
  - Tenure (Y/N)
  - Years of experience
  <br>
  
###### **Data Collection**: 
  Again, colleges already maintain information about classes and professors, including
  professor ratings by semester.
  <br><br>

###### **Ethical Concerns**: 
  The biggest ethical concern lies in professors utilizing this model to change their 
  teaching style for better ratings (and higher pay) without considering the value that
  they bring to their students (AKA optimizing for rating, rather than teaching).
---


### Clustering Songs for Queue
###### **Application**:
  Applying Machine Learning to ascertain clusters of songs to be placed in the queue
  of a user's music streaming app:<br>

  - Better queue recommendations/clusters lead to higher customer satisfaction
  <br><br>

###### **Supervised or Unsupervised**:
  - Since this model aims to group songs based on user information, and since there is no
  true answer to measure against, this serves as an Unsupervised Clustering model. 
  <br><br>
    
###### **Target Variable**: 
  Unsupervised models do not use target variables <br><br>
  
###### **Potential Feature Variables**:
  - Time of day
  - Day of week/month
  - Previous songs queued/played by user
  - Liked artists
  - Liked albums/music types  
  <br>
  
###### **Data Collection**: 
  Ideally, this would be developed as an employee within one of these companies. Streaming
  apps collect extensive user data, which can be leveraged within the model.
  <br><br>

###### **Ethical Concerns**: 
  While users agree to data collection within the terms of use of streaming apps, they still
  find direct use of their information off-putting. Not so much an ethical concern, but a 
  concern of the brand's PR (think of the famous Target pregnancy model)

---


### Dimension Reduction for Customer Profiling
###### **Application**:
  I work in an insurance company, where customer information includes over one hundred variables.
  This model reduces the # of variables to be considered for customer profiling within the business.:<br>

  - Identify and strip away customer features deemed unimportant
  - Simplify customer profiling for future modeling<br><br>

###### **Supervised or Unsupervised**:
  No target variable and no true answer, this would be an Unsupervised Dimension Reduction model.
  <br><br>
    
###### **Target Variable**: 
  No target variable
  <br><br>
  
###### **Potential Feature Variables**:

Literally *hundreds* of variables, including...

  - Age
  - Quoted Bodily Injury Limites
  - Prior Insurance
  - Number of Automobiles vs. # of Drivers (also used for fraud detection)
  - Geographic Location  
  <br>
  
###### **Data Collection**: 
  Assuming working within an insurance company. Data collection ability lies on the user's 
  ability to gain access to the data (or find people who can make it happen). This can take months,
  depending on the data required.
  <br><br>

###### **Ethical Concerns**: 
  Insurance is heavily regulated,and customer profiling often fits in the "grey" are of legal
  vs not legal..
  
  - Without going into too much detail, how to balance profitability vs morality? 
  - What's good for the total customer base vs. certain customer segments?
  - Red lining: <https://www.insurance.us/articles/blog/what-is-redlining-with-auto-insurance>

---


## Part 2 {.tabset .tabset-pills}
### Data Setup & Loading

Setting global chunk options
```{r Global Setup, setup, include=TRUE}
knitr::opts_chunk$set(
  comment = '', fig.width = 6, fig.height = 6,
  warning = FALSE, error = FALSE, message = FALSE,
  include = TRUE, echo = TRUE, strip.white = TRUE, highlight = TRUE, results = TRUE
)
```


Loading Libraries
```{r Libraries, results = 'hide'}
packages = c('tidyverse','tidymodels','here','kknn')
lapply(packages, library, character.only = TRUE)
```

### Lab Questions

###### **1. Is this a supervised or unsupervised learning problem? Why?**
  - Since the target variable is contained within the data, and since the model aims to predict
    a numerical value, this is a Supervised Regression model
    <br><br>
    
###### **2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?**

  - *Target Var:* cmedv (Median value of owner-occupied homes in USD 1,000's)
  - *Predictor Vars:*  
    - lon: longitude of census tract
    - lat: latitude of census tract
    - crim: per capita crime rate by town
    - zn: proportion of residential land zoned for lots over 25,000 sq.ft
    - indus: proportion of non-retail business acres per town
    - chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - nox: nitric oxides concentration (parts per 10 million) –> aka air pollution
    - rm: average number of rooms per dwelling
    - age: proportion of owner-occupied units built prior to 1940
    - dis: weighted distances to five Boston employment centers
    - rad: index of accessibility to radial highways
    - tax: full-value property-tax rate per USD 10,000
    - ptratio: pupil-teacher ratio by town
    - lstat: percentage of lower status of the population

<br>

###### **3. Given the type of variable cmedv is, is this a regression or classification problem?**
- As stated before, this is a Regression problem

<br>

###### **4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?**
```{r Load Data, results = 'hide'}
path <- here('Data','boston.csv')
boston <- readr::read_csv(path)
```

```{r}
head(boston)
```

<br><br>

###### **5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.**
```{r}
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
```

<br><br>

###### **6. How many observations are in the training set and test set?**
There are **`r nrow(train)`** observations in the "train" data set and **`r nrow(test)`** observations 
in the "test" data set.
<br><br>

###### **7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?**
```{r}
train %>% 
  mutate(id = 'train') %>% 
  bind_rows(test %>% mutate(id = 'test')) %>%
  ggplot(aes(cmedv, color = id)) +
  geom_density()
```

<br><br>

###### **8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?**
```{r}
# fit model
lm1 <- linear_reg() %>%
  fit( cmedv ~ rm, data = train)

# compute the RMSE on the test data
prd1 <- lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

prd1
```
The test set RMSE is **$6,8301** when using rm (average number of rooms) to estimate cmedv

<br><br>

###### **9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?**
```{r}
# fit model
features <- colnames(boston[names(boston) %in% 'cmedv' == FALSE])

lm2 <- linear_reg() %>%
  fit( cmedv ~ ., data = train[,c("cmedv",features)])

# compute the RMSE on the test data
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
```
The test set RMSE is **$4,829** when using all features to estimate cmedv. This represents a nearly *$2,000* increase in the model's accuracy. 
<br><br>

###### **10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?**
```{r}
# fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit( cmedv ~ ., data = train[,c("cmedv",features)])

# compute the RMSE on the test data
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
```
Using K-Nearest-Neighbor, the test set RMSE is **$3,357** (with all features included). The kknn models maintains a **$1,500** increase in accuracy compared to linear regression using all features, and a **$3,500** increase in accruacy compared to linear regression using only the rm factor.   





