Module 8 Lab Notebook

Part 1:

Fraud Detection: * Supervised: based off of historical data of transactions deemed either legitimate or fraudulent. * Target Variable: If a transaction is denoted as being “fraudulent” or not * Feature Variables: Transaction time, location, amount, account owner’s spending history, etc.

Social Media Interest Groups: * Unsupervised: A variety of categories without predefined labels. * Feature Variable: User activity including likes, reshares, comments, ad interactions and accounts followed

Product Recommendations: * Supervised: based off of previous purchases with defined labels such as department, cost, etc. * Target Variable: Whether or not a customer purchases an item. * Feature Variables: Past purchases, cart items, user demographics.

Credit Scores: * Supervised: based off or historical data about what characteristics - credit history, income, etc. - make a perspective customer more likely to default on a loan. * Target Variable: Whether or not a consumer defaults on a loan or not

Part 2: Pre reqs

#Library
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

#Data path
library(here)

## here() starts at C:/Users/empac/Documents/Coursework/BANA4080

library(readr)

2.1: This is a supervised learning problem

2.2 Response Variable: cmedv, Predictor Variables: The other 15.

2.3: This is a regression problem

2.4:

# Data
boston <- readr::read_csv("boston.csv")

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check for missing values in the dataset
missing_values <- colSums(is.na(boston))
print(missing_values)

##     lon     lat   cmedv    crim      zn   indus    chas     nox      rm     age 
##       0       0       0       0       0       0       0       0       0       0 
##     dis     rad     tax ptratio       b   lstat 
##       0       0       0       0       0       0

# Summary Statistics
summary_cmedv <- summary(boston$cmedv)
print(summary_cmedv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

2.4.a: There are no missing values in this data set.

2.4.b.: The minimum is: 5.00. The maximum is: 50.00.

2.4.c: The average is 22.53. The median is 21.20.

2.5:

library(rsample)

set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv) 
train <- training(split)
test <-testing(split)

2.6:

n_train <- nrow(train)
n_train

## [1] 352

n_test <-nrow(test)
n_test

## [1] 154

There are 352 entries in the training data and 154 in the test data.

2.7:

library(ggplot2)

#Histogram: Training Data
ggplot(train, aes(x = cmedv)) + 
  geom_histogram(binwidth = 1, fill = 'blue', alpha = 0.6) +
  ggtitle("Training Set - Distribution of cmedv") +
  xlab("cmedv") +
  ylab("Frequency")

#Histogram: Test Data
ggplot(test, aes(x = cmedv)) + 
  geom_histogram(binwidth = 1, fill = 'red', alpha = 0.6) +
  ggtitle("Test Set - Distribution of cmedv") +
  xlab("cmedv") +
  ylab("Frequency")

#Boxplot Comp
ggplot() +
  geom_boxplot(data = train, aes(y = cmedv, x = "Train"), fill = 'blue', alpha = 0.6) +
  geom_boxplot(data = test, aes(y = cmedv, x = "Test"), fill = 'red', alpha = 0.6) +
  ggtitle("Boxplot Comparison of cmedv") +
  ylab("cmedv") +
  xlab("Dataset")

2.8:

library(parsnip)
library(yardstick)
library(dplyr)

lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)

lm1 %>%
  predict(new_data = test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The test RMSE is 6.8314

2.9:

lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)  

lm2 %>%
  predict(new_data = test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The RMSE is 4.8293.

install.packages("kknn")

## Installing package into 'C:/Users/empac/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'kknn' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\empac\AppData\Local\Temp\RtmpwTWAbs\downloaded_packages

library(kknn)      

knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)  

knn %>%
  predict(new_data = test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

Module 8 Lab Notebook

Emily Pachuk

2024-10-20