Part 1:
Fraud Detection: * Supervised: based off of historical data of transactions deemed either legitimate or fraudulent. * Target Variable: If a transaction is denoted as being “fraudulent” or not * Feature Variables: Transaction time, location, amount, account owner’s spending history, etc.
Social Media Interest Groups: * Unsupervised: A variety of categories without predefined labels. * Feature Variable: User activity including likes, reshares, comments, ad interactions and accounts followed
Product Recommendations: * Supervised: based off of previous purchases with defined labels such as department, cost, etc. * Target Variable: Whether or not a customer purchases an item. * Feature Variables: Past purchases, cart items, user demographics.
Credit Scores: * Supervised: based off or historical data about what characteristics - credit history, income, etc. - make a perspective customer more likely to default on a loan. * Target Variable: Whether or not a consumer defaults on a loan or not
Part 2: Pre reqs
#Library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
#Data path
library(here)
## here() starts at C:/Users/empac/Documents/Coursework/BANA4080
library(readr)
2.1: This is a supervised learning problem
2.2 Response Variable: cmedv, Predictor Variables: The other 15.
2.3: This is a regression problem
2.4:
# Data
boston <- readr::read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for missing values in the dataset
missing_values <- colSums(is.na(boston))
print(missing_values)
## lon lat cmedv crim zn indus chas nox rm age
## 0 0 0 0 0 0 0 0 0 0
## dis rad tax ptratio b lstat
## 0 0 0 0 0 0
# Summary Statistics
summary_cmedv <- summary(boston$cmedv)
print(summary_cmedv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
2.4.a: There are no missing values in this data set.
2.4.b.: The minimum is: 5.00. The maximum is: 50.00.
2.4.c: The average is 22.53. The median is 21.20.
2.5:
library(rsample)
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <-testing(split)
2.6:
n_train <- nrow(train)
n_train
## [1] 352
n_test <-nrow(test)
n_test
## [1] 154
There are 352 entries in the training data and 154 in the test data.
2.7:
library(ggplot2)
#Histogram: Training Data
ggplot(train, aes(x = cmedv)) +
geom_histogram(binwidth = 1, fill = 'blue', alpha = 0.6) +
ggtitle("Training Set - Distribution of cmedv") +
xlab("cmedv") +
ylab("Frequency")
#Histogram: Test Data
ggplot(test, aes(x = cmedv)) +
geom_histogram(binwidth = 1, fill = 'red', alpha = 0.6) +
ggtitle("Test Set - Distribution of cmedv") +
xlab("cmedv") +
ylab("Frequency")
#Boxplot Comp
ggplot() +
geom_boxplot(data = train, aes(y = cmedv, x = "Train"), fill = 'blue', alpha = 0.6) +
geom_boxplot(data = test, aes(y = cmedv, x = "Test"), fill = 'red', alpha = 0.6) +
ggtitle("Boxplot Comparison of cmedv") +
ylab("cmedv") +
xlab("Dataset")
2.8:
library(parsnip)
library(yardstick)
library(dplyr)
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
lm1 %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The test RMSE is 6.8314
2.9:
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
lm2 %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The RMSE is 4.8293.
install.packages("kknn")
## Installing package into 'C:/Users/empac/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'kknn' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\empac\AppData\Local\Temp\RtmpwTWAbs\downloaded_packages
library(kknn)
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
knn %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)