my_packages <- c("readr","tidyverse","tidyr","completejourney","dplyr","stringr","ggplot2","lubridate","scales","here","naniar","outliers","gridExtra","AmesHousing","ggh4x","repurrrsive","tidymodels","modeldata","kknn")
lapply(my_packages, require, character.only = TRUE)
## Loading required package: readr
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ dplyr 1.1.0
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.3.0 ✔ forcats 0.5.2
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: completejourney
##
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
##
## Loading required package: lubridate
##
##
## Attaching package: 'lubridate'
##
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
##
##
## Loading required package: scales
##
##
## Attaching package: 'scales'
##
##
## The following object is masked from 'package:purrr':
##
## discard
##
##
## The following object is masked from 'package:readr':
##
## col_factor
##
##
## Loading required package: here
##
## here() starts at C:/Users/chase/OneDrive/Desktop/Data Mining/Data Mining Spring 2023
##
## Loading required package: naniar
##
## Loading required package: outliers
##
## Loading required package: gridExtra
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## Loading required package: AmesHousing
##
## Loading required package: ggh4x
##
## Loading required package: repurrrsive
##
## Loading required package: tidymodels
##
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
##
## ✔ broom 1.0.2 ✔ rsample 1.1.1
## ✔ dials 1.1.0 ✔ tune 1.0.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.4 ✔ yardstick 1.1.0
## ✔ recipes 1.0.5
##
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
##
## Loading required package: kknn
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
##
## [[13]]
## [1] TRUE
##
## [[14]]
## [1] TRUE
##
## [[15]]
## [1] TRUE
##
## [[16]]
## [1] TRUE
##
## [[17]]
## [1] TRUE
##
## [[18]]
## [1] TRUE
##
## [[19]]
## [1] TRUE
setwd("C:/Users/chase/OneDrive/Desktop/Data Mining/Data Mining Spring 2023")
getwd()
## [1] "C:/Users/chase/OneDrive/Desktop/Data Mining/Data Mining Spring 2023"
For this exercise we’ll use the Boston housing data set. The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston, MA. Originally published in Harrison Jr and Rubinfeld (1978).
The purpose of this data set is to predict the median value of owner-occupied homes for various census tracts in the Boston area. Each row (observation) represents a given census tract and the variable we wish to predict is cmedv (median value of owner-occupied homes in USD 1000’s). The other variables are variables we want to use to help make predictions of cmedv and include:
•lon: longitude of census tract •lat: latitude of census tract •crim: per capita crime rate by town •zn: proportion of residential land zoned for lots over 25,000 sq.ft •indus: proportion of non-retail business acres per town •chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) •nox: nitric oxides concentration (parts per 10 million) –> aka air pollution •rm: average number of rooms per dwelling •age: proportion of owner-occupied units built prior to 1940 •dis: weighted distances to five Boston employment centers •rad: index of accessibility to radial highways •tax: full-value property-tax rate per USD 10,000 •ptratio: pupil-teacher ratio by town •lstat: percentage of lower status of the population
Task 1: is this a supervised or unsupervised learning problem? Why?
#It would be a supervised learning problem because we're predicting a specific output being given multiple different input features. This is a labeled dataset.
Task 2: There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?
# The response variable that we want to predict would be cmedv, whic is dependent on the interactions of the predictor variables, which are all the other variables in the data set.
Task 3: Given the type of variable cmedv is, is this a regression or classification problem?
# This would be a regression problem since we are predicting a numerical value (median value of owner-occupied homes).
Task 4: Importing the data. Are there any missing values? What is the minimum and maximum values of cmedv? What is the median and average cmedv value?
boston <- read_csv('Data Mining Data Folder/boston.csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(boston)) # 0 missing values
## [1] 0
min(boston$cmedv) # min value = 5
## [1] 5
max(boston$cmedv) # max value = 50
## [1] 50
median(boston$cmedv) # median value = 21.2
## [1] 21.2
mean(boston$cmedv) # mean value = 22.53
## [1] 22.52885
Task 5: Split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
Task 6: How many observations are in the training set and test set?
#Training: 352
#Testing: 154
Task 8: fit a linear regression model using the “rm” feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
# with rm feature, RMSE is 6.83
Task 9: fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
# with all features, RMSE is 4.83
Task 10: Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.37
# RMSE is 3.37
# knn model is better because it was the lowest RMSE