Module 8 lab

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ recipes      1.0.8
## ✔ dials        1.2.0     ✔ rsample      1.2.0
## ✔ dplyr        1.1.3     ✔ tibble       3.2.1
## ✔ ggplot2      3.4.3     ✔ tidyr        1.3.0
## ✔ infer        1.0.5     ✔ tune         1.1.2
## ✔ modeldata    1.2.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.2     ✔ yardstick    1.2.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

boston <- readr::read_csv("~/Desktop/BANA 4080 R/data_bana4080/boston.csv")

## Rows: 506 Columns: 16

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

PART 1

Supervised learning:

A student’s exam score is based on certain features or variables.

Problem: Estimating the exam score of a student based on relevant features.

Supervised learning:

Problem type: Due to the need for labeled data for model training, this is a supervised learning problem. The dataset contains data on students’ actual exam results as well as information on their study habits.

Target variable: The student’s exam score which is a continuous variable.

Features Variable: The number of hours spent studying, the quantity of assignments finished, and attendance percentage are examples of feature variables.

Data collection: Students’ study time, assignment completion status, attendance, and exam results are all tracked in order to gather data on them. Surveys, records, or educational databases can be used to gather it.

Benefits: The ability to forecast student performance and personalize assistance methods gives instructors more control over supervised learning. Finding underperforming pupils can help with early assistance, and individualized comments based on study habits can be given.

Ethical Concerns: The ethical importance of privacy and appropriate data use cannot be overstated. When handling student data, it’s important to maintain their privacy, obtain their informed consent, and only use the information for educational purposes. To stop unfair educational methods, bias identification and mitigation are also crucial. Unintended consequences, such as resource misallocation or ineffective student interventions, may result from incorrectly classifying a student’s performance.

PART 2

QUESTION 1

This is a supervised learning problem because we have a target variable (cmedv) that we want to predict based on a set of predictor variables.

QUESTION 2

The response variable is “cmedv,” and the predictor variables (features) include “lon,” “lat,” “crim,” “zn,” “indus,” “chas,” “nox,” “rm,” “age,” “dis,” “rad,” “tax,” “ptratio,” and “lstat.”

QUESTION 3

“cmedv” is a continuous numeric variable, so this is a regression problem. You can also embed plots, for example:

QUESTION 4

missing_value_boston <- sum(is.na(boston))
min_boston <- min(boston$cmedv)
max_boston <- max(boston$cmedv)
mean_boston <- mean(boston$cmedv)
median_boston <- median(boston$cmedv)

QUESTION 5

set_speed <- set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)

QUESTION 6

number_train <- nrow(train)
number_test <- nrow(test)

QUESTION 7

You can compare the distributions using summary statistics, histograms, or density plots. If they significantly differ, it may indicate issues with your split.

plot_1 <- hist(train$cmedv, main = "Training Set - cmedv Distribution", xlab = "cmedv")

plot_2 <- hist(test$cmedv, main = "Test Set - cmedv Distribution", xlab = "cmedv")

QUESTION 8

lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)
compute_the_RMSE <- lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

QUESTION 9

lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)
compute_the_RMSE_9 <- lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

QUESTION 10

knn <- nearest_neighbor(weight_func = "rectangular") %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)
compute_the_RMSE_10 <- knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

Module 8 lab

Nandini Agarwal

2023-10-15

PART 1

PART 2