library(tidyverse)
library(openintro)
## Warning: package 'openintro' was built under R version 4.3.3
## Warning: package 'usdata' was built under R version 4.3.3
library(dplyr)

The file UniversalBank.rds contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Data Description:

Variable Description
ID Customer ID
Age Customer’s age in completed years
Experience # years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education_1 Education Level = 1 if Undergrad; 0 otherwise
Education_2 Education Level = 1 if Graduate; 0 otherwise
Education_3 Education Level = 1 if Advanced/Professional; 0 otherwise
Mortgage Value of house mortgage if any. ($000)
PersonalLoan Did this customer accept the personal loan offered in the last campaign?
SecuritiesAccount Does the customer have a securities account with the bank?
CDAccount Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by UniversalBank?

Question 1

Read and preprocess the data:

  1. Read the data file and assign into an R object.
  2. Drop ID and ZIPcode from the dataset.
  3. Convert binary categorical variables into factor variables.
  4. Handle missing values, if there is any.
  5. Normalize all numerical variables.

Answer to Question 1

# Inserting the code here

bank_data <- readRDS("~/Documents/UNH/Semester-3/SupervisedMachineLearning/test/UniversalBank.rds")

# Dropping 'ID' and 'ZIPCode'
bank_data <- bank_data %>% select(-ID, -ZIPCode)

# Converting binary categorical variables to factors
binary_vars <- c("PersonalLoan", "SecuritiesAccount", "CDAccount", "Online", "CreditCard")

bank_data[binary_vars] <- lapply(bank_data[binary_vars], factor)

# Checking for missing values
sum(is.na(bank_data))
## [1] 0
# Removing rows with missing values if any exist
bank_data <- na.omit(bank_data)

# Identifying numerical variables
numerical_vars <- c("Age", "Experience", "Income", "Family", "CCAvg", "Mortgage")

# Normalizing the numerical variables to scale between 0 and 1
bank_data[numerical_vars] <- lapply(bank_data[numerical_vars], function(x) (x - min(x)) / (max(x) - min(x)))

Question 2

Partition the data into training (60%) and testing (40%) sets. Use top 60% as training sample and remaining as test data.

Answer to Question 2

# Insert the code here

# Setingthe seed for reproducibility
set.seed(123)

# Defining the split index (60% training data)
train_indices <- sample(1:nrow(bank_data), size = 0.6 * nrow(bank_data))

# Splitting the data
train_data <- bank_data[train_indices, ]  # 60% training data
test_data <- bank_data[-train_indices, ]   # 40% testing data

# Checking the size of train and test data
nrow(train_data)  # Should be 60% of the original data
## [1] 3000
nrow(test_data)   # Should be 40% of the original data
## [1] 2000

Question 3

Perform a k-NN classification with all predictors except ID and ZIP code using:
a. k = 1.
b. k = 2.
c. k = 3.
d. k = 4.

For each k=1,2,3,4, compute accuracy based on test dataset. Whick k is the best?

Answer to Question 3

# Insert the code here
# Loading the  necessary package
library(class)

# Define the target variable (PersonalLoan) for both training and test datasets
train_labels <- train_data$PersonalLoan
test_labels <- test_data$PersonalLoan

# Exclude the target variable from training and test datasets (use only predictors)
train_predictors <- train_data %>% select(-PersonalLoan)
test_predictors <- test_data %>% select(-PersonalLoan)

# Convert to matrix for knn function
train_matrix <- as.matrix(train_predictors)
test_matrix <- as.matrix(test_predictors)

# Function to compute accuracy
compute_accuracy <- function(k) {
  # Perform k-NN classification
  knn_predictions <- knn(train_matrix, test_matrix, cl = train_labels, k = k)
  
  # Calculate accuracy
  accuracy <- sum(knn_predictions == test_labels) / length(test_labels)
  return(accuracy)
}

# Compute accuracy for k = 1, 2, 3, 4
accuracy_k1 <- compute_accuracy(1)
accuracy_k2 <- compute_accuracy(2)
accuracy_k3 <- compute_accuracy(3)
accuracy_k4 <- compute_accuracy(4)

# Print the accuracies
accuracy_k1  # Accuracy for k = 1
## [1] 0.962
accuracy_k2  # Accuracy for k = 2
## [1] 0.9565
accuracy_k3  # Accuracy for k = 3
## [1] 0.96
accuracy_k4  # Accuracy for k = 4
## [1] 0.955

Question 4

Consider the following customer:

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, SecuritiesAccount = 0, CDAccount = 0, Online = 1, and CreditCard = 1.

Perform a k-NN classification with all predictors except ID and ZIP code using th best k. How would this customer be classified?

Answer to Question 4

# Insert the yor code here

# Defining the new customer data
new_customer <- data.frame(
  Age = 40,
  Experience = 10,
  Income = 84,
  Family = 2,
  CCAvg = 2,
  Education_1 = 0,
  Education_2 = 1,
  Education_3 = 0,
  Mortgage = 0,
  SecuritiesAccount = 0,
  CDAccount = 0,
  Online = 1,
  CreditCard = 1
)

# Normalizing the new customer data using the same scaling as the training data
normalize <- function(x, min_val, max_val) {
  return ((x - min_val) / (max_val - min_val))
}

# Normalizing the new customer's numeric values based on the training data's range
numerical_vars <- c("Age", "Experience", "Income", "Family", "CCAvg", "Mortgage")
for (var in numerical_vars) {
  new_customer[[var]] <- normalize(new_customer[[var]], min(train_data[[var]]), max(train_data[[var]]))
}

# Converting categorical variables of the new customer to factors (to match the training data)
binary_vars <- c("SecuritiesAccount", "CDAccount", "Online", "CreditCard", "Education_1", "Education_2", "Education_3")
new_customer[binary_vars] <- lapply(new_customer[binary_vars], factor)

# Performing k-NN classification for the new customer (using the best k = 1 from previous analysis)
knn_prediction <- knn(train_matrix, as.matrix(new_customer), cl = train_labels, k = 1)

# Printing the predicted class for the new customer
knn_prediction
## [1] 1
## Levels: 0 1
---
title: "ECON 3200: Homework 1"
author: "Satya Narayana Panda"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(dplyr)
```

The file `UniversalBank.rds` contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Data Description:

| Variable            | Description                                                                         |
|:--------------------|:------------------------------------------------------------------------------------|
| ID                  | Customer ID                                                                         |
| Age                 | Customer's age in completed years                                                   |
| Experience          | # years of professional experience                                                  |
| Income              | Annual income of the customer ($000)                                                |
| ZIPCode             | Home Address ZIP code                                                               |
| Family              | Family size of the customer                                                         |
| CCAvg               | Avg. spending on credit cards per month ($000)                                      |
| Education_1         | Education Level = 1 if Undergrad; 0 otherwise                                       |
| Education_2         | Education Level = 1 if Graduate; 0 otherwise                                        |
| Education_3         | Education Level = 1 if Advanced/Professional; 0 otherwise                           |
| Mortgage            | Value of house mortgage if any. ($000)                                              |
| PersonalLoan        | Did this customer accept the personal loan offered in the last campaign?            |
| SecuritiesAccount   | Does the customer have a securities account with the bank?                          |
| CDAccount           | Does the customer have a certificate of deposit (CD) account with the bank?         |
| Online              | Does the customer use internet banking facilities?                                  |
| CreditCard          | Does the customer use a credit card issued by UniversalBank?                        |

### Question 1

Read and preprocess the data: 

a. Read the data file and assign into an `R` object.  
b. Drop ID and ZIPcode from the dataset.  
c. Convert binary categorical variables into factor variables.  
d. Handle missing values, if there is any.  
e. Normalize all numerical variables.  


### _Answer to Question 1_

```{r Q1}
# Inserting the code here

bank_data <- readRDS("~/Documents/UNH/Semester-3/SupervisedMachineLearning/test/UniversalBank.rds")

# Dropping 'ID' and 'ZIPCode'
bank_data <- bank_data %>% select(-ID, -ZIPCode)

# Converting binary categorical variables to factors
binary_vars <- c("PersonalLoan", "SecuritiesAccount", "CDAccount", "Online", "CreditCard")

bank_data[binary_vars] <- lapply(bank_data[binary_vars], factor)

# Checking for missing values
sum(is.na(bank_data))

# Removing rows with missing values if any exist
bank_data <- na.omit(bank_data)

# Identifying numerical variables
numerical_vars <- c("Age", "Experience", "Income", "Family", "CCAvg", "Mortgage")

# Normalizing the numerical variables to scale between 0 and 1
bank_data[numerical_vars] <- lapply(bank_data[numerical_vars], function(x) (x - min(x)) / (max(x) - min(x)))

```


### Question 2

Partition the data into training (60%) and testing (40%) sets. Use top 60% as training sample and remaining as test data.

### _Answer to Question 2_

```{r Q2}
# Insert the code here

# Setingthe seed for reproducibility
set.seed(123)

# Defining the split index (60% training data)
train_indices <- sample(1:nrow(bank_data), size = 0.6 * nrow(bank_data))

# Splitting the data
train_data <- bank_data[train_indices, ]  # 60% training data
test_data <- bank_data[-train_indices, ]   # 40% testing data

# Checking the size of train and test data
nrow(train_data)  # Should be 60% of the original data
nrow(test_data)   # Should be 40% of the original data


```

### Question 3

Perform a k-NN classification with all predictors except ID and ZIP code using:  
a. k = 1.  
b. k = 2.  
c. k = 3.  
d. k = 4.   

For each k=1,2,3,4, compute accuracy based on test dataset. Whick k is the best?

### _Answer to Question 3_

```{r Q3}
# Insert the code here
# Loading the  necessary package
library(class)

# Define the target variable (PersonalLoan) for both training and test datasets
train_labels <- train_data$PersonalLoan
test_labels <- test_data$PersonalLoan

# Exclude the target variable from training and test datasets (use only predictors)
train_predictors <- train_data %>% select(-PersonalLoan)
test_predictors <- test_data %>% select(-PersonalLoan)

# Convert to matrix for knn function
train_matrix <- as.matrix(train_predictors)
test_matrix <- as.matrix(test_predictors)

# Function to compute accuracy
compute_accuracy <- function(k) {
  # Perform k-NN classification
  knn_predictions <- knn(train_matrix, test_matrix, cl = train_labels, k = k)
  
  # Calculate accuracy
  accuracy <- sum(knn_predictions == test_labels) / length(test_labels)
  return(accuracy)
}

# Compute accuracy for k = 1, 2, 3, 4
accuracy_k1 <- compute_accuracy(1)
accuracy_k2 <- compute_accuracy(2)
accuracy_k3 <- compute_accuracy(3)
accuracy_k4 <- compute_accuracy(4)

# Print the accuracies
accuracy_k1  # Accuracy for k = 1
accuracy_k2  # Accuracy for k = 2
accuracy_k3  # Accuracy for k = 3
accuracy_k4  # Accuracy for k = 4


```

### Question 4

Consider the following customer: 

Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, SecuritiesAccount = 0, CDAccount = 0, Online = 1, and CreditCard = 1.

Perform a k-NN classification with all predictors except ID and ZIP code using th best k. How would this customer be classified?

### _Answer to Question 4_

```{r Q4}
# Insert the yor code here

# Defining the new customer data
new_customer <- data.frame(
  Age = 40,
  Experience = 10,
  Income = 84,
  Family = 2,
  CCAvg = 2,
  Education_1 = 0,
  Education_2 = 1,
  Education_3 = 0,
  Mortgage = 0,
  SecuritiesAccount = 0,
  CDAccount = 0,
  Online = 1,
  CreditCard = 1
)

# Normalizing the new customer data using the same scaling as the training data
normalize <- function(x, min_val, max_val) {
  return ((x - min_val) / (max_val - min_val))
}

# Normalizing the new customer's numeric values based on the training data's range
numerical_vars <- c("Age", "Experience", "Income", "Family", "CCAvg", "Mortgage")
for (var in numerical_vars) {
  new_customer[[var]] <- normalize(new_customer[[var]], min(train_data[[var]]), max(train_data[[var]]))
}

# Converting categorical variables of the new customer to factors (to match the training data)
binary_vars <- c("SecuritiesAccount", "CDAccount", "Online", "CreditCard", "Education_1", "Education_2", "Education_3")
new_customer[binary_vars] <- lapply(new_customer[binary_vars], factor)

# Performing k-NN classification for the new customer (using the best k = 1 from previous analysis)
knn_prediction <- knn(train_matrix, as.matrix(new_customer), cl = train_labels, k = 1)

# Printing the predicted class for the new customer
knn_prediction


```

