Targil 1 Machines

library(rmarkdown)
###Question 1
##segment a
CA_data <- read.csv("/Users/yamshacham/Downloads/CAhousing.csv")
##segment b
dim(CA_data)

## [1] 20640    10

#There are 20640 rows and 10 columns in this data set
##segment c
summary(CA_data)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

table(CA_data$ocean_proximity)

## 
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

#we detected the outlier ISLAND which has only 5 observations
##segment d
library(correlation)
correlation(CA_data)

#These parameters are highly correlated: total_bedrooms & households:   0.98, total_rooms & total_bedrooms: 0.93, total_rooms & households: 0.92, population & households:  0.91, total_rooms & population: 0.86, longitude & latitude: -0.92

###question 2
library("ggplot2")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("MASS")

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

##segment a
ggplot(CA_data, aes(x= median_income, y= median_house_value)) + geom_point(alpha = 0.7, size = 0.5,color = "purple")

#we can learn theres a positive correlation between how "rich" you are to how "valued" your house is.
##segment b
reg1 <- lm(median_house_value ~ median_income, data = CA_data)
##Segment c
summary(reg1)

## 
## Call:
## lm(formula = median_house_value ~ median_income, data = CA_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -540697  -55950  -16979   36978  434023 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    45085.6     1322.9   34.08   <2e-16 ***
## median_income  41793.8      306.8  136.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83740 on 20638 degrees of freedom
## Multiple R-squared:  0.4734, Adjusted R-squared:  0.4734 
## F-statistic: 1.856e+04 on 1 and 20638 DF,  p-value: < 2.2e-16

##segment d
#Our interpretation is that the intercept (i.e beta0) which is 45085.6 show us that a house of a family with 0 income is worth initially 45085$ and for each 1000 dollar added to teh mdian income its worth is raised by 41793$. Both coefficient are significant.
##segment e
plot(reg1)

##segment f
pred_vector <- reg1$fitted.values
##Segment g
ggplot(CA_data, aes(x = median_income, y = median_house_value)) +  
  geom_point(alpha = 0.8, size = 0.5, color = "orange") +
  geom_smooth(method = "lm", se = TRUE, color = "purple") +  
  labs(
    title = "Effect of Median Income on Median House Value",
    x = "Median Income",
    y = "Median House Value"
  )

## `geom_smooth()` using formula = 'y ~ x'

#This regression seems like a poor fit,as we can see that most of the observations are concentrated off of it. It does show the correlation of both variables i.e them going in the same direction (x increases, y increases)

###Question 3
##Segment a
#The outliers or anomalies is the line forming at the 5e+05 threshold which practically means that the median house value is capped at $500k
hist(CA_data$median_house_value, breaks=100)

##segemnt b
model <- lm(median_house_value ~ ., data = CA_data)
summary(model)

## 
## Call:
## lm(formula = median_house_value ~ ., data = CA_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -556980  -42683  -10497   28765  779052 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.270e+06  8.801e+04 -25.791  < 2e-16 ***
## longitude                 -2.681e+04  1.020e+03 -26.296  < 2e-16 ***
## latitude                  -2.548e+04  1.005e+03 -25.363  < 2e-16 ***
## housing_median_age         1.073e+03  4.389e+01  24.439  < 2e-16 ***
## total_rooms               -6.193e+00  7.915e-01  -7.825 5.32e-15 ***
## total_bedrooms             1.006e+02  6.869e+00  14.640  < 2e-16 ***
## population                -3.797e+01  1.076e+00 -35.282  < 2e-16 ***
## households                 4.962e+01  7.451e+00   6.659 2.83e-11 ***
## median_income              3.926e+04  3.380e+02 116.151  < 2e-16 ***
## ocean_proximityINLAND     -3.928e+04  1.744e+03 -22.522  < 2e-16 ***
## ocean_proximityISLAND      1.529e+05  3.074e+04   4.974 6.62e-07 ***
## ocean_proximityNEAR BAY   -3.954e+03  1.913e+03  -2.067  0.03879 *  
## ocean_proximityNEAR OCEAN  4.278e+03  1.570e+03   2.726  0.00642 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68660 on 20420 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6465, Adjusted R-squared:  0.6463 
## F-statistic:  3112 on 12 and 20420 DF,  p-value: < 2.2e-16

##Segment c
#The model fits pretty good, its R^2 score is close to 0.7 which is generally good, in my opinion we cant omit any variables as they are all significant respectively.

###Question 4
##Segment a
library(class)
library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following object is masked from 'package:MASS':
## 
##     Boston

dat = Default
##segment b
dim(dat)

## [1] 10000     4

#There are 10000 rows and 4 columns
summary(dat)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

#Variables are "centered around 0" if their mean or median is approximately zero. Let’s assess this for the numeric variables: balance: Mean = 835.4, Median = 823.6; clearly not centered around 0. income: Mean = 33,517, Median = 34,553; again, not centered around 0. Conclusion: The numeric variables are not centered around 0.           Using Q1-1.5*IQR and Q3+1.5*IQR rule we can detect outliers. lets calculate and see. for balance:Q1 = 481.7, Q3 = 1166.3 → IQR = 1166.3 - 481.7 = 684.6. Bounds: Lower = 481.7 - 1.5×684.6 ≈ -545.2, Upper = 1166.3 + 1.5×684.6 ≈ 2193.2 Max = 2654.3 → Above the upper bound. as we can see there are outliers for baalnce. doing the same for income we can see there are no outliers for it.
?Default
sd(dat$income)

## [1] 13336.64

sd(dat$balance)

## [1] 483.715

#They arent standartized since they arent centered around 0 and have sd of 1.
##segment c
library(dplyr)
dat <- dat |>
  dplyr::select(balance, income, default)

dat$income <- scale(dat$income)
dat$balance <- scale(dat$balance)

#We will divide them pareto efficient:80/20 (1-8000 train, 8001-10000 test)

xtrain <- dat[1:8000, c("balance", "income")]
xtest <- dat[8001:10000, c("balance", "income")]
ytrain <-dat[1:8000, "default"]
ytest <- dat[8001:10000, "default"]

library(class)
knn_1 <- knn(train = xtrain, test = xtest, cl = ytrain, k = 1)
knn_5 <- knn(train = xtrain, test = xtest, cl = ytrain, k = 5)
knn_20 <- knn(train = xtrain, test = xtest, cl = ytrain, k = 20)
knn_70 <- knn(train = xtrain, test = xtest, cl = ytrain, k = 70)

table_ytrain <- table(ytrain)
print(table_ytrain)

## ytrain
##   No  Yes 
## 7734  266

prop_ytrain <- prop.table(table_ytrain)
print(prop_ytrain)

## ytrain
##      No     Yes 
## 0.96675 0.03325

#the percentage of no is 96.475% and yes 3.525%, a safe simpole rule is always predicting no, which will result in 96.475% correct guesses of no atleast. no i wouldnt use this method since This classifier would fail 100% of the time for "Yes" cases, which is what were looking for. i would use a method i have from my prior knowledge which is precision, recall and f1 score to figure out how good our model is by predicting TP and TN.

###Question 5
##Segment a
set.seed(123)
train_cv_index <- sample(1:nrow(CA_data), 0.9*nrow(CA_data))
train_cv <- CA_data[train_cv_index, ]
test <- CA_data[-train_cv_index, ]

set.seed(321)
cv_index <- sample(1:nrow(train_cv), 0.15*nrow(CA_data)) 
cross_val <- train_cv[cv_index, ]
train <- train_cv[-cv_index, ]

##Segment b
x <- train$median_income
y <- train$median_house_value

ks_box_05 <- ksmooth(x, y, kernel = "box", bandwidth = 0.5)
ks_gauss_05 <- ksmooth(x, y, kernel = "normal", bandwidth = 0.5)
ks_gauss_15 <- ksmooth(x, y, kernel = "normal", bandwidth = 1.5)

plot(x, y, pch = 16, col = "black",
     xlab = "Median Income",
     ylab = "Median House Value",
     main = "Kernel Regression Smoothing")

lines(ks_box_05, col = "red", lwd = 2)
lines(ks_gauss_05, col = "blue", lwd = 2)
lines(ks_gauss_15, col = "green", lwd = 2)

#We observe that  higher bandwith follows teh pattern better and is smoother,we also see that uniform or box kernel is rougher than a normal with the same bandwidth.

##Segemtn c

mse_function <- function(true, predicted) 
  {mean((true - predicted)^2)}

kernel_types <- c("box", "normal")
bandwidth_values <- c(0.5, 1, 1.5, 2)
mse_output <- list()

for (kernel in kernel_types) {
  for (bw in bandwidth_values) {
    prediction_train <- ksmooth(train$median_income, train$median_house_value, x.points = train$median_income, kernel = kernel, bandwidth = bw)$y
    prediction_cv <- ksmooth(train$median_income, train$median_house_value, x.points = cross_val$median_income, kernel = kernel, bandwidth = bw)$y
    
    mse_train <- mse_function(train$median_house_value, prediction_train)
    mse_cv <- mse_function(cross_val$median_house_value, prediction_cv)
    
    mse_output[[paste(kernel, bw)]] <- c(Train_MSE = mse_train, CV_MSE = mse_cv)
  }
}

mse_table <- do.call(rbind, mse_output)
mse_table <- round(mse_table, 2)
print(mse_table)

##              Train_MSE      CV_MSE
## box 0.5    19945706795          NA
## box 1      19754611684 19523055211
## box 1.5    19434907778 19195683709
## box 2      19035472978 18794755446
## normal 0.5 19903473371 19684880104
## normal 1   19586014901 19349007350
## normal 1.5 19115996598 18868109867
## normal 2   18561390424 18302226446

library(ggplot2)
library(tibble)

mse_plot_data <- rownames_to_column(as.data.frame(mse_table), var = "Kernel_BW")
mse_plot_data <- mse_plot_data |>
  tidyr::separate(Kernel_BW, into = c("Kernel", "Bandwidth"), sep = " ") |>
  mutate(Bandwidth = as.numeric(Bandwidth))
mse_plot_data <- na.omit(mse_plot_data)
ggplot(mse_plot_data, aes(x = Bandwidth)) +
  geom_line(aes(y = Train_MSE, color = Kernel), linetype = "dashed", linewidth = 1) +
  geom_line(aes(y = CV_MSE, color = Kernel), linewidth = 1) +
  geom_point(aes(y = Train_MSE, color = Kernel), shape = 1, size = 2) +
  geom_point(aes(y = CV_MSE, color = Kernel), shape = 16, size = 2) +
  labs(
    title = "MSE vs Bandwidth for Each Kernel",
    x = "Bandwidth",
    y = "Mean Squared Error",
    color = "Kernel Type"
  ) +
  theme_minimal()

#In simple words - like i said in the previous segment as the bandwidth increases, the prediction becomes smoother, and the model overfits less. The "normal" kernel gives smoother results than the "box" kernel. Based on the MSE values, the best tradeoff between bias and variance is with the normal kernel and bandwidth 1.5 — it has the lowest CV error, so it's better at generalizing to new data.

##segment d
ks_gauss_15_test <- ksmooth(train$median_income, train$median_house_value, x.points = test$median_income, kernel = "normal", bandwidth = 1.5)

pred_test <- ks_gauss_15_test$y
mse_test <- mse_function(test$median_house_value, pred_test)
ks_gauss_15_cv <- ksmooth(train$median_income, train$median_house_value, 
x.points = cross_val$median_income, kernel = "normal", bandwidth = 1.5)
pred_cv <- ks_gauss_15_cv$y
mse_cv <- mse_function(cross_val$median_house_value, pred_cv)

mse_test

## [1] 18734840838

mse_cv

## [1] 18868109867

#Since the outputs are very clode we can see that the model behaves simiiliar in the test and in the cv data, which indicates that its not over or underfitted to the data, and implies we chose the right bandwidth (1.5) and kernel (normal) for the model.

##segment e
#We dont use the test set to tune the bandwidth because it would make the test set less useful for checking the models true performance. If we tune the model based on the test data, the results could be biased, and we wouldnt know if the model works well on new, unseen data. Instead, we use cross-validation to adjust the model, so the test set can still be a fair measure of how the model performs in the real world.

Targil 1 Machines

Yam Shacham & Eyal Torjman

2025-04-26