ASSIGN-1

1) Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction.Finally,provide n and p.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

ANS : This is a regression problem scenario because CEO salary is a continuous numerical variable, and we are interested in modeling its relationship with other variables. We are most interested in inference because the goal is to understand the factors that affect CEO salary instead of predicting it. n=500 ,since data is of 500 firms. p =3 ,as we are recording 3 variables as predictors.

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price,and ten other variables.

ANS: This is a classification scenario because response variable (S/F) is categorical, we are interested in prediction as we are looking whether a new product will succeed based on past data or not.

n= 20 (no. of products) p= 13 (no. of variables)

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market,and the % change in the German market.

ANS : This is a regression problem scenario because the target variable is a continuous numerical variable, and we are interested in prediction because the goal is to forecast future exchange rate based on weekly changes in stock markets.

Here n will be weekly data so most probably it will be 52. n=52 (52 weeks) p=3 ( % change in USD, British Market and German Market)

2) We now revisit the bias-variance decomposition.

Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

ANS: Please visit attached picture.

Explain why each of the five curves has the shape displayed in part (a).

ANS: Bias² (Decreasing Curve) it decreases because bias represents the error introduced when a model is too simple to capture the real patterns in the data. Less flexible models (like linear regression) make strong assumptions and fail to capture complex relationships, resulting in high bias. As model flexibility increases (e.g., by using decision trees or neural networks), bias decreases because the model can better fit the data. This is why the bias curve decreases as flexibility increases.

Variance (Increasing Curve) it increases because variance measures how sensitive the model is to fluctuations in the training data. As flexibility increases, the model starts focusing more on specific details and noise in the training data, making it less generalizable. For example, a very flexible model like a deep decision tree might memorize the training data. This causes variance to increase with model flexibility.

Training Error (Decreasing Curve) it decreases because training error measures how well the model fits the training data. As flexibility increases, the model can fit the training data more closely. For a highly flexible model, training error can approach zero because the model can memorize all training points. This explains why the training error decreases with increasing flexibility.

Test Error (U-Shaped Curve) it has a U-shape because test error initially decreases because the model is capturing more meaningful patterns in the data as flexibility increases. However, beyond a certain point, the model becomes too flexible and starts overfitting, meaning it learns noise rather than real relationships. As a result, test error begins to rise after reaching the optimal flexibility. This creates the U-shaped curve for test error.

Irreducible Error (Flat Line) it is constant because irreducible error is caused by noise or randomness in the data that no model can eliminate. This includes factors not captured by the model or inherent variability in response variable. Since this error is independent of model flexibility, it remains constant across all levels of flexibility.

3) What are the advantages and disadvantages of a very flexible vs a less flexible approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

ANS :

Flexible method advantages:

Decreased bias (less assumptions about the functional form of 𝑓 ) Capture non-linear relationships Capture complex variable interactions Often require less assumptions Higher accuracy

Flexible method disadvantages:

Increased variance (easier to overfit) Require tuning Increased training times Require more variables and observations to work optimally Less interpretable

Coming to Less flexible method, It has lower risk of overfitting, more interpretable as simpler models are easier to understand and explain. It also has lower computational cost and faster training compare to a flexible model.It works well with small datasets too. But it has higher bias and struggles when relationships are nonlinear or require detailed feature interactions.

More flexible is preferred when : the primary concern is predictive power, there is sufficient computing power for variance-controlling measures (validation), there is a large number of variables, the relationship between features and response is complex and nonlinear, there is a large amount of data, and you can apply regularization to control overfitting.

Less flexible is preferred when : we are more interested in inference, further clarification needed behind why an observation has a particular prediction, the dataset is small, the training time should be low especially for real-time projects, the problem is effectively captured by straightforward assumptions.

4) The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

Lets input the data into a data frame.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data.frame(Obs= c(1,2,3,4,5,6),
                   X1 = c(0, 2, 0, 0, -1, 1), 
                   X2 = c(3, 0, 1, 1, 0, 1), 
                   X3 = c(0, 0, 3, 2, 1, 1), 
                   Y = c('Red', 'Red', 'Red', 'Green', 'Green', 'Red'), 
                   stringsAsFactors = F)

colnames(data) <- c('obs','X1', 'X2', 'X3', 'Y')
data

##   obs X1 X2 X3     Y
## 1   1  0  3  0   Red
## 2   2  2  0  0   Red
## 3   3  0  1  3   Red
## 4   4  0  1  2 Green
## 5   5 -1  0  1 Green
## 6   6  1  1  1   Red

Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

Eucledean distance formula is

\[ d = \sqrt{(X_1 - x_1)^2 + (X_2 - x_2)^2 + (X_3 - x_3)^2} \] Now lets calculate the euclidean distance.

Z <- c(0, 0, 0)

data$dist_X_Z <- round(sqrt((data$`X1` - Z[1])^2 + 
                        (data$`X2` - Z[2])^2 + 
                        (data$`X3` - Z[3])^2),2)

colnames(data)[6] <- 'Distance'
data

##   obs X1 X2 X3     Y Distance
## 1   1  0  3  0   Red     3.00
## 2   2  2  0  0   Red     2.00
## 3   3  0  1  3   Red     3.16
## 4   4  0  1  2 Green     2.24
## 5   5 -1  0  1 Green     1.41
## 6   6  1  1  1   Red     1.73

What is our prediction with K = 1? Why?

since the one nearest neighbor to z is observation 5,so its green.

What is our prediction with K = 3? Why?

since the three nearest neighbors to z are observations 5,6,2.It has 2 Red and 1 Green so it is Red.

If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?

If the bayes decision boundary is highly non-linear, we will expect a small K to perform better. A smaller K is more flexible and the KNN boundary would be pulled in different directions easier, whereas a larger K would oversmooth the decision surface and fail to detect the complexity of the decision boundary.

LAB-1

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

if (!require(ISLR)) install.packages("ISLR")

## Loading required package: ISLR

library(ISLR)
data("Auto")

Which of the predictors are quantitative, and which are qualitative?

Quantitative : mpg,cylinders,displacement,horsepower,weight,acceleration,year

Qualitative: origin and name

What is the range of each quantitative predictor? You can answer this using the range() function.

range_Auto <- data.frame(sapply(Auto[1:7], range))
rownames(range_Auto) <- c("min:", "max:")
range_Auto

##       mpg cylinders displacement horsepower weight acceleration year
## min:  9.0         3           68         46   1613          8.0   70
## max: 46.6         8          455        230   5140         24.8   82

We can see that the range of each quantitative predictor has been displayed.

What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

sapply(Auto[ ,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto_2 <- Auto[-c(10:85), ]
range_Auto_2 <- data.frame(sapply(Auto_2[ ,1:7], range))
rownames(range_Auto_2) <- c("min:", "max:")
range_Auto_2

##       mpg cylinders displacement horsepower weight acceleration year
## min: 11.0         3           68         46   1649          8.5   70
## max: 46.6         8          455        230   4997         24.8   82

sapply(Auto_2[ ,1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

sapply(Auto_2[ ,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

These are the range, mean and sd of values ranging from 1-10 and 85 + from the dataset.

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

library(dplyr)
library(ggplot2)
pairs(Auto %>% select(-name, -origin), main="Scatterplot Matrix")

We can see strong negative correlations between mpg and variables like weight, horsepower, and displacement. weight and displacement show a positive relationship, which is expected because larger cars generally have both higher weight and displacement.

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  ggtitle("MPG vs Horsepower")

## `geom_smooth()` using formula = 'y ~ x'

There is a clear negative relationship between mpg and horsepower. As horsepower increases, fuel efficiency decreases. The geom_smooth line confirms a downward trend, which tells us its a strong linear correlation.

ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("MPG vs Weight")

## `geom_smooth()` using formula = 'y ~ x'

Similar to the above plot, there is a negative correlation between mpg and weight. Normally, heavier cars tend to have lower mileage, which aligns with real-world expectations. The regression line strengthen our opinion.

The negative relationships between mpg and horsepower/weight suggest that high performance, heavier cars consumes more fuel. While predicting mpg, weight and horsepower are strong candidates as predictors in regression models. Other predictors, such as displacement and acceleration, might also play a role but they are not as strong as these.

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

If we refer to the scatter matrix above, it appears that all quantitative predictors have some relationship with the response. name (the name of the vehicle) is categorical, and has 301 unique values. This isn’t really better choice to consider it as a predictor, particularly on such a small dataset.

However, we may be able to extract useful information if we tweak the name into brand of car. I extract this information, correct some mistakes and collapse the variable down into 10 levels:

Auto$brand <- sub(" .*", "", Auto$name)  
brand_counts <- table(Auto$brand) 
top_brands <- names(sort(brand_counts, decreasing = TRUE)[1:9])  
Auto$brand <- ifelse(Auto$brand %in% top_brands, Auto$brand, "Other")
Auto$brand <- as.factor(Auto$brand)
table(Auto$brand)

## 
##       amc     buick chevrolet    datsun     dodge      ford     Other  plymouth 
##        27        17        43        23        28        48       134        31 
##   pontiac    toyota 
##        16        25

Lets take a visualization of brand vs mpg

ggplot(Auto, aes(x = brand, y = mpg, fill = brand)) + 
geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Brand vs Mpg - Boxplot", 
       subtitle = "Engineered feature", 
       x = "Brand", 
       y = "MPG")

With this we can predict mpg for various brands. But lets check the strong predictors for mpg but as name variable needs tweaking, lets ignore it.

lets check for various variables

# MPG vs Weight
ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  ggtitle("MPG vs Weight")

## `geom_smooth()` using formula = 'y ~ x'

# MPG vs Horsepower
ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("MPG vs Horsepower")

## `geom_smooth()` using formula = 'y ~ x'

# MPG vs Displacement
ggplot(Auto, aes(x = displacement, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  ggtitle("MPG vs Displacement")

## `geom_smooth()` using formula = 'y ~ x'

# MPG vs Year 
ggplot(Auto, aes(x = year, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", col = "purple") +
  ggtitle("MPG vs Year")

## `geom_smooth()` using formula = 'y ~ x'

cor_matrix <- cor(Auto[, c("mpg", "displacement", "horsepower", "weight", "acceleration", "year", "cylinders")])
print(cor_matrix)

##                     mpg displacement horsepower     weight acceleration
## mpg           1.0000000   -0.8051269 -0.7784268 -0.8322442    0.4233285
## displacement -0.8051269    1.0000000  0.8972570  0.9329944   -0.5438005
## horsepower   -0.7784268    0.8972570  1.0000000  0.8645377   -0.6891955
## weight       -0.8322442    0.9329944  0.8645377  1.0000000   -0.4168392
## acceleration  0.4233285   -0.5438005 -0.6891955 -0.4168392    1.0000000
## year          0.5805410   -0.3698552 -0.4163615 -0.3091199    0.2903161
## cylinders    -0.7776175    0.9508233  0.8429834  0.8975273   -0.5046834
##                    year  cylinders
## mpg           0.5805410 -0.7776175
## displacement -0.3698552  0.9508233
## horsepower   -0.4163615  0.8429834
## weight       -0.3091199  0.8975273
## acceleration  0.2903161 -0.5046834
## year          1.0000000 -0.3456474
## cylinders    -0.3456474  1.0000000

Strong Negative Correlation with weight, horsepower, cylinders and displacement so we can deduce the following.

1.Heavier cars tend to have lower fuel efficiency. 2.More powerful engines (high horsepower) consume more fuel. 3.Larger engine displacement is associated with lower fuel efficiency. 4.More cylinders increase engine size and fuel use.

Positive Correlation with year so we can say

1.Newer cars tend to be more fuel-efficient. 2.Cars with better acceleration tend to have higher mpg.

Strong predictors : Horsepower and weight show a clear negative correlation with mpg—higher values lead to lower fuel efficiency. Cylinders and Displacement also act as strong predictors. Moderate or weak : Year show moderate relationship with mpg, But we can say newer model tend to have more mpg. Acceleration has unclear or weak correlation with mpg.

Primary Influences on MPG: Vehicles with higher horsepower, heavier weight, and larger displacement tend to have lower mpg, making these key factors for prediction. The plots above of mpg vs weight, mpg vs horsepower are good examples of this.

This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. How many rows are in this data set? How many columns? What do the rows and columns represent?

library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

data(Boston)

num_rows <- nrow(Boston)
num_cols <- ncol(Boston)
cat("Number of rows:", num_rows, "\n")

## Number of rows: 506

cat("Number of columns:", num_cols, "\n")

## Number of columns: 13

Rows represent different census tract neighborhoods i.e., geographic region defined for statistical purposes, and columns represent various factors affecting housing values.

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston[, -14], main = "Pairwise Scatterplots of Boston Housing Data")

Negative correlation between average rooms per dwelling and LSTAT (lower status population percentage), this suggests that wealthier neighborhoods (low LSTAT) tend to have larger homes. Positive correlation between TAX and RAD (index of highway accessibility, this may indicate that well-connected areas have higher property taxes.

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

cor_matrix <- cor(Boston)
cor_crim <- cor_matrix["crim", ]
print(cor_crim)

##        crim          zn       indus        chas         nox          rm 
##  1.00000000 -0.20046922  0.40658341 -0.05589158  0.42097171 -0.21924670 
##         age         dis         rad         tax     ptratio       lstat 
##  0.35273425 -0.37967009  0.62550515  0.58276431  0.28994558  0.45562148 
##        medv 
## -0.38830461

rad (0.63): Strongest correlation, meaning areas with higher accessibility to radial highways tend to have higher crime rates. This suggests that locations with easier access to highways may experience more criminal activity. medv (-0.39): Higher median home values are linked to lower crime rates, supporting the idea that wealthier neighborhoods tend to have less crime.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

Min: 0.00632 ,Some areas have extremely low crime rates. Median: 0.25651 , Half of the neighborhoods have a crime rate below this value. Mean: 3.61352 , The mean is much higher than the median, suggesting a right-skewed distribution with extreme outliers. Max: 88.9762 , Some census tracts have exceptionally high crime rates.

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

Min: 187.0 ,Some areas have relatively low property taxes. Median: 330.0 ,Half of the neighborhoods have a tax rate below this value. Mean: 408.2 ,Again, the mean is higher than the median, suggesting right-skewed distribution. Max: 711.0 ,Some areas have exceptionally high property tax rates.

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Min: 12.6 ,Some areas have low student-to-teacher ratios, suggesting better-funded schools. Median: 19.05 ,Most areas have moderate class sizes. Max: 22.0 ,Certain neighborhoods have high pupil-teacher ratios, indicating larger class sizes and potentially lower educational quality.

How many of the census tracts in this data set bound the Charles river?

num_tracts_charles <- sum(Boston$chas == 1)
cat("Number of census tracts bounding Charles River:", num_tracts_charles, "\n")

## Number of census tracts bounding Charles River: 35

From the output, we see that 35 census tracts in the Boston dataset have chas = 1, indicating they border the Charles River.

What is the median pupil-teacher ratio among the towns in this data set?

median_ptratio <- median(Boston$ptratio)
cat("Median pupil-teacher ratio:", median_ptratio, "\n")

## Median pupil-teacher ratio: 19.05

Which census tract of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

min_medv_index <- which.min(Boston$medv)
lowest_medv_tract <- Boston[min_medv_index, ]
print(lowest_medv_tract)

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

High crime, industrialization, pollution, and aging infrastructure correlate with low property values. Lack of residential zoning and small home sizes further reduce home values.Urban planning improvements could help increase property values in such areas.

In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

num_above_7 <- sum(Boston$rm > 7)
num_above_8 <- sum(Boston$rm > 8)
cat("Census tracts with >7 rooms per dwelling:", num_above_7, "\n")

## Census tracts with >7 rooms per dwelling: 64

cat("Census tracts with >8 rooms per dwelling:", num_above_8, "\n")

## Census tracts with >8 rooms per dwelling: 13

Larger homes tend to be in more affluent neighborhoods, likely correlating with higher median home values (medv) and lower crime rates (crim).