ANS : This is a regression problem scenario because CEO salary is a continuous numerical variable, and we are interested in modeling its relationship with other variables. We are most interested in inference because the goal is to understand the factors that affect CEO salary instead of predicting it. n=500 ,since data is of 500 firms. p =3 ,as we are recording 3 variables as predictors.
ANS: This is a classification scenario because response variable (S/F) is categorical, we are interested in prediction as we are looking whether a new product will succeed based on past data or not.
n= 20 (no. of products) p= 13 (no. of variables)
ANS : This is a regression problem scenario because the target variable is a continuous numerical variable, and we are interested in prediction because the goal is to forecast future exchange rate based on weekly changes in stock markets.
Here n will be weekly data so most probably it will be 52. n=52 (52 weeks) p=3 ( % change in USD, British Market and German Market)
ANS: Please visit attached picture.
ANS: Bias² (Decreasing Curve) it decreases because bias represents the error introduced when a model is too simple to capture the real patterns in the data. Less flexible models (like linear regression) make strong assumptions and fail to capture complex relationships, resulting in high bias. As model flexibility increases (e.g., by using decision trees or neural networks), bias decreases because the model can better fit the data. This is why the bias curve decreases as flexibility increases.
Variance (Increasing Curve) it increases because variance measures how sensitive the model is to fluctuations in the training data. As flexibility increases, the model starts focusing more on specific details and noise in the training data, making it less generalizable. For example, a very flexible model like a deep decision tree might memorize the training data. This causes variance to increase with model flexibility.
Training Error (Decreasing Curve) it decreases because training error measures how well the model fits the training data. As flexibility increases, the model can fit the training data more closely. For a highly flexible model, training error can approach zero because the model can memorize all training points. This explains why the training error decreases with increasing flexibility.
Test Error (U-Shaped Curve) it has a U-shape because test error initially decreases because the model is capturing more meaningful patterns in the data as flexibility increases. However, beyond a certain point, the model becomes too flexible and starts overfitting, meaning it learns noise rather than real relationships. As a result, test error begins to rise after reaching the optimal flexibility. This creates the U-shaped curve for test error.
Irreducible Error (Flat Line) it is constant because irreducible error is caused by noise or randomness in the data that no model can eliminate. This includes factors not captured by the model or inherent variability in response variable. Since this error is independent of model flexibility, it remains constant across all levels of flexibility.
ANS :
Decreased bias (less assumptions about the functional form of 𝑓 ) Capture non-linear relationships Capture complex variable interactions Often require less assumptions Higher accuracy
Increased variance (easier to overfit) Require tuning Increased training times Require more variables and observations to work optimally Less interpretable
Coming to Less flexible method, It has lower risk of overfitting, more interpretable as simpler models are easier to understand and explain. It also has lower computational cost and faster training compare to a flexible model.It works well with small datasets too. But it has higher bias and struggles when relationships are nonlinear or require detailed feature interactions.
More flexible is preferred when : the primary concern is predictive power, there is sufficient computing power for variance-controlling measures (validation), there is a large number of variables, the relationship between features and response is complex and nonlinear, there is a large amount of data, and you can apply regularization to control overfitting.
Less flexible is preferred when : we are more interested in inference, further clarification needed behind why an observation has a particular prediction, the dataset is small, the training time should be low especially for real-time projects, the problem is effectively captured by straightforward assumptions.
Lets input the data into a data frame.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- data.frame(Obs= c(1,2,3,4,5,6),
X1 = c(0, 2, 0, 0, -1, 1),
X2 = c(3, 0, 1, 1, 0, 1),
X3 = c(0, 0, 3, 2, 1, 1),
Y = c('Red', 'Red', 'Red', 'Green', 'Green', 'Red'),
stringsAsFactors = F)
colnames(data) <- c('obs','X1', 'X2', 'X3', 'Y')
data
## obs X1 X2 X3 Y
## 1 1 0 3 0 Red
## 2 2 2 0 0 Red
## 3 3 0 1 3 Red
## 4 4 0 1 2 Green
## 5 5 -1 0 1 Green
## 6 6 1 1 1 Red
Eucledean distance formula is
\[ d = \sqrt{(X_1 - x_1)^2 + (X_2 - x_2)^2 + (X_3 - x_3)^2} \] Now lets calculate the euclidean distance.
Z <- c(0, 0, 0)
data$dist_X_Z <- round(sqrt((data$`X1` - Z[1])^2 +
(data$`X2` - Z[2])^2 +
(data$`X3` - Z[3])^2),2)
colnames(data)[6] <- 'Distance'
data
## obs X1 X2 X3 Y Distance
## 1 1 0 3 0 Red 3.00
## 2 2 2 0 0 Red 2.00
## 3 3 0 1 3 Red 3.16
## 4 4 0 1 2 Green 2.24
## 5 5 -1 0 1 Green 1.41
## 6 6 1 1 1 Red 1.73
since the one nearest neighbor to z is observation 5,so its green.
since the three nearest neighbors to z are observations 5,6,2.It has 2 Red and 1 Green so it is Red.
If the bayes decision boundary is highly non-linear, we will expect a small K to perform better. A smaller K is more flexible and the KNN boundary would be pulled in different directions easier, whereas a larger K would oversmooth the decision surface and fail to detect the complexity of the decision boundary.
if (!require(ISLR)) install.packages("ISLR")
## Loading required package: ISLR
library(ISLR)
data("Auto")
Quantitative : mpg,cylinders,displacement,horsepower,weight,acceleration,year
Qualitative: origin and name
range_Auto <- data.frame(sapply(Auto[1:7], range))
rownames(range_Auto) <- c("min:", "max:")
range_Auto
## mpg cylinders displacement horsepower weight acceleration year
## min: 9.0 3 68 46 1613 8.0 70
## max: 46.6 8 455 230 5140 24.8 82
We can see that the range of each quantitative predictor has been displayed.
sapply(Auto[1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
sapply(Auto[ ,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
Auto_2 <- Auto[-c(10:85), ]
range_Auto_2 <- data.frame(sapply(Auto_2[ ,1:7], range))
rownames(range_Auto_2) <- c("min:", "max:")
range_Auto_2
## mpg cylinders displacement horsepower weight acceleration year
## min: 11.0 3 68 46 1649 8.5 70
## max: 46.6 8 455 230 4997 24.8 82
sapply(Auto_2[ ,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
sapply(Auto_2[ ,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
These are the range, mean and sd of values ranging from 1-10 and 85 + from the dataset.
library(dplyr)
library(ggplot2)
pairs(Auto %>% select(-name, -origin), main="Scatterplot Matrix")
We can see strong negative correlations between mpg and variables like weight, horsepower, and displacement. weight and displacement show a positive relationship, which is expected because larger cars generally have both higher weight and displacement.
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
ggtitle("MPG vs Horsepower")
## `geom_smooth()` using formula = 'y ~ x'
There is a clear negative relationship between mpg and horsepower. As horsepower increases, fuel efficiency decreases. The geom_smooth line confirms a downward trend, which tells us its a strong linear correlation.
ggplot(Auto, aes(x = weight, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
ggtitle("MPG vs Weight")
## `geom_smooth()` using formula = 'y ~ x'
Similar to the above plot, there is a negative correlation between mpg and weight. Normally, heavier cars tend to have lower mileage, which aligns with real-world expectations. The regression line strengthen our opinion.
The negative relationships between mpg and horsepower/weight suggest that high performance, heavier cars consumes more fuel. While predicting mpg, weight and horsepower are strong candidates as predictors in regression models. Other predictors, such as displacement and acceleration, might also play a role but they are not as strong as these.
If we refer to the scatter matrix above, it appears that all quantitative predictors have some relationship with the response. name (the name of the vehicle) is categorical, and has 301 unique values. This isn’t really better choice to consider it as a predictor, particularly on such a small dataset.
However, we may be able to extract useful information if we tweak the name into brand of car. I extract this information, correct some mistakes and collapse the variable down into 10 levels:
Auto$brand <- sub(" .*", "", Auto$name)
brand_counts <- table(Auto$brand)
top_brands <- names(sort(brand_counts, decreasing = TRUE)[1:9])
Auto$brand <- ifelse(Auto$brand %in% top_brands, Auto$brand, "Other")
Auto$brand <- as.factor(Auto$brand)
table(Auto$brand)
##
## amc buick chevrolet datsun dodge ford Other plymouth
## 27 17 43 23 28 48 134 31
## pontiac toyota
## 16 25
Lets take a visualization of brand vs mpg
ggplot(Auto, aes(x = brand, y = mpg, fill = brand)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(title = "Brand vs Mpg - Boxplot",
subtitle = "Engineered feature",
x = "Brand",
y = "MPG")
With this we can predict mpg for various brands. But lets check the strong predictors for mpg but as name variable needs tweaking, lets ignore it.
lets check for various variables
# MPG vs Weight
ggplot(Auto, aes(x = weight, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
ggtitle("MPG vs Weight")
## `geom_smooth()` using formula = 'y ~ x'
# MPG vs Horsepower
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
ggtitle("MPG vs Horsepower")
## `geom_smooth()` using formula = 'y ~ x'
# MPG vs Displacement
ggplot(Auto, aes(x = displacement, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "green") +
ggtitle("MPG vs Displacement")
## `geom_smooth()` using formula = 'y ~ x'
# MPG vs Year
ggplot(Auto, aes(x = year, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "purple") +
ggtitle("MPG vs Year")
## `geom_smooth()` using formula = 'y ~ x'
cor_matrix <- cor(Auto[, c("mpg", "displacement", "horsepower", "weight", "acceleration", "year", "cylinders")])
print(cor_matrix)
## mpg displacement horsepower weight acceleration
## mpg 1.0000000 -0.8051269 -0.7784268 -0.8322442 0.4233285
## displacement -0.8051269 1.0000000 0.8972570 0.9329944 -0.5438005
## horsepower -0.7784268 0.8972570 1.0000000 0.8645377 -0.6891955
## weight -0.8322442 0.9329944 0.8645377 1.0000000 -0.4168392
## acceleration 0.4233285 -0.5438005 -0.6891955 -0.4168392 1.0000000
## year 0.5805410 -0.3698552 -0.4163615 -0.3091199 0.2903161
## cylinders -0.7776175 0.9508233 0.8429834 0.8975273 -0.5046834
## year cylinders
## mpg 0.5805410 -0.7776175
## displacement -0.3698552 0.9508233
## horsepower -0.4163615 0.8429834
## weight -0.3091199 0.8975273
## acceleration 0.2903161 -0.5046834
## year 1.0000000 -0.3456474
## cylinders -0.3456474 1.0000000
Strong Negative Correlation with weight, horsepower, cylinders and displacement so we can deduce the following.
1.Heavier cars tend to have lower fuel efficiency. 2.More powerful engines (high horsepower) consume more fuel. 3.Larger engine displacement is associated with lower fuel efficiency. 4.More cylinders increase engine size and fuel use.
Positive Correlation with year so we can say
1.Newer cars tend to be more fuel-efficient. 2.Cars with better acceleration tend to have higher mpg.
Strong predictors : Horsepower and weight show a clear negative correlation with mpg—higher values lead to lower fuel efficiency. Cylinders and Displacement also act as strong predictors. Moderate or weak : Year show moderate relationship with mpg, But we can say newer model tend to have more mpg. Acceleration has unclear or weak correlation with mpg.
Primary Influences on MPG: Vehicles with higher horsepower, heavier weight, and larger displacement tend to have lower mpg, making these key factors for prediction. The plots above of mpg vs weight, mpg vs horsepower are good examples of this.
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
data(Boston)
num_rows <- nrow(Boston)
num_cols <- ncol(Boston)
cat("Number of rows:", num_rows, "\n")
## Number of rows: 506
cat("Number of columns:", num_cols, "\n")
## Number of columns: 13
Rows represent different census tract neighborhoods i.e., geographic region defined for statistical purposes, and columns represent various factors affecting housing values.
pairs(Boston[, -14], main = "Pairwise Scatterplots of Boston Housing Data")
Negative correlation between average rooms per dwelling and LSTAT (lower status population percentage), this suggests that wealthier neighborhoods (low LSTAT) tend to have larger homes. Positive correlation between TAX and RAD (index of highway accessibility, this may indicate that well-connected areas have higher property taxes.
cor_matrix <- cor(Boston)
cor_crim <- cor_matrix["crim", ]
print(cor_crim)
## crim zn indus chas nox rm
## 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
## age dis rad tax ptratio lstat
## 0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 0.45562148
## medv
## -0.38830461
rad (0.63): Strongest correlation, meaning areas with higher accessibility to radial highways tend to have higher crime rates. This suggests that locations with easier access to highways may experience more criminal activity. medv (-0.39): Higher median home values are linked to lower crime rates, supporting the idea that wealthier neighborhoods tend to have less crime.
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Min: 0.00632 ,Some areas have extremely low crime rates. Median: 0.25651 , Half of the neighborhoods have a crime rate below this value. Mean: 3.61352 , The mean is much higher than the median, suggesting a right-skewed distribution with extreme outliers. Max: 88.9762 , Some census tracts have exceptionally high crime rates.
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
Min: 187.0 ,Some areas have relatively low property taxes. Median: 330.0 ,Half of the neighborhoods have a tax rate below this value. Mean: 408.2 ,Again, the mean is higher than the median, suggesting right-skewed distribution. Max: 711.0 ,Some areas have exceptionally high property tax rates.
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
Min: 12.6 ,Some areas have low student-to-teacher ratios, suggesting better-funded schools. Median: 19.05 ,Most areas have moderate class sizes. Max: 22.0 ,Certain neighborhoods have high pupil-teacher ratios, indicating larger class sizes and potentially lower educational quality.
num_tracts_charles <- sum(Boston$chas == 1)
cat("Number of census tracts bounding Charles River:", num_tracts_charles, "\n")
## Number of census tracts bounding Charles River: 35
From the output, we see that 35 census tracts in the Boston dataset have chas = 1, indicating they border the Charles River.
median_ptratio <- median(Boston$ptratio)
cat("Median pupil-teacher ratio:", median_ptratio, "\n")
## Median pupil-teacher ratio: 19.05
min_medv_index <- which.min(Boston$medv)
lowest_medv_tract <- Boston[min_medv_index, ]
print(lowest_medv_tract)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
High crime, industrialization, pollution, and aging infrastructure correlate with low property values. Lack of residential zoning and small home sizes further reduce home values.Urban planning improvements could help increase property values in such areas.
num_above_7 <- sum(Boston$rm > 7)
num_above_8 <- sum(Boston$rm > 8)
cat("Census tracts with >7 rooms per dwelling:", num_above_7, "\n")
## Census tracts with >7 rooms per dwelling: 64
cat("Census tracts with >8 rooms per dwelling:", num_above_8, "\n")
## Census tracts with >8 rooms per dwelling: 13
Larger homes tend to be in more affluent neighborhoods, likely correlating with higher median home values (medv) and lower crime rates (crim).