Data_Analytics_Exercise

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary
Problem Type: Regression (the response variable, CEO salary, is quantitative).

Focus: Inference (interested in understanding which factors affect the salary).

n and p: n=500 (top 500 firms). p includes profit, number of employees, and industry. so p is 3,

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, pricecharged for the product, marketing budget, competition price, and ten other variables.

Problem Type: Classification (the response variable is categorical: success or failure).

Focus: Prediction (want to predict the success or failure of a new product).

n and p: n=20 (data on 20 similar products). Predictors include price charged, marketing budget, competition price, and ten other variables, p=13.

We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

Problem Type: Regression (the response variable, % change in the US dollar, is continuous).

Focus: Prediction (interested in predicting the future % change in the dollar).

n and p: Assuming weekly data for all 52 weeks of 2012, n=52. Predictors include % changes in the US, British, and German markets, p=3.
1. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one

flexibility <- seq(1, 10, length.out = 100)
data <- data.frame(flexibility)

# Calculate hypothetical values for each curve
data$bias_squared <- (10 - flexibility)^2
data$variance <- flexibility^2
data$training_error <- (10 - flexibility)^2 + flexibility
data$test_error <- (10 - flexibility)^2 + flexibility^2
data$bayes_error <- rep(5, 100)

# Create the plot
ggplot(data, aes(x = flexibility)) +
    geom_line(aes(y = bias_squared, color = "Bias Squared")) +
    geom_line(aes(y = variance, color = "Variance")) +
    geom_line(aes(y = training_error, color = "Training Error")) +
    geom_line(aes(y = test_error, color = "Test Error")) +
    geom_line(aes(y = bayes_error, color = "Bayes Error")) +
    labs(title = "Bias-Variance Decomposition", x = "Flexibility", y = "Error") +
    scale_color_manual(values = c("Bias Squared" = "blue",
                                  "Variance" = "red",
                                  "Training Error" = "green",
                                  "Test Error" = "purple",
                                  "Bayes Error" = "orange"))

(b) Explanation of Each Curve

Squared Bias: Decreases with increased model flexibility. As models become more complex, they can better capture the underlying data patterns, reducing bias.
Variance: Increases with model flexibility. More complex models are more sensitive to small fluctuations in the training data, leading to higher variance.
Training Error: Generally decreases as the model becomes more flexible. Complex models can fit the training data more closely.
Test Error (Generalization Error): Initially decreases as flexibility increases (reduction in bias), but then increases due to a rise in variance. It typically forms a U-shape.
Bayes Error (Irreducible Error): This error is unaffected by model complexity. It represents the noise inherent in the data and remains constant across different levels of flexibility.

5) What are the advantages and disadvantages of a very flexible (versusa less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Overfitting vs. Underfitting:
- A very flexible approach runs a higher risk of overfitting, especially with smaller or noisier datasets, as it can capture random noise in the data as if it were significant patterns.
- Conversely, a less flexible approach may lead to underfitting, failing to capture the complexities and nuances in the data, especially if the true relationships are non-linear or involve interactions.
Balancing the risk of overfitting with the potential for underfitting is key to selecting the right model complexity for your data and objectives.

7. ) The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Compute the Euclidean Distance: Euclidean distance between a point (x1,x2,x3) and the test point (0,0,0) is calculated as sqrt((x1*x1+x2*x2+x3*x3))
1. Observation 1: 3
2. Observation 2: 2
3. Observation 3: sqrt(10)≈3.16
4. Observation 4: sqrt(5)≈2.24
5. Observation 5:sqrt(2)≈1.41
6. Observation 6: sqrt(3)≈1.73
1. Prediction with K = 1: For K = 1, we select the nearest neighbor to the test point. The closest observation is Observation 5, with a distance of approximately 1.41. The Y value for this observation is Green. Therefore, our prediction with K = 1 is Green.
2. Prediction with K = 3: For K = 3, we select the three nearest neighbors. These are Observations 5, 6, and 2, with distances approximately 1.41, 1.73, and 2, respectively. The Y values for these observations are Green, Red, and Red. Since there are two Reds and one Green, the majority vote is Red. Therefore, our prediction with K = 3 is Red.
3. Best Value for K in Nonlinear Decision Boundary: If the Bayes decision boundary is highly nonlinear, we would expect the best value for K to be small. This is because a smaller K allows the model to adapt more closely to the local structure of the data, which is beneficial in capturing the nonlinear patterns. A larger K would average over more neighbors, leading to a smoother and more linear decision boundary, potentially missing the complex nuances of a highly nonlinear boundary. Therefore, in cases of nonlinear boundaries, a smaller K is typically more effective.

Data_Analytics_Exercise_1

Surya

2024-01-24

(b) Explanation of Each Curve