Data Analytics

Explain whether each scenario is a classification or regression prob- lem, and indicate whether we are most interested in inference or pre- diction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Problem: Regression Interest : Inference n: 500 (number of firms) p :4 (number of features) Here the problem clearly shows that its not a binary classification problem as we are trying to predicting variable is a contionous outcome. So this is a clear case Regression problem.

In terms of interest we should be more keen to inference as we are trying to understand the relation or pattern that will be affecting the CEO salary.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each prod- uct we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Problem: Classification Interest: Prediction n: 20(silimar products) p: 14

Here we are need to know if the product is going to be success/failure which is a clear scenario of classification problem as our prediction variable possibly has 2 only 2 outcomes.

In terms of interest its a prediction problem as we are using the existing data and predicting the possibility of other product.

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Problem: Regresion Interest: Prediction n: 52( total weeks in a year) p: 4(total markets)

Here our prediction variable is a continous variable and changes every week thus this falls under a regression problem category.

Here our primary goal is prediction as we are using the exsisting data and trying to predict the change that can occur.

# Load necessary libraries
library(ggplot2)
library(reshape2)  # Make sure to load reshape2 for the melt function

# Create a dataframe with hypothetical data
flexibility <- seq(1, 10, by = 1)
bias <- 0.1 * (flexibility - 5)^2
variance <- 0.2 * flexibility
training_error <- 0.15 * flexibility
test_error <- 0.1 * (flexibility - 5)^2 + 0.2 * flexibility + 0.15 * flexibility
bayes_error <- rep(5, 10)

# Ensure all vectors have the same length
min_length <- min(length(flexibility), length(bias), length(variance),
                  length(training_error), length(test_error), length(bayes_error))

data <- data.frame(Flexibility = flexibility[1:min_length],
                   Squared_Bias = bias[1:min_length],
                   Variance = variance[1:min_length],
                   Training_Error = training_error[1:min_length],
                   Test_Error = test_error[1:min_length],
                   Bayes_Error = bayes_error[1:min_length])

# Melt the data for ggplot
data_melted <- melt(data, id.vars = "Flexibility", variable.name = "Curve", value.name = "Value")

# Plot
ggplot(data_melted, aes(x = Flexibility, y = Value, color = Curve)) +
  geom_line() +
  labs(title = "Bias-Variance Tradeoff",
       x = "Amount of Flexibility",
       y = "Values for Each Curve") +
  theme_minimal()

(b) Explain why each of the five curves has the shape displayed in part (a).

1.Squared Bias Curve: Initially, as flexibility increases, the squared bias decreases. More flexible models can capture complex relationships in the data, leading to lower bias. However, after a certain point, increasing flexibility may result in overfitting, causing the squared bias to increase.

2.Variance Curve: As flexibility increases, the variance tends to increase. More flexible models are sensitive to variations in the training data, leading to higher variability. This is because highly flexible models can adapt too much to the noise in the training data, resulting in increased variance.

3.Training Error Curve: Training error typically decreases as flexibility increases. More flexible models can fit the training data more closely, resulting in lower training error. However, at some point, overfitting may occur, causing the training error to increase as the model becomes too tailored to the training data.

4.Test Error Curve: The test error curve often exhibits a U-shaped pattern. Initially, as flexibility increases, test error decreases as the model captures more complex patterns. However, beyond a certain point, overfitting occurs, and the test error starts to increase due to increased variance and reduced generalization to new data.

5.Bayes (Irreducible) Error Curve: The Bayes error, representing the irreducible error or inherent noise in the data, remains constant regardless of the flexibility of the model. It does not change with model complexity and serves as a lower bound on the achievable error.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

Advantages: Capturing Complex Relationships: More flexible approach has the ability to capture complex relationships such as when there are so many parameters then a more flexible approch helps in understanding the relationships between the variables.

Higher Accuracy: As the more flexible approach is prone to capture complex relationships it results in high accuracy which is very essential while predicting the outcomes.

Disadvantages: Overfitting: Though it captures complex relationships and has high accuracy this scenarios may lead to overfitting where the model might remember the patterns that it has seen in training data and fails to understand the underlying relation. Such scenario leads to a situation where it performs poor on test data or new data.

Interpretability: Highly flexible models are often more challenging to interpret. Understanding the logic behind predictions becomes complex, making it difficult to gain insights into the relationships between variables.

Circumstances for a More Flexible Approach:

Complex Data Patterns: When the relationships in the data are complex and not easily captured by simple models, a more flexible approach might be preferred. For example, in image recognition where patterns are intricate, a deep neural network may be more suitable.

Large Amounts of Data: With a substantial amount of data, a more flexible model might be able to effectively leverage the information for better predictions without overfitting.

Circumstances for a Less Flexible Approach:

Advantages: Simplicity and Interpretability: Simple models are often more interpretable and easier to understand. When interpretability is crucial, especially in fields like healthcare or finance, a less flexible model might be preferred.

Reduced Risk of Overfitting: Less flexible models are less prone to overfitting, making them more robust in situations with limited data.

Disadvantages: Limited Representation: Simple models may struggle to represent complex relationships in the data, leading to lower accuracy when the relationships are not well-captured by the model’s simplicity.

Examples:

More Flexible Approach Example: Scenario: Predicting house prices based on various features (e.g., square footage, number of bedrooms, location). Model: A complex ensemble of decision trees (Random Forest or Gradient Boosting). Advantage: Captures nonlinear relationships in housing market dynamics. Disadvantage: Prone to overfitting if not properly regularized.

Less Flexible Approach Example: Scenario: Predicting student performance based on study hours and attendance. Model: Linear regression. Advantage: Simple to interpret and less likely to overfit with a small dataset. Disadvantage: May not capture complex interactions between study hours and other factors.

The table below provides a training data set containing six observa- tions, three predictors, and one qualitative response variable.

Obs. X1 X2 X3 Y 1 0 3 0 Red 2 2 0 0 Red 3 0 1 3 Red 4 0 1 2 Green 5 −1 0 1 Green 6 1 1 1 Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

a)Compute the Euclidean distance between each observation and the testpoint,X1 =X2 =X3 =0.

The Euclidean formula is sqrt((x2-x1)^2)n+ (y2-y1)^2 + (z2-z1)^2)

as mentioned X1=X2=X3=0

observation1:- sqrt((0-0)^2 + (3-0)^2 + (0-0)^2) = 3 observation2:- sqrt((2-0)^2 + (0-0)^2 + (0-0)^2) = 2 observation3:- sqrt((0-0)^2 + (1-0)^2 + (3-0)^2) = 3.17 observation4:- sqrt((0-0)^2 + (1-0)^2 + (2-0)^2) = 2.2 observation5:- sqrt((-1-0)^2)+(0-0)2 + (1-0)^2) = 1.414 observation6:- sqrt((1-0)^2) + (1-0)^2+ (1-0)^2) = 1.717

b)What is our prediction with K = 1? Why?

For K=1 we generally take the shortest distance that is obtained. From above we can clearly see that observation 5 has the shortest distance and this observation has value green. Thus our prediction will be Green .

c) What is our prediction with K = 3? Why?

For K=1 we generally take top three shortest distance values that are obtained. From above we can clearly see that they are observation5, observation6, observation2 which are green, red, red respectively.As we have more Red thus our prediction will be Red .

d) If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?

If Bayers decision boundary is non-linear the best value for k will be the least value, so that it has fewer neighbours around and is capable of capturing complex and underlying patterns thus least value is the best value