ISL2R ANSWERS

Author

Albar Ugalde Hernández

Chapter 2

Conceptual exercises

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a fexible statistical learning method to be better or worse than an infexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

\colorbox{orange}{Answer:}

When the sample size n is extremely large and the number of predictors p is small, we would generally expect the performance of a flexible statistical learning method to be better than an inflexible method. This is because:

With a large sample size, flexible methods have \colorbox{yellow}{more data to learn from}, which helps in capturing complex patterns and relationships within the data.
Having a small number of predictors reduces the risk of \colorbox{yellow}{overfitting} when using flexible methods. Overfitting occurs when a \colorbox{yellow}{model learns noise in the data instead of the underlying true relationship.} Flexible methods are prone to overfitting when the number of predictors is large relative to the sample size. However, with a small number of predictors, this risk is mitigated, allowing the flexible method to effectively capture the underlying structure in the data.

Therefore, in this scenario, the flexibility of the statistical learning method is expected to lead to better performance as it can more effectively utilize the abundance of data without being overly influenced by the small number of predictors.

(b) The number of predictors p is extremely large, and the number of observations n is small.

\colorbox{orange}{Answer:}

When the number of predictors p is extremely large and the number of observations n is small, we would generally expect the performance of a \colorbox{yellow}{flexible statistical learning method to be worse } than an inflexible method. Here’s why:

Overfitting: With a large number of predictors relative to the number of observations, flexible methods are prone to overfitting. Overfitting occurs when a model captures noise in the data rather than the underlying true relationship. In this scenario, with a small number of observations, flexible methods may have difficulty generalizing well to new, unseen data.
Bias-Variance Tradeoff: Flexible methods typically have high variance and low bias. In situations with a small number of observations, the high variance of flexible methods can lead to significant instability in the model estimates, resulting in poor performance.
Computational Complexity: With a large number of predictors, flexible methods often involve complex algorithms that require substantial computational resources. In a scenario with a small number of observations, these computational demands may become prohibitive.

In summary, in situations where the number of predictors is extremely large and the number of observations is small, the flexibility of the statistical learning method is likely to lead to poorer performance compared to an inflexible method. This is primarily due to the increased risk of overfitting, the bias-variance tradeoff, and the computational challenges associated with flexible methods in such scenarios.

\colorbox{orange}{Answer}

When the relationship between the predictors and the response variable is highly non-linear, we would generally expect the performance of a \colorbox{yellow}{flexible statistical learning method to be better than an inflexible method.} Here’s why:

Capturing Non-linear Relationships: Flexible methods, such as polynomial regression, splines, or tree-based models, have the capacity to capture complex, non-linear relationships between predictors and the response variable. They can fit intricate patterns in the data more effectively than linear models, which are limited to linear relationships.
Model Complexity: Inflexible methods like linear regression assume linear relationships between predictors and the response. When the relationship is highly non-linear, inflexible methods may fail to capture important patterns in the data, resulting in poor predictive performance.
Adaptability: Flexible methods are adaptable to a wide range of data distributions and relationships. They can adjust their complexity to fit the underlying structure of the data, which is essential when dealing with highly non-linear relationships.
Risk of Underfitting: Inflexible methods may underfit the data when the relationship is highly non-linear, as they are unable to capture the complexity of the underlying patterns. Flexible methods, on the other hand, are less prone to underfitting in such scenarios.

Therefore, in situations where the relationship between predictors and the response variable is highly non-linear, the flexibility of the statistical learning method is expected to lead to better performance, as it enables the model to capture the complex non-linearities present in the data.

(d) The variance of the error terms, i.e. σ2 = Var(\epsilon), is extremely high.

\colorbox{orange}{Answer:}

When the variance of the error terms (σ2) is extremely high, we would generally expect the performance of a \colorbox{yellow}{flexible statistical learning method to be worse than an inflexible method. } Here’s why:

Sensitivity to Noise: Flexible methods tend to capture both the signal and the noise in the data. When the variance of the error terms is extremely high, there is a greater amount of noise present in the data. Flexible methods are more likely to overfit to this noise, resulting in poor generalization performance on unseen data.
Overfitting: High variance in the error terms can lead to overfitting, where the model learns the noise in the data rather than the underlying true relationship. Flexible methods are particularly susceptible to overfitting when the variance of the error terms is high, as they have more capacity to fit complex patterns, including noise.
Model Stability: Inflexible methods, such as linear regression, tend to have more stable estimates when the variance of the error terms is high. They are less influenced by individual data points or outliers compared to flexible methods, which can exhibit more variability in their predictions.
Bias-Variance Tradeoff: Flexible methods typically have low bias and high variance. In situations with high error variance, the high variance of flexible methods can exacerbate the problem, leading to models that are overly sensitive to variations in the data.

Therefore, in scenarios where the variance of the error terms is extremely high, the flexibility of the statistical learning method is expected to lead to worse performance, as flexible methods are more prone to overfitting and instability in the presence of high noise levels. In such cases, inflexible methods may provide more robust and reliable predictions.

2. Explain whether each scenario is a classifcation or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

\colorbox{pink}{Answer:}

This scenario is a regression problem because the target variable (CEO salary) is a continuous numerical variable. We are most interested in inference, which involves understanding the relationship between predictor variables (profit, number of employees, industry) and the target variable (CEO salary). In this case:

n (number of observations): 500 (data on the top 500 firms in the US)
p (number of predictors): 3 (profit, number of employees, industry)

The goal is to analyze how these predictors (profit, number of employees, industry) influence CEO salary, which aligns with the objective of inference.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

\colorbox{pink}{Answer:}

This scenario is a classification problem because the target variable (success or failure of the product launch) is categorical, with two possible outcomes: success or failure. We are most interested in prediction, as the primary goal is to predict the success or failure of the new product launch based on the available predictor variables. In this case:

n (number of observations): 20 (data on 20 similar products)
n (number of predictors): 13 (price charged for the product, marketing budget, competition price, and ten other variables)

The objective is to use the available data on similar product launches to predict whether the new product will be a success or failure, which aligns with the goal of prediction.

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

\colorbox{pink}{Answer:}

This scenario is a regression problem because the target variable (% change in the USD/Euro exchange rate) is a continuous numerical variable. We are most interested in prediction, as the primary goal is to predict the percentage change in the USD/Euro exchange rate based on the weekly changes in the world stock markets. In this case:

n (number of observations): The number of weeks in 2012
p (number of predictors): 3 (percentage change in the US market, percentage change in the British market, percentage change in the German market)

The objective is to use the weekly data on changes in world stock markets to predict the percentage change in the USD/Euro exchange rate, which aligns with the goal of prediction.

3. We now revisit the bias-variance decomposition.

Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less fexible statistical learning methods towards more fexible approaches. The x-axis should represent the amount of fexibility in the method, and the y-axis should represent the values for each curve. There should be fve curves. Make sure to label each one.

\colorbox{pink}{Answer:}

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.2

# Datos hipotéticos para las métricas (solo para demostración)
flexibility <- seq(1, 10, length.out = 100)
bias_squared <- flexibility^2 * 0.05
variance <- flexibility^2 * 0.03
training_error <- flexibility^2 * 0.04
test_error <- flexibility^2 * 0.06
bayes_error <- flexibility^2 * 0.02

# Crear un marco de datos para los datos
data <- data.frame(Flexibility = flexibility,
                   Bias_Squared = bias_squared,
                   Variance = variance,
                   Training_Error = training_error,
                   Test_Error = test_error,
                   Bayes_Error = bayes_error)

# Trama
ggplot(data, aes(x = Flexibility)) +
  geom_line(aes(y = Bias_Squared, color = "Bias Squared")) +
  geom_line(aes(y = Variance, color = "Variance")) +
  geom_line(aes(y = Training_Error, color = "Training Error")) +
  geom_line(aes(y = Test_Error, color = "Test Error")) +
  geom_line(aes(y = Bayes_Error, color = "Bayes Error")) +
  labs(title = "Bias-Variance Decomposition",
       x = "Flexibility",
       y = "Value",
       color = "Curve") +
  scale_color_manual(values = c("Bias Squared" = "red",
                                 "Variance" = "blue",
                                 "Training Error" = "green",
                                 "Test Error" = "purple",
                                 "Bayes Error" = "orange")) +
  theme_minimal()

Explain why each of the fve curves has the shape displayed in part (a).

\colorbox{pink}{Answer:}

Explanation of the shape of the five curves:

Squared Bias:
- When methods are less flexible, squared bias tends to be high. This is because less flexible models, such as simple linear models, may not be able to capture the complexity of the true underlying pattern in the data. Therefore, they systematically underestimate or overestimate the true model, resulting in high bias.
- As model flexibility increases, bias tends to gradually decrease. More flexible models can better fit the underlying structure of the data, thereby reducing bias.
Variance:
- In less flexible methods, variance tends to be low. This is because these models have fewer parameters and are therefore less sensitive to small fluctuations in the training data.
- As model flexibility increases, variance tends to increase. More flexible models, such as deep decision trees or complex neural network models, have more parameters and may overfit the training data, resulting in greater variability in predictions across different training data sets.
Training Error:
- Initially, when the model is less flexible, training error is high. This is because the model cannot fully capture the underlying relationship in the data, resulting in inaccurate predictions on the training set.
- As model flexibility increases, training error tends to decrease. More flexible models can better fit the training data, thereby reducing training error.
Test Error:
- Initially, when the model is less flexible, test error is high. This is because the model does not generalize well to new data, resulting in poor performance on the test set.
- As model flexibility increases, test error tends to decrease. More flexible models can better capture the underlying structure of the data, resulting in improved performance on the test set.
Bayes Error:
- Bayes error, also known as irreducible error, represents the minimum error that any model can achieve. It is inherent to the stochastic nature of the data and cannot be eliminated even with a perfect model.
- Therefore, Bayes error remains constant across all levels of model flexibility, as it cannot be improved beyond a certain limit due to the inherent randomness in the data.

4. You will now think of some real-life applications for statistical learning.

Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

\colorbox{pink}{Answer:}

Email Spam Detection:
- Response: The response variable is whether an email is classified as spam or not spam (ham).
- Predictors: Predictors may include features extracted from the email content (e.g., frequency of certain words or phrases, presence of attachments), metadata (e.g., sender’s email address, subject line), and other contextual information.
- Goal: The goal of this application is prediction. The primary objective is to accurately classify incoming emails as either spam or legitimate (ham) in order to automatically filter out unwanted or potentially harmful messages from users’ inboxes.
Medical Diagnosis:
- Response: The response variable is the presence or absence of a particular medical condition or disease.
- Predictors: Predictors may include various medical test results (e.g., blood tests, imaging scans), patient demographics (e.g., age, gender), lifestyle factors (e.g., smoking status, exercise habits), and medical history.
- Goal: The goal of this application can be both inference and prediction. Inference involves understanding the relationship between predictors and the likelihood of a medical condition, which can aid in medical research and understanding disease mechanisms. Prediction involves using the information to accurately diagnose or predict the presence of a medical condition in new patients.
Credit Risk Assessment:
- Response: The response variable is the creditworthiness of an individual or entity, typically expressed as a binary outcome (e.g., approved or denied credit, low or high credit risk).
- Predictors: Predictors may include credit history (e.g., credit score, payment history), financial status (e.g., income, debt-to-income ratio), employment status, and other demographic factors.
- Goal: The goal of this application is primarily prediction. Financial institutions use classification models to assess the credit risk of potential borrowers, enabling them to make informed decisions about extending credit, setting interest rates, and managing overall portfolio risk.

In summary, each of these real-life applications involves classification tasks where the goal may be either inference or prediction, depending on the specific context and objectives of the application.

Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

\colorbox{pink}{Answer:}

House Price Prediction:
- Response: The response variable is the price of a house.
- Predictors: Predictors may include various features of the house (e.g., size, number of bedrooms and bathrooms, location, amenities), as well as external factors such as neighborhood demographics, school quality, and economic indicators.
- Goal: The goal of this application is prediction. Real estate agents, homeowners, and property investors use regression models to estimate the selling price of a house based on its features and market conditions.
Demand Forecasting:
- Response: The response variable is the quantity of a product or service demanded.
- Predictors: Predictors may include historical sales data, advertising expenditure, economic indicators (e.g., GDP, consumer sentiment), seasonality, and competitor actions.
- Goal: The goal of this application is prediction. Businesses use regression models to forecast future demand for their products or services, helping them optimize inventory management, production planning, and marketing strategies.
Climate Change Analysis:
- Response: The response variable is a climatic measure, such as temperature, precipitation, or sea level rise.
- Predictors: Predictors may include greenhouse gas emissions, land use changes, solar radiation, oceanic currents, and atmospheric circulation patterns.
- Goal: The goal of this application can be both inference and prediction. Inference involves understanding the relationship between predictor variables (e.g., greenhouse gas emissions) and climatic measures, aiding scientists in assessing the drivers and impacts of climate change. Prediction involves using regression models to forecast future climatic conditions under different scenarios, assisting policymakers in developing mitigation and adaptation strategies.

In summary, regression is useful in various real-life applications where the goal may be either inference or prediction, depending on the specific context and objectives of the analysis.

Describe three real-life applications in which cluster analysis might be useful.

\colorbox{pink}{Answer:}

Customer Segmentation:
- Objective: Cluster analysis can be used to segment customers into distinct groups based on their similarities in terms of demographics, purchasing behavior, preferences, and other relevant characteristics.
- Application: Businesses in various industries, such as retail, banking, and telecommunications, use customer segmentation to tailor marketing strategies, product offerings, and customer service approaches to specific customer segments. For example, a retail chain might use cluster analysis to identify different customer segments (e.g., budget shoppers, luxury buyers, occasional buyers) and develop targeted marketing campaigns for each segment.
Healthcare Patient Stratification:
- Objective: Cluster analysis can be used to stratify patients into groups based on their health characteristics, medical history, treatment response, and other relevant factors.
- Application: Healthcare providers and researchers use patient stratification to identify subgroups of patients with similar clinical profiles or disease phenotypes. This information can be valuable for personalized medicine, treatment planning, clinical trial design, and resource allocation. For example, in oncology, cluster analysis may be used to classify cancer patients into distinct molecular subtypes to guide treatment decisions and predict prognosis.
Social Network Analysis:
- Objective: Cluster analysis can be used to identify communities or groups within social networks based on patterns of connections, interactions, and shared interests.
- Application: Social network platforms, marketing agencies, and sociologists use cluster analysis to detect cohesive groups of individuals within online social networks, forums, or communities. This information can be used to understand social dynamics, target advertising campaigns, detect influential users, and identify potential collaborations. For example, in social media marketing, cluster analysis may be used to identify niche communities or micro-influencers with specific interests or demographics for targeted advertising or partnership opportunities.

In summary, cluster analysis is a valuable tool in various real-life applications for identifying meaningful patterns and structures within data, which can inform decision-making, resource allocation, and strategy development in diverse domains.

5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

\colorbox{pink}{Answer:}

Advantages of a very flexible approach:

Better Fit to Complex Data: Very flexible models, such as deep neural networks or non-linear regression models, have the capacity to capture complex patterns and relationships in the data more accurately than less flexible models.
Higher Predictive Accuracy: Flexible models can potentially achieve higher predictive accuracy, especially when the underlying relationship between predictors and response is highly non-linear or complex.
Adaptability to Diverse Data Types: Flexible models can handle diverse types of data, including structured, unstructured, and high-dimensional data, making them versatile for a wide range of applications.

Disadvantages of a very flexible approach:

Increased Risk of Overfitting: Very flexible models are more prone to overfitting, where the model captures noise in the training data rather than the underlying true relationship. This can lead to poor generalization performance on unseen data.
Computational Complexity: Flexible models often involve complex algorithms and require substantial computational resources for training and inference, making them computationally expensive and time-consuming.
Difficulty in Interpretation: Very flexible models may produce complex and opaque representations of the underlying patterns in the data, making it difficult to interpret the model outputs and understand the driving factors behind the predictions.

Circumstances where a more flexible approach might be preferred:

Highly Non-linear Relationships: When the relationship between predictors and the response is highly non-linear or complex, a more flexible approach may be preferred to capture the underlying structure of the data more accurately.
Large and Diverse Data Sets: In cases where the data set is large and diverse, and there are potentially complex interactions between predictors, a more flexible approach may be able to extract valuable insights and patterns that less flexible models may miss.

Circumstances where a less flexible approach might be preferred:

Interpretability: When interpretability of the model is important for decision-making or regulatory compliance, a less flexible approach, such as linear regression or decision trees, may be preferred due to their simplicity and ease of interpretation.
Limited Data Availability: In situations where the amount of available data is limited, using a less flexible model can help mitigate the risk of overfitting and provide more reliable predictions with the available data.
Computational Constraints: When computational resources are limited, or real-time performance is required, less flexible models may be preferred due to their lower computational complexity and faster inference times.

6. Describe the diferences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

\colorbox{pink}{Answer:}

Parametric and non-parametric statistical learning approaches differ primarily in their assumptions about the functional form of the underlying relationship between predictors and response variable, as well as in their flexibility in modeling complex relationships.

Parametric Statistical Learning Approach:

Assumption: Parametric models assume a specific functional form for the relationship between predictors and the response variable. For example, linear regression assumes a linear relationship, logistic regression assumes a logistic relationship, etc.
Advantages:
1. Interpretability: Parametric models are often more interpretable because they have a clear functional form and interpretable coefficients.
2. Efficiency: Parametric models typically require fewer parameters to be estimated, leading to faster training and inference times, especially with large data sets.
3. Statistical Efficiency: When the true underlying relationship matches the assumed functional form, parametric models can be statistically more efficient, meaning they can estimate model parameters more accurately with less data.
Disadvantages:
1. Limited Flexibility: Parametric models make strong assumptions about the shape of the relationship between predictors and the response variable. If the true relationship is highly non-linear or complex, parametric models may fail to capture it adequately, resulting in poor model performance.
2. Misspecification: If the assumed functional form does not accurately reflect the true relationship in the data, parametric models may suffer from bias, leading to inaccurate predictions.
3. Sensitive to Outliers: Parametric models can be sensitive to outliers, influential observations, or violations of the underlying assumptions, which can affect the reliability of the model estimates.

Non-parametric Statistical Learning Approach:

Assumption: Non-parametric models make minimal assumptions about the functional form of the relationship between predictors and the response variable. Instead, they rely on flexible algorithms to learn the relationship directly from the data.
Advantages:
1. Flexibility: Non-parametric models can capture complex and non-linear relationships without making strong assumptions about the functional form, making them more flexible and adaptable to diverse data patterns.
2. Robustness: Non-parametric models are often more robust to violations of assumptions, such as outliers or non-normality, as they do not rely on specific distributional assumptions.
3. Accuracy: Non-parametric models can potentially provide more accurate predictions, especially when the underlying relationship is highly non-linear or complex.
Disadvantages:
1. Interpretability: Non-parametric models can be less interpretable compared to parametric models, as they do not have a simple functional form with interpretable coefficients.
2. Computational Complexity: Non-parametric models can be computationally expensive and require more computational resources, especially with large data sets or complex algorithms.
3. Data Efficiency: Non-parametric models may require larger sample sizes to estimate the underlying relationship accurately compared to parametric models, especially in regions of the predictor space with sparse data.

7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Obs.	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. (a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

\colorbox{pink}{Answer:}

Para calcular la distancia euclidiana entre cada observación y el punto de prueba x1=x2=x3=0, podemos usar la fórmula de la distancia euclidiana:

\text{Distancia} = \sqrt{(X1_{\text{obs}} - X1_{\text{test}})^2 + (X2_{\text{obs}} - X2_{\text{test}})^2 + (X3_{\text{obs}} - X3_{\text{test}})^2}

Sustituyendo los valores de cada observación:

Para la observación 1:

\sqrt{(0 - 0)^2 + (3 - 0)^2 + (0 - 0)^2} = \sqrt{0 + 9 + 0} = \sqrt{9} = 3 Para la observación 2:

\sqrt{(2 - 0)^2 + (0 - 0)^2 + (0 - 0)^2} = \sqrt{4 + 0 + 0} = \sqrt{4} = 2 Para la observación 3:

\sqrt{(0 - 0)^2 + (1 - 0)^2 + (3 - 0)^2} = \sqrt{0 + 1 + 9} = \sqrt{10} Para la observación 4:

\sqrt{(0 - 0)^2 + (1 - 0)^2 + (2 - 0)^2} = \sqrt{0 + 1 + 4} = \sqrt{5} Para la observación 5:

\sqrt{(-1 - 0)^2 + (0 - 0)^2 + (1 - 0)^2} = \sqrt{1 + 0 + 1} = \sqrt{2} Para la observación 6:

\sqrt{(1 - 0)^2 + (1 - 0)^2 + (1 - 0)^2} = \sqrt{1 + 1 + 1} = \sqrt{3}

(b) What is our prediction with K = 1? Why?

Con K=1, nuestra predicción será el color de la etiqueta de clase de la observación más cercana al punto de prueba x1=x2=x3=0, que es la observación 1. Por lo tanto, la predicción será “Red”.

Con K=3, nuestra predicción será el color de la etiqueta de clase de las tres observaciones más cercanas al punto de prueba X1=X2=X3=0. Las tres observaciones más cercanas son la observación 1 (distancia 3), la observación 2 (distancia 2), y la observación 6 (distancia \sqrt 3). Dos de ellas son de color “Red” y una es de color “Green”. Como hay más observaciones “Red” cerca del punto de prueba, nuestra predicción sería “Red”.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

Si la frontera de decisión de Bayes en este problema es altamente no lineal, esperaríamos que el mejor valor para K fuera pequeño. Esto se debe a que un valor pequeño de K generalmente da como resultado una frontera de decisión más flexible y no paramétrica, lo que puede capturar mejor la complejidad no lineal en los datos. Por otro lado, un valor grande de K puede suavizar demasiado la frontera de decisión, lo que podría no ser adecuado para datos con una estructura no lineal complicada.

APPLIED

8. This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are

• Private : Public/private indicator

• Apps : Number of applications received

• Accept : Number of applicants accepted

• Enroll : Number of new students enrolled

• Top10perc : New students from top 10 % of high school class

• Top25perc : New students from top 25 % of high school class

• F.Undergrad : Number of full-time undergraduates

• P.Undergrad : Number of part-time undergraduates

• Outstate : Out-of-state tuition

• Room.Board : Room and board costs

• Books : Estimated book costs

• Personal : Estimated personal spending

• PhD : Percent of faculty with Ph.D.’s

• Terminal : Percent of faculty with terminal degree

• S.F.Ratio : Student/faculty ratio

• perc.alumni : Percent of alumni who donate

• Expend : Instructional expenditure per student

• Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor. #### (a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

\colorbox{cyan}{Answer:}

As there´s an existing library in the CRAN called “ISLR2” we simply load directly from the library. Just notice that in the library College is with capital C and the exercise ask with lower case c, as follow:

library("ISLR2")

Warning: package 'ISLR2' was built under R version 4.3.2

college <- read.csv("College.csv")


head(college)

                             X Private Apps Accept Enroll Top10perc Top25perc
1 Abilene Christian University     Yes 1660   1232    721        23        52
2           Adelphi University     Yes 2186   1924    512        16        29
3               Adrian College     Yes 1428   1097    336        22        50
4          Agnes Scott College     Yes  417    349    137        60        89
5    Alaska Pacific University     Yes  193    146     55        16        44
6            Albertson College     Yes  587    479    158        38        62
  F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
1        2885         537     7440       3300   450     2200  70       78
2        2683        1227    12280       6450   750     1500  29       30
3        1036          99    11250       3750   400     1165  53       66
4         510          63    12960       5450   450      875  92       97
5         249         869     7560       4120   800     1500  76       72
6         678          41    13500       3335   500      675  67       73
  S.F.Ratio perc.alumni Expend Grad.Rate
1      18.1          12   7041        60
2      12.2          16  10527        56
3      12.9          30   8735        54
4       7.7          37  19016        59
5      11.9           2  10922        15
6       9.4          11   9727        55

View(college)

(b) Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1]
View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

college <- college[,-1]
View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

(c) i. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

   Private               Apps           Accept          Enroll    
 Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
 Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Mode  :character   Median : 1558   Median : 1110   Median : 434  
                    Mean   : 3002   Mean   : 2019   Mean   : 780  
                    3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
                    Max.   :48094   Max.   :26330   Max.   :6392  
   Top10perc       Top25perc      F.Undergrad     P.Undergrad     
 Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
 1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
 Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
 Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
 3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
 Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
    Outstate       Room.Board       Books           Personal   
 Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
 1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
 Median : 9990   Median :4200   Median : 500.0   Median :1200  
 Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
 3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
 Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
      PhD            Terminal       S.F.Ratio      perc.alumni   
 Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
 1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
 Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
 Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
 3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
 Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
     Expend        Grad.Rate     
 Min.   : 3186   Min.   : 10.00  
 1st Qu.: 6751   1st Qu.: 53.00  
 Median : 8377   Median : 65.00  
 Mean   : 9660   Mean   : 65.46  
 3rd Qu.:10830   3rd Qu.: 78.00  
 Max.   :56233   Max.   :118.00

ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that

you can reference the first ten columns of a matrix A using A[,1:10].

#pairs(
#  ~ Private +  Apps + Accept + Enroll + Top10perc + Top25perc + F#.Undergrad + P.Undergrad + Outstate + Room.Board,
#  data = college
#)

iii. Use the plot() function to produce side-by-side boxplots of

Outstate versus Private.

#boxplot(college$Private,college$Outstate)

iv. Create a new qualitative variable, called Elite, by binning

the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(college)

   Private               Apps           Accept          Enroll    
 Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
 Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Mode  :character   Median : 1558   Median : 1110   Median : 434  
                    Mean   : 3002   Mean   : 2019   Mean   : 780  
                    3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
                    Max.   :48094   Max.   :26330   Max.   :6392  
   Top10perc       Top25perc      F.Undergrad     P.Undergrad     
 Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
 1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
 Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
 Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
 3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
 Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
    Outstate       Room.Board       Books           Personal   
 Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
 1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
 Median : 9990   Median :4200   Median : 500.0   Median :1200  
 Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
 3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
 Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
      PhD            Terminal       S.F.Ratio      perc.alumni   
 Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
 1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
 Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
 Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
 3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
 Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
     Expend        Grad.Rate      Elite    
 Min.   : 3186   Min.   : 10.00   No :699  
 1st Qu.: 6751   1st Qu.: 53.00   Yes: 78  
 Median : 8377   Median : 65.00            
 Mean   : 9660   Mean   : 65.46            
 3rd Qu.:10830   3rd Qu.: 78.00            
 Max.   :56233   Max.   :118.00

boxplot(college$Outstate,college$Elite)

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

# Set up the layout for the plots in a 2x2 grid
par(mfrow = c(2, 2))

# Plot histograms for selected variables with different numbers of bins
hist(college$Apps, breaks = 10, main = "Histogram of Apps", xlab = "Apps", col = "skyblue")
hist(college$Accept, breaks = 20, main = "Histogram of Accept", xlab = "Accept", col = "salmon")
hist(college$Outstate, breaks = 15, main = "Histogram of Outstate", xlab = "Outstate", col = "lightgreen")
hist(college$Expend, breaks = 25, main = "Histogram of Expend", xlab = "Expend", col = "purple")

vi. Continue exploring the data, and provide a brief summary of what you discover.

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

(a) Which of the predictors are quantitative, and which are qualitative?

\colorbox{cyan}{Answer:}

library("ISLR2")
str(Auto)

'data.frame':   392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
  ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

unique(Auto$year)

 [1] 70 71 72 73 74 75 76 77 78 79 80 81 82

unique(Auto$cylinders)

[1] 8 4 6 3 5

As we can see with the str() function. There´s 9 columns or variables. We can appreciate that at first glance there´s 8 quant and 1 qual (name var.) nevertheless if we use the unique() function the true appearance would reveal as the cylinder var. which looks quantitative but there´s only 5 possibles option. Also with origin variable.\colorbox{yellow}{Therefore at least we can treat as 6 quant and 3 qual variables. } year which has few levels could be treat as a qualitative variable depending of the approach we want.

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

\colorbox{cyan}{Answer:}

For each of those quantitative variables that we state in previous exercise:

range(Auto$mpg)

[1]  9.0 46.6

range(Auto$displacement)

[1]  68 455

range(Auto$horsepower)

[1]  46 230

range(Auto$weight)

[1] 1613 5140

range(Auto$acceleration)

[1]  8.0 24.8

range(Auto$year)

[1] 70 82

(c) What is the mean and standard deviation of each quantitative predictor?

\colorbox{cyan}{Answer:}

mean(Auto$mpg)

[1] 23.44592

sd(Auto$mpg)

[1] 7.805007

mean(Auto$displacement)

[1] 194.412

sd(Auto$displacement)

[1] 104.644

mean(Auto$horsepower)

[1] 104.4694

sd(Auto$horsepower)

[1] 38.49116

mean(Auto$weight)

[1] 2977.584

sd(Auto$weight)

[1] 849.4026

mean(Auto$acceleration)

[1] 15.54133

sd(Auto$acceleration)

[1] 2.758864

mean(Auto$year)

[1] 75.97959

sd(Auto$year)

[1] 3.683737

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

\colorbox{cyan}{Answer:}

new_auto <- Auto[-(10:85),]
dim(Auto)

[1] 392   9

dim(new_auto)

[1] 316   9

mean(new_auto$mpg)

[1] 24.40443

sd(new_auto$mpg)

[1] 7.867283

mean(new_auto$displacement)

[1] 187.2405

sd(new_auto$displacement)

[1] 99.67837

mean(new_auto$horsepower)

[1] 100.7215

sd(new_auto$horsepower)

[1] 35.70885

mean(new_auto$weight)

[1] 2935.972

sd(new_auto$weight)

[1] 811.3002

mean(new_auto$acceleration)

[1] 15.7269

sd(new_auto$acceleration)

[1] 2.693721

mean(new_auto$year)

[1] 77.14557

sd(new_auto$year)

[1] 3.106217

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

\colorbox{cyan}{Answer:}

library(ggplot2)
library(gridExtra)

Warning: package 'gridExtra' was built under R version 4.3.2

# Crear scatterplots y guardarlos en objetos ggplot
scatterplot1 <- ggplot(data = Auto, aes(x = displacement, y = mpg, color = factor(cylinders))) +
  geom_point() +
  labs(x = "Displacement", y = "MPG") +
  ggtitle("MPG vs. Displacement") +
  theme(legend.position = "right")

scatterplot2 <- ggplot(data = Auto, aes(x = horsepower, y = mpg, color = factor(cylinders))) +
  geom_point() +
  labs(x = "Horsepower", y = "MPG") +
  ggtitle("MPG vs. Horsepower") +
  theme(legend.position = "right")

scatterplot3 <- ggplot(data = Auto, aes(x = weight, y = mpg, color = factor(cylinders))) +
  geom_point() +
  labs(x = "Weight", y = "MPG") +
  ggtitle("MPG vs. Weight") +
  theme(legend.position = "right")

scatterplot4 <- ggplot(data = Auto, aes(x = acceleration, y = mpg, color = factor(cylinders))) +
  geom_point() +
  labs(x = "Acceleration", y = "MPG") +
  ggtitle("MPG vs. Acceleration") +
  theme(legend.position = "right")

scatterplot5 <- ggplot(data = Auto, aes(x = year, y = mpg, color = factor(cylinders))) +
  geom_point() +
  labs(x = "Year", y = "MPG") +
  ggtitle("MPG vs. Year") +
  theme(legend.position = "right")

# Crear grid con los scatterplots y leyenda
grid.arrange(scatterplot1 + guides(color = guide_legend(title = "Cylinders")), 
             scatterplot2 + guides(color = guide_legend(title = "Cylinders")), 
             scatterplot3 + guides(color = guide_legend(title = "Cylinders")), 
             scatterplot4 + guides(color = guide_legend(title = "Cylinders")), 
             scatterplot5 + guides(color = guide_legend(title = "Cylinders")), 
             ncol = 2)

Through this plot we can observe that we should treat as quantitative variables only 4 of them.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

\colorbox{cyan}{Answer:}

As seen in previous exercise we clearly observe that these 4 quantitative variables would be useful to predict miles per gallon. As I plot these with the different levels of qualitative cylinder variable we could appreciate the association within.

10. This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library

Read about the data set:

#?Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?

\colorbox{cyan}{Answer:}

A data frame with 506 rows and 13 variables.

A data set containing housing values in 506 suburbs of Boston.

(b) Make some pairwise scatterplots of the predictors (columns) in

this data set. Describe your fndings.

\colorbox{cyan}{Answer:}

# Cargar el paquete de visualización
library(ggplot2)

# Repetir para cada combinación de predictores
for (i in 1:(ncol(Boston) - 1)) {
  for (j in (i + 1):ncol(Boston)) {
    plot_title <- paste(names(Boston)[i], "vs.", names(Boston)[j])
    scatterplot <- ggplot(data = Boston, aes_string(x = names(Boston)[i], y = names(Boston)[j])) +
      geom_point() +
      labs(x = names(Boston)[i], y = names(Boston)[j]) +
      ggtitle(plot_title)
    
    print(scatterplot)
  }
}

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

Findings:

By examining the pairwise scatterplots, we can observe the relationships between all possible combinations of predictors in the Boston dataset.
Patterns and trends can be identified, as well as potential linear or non-linear relationships between the variables.
For example, we can see if there is any relationship between the crime rate (crim) and the proportion of residential land (zn), or between the median age of homes (age) and the median value of homes (medv).
The presence of scattered or clustered points, as well as the direction of the relationship between the variables, can provide insights into the nature of the relationship between the predictors in the dataset.

It’s important to carefully examine the pairwise scatterplots to gain a better understanding of the structure and relationships in the data before conducting further analysis.

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

\colorbox{cyan}{Answer:}

Due the previous myriad of plots. You can appreciate a negative relationship between crim and rm ( avg # of rooms per dwelling). As less rooms crime rates could increase. Also somehow likely happen with nox var.

(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

\colorbox{cyan}{Answer:}

range(Boston$tax)

[1] 187 711

range(Boston$ptratio)

[1] 12.6 22.0

Both of these var tax & ptratio could indicate that higher they´re higher crime would be.

(e) How many of the census tracts in this data set bound the Charles river?

\colorbox{cyan}{Answer:}

table(Boston$chas)


  0   1 
471  35

Using the table() function we there´s 35 census tracts that bound the Charles river

(f) What is the median pupil-teacher ratio among the towns in this data set?

\colorbox{cyan}{Answer:}

median(Boston$ptratio)

[1] 19.05

(g) Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.

\colorbox{cyan}{Answer:}

# Find the index of the census tract with the lowest median value of owner-occupied homes
lowest_medv_index <- which.min(Boston$medv)

# Extract the row corresponding to the census tract with the lowest median value of owner-occupied homes
lowest_medv_tract <- Boston[lowest_medv_index, ]

# Display the values of other predictors for that census tract
print(lowest_medv_tract)

       crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

# Calculate the overall ranges for other predictors in the Boston dataset
overall_ranges <- sapply(Boston[, -which(names(Boston) == "medv")], range)

# Display the overall ranges for other predictors
print(overall_ranges)

         crim  zn indus chas   nox    rm   age     dis rad tax ptratio lstat
[1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6  1.73
[2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 37.97

5,000

Commentary on Findings:

By comparing the values of other predictors for the census tract with the lowest median value of owner-occupied homes to the overall ranges for those predictors in the Boston dataset, we can assess how unusual or typical those values are.
If the values for other predictors in the lowest median value tract are at the extremes of the overall ranges, it may indicate unique characteristics or conditions in that tract compared to the rest of the dataset.
Conversely, if the values are within the typical range observed across the dataset, it suggests that the low median value of owner-occupied homes in that tract may not be driven by extreme values of other predictors but rather by other factors not captured in the dataset.
Further analysis and comparison with external data or contextual information may be needed to fully understand the reasons behind the low median value of owner-occupied homes in that census tract.

(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling?Comment on the census tracts that average more than eight rooms per dwelling.

\colorbox{cyan}{Answer:}

table(7<Boston$rm)


FALSE  TRUE 
  442    64

table(Boston$rm>8)


FALSE  TRUE 
  493    13

# Count the number of census tracts with more than seven rooms per dwelling
more_than_seven_rooms <- sum(Boston$rm > 7)

# Count the number of census tracts with more than eight rooms per dwelling
more_than_eight_rooms <- sum(Boston$rm > 8)

# Print the results
cat("Number of census tracts with more than seven rooms per dwelling:", more_than_seven_rooms, "\n")

Number of census tracts with more than seven rooms per dwelling: 64

cat("Number of census tracts with more than eight rooms per dwelling:", more_than_eight_rooms, "\n")

Number of census tracts with more than eight rooms per dwelling: 13

Commentary on Census Tracts with More than Eight Rooms per Dwelling:

Census tracts with more than eight rooms per dwelling may represent areas with larger, more spacious homes.
These areas may be associated with higher-income neighborhoods or neighborhoods with larger families, as larger homes often accommodate more people.
Additionally, census tracts with more than eight rooms per dwelling may have higher property values due to the larger size and potentially higher quality of the homes.
Understanding the characteristics and demographics of these census tracts can provide insights into the housing market and socio-economic landscape of the region.

FINALLY THIS IS THE END (MY FRIEND HAHAHA) OF CHAPTER 2!!! hope it result helpful

-AlbaR.