Assignment 1

STA6543

Author

Stephen Garcia (wqr974)

Published

May 29, 2025

Chapter 2

Conceptual Exercises

Problem 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a)
We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Response:
This is a regression problem because the response variable — CEO salary — is continuous.

We are primarily interested in inference, as the goal is to understand the relationship between CEO salary and the predictors (profit, number of employees, industry).

Note: If the focus were only on accurately forecasting CEO salary without concern for the underlying relationships, the goal would be prediction instead.

Sample Size and Number of Predictors:
- n = 500 (number of firms)
- p = 3 (profit, number of employees, industry)

(b)
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Response:
This is a classification problem because the response variable — success or failure — is categorical.

We are primarily interested in prediction, as the goal is to accurately classify the new product as a success or failure based on the features.

Sample Size and Number of Predictors:
- n = 20 (number of products)
- p = 13 (price charged, marketing budget, competition price, and ten other variables)

(c)
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Response:
This is a regression problem because the response variable — % change in the USD/Euro exchange rate — is continuous.

We are primarily interested in prediction, as the goal is to accurately forecast the % change in the USD/Euro exchange rate based on the changes in the world stock markets.

Sample Size and Number of Predictors:
- n = 52 (number of weeks in 2012)
- p = 3 (% change in the US market, % change in the British market, % change in the German market)

Problem 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

– Advantages of a Very Flexible Approach:
- High Predictive Accuracy: Flexible models can capture complex relationships in the data, leading to better performance on training data.
- Ability to Model Non-linear Relationships: They can fit non-linear patterns, interactions, and high-dimensional data effectively.
- Adaptability: Flexible models can adjust to various data structures and distributions, making them suitable for diverse applications.

Disadvantages of a Very Flexible Approach:
- Overfitting: They are prone to overfitting, especially with small sample sizes, where the model learns noise instead of the underlying pattern.
- Interpretability: They often lack interpretability, making it difficult to understand the relationship between predictors and the response variable.
- Computational Complexity: They can be computationally intensive, requiring more resources and time for training and prediction.
- Sensitivity to Outliers: Flexible models can be sensitive to outliers, which can disproportionately influence the model’s predictions.

When to Prefer a More Flexible Approach:
- Complex Data Structures: When the data exhibits complex relationships, interactions, or non-linear patterns that simpler models cannot capture.
- High-Dimensional Data: In cases with many predictors, flexible models can help identify relevant features and interactions.
- Prediction Focus: When the primary goal is to achieve high predictive accuracy, even at the cost of interpretability.

When to Prefer a Less Flexible Approach:
- Interpretability Required: When understanding the relationship between predictors and the response is crucial, such as in scientific research or policy-making.
- Small Sample Sizes: In cases with limited data, simpler models are less likely to overfit and can provide more reliable estimates.
- Robustness to Outliers: When the data contains outliers, less flexible models can be more robust and provide stable predictions.
- Computational Efficiency: When computational resources are limited, simpler models can be trained and evaluated more quickly.

Conclusion:
The choice between a flexible and less flexible approach depends on the specific context of the problem, the nature of the data, and the goals of the analysis. A balanced approach that considers both predictive performance and interpretability is often ideal.

Problem 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

Parametric vs. Non-Parametric Approaches:

Parametric Approaches:
- Assume a specific functional form for the relationship between predictors and the response variable (e.g., linear regression assumes a linear relationship).
- Characterized by a fixed number of parameters, regardless of the size of the dataset.
Non-Parametric Approaches:
- Do not assume a specific functional form for the relationship between predictors and the response variable.
- Can adapt to the data’s structure, potentially using a larger number of parameters as the dataset size increases.

Advantages of Parametric Approaches:
- Simplicity: Parametric models are often simpler and easier to interpret, making them suitable for understanding relationships between variables.
- Efficiency: They typically require less data to estimate the parameters, making them more efficient in terms of computation and memory usage.
- Speed: Training and prediction are usually faster because the model structure is fixed.

Disadvantages of Parametric Approaches:
- Assumption Dependence: The performance heavily relies on the correctness of the assumed functional form. If the true relationship is complex, parametric models may underperform.
- Limited Flexibility: They may struggle to capture non-linear relationships or interactions between variables without additional transformations or features.
- Over-simplification: By assuming a specific form, they may oversimplify the problem and miss important patterns in the data.

Conclusion: The choice between parametric and non-parametric approaches depends on the nature of the data, the complexity of the relationships, and the goals of the analysis. Parametric approaches are advantageous when interpretability and efficiency are prioritized, while non-parametric approaches are preferred for capturing complex relationships without strong assumptions.

Applied Exercises

Problem 8

This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US.

(a)
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

# Check the current working directory
getwd()

[1] "C:/Users/sgarc/OneDrive/Documents/R_Studio_Folders/STA6543/assignment_1"

# Set the working directory to the location of the College.csv file
# Check the current working directory and adjust accordingly
current_dir <- getwd()

# Load the College data
college <- read.csv("College.csv", na.strings = "?", stringsAsFactors = T)

(b)
Use the View() function to view the data.

View(college)

Try the following commands:

rownames(college) <- college[, 1]  # Set the first column as row names
View(college)  # View the updated data frame

college <- college[, -1]  # Remove the first column (now row names)
View(college)  # View the data frame without the first column

(c)
i. Use the summary() function to get a summary of the data.

summary(college)

 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[, 1:10], main = "Scatterplot Matrix of First 10 Variables")

Use the plot() function to produce a scatterplot of Outstate versus Private.

library(ggplot2)
# I created a box plot with jitter to better visualize the data as the scatter plot wasn't helpful.
ggplot(college, aes(x = Private, y = Outstate, color = Private)) +
  stat_boxplot(geom ="errorbar", width = 0.25) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6) +
  labs(title = "Outstate Tuition vs Private Institution",
       x = "Private Institution (Yes/No)",
       y = "Outstate Tuition") +
  theme_minimal()

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite <- rep("No", nrow(college))  # Initialize Elite variable
Elite[college$Top10perc > 50] <- "Yes"  # Set to "Yes" where Top10perc > 50
Elite <- as.factor(Elite)  # Convert to factor
college <- data.frame(college, Elite)  # Add Elite to the college data frame
summary(college$Elite)  # Summary of the Elite variable

 No Yes 
699  78

Use the plot() function to produce a scatterplot of Outstate versus Elite.

library(ggplot2)

# I created a box plot with jitter to better visualize the data as the scatter plot wasn't helpful.
ggplot(college, aes(x = Elite, y = Outstate, color = Elite)) + 
  stat_boxplot(geom ="errorbar", width = 0.25) +
  geom_boxplot(outlier.shape = NA) +  # hides default boxplot outliers
  geom_jitter(width = 0.2, alpha = 0.6) +
  labs(title = "Outstate Tuition vs Elite Institution",
       x = "Elite Institution (Yes/No)",
       y = "Outstate Tuition") +
  theme_minimal()

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

library(ggplot2)
library(patchwork)

# Create each histogram as a ggplot object
p1 <- ggplot(college, aes(x = Outstate)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  labs(title = "Histogram of Outstate Tuition", x = "Outstate Tuition")

p2 <- ggplot(college, aes(x = Top10perc)) +
  geom_histogram(bins = 10, fill = "darkorange", color = "black") +
  labs(title = "Histogram of Top 10 Percent", x = "Top 10 Percent")

p3 <- ggplot(college, aes(x = Grad.Rate)) +
  geom_histogram(bins = 15, fill = "forestgreen", color = "black") +
  labs(title = "Histogram of Graduation Rate", x = "Graduation Rate")

p4 <- ggplot(college, aes(x = Room.Board)) +
  geom_histogram(bins = 25, fill = "purple", color = "black") +
  labs(title = "Histogram of Room and Board", x = "Room and Board")

# Combine plots in a 2x2 grid
(p1 | p2) / (p3 | p4)

Continue exploring the data, and provide a brief summary of what you discover.

Problem 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

# Load the Auto data set
Auto <- read.csv("Auto.csv", na.strings = "?", stringsAsFactors = T)
# Remove rows with missing values
Auto <- na.omit(Auto)  # Remove rows with NA values

(a)
Which of the predictors are quantitative, and which are qualitative?

# Check the structure of the Auto data set
str(Auto)

'data.frame':   392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
  ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Response:
- Quantitative Predictors:
- mpg
- cylinders
- displacement
- horsepower
- weight
- acceleration
- year
- Qualitative Predictors:
- name
- origin

(b)
What is the range of each quantitative predictor? You can answer this using the range() function.

# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")

# Step 1: Calculate the range for each quantitative predictor
min_max <- sapply(Auto[, quant_vars], range)

# Step 2: Convert to Data Frame and Calculate Range
min_max_df <- as.data.frame(t(min_max))
colnames(min_max_df) <- c("Min", "Max")  # Set column names

# Step 3: Calculate Range
min_max_df$Range <- min_max[2, ] - min_max[1, ]

print(min_max_df)  # View the min and max values for each quantitative predictor

              Min    Max  Range
mpg             9   46.6   37.6
cylinders       3    8.0    5.0
displacement   68  455.0  387.0
horsepower     46  230.0  184.0
weight       1613 5140.0 3527.0
acceleration    8   24.8   16.8
year           70   82.0   12.0

(c)
What is the mean and standard deviation of each quantitative predictor?

# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")

# Step 1: Get Mean and SD for all numeric columns
mean_sd <- sapply(Auto[, quant_vars], function(x) c(Mean = mean(x), SD = sd(x)))

# Step 2: Convert to Data Frame
mean_sd_df <- as.data.frame(t(mean_sd))

print(mean_sd_df)  # View the mean and standard deviation for each quantitative predictor

                    Mean         SD
mpg            23.445918   7.805007
cylinders       5.471939   1.705783
displacement  194.411990 104.644004
horsepower    104.469388  38.491160
weight       2977.584184 849.402560
acceleration   15.541327   2.758864
year           75.979592   3.683737

(d)
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")

# Step 1: Remove the 10th through 85th observations
Auto_subset <- Auto[-(10:85), ]

# Step 2: Get Range for all numeric columns
min_max <- sapply(Auto_subset[, quant_vars], range)

# Step 3: Convert to Data Frame
min_max_df <- as.data.frame(t(min_max))
colnames(min_max_df) <- c("Min", "Max")  # Set column names

# Step 4: Calculate Range
min_max_df$Range <- min_max[2, ] - min_max[1, ]

print(min_max_df)  # View the range for each quantitative predictor

                Min    Max  Range
mpg            11.0   46.6   35.6
cylinders       3.0    8.0    5.0
displacement   68.0  455.0  387.0
horsepower     46.0  230.0  184.0
weight       1649.0 4997.0 3348.0
acceleration    8.5   24.8   16.3
year           70.0   82.0   12.0

# Step 1: Get Mean and SD for all numeric columns
mean_sd <- sapply(Auto_subset[, quant_vars], function(x) c(Mean = mean(x), SD = sd(x)))

# Step 2: Convert to Data Frame
mean_sd_df <- as.data.frame(t(mean_sd))

print(mean_sd_df)  # View the mean and standard deviation for each quantitative predictor

                    Mean         SD
mpg            24.404430   7.867283
cylinders       5.373418   1.654179
displacement  187.240506  99.678367
horsepower    100.721519  35.708853
weight       2935.971519 811.300208
acceleration   15.726899   2.693721
year           77.145570   3.106217

(e)
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Auto$origin <- factor(Auto$origin, labels = c("USA", "Europe", "Asia"))

ggplot(Auto, aes(x = weight, y = mpg, color = origin)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "MPG vs Weight by Origin",
       x = "Weight",
       y = "Miles Per Gallon (MPG)") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

(f)
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Response:
The scatterplot of mpg versus weight suggests a negative relationship, indicating that as weight increases, mpg tends to decrease. This implies that weight could be a useful predictor for mpg.

The scatterplot also shows that the relationship between mpg and weight varies by origin, suggesting that the origin of the car may also be a significant predictor. The linear trend lines for each origin group indicate that the slope of the relationship differs, which suggests that a model incorporating interaction terms between weight and origin could improve predictions.

Problem 10

This exercise involves the Boston housing data set.
(a)
To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.

# Load the ISLR2 library
library(ISLR2)

Warning: package 'ISLR2' was built under R version 4.4.3


Attaching package: 'ISLR2'

The following object is masked _by_ '.GlobalEnv':

    Auto

# ?Boston

How many rows are in this data set? How many columns? What do the rows and columns represent? Response:
The Boston data set contains 506 rows and 13 columns. Each row represents a different suburb of Boston, while the columns represent various attributes of these suburbs, such as crime rate, average number of rooms per dwelling, and median value of owner-occupied homes.

(b)
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

selected_vars <- Boston[, c("lstat", "rm", "nox", "dis", "crim")]

ggpairs(selected_vars, title = "Pairwise Relationships Among Key Boston Housing Predictors")

library(car)

Loading required package: carData

# Fit a full model with all predictors
model <- lm(medv ~ ., data = Boston)

# View VIFs for all predictors
vif(model)

    crim       zn    indus     chas      nox       rm      age      dis 
1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037 
     rad      tax  ptratio    lstat 
7.445301 9.002158 1.797060 2.870777

cor(Boston)[, "tax"]

       crim          zn       indus        chas         nox          rm 
 0.58276431 -0.31456332  0.72076018 -0.03558652  0.66802320 -0.29204783 
        age         dis         rad         tax     ptratio       lstat 
 0.50645559 -0.53443158  0.91022819  1.00000000  0.46085304  0.54399341 
       medv 
-0.46853593

(c)
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Response:
Yes, several predictors are associated with the per capita crime rate (crim). For example, the scatterplot of crim versus nox (nitric oxides concentration) shows a positive relationship, indicating that higher levels of nitric oxides are associated with higher crime rates. Similarly, the scatterplot of crim versus lstat (percentage of lower status of the population) suggests that areas with a higher percentage of lower-status individuals tend to have higher crime rates. This suggests that socioeconomic factors, such as pollution and income inequality, may contribute to crime rates in Boston suburbs.

(d)
Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following object is masked from 'package:car':

    recode

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Select and standardize the variables
standardized_df <- Boston %>%
  select(crim, tax, ptratio) %>%
  mutate(across(everything(), scale)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot standardized boxplots
ggplot(standardized_df, aes(x = Variable, y = Value, color = Variable)) + 
  stat_boxplot(geom ="errorbar", width = 0.25) +
  geom_boxplot(outlier.shape = NA) +  # hides default boxplot outliers
  geom_jitter(width = 0.2, alpha = 0.6) +
  labs(title = "Boxplot with Colored Fill and Whisker Caps",
       y = "Standardized Z-score", x = "") +
  theme_minimal()

Response:
The data suggest that crime is the most unevenly distributed of the three predictors, with a small number of tracts driving very high values. Tax rates are also somewhat varied, while pupil-teacher ratios remain relatively stable across neighborhoods.

(e)
How many of the census tracts in this data set bound the Charles river?

# Count the number of census tracts that bound the Charles River
sum(Boston$chas == 1)

[1] 35

Response:
There are 35 census tracts in the Boston data set that bound the Charles River, as indicated by the “chas” variable being equal to 1.

(f)
What is the median pupil-teacher ratio among the towns in this data set?

# Calculate the median pupil-teacher ratio  
median_ptratio <- median(Boston$ptratio, na.rm = TRUE)
print(median_ptratio)  # Print the median pupil-teacher ratio

[1] 19.05

(g)
Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

# Find the census tract with the lowest median value of owner-occupied homes
lowest_medv_index <- which.min(Boston$medv)
lowest_medv_tract <- Boston[lowest_medv_index, ]
print(lowest_medv_tract)  # Print the details of the census tract with the lowest median value

       crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

# Calculate the range of each predictor
predictor_ranges <- sapply(Boston[, -which(names(Boston) == "medv")], range)
print(predictor_ranges)  # Print the ranges of the predictors

         crim  zn indus chas   nox    rm   age     dis rad tax ptratio lstat
[1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6  1.73
[2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 37.97

# Compare the values of the predictors for the lowest median value tract to the overall ranges
comparison <- data.frame(
  Predictor = names(lowest_medv_tract)[-which(names(lowest_medv_tract) == "medv")],
  Value = as.numeric(lowest_medv_tract[-which(names(lowest_medv_tract) == "medv")]),
  Range_Min = predictor_ranges[1, ],
  Range_Max = predictor_ranges[2, ]
)
print(comparison)  # Print the comparison of values to overall ranges

        Predictor    Value Range_Min Range_Max
crim         crim  38.3518   0.00632   88.9762
zn             zn   0.0000   0.00000  100.0000
indus       indus  18.1000   0.46000   27.7400
chas         chas   0.0000   0.00000    1.0000
nox           nox   0.6930   0.38500    0.8710
rm             rm   5.4530   3.56100    8.7800
age           age 100.0000   2.90000  100.0000
dis           dis   1.4896   1.12960   12.1265
rad           rad  24.0000   1.00000   24.0000
tax           tax 666.0000 187.00000  711.0000
ptratio   ptratio  20.2000  12.60000   22.0000
lstat       lstat  30.5900   1.73000   37.9700

names(lowest_medv_tract)[-which(names(lowest_medv_tract) == "medv")]

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "lstat"

Response:
The Boston census tract with the lowest median home value shows signs of significant urban and socioeconomic stress. It has high crime, pollution, industrial land use, and tax rates, along with small, aging homes and poor school resources. The area also has a high percentage of lower-status residents, helping to explain its depressed housing value.

(h)
In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

# Count the number of census tracts with more than 7 and 8 rooms per dwelling
more_than_7_rooms <- sum(Boston$rm > 7)
more_than_8_rooms <- sum(Boston$rm > 8)
print(paste("Number of census tracts with more than 7 rooms:", more_than_7_rooms))

[1] "Number of census tracts with more than 7 rooms: 64"

print(paste("Number of census tracts with more than 8 rooms:", more_than_8_rooms))

[1] "Number of census tracts with more than 8 rooms: 13"

# Get the details of census tracts that average more than 8 rooms per dwelling
tracts_more_than_8_rooms <- Boston[Boston$rm > 8, ]
print(tracts_more_than_8_rooms)  # Print the details of census tracts with more than 8 rooms

       crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9

Response:
Homes with over 8 rooms are generally large, high-value properties in affluent areas, often hitting the dataset’s maximum home value of $50K. They feature low lstat, moderate taxes, and good school access. Most are older but well-located. One outlier (tract 365) shows that size alone doesn’t ensure high value when crime and pollution are high.

Other Formats