# Check the current working directory
getwd()[1] "C:/Users/sgarc/OneDrive/Documents/R_Studio_Folders/STA6543/assignment_1"
STA6543
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
(a)
We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
Response:
This is a regression problem because the response variable — CEO salary — is continuous.
We are primarily interested in inference, as the goal is to understand the relationship between CEO salary and the predictors (profit, number of employees, industry).
Note: If the focus were only on accurately forecasting CEO salary without concern for the underlying relationships, the goal would be prediction instead.
Sample Size and Number of Predictors:
- n = 500 (number of firms)
- p = 3 (profit, number of employees, industry)
(b)
We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
Response:
This is a classification problem because the response variable — success or failure — is categorical.
We are primarily interested in prediction, as the goal is to accurately classify the new product as a success or failure based on the features.
Sample Size and Number of Predictors:
- n = 20 (number of products)
- p = 13 (price charged, marketing budget, competition price, and ten other variables)
(c)
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
Response:
This is a regression problem because the response variable — % change in the USD/Euro exchange rate — is continuous.
We are primarily interested in prediction, as the goal is to accurately forecast the % change in the USD/Euro exchange rate based on the changes in the world stock markets.
Sample Size and Number of Predictors:
- n = 52 (number of weeks in 2012)
- p = 3 (% change in the US market, % change in the British market, % change in the German market)
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
– Advantages of a Very Flexible Approach:
- High Predictive Accuracy: Flexible models can capture complex relationships in the data, leading to better performance on training data.
- Ability to Model Non-linear Relationships: They can fit non-linear patterns, interactions, and high-dimensional data effectively.
- Adaptability: Flexible models can adjust to various data structures and distributions, making them suitable for diverse applications.
Disadvantages of a Very Flexible Approach:
- Overfitting: They are prone to overfitting, especially with small sample sizes, where the model learns noise instead of the underlying pattern.
- Interpretability: They often lack interpretability, making it difficult to understand the relationship between predictors and the response variable.
- Computational Complexity: They can be computationally intensive, requiring more resources and time for training and prediction.
- Sensitivity to Outliers: Flexible models can be sensitive to outliers, which can disproportionately influence the model’s predictions.
When to Prefer a More Flexible Approach:
- Complex Data Structures: When the data exhibits complex relationships, interactions, or non-linear patterns that simpler models cannot capture.
- High-Dimensional Data: In cases with many predictors, flexible models can help identify relevant features and interactions.
- Prediction Focus: When the primary goal is to achieve high predictive accuracy, even at the cost of interpretability.
When to Prefer a Less Flexible Approach:
- Interpretability Required: When understanding the relationship between predictors and the response is crucial, such as in scientific research or policy-making.
- Small Sample Sizes: In cases with limited data, simpler models are less likely to overfit and can provide more reliable estimates.
- Robustness to Outliers: When the data contains outliers, less flexible models can be more robust and provide stable predictions.
- Computational Efficiency: When computational resources are limited, simpler models can be trained and evaluated more quickly.
Conclusion:
The choice between a flexible and less flexible approach depends on the specific context of the problem, the nature of the data, and the goals of the analysis. A balanced approach that considers both predictive performance and interpretability is often ideal.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
Parametric vs. Non-Parametric Approaches:
Advantages of Parametric Approaches:
- Simplicity: Parametric models are often simpler and easier to interpret, making them suitable for understanding relationships between variables.
- Efficiency: They typically require less data to estimate the parameters, making them more efficient in terms of computation and memory usage.
- Speed: Training and prediction are usually faster because the model structure is fixed.
Disadvantages of Parametric Approaches:
- Assumption Dependence: The performance heavily relies on the correctness of the assumed functional form. If the true relationship is complex, parametric models may underperform.
- Limited Flexibility: They may struggle to capture non-linear relationships or interactions between variables without additional transformations or features.
- Over-simplification: By assuming a specific form, they may oversimplify the problem and miss important patterns in the data.
Conclusion: The choice between parametric and non-parametric approaches depends on the nature of the data, the complexity of the relationships, and the goals of the analysis. Parametric approaches are advantageous when interpretability and efficiency are prioritized, while non-parametric approaches are preferred for capturing complex relationships without strong assumptions.
This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US.
(a)
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
# Check the current working directory
getwd()[1] "C:/Users/sgarc/OneDrive/Documents/R_Studio_Folders/STA6543/assignment_1"
# Set the working directory to the location of the College.csv file
# Check the current working directory and adjust accordingly
current_dir <- getwd()
# Load the College data
college <- read.csv("College.csv", na.strings = "?", stringsAsFactors = T)(b)
Use the View() function to view the data.
View(college)Try the following commands:
rownames(college) <- college[, 1] # Set the first column as row names
View(college) # View the updated data frame
college <- college[, -1] # Remove the first column (now row names)
View(college) # View the data frame without the first column(c)
i. Use the summary() function to get a summary of the data.
summary(college) Private Apps Accept Enroll Top10perc
No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
Median : 1558 Median : 1110 Median : 434 Median :23.00
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
Max. :48094 Max. :26330 Max. :6392 Max. :96.00
Top25perc F.Undergrad P.Undergrad Outstate
Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
Room.Board Books Personal PhD
Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median :4200 Median : 500.0 Median :1200 Median : 75.00
Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
Terminal S.F.Ratio perc.alumni Expend
Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
Median : 82.0 Median :13.60 Median :21.00 Median : 8377
Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
Grad.Rate
Min. : 10.00
1st Qu.: 53.00
Median : 65.00
Mean : 65.46
3rd Qu.: 78.00
Max. :118.00
pairs(college[, 1:10], main = "Scatterplot Matrix of First 10 Variables")library(ggplot2)
# I created a box plot with jitter to better visualize the data as the scatter plot wasn't helpful.
ggplot(college, aes(x = Private, y = Outstate, color = Private)) +
stat_boxplot(geom ="errorbar", width = 0.25) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6) +
labs(title = "Outstate Tuition vs Private Institution",
x = "Private Institution (Yes/No)",
y = "Outstate Tuition") +
theme_minimal()Elite <- rep("No", nrow(college)) # Initialize Elite variable
Elite[college$Top10perc > 50] <- "Yes" # Set to "Yes" where Top10perc > 50
Elite <- as.factor(Elite) # Convert to factor
college <- data.frame(college, Elite) # Add Elite to the college data frame
summary(college$Elite) # Summary of the Elite variable No Yes
699 78
Use the plot() function to produce a scatterplot of Outstate versus Elite.
library(ggplot2)
# I created a box plot with jitter to better visualize the data as the scatter plot wasn't helpful.
ggplot(college, aes(x = Elite, y = Outstate, color = Elite)) +
stat_boxplot(geom ="errorbar", width = 0.25) +
geom_boxplot(outlier.shape = NA) + # hides default boxplot outliers
geom_jitter(width = 0.2, alpha = 0.6) +
labs(title = "Outstate Tuition vs Elite Institution",
x = "Elite Institution (Yes/No)",
y = "Outstate Tuition") +
theme_minimal()library(ggplot2)
library(patchwork)
# Create each histogram as a ggplot object
p1 <- ggplot(college, aes(x = Outstate)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
labs(title = "Histogram of Outstate Tuition", x = "Outstate Tuition")
p2 <- ggplot(college, aes(x = Top10perc)) +
geom_histogram(bins = 10, fill = "darkorange", color = "black") +
labs(title = "Histogram of Top 10 Percent", x = "Top 10 Percent")
p3 <- ggplot(college, aes(x = Grad.Rate)) +
geom_histogram(bins = 15, fill = "forestgreen", color = "black") +
labs(title = "Histogram of Graduation Rate", x = "Graduation Rate")
p4 <- ggplot(college, aes(x = Room.Board)) +
geom_histogram(bins = 25, fill = "purple", color = "black") +
labs(title = "Histogram of Room and Board", x = "Room and Board")
# Combine plots in a 2x2 grid
(p1 | p2) / (p3 | p4)This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
# Load the Auto data set
Auto <- read.csv("Auto.csv", na.strings = "?", stringsAsFactors = T)
# Remove rows with missing values
Auto <- na.omit(Auto) # Remove rows with NA values(a)
Which of the predictors are quantitative, and which are qualitative?
# Check the structure of the Auto data set
str(Auto)'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
- attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Response:
- Quantitative Predictors:
- mpg
- cylinders
- displacement
- horsepower
- weight
- acceleration
- year
- Qualitative Predictors:
- name
- origin
(b)
What is the range of each quantitative predictor? You can answer this using the range() function.
# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")
# Step 1: Calculate the range for each quantitative predictor
min_max <- sapply(Auto[, quant_vars], range)
# Step 2: Convert to Data Frame and Calculate Range
min_max_df <- as.data.frame(t(min_max))
colnames(min_max_df) <- c("Min", "Max") # Set column names
# Step 3: Calculate Range
min_max_df$Range <- min_max[2, ] - min_max[1, ]
print(min_max_df) # View the min and max values for each quantitative predictor Min Max Range
mpg 9 46.6 37.6
cylinders 3 8.0 5.0
displacement 68 455.0 387.0
horsepower 46 230.0 184.0
weight 1613 5140.0 3527.0
acceleration 8 24.8 16.8
year 70 82.0 12.0
(c)
What is the mean and standard deviation of each quantitative predictor?
# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")
# Step 1: Get Mean and SD for all numeric columns
mean_sd <- sapply(Auto[, quant_vars], function(x) c(Mean = mean(x), SD = sd(x)))
# Step 2: Convert to Data Frame
mean_sd_df <- as.data.frame(t(mean_sd))
print(mean_sd_df) # View the mean and standard deviation for each quantitative predictor Mean SD
mpg 23.445918 7.805007
cylinders 5.471939 1.705783
displacement 194.411990 104.644004
horsepower 104.469388 38.491160
weight 2977.584184 849.402560
acceleration 15.541327 2.758864
year 75.979592 3.683737
(d)
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
# Step 0: Let's first identify the quantitative predictors
quant_vars <- setdiff(names(Auto)[sapply(Auto, is.numeric)], "origin")
# Step 1: Remove the 10th through 85th observations
Auto_subset <- Auto[-(10:85), ]
# Step 2: Get Range for all numeric columns
min_max <- sapply(Auto_subset[, quant_vars], range)
# Step 3: Convert to Data Frame
min_max_df <- as.data.frame(t(min_max))
colnames(min_max_df) <- c("Min", "Max") # Set column names
# Step 4: Calculate Range
min_max_df$Range <- min_max[2, ] - min_max[1, ]
print(min_max_df) # View the range for each quantitative predictor Min Max Range
mpg 11.0 46.6 35.6
cylinders 3.0 8.0 5.0
displacement 68.0 455.0 387.0
horsepower 46.0 230.0 184.0
weight 1649.0 4997.0 3348.0
acceleration 8.5 24.8 16.3
year 70.0 82.0 12.0
# Step 1: Get Mean and SD for all numeric columns
mean_sd <- sapply(Auto_subset[, quant_vars], function(x) c(Mean = mean(x), SD = sd(x)))
# Step 2: Convert to Data Frame
mean_sd_df <- as.data.frame(t(mean_sd))
print(mean_sd_df) # View the mean and standard deviation for each quantitative predictor Mean SD
mpg 24.404430 7.867283
cylinders 5.373418 1.654179
displacement 187.240506 99.678367
horsepower 100.721519 35.708853
weight 2935.971519 811.300208
acceleration 15.726899 2.693721
year 77.145570 3.106217
(e)
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
Auto$origin <- factor(Auto$origin, labels = c("USA", "Europe", "Asia"))
ggplot(Auto, aes(x = weight, y = mpg, color = origin)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "MPG vs Weight by Origin",
x = "Weight",
y = "Miles Per Gallon (MPG)") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
(f)
Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
Response:
The scatterplot of mpg versus weight suggests a negative relationship, indicating that as weight increases, mpg tends to decrease. This implies that weight could be a useful predictor for mpg.
The scatterplot also shows that the relationship between mpg and weight varies by origin, suggesting that the origin of the car may also be a significant predictor. The linear trend lines for each origin group indicate that the slope of the relationship differs, which suggests that a model incorporating interaction terms between weight and origin could improve predictions.
This exercise involves the Boston housing data set.
(a)
To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.
# Load the ISLR2 library
library(ISLR2)Warning: package 'ISLR2' was built under R version 4.4.3
Attaching package: 'ISLR2'
The following object is masked _by_ '.GlobalEnv':
Auto
# ?BostonHow many rows are in this data set? How many columns? What do the rows and columns represent? Response:
The Boston data set contains 506 rows and 13 columns. Each row represents a different suburb of Boston, while the columns represent various attributes of these suburbs, such as crime rate, average number of rooms per dwelling, and median value of owner-occupied homes.
(b)
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
library(GGally)Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
selected_vars <- Boston[, c("lstat", "rm", "nox", "dis", "crim")]
ggpairs(selected_vars, title = "Pairwise Relationships Among Key Boston Housing Predictors")library(car)Loading required package: carData
# Fit a full model with all predictors
model <- lm(medv ~ ., data = Boston)
# View VIFs for all predictors
vif(model) crim zn indus chas nox rm age dis
1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037
rad tax ptratio lstat
7.445301 9.002158 1.797060 2.870777
cor(Boston)[, "tax"] crim zn indus chas nox rm
0.58276431 -0.31456332 0.72076018 -0.03558652 0.66802320 -0.29204783
age dis rad tax ptratio lstat
0.50645559 -0.53443158 0.91022819 1.00000000 0.46085304 0.54399341
medv
-0.46853593
(c)
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Response:
Yes, several predictors are associated with the per capita crime rate (crim). For example, the scatterplot of crim versus nox (nitric oxides concentration) shows a positive relationship, indicating that higher levels of nitric oxides are associated with higher crime rates. Similarly, the scatterplot of crim versus lstat (percentage of lower status of the population) suggests that areas with a higher percentage of lower-status individuals tend to have higher crime rates. This suggests that socioeconomic factors, such as pollution and income inequality, may contribute to crime rates in Boston suburbs.
(d)
Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
library(tidyr)
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Select and standardize the variables
standardized_df <- Boston %>%
select(crim, tax, ptratio) %>%
mutate(across(everything(), scale)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Plot standardized boxplots
ggplot(standardized_df, aes(x = Variable, y = Value, color = Variable)) +
stat_boxplot(geom ="errorbar", width = 0.25) +
geom_boxplot(outlier.shape = NA) + # hides default boxplot outliers
geom_jitter(width = 0.2, alpha = 0.6) +
labs(title = "Boxplot with Colored Fill and Whisker Caps",
y = "Standardized Z-score", x = "") +
theme_minimal()Response:
The data suggest that crime is the most unevenly distributed of the three predictors, with a small number of tracts driving very high values. Tax rates are also somewhat varied, while pupil-teacher ratios remain relatively stable across neighborhoods.
(e)
How many of the census tracts in this data set bound the Charles river?
# Count the number of census tracts that bound the Charles River
sum(Boston$chas == 1) [1] 35
Response:
There are 35 census tracts in the Boston data set that bound the Charles River, as indicated by the “chas” variable being equal to 1.
(f)
What is the median pupil-teacher ratio among the towns in this data set?
# Calculate the median pupil-teacher ratio
median_ptratio <- median(Boston$ptratio, na.rm = TRUE)
print(median_ptratio) # Print the median pupil-teacher ratio[1] 19.05
(g)
Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
# Find the census tract with the lowest median value of owner-occupied homes
lowest_medv_index <- which.min(Boston$medv)
lowest_medv_tract <- Boston[lowest_medv_index, ]
print(lowest_medv_tract) # Print the details of the census tract with the lowest median value crim zn indus chas nox rm age dis rad tax ptratio lstat medv
399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
# Calculate the range of each predictor
predictor_ranges <- sapply(Boston[, -which(names(Boston) == "medv")], range)
print(predictor_ranges) # Print the ranges of the predictors crim zn indus chas nox rm age dis rad tax ptratio lstat
[1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 1.73
[2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 37.97
# Compare the values of the predictors for the lowest median value tract to the overall ranges
comparison <- data.frame(
Predictor = names(lowest_medv_tract)[-which(names(lowest_medv_tract) == "medv")],
Value = as.numeric(lowest_medv_tract[-which(names(lowest_medv_tract) == "medv")]),
Range_Min = predictor_ranges[1, ],
Range_Max = predictor_ranges[2, ]
)
print(comparison) # Print the comparison of values to overall ranges Predictor Value Range_Min Range_Max
crim crim 38.3518 0.00632 88.9762
zn zn 0.0000 0.00000 100.0000
indus indus 18.1000 0.46000 27.7400
chas chas 0.0000 0.00000 1.0000
nox nox 0.6930 0.38500 0.8710
rm rm 5.4530 3.56100 8.7800
age age 100.0000 2.90000 100.0000
dis dis 1.4896 1.12960 12.1265
rad rad 24.0000 1.00000 24.0000
tax tax 666.0000 187.00000 711.0000
ptratio ptratio 20.2000 12.60000 22.0000
lstat lstat 30.5900 1.73000 37.9700
names(lowest_medv_tract)[-which(names(lowest_medv_tract) == "medv")] [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "lstat"
Response:
The Boston census tract with the lowest median home value shows signs of significant urban and socioeconomic stress. It has high crime, pollution, industrial land use, and tax rates, along with small, aging homes and poor school resources. The area also has a high percentage of lower-status residents, helping to explain its depressed housing value.
(h)
In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
# Count the number of census tracts with more than 7 and 8 rooms per dwelling
more_than_7_rooms <- sum(Boston$rm > 7)
more_than_8_rooms <- sum(Boston$rm > 8)
print(paste("Number of census tracts with more than 7 rooms:", more_than_7_rooms))[1] "Number of census tracts with more than 7 rooms: 64"
print(paste("Number of census tracts with more than 8 rooms:", more_than_8_rooms))[1] "Number of census tracts with more than 8 rooms: 13"
# Get the details of census tracts that average more than 8 rooms per dwelling
tracts_more_than_8_rooms <- Boston[Boston$rm > 8, ]
print(tracts_more_than_8_rooms) # Print the details of census tracts with more than 8 rooms crim zn indus chas nox rm age dis rad tax ptratio lstat medv
98 0.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 4.21 38.7
164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 3.32 50.0
205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 2.88 50.0
225 0.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 4.14 44.8
226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 4.63 50.0
227 0.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 3.13 37.6
233 0.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 2.47 41.7
234 0.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 3.95 48.3
254 0.36894 22 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 3.54 42.8
258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 5.12 50.0
263 0.52014 20 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 5.91 48.8
268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 7.44 50.0
365 3.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 5.29 21.9
Response:
Homes with over 8 rooms are generally large, high-value properties in affluent areas, often hitting the dataset’s maximum home value of $50K. They feature low lstat, moderate taxes, and good school access. Most are older but well-located. One outlier (tract 365) shows that size alone doesn’t ensure high value when crime and pollution are high.