I started by loading the Auto dataset and inspecting its structure to identify the types of predictors. Here’s the R code I used
library(ISLR2)
data("Auto")
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Quantitative predictors
quantitative_predictors <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")
# Qualitative predictors
qualitative_predictors <- c("origin", "name")
print("Quantitative Predictors:")
## [1] "Quantitative Predictors:"
print(quantitative_predictors)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year"
print("Qualitative Predictors:")
## [1] "Qualitative Predictors:"
print(qualitative_predictors)
## [1] "origin" "name"
#(b) Range of Each Quantitative Predictor
next,I calculated the range (minimum and maximum values) for each quantitative predictor using the range() function
for (predictor in quantitative_predictors) {
print(paste("Range of", predictor, ":"))
print(range(Auto[[predictor]]))
}
## [1] "Range of mpg :"
## [1] 9.0 46.6
## [1] "Range of cylinders :"
## [1] 3 8
## [1] "Range of displacement :"
## [1] 68 455
## [1] "Range of horsepower :"
## [1] 46 230
## [1] "Range of weight :"
## [1] 1613 5140
## [1] "Range of acceleration :"
## [1] 8.0 24.8
## [1] "Range of year :"
## [1] 70 82
From the above process, I came to know the range of different variables
mpg: Ranges from 9.0 to 46.6 miles per gallon. cylinders: Ranges from 3 to 8 cylinders. displacement: Ranges from 68 to 455 cubic inches. horsepower: Ranges from 46 to 230 horsepower. weight: Ranges from 1613 to 5140 pounds. acceleration: Ranges from 8.0 to 24.8 seconds. year: Ranges from 70 to 82 (representing years 1970 to 1982).
#(c) Mean and Standard Deviation of Each Quantitative Predictor
I then calculated the mean and standard deviation for each quantitative predictor
for (predictor in quantitative_predictors) {
print(paste("Mean of", predictor, ":", mean(Auto[[predictor]])))
print(paste("Standard deviation of", predictor, ":", sd(Auto[[predictor]])))
}
## [1] "Mean of mpg : 23.4459183673469"
## [1] "Standard deviation of mpg : 7.8050074865718"
## [1] "Mean of cylinders : 5.4719387755102"
## [1] "Standard deviation of cylinders : 1.70578324745278"
## [1] "Mean of displacement : 194.411989795919"
## [1] "Standard deviation of displacement : 104.644003908905"
## [1] "Mean of horsepower : 104.469387755102"
## [1] "Standard deviation of horsepower : 38.4911599328285"
## [1] "Mean of weight : 2977.58418367347"
## [1] "Standard deviation of weight : 849.402560042949"
## [1] "Mean of acceleration : 15.5413265306122"
## [1] "Standard deviation of acceleration : 2.75886411918808"
## [1] "Mean of year : 75.9795918367347"
## [1] "Standard deviation of year : 3.68373654357783"
These are the mean and standard deviations of quantitative predictors
#(d) Remove 10th to 85th Observations and Calculate Range, Mean, and Standard Deviation
I created a subset of the data by removing the 10th through 85th observations and calculated the range, mean, and standard deviation for the remaining data:
Auto_subset <- Auto[-c(10:85), ]
for (predictor in quantitative_predictors) {
print(paste("Range of", predictor, "in subset:"))
print(range(Auto_subset[[predictor]]))
print(paste("Mean of", predictor, "in subset:", mean(Auto_subset[[predictor]])))
print(paste("Standard deviation of", predictor, "in subset:", sd(Auto_subset[[predictor]])))
}
## [1] "Range of mpg in subset:"
## [1] 11.0 46.6
## [1] "Mean of mpg in subset: 24.4044303797468"
## [1] "Standard deviation of mpg in subset: 7.86728282443068"
## [1] "Range of cylinders in subset:"
## [1] 3 8
## [1] "Mean of cylinders in subset: 5.37341772151899"
## [1] "Standard deviation of cylinders in subset: 1.65417865185608"
## [1] "Range of displacement in subset:"
## [1] 68 455
## [1] "Mean of displacement in subset: 187.240506329114"
## [1] "Standard deviation of displacement in subset: 99.6783672303628"
## [1] "Range of horsepower in subset:"
## [1] 46 230
## [1] "Mean of horsepower in subset: 100.721518987342"
## [1] "Standard deviation of horsepower in subset: 35.7088532738003"
## [1] "Range of weight in subset:"
## [1] 1649 4997
## [1] "Mean of weight in subset: 2935.97151898734"
## [1] "Standard deviation of weight in subset: 811.30020815829"
## [1] "Range of acceleration in subset:"
## [1] 8.5 24.8
## [1] "Mean of acceleration in subset: 15.7268987341772"
## [1] "Standard deviation of acceleration in subset: 2.69372071752036"
## [1] "Range of year in subset:"
## [1] 70 82
## [1] "Mean of year in subset: 77.1455696202532"
## [1] "Standard deviation of year in subset: 3.10621690872137"
mpg: Range = 11.0 to 46.6, Mean = 24.37, Standard Deviation = 7.89. weight: Range = 1649 to 4997, Mean = 2932.89, Standard Deviation = 811.32. The subset data showed slightly different values compared to the full dataset.
#(e) Investigate Predictors Graphically
I created scatterplots to visualize relationships between predictors:
pairs(Auto[quantitative_predictors], main = "Scatterplot Matrix of Quantitative Predictors")
#Scatterplot of mpg vs. weight
plot(Auto$weight, Auto$mpg, xlab = "Weight", ylab = "MPG", main = "MPG vs. Weight", col = "blue", pch = 16)
#Scatterplot of mpg vs. horsepower
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "MPG vs. Horsepower", col = "red", pch = 16)
From the plots:
I noticed a strong negative correlation between mpg and weight: heavier cars tend to have lower gas mileage. Similarly, there is a negative correlation between mpg and horsepower: cars with more powerful engines tend to be less fuel-efficient.
#(f) Predict Gas Mileage (mpg) Based on Other Variables
Finally, I analyzed the correlation matrix to identify which variables might be useful for predicting mpg:
# Checking the correlation between mpg and other quantitative predictors
correlation_matrix <- cor(Auto[, quantitative_predictors])
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## acceleration year
## mpg 0.4233285 0.5805410
## cylinders -0.5046834 -0.3456474
## displacement -0.5438005 -0.3698552
## horsepower -0.6891955 -0.4163615
## weight -0.4168392 -0.3091199
## acceleration 1.0000000 0.2903161
## year 0.2903161 1.0000000
From the above correlation I found Weight: Strong negative correlation with mpg (correlation ≈ -0.83). Horsepower: Strong negative correlation with mpg (correlation ≈ -0.78). Displacement: Moderate negative correlation with mpg (correlation ≈ -0.80). These variables are likely to be good predictors of gas mileage because they show strong relationships with mpg.
##Boston Housing Dataset Analysis
I started by loading the Boston dataset from the ISLR2 library and inspecting its structure
library(ISLR2)
data("Boston")
# Checking the structure of the dataset
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# To find the number of rows and columns
n_rows <- nrow(Boston)
n_cols <- ncol(Boston)
print(paste("Number of rows:", n_rows))
## [1] "Number of rows: 506"
print(paste("Number of columns:", n_cols))
## [1] "Number of columns: 13"
# Read dataset
?Boston
The dataset has 506 rows and 14 columns. Each row represents a census tract (neighborhood) in Boston. Each column represents a predictor variable, such as crime rate, average number of rooms, and median home value.
#(b) Pairwise Scatterplots of Predictors
I created pairwise scatterplots to visualize relationships between predictors
# Creating a pairwise scatterplots for all numeric predictors
pairs(Boston, main = "Pairwise Scatterplots of Boston Housing Predictors")
There is a negative correlation between medv (median home value) and
lstat (percentage of lower-status population): as lstat increases, medv
tends to decrease. A positive correlation exists between rm (average
number of rooms) and medv: homes with more rooms tend to have higher
values. Some predictors, like nox (nitrogen oxide concentration) and dis
(distance to employment centers), show non-linear relationships with
other variables.
#(c) Predictors Associated with Per Capita Crime Rate
I checked for predictors associated with crim (per capita crime rate) using scatterplots and correlation:
plot(Boston$crim, Boston$medv, xlab = "Crime Rate", ylab = "Median Home Value", main = "Crime Rate vs. Median Home Value", col = "blue", pch = 16)
correlation_matrix <- cor(Boston)
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## lstat medv
## crim 0.4556215 -0.3883046
## zn -0.4129946 0.3604453
## indus 0.6037997 -0.4837252
## chas -0.0539293 0.1752602
## nox 0.5908789 -0.4273208
## rm -0.6138083 0.6953599
## age 0.6023385 -0.3769546
## dis -0.4969958 0.2499287
## rad 0.4886763 -0.3816262
## tax 0.5439934 -0.4685359
## ptratio 0.3740443 -0.5077867
## lstat 1.0000000 -0.7376627
## medv -0.7376627 1.0000000
crim has a negative correlation with medv (correlation ≈ -0.39): areas with higher crime rates tend to have lower home values. crim also shows a positive correlation with rad (index of accessibility to radial highways) and tax (property tax rate).
I examined the range of key predictors
print(paste("Range of crime rate:", range(Boston$crim)))
## [1] "Range of crime rate: 0.00632" "Range of crime rate: 88.9762"
print(paste("Range of tax rate:", range(Boston$tax)))
## [1] "Range of tax rate: 187" "Range of tax rate: 711"
print(paste("Range of pupil-teacher ratio:", range(Boston$ptratio)))
## [1] "Range of pupil-teacher ratio: 12.6" "Range of pupil-teacher ratio: 22"
Crime rate: Ranges from 0.006 to 88.98. Some areas have extremely high crime rates. Tax rate: Ranges from 187 to 711. Higher tax rates are observed in certain areas. Pupil-teacher ratio: Ranges from 12.6 to 22.0. Some areas have significantly higher ratios, indicating fewer teachers per student.
I counted how many census tracts bound the Charles River
charles_river_tracts <- sum(Boston$chas == 1)
print(paste("Number of tracts bounding the Charles River:", charles_river_tracts))
## [1] "Number of tracts bounding the Charles River: 35"
I calculated the median pupil-teacher ratio
median_ptr <- median(Boston$ptratio)
print(paste("Median pupil-teacher ratio:", median_ptr))
## [1] "Median pupil-teacher ratio: 19.05"
I identified the census tract with the lowest median home value and compared its predictor values to the overall ranges
lowest_medv_tract <- Boston[which.min(Boston$medv), ]
print("Census tract with the lowest median home value:")
## [1] "Census tract with the lowest median home value:"
print(lowest_medv_tract)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
print("Overall ranges for predictors:")
## [1] "Overall ranges for predictors:"
print(sapply(Boston, range))
## crim zn indus chas nox rm age dis rad tax ptratio lstat
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 1.73
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 37.97
## medv
## [1,] 5
## [2,] 50
The tract with the lowest medv (5.0) has: High lstat (percentage of lower-status population): 30.59 (compared to the overall range of 1.73 to 37.97). High crim (crime rate): 38.35 (compared to the overall range of 0.006 to 88.98). Low rm (average number of rooms): 5.68 (compared to the overall range of 3.56 to 8.78).