Assignment: Auto Dataset Analysis

9(a) Identify Quantitative and Qualitative Predictors

I started by loading the Auto dataset and inspecting its structure to identify the types of predictors. Here’s the R code I used

  1. Quantitative predictors: These are numerical variables like mpg (miles per gallon), cylinders (number of cylinders), displacement (engine size), horsepower (engine power), weight (car weight), acceleration (how quickly the car speeds up), and year (year of manufacture).
  2. Qualitative predictors: These are categorical variables like origin (region where the car was made) and name (car model names).
library(ISLR2)
data("Auto")

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Quantitative predictors
quantitative_predictors <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")

# Qualitative predictors
qualitative_predictors <- c("origin", "name")

print("Quantitative Predictors:")
## [1] "Quantitative Predictors:"
print(quantitative_predictors)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"
print("Qualitative Predictors:")
## [1] "Qualitative Predictors:"
print(qualitative_predictors)
## [1] "origin" "name"

#(b) Range of Each Quantitative Predictor

next,I calculated the range (minimum and maximum values) for each quantitative predictor using the range() function

for (predictor in quantitative_predictors) {
  print(paste("Range of", predictor, ":"))
  print(range(Auto[[predictor]]))
}
## [1] "Range of mpg :"
## [1]  9.0 46.6
## [1] "Range of cylinders :"
## [1] 3 8
## [1] "Range of displacement :"
## [1]  68 455
## [1] "Range of horsepower :"
## [1]  46 230
## [1] "Range of weight :"
## [1] 1613 5140
## [1] "Range of acceleration :"
## [1]  8.0 24.8
## [1] "Range of year :"
## [1] 70 82

From the above process, I came to know the range of different variables

mpg: Ranges from 9.0 to 46.6 miles per gallon. cylinders: Ranges from 3 to 8 cylinders. displacement: Ranges from 68 to 455 cubic inches. horsepower: Ranges from 46 to 230 horsepower. weight: Ranges from 1613 to 5140 pounds. acceleration: Ranges from 8.0 to 24.8 seconds. year: Ranges from 70 to 82 (representing years 1970 to 1982).

#(c) Mean and Standard Deviation of Each Quantitative Predictor

I then calculated the mean and standard deviation for each quantitative predictor

for (predictor in quantitative_predictors) {
  print(paste("Mean of", predictor, ":", mean(Auto[[predictor]])))
  print(paste("Standard deviation of", predictor, ":", sd(Auto[[predictor]])))
}
## [1] "Mean of mpg : 23.4459183673469"
## [1] "Standard deviation of mpg : 7.8050074865718"
## [1] "Mean of cylinders : 5.4719387755102"
## [1] "Standard deviation of cylinders : 1.70578324745278"
## [1] "Mean of displacement : 194.411989795919"
## [1] "Standard deviation of displacement : 104.644003908905"
## [1] "Mean of horsepower : 104.469387755102"
## [1] "Standard deviation of horsepower : 38.4911599328285"
## [1] "Mean of weight : 2977.58418367347"
## [1] "Standard deviation of weight : 849.402560042949"
## [1] "Mean of acceleration : 15.5413265306122"
## [1] "Standard deviation of acceleration : 2.75886411918808"
## [1] "Mean of year : 75.9795918367347"
## [1] "Standard deviation of year : 3.68373654357783"

These are the mean and standard deviations of quantitative predictors

#(d) Remove 10th to 85th Observations and Calculate Range, Mean, and Standard Deviation

I created a subset of the data by removing the 10th through 85th observations and calculated the range, mean, and standard deviation for the remaining data:

Auto_subset <- Auto[-c(10:85), ]

for (predictor in quantitative_predictors) {
  print(paste("Range of", predictor, "in subset:"))
  print(range(Auto_subset[[predictor]]))
  print(paste("Mean of", predictor, "in subset:", mean(Auto_subset[[predictor]])))
  print(paste("Standard deviation of", predictor, "in subset:", sd(Auto_subset[[predictor]])))
}
## [1] "Range of mpg in subset:"
## [1] 11.0 46.6
## [1] "Mean of mpg in subset: 24.4044303797468"
## [1] "Standard deviation of mpg in subset: 7.86728282443068"
## [1] "Range of cylinders in subset:"
## [1] 3 8
## [1] "Mean of cylinders in subset: 5.37341772151899"
## [1] "Standard deviation of cylinders in subset: 1.65417865185608"
## [1] "Range of displacement in subset:"
## [1]  68 455
## [1] "Mean of displacement in subset: 187.240506329114"
## [1] "Standard deviation of displacement in subset: 99.6783672303628"
## [1] "Range of horsepower in subset:"
## [1]  46 230
## [1] "Mean of horsepower in subset: 100.721518987342"
## [1] "Standard deviation of horsepower in subset: 35.7088532738003"
## [1] "Range of weight in subset:"
## [1] 1649 4997
## [1] "Mean of weight in subset: 2935.97151898734"
## [1] "Standard deviation of weight in subset: 811.30020815829"
## [1] "Range of acceleration in subset:"
## [1]  8.5 24.8
## [1] "Mean of acceleration in subset: 15.7268987341772"
## [1] "Standard deviation of acceleration in subset: 2.69372071752036"
## [1] "Range of year in subset:"
## [1] 70 82
## [1] "Mean of year in subset: 77.1455696202532"
## [1] "Standard deviation of year in subset: 3.10621690872137"

mpg: Range = 11.0 to 46.6, Mean = 24.37, Standard Deviation = 7.89. weight: Range = 1649 to 4997, Mean = 2932.89, Standard Deviation = 811.32. The subset data showed slightly different values compared to the full dataset.

#(e) Investigate Predictors Graphically

I created scatterplots to visualize relationships between predictors:

pairs(Auto[quantitative_predictors], main = "Scatterplot Matrix of Quantitative Predictors")

#Scatterplot of mpg vs. weight
plot(Auto$weight, Auto$mpg, xlab = "Weight", ylab = "MPG", main = "MPG vs. Weight", col = "blue", pch = 16)

#Scatterplot of mpg vs. horsepower
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "MPG vs. Horsepower", col = "red", pch = 16)

From the plots:

I noticed a strong negative correlation between mpg and weight: heavier cars tend to have lower gas mileage. Similarly, there is a negative correlation between mpg and horsepower: cars with more powerful engines tend to be less fuel-efficient.

#(f) Predict Gas Mileage (mpg) Based on Other Variables

Finally, I analyzed the correlation matrix to identify which variables might be useful for predicting mpg:

# Checking the correlation between mpg and other quantitative predictors
correlation_matrix <- cor(Auto[, quantitative_predictors])
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

From the above correlation I found Weight: Strong negative correlation with mpg (correlation ≈ -0.83). Horsepower: Strong negative correlation with mpg (correlation ≈ -0.78). Displacement: Moderate negative correlation with mpg (correlation ≈ -0.80). These variables are likely to be good predictors of gas mileage because they show strong relationships with mpg.

##Boston Housing Dataset Analysis

10(a) Load the Dataset and Inspect Its Structure

I started by loading the Boston dataset from the ISLR2 library and inspecting its structure

library(ISLR2)
data("Boston")

# Checking the structure of the dataset
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# To find the number of rows and columns
n_rows <- nrow(Boston)
n_cols <- ncol(Boston)

print(paste("Number of rows:", n_rows))
## [1] "Number of rows: 506"
print(paste("Number of columns:", n_cols))
## [1] "Number of columns: 13"
# Read dataset
?Boston

The dataset has 506 rows and 14 columns. Each row represents a census tract (neighborhood) in Boston. Each column represents a predictor variable, such as crime rate, average number of rooms, and median home value.

#(b) Pairwise Scatterplots of Predictors

I created pairwise scatterplots to visualize relationships between predictors

# Creating a pairwise scatterplots for all numeric predictors
pairs(Boston, main = "Pairwise Scatterplots of Boston Housing Predictors")

There is a negative correlation between medv (median home value) and lstat (percentage of lower-status population): as lstat increases, medv tends to decrease. A positive correlation exists between rm (average number of rooms) and medv: homes with more rooms tend to have higher values. Some predictors, like nox (nitrogen oxide concentration) and dis (distance to employment centers), show non-linear relationships with other variables.

#(c) Predictors Associated with Per Capita Crime Rate

I checked for predictors associated with crim (per capita crime rate) using scatterplots and correlation:

plot(Boston$crim, Boston$medv, xlab = "Crime Rate", ylab = "Median Home Value", main = "Crime Rate vs. Median Home Value", col = "blue", pch = 16)

correlation_matrix <- cor(Boston)
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##              lstat       medv
## crim     0.4556215 -0.3883046
## zn      -0.4129946  0.3604453
## indus    0.6037997 -0.4837252
## chas    -0.0539293  0.1752602
## nox      0.5908789 -0.4273208
## rm      -0.6138083  0.6953599
## age      0.6023385 -0.3769546
## dis     -0.4969958  0.2499287
## rad      0.4886763 -0.3816262
## tax      0.5439934 -0.4685359
## ptratio  0.3740443 -0.5077867
## lstat    1.0000000 -0.7376627
## medv    -0.7376627  1.0000000

crim has a negative correlation with medv (correlation ≈ -0.39): areas with higher crime rates tend to have lower home values. crim also shows a positive correlation with rad (index of accessibility to radial highways) and tax (property tax rate).

(d) High Crime Rates, Tax Rates, and Pupil-Teacher Ratios

I examined the range of key predictors

print(paste("Range of crime rate:", range(Boston$crim)))
## [1] "Range of crime rate: 0.00632" "Range of crime rate: 88.9762"
print(paste("Range of tax rate:", range(Boston$tax)))
## [1] "Range of tax rate: 187" "Range of tax rate: 711"
print(paste("Range of pupil-teacher ratio:", range(Boston$ptratio)))
## [1] "Range of pupil-teacher ratio: 12.6" "Range of pupil-teacher ratio: 22"

Crime rate: Ranges from 0.006 to 88.98. Some areas have extremely high crime rates. Tax rate: Ranges from 187 to 711. Higher tax rates are observed in certain areas. Pupil-teacher ratio: Ranges from 12.6 to 22.0. Some areas have significantly higher ratios, indicating fewer teachers per student.

(e) Census Tracts Bounding the Charles River

I counted how many census tracts bound the Charles River

charles_river_tracts <- sum(Boston$chas == 1)
print(paste("Number of tracts bounding the Charles River:", charles_river_tracts))
## [1] "Number of tracts bounding the Charles River: 35"

(f) Median Pupil-Teacher Ratio

I calculated the median pupil-teacher ratio

median_ptr <- median(Boston$ptratio)
print(paste("Median pupil-teacher ratio:", median_ptr))
## [1] "Median pupil-teacher ratio: 19.05"

(g) Census Tract with the Lowest Median Home Value

I identified the census tract with the lowest median home value and compared its predictor values to the overall ranges

lowest_medv_tract <- Boston[which.min(Boston$medv), ]

print("Census tract with the lowest median home value:")
## [1] "Census tract with the lowest median home value:"
print(lowest_medv_tract)
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
print("Overall ranges for predictors:")
## [1] "Overall ranges for predictors:"
print(sapply(Boston, range))
##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio lstat
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6  1.73
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 37.97
##      medv
## [1,]    5
## [2,]   50

The tract with the lowest medv (5.0) has: High lstat (percentage of lower-status population): 30.59 (compared to the overall range of 1.73 to 37.97). High crim (crime rate): 38.35 (compared to the overall range of 0.006 to 88.98). Low rm (average number of rooms): 5.68 (compared to the overall range of 3.56 to 8.78).