Lab1 02/03/2025

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(MASS) 
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
install.packages("ISLR")
## Warning: package 'ISLR' is in use and will not be installed

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

data(Auto)
Auto <- na.omit(Auto)  # Remove missing values

(a) Which of the predictors are quantitative, and which are qualitative?

  • Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative predictors: name, origin

(b) What is the range of each quantitative predictor? You can answer this using the min() and max() methods in numpy.

# Remove categorical columns
Auto_numeric <- Auto[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")]

# Compute min and max separately
min_values <- sapply(Auto_numeric, min)
max_values <- sapply(Auto_numeric, max)

# Combine results into a dataframe
range_values <- data.frame(Min = min_values, Max = max_values)

range_values
##               Min    Max
## mpg             9   46.6
## cylinders       3    8.0
## displacement   68  455.0
## horsepower     46  230.0
## weight       1613 5140.0
## acceleration    8   24.8
## year           70   82.0

(c) What is the mean and standard deviation of each quantitative .max() predictor?

mean_val=sapply(Auto_numeric, mean)
sd_val=sapply(Auto_numeric, sd)

res_val=data.frame(Mean=mean_val, Standard_Dev=sd_val)

res_val
##                     Mean Standard_Dev
## mpg            23.445918     7.805007
## cylinders       5.471939     1.705783
## displacement  194.411990   104.644004
## horsepower    104.469388    38.491160
## weight       2977.584184   849.402560
## acceleration   15.541327     2.758864
## year           75.979592     3.683737

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto_subset <- Auto[-c(10:85), ]
Auto_filter <-Auto_subset[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")]
# Compute range, mean, and standard deviation
sapply(Auto_filter, function(x) c(min = min(x), max = max(x), mean = mean(x), sd = sd(x)))
##            mpg cylinders displacement horsepower    weight acceleration
## min  11.000000  3.000000     68.00000   46.00000 1649.0000     8.500000
## max  46.600000  8.000000    455.00000  230.00000 4997.0000    24.800000
## mean 24.404430  5.373418    187.24051  100.72152 2935.9715    15.726899
## sd    7.867283  1.654179     99.67837   35.70885  811.3002     2.693721
##           year
## min  70.000000
## max  82.000000
## mean 77.145570
## sd    3.106217

(e) Using the full data set, investigate the predictors graphically, using scatter plots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

# Scatterplot matrix
pairs(Auto[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")])

  • The pairs plot explains the relation between each predictor and their behaviors.

  • As number of cylinders increases, there is no linear increase or decrease in mpg. There exists a sweet point for efficient mpg like in 4 cylinders cars have better efficiency than those of 3 or 6 and 8. Where as 3 cylinders have the least fuel efficiency compared to other number of cylinders.

  • Displacement, Horsepower and weight have a negative co-relation with the mpg efficiency. As these predictors increase, mpg decreases.

  • mpg efficiency improved for the latest manufactured cars compared to older models.

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

ggplot(Auto, aes(x = weight, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Weight")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Horsepower")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Auto, aes(x = displacement, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Displacement")
## `geom_smooth()` using formula = 'y ~ x'

  • The scatter plots indicate that weight, horsepower, and displacement have strong negative correlations with mpg.
  • Cars with higher weight, horsepower, and displacement tend to have lower mpg, suggesting these predictors could be useful in predicting gas mileage.

10. This exercise involves the Boston housing data set.

(a) To begin, load in the Boston data set, which is part of the ISLP library.

# My R version does not have ISLP, but found same dataset in MASS library.
library(MASS) 
# Load the Boston data set
data("Boston")

How many rows are in this data set? How many columns? What do the rows and columns represent?

dim(Boston)
## [1] 506  14
?Boston
## starting httpd help server ... done

The Boston dataset has 506 representing different suburbs, and 14 representing various predictors like crime rate, tax rates, etc.

Here’s a brief description of the columns:

  1. crim: Per capita crime rate by town.

  2. zn: Proportion of residential land zoned for large lots (over 25,000 sq. ft.).

  3. indus: Proportion of non-retail business acres per town.

  4. chas: Charles River dummy variable (1 if tract bounds the river, 0 otherwise).

  5. nox: Nitrogen oxides concentration (parts per 10 million).

  6. rm: Average number of rooms per dwelling.

  7. age: Proportion of owner-occupied units built before 1940.

  8. dis: Weighted distance to employment centers.

  9. rad: Index of accessibility to radial highways.

  10. tax: Property tax rate per $10,000.

  11. ptratio: Pupil-teacher ratio by town.

  12. b: Proportion of residents of African American descent.

  13. lstat: Percentage of lower status population.

  14. medv: Median value of owner-occupied homes (in $1,000s).

(b) Make some pairwise scatter plots of the predictors in this data set. Describe your findings.

# Pairwise scatterplot
pairs(Boston[,c("crim","medv","rm","dis","nox")])

  • As the crime rate and Nitrogen oxides concentration levels increases in any area, the median value of houses decreases.

  • With increase in number of rooms and decrease in distance from employment centers, the median values of houses is increasing.

(c) Are any of the predictors associated with percapita crime rate? If so, explain the relationship

# Check correlation of predictors with crime rate
cor(Boston$crim, Boston[, -1])
##              zn     indus        chas       nox         rm       age        dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
##            rad       tax   ptratio      black     lstat       medv
## [1,] 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046
  • There is a high correlation between crime rate and access to radial highways, followed by Proportion of non-retail business acres per town, and Property tax rate

  • A negative co-relation exists between distance to employment areas and crime rate.

(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston[, c("crim", "tax", "ptratio")])
##       crim               tax           ptratio     
##  Min.   : 0.00632   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 0.08205   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 0.25651   Median :330.0   Median :19.05  
##  Mean   : 3.61352   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 3.67708   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :88.97620   Max.   :711.0   Max.   :22.00
  • Crime Rates : The range of crime rates varies from 0.00632 to 88.9762. This suggests a wide disparity, with most suburbs having low crime rates, but a few towns experiencing very high crime rates.

  • Tax Rates : Property tax rates range from 187 to 711, suggesting significant variability, where some suburbs have relatively low taxes and others impose high taxes.

  • Pupil-Teacher Ratios : The pupil-teacher ratio ranges from 12.60 to 22.00, which is narrower than the range for crime rates or tax rates, but still reveals some variation in educational resources across suburbs.

(e) How many of the suburbs in this data set bound the Charles river?

sum(Boston$chas == 1)
## [1] 35
  • The number of suburbs that bound the Charles River are 35.

(f) What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)
## [1] 19.05
  • The median pupil-teacher ratio is found by calculating the median of the ptratio column and is equal to 19.05

(g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

# Find suburb with lowest median value of homes
min_medv_row <- Boston[which.min(Boston$medv), ]
min_medv_row
##        crim zn indus chas   nox    rm age    dis rad tax ptratio black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.9 30.59
##     medv
## 399    5
# Range of the predictors
smry=data.frame(sapply(Boston, function(x) c(min = min(x), max = max(x),avg=mean(x))))
smry
##          crim        zn    indus       chas       nox       rm      age
## min  0.006320   0.00000  0.46000 0.00000000 0.3850000 3.561000   2.9000
## max 88.976200 100.00000 27.74000 1.00000000 0.8710000 8.780000 100.0000
## avg  3.613524  11.36364 11.13678 0.06916996 0.5546951 6.284634  68.5749
##           dis       rad      tax  ptratio   black    lstat     medv
## min  1.129600  1.000000 187.0000 12.60000   0.320  1.73000  5.00000
## max 12.126500 24.000000 711.0000 22.00000 396.900 37.97000 50.00000
## avg  3.795043  9.549407 408.2372 18.45553 356.674 12.65306 22.53281
  • The suburb with the lowest median home value is listed above, and the values of other predictors for that suburb can be compared to the overall ranges of each predictor.

  • For this suburb, crime rate is very high compared to the average value of all other suburbs and there is no large residential zone around this place.

  • Non-retail business acres in this area is more than the average range of other suburbs.

  • This suburb does not bound the Charles River and the Nitrogen oxide levels are very high in this area.

  • It is far form Employment areas and farthest from radial highways among all the suburbs

  • Percentage of Lower status people is very dense in this area and Residents of African American descents are the majority in this suburb.

(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

# Suburbs with more than 7 rooms and more than 8 rooms
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
mt8=data.frame(sapply(Boston[Boston$rm>8,], function(x) Avg=mean(x)))
mt8
##         sapply.Boston.Boston.rm...8.....function.x..Avg...mean.x..
## crim                                                     0.7187954
## zn                                                      13.6153846
## indus                                                    7.0784615
## chas                                                     0.1538462
## nox                                                      0.5392385
## rm                                                       8.3485385
## age                                                     71.5384615
## dis                                                      3.4301923
## rad                                                      7.4615385
## tax                                                    325.0769231
## ptratio                                                 16.3615385
## black                                                  385.2107692
## lstat                                                    4.3100000
## medv                                                    44.2000000
  • There are 64 and 13 suburbs which have more than 7 and 8 rooms respectively.

  • Those suburbs with houses more than 8 rooms have less crime rate, high proportion of residential land zoned for large lots and are mostly tracts bound with Charles river.

  • Also, these suburbs have median Nitrogen oxide levels in air and are very old compared to other suburbs in Boston.

  • These suburbs are not so far from employment centers and are located closer to the radial highways.

  • Property tax rates are close to average price in Boston for these suburbs and Pupil-teacher ration is sufficiently healthy.

  • African American descents are the majority in these suburbs and Lower status population is very less.

  • The median house values are double the average prices of all other suburbs and close to the highest values.