library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
install.packages("ISLR")
## Warning: package 'ISLR' is in use and will not be installed
data(Auto)
Auto <- na.omit(Auto) # Remove missing values
(a) Which of the predictors are quantitative, and which are qualitative?
(b) What is the range of each quantitative predictor? You can answer this using the min() and max() methods in numpy.
# Remove categorical columns
Auto_numeric <- Auto[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")]
# Compute min and max separately
min_values <- sapply(Auto_numeric, min)
max_values <- sapply(Auto_numeric, max)
# Combine results into a dataframe
range_values <- data.frame(Min = min_values, Max = max_values)
range_values
## Min Max
## mpg 9 46.6
## cylinders 3 8.0
## displacement 68 455.0
## horsepower 46 230.0
## weight 1613 5140.0
## acceleration 8 24.8
## year 70 82.0
(c) What is the mean and standard deviation of each quantitative .max() predictor?
mean_val=sapply(Auto_numeric, mean)
sd_val=sapply(Auto_numeric, sd)
res_val=data.frame(Mean=mean_val, Standard_Dev=sd_val)
res_val
## Mean Standard_Dev
## mpg 23.445918 7.805007
## cylinders 5.471939 1.705783
## displacement 194.411990 104.644004
## horsepower 104.469388 38.491160
## weight 2977.584184 849.402560
## acceleration 15.541327 2.758864
## year 75.979592 3.683737
(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Auto_subset <- Auto[-c(10:85), ]
Auto_filter <-Auto_subset[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")]
# Compute range, mean, and standard deviation
sapply(Auto_filter, function(x) c(min = min(x), max = max(x), mean = mean(x), sd = sd(x)))
## mpg cylinders displacement horsepower weight acceleration
## min 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## max 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## sd 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
## year
## min 70.000000
## max 82.000000
## mean 77.145570
## sd 3.106217
(e) Using the full data set, investigate the predictors graphically, using scatter plots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
# Scatterplot matrix
pairs(Auto[,c("mpg","cylinders","displacement","horsepower","weight","acceleration","year")])
The pairs plot explains the relation between each predictor and their behaviors.
As number of cylinders increases, there is no linear increase or decrease in mpg. There exists a sweet point for efficient mpg like in 4 cylinders cars have better efficiency than those of 3 or 6 and 8. Where as 3 cylinders have the least fuel efficiency compared to other number of cylinders.
Displacement, Horsepower and weight have a negative co-relation with the mpg efficiency. As these predictors increase, mpg decreases.
mpg efficiency improved for the latest manufactured cars compared to older models.
(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
ggplot(Auto, aes(x = weight, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Weight")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Auto, aes(x = horsepower, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Horsepower")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Auto, aes(x = displacement, y = mpg)) + geom_point() + geom_smooth(method = "lm") + ggtitle("MPG vs. Displacement")
## `geom_smooth()` using formula = 'y ~ x'
(a) To begin, load in the Boston data set, which is part of the ISLP library.
# My R version does not have ISLP, but found same dataset in MASS library.
library(MASS)
# Load the Boston data set
data("Boston")
How many rows are in this data set? How many columns? What do the rows and columns represent?
dim(Boston)
## [1] 506 14
?Boston
## starting httpd help server ... done
The Boston dataset has 506 representing different suburbs, and 14 representing various predictors like crime rate, tax rates, etc.
Here’s a brief description of the columns:
crim: Per capita crime rate by town.
zn: Proportion of residential land zoned for large lots (over 25,000 sq. ft.).
indus: Proportion of non-retail business acres per town.
chas: Charles River dummy variable (1 if tract bounds the river, 0 otherwise).
nox: Nitrogen oxides concentration (parts per 10 million).
rm: Average number of rooms per dwelling.
age: Proportion of owner-occupied units built before 1940.
dis: Weighted distance to employment centers.
rad: Index of accessibility to radial highways.
tax: Property tax rate per $10,000.
ptratio: Pupil-teacher ratio by town.
b: Proportion of residents of African American descent.
lstat: Percentage of lower status population.
medv: Median value of owner-occupied homes (in $1,000s).
(b) Make some pairwise scatter plots of the predictors in this data set. Describe your findings.
# Pairwise scatterplot
pairs(Boston[,c("crim","medv","rm","dis","nox")])
As the crime rate and Nitrogen oxides concentration levels increases in any area, the median value of houses decreases.
With increase in number of rooms and decrease in distance from employment centers, the median values of houses is increasing.
(c) Are any of the predictors associated with percapita crime rate? If so, explain the relationship
# Check correlation of predictors with crime rate
cor(Boston$crim, Boston[, -1])
## zn indus chas nox rm age dis
## [1,] -0.2004692 0.4065834 -0.05589158 0.4209717 -0.2192467 0.3527343 -0.3796701
## rad tax ptratio black lstat medv
## [1,] 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215 -0.3883046
There is a high correlation between crime rate and access to radial highways, followed by Proportion of non-retail business acres per town, and Property tax rate
A negative co-relation exists between distance to employment areas and crime rate.
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
summary(Boston[, c("crim", "tax", "ptratio")])
## crim tax ptratio
## Min. : 0.00632 Min. :187.0 Min. :12.60
## 1st Qu.: 0.08205 1st Qu.:279.0 1st Qu.:17.40
## Median : 0.25651 Median :330.0 Median :19.05
## Mean : 3.61352 Mean :408.2 Mean :18.46
## 3rd Qu.: 3.67708 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :88.97620 Max. :711.0 Max. :22.00
Crime Rates : The range of crime rates varies from 0.00632 to 88.9762. This suggests a wide disparity, with most suburbs having low crime rates, but a few towns experiencing very high crime rates.
Tax Rates : Property tax rates range from 187 to 711, suggesting significant variability, where some suburbs have relatively low taxes and others impose high taxes.
Pupil-Teacher Ratios : The pupil-teacher ratio ranges from 12.60 to 22.00, which is narrower than the range for crime rates or tax rates, but still reveals some variation in educational resources across suburbs.
(e) How many of the suburbs in this data set bound the Charles river?
sum(Boston$chas == 1)
## [1] 35
(f) What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
(g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
# Find suburb with lowest median value of homes
min_medv_row <- Boston[which.min(Boston$medv), ]
min_medv_row
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59
## medv
## 399 5
# Range of the predictors
smry=data.frame(sapply(Boston, function(x) c(min = min(x), max = max(x),avg=mean(x))))
smry
## crim zn indus chas nox rm age
## min 0.006320 0.00000 0.46000 0.00000000 0.3850000 3.561000 2.9000
## max 88.976200 100.00000 27.74000 1.00000000 0.8710000 8.780000 100.0000
## avg 3.613524 11.36364 11.13678 0.06916996 0.5546951 6.284634 68.5749
## dis rad tax ptratio black lstat medv
## min 1.129600 1.000000 187.0000 12.60000 0.320 1.73000 5.00000
## max 12.126500 24.000000 711.0000 22.00000 396.900 37.97000 50.00000
## avg 3.795043 9.549407 408.2372 18.45553 356.674 12.65306 22.53281
The suburb with the lowest median home value is listed above, and the values of other predictors for that suburb can be compared to the overall ranges of each predictor.
For this suburb, crime rate is very high compared to the average value of all other suburbs and there is no large residential zone around this place.
Non-retail business acres in this area is more than the average range of other suburbs.
This suburb does not bound the Charles River and the Nitrogen oxide levels are very high in this area.
It is far form Employment areas and farthest from radial highways among all the suburbs
Percentage of Lower status people is very dense in this area and Residents of African American descents are the majority in this suburb.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
# Suburbs with more than 7 rooms and more than 8 rooms
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
mt8=data.frame(sapply(Boston[Boston$rm>8,], function(x) Avg=mean(x)))
mt8
## sapply.Boston.Boston.rm...8.....function.x..Avg...mean.x..
## crim 0.7187954
## zn 13.6153846
## indus 7.0784615
## chas 0.1538462
## nox 0.5392385
## rm 8.3485385
## age 71.5384615
## dis 3.4301923
## rad 7.4615385
## tax 325.0769231
## ptratio 16.3615385
## black 385.2107692
## lstat 4.3100000
## medv 44.2000000
There are 64 and 13 suburbs which have more than 7 and 8 rooms respectively.
Those suburbs with houses more than 8 rooms have less crime rate, high proportion of residential land zoned for large lots and are mostly tracts bound with Charles river.
Also, these suburbs have median Nitrogen oxide levels in air and are very old compared to other suburbs in Boston.
These suburbs are not so far from employment centers and are located closer to the radial highways.
Property tax rates are close to average price in Boston for these suburbs and Pupil-teacher ration is sufficiently healthy.
African American descents are the majority in these suburbs and Lower status population is very less.
The median house values are double the average prices of all other suburbs and close to the highest values.