QUESTION 9
install.packages("ISLR2", repos = "https://cloud.r-project.org/")
## Installing package into 'C:/Users/saisr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ISLR2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\saisr\AppData\Local\Temp\RtmpCkAdgm\downloaded_packages
library(ISLR2)
data(Auto) # Load the dataset
str(Auto) # Check the structure
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
head(Auto) # View the first few rows
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
9a:
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
The variables which are numerical and continuous are quantitative so: mpg : num 18 15 18 16 17 15 14 14 14 15 … cylinders : num 8 8 8 8 8 8 8 8 8 8 … displacement: num 307 350 318 304 302 429 454 440 455 390 … horsepower : num 130 165 150 150 140 198 220 215 225 190 … weight : num 3504 3693 3436 3433 3449 … acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 … year : num 70 70 70 70 70 70 70 70 70 70 … The variables which represent categorical data are qualitative so: origin : num 1 1 1 1 1 1 1 1 1 1 … name : Factor w/ 304 levels “amc ambassador brougham”,..: 49 36 231 14 161 141 54 223 241 2 …
9b:
sapply(Auto[, sapply(Auto, is.numeric)], range)
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 9.0 3 68 46 1613 8.0 70 1
## [2,] 46.6 8 455 230 5140 24.8 82 3
here we can see that there is the range i.e., minimum and maximum values of the qualitative variables.
9c:
sapply(Auto[, sapply(Auto, is.numeric)], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year origin
## 75.979592 1.576531
sapply(Auto[, sapply(Auto, is.numeric)], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.8050075 1.7057832 104.6440039 38.4911599 849.4025600 2.7588641
## year origin
## 3.6837365 0.8055182
here we can see the mean and standard deviation of each of the variables.
9d:
A subset without rows 10 to 85
Auto_subset <- Auto[-(10:85), ]
calculating range, mean and standard deviation
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], range) # Range
## mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0 3 68 46 1649 8.5 70 1
## [2,] 46.6 8 455 230 4997 24.8 82 3
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], mean) # Mean
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year origin
## 77.145570 1.601266
sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], sd) # Standard deviation
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year origin
## 3.106217 0.819910
9e:
install.packages("ggplot2", repos = "https://cloud.r-project.org/")
## Installing package into 'C:/Users/saisr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\saisr\AppData\Local\Temp\RtmpCkAdgm\downloaded_packages
library(ggplot2)
ggplot(Auto, aes(x = weight, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
ggtitle("MPG vs Weight") +
xlab("Weight") +
ylab("Miles Per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", col = "purple") +
ggtitle("MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles Per Gallon")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Auto, aes(x = as.factor(cylinders), y = mpg, fill = as.factor(cylinders))) +
geom_boxplot() +
ggtitle("MPG by Number of Cylinders") +
xlab("Cylinders") +
ylab("Miles Per Gallon") +
theme_minimal()
9f: based on the scatter plots and other plots that are plotted between mpg and other variables we can tell that there are different types of correlation and every factor has a reason that it can be useful to predict mpg with other variables. - weight: strong negative correlation. As cars get heavier, they require more fuel to move, reducing fuel efficiency. This relationship is linear making weight an important variable to include in any regression model for predicting MPG. - horsepower: negative correlation.Powerful engines are less fuel-efficient.so the relationship seems linear, it might be worth exploring non-linear models to capture any more complex interactions between horsepower and MPG. - year : positive correlation. The year of manufacture is an important variable for predicting MPG. - Cylinders – Cars with fewer cylinders are more fuel-efficient. - Displacement – big engines are typically less fuel-efficient. - Origin – Region of manufacture impacts fuel efficiency.
QUESTION 10
10a:
# View the Boston data set
data("Boston")
# Check the structure of the Boston dataset
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
10b:
library(ISLR2)
# Load the Boston data set
data("Boston")
pairs(Boston[, -14], main = "Pairwise Scatterplots of Boston Housing Data")
Findings: - Crime Rate vs. Property Tax: There might be a positive correlation between crime rate and property tax. In some cases, higher crime areas might have higher taxes for community policing or funding. - Crime Rate vs. Number of Rooms: A negative relationship might be observed. Areas with higher crime rates might have lower average numbers of rooms per dwelling due to economic factors affecting the region. - Average Rooms per Dwelling vs. Property Tax: Expect a positive correlation. Wealthier areas with larger homes are likely to have higher property taxes to fund local services like schools, roads, etc. - Distance to Employment Centers vs. House Price: A positive correlation could be observed, as people in wealthier areas might have easier access to employment hubs, driving house prices up.
10c:
# Correlation between each predictor and crime rate
cor(Boston[, -14]) # Remove the 'medv' column (target variable)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## lstat medv
## crim 0.4556215 -0.3883046
## zn -0.4129946 0.3604453
## indus 0.6037997 -0.4837252
## chas -0.0539293 0.1752602
## nox 0.5908789 -0.4273208
## rm -0.6138083 0.6953599
## age 0.6023385 -0.3769546
## dis -0.4969958 0.2499287
## rad 0.4886763 -0.3816262
## tax 0.5439934 -0.4685359
## ptratio 0.3740443 -0.5077867
## lstat 1.0000000 -0.7376627
## medv -0.7376627 1.0000000
# Scatterplot of crime rate vs some predictors (e.g., tax, rooms)
plot(Boston$crim, Boston$tax, main = "Crime Rate vs Tax Rate", xlab = "Crime Rate", ylab = "Tax Rate")
10d.
for this we can summarize range of each predictor.
# Summary statistics of each predictor
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
comment: - Crime Rate: Range: Min: 0.006, Max: 88.98. There are some census tracts with very high crime rates, as the maximum value is extremely high (88.98). The majority of the areas, however, seem to have much lower crime rates, with the mean being 3.61 and the median around 0.26. - Tax Rate: Range: Min: 187, Max: 711. The maximum tax rate is 711, which is quite high. Given that the median is 330 and the mean is 408, it appears that there are some areas with very high property tax rates, potentially due to the needs of areas with higher crime rates. - Pupil-Teacher Ratio: Range: Min: 12.6, Max: 22. The maximum pupil-teacher ratio is 22, which indicates that some areas have relatively high ratios.
10e:
# Count the number of census tracts bounding the Charles River
sum(Boston$chas == 1)
## [1] 35
10f:
# Median pupil-teacher ratio
median(Boston$ptratio)
## [1] 19.05
10g:
# Find the index of the census tract with the lowest 'medv'
lowest_medv_index <- which.min(Boston$medv)
# View the details of that census tract
Boston[lowest_medv_index, ]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
The census tract with the lowest median value of owner-occupied homes (MEDV = 5) has the following values for the predictors: Crime rate (CRIM): 38.35 Proportion of residential land zoned for large lots (ZN): 0 Industrial proportion (INDUS): 18.1 Charles River dummy (CHAS): 0 Nitrogen oxides concentration (NOX): 0.693 Average number of rooms (RM): 5.453 Age of homes (AGE): 100 Distance to employment centers (DIS): 1.49 Radial highway accessibility (RAD): 24 Property tax rate (TAX): 666 Pupil-teacher ratio (PTRATIO): 20.2 Percentage of lower-income population (LSTAT): 30.59 Median value of owner-occupied homes (MEDV): 5
Comparison to Overall Ranges:
Crime rate (CRIM): 38.35 (maximum: 88.98, minimum: 0.00632) This is very high compared to the overall range. A high crime rate suggests a less safe area, which can significantly lower property values.
Proportion of residential land zoned for large lots (ZN): 0 (maximum: 100) A value of 0 indicates no land is zoned for large lots, which likely contributes to a more urbanized area with smaller homes, lower living space, and lower property values.
Industrial proportion (INDUS): 18.1 (maximum: 27.74) This value is on the higher end, indicating a high industrial presence. Areas with high industrial activity tend to have lower residential property values due to noise, pollution, and less aesthetic appeal.
Charles River dummy (CHAS): 0 (maximum: 1) Being farther from the Charles River (value = 0) means this tract is likely less desirable than those close to the river. Proximity to the river generally raises property values.
10h:
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13