I am importing the libraries needed to run these notes.
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
#loading the Dataset
data(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
#Auto Dataset Exercise
Quantitative Predictors: These are numerical and include mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin. Qualitative Predictor: name is a categorical variable, making it a qualitative predictor.
The range for each quantitative predictor can be calculated using the range() function. The results based on your summary are:
mpg: 9 to 46.6 cylinders: 3 to 8 displacement: 68 to 455 horsepower: 46 to 230 weight: 1613 to 5140 acceleration: 8 to 24.8 year: 70 to 82 origin: 1 to 3
cat("\nmpg:\n")
##
## mpg:
cat("Mean =", mean(Auto$mpg, na.rm = TRUE), "\n")
## Mean = 23.44592
cat("Standard Deviation =", sd(Auto$mpg, na.rm = TRUE), "\n")
## Standard Deviation = 7.805007
cat("\ncylinders:\n")
##
## cylinders:
cat("Mean =", mean(Auto$cylinders, na.rm = TRUE), "\n")
## Mean = 5.471939
cat("Standard Deviation =", sd(Auto$cylinders, na.rm = TRUE), "\n")
## Standard Deviation = 1.705783
cat("\ndisplacement:\n")
##
## displacement:
cat("Mean =", mean(Auto$displacement, na.rm = TRUE), "\n")
## Mean = 194.412
cat("Standard Deviation =", sd(Auto$displacement, na.rm = TRUE), "\n")
## Standard Deviation = 104.644
cat("\nhorsepower:\n")
##
## horsepower:
cat("Mean =", mean(Auto$horsepower, na.rm = TRUE), "\n")
## Mean = 104.4694
cat("Standard Deviation =", sd(Auto$horsepower, na.rm = TRUE), "\n")
## Standard Deviation = 38.49116
cat("\nweight:\n")
##
## weight:
cat("Mean =", mean(Auto$weight, na.rm = TRUE), "\n")
## Mean = 2977.584
cat("Standard Deviation =", sd(Auto$weight, na.rm = TRUE), "\n")
## Standard Deviation = 849.4026
cat("\nacceleration:\n")
##
## acceleration:
cat("Mean =", mean(Auto$acceleration, na.rm = TRUE), "\n")
## Mean = 15.54133
cat("Standard Deviation =", sd(Auto$acceleration, na.rm = TRUE), "\n")
## Standard Deviation = 2.758864
cat("\nyear:\n")
##
## year:
cat("Mean =", mean(Auto$year, na.rm = TRUE), "\n")
## Mean = 75.97959
cat("Standard Deviation =", sd(Auto$year, na.rm = TRUE), "\n")
## Standard Deviation = 3.683737
cat("\norigin:\n")
##
## origin:
cat("Mean =", mean(Auto$origin, na.rm = TRUE), "\n")
## Mean = 1.576531
cat("Standard Deviation =", sd(Auto$origin, na.rm = TRUE), "\n")
## Standard Deviation = 0.8055182
Auto_subset <- Auto[-(10:85), ]
print_stats <- function(data, variable_name) {
cat("\n", variable_name, ":\n")
cat("Range: ", range(data, na.rm = TRUE), "\n")
cat("Mean: ", mean(data, na.rm = TRUE), "\n")
cat("Standard Deviation: ", sd(data, na.rm = TRUE), "\n")
}
# Apply the function to each quantitative predictor
print_stats(Auto_subset$mpg, "mpg")
##
## mpg :
## Range: 11 46.6
## Mean: 24.40443
## Standard Deviation: 7.867283
print_stats(Auto_subset$cylinders, "cylinders")
##
## cylinders :
## Range: 3 8
## Mean: 5.373418
## Standard Deviation: 1.654179
print_stats(Auto_subset$displacement, "displacement")
##
## displacement :
## Range: 68 455
## Mean: 187.2405
## Standard Deviation: 99.67837
print_stats(Auto_subset$horsepower, "horsepower")
##
## horsepower :
## Range: 46 230
## Mean: 100.7215
## Standard Deviation: 35.70885
print_stats(Auto_subset$weight, "weight")
##
## weight :
## Range: 1649 4997
## Mean: 2935.972
## Standard Deviation: 811.3002
print_stats(Auto_subset$acceleration, "acceleration")
##
## acceleration :
## Range: 8.5 24.8
## Mean: 15.7269
## Standard Deviation: 2.693721
print_stats(Auto_subset$year, "year")
##
## year :
## Range: 70 82
## Mean: 77.14557
## Standard Deviation: 3.106217
print_stats(Auto_subset$origin, "origin")
##
## origin :
## Range: 1 3
## Mean: 1.601266
## Standard Deviation: 0.81991
# Scatterplot of horsepower vs mpg
plot(Auto$horsepower, Auto$mpg, main = "Horsepower vs MPG", xlab = "Horsepower", ylab = "Miles per Gallon")
# Scatterplot of weight vs mpg
plot(Auto$weight, Auto$mpg, main = "Weight vs MPG", xlab = "Weight", ylab = "Miles per Gallon")
Findings: From both the plots, we can observe that Horsepower and Weight
is inversely proportional to gas mileage (mpg). As Horsepower requires
more fuel in less time and In regarding weight, it takes more power to
maintain the speed and acceleration.
plot(Auto$displacement, Auto$mpg,
main = "Displacement vs MPG",
xlab = "Engine Displacement",
ylab = "Miles per Gallon (MPG)",
pch = 19, col = "blue")
Finding: The bigger the engine displacement volume, the more air that
can be pushed into the cylinders. This boosts the combustion process and
allows the engine to generate more power and it results in lesser
mpg.
#Boston Problem Set
data(Boston)
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
Linearity: Some variables show a linear relationship with others; for
example, rm (average number of rooms per dwelling) seems to have a
positive linear relationship with medv (median value of owner-occupied
homes), suggesting that larger homes tend to be more valuable.
Crime Rate and Zoning (zn): You might expect to see a negative correlation where areas with higher residential zoning (more space allocated for residential buildings) could have lower crime rates.
Crime Rate and Industrialization (indus): There may be a positive correlation, where more industrially zoned areas have higher crime rates, possibly due to less residential presence and more anonymity.
High Crime Rates: Some suburbs have crime rates as high as approximately 89 per capita, indicating notable variation with certain areas experiencing significantly higher crime. High Tax Rates: The property-tax rate reaches up to 711 per $10,000 in some suburbs, suggesting that certain areas are subject to much higher taxes. High Pupil-Teacher Ratios: Ratios go up to 22, indicating some suburbs may have overcrowded schools with fewer teachers available per student.
sum(Boston$chas == 1)
## [1] 35
median_ptratio <- median(Boston$ptratio)
print(median_ptratio)
## [1] 19.05
min_medv_suburb <- Boston[Boston$medv == min(Boston$medv), ]
print(min_medv_suburb)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90 30.59
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98
## medv
## 399 5
## 406 5
# Number of suburbs with more than 7 rooms per dwelling
sum(Boston$rm > 7)
## [1] 64
# Number of suburbs with more than 8 rooms per dwelling
sum(Boston$rm > 8)
## [1] 13