library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#View(Auto)
df_Auto <- data.frame(Auto)
head(df_Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
dim(df_Auto)
## [1] 392 9
df_Auto <- na.omit(df_Auto)
dim(df_Auto)
## [1] 392 9
sapply(df_Auto,class)
## mpg cylinders displacement horsepower weight acceleration
## "numeric" "integer" "numeric" "integer" "integer" "numeric"
## year origin name
## "integer" "integer" "factor"
sapply(df_Auto[,1:7], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
# Mean and standard deviation.
paste("Mean of each column")
## [1] "Mean of each column"
sapply(df_Auto[,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
paste("Standard Deviation of each column")
## [1] "Standard Deviation of each column"
sapply(df_Auto[,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
# observations excluding 10 - 85th row
df_Auto_red = df_Auto[-c(10:85),]
# Their respective range, mean and sd
paste("Range of each column : ")
## [1] "Range of each column : "
sapply(df_Auto_red[,1:7], range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0 3 68 46 1649 8.5 70
## [2,] 46.6 8 455 230 4997 24.8 82
paste("Mean of each column : ")
## [1] "Mean of each column : "
sapply(df_Auto_red[,1:7], mean)
## mpg cylinders displacement horsepower weight acceleration
## 24.404430 5.373418 187.240506 100.721519 2935.971519 15.726899
## year
## 77.145570
paste("Standard Deviation of each column :")
## [1] "Standard Deviation of each column :"
sapply(df_Auto_red[,1:7], sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.867283 1.654179 99.678367 35.708853 811.300208 2.693721
## year
## 3.106217
pairs(df_Auto[,1:7])
cor(df_Auto[, c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## acceleration year
## mpg 0.4233285 0.5805410
## cylinders -0.5046834 -0.3456474
## displacement -0.5438005 -0.3698552
## horsepower -0.6891955 -0.4163615
## weight -0.4168392 -0.3091199
## acceleration 1.0000000 0.2903161
## year 0.2903161 1.0000000
From the pair plot and the correlation data, we can state there exists linear relationships between some of the variables.
mpg : mpg has strong
negative linear relationships with displacement,
cylinders and weight. That is we can expect
the mpg of the car to decrease as their
displacement and cylinders increase.
mpg has a positive correlation with
year, and this suggests that newer models tend to have
higher mpg than older ones.
displacement, for every other feature
i.e. displacement has a strong positive linear relationship
with cylinders, horsepower ,
weight. This indicates there is correlation among those
features in a positive manner. If displacement increases so does other
features (cylinders, horsepower ,
weight) mentioned above do.
displacement has strong negative linear
relationships with mpg, acceleration and
year. That is we can expect the displacement
of the car to decrease as their mpg and
acceleration increase.
cylinders, for other feature
i.e. cylinders, Strong negative correlations with
mpg.Moderate negative correlation with
acceleration. Weak negative correlation with
year.That is each of them are inversely related to
cylinders.
But if look at other features like displacement,
horsepower, weight. They are positively
related to cylinders i.e. indicating if the no of cylinders
are increased, so would there be an increase in engine displacement,
horsepower and weight of vehicle.
horsepower, for other feature
i.e. horsepower, Strong negative correlations with
mpg. Moderate negative correlation with
acceleration. Weak negative correlation with
year.That is each of them are inversely related to
cylinders.
But if look at other features like displacement,
cylinders, weight. They are positively related
to cylinders i.e. indicating if the engine horsepower are
increased, so would there be an increase in cylinders, displacement and
weight of vehicle.
weight, similar to
horsepower and cylinders it is for weights.
negative correlation with mpg,acceleration and
year. Positive relation with horsepower,
displacement and cylinder.
acceleration & year, has Positive
relation only with model year/acceleration and
mpg. This implies that vehicles with greater acceleration
may also have more fuel-efficient vehicles. Rest all other features are
negatively related to acceleration.
Conclusion : Say one needs to identify how fast their car should accelerate in the future and accordingly what should be its weight. Then in that case, if we plot a graph between Acceleration and weight , we can see that there is a negative relation. Hence heavier the vehicle is less would be its acceleration/speed and vice a versa.
ggplot(Auto, aes(x = weight, y = acceleration)) +
geom_point() +
theme(legend.position = "none") +
scale_x_continuous(labels = scales::comma_format()) +
labs(x = "Weight",
y = "Acceleration",
title = "Correlation between weight and acceleration")
displacement,
cylinders and weight will lead to a reduced
mpg.year tend to have higher
mpg.df_Auto$origin <- factor(df_Auto$origin, labels = c("American", "European", "Japanese"))
ggplot(df_Auto, aes(x = origin, y = mpg, fill = origin)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(title = "Origin vs Mpg - Boxplot",
x = "Origin",
y = "MPG")
findings : - Japenese origin vehicles have comparitively higher mpg(around 33 mpg) than european(around 25 mean) and american ones (around 20 mpg mean).
library(ISLR2)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
Read about the data set: ?Boston How many rows are in this data set? How many columns? What do the rows and columns represent?
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#View(Boston)
dim(Boston)
## [1] 506 13
# Pair plots of all features
pairs( Boston)
Findings : -
crim seems to have a negative linear
relationship with medv and dis. i.e. per
capita crime rate has negative relation with median value of house and
weighted mean of distances to 5 boston employment centres. i.e. median
value of house and mean distance to employment centres increases then
per capita crime rate decreases.Also, The crime and tax rate have an
inverse relationship as in less crime in high tax rate areas. while
‘crim’ has positive relation with nox i.e. crime rate
increases with increase in notrogen oxide concentration.
nox has a negative linear relationship with
dis and medv. i.e. As median value of owner
occupied home and median value of house increases, nitrogen oxide
concentration decreases. From above we also can infer that crime rate
would also decrease.
dis has a positive linear relationship with
medv, while it has positive relation with
Age
# Correlation coefficients between CRIM and all other variables.
cor(Boston[-1],Boston$crim)
## [,1]
## zn -0.20046922
## indus 0.40658341
## chas -0.05589158
## nox 0.42097171
## rm -0.21924670
## age 0.35273425
## dis -0.37967009
## rad 0.62550515
## tax 0.58276431
## ptratio 0.28994558
## lstat 0.45562148
## medv -0.38830461
crim and other
features, but they are not as strong as some of the relationships we
observed in the Auto dataset.crim has a negative linear relationship with
medv, dis , rm ,
chas and zn. For instance, there is a negative
correlation between rm and medv , indicating that neighborhoods with
higher median home prices and more rooms also likely have lower crime
rates.crim has a positive linear relationship with
indus, nox, rad ,
tax and lstat. For instance, there is a
positive correlation between tax and rad and the crime rate ( crim ),
suggesting that greater values in these variables correspond to higher
crime rates.# Suburbs with crime rate higher than 95% of suburbs.
High.Crime <- Boston[Boston$crim > quantile(Boston$crim, 0.95),]
print(nrow(High.Crime))
## [1] 26
print(paste("Range",range(Boston$crim)))
## [1] "Range 0.00632" "Range 88.9762"
print(paste("Mean",mean(Boston$crim)))
## [1] "Mean 3.61352355731225"
print(paste("Standard Deviation",sd(Boston$crim)))
## [1] "Standard Deviation 8.60154510533249"
# Suburbs with tax rates higher than 95% of suburbs.
High.Tax <- Boston[Boston$tax > quantile(Boston$tax, 0.95),]
print(nrow(High.Tax))
## [1] 5
print(paste("Range",range(Boston$tax)))
## [1] "Range 187" "Range 711"
print(paste("Mean",mean(Boston$tax)))
## [1] "Mean 408.237154150198"
print(paste("Standard Deviation",sd(Boston$tax)))
## [1] "Standard Deviation 168.537116054959"
# Suburbs with ptratio higher than 95% of suburbs.
High.ptratio <- Boston[Boston$ptratio > quantile(Boston$ptratio, 0.95),]
print(nrow(High.ptratio))
## [1] 18
print(paste("Range",range(Boston$ptratio)))
## [1] "Range 12.6" "Range 22"
print(paste("Mean",mean(Boston$ptratio)))
## [1] "Mean 18.4555335968379"
print(paste("Standard Deviation",sd(Boston$ptratio)))
## [1] "Standard Deviation 2.16494552371444"
sum(Boston$chas==1)
## [1] 35
median(Boston$ptratio)
## [1] 19.05
which(Boston$medv == min(Boston$medv))
## [1] 399 406
# Values of other predictors for suburb 399
Boston[399,]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
# Values of other predictors for suburb 399
Boston[406,]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
We can see that both observations with the lowest medv take very similar values, and for many they take quite extreme values.
= 90th percentile for: crim, age, lstat
= 75th percentile for: indus, nox, rad, tax, ptratio
# More than 7 rooms
sum(Boston$rm > 7)
## [1] 64
# More than 8 rooms
sum(Boston$rm > 8)
## [1] 13
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
summary(subset(Boston, rm > 8))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
findings:
crim,
lstat and much higher medv when comparing the
IQR range with entire dataset.