This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data. Reading Auto data
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
data(Auto)
auto=as.data.frame(Auto)
# Removing missing values
auto <- na.omit(auto)
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
All variables are quantitative except “name” and “origin”
columns <- c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")
column_data <- subset(auto, select = colnames(auto) %in% columns)
## Show Range
sapply(column_data, range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] 9.0 3 68 46 1613 8.0 70
## [2,] 46.6 8 455 230 5140 24.8 82
## Show Mean and SD
print("Mean:")
## [1] "Mean:"
sapply(column_data, mean)
## mpg cylinders displacement horsepower weight acceleration
## 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327
## year
## 75.979592
print("Standard Deviation:")
## [1] "Standard Deviation:"
sapply(column_data, sd)
## mpg cylinders displacement horsepower weight acceleration
## 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864
## year
## 3.683737
auto_filtered <- auto[-c(10:85), -c(4,9)]
print("Range:")
## [1] "Range:"
sapply(auto_filtered, range)
## mpg cylinders displacement weight acceleration year origin
## [1,] 11.0 3 68 1649 8.5 70 1
## [2,] 46.6 8 455 4997 24.8 82 3
print("Mean:")
## [1] "Mean:"
sapply(auto_filtered, mean)
## mpg cylinders displacement weight acceleration year
## 24.404430 5.373418 187.240506 2935.971519 15.726899 77.145570
## origin
## 1.601266
print("Standard Deviation:")
## [1] "Standard Deviation:"
sapply(auto_filtered, sd)
## mpg cylinders displacement weight acceleration year
## 7.867283 1.654179 99.678367 811.300208 2.693721 3.106217
## origin
## 0.819910
## Ignore origin column
library(ggplot2)
require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 4.3.2
pairs(auto)
# Creating Histograms for each variable
p1 <- ggplot(auto, aes(x = mpg)) +
geom_histogram()
p2 <- ggplot(auto, aes(x = cylinders)) +
geom_histogram()
p3 <- ggplot(auto, aes(x = displacement)) +
geom_histogram()
p4 <- ggplot(auto, aes(x = weight)) +
geom_histogram()
p5 <- ggplot(auto, aes(x = acceleration)) +
geom_histogram()
p6 <- ggplot(auto, aes(x = year)) +
geom_histogram()
grid.arrange(p1, p2, p3, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
grid.arrange(p4, p5, p6, ncol=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the Pair Graph, MPG is affected inversely by displacement,
horsepoweer, weight but positively affected by year. Newer cars are more
likely to have better MPG. Weight has a positive correlation with
horsepower and an inverse relationship with acceleration. There are a
small number of entries for 2 and 5 cylinder vehicle while 4 cylinder
vehicles are most common in the dataset.
As mentioned above, the pair graph suggests MPG is affected inversely by displacement, horsepoweer, weight but positively affected by year. These variables can be useful in predicting MPG
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
data(Boston)
#?Boston
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
dim(Boston)
## [1] 506 13
There are 506 rows and 13 columns
pairs(Boston)
plot(Boston$age, Boston$crim)
plot(Boston$crim, Boston$dis)
plot(Boston$age, Boston$dist)
plot(Boston$crim, Boston$ptratio)
No clear relationship between crime and age, nox, ptratio, or dist.
rm, age, and medv have likely relationships
plot(Boston$crim, Boston$rm)
plot(Boston$crim, Boston$age)
plot(Boston$crim, Boston$medv)
medv is a good predictor as seen in the graph above.
ggplot(Boston, aes(x = crim)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Beyond around 20 there are few outliers with high crime rates.
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
which.max(Boston$crim)
## [1] 381
# Tax
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
which.max(Boston$tax)
## [1] 489
# Pupil-Teacher Ratio by town.
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
which.max(Boston$ptratio)
## [1] 355
Suburb #381 has the highest crime rate of 88.976% [median: 0.25651]. This is a huge difference in the range. Suburb #489 has the highest tax rate of 711.0 [median: 330.0]. This is a huge difference in range. Suburb #355 has the highest pupil-teacher ratio by town of 22 [median: 19.05]. This is slightly higher than the median as ptratio as a small range
sum(Boston$chas == 1)
## [1] 35
35 census tracts are bound the Charles river.
median(Boston$ptratio)
## [1] 19.05
Median is 19.05
summary(Boston$medv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
which.min(Boston$medv)
## [1] 399
Boston[399,]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
Suburb #399 has higher than median crime rate and is not bound the Charles River. The ptratio is slightly higher than the median. The house age is very old as all owner-occupied units were built prior to 1940. The number of rooms is lower than the median
summary(Boston$rm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 5.886 6.208 6.285 6.623 8.780
sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
summary(Boston[Boston$rm > 8,])
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
For census tracts that average more than eight rooms per dwelling: - The median crime is higher - Are less likely to be bound by the Charles River - Are closer to Boston Employment Centers - Huge difference in the lower status of the population - Has a lower ptratio