1. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#View(Auto)

df_Auto <- data.frame(Auto)
head(df_Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
dim(df_Auto)
## [1] 392   9
df_Auto <- na.omit(df_Auto)
dim(df_Auto)
## [1] 392   9
  1. Which of the predictors are quantitative, and which are qualitative?
sapply(df_Auto,class)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    "numeric"    "integer"    "numeric"    "integer"    "integer"    "numeric" 
##         year       origin         name 
##    "integer"    "integer"     "factor"
  1. What is the range of each quantitative predictor? You can answer this using the range() function. range()
sapply(df_Auto[,1:7], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82
  1. What is the mean and standard deviation of each quantitative predictor?
# Mean and standard deviation.
paste("Mean of each column")
## [1] "Mean of each column"
sapply(df_Auto[,1:7], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592
paste("Standard Deviation of each column")
## [1] "Standard Deviation of each column"
sapply(df_Auto[,1:7], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737
  1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
# observations excluding 10 - 85th row
df_Auto_red = df_Auto[-c(10:85),]

# Their respective range, mean and sd
paste("Range of each column : ")
## [1] "Range of each column : "
sapply(df_Auto_red[,1:7], range)
##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82
paste("Mean of each column : ")
## [1] "Mean of each column : "
sapply(df_Auto_red[,1:7], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570
paste("Standard Deviation of each column :")
## [1] "Standard Deviation of each column :"
sapply(df_Auto_red[,1:7], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217
  1. Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your fndings.
pairs(df_Auto[,1:7])

cor(df_Auto[, c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000
ggplot(Auto, aes(x = weight, y = acceleration)) + 
  geom_point() + 
  theme(legend.position = "none") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(x = "Weight", 
       y = "Acceleration", 
       title = "Correlation between weight and acceleration")

  1. Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
df_Auto$origin <- factor(df_Auto$origin, labels = c("American", "European", "Japanese"))

ggplot(df_Auto, aes(x = origin, y = mpg, fill = origin)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Origin vs Mpg - Boxplot", 
       x = "Origin", 
       y = "MPG")

findings : - Japenese origin vehicles have comparitively higher mpg(around 33 mpg) than european(around 25 mean) and american ones (around 20 mpg mean).

  1. This exercise involves the Boston housing data set.
  1. To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. library(ISLR2) Now the data set is contained in the object Boston.
library(ISLR2)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Read about the data set: ?Boston How many rows are in this data set? How many columns? What do the rows and columns represent?

str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#View(Boston)
dim(Boston)
## [1] 506  13
  1. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.
# Pair plots of all features
pairs( Boston)

Findings : - crim seems to have a negative linear relationship with medv and dis. i.e. per capita crime rate has negative relation with median value of house and weighted mean of distances to 5 boston employment centres. i.e. median value of house and mean distance to employment centres increases then per capita crime rate decreases.Also, The crime and tax rate have an inverse relationship as in less crime in high tax rate areas. while ‘crim’ has positive relation with nox i.e. crime rate increases with increase in notrogen oxide concentration.

  1. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
# Correlation coefficients between CRIM and all other variables.
cor(Boston[-1],Boston$crim)
##                [,1]
## zn      -0.20046922
## indus    0.40658341
## chas    -0.05589158
## nox      0.42097171
## rm      -0.21924670
## age      0.35273425
## dis     -0.37967009
## rad      0.62550515
## tax      0.58276431
## ptratio  0.28994558
## lstat    0.45562148
## medv    -0.38830461
  1. Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
# Suburbs with crime rate higher than 95% of suburbs.
High.Crime <- Boston[Boston$crim > quantile(Boston$crim, 0.95),]
print(nrow(High.Crime))
## [1] 26
print(paste("Range",range(Boston$crim)))  
## [1] "Range 0.00632" "Range 88.9762"
print(paste("Mean",mean(Boston$crim)))
## [1] "Mean 3.61352355731225"
print(paste("Standard Deviation",sd(Boston$crim)))
## [1] "Standard Deviation 8.60154510533249"
# Suburbs with tax rates higher than 95% of suburbs.
High.Tax <- Boston[Boston$tax > quantile(Boston$tax, 0.95),]
print(nrow(High.Tax))
## [1] 5
print(paste("Range",range(Boston$tax)))  
## [1] "Range 187" "Range 711"
print(paste("Mean",mean(Boston$tax)))
## [1] "Mean 408.237154150198"
print(paste("Standard Deviation",sd(Boston$tax)))
## [1] "Standard Deviation 168.537116054959"
# Suburbs with ptratio higher than 95% of suburbs.
High.ptratio <- Boston[Boston$ptratio > quantile(Boston$ptratio, 0.95),]
print(nrow(High.ptratio))
## [1] 18
print(paste("Range",range(Boston$ptratio)))  
## [1] "Range 12.6" "Range 22"
print(paste("Mean",mean(Boston$ptratio)))
## [1] "Mean 18.4555335968379"
print(paste("Standard Deviation",sd(Boston$ptratio)))
## [1] "Standard Deviation 2.16494552371444"
  1. How many of the census tracts in this data set bound the Charles river?
sum(Boston$chas==1)
## [1] 35
  1. What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
  1. Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.
which(Boston$medv == min(Boston$medv))
## [1] 399 406
# Values of other predictors for suburb 399
Boston[399,]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
# Values of other predictors for suburb 399
Boston[406,]
##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

We can see that both observations with the lowest medv take very similar values, and for many they take quite extreme values.

  1. In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
# More than 7 rooms
sum(Boston$rm > 7)
## [1] 64
# More than 8 rooms
sum(Boston$rm > 8)
## [1] 13
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00
summary(subset(Boston, rm > 8))
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

findings: