Lab_1_Amritha

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.2

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#View(Auto)

df_Auto <- data.frame(Auto)
head(df_Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

dim(df_Auto)

## [1] 392   9

392 rows and 9 columns

df_Auto <- na.omit(df_Auto)
dim(df_Auto)

## [1] 392   9

Which of the predictors are quantitative, and which are qualitative?

sapply(df_Auto,class)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    "numeric"    "integer"    "numeric"    "integer"    "integer"    "numeric" 
##         year       origin         name 
##    "integer"    "integer"     "factor"

Quantitative: mpg,cylinders,displacement,horsepower, weight, acceleration, year.
Qualitative: name, origin(it is qualitative despite the fact it is in integers here in the dataset, values 1 to 3).

What is the range of each quantitative predictor? You can answer this using the range() function. range()

sapply(df_Auto[,1:7], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

What is the mean and standard deviation of each quantitative predictor?

# Mean and standard deviation.
paste("Mean of each column")

## [1] "Mean of each column"

sapply(df_Auto[,1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

paste("Standard Deviation of each column")

## [1] "Standard Deviation of each column"

sapply(df_Auto[,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

# observations excluding 10 - 85th row
df_Auto_red = df_Auto[-c(10:85),]

# Their respective range, mean and sd
paste("Range of each column : ")

## [1] "Range of each column : "

sapply(df_Auto_red[,1:7], range)

##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82

paste("Mean of each column : ")

## [1] "Mean of each column : "

sapply(df_Auto_red[,1:7], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

paste("Standard Deviation of each column :")

## [1] "Standard Deviation of each column :"

sapply(df_Auto_red[,1:7], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your fndings.

pairs(df_Auto[,1:7])

cor(df_Auto[, c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year")])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

From the pair plot and the correlation data, we can state there exists linear relationships between some of the variables.
mpg : mpg has strong negative linear relationships with displacement, cylinders and weight. That is we can expect the mpg of the car to decrease as their displacement and cylinders increase.
mpg has a positive correlation with year, and this suggests that newer models tend to have higher mpg than older ones.
displacement, for every other feature i.e. displacement has a strong positive linear relationship with cylinders, horsepower , weight. This indicates there is correlation among those features in a positive manner. If displacement increases so does other features (cylinders, horsepower , weight) mentioned above do.
displacement has strong negative linear relationships with mpg, acceleration and year. That is we can expect the displacement of the car to decrease as their mpg and acceleration increase.
cylinders, for other feature i.e. cylinders, Strong negative correlations with mpg.Moderate negative correlation with acceleration. Weak negative correlation with year.That is each of them are inversely related to cylinders.
But if look at other features like displacement, horsepower, weight. They are positively related to cylinders i.e. indicating if the no of cylinders are increased, so would there be an increase in engine displacement, horsepower and weight of vehicle.
horsepower, for other feature i.e. horsepower, Strong negative correlations with mpg. Moderate negative correlation with acceleration. Weak negative correlation with year.That is each of them are inversely related to cylinders.
But if look at other features like displacement, cylinders, weight. They are positively related to cylinders i.e. indicating if the engine horsepower are increased, so would there be an increase in cylinders, displacement and weight of vehicle.
weight, similar to horsepower and cylinders it is for weights. negative correlation with mpg,acceleration and year. Positive relation with horsepower, displacement and cylinder.
acceleration & year, has Positive relation only with model year/acceleration and mpg. This implies that vehicles with greater acceleration may also have more fuel-efficient vehicles. Rest all other features are negatively related to acceleration.
Conclusion : Say one needs to identify how fast their car should accelerate in the future and accordingly what should be its weight. Then in that case, if we plot a graph between Acceleration and weight , we can see that there is a negative relation. Hence heavier the vehicle is less would be its acceleration/speed and vice a versa.

ggplot(Auto, aes(x = weight, y = acceleration)) + 
  geom_point() + 
  theme(legend.position = "none") + 
  scale_x_continuous(labels = scales::comma_format()) + 
  labs(x = "Weight", 
       y = "Acceleration", 
       title = "Correlation between weight and acceleration")

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

From the plots and the correlation data, I can think that we can predict mpg.
An increase in the variables displacement, cylinders and weight will lead to a reduced mpg.
Newer models year tend to have higher mpg.
However, we may be able to extract useful information. For example, we can identify based on origin / manfacturer what is the mpg.

df_Auto$origin <- factor(df_Auto$origin, labels = c("American", "European", "Japanese"))

ggplot(df_Auto, aes(x = origin, y = mpg, fill = origin)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  labs(title = "Origin vs Mpg - Boxplot", 
       x = "Origin", 
       y = "MPG")

findings : - Japenese origin vehicles have comparitively higher mpg(around 33 mpg) than european(around 25 mean) and american ones (around 20 mpg mean).

This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. library(ISLR2) Now the data set is contained in the object Boston.

library(ISLR2)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Read about the data set: ?Boston How many rows are in this data set? How many columns? What do the rows and columns represent?

str(Boston)

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

#View(Boston)
dim(Boston)

## [1] 506  13

506 rows of suburbs or towns and 13 columns of predictors.

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.

# Pair plots of all features
pairs( Boston)

Findings : - crim seems to have a negative linear relationship with medv and dis. i.e. per capita crime rate has negative relation with median value of house and weighted mean of distances to 5 boston employment centres. i.e. median value of house and mean distance to employment centres increases then per capita crime rate decreases.Also, The crime and tax rate have an inverse relationship as in less crime in high tax rate areas. while ‘crim’ has positive relation with nox i.e. crime rate increases with increase in notrogen oxide concentration.

nox has a negative linear relationship with dis and medv. i.e. As median value of owner occupied home and median value of house increases, nitrogen oxide concentration decreases. From above we also can infer that crime rate would also decrease.
dis has a positive linear relationship with medv, while it has positive relation with Age

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

# Correlation coefficients between CRIM and all other variables.
cor(Boston[-1],Boston$crim)

##                [,1]
## zn      -0.20046922
## indus    0.40658341
## chas    -0.05589158
## nox      0.42097171
## rm      -0.21924670
## age      0.35273425
## dis     -0.37967009
## rad      0.62550515
## tax      0.58276431
## ptratio  0.28994558
## lstat    0.45562148
## medv    -0.38830461

There are some correlations between crim and other features, but they are not as strong as some of the relationships we observed in the Auto dataset.
crim has a negative linear relationship with medv, dis , rm , chas and zn. For instance, there is a negative correlation between rm and medv , indicating that neighborhoods with higher median home prices and more rooms also likely have lower crime rates.
crim has a positive linear relationship with indus, nox, rad , tax and lstat. For instance, there is a positive correlation between tax and rad and the crime rate ( crim ), suggesting that greater values in these variables correspond to higher crime rates.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

# Suburbs with crime rate higher than 95% of suburbs.
High.Crime <- Boston[Boston$crim > quantile(Boston$crim, 0.95),]
print(nrow(High.Crime))

## [1] 26

print(paste("Range",range(Boston$crim)))

## [1] "Range 0.00632" "Range 88.9762"

print(paste("Mean",mean(Boston$crim)))

## [1] "Mean 3.61352355731225"

print(paste("Standard Deviation",sd(Boston$crim)))

## [1] "Standard Deviation 8.60154510533249"

There are 26 suburbs with a crime rate higher than 95% of the other suburbs.
The range is very wide, it goes from a rate of near zero to 89.

# Suburbs with tax rates higher than 95% of suburbs.
High.Tax <- Boston[Boston$tax > quantile(Boston$tax, 0.95),]
print(nrow(High.Tax))

## [1] 5

print(paste("Range",range(Boston$tax)))

## [1] "Range 187" "Range 711"

print(paste("Mean",mean(Boston$tax)))

## [1] "Mean 408.237154150198"

print(paste("Standard Deviation",sd(Boston$tax)))

## [1] "Standard Deviation 168.537116054959"

There are 5 suburbs with a tax rate higher than 95% of other suburbs. This seems reasonable as property tax rates are designed not to be extremely drastic.
The range is narrower than the crime rate.

# Suburbs with ptratio higher than 95% of suburbs.
High.ptratio <- Boston[Boston$ptratio > quantile(Boston$ptratio, 0.95),]
print(nrow(High.ptratio))

## [1] 18

print(paste("Range",range(Boston$ptratio)))

## [1] "Range 12.6" "Range 22"

print(paste("Mean",mean(Boston$ptratio)))

## [1] "Mean 18.4555335968379"

print(paste("Standard Deviation",sd(Boston$ptratio)))

## [1] "Standard Deviation 2.16494552371444"

There are 18 suburbs with a high pupil to teacher ratio, and this a reasonable outcome as educational laws limit the numbers of teacher or students per class/school.
The range in quite narrow between 12 to 22.

How many of the census tracts in this data set bound the Charles river?

sum(Boston$chas==1)

## [1] 35

35 suburbs/towns bound the Charles river.

What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

median pupil-teacher ratio among the towns is 19.05

Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.

which(Boston$medv == min(Boston$medv))

## [1] 399 406

There are two suburbs (399 & 406) that have the lowest median property values.

# Values of other predictors for suburb 399
Boston[399,]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

# Values of other predictors for suburb 399
Boston[406,]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

We can see that both observations with the lowest medv take very similar values, and for many they take quite extreme values.

= 90th percentile for: crim, age, lstat
= 75th percentile for: indus, nox, rad, tax, ptratio
Both suburbs are chas = 0: they are not near the Charles river
Lower percentiles for: zn, rm, dis

In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

# More than 7 rooms
sum(Boston$rm > 7)

## [1] 64

64 such data points/tuples that have census tracts with more than 7 rooms

# More than 8 rooms
sum(Boston$rm > 8)

## [1] 13

13 data points/tuples that have census tracts with more than 8 rooms

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

summary(subset(Boston, rm > 8))

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

findings:

A review of these tracts’ summary statistics highlights a number of important aspect.i.e. Relatively low crim, lstat and much higher medv when comparing the IQR range with entire dataset.
Also, indicates that there is low crime rate for the houses with 8 or more rooms,with a mean of 0.71879 and a high of 3.47428, suggesting a safer living place.
Also, indicates that proportion of residential land zoned lots over 25000sq. ft are more with a mean of 13.6, when compared to all other data points having a mean of 11.36.Similarly, proportion of non-retail business acres per town is comparitively lesser (i.e. indus - 7.07) in 8 room houses within boston while compared to others in general (i.e. mean of indus - 11.14). This somehow indicates a well-balanced distribution of both residential and industrial regions
The average age of houses is 71.54 and mean property tax rate is 325.1, indicating that they are desirable regions with greater property values.

Lab_1_Amritha

Amritha Prakash

2024-01-21