Data_Analytics_Lab

I am importing the libraries needed to run these notes.

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.2

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

#loading the Dataset

data(Auto)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

#Auto Dataset Exercise

(a) Identifying Quantitative and Qualitative Predictors:

Quantitative Predictors: These are numerical and include mpg, cylinders, displacement, horsepower, weight, acceleration, year, and origin. Qualitative Predictor: name is a categorical variable, making it a qualitative predictor.

(b) Range of Each Quantitative Predictor:

The range for each quantitative predictor can be calculated using the range() function. The results based on your summary are:

mpg: 9 to 46.6 cylinders: 3 to 8 displacement: 68 to 455 horsepower: 46 to 230 weight: 1613 to 5140 acceleration: 8 to 24.8 year: 70 to 82 origin: 1 to 3

(c) Mean and Standard Deviation of each quantitative predictor:

cat("\nmpg:\n")

## 
## mpg:

cat("Mean =", mean(Auto$mpg, na.rm = TRUE), "\n")

## Mean = 23.44592

cat("Standard Deviation =", sd(Auto$mpg, na.rm = TRUE), "\n")

## Standard Deviation = 7.805007

cat("\ncylinders:\n")

## 
## cylinders:

cat("Mean =", mean(Auto$cylinders, na.rm = TRUE), "\n")

## Mean = 5.471939

cat("Standard Deviation =", sd(Auto$cylinders, na.rm = TRUE), "\n")

## Standard Deviation = 1.705783

cat("\ndisplacement:\n")

## 
## displacement:

cat("Mean =", mean(Auto$displacement, na.rm = TRUE), "\n")

## Mean = 194.412

cat("Standard Deviation =", sd(Auto$displacement, na.rm = TRUE), "\n")

## Standard Deviation = 104.644

cat("\nhorsepower:\n")

## 
## horsepower:

cat("Mean =", mean(Auto$horsepower, na.rm = TRUE), "\n")

## Mean = 104.4694

cat("Standard Deviation =", sd(Auto$horsepower, na.rm = TRUE), "\n")

## Standard Deviation = 38.49116

cat("\nweight:\n")

## 
## weight:

cat("Mean =", mean(Auto$weight, na.rm = TRUE), "\n")

## Mean = 2977.584

cat("Standard Deviation =", sd(Auto$weight, na.rm = TRUE), "\n")

## Standard Deviation = 849.4026

cat("\nacceleration:\n")

## 
## acceleration:

cat("Mean =", mean(Auto$acceleration, na.rm = TRUE), "\n")

## Mean = 15.54133

cat("Standard Deviation =", sd(Auto$acceleration, na.rm = TRUE), "\n")

## Standard Deviation = 2.758864

cat("\nyear:\n")

## 
## year:

cat("Mean =", mean(Auto$year, na.rm = TRUE), "\n")

## Mean = 75.97959

cat("Standard Deviation =", sd(Auto$year, na.rm = TRUE), "\n")

## Standard Deviation = 3.683737

cat("\norigin:\n")

## 
## origin:

cat("Mean =", mean(Auto$origin, na.rm = TRUE), "\n")

## Mean = 1.576531

cat("Standard Deviation =", sd(Auto$origin, na.rm = TRUE), "\n")

## Standard Deviation = 0.8055182

d) Removing the 10th though 85 observations in the Auto Datset.

Auto_subset <- Auto[-(10:85), ]

print_stats <- function(data, variable_name) {
  cat("\n", variable_name, ":\n")
  cat("Range: ", range(data, na.rm = TRUE), "\n")
  cat("Mean: ", mean(data, na.rm = TRUE), "\n")
  cat("Standard Deviation: ", sd(data, na.rm = TRUE), "\n")
}

# Apply the function to each quantitative predictor
print_stats(Auto_subset$mpg, "mpg")

## 
##  mpg :
## Range:  11 46.6 
## Mean:  24.40443 
## Standard Deviation:  7.867283

print_stats(Auto_subset$cylinders, "cylinders")

## 
##  cylinders :
## Range:  3 8 
## Mean:  5.373418 
## Standard Deviation:  1.654179

print_stats(Auto_subset$displacement, "displacement")

## 
##  displacement :
## Range:  68 455 
## Mean:  187.2405 
## Standard Deviation:  99.67837

print_stats(Auto_subset$horsepower, "horsepower")

## 
##  horsepower :
## Range:  46 230 
## Mean:  100.7215 
## Standard Deviation:  35.70885

print_stats(Auto_subset$weight, "weight")

## 
##  weight :
## Range:  1649 4997 
## Mean:  2935.972 
## Standard Deviation:  811.3002

print_stats(Auto_subset$acceleration, "acceleration")

## 
##  acceleration :
## Range:  8.5 24.8 
## Mean:  15.7269 
## Standard Deviation:  2.693721

print_stats(Auto_subset$year, "year")

## 
##  year :
## Range:  70 82 
## Mean:  77.14557 
## Standard Deviation:  3.106217

print_stats(Auto_subset$origin, "origin")

## 
##  origin :
## Range:  1 3 
## Mean:  1.601266 
## Standard Deviation:  0.81991

e) Creating scatterplot to highlight the relationships among the predictors.

# Scatterplot of horsepower vs mpg
plot(Auto$horsepower, Auto$mpg, main = "Horsepower vs MPG", xlab = "Horsepower", ylab = "Miles per Gallon")

# Scatterplot of weight vs mpg
plot(Auto$weight, Auto$mpg, main = "Weight vs MPG", xlab = "Weight", ylab = "Miles per Gallon")

Findings: From both the plots, we can observe that Horsepower and Weight is inversely proportional to gas mileage (mpg). As Horsepower requires more fuel in less time and In regarding weight, it takes more power to maintain the speed and acceleration.

f) One more important Feature in regarding the MPG:

plot(Auto$displacement, Auto$mpg, 
     main = "Displacement vs MPG", 
     xlab = "Engine Displacement", 
     ylab = "Miles per Gallon (MPG)", 
     pch = 19, col = "blue")

Finding: The bigger the engine displacement volume, the more air that can be pushed into the cylinders. This boosts the combustion process and allows the engine to generate more power and it results in lesser mpg.

#Boston Problem Set

data(Boston)
dim(Boston)

## [1] 506  14

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

b) Pairwise Scatterplots of Columns of the Boston Dataset.

pairs(Boston)

Linearity: Some variables show a linear relationship with others; for example, rm (average number of rooms per dwelling) seems to have a positive linear relationship with medv (median value of owner-occupied homes), suggesting that larger homes tend to be more valuable.

(c) Predictors associated with per capita crime rate

Crime Rate and Zoning (zn): You might expect to see a negative correlation where areas with higher residential zoning (more space allocated for residential buildings) could have lower crime rates.

Crime Rate and Industrialization (indus): There may be a positive correlation, where more industrially zoned areas have higher crime rates, possibly due to less residential presence and more anonymity.

(d) Factors associated with high crime rates

High Crime Rates: Some suburbs have crime rates as high as approximately 89 per capita, indicating notable variation with certain areas experiencing significantly higher crime. High Tax Rates: The property-tax rate reaches up to 711 per $10,000 in some suburbs, suggesting that certain areas are subject to much higher taxes. High Pupil-Teacher Ratios: Ratios go up to 22, indicating some suburbs may have overcrowded schools with fewer teachers available per student.

(e) Number of suburbs behind the Charles river

sum(Boston$chas == 1)

## [1] 35

(f) Median pupil ration of the Boston city

median_ptratio <- median(Boston$ptratio)
print(median_ptratio)

## [1] 19.05

(g) Finding which suburb of Boston has lowest median value of owneroccupied homes?

min_medv_suburb <- Boston[Boston$medv == min(Boston$medv), ]
print(min_medv_suburb)

##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90 30.59
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97 22.98
##     medv
## 399    5
## 406    5

(h) how many of the suburbs average more than 7 and 8 rooms per dwelling ??

# Number of suburbs with more than 7 rooms per dwelling
sum(Boston$rm > 7)

## [1] 64

# Number of suburbs with more than 8 rooms per dwelling
sum(Boston$rm > 8)

## [1] 13

Data_Analytics_Lab_1

Surya

2024-01-24