Loading data

# Reading the data
auto_data <- read.csv("C:/Users/dell/Downloads/Auto.csv")

a) Which of the predictors are quantitative, and which are qualitative?

# Function to classify predictors
classify_predictors <- function(df) {
  quantitative <- character()
  qualitative <- character()

  for (col_name in names(df)) {
    if (class(df[[col_name]]) %in% c("numeric", "integer")) {
      quantitative <- c(quantitative, col_name)
    } else if (class(df[[col_name]]) %in% c("factor", "character")) {
      qualitative <- c(qualitative, col_name)
    }
  }

  list(quantitative = quantitative, qualitative = qualitative)
}

# Classifying the predictors
predictor_types <- classify_predictors(auto_data)

# Print the results
print("Quantitative Predictors:")

## [1] "Quantitative Predictors:"

print(predictor_types$quantitative)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"

print("Qualitative Predictors:")

## [1] "Qualitative Predictors:"

print(predictor_types$qualitative)

## [1] "name"

b) What is the range of each quantitative predictor? You can answer this using the range() function

# Calculate range for each quantitative predictor
range_mpg <- range(auto_data$mpg, na.rm = TRUE)
range_cylinders <- range(auto_data$cylinders, na.rm = TRUE)
range_displacement <- range(auto_data$displacement, na.rm = TRUE)
range_horsepower <- range(auto_data$horsepower, na.rm = TRUE)
range_weight <- range(auto_data$weight, na.rm = TRUE)
range_acceleration <- range(auto_data$acceleration, na.rm = TRUE)
range_year <- range(auto_data$year, na.rm = TRUE)

# Print the ranges
print(paste("MPG: ", range_mpg))

## [1] "MPG:  9"    "MPG:  46.6"

print(paste("Cylinders: ", range_cylinders))

## [1] "Cylinders:  3" "Cylinders:  8"

print(paste("Displacement: ", range_displacement))

## [1] "Displacement:  68"  "Displacement:  455"

print(paste("Horsepower: ", range_horsepower))

## [1] "Horsepower:  46"  "Horsepower:  230"

print(paste("Weight: ", range_weight))

## [1] "Weight:  1613" "Weight:  5140"

print(paste("Acceleration: ", range_acceleration))

## [1] "Acceleration:  8"    "Acceleration:  24.8"

print(paste("Year: ", range_year))

## [1] "Year:  70" "Year:  82"

c) What is the mean and standard deviation of each quantitative predictor?

# Function to calculate mean and standard deviation for quantitative predictors
calculate_stats <- function(df) {
  stats <- data.frame(Predictor = character(), Mean = numeric(), SD = numeric(), stringsAsFactors = FALSE)

  for (col_name in names(df)) {
    if (class(df[[col_name]]) %in% c("numeric", "integer")) {
      mean_val <- mean(df[[col_name]], na.rm = TRUE)
      sd_val <- sd(df[[col_name]], na.rm = TRUE)
      stats <- rbind(stats, data.frame(Predictor = col_name, Mean = mean_val, SD = sd_val))
    }
  }

  stats
}

# Calculating stats for quantitative predictors
quantitative_stats <- calculate_stats(auto_data)

# Print the results
print(quantitative_stats)

##      Predictor        Mean          SD
## 1          mpg   23.445918   7.8050075
## 2    cylinders    5.471939   1.7057832
## 3 displacement  194.411990 104.6440039
## 4   horsepower  104.469388  38.4911599
## 5       weight 2977.584184 849.4025600
## 6 acceleration   15.541327   2.7588641
## 7         year   75.979592   3.6837365
## 8       origin    1.576531   0.8055182

d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

# Remove the 10th through 85th observations
auto_data_modified <- auto_data[-(10:85), ]

# Function to calculate range, mean, and standard deviation
calculate_stats <- function(df) {
  stats <- data.frame(
    Predictor = character(),
    Min = numeric(),
    Max = numeric(),
    Mean = numeric(),
    SD = numeric(),
    stringsAsFactors = FALSE
  )

  for (col_name in names(df)) {
    # Check if the column is numeric
    if (class(df[[col_name]]) %in% c("numeric", "integer")) {
      min_val <- min(df[[col_name]], na.rm = TRUE)
      max_val <- max(df[[col_name]], na.rm = TRUE)
      mean_val <- mean(df[[col_name]], na.rm = TRUE)
      sd_val <- sd(df[[col_name]], na.rm = TRUE)
      stats <- rbind(stats, data.frame(Predictor = col_name, Min = min_val, Max = max_val, Mean = mean_val, SD = sd_val))
    }
  }

  stats
}

# Calculating stats for the modified dataset
stats_modified <- calculate_stats(auto_data_modified)

# Print the results
print(stats_modified)

##      Predictor    Min    Max        Mean         SD
## 1          mpg   11.0   46.6   24.404430   7.867283
## 2    cylinders    3.0    8.0    5.373418   1.654179
## 3 displacement   68.0  455.0  187.240506  99.678367
## 4   horsepower   46.0  230.0  100.721519  35.708853
## 5       weight 1649.0 4997.0 2935.971519 811.300208
## 6 acceleration    8.5   24.8   15.726899   2.693721
## 7         year   70.0   82.0   77.145570   3.106217
## 8       origin    1.0    3.0    1.601266   0.819910

e) Using the full data set, investigate the predictors graphically,

using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your fndings.

# Load the ggplot2 package
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

# Assuming 'auto_data' is your dataframe

# Scatterplot of mpg vs weight
ggplot(auto_data, aes(x = weight, y = mpg)) +
  geom_point() +
  labs(title = "MPG vs Weight", x = "Weight", y = "MPG")

# Scatterplot of mpg vs horsepower
ggplot(auto_data, aes(x = horsepower, y = mpg)) +
  geom_point() +
  labs(title = "MPG vs Horsepower", x = "Horsepower", y = "MPG")

# Scatterplot of mpg vs displacement
ggplot(auto_data, aes(x = displacement, y = mpg)) +
  geom_point() +
  labs(title = "MPG vs Displacement", x = "Displacement", y = "MPG")

library(GGally)

## Warning: package 'GGally' was built under R version 4.3.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

selected_predictors <- auto_data[, c("mpg", "cylinders", "displacement", "horsepower", "weight")]

# Create the pair plot
ggpairs(selected_predictors)

library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

# Select only the quantitative predictors for correlation analysis

quantitative_predictors <- auto_data[, sapply(auto_data, is.numeric)]

# Compute the correlation matrix
cor_matrix <- cor(quantitative_predictors, use = "complete.obs")

# Create the heatmap using corrplot
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Analysis and Comments

MPG and Engine Features: There is a clear negative relationship between mpg and other engine-related features like cylinders, displacement, horsepower, and weight. Cars with higher values in these features tend to have lower fuel efficiency (mpg).

Engine Size and Weight: Displacement, horsepower, and weight are positively correlated. Larger engines (higher displacement and horsepower) typically increase the weight of the vehicle.

f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer

When predicting gas mileage (mpg) based on other variables, the scatterplots and heatmaps I’ve created above provides valuable insights: Those include:

Displacement, Horsepower, and Weight:

There is a clear downward trend visible in the scatterplot(mpg vs weight). As the weight of the vehicles increases, the mpg appears to decrease. This indicates a negative correlation between vehicle weight and fuel efficiency.

There is a visible negative correlation between horsepower and mpg. As horsepower increases, the mpg seems to decrease. This indicates that cars with more powerful engines tend to have lower fuel efficiency.

Similar to the previous variables (weight and horsepower), there is a negative correlation between displacement and mpg. As engine displacement increases, the mpg tends to decrease. This suggests that vehicles with larger engines, which have more displacement, generally have lower fuel efficiency.

Boston Data

a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.2

num_rows <- nrow(Boston)
num_columns <- ncol(Boston)

# Display the number of rows and columns
print(paste("Number of rows:", num_rows))

## [1] "Number of rows: 506"

print(paste("Number of columns:", num_columns))

## [1] "Number of columns: 13"

b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.

#
# Selecting a subset of predictors i am interested in
selected_predictors <- Boston[, c("crim", "rm", "medv")]

# Create pairwise scatterplots
pairs(selected_predictors)

Interpretations and Insights :

Crime Rate (crim) vs. Other Variables:

There’s a noticeable trend of increasing crime rates in areas with higher industrial activity (indus) and higher tax rates (tax). The median value of homes (medv) tends to be lower in areas with higher crime rates.

Residential Land Zoned (zn) vs. Other Variables:

There’s a cluster of data points at the lower end, indicating many areas with little or no land zoned for large residential plots. Higher values of zn generally correspond to lower industrial activity (indus) and lower nitrogen oxides concentration (nox). Industrial Activity (indus) vs. Other Variables:

Higher industrial activity is associated with higher nox levels and older buildings (age). There’s a negative relationship with the median value of homes (medv). Nitrogen Oxides Concentration (nox) vs. Other Variables:

As expected, nox levels are higher in industrial areas (indus). Higher nox levels are also associated with older constructions (age). Average Number of Rooms (rm) vs. Other Variables:

A clear positive correlation is observed between the average number of rooms and the median value of homes (medv). Property-Tax Rate (tax) vs. Other Variables:

Higher tax rates are seen in areas with more industrial activity (indus). Pupil-Teacher Ratio (ptratio) vs. Other Variables:

There’s no clear trend observed with ptratio and most of the variables, though there might be a slight negative correlation with medv. Median Value of Homes (medv) vs. Other Variables:

Apart from the positive relationship with the average number of rooms (rm), medv is generally lower in areas with higher crime rates (crim), more industrial activity (indus), and higher levels of nox.

c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

To determine if any predictors are associated with per capita crime rate in the Boston Housing dataset, we will typically perform a correlation analysis or a regression analysis.

library(readr)
library(ggplot2)


# Calculate correlation matrix
correlation_matrix <- cor(Boston)


correlations_with_crim <- correlation_matrix["crim", ]

# regression analysis
model <- lm(crim ~ ., data = Boston)

# Summary of the model to check for significant predictors
summary(model)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Interpretation and Analysis

The residuals are the differences between the observed values and the values predicted by the model. The range of residuals indicates there are some large discrepancies between predicted and actual values, especially considering the maximum residual is quite high.

Pr(>|t|) is the p-value, or the probability of observing a test statistic as extreme as, or more extreme than, what was observed if the true coefficient were zero. Small p-values indicate strong evidence against the null hypothesis, suggesting the predictor has an effect on crim.

In summary, the model suggests that there are several significant predictors of per capita crime rate in the dataset. Notably, proximity to employment centers and accessibility to highways are associated with crime rate, as well as the proportion of land zoned for large lots and the median value of homes. The model explains a moderate proportion of the variability in the crime rate, but there is still a significant amount of variability that is not captured by these predictors.

d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

To determine if any census tracts have particularly high crime rates, tax rates, or pupil-teacher ratios, I will look at the summary statistics (such as max, mean, standard deviation) for those specific variables in the dataset.

# Summary statistics for crime rate
summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

# Summary statistics for tax rate
summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

# Summary statistics for pupil-teacher ratio
summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Interpreting and findings:

Crime Rate (crim):

Min: The minimum crime rate is very low at 0.00632, suggesting there are areas with almost negligible per capita crime rates. 1st Qu. (25th percentile): At 0.08204, this indicates that 25% of the census tracts have a crime rate lower than this value. Median: The median crime rate is 0.25651, which means half the tracts have a lower crime rate and the other half have a higher rate. Mean: The average crime rate is much higher at 3.61352 due to the presence of some tracts with very high crime rates, which skews the mean upward. 3rd Qu. (75th percentile): At 3.67708, 75% of the tracts have a crime rate lower than this value, and 25% have a higher rate, indicating a wide spread. Max: The maximum crime rate is extremely high at 88.97620, indicating that there are some census tracts with particularly high crime rates.

Tax Rate (tax):

Min: The minimum tax rate is 187, which could be a per $10,000 or some other unit used in the dataset. 1st Qu.: The first quartile is 279, meaning 25% of tracts have a tax rate below this. Median: The median tax rate is 330, the midpoint of the dataset. Mean: The mean tax rate is significantly higher at 408.2, suggesting the presence of tracts with high tax rates. 3rd Qu.: At 666, this value indicates that 75% of the tracts have a tax rate below this, and the remaining 25% are higher. Max: The maximum tax rate is 711, showing that there are tracts with very high tax rates.

Pupil-Teacher Ratio (ptratio):

Min: The lowest pupil-teacher ratio is 12.60, which is quite favorable. 1st Qu.: At 17.40, this means that 25% of tracts have a ratio less than this. Median: The median ratio is 19.05, the value in the middle of the dataset. Mean: The average ratio is slightly lower at 18.46, indicating the data is not heavily skewed. 3rd Qu.: A ratio of 20.20 suggests that 75% of tracts have a pupil-teacher ratio below this figure. Max: The highest pupil-teacher ratio is 22.00, which is higher but not excessively so, suggesting a somewhat uniform distribution across tracts.

In summary, while the pupil-teacher ratios are relatively uniform across census tracts, there is a significant variation in both crime rates and tax rates. The crime rate particularly shows a very wide range, with some tracts having extremely high crime rates that skew the average well above the median. The tax rates also vary considerably, but the highest rates do not skew the mean as dramatically as for crime rates. These variables’ wide ranges indicate diverse characteristics among the census tracts in the Boston area.

e) How many of the census tracts in this data set bound the Charles river?

To find out how many census tracts bound the Charles River, I will sum the values of the chas column

# Sum the values of the 'chas' column to count how many tracts bound the river
num_tracts_bounding_river <- sum(Boston$chas)

# Print the result
print(num_tracts_bounding_river)

## [1] 35

f) What is the median pupil-teacher ratio among the towns in this data set?

To find the median pupil-teacher ratio among the towns in the Boston Housing dataset, I will look for the ptratio column, which typically represents the pupil-teacher ratio by town.

# Calculate the median pupil-teacher ratio
median_ptratio <- median(Boston$ptratio)

# Print the result
print(median_ptratio)

## [1] 19.05

g) Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.

To find the census tract with the lowest median value of owner-occupied homes, I will identify the minimum value in the medv column of the Boston Housing dataset. Once we find that minimum value, we can locate the corresponding row to see the values of the other predictors for that census tract.

# Find the minimum value of 'medv'
min_medv <- min(Boston$medv)

# Find the row which has the minimum 'medv' value
tract_with_min_medv <- Boston[Boston$medv == min_medv, ]

# Print the tract details
print(tract_with_min_medv)

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

By comparing the specific values from the tract with the lowest medv to the overall summary statistics, we can comment on how that tract compares to the general trends in the dataset

# Get summary statistics for each column to compare
summary_statistics <- summary(Boston)

# Print the summary statistics
print(summary_statistics)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Summary and Interpretation:

Both tracts show extreme values in several predictors compared to the overall dataset. These include very high crime rates, high industrialization, lower number of rooms, proximity to employment centers, high tax rates, and a high proportion of lower-income residents.

These factors collectively contribute to the extremely low median value of homes in these tracts. The combination of high crime rates, industrialization, and other factors point towards these areas being less desirable for residential purposes, reflected in the low property values.

The characteristics of these tracts differ significantly from the median and mean trends across the Boston area, indicating that these tracts are outliers with specific challenges.

h)In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

To determine the number of census tracts in the Boston Housing dataset that average more than seven and more than eight rooms per dwelling (rm), I will count the number of entries in the rm column that exceed these thresholds.

# Count the number of tracts with more than seven rooms per dwelling
num_tracts_more_than_seven <- sum(Boston$rm > 7)

# Count the number of tracts with more than eight rooms per dwelling
num_tracts_more_than_eight <- sum(Boston$rm > 8)

# Print the results
print(paste("Number of tracts with more than seven rooms per dwelling:", num_tracts_more_than_seven))

## [1] "Number of tracts with more than seven rooms per dwelling: 64"

print(paste("Number of tracts with more than eight rooms per dwelling:", num_tracts_more_than_eight))

## [1] "Number of tracts with more than eight rooms per dwelling: 13"

Comment on Census Tracts with More than Eight Rooms per Dwelling:

Census tracts with more than eight rooms per dwelling are likely to represent more affluent areas. Larger homes are typically associated with higher property values. These tracts might have lower population density, given the larger average dwelling size. The characteristics of these tracts, such as crime rate, proximity to amenities, and school quality, could differ significantly from those with smaller average dwelling sizes. It’s also possible that these areas have historical or architectural significance, which can be a factor in larger average dwelling sizes.

DA Lab 1

Sai Dheeraj Kanaparthi

2024-01-22