Lab1

9.a)

Auto <- na.omit(Auto)

Quantitative	Qualitative
mpg	cylinders
displacement	origin
horsepower	name
weight
acceleration
year

9.b)

auto_df <- na.omit(Auto)

# Quantitative predictors within auto_df
quantitative_predictors <- names(auto_df)[sapply(auto_df, is.numeric)]

# Range data frame
range_df <- data.frame(
  Predictor = quantitative_predictors,
  Minimum = NA,
  Maximum = NA,
  Range = NA
)

# Calculate range
for (i in 1:nrow(range_df)) {
  col <- range_df$Predictor[i]
  range <- range(auto_df[, col])
  range_df[i, 2:4] <- c(range[1], range[2], range[2] - range[1])
}

print(range_df)

##      Predictor Minimum Maximum  Range
## 1          mpg       9    46.6   37.6
## 2    cylinders       3     8.0    5.0
## 3 displacement      68   455.0  387.0
## 4   horsepower      46   230.0  184.0
## 5       weight    1613  5140.0 3527.0
## 6 acceleration       8    24.8   16.8
## 7         year      70    82.0   12.0
## 8       origin       1     3.0    2.0

9.c)

# Data frame initialization
summary_df <- data.frame(
  Predictor = quantitative_predictors,
  Mean = NA,
  SD = NA
)

# Calculating mean and standard deviation
for (i in 1:nrow(summary_df)) {
  col <- summary_df$Predictor[i]
  summary_df[i, 2:3] <- c(mean(auto_df[, col]), sd(auto_df[, col]))
}

print(summary_df)

##      Predictor        Mean          SD
## 1          mpg   23.445918   7.8050075
## 2    cylinders    5.471939   1.7057832
## 3 displacement  194.411990 104.6440039
## 4   horsepower  104.469388  38.4911599
## 5       weight 2977.584184 849.4025600
## 6 acceleration   15.541327   2.7588641
## 7         year   75.979592   3.6837365
## 8       origin    1.576531   0.8055182

9.d)

# Excluding observations 10 to 85
subset_data <- auto_df[-(10:85), ]

# Identify quantitative predictors within the subset
quantitative_predictors <- names(subset_data)[sapply(subset_data, is.numeric)]

# Initializing data frame
summary_df <- data.frame(
  Predictor = quantitative_predictors,
  Minimum = NA,
  Maximum = NA,
  Range = NA,
  Mean = NA,
  SD = NA
)

# Calculating range, mean, and standard deviation
for (i in 1:nrow(summary_df)) {
  col <- summary_df$Predictor[i]
  range <- range(subset_data[, col])
  summary_df[i, 2:4] <- c(range[1], range[2], range[2] - range[1])
  summary_df[i, 5:6] <- c(mean(subset_data[, col]), sd(subset_data[, col]))
}

print(summary_df)

##      Predictor Minimum Maximum  Range        Mean         SD
## 1          mpg    11.0    46.6   35.6   24.404430   7.867283
## 2    cylinders     3.0     8.0    5.0    5.373418   1.654179
## 3 displacement    68.0   455.0  387.0  187.240506  99.678367
## 4   horsepower    46.0   230.0  184.0  100.721519  35.708853
## 5       weight  1649.0  4997.0 3348.0 2935.971519 811.300208
## 6 acceleration     8.5    24.8   16.3   15.726899   2.693721
## 7         year    70.0    82.0   12.0   77.145570   3.106217
## 8       origin     1.0     3.0    2.0    1.601266   0.819910

9.e)

pairs(Auto)

plot(Auto$mpg, Auto$displacement) #relationship between fuel efficiency (mpg) and engine displacement

plot(Auto$horsepower, Auto$mpg) #relationship between horsepower and fuel efficiency (mpg)

plot(Auto$cylinders, Auto$mpg) #relationship between the number of cylinders and fuel efficiency (mpg)

plot(Auto$acceleration, Auto$mpg) #relationship between acceleration and fuel efficiency (mpg)

Fuel efficiency (mpg) and engine displacement

The graph shows a negative correlation between fuel efficiency (mpg) and engine displacement. This could be because cars with larger engines or higher displacement usually have lower fuel efficiency.

Horsepower and fuel efficiency

The graph shows a negative correlation between horsepower and fuel efficiency. This indicates that cars with higher horsepower tend to have lower fuel efficiency.

Number of cylinders and fuel efficiency

The graph shows a negative correlation between the number of cylinders and fuel efficiency. This means cars with more cylinders usually have lower fuel efficiency.

Acceleration and fuel efficiency

The graph shows a negative correlation between acceleration and fuel efficiency. This indicates that cars that accelerate faster get worse gas mileage. Since it takes more energy to accelerate a car quickly, this energy comes from burning fuel.

9.f)

cor(Auto$displacement, Auto$mpg)

## [1] -0.8051269

cor(Auto$horsepower, Auto$mpg)

## [1] -0.7784268

cor(Auto$year, Auto$displacement)

## [1] -0.3698552

cor(Auto$weight, Auto$mpg)

## [1] -0.8322442

cor(Auto$cylinders, Auto$mpg)

## [1] -0.7776175

cor(Auto$origin, Auto$mpg)

## [1] 0.5652088

Using the correlations above Displacement, Horsepower, Year, and Cylinders can be used as predictors for mpg.

10.a)

?Boston

## starting httpd help server ... done

dim(Boston) # Number of rows and Columns

## [1] 506  13

colnames(Boston) # Name of columns

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "lstat"   "medv"

There are 506 rows and 13 columns in this dataset.
The columns in the dataset are:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

10.b)

pairs(Boston)

plot(Boston$rm, Boston$medv) # Relation between housing prices and the number of rooms

plot(Boston$ptratio, Boston$medv) # Relationship between Pupil-teacher ratio and housing prices

plot(Boston$dis, Boston$medv) # Relationship between accessibility to jobs with housing prices

plot(Boston$crim, Boston$lstat) # Relationship between crime rate and lower socioeconomic status

plot(Boston$nox, Boston$tax) # Relationship between pollution and tax rates

Housing prices and the number of rooms

The graph shows a positive correlation. This means that as the number of rooms in a house increases, the price of the house also tends to increase

Pupil-teacher ratio by town and housing prices

The graph shows a weak negative correlation, which means that towns with a higher pupil-teacher ratio tend to have lower median housing prices.

Accessibility to jobs with housing prices

The graph shows a negative correlation between the two variables, meaning that as the accessibility to jobs increases, the median housing price decrease.

Pollution and tax rates

The graph shows a positive correlation, meaning that areas with higher air pollution levels tend to have higher property tax rates.

10.c)

plot(Boston$medv, Boston$crim) # Relationship between crime rate and lower socioeconomic status

Crime rate and median value of owner-occupied homes

The graph shows a positive correlation, meaning that towns with lower median values tend to have higher crime rates. However, few towns with the most expensive house also have higher crime rate.

10.d)

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

# Crime Rates
ggplot(Boston, aes(x = crim)) +
  geom_histogram(bins = 50, fill = "blue", color = "black") +
  theme_minimal() +
  ggtitle("Distribution of Crime Rates")

# Tax Rates
ggplot(Boston, aes(x = tax)) +
  geom_histogram(bins = 50, fill = "red", color = "black") +
  theme_minimal() +
  ggtitle("Distribution of Tax Rates")

# Pupil-Teacher Ratios
ggplot(Boston, aes(x = ptratio)) +
  geom_histogram(bins = 50, fill = "green", color = "black") +
  theme_minimal() +
  ggtitle("Distribution of Pupil-Teacher Ratios")

Crime Rates

The histogram shows a right skewed distribution of crime rates. Most tracts have low crime rates but a few tracts show high crime rates.

Tax Rates

The histogram indicates tracts tend to have tax rates clustered around two different tax rate levels.

Pupil-Teacher Ratios

The distribution is slightly left skewed with most tracts generally having high-pupil tracher ratios across Boston.

10.e)

number_of_tracts <- sum(Boston$chas)
print(paste("Number of tracts that bound the Charles River:", number_of_tracts))

## [1] "Number of tracts that bound the Charles River: 35"

10.f)

median_ptratio <- median(Boston$ptratio)
print(paste("Median pupil-teacher ratio:", median_ptratio))

## [1] "Median pupil-teacher ratio: 19.05"

10.g)

lowest_medv_tract <- Boston[Boston$medv == min(Boston$medv), ]
print(lowest_medv_tract)

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

The census tract which has lowest median value of owner occupied homes (medv) in the dataset are tracts 399 and 406.

overall_ranges <- summary(Boston)
print(overall_ranges)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Crime rate: 38.3518, which is significantly higher than the average.
Residential land zoned: 0.0, which indicates no large residential land zoned.
Industrial proportion: 18.1, indicates a high proportion of non-retail business acres.
Charles River: 0, indicating the tract does not bound the Charles River.
Nitric oxides concentration: 0.693, which is high and towards the upper end of its range.
Average number of rooms: 5.453, which is relatively low, indicates smaller dwellings.
Proportion of older buildings: 100, indicates that all buildings were built before 1940.
Distance to employment centers: 1.4896, relatively close to employment centers.
Accessibility to highways: 24, at the maximum of its range.
Tax rate: 666, which is very high and towards the upper end of the range.
Pupil-teacher ratio: 20.2, also high and towards the upper end of the range.
Lower status population: 30.59, comparatively very high, indicating a higher proportion of lower status population.

10.h)

more_than_seven_rooms <- sum(Boston$rm > 7)

more_than_eight_rooms <- sum(Boston$rm > 8)

print(paste("Number of tracts with more than 7 rooms:", more_than_seven_rooms))

## [1] "Number of tracts with more than 7 rooms: 64"

print(paste("Number of tracts with more than 8 rooms:", more_than_eight_rooms))

## [1] "Number of tracts with more than 8 rooms: 13"

tracts_more_than_eight_rooms <- Boston[Boston$rm > 8, ]
summary(tracts_more_than_eight_rooms)

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

Low Crime Rates: The average crime rate in these tracts is relatively low (mean ≈ 0.72)
Pupil-Teacher Ratios: The pupil-teacher ratios are slightly lower than the maximum in the dataset, which indicates better educational resources.
Lower Status Population: The lower status of the population (lstat) is quite low, indicating these are likely affluent areas.
High Median Home Values: The median value of owner-occupied homes is very high, and at the upper end of the dataset’s range.

Lab1

2024-01-25