library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR2)
Auto <- na.omit(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
class(Auto$name)
## [1] "factor"
range(Auto$origin)
## [1] 1 3
mpg = 9.0 to 46.6, cylinders = 3 to 8, displacement = 68 to 455, horsepower = 46 to 230, weight = 1613 to 5140, acceleration = 8.0 to 24.8, year = 70 to 82, origin = 1 to 3
sd(Auto$origin)
## [1] 0.8055182
mpg = mean: 23.44 and Std: 7.8,
cylinders = mean: 5.4 and Std: 1.7,
displacement = mean: 194 and Std: 104.6,
horsepower = mean:104.46 and Std: 38.49,
weight = mean: 2977.5 and Std: 849.4,
acceleration = mean:15.5 and Std: 2.76,
year = mean: 75.9. and Std: 3.68,
origin = mean:1.57 and Std: 0.8
Auto2 <- Auto[-(10:84),]
sd(Auto2$origin)
## [1] 0.8193079
mpg = 11.0 to 46.6, 24.368, 7.88
cylinders = 3 to 8, 5.38, 1.65
displacement = 68 to 455, 187.75, 99.9
horsepower = 46 to 230, 100.95, 35.89
weight = 1649 to 4997, 2939.644, 812.64
acceleration = 8.5 to 24.8, 15.7, 2.69
year = 70 to 82, 77.13, 3.11
origin = 1 to 3, 1.599, 0.81
plot(Auto$horsepower,Auto$mpg, xlab = 'hp', ylab = 'mpg')
Higher horsepower has lower mpg and this is the general trend looking at the plot.
plot(Auto$acceleration,Auto$weight)
Higher accelerating vehicles seem to be related to lower weights.
plot(Auto$acceleration, Auto$mpg)
?Boston #understanding data variables
b.
#creating some pairwise scatterplots to explore relationships b/w predictors
pairs(~crim + age + dis, data = Boston)
pairs(~ chas + lstat + medv, data = Boston)
pairs(~ rm + tax + medv, data = Boston)
pairs(~ nox + rad + lstat, data = Boston)
From the pairwise scatter plots, I noticed couple of trends and relationships. First, there is a negative relationship between medv and lstat. Higher value homes are owned by people who are not lower status. Another interesting finding was the smaller weighted mean distance to employment centers was associated with older homes built prior to 1940. Also older home areas are linked with more crime rate per capita. If a house has more rooms, then it’s median value is likely to be higher as well. Lastly we noticed that people who weren’t lower status, they lived in conditions with lower nitrogen oxide concentrations.
c. From my plots, I noticed that older home areas were associated with higher crime rates per capita.
d.
hist(Boston$crim, xlim = c(0,1), breaks = 2000)
range(Boston$crim)
## [1] 0.00632 88.97620
Even though the range of crime rate is from 0.006 to 88.9, most of the data is close 0.01 or 0.02 based on the histogram. This tells us that there are some outlier suburbs with very high crime rates.
range(Boston$tax)
## [1] 187 711
hist(Boston$tax)
So from the range and histogram of tax rates, we can see there are two groups formed. One with high tax rates which is frequent in the data, and another group with lower tax rates. Overall, there are more people in the lower tax group.
range(Boston$ptratio)
## [1] 12.6 22.0
hist(Boston$ptratio)
There are many suburbs with high student to teacher ratios of 20+. But the range of this ratio is pretty wide from 12 to 22. Majority of suburbs are still under 20 student to teacher ratio.
e. how many bound the Charles river?
There are only 35 suburbs that bound the river.
Boston |>
filter(chas == 1) |>
count()
## n
## 1 35
f.
median(Boston$ptratio)
## [1] 19.05
Median student to teacher ratio is 19.05.
g.
lwst_medv <- min(Boston$medv)
lwst_medv
## [1] 5
Boston |>
filter(medv == 5)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 2 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
I found two suburbs with same lowest median value. Other predictors for these include high crime rates of 38.3 and 67.9. Hundred percent of owner occupied houses are built before 1940. Tax rates exceed $650 per $10000. About thirty percent and twenty-two percent of the population is classified as lower status in these towns or suburbs.
If we compare their values to overall ranges, then the crime rate range is 0.00632-88.976. The range of tax rates are 187 to 711. So, overall their values end up on the higher end of the spectrum of the predictors.
h.
Boston |>
filter(rm > 8) |>
pull(lstat) |>
mean()
## [1] 4.31
There are 64 suburbs with average rooms in houses more than 7. If we look at more than 8 rooms, now that number reduces to just 13 suburbs.
In those 13 suburbs, mean of the all median house values is 44.2. The mean of lower status of the population is just 4.31 which means that only about 4 percent of the population is lower status there.