Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
This is a regression problem as the variable of interest is salary, and are performing inference as we are more interested in the relationship between predictors and the salary rather than prediction. In this case n=500 as in the number of firms and p=3 as in the variables recorded affecting salary.
This is a classification problem whether a product will be a success or not, and is prediction as we are more interested in the outcome. In this case n=20 which is the number of previous products and p=13 which are the variables recorded for each product.
This is a regression problem of the % change, and prediction as we are interested in the outcome. For this the n=52 for all of the weeks of 2012 and p=3 for each market recorded.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
A flexible approach for regression will have less bias and can perform better than less flexible models, but will have more variance based on the sample provided and may lead to overfitting. More flexible approaches are also more complicated and computationally expensive than less flexible models.
A flexible approach may be preferred when dealing with a non-linear relationship that cannot be adequately captured by less flexible methods, when the goal is to maximize performance of prediction, or when simplicity of the approach is not a limiting factor. This can also be preferred when there is a large sample size with relatively few predictors.
A less flexible approach may be preferred when dealing with more linear relationships, when inference and understanding the relationship between variables is important, or when simplicity of the approach is desired. This can also be preferred when there is a small sample size or many predictors.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A parametric approach makes certain assumptions about relationships in the data, while a non parametric approach does not. Linear regression is a parametric approach which assumes a linear relationship in the data, for example. A non parametric approach like K Nearest Neighbors does not make any assumption.
A parametric approach can make problems significantly simpler to solve, which decreases computational cost and makes the result much more interpretable as a result. The disadvantages can include worse performance metrics than non parametric approaches as the assumptions made are often not true.
This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are:
• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10 % of high school class
• Top25perc : New students from top 25 % of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor.
college <- read.csv('College.csv')
rownames(college) <- college[, 1]
# View(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try:
college <- college[,-1]
# View(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
numeric.college <- college
numeric.college[,1] <- as.numeric(factor(college[,1], levels = c("Yes", "No"), labels = c(1, 0)))
pairs(numeric.college[,1:10])
boxplot(Outstate ~ Private, data = college,
xlab = "Elite",
ylab = "Outstate")
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college , Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
boxplot(Outstate ~ Elite, data = college,
xlab = "Elite",
ylab = "Outstate")
par(mfrow = c(2,2))
hist(college$Apps)
hist(college$Top10perc)
hist(college$Outstate)
hist(college$Grad.Rate)
par(mfrow = c(2,2))
# look at acceptance rate
Acceptance.Rate <- college$Accept/college$Apps
college <- data.frame(college , Acceptance.Rate)
summary(college$Acceptance.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1545 0.6756 0.7788 0.7469 0.8485 1.0000
hist(college$Acceptance.Rate,
main='',
xlab='Acceptance Rate')
# boxplot of elite v acceptance rate
boxplot(Acceptance.Rate ~ Elite, data = college,
xlab = "Elite",
ylab = "Acceptance Rate")
# boxplots of elite v student faculty ratio
boxplot(S.F.Ratio ~ Elite, data = college,
xlab = "Elite",
ylab = "Student Faculty Ratio")
# boxplots of elite v applications
boxplot(Apps ~ Elite, data = college,
xlab = "Elite",
ylab = "Applications")
# do some writing on this
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
auto <- read.table('auto.data', header = T)
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130.0 3504 12.0 70 1
## 2 15 8 350 165.0 3693 11.5 70 1
## 3 18 8 318 150.0 3436 11.0 70 1
## 4 16 8 304 150.0 3433 12.0 70 1
## 5 17 8 302 140.0 3449 10.5 70 1
## 6 15 8 429 198.0 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
sum(is.na(auto)) # no outright missing values
## [1] 0
unique(auto$horsepower) # there is a "?" which is a missing value. will remove these columns
## [1] "130.0" "165.0" "150.0" "140.0" "198.0" "220.0" "215.0" "225.0" "190.0"
## [10] "170.0" "160.0" "95.00" "97.00" "85.00" "88.00" "46.00" "87.00" "90.00"
## [19] "113.0" "200.0" "210.0" "193.0" "?" "100.0" "105.0" "175.0" "153.0"
## [28] "180.0" "110.0" "72.00" "86.00" "70.00" "76.00" "65.00" "69.00" "60.00"
## [37] "80.00" "54.00" "208.0" "155.0" "112.0" "92.00" "145.0" "137.0" "158.0"
## [46] "167.0" "94.00" "107.0" "230.0" "49.00" "75.00" "91.00" "122.0" "67.00"
## [55] "83.00" "78.00" "52.00" "61.00" "93.00" "148.0" "129.0" "96.00" "71.00"
## [64] "98.00" "115.0" "53.00" "81.00" "79.00" "120.0" "152.0" "102.0" "108.0"
## [73] "68.00" "58.00" "149.0" "89.00" "63.00" "48.00" "66.00" "139.0" "103.0"
## [82] "125.0" "133.0" "138.0" "135.0" "142.0" "77.00" "62.00" "132.0" "84.00"
## [91] "64.00" "74.00" "116.0" "82.00"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
auto <- auto %>%
filter(!grepl("\\?", horsepower)) %>%
select(-grep("\\?", names(auto)))
# unique(auto$horsepower)
Quantitative predictors include MPG, displacement, horsepower, weight, and acceleration. Qualitative values include cylinders, year, origin, and name.
auto.quant <- auto[,c(1,3,4,5,6)]
auto.quant$horsepower <- as.numeric(auto.quant$horsepower)
ranges <- sapply(auto.quant,range)
ranges
## mpg displacement horsepower weight acceleration
## [1,] 9.0 68 46 1613 8.0
## [2,] 46.6 455 230 5140 24.8
results <- sapply(auto.quant, function(x){
mean <- mean(x)
sd <- sd(x)
out <- c(mean,sd)
names(out) <- c('Mean', 'Standard Deviation')
out
})
results
## mpg displacement horsepower weight acceleration
## Mean 23.445918 194.412 104.46939 2977.5842 15.541327
## Standard Deviation 7.805007 104.644 38.49116 849.4026 2.758864
auto.d <- auto.quant[-c(10:85),]
range <- sapply(auto.d, range)
mean <- sapply(auto.d, mean)
sd <- sapply(auto.d, sd)
## [1] Range:
## mpg displacement horsepower weight acceleration
## [1,] 11.0 68 46 1649 8.5
## [2,] 46.6 455 230 4997 24.8
## [1] Mean:
## mpg displacement horsepower weight acceleration
## 24.40443 187.24051 100.72152 2935.97152 15.72690
## [1] Standard Deviation:
## mpg displacement horsepower weight acceleration
## 7.867283 99.678367 35.708853 811.300208 2.693721
numeric.auto <- auto
numeric.auto$horsepower <- as.numeric(auto$horsepower)
pairs(numeric.auto[,-c(7:9)])
par(mfrow = c(2,2))
plot(horsepower~displacement, data=auto)
plot(horsepower~acceleration, data=auto)
plot(horsepower~weight, data=auto)
plot(displacement~acceleration, data=auto)
plot(mpg~acceleration, data=auto)
plot(mpg~displacement, data=auto)
plot(mpg~weight, data=auto)
plot(mpg~horsepower, data=auto)
From the pairwise plot we can see a few relationships between predictors. We see positive linear relationships in horsepower vs weight and displacement, and negative linear relationships in horsepower vs acceleration. As we will discuss in the next question we see a negative linear relationship between mpg and horsepower and weight, and a slight negative relationship between mpg and displacement.
Looking at the pairwise plots, there appears to be a negative linear relationship between mpg and horsepower and weight. There also appears to be a negative relationship between mpg and displacement, although it does not appear to be as strong. We can tell this by looking at the scatterplot above, in which the plots between mpg and the other predictors appear to be roughly linear.
This exercise involves the Boston housing data set.
library(ISLR2)
Now the data set is contained in the object Boston.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
How many rows are in this data set? How many columns? What do the rows and columns represent?
dim(Boston)
## [1] 506 13
There are 506 rows in this dataset representing suburbs in Boston and 13 columns representing different housing values.
pairs(Boston[,-c(14,15)])
From the scatterplots we can see a few linear relationships that stand out. the medv seems to be negatively related with lstat, and positively related to rm. We also see a positive relationship between dis and zn and negative relationship between dis and indus; this one intuitively makes sense, as a high proportion of industry in a region likely indicates a closer distance to employment centers, while the opposite is likely for a region with more residential area. There also appears to be a positive relationship between nox and indus; this also makes sense as higher levels of industry would likely increase the amount of pollution.
boston.crim <- lm(crim~., data=Boston)
summary(boston.crim)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Using a linear model we can see that with a significance of 0.05, we see a positive relationship between crime and zn and rad, and a negative relationship between dis and medv.
hist(Boston$crim, main = 'Crime')
hist(Boston$tax, main = 'Tax')
hist(Boston$ptratio, main = 'Pupil Teacher Ratio')
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
ranges <- sapply(Boston, range)
ranges
## crim zn indus chas nox rm age dis rad tax ptratio lstat
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 1.73
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 37.97
## medv
## [1,] 5
## [2,] 50
For crime, it appears that there is a positive skew with most values being lower than 1, and a few outliers reaching up to 80, causing the range to be high. Tax also has a relatively high range from 187 to 711, with what almost appears to be two separate populations: a lower tax and more spread out group and a higher tax group that has much less variance. Pupil teacher ratio also appears to have a relatively wide range, with a negative skew.
table(Boston$chas)
##
## 0 1
## 471 35
35 of the census tracts in this dataset bound the Charles River.
median(Boston$ptratio)
## [1] 19.05
The median pupil-teacher ratio among towns in this dataset is 19.05.
min(Boston$medv)
## [1] 5
Boston[Boston$medv == min(Boston$medv),]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 30.59 5
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 22.98 5
# information on other values for comparison
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
There are two census tracts with the lowest value of medv of 5. These two have much higher crime rates, have 100% of owner occupied units built prior to 1940, and have the highest index of accessibility to radial highways.
Boston$above7 = ifelse(Boston$rm > 7, T, F)
Boston$above8 = ifelse(Boston$rm > 8, T, F)
print('Above 7 rooms:', quote=F)
## [1] Above 7 rooms:
table(Boston$above7)
##
## FALSE TRUE
## 442 64
print('Above 8 rooms:', quote=F)
## [1] Above 8 rooms:
table(Boston$above8)
##
## FALSE TRUE
## 493 13
64 census tracts average above 7 rooms, while 13 average above 8 rooms. Let’s take a closer look at those above 8 rooms:
# for comparison:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv above7 above8
## Min. : 5.00 Mode :logical Mode :logical
## 1st Qu.:17.02 FALSE:442 FALSE:493
## Median :21.20 TRUE :64 TRUE :13
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Boston[Boston$above8 == T,]
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 98 0.12083 0 2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 4.21 38.7
## 164 1.51902 0 19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 3.32 50.0
## 205 0.02009 95 2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 2.88 50.0
## 225 0.31533 0 6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 4.14 44.8
## 226 0.52693 0 6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 4.63 50.0
## 227 0.38214 0 6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 3.13 37.6
## 233 0.57529 0 6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 2.47 41.7
## 234 0.33147 0 6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 3.95 48.3
## 254 0.36894 22 5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 3.54 42.8
## 258 0.61154 20 3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 5.12 50.0
## 263 0.52014 20 3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 5.91 48.8
## 268 0.57834 20 3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 7.44 50.0
## 365 3.47428 0 18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 5.29 21.9
## above7 above8
## 98 TRUE TRUE
## 164 TRUE TRUE
## 205 TRUE TRUE
## 225 TRUE TRUE
## 226 TRUE TRUE
## 227 TRUE TRUE
## 233 TRUE TRUE
## 234 TRUE TRUE
## 254 TRUE TRUE
## 258 TRUE TRUE
## 263 TRUE TRUE
## 268 TRUE TRUE
## 365 TRUE TRUE
summary(Boston[Boston$above8 == T,])
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
## above7 above8
## Mode:logical Mode:logical
## TRUE:13 TRUE:13
##
##
##
##
We can see that the median medv is much higher, 48.3 in those above 8 and 21.20 overall. We also see a much lower median lstat, 4.14 in above 8 and 11.36 overall. These would seem to point to these zones to be overall of higher class, but interestingly enough the median crime does appear to be higher, 0.52 in above 8 as opposed to 0.25 overall.