Conceptual Excercise

library('plotly')
library('dplyr')
library('ggplot2')

ISLR Exercise, Edition 1, Chapter 2, Page 53

James, G., D. Witten, Hastie T., and R. Tibshirani. 2013. An Introduction to Statistical Learning with
Applications in R. Springer. 7th

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Obs. X1 X2 X3 Y 1 0 3 0 Red 2 2 0 0 Red 3 0 1 3 Red 4 0 1 2 Green 5 −1 0 1 Green 6 1 1 1 Red Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors. (a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0. (b) What is our prediction with K = 1? Why? (c) What is our prediction with K = 3? Why? (d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

df <- data.frame(X1=c(0,2,0,0,-1,1),
                 X2=c(3,0,1,1,0,1),
                 X3=c(0,0,3,2,1,1),
                 Y=c('Red','Red','Red','Green','Green','Red'))
df <- df %>% mutate(d=sqrt(X1^2 +X2*X2 + X3*X3)) %>% arrange(d)
df

##   X1 X2 X3     Y        d
## 1 -1  0  1 Green 1.414214
## 2  1  1  1   Red 1.732051
## 3  2  0  0   Red 2.000000
## 4  0  1  2 Green 2.236068
## 5  0  3  0   Red 3.000000
## 6  0  1  3   Red 3.162278

From the df table, d is the Euclidean distance from the test point Xo(0,0,0) to the 6 train points. Sort these distances ascending for k nearest neighbor test.

If k=1, the 1 nearest observation is (-1,0,1) with Y=Green, hence we predict the test point (0,0,0) is Green.
If k=3, we will take 3 nearest neighbor points of Xo, those Y values are Green, Red, Red and we predict the test point Xo is Red in this case.

If the decision boundary is highly nonlinear, the Euclidean distance will be high and hence the high variance, so it is better to use the small k.

p <- plot_ly(df, x=~X1, y=~X2, z=~X3, color=~Y, colors=c('green','red')) 
p

## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d

## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Applied, `College` dataset, page 54

This will install or load ISLR package which contains sample datasets of the book.

if (!require('ISLR')){
   install.packages('ISLR')
   library('ISLR')
} else {
   library('ISLR')
}

## Loading required package: ISLR

Summary the variables in the data set

summary(College)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

Produce a scatterplot of the first ten variables

pairs(College[,1:10])

boxplot of Outstate and Private as we can see, Private universities have higher out of state tuition fee.

boxplot(College$Outstate~College$Private)

ggplot(College, aes(x=Private, y = Outstate, fill=Private)) +
   geom_boxplot()

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %

College$Elite <- as.factor(ifelse(College$Top10perc > 50,'Yes','No')) 

summary(College$Elite)

##  No Yes 
## 699  78

ggplot(College, aes(x=Elite, y = Outstate, fill=Elite)) +
   geom_boxplot()

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.

par(mfrow=c(2,3))
hist(College$Accept)
hist(College$Outstate)
hist(College$Top10perc)
hist(College$PhD)
hist(College$Grad.Rate)
hist(College$Enroll)

Continue exploring the data, and provide a brief summary of what you discover.

College$AcceptRate <- College$Accept / College$Apps
ggplot(College, aes(x=Elite, y = AcceptRate, fill=Private)) +
   geom_boxplot()+
   ggtitle('Acceptance Rate break down by Elite and Private uni')

From the above boxplot, the acceptance rate to Elite and/or Private universites is lower (aka harder) than the non-Elite uni.

College$AcceptRate <- College$Accept / College$Apps
ggplot(College, aes(x=Elite, y = Grad.Rate, fill=Private)) +
   geom_boxplot()+
   ggtitle('Graduation Rate break down by Elite and Private uni')

The above graph showing the graduation rate, Elite and private universities have the highest graduation rate (nearly 90%), followed by Elite and public, then non-elite private and the public non-elite universities have the lowest graduation rate, about 50%.

`Auto` data set, page 56

From summary table,

quantitative predictors: mpg, displacement, horsepower, weight, acceleration, year (can consider as categorical variable). Range from min to max value.
qualitative predictors: name, origin (coded to quantitative), cylinders
No missing values.

summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

summary(Auto[-c(10:85),])

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   :11.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1649  
##  1st Qu.:18.00   1st Qu.:4.000   1st Qu.:100.2   1st Qu.: 75.0   1st Qu.:2214  
##  Median :23.95   Median :4.000   Median :145.5   Median : 90.0   Median :2792  
##  Mean   :24.40   Mean   :5.373   Mean   :187.2   Mean   :100.7   Mean   :2936  
##  3rd Qu.:30.55   3rd Qu.:6.000   3rd Qu.:250.0   3rd Qu.:115.0   3rd Qu.:3508  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :4997  
##                                                                                
##   acceleration        year           origin     
##  Min.   : 8.50   Min.   :70.00   Min.   :1.000  
##  1st Qu.:14.00   1st Qu.:75.00   1st Qu.:1.000  
##  Median :15.50   Median :77.00   Median :1.000  
##  Mean   :15.73   Mean   :77.15   Mean   :1.601  
##  3rd Qu.:17.30   3rd Qu.:80.00   3rd Qu.:2.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                 
##                         name    
##  ford pinto               :  5  
##  toyota corolla           :  5  
##  amc matador              :  4  
##  chevrolet chevette       :  4  
##  amc hornet               :  3  
##  chevrolet caprice classic:  3  
##  (Other)                  :292

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings

pairs(Auto %>% select(-c(name,year, origin, cylinders)) )

positive relationship between horsepower and weight
negative relationship between mpg and weight,….

round(cor(Auto %>% select(-c(name))),2)

##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
## origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
##              origin
## mpg            0.57
## cylinders     -0.57
## displacement  -0.61
## horsepower    -0.46
## weight        -0.59
## acceleration   0.21
## year           0.18
## origin         1.00

model <- lm(mpg~cylinders + displacement + horsepower+ weight + acceleration + year + origin, data=Auto)

model <- lm(mpg~ weight + year + origin, data=Auto)
model

## 
## Call:
## lm(formula = mpg ~ weight + year + origin, data = Auto)
## 
## Coefficients:
## (Intercept)       weight         year       origin  
##  -18.045850    -0.005994     0.757126     1.150391

summary(model)

## 
## Call:
## lm(formula = mpg ~ weight + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9440 -2.0948 -0.0389  1.7255 13.2722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.805e+01  4.001e+00  -4.510 8.60e-06 ***
## weight      -5.994e-03  2.541e-04 -23.588  < 2e-16 ***
## year         7.571e-01  4.832e-02  15.668  < 2e-16 ***
## origin       1.150e+00  2.591e-01   4.439 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.348 on 388 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.816 
## F-statistic: 579.2 on 3 and 388 DF,  p-value: < 2.2e-16

R-squared: 81.6% p value<0.05 for model with weight, year and origin.

`Boston` data set, page 56

library('MASS')

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## The following object is masked from 'package:plotly':
## 
##     select

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

pairs(Boston)

cor(Boston)

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

summary(Boston[,c('crim','tax','ptratio')])

##       crim               tax           ptratio     
##  Min.   : 0.00632   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 0.08205   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 0.25651   Median :330.0   Median :19.05  
##  Mean   : 3.61352   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 3.67708   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :88.97620   Max.   :711.0   Max.   :22.00

Yes, max value of crim, tax are much higher than the mean values.

How many of the suburbs in this data set bound the Charles river?

sum(Boston$chas==1)

## [1] 35

What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings

Boston[Boston$medv==min(Boston$medv),]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black lstat
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90 30.59
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97 22.98
##     medv
## 399    5
## 406    5

Boston$min_medv <- ifelse(Boston$medv==min(Boston$medv),'lowest medv', 'non lowest')

ggplot(Boston, aes(x=min_medv, y=crim, color=min_medv))+
      geom_boxplot()

ggplot(Boston, aes(x=min_medv, y=indus, color=min_medv))+
      geom_boxplot()

ggplot(Boston, aes(x=min_medv, y=zn, color=min_medv))+
      geom_boxplot()

ggplot(Boston, aes(x=min_medv, y=nox, color=min_medv))+
      geom_boxplot()

Some comments: 2 suburbs have the lowest medv (5) whose crime rate, nitrogen oxides are very high compare to non-lowest…. we can plot more graphs and more comments :) but with the high crime rate, these suburbs should not be the livable location.

In this data set, how many of the suburbs average more than
seven rooms per dwelling? More than eight rooms per dwelling?
Comment on the suburbs that average more than eight rooms
per dwelling

cat('\n number of suburbs have more than 7 rooms per dwelling ', sum(Boston$rm>=7))

## 
##  number of suburbs have more than 7 rooms per dwelling  64

cat('\n number of suburbs have more than 8 rooms per dwelling ', sum(Boston$rm>=8))

## 
##  number of suburbs have more than 8 rooms per dwelling  13

cat('\n List of more than 8 rooms')

## 
##  List of more than 8 rooms

Boston[which(Boston$rm >=8),]

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio  black lstat
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0 396.90  4.21
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7 388.45  3.32
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7 390.55  2.88
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4 385.05  4.14
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4 382.00  4.63
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4 387.38  3.13
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4 385.91  2.47
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4 378.95  3.95
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1 396.90  3.54
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0 389.70  5.12
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0 386.86  5.91
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0 384.54  7.44
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2 354.55  5.29
##     medv   min_medv
## 98  38.7 non lowest
## 164 50.0 non lowest
## 205 50.0 non lowest
## 225 44.8 non lowest
## 226 50.0 non lowest
## 227 37.6 non lowest
## 233 41.7 non lowest
## 234 48.3 non lowest
## 254 42.8 non lowest
## 258 50.0 non lowest
## 263 48.8 non lowest
## 268 50.0 non lowest
## 365 21.9 non lowest

summary(Boston[which(Boston$rm >=8),])

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv        min_medv        
##  Min.   :2.47   Min.   :21.9   Length:13         
##  1st Qu.:3.32   1st Qu.:41.7   Class :character  
##  Median :4.14   Median :48.3   Mode  :character  
##  Mean   :4.31   Mean   :44.2                     
##  3rd Qu.:5.12   3rd Qu.:50.0                     
##  Max.   :7.44   Max.   :50.0

Some comments: crime rate is very low compare to overall. Price is high, and so on….

Conceptual Excercise

hale

2022-07-17

ISLR Exercise, Edition 1, Chapter 2, Page 53

Applied, `College` dataset, page 54

`Auto` data set, page 56

`Boston` data set, page 56

Conceptual Excercise

hale

2022-07-17

ISLR Exercise, Edition 1, Chapter 2, Page 53

Applied, College dataset, page 54

Auto data set, page 56

Boston data set, page 56

Applied, `College` dataset, page 54

`Auto` data set, page 56

`Boston` data set, page 56