Problem 2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

This is a regression problem as the variable of interest is salary, and are performing inference as we are more interested in the relationship between predictors and the salary rather than prediction. In this case n=500 as in the number of firms and p=3 as in the variables recorded affecting salary.

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a classification problem whether a product will be a success or not, and is prediction as we are more interested in the outcome. In this case n=20 which is the number of previous products and p=13 which are the variables recorded for each product.

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression problem of the % change, and prediction as we are interested in the outcome. For this the n=52 for all of the weeks of 2012 and p=3 for each market recorded.

Problem 5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

A flexible approach for regression will have less bias and can perform better than less flexible models, but will have more variance based on the sample provided and may lead to overfitting. More flexible approaches are also more complicated and computationally expensive than less flexible models.

A flexible approach may be preferred when dealing with a non-linear relationship that cannot be adequately captured by less flexible methods, when the goal is to maximize performance of prediction, or when simplicity of the approach is not a limiting factor. This can also be preferred when there is a large sample size with relatively few predictors.

A less flexible approach may be preferred when dealing with more linear relationships, when inference and understanding the relationship between variables is important, or when simplicity of the approach is desired. This can also be preferred when there is a small sample size or many predictors.

Problem 6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

A parametric approach makes certain assumptions about relationships in the data, while a non parametric approach does not. Linear regression is a parametric approach which assumes a linear relationship in the data, for example. A non parametric approach like K Nearest Neighbors does not make any assumption.

A parametric approach can make problems significantly simpler to solve, which decreases computational cost and makes the result much more interpretable as a result. The disadvantages can include worse performance metrics than non parametric approaches as the assumptions made are often not true.

Problem 8

This exercise relates to the College data set, which can be found in the file College.csv on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are:

• Private : Public/private indicator

• Apps : Number of applications received

• Accept : Number of applicants accepted

• Enroll : Number of new students enrolled

• Top10perc : New students from top 10 % of high school class

• Top25perc : New students from top 25 % of high school class

• F.Undergrad : Number of full-time undergraduates

• P.Undergrad : Number of part-time undergraduates

• Outstate : Out-of-state tuition

• Room.Board : Room and board costs

• Books : Estimated book costs

• Personal : Estimated personal spending

• PhD : Percent of faculty with Ph.D.’s

• Terminal : Percent of faculty with terminal degree

• S.F.Ratio : Student/faculty ratio

• perc.alumni : Percent of alumni who donate

• Expend : Instructional expenditure per student

• Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

college <- read.csv('College.csv')

Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1]
# View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try:

college <- college[,-1]
# View(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

numeric.college <- college
numeric.college[,1] <- as.numeric(factor(college[,1], levels = c("Yes", "No"), labels = c(1, 0)))
pairs(numeric.college[,1:10])

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

boxplot(Outstate ~ Private, data = college, 
        xlab = "Elite", 
        ylab = "Outstate")

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college , Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

boxplot(Outstate ~ Elite, data = college, 
        xlab = "Elite", 
        ylab = "Outstate")

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow = c(2,2))
hist(college$Apps)
hist(college$Top10perc)
hist(college$Outstate)
hist(college$Grad.Rate)

Continue exploring the data, and provide a brief summary of what you discover.

par(mfrow = c(2,2))
# look at acceptance rate
Acceptance.Rate <- college$Accept/college$Apps
college <- data.frame(college , Acceptance.Rate)
summary(college$Acceptance.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1545  0.6756  0.7788  0.7469  0.8485  1.0000

hist(college$Acceptance.Rate,
     main='',
     xlab='Acceptance Rate')

# boxplot of elite v acceptance rate
boxplot(Acceptance.Rate ~ Elite, data = college, 
        xlab = "Elite", 
        ylab = "Acceptance Rate")

# boxplots of elite v student faculty ratio
boxplot(S.F.Ratio ~ Elite, data = college, 
        xlab = "Elite", 
        ylab = "Student Faculty Ratio")

# boxplots of elite v applications
boxplot(Apps ~ Elite, data = college, 
        xlab = "Elite", 
        ylab = "Applications")

# do some writing on this

Problem 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

auto <- read.table('auto.data', header = T)

head(auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307      130.0   3504         12.0   70      1
## 2  15         8          350      165.0   3693         11.5   70      1
## 3  18         8          318      150.0   3436         11.0   70      1
## 4  16         8          304      150.0   3433         12.0   70      1
## 5  17         8          302      140.0   3449         10.5   70      1
## 6  15         8          429      198.0   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

sum(is.na(auto)) # no outright missing values

## [1] 0

unique(auto$horsepower) # there is a "?" which is a missing value. will remove these columns

##  [1] "130.0" "165.0" "150.0" "140.0" "198.0" "220.0" "215.0" "225.0" "190.0"
## [10] "170.0" "160.0" "95.00" "97.00" "85.00" "88.00" "46.00" "87.00" "90.00"
## [19] "113.0" "200.0" "210.0" "193.0" "?"     "100.0" "105.0" "175.0" "153.0"
## [28] "180.0" "110.0" "72.00" "86.00" "70.00" "76.00" "65.00" "69.00" "60.00"
## [37] "80.00" "54.00" "208.0" "155.0" "112.0" "92.00" "145.0" "137.0" "158.0"
## [46] "167.0" "94.00" "107.0" "230.0" "49.00" "75.00" "91.00" "122.0" "67.00"
## [55] "83.00" "78.00" "52.00" "61.00" "93.00" "148.0" "129.0" "96.00" "71.00"
## [64] "98.00" "115.0" "53.00" "81.00" "79.00" "120.0" "152.0" "102.0" "108.0"
## [73] "68.00" "58.00" "149.0" "89.00" "63.00" "48.00" "66.00" "139.0" "103.0"
## [82] "125.0" "133.0" "138.0" "135.0" "142.0" "77.00" "62.00" "132.0" "84.00"
## [91] "64.00" "74.00" "116.0" "82.00"

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

auto <- auto %>% 
  filter(!grepl("\\?", horsepower)) %>% 
  select(-grep("\\?", names(auto)))
# unique(auto$horsepower)

Which of the predictors are quantitative, and which are qualitative?

Quantitative predictors include MPG, displacement, horsepower, weight, and acceleration. Qualitative values include cylinders, year, origin, and name.

What is the range of each quantitative predictor? You can answer this using the range() function.

auto.quant <- auto[,c(1,3,4,5,6)]
auto.quant$horsepower <- as.numeric(auto.quant$horsepower)

ranges <- sapply(auto.quant,range)
ranges

##       mpg displacement horsepower weight acceleration
## [1,]  9.0           68         46   1613          8.0
## [2,] 46.6          455        230   5140         24.8

What is the mean and standard deviation of each quantitative predictor?

results <- sapply(auto.quant, function(x){
  mean <- mean(x)
  sd <- sd(x)
  out <- c(mean,sd)
  names(out) <- c('Mean', 'Standard Deviation')
  out
  })
results

##                          mpg displacement horsepower    weight acceleration
## Mean               23.445918      194.412  104.46939 2977.5842    15.541327
## Standard Deviation  7.805007      104.644   38.49116  849.4026     2.758864

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto.d <- auto.quant[-c(10:85),]

range <- sapply(auto.d, range)
mean <- sapply(auto.d, mean)
sd <- sapply(auto.d, sd)

## [1] Range:

##       mpg displacement horsepower weight acceleration
## [1,] 11.0           68         46   1649          8.5
## [2,] 46.6          455        230   4997         24.8

## [1] Mean:

##          mpg displacement   horsepower       weight acceleration 
##     24.40443    187.24051    100.72152   2935.97152     15.72690

## [1] Standard Deviation:

##          mpg displacement   horsepower       weight acceleration 
##     7.867283    99.678367    35.708853   811.300208     2.693721

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

numeric.auto <- auto
numeric.auto$horsepower <- as.numeric(auto$horsepower)
pairs(numeric.auto[,-c(7:9)])

par(mfrow = c(2,2))
plot(horsepower~displacement, data=auto)
plot(horsepower~acceleration, data=auto)
plot(horsepower~weight, data=auto)
plot(displacement~acceleration, data=auto)

plot(mpg~acceleration, data=auto)
plot(mpg~displacement, data=auto)
plot(mpg~weight, data=auto)
plot(mpg~horsepower, data=auto)

From the pairwise plot we can see a few relationships between predictors. We see positive linear relationships in horsepower vs weight and displacement, and negative linear relationships in horsepower vs acceleration. As we will discuss in the next question we see a negative linear relationship between mpg and horsepower and weight, and a slight negative relationship between mpg and displacement.

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Looking at the pairwise plots, there appears to be a negative linear relationship between mpg and horsepower and weight. There also appears to be a negative relationship between mpg and displacement, although it does not appear to be as strong. We can tell this by looking at the scatterplot above, in which the plots between mpg and the other predictors appear to be roughly linear.

Problem 10

This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.

library(ISLR2)

Now the data set is contained in the object Boston.

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

How many rows are in this data set? How many columns? What do the rows and columns represent?

dim(Boston)

## [1] 506  13

There are 506 rows in this dataset representing suburbs in Boston and 13 columns representing different housing values.

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston[,-c(14,15)])

From the scatterplots we can see a few linear relationships that stand out. the medv seems to be negatively related with lstat, and positively related to rm. We also see a positive relationship between dis and zn and negative relationship between dis and indus; this one intuitively makes sense, as a high proportion of industry in a region likely indicates a closer distance to employment centers, while the opposite is likely for a region with more residential area. There also appears to be a positive relationship between nox and indus; this also makes sense as higher levels of industry would likely increase the amount of pollution.

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

boston.crim <- lm(crim~., data=Boston)
summary(boston.crim)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Using a linear model we can see that with a significance of 0.05, we see a positive relationship between crime and zn and rad, and a negative relationship between dis and medv.

Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

hist(Boston$crim, main = 'Crime')

hist(Boston$tax, main = 'Tax')

hist(Boston$ptratio, main = 'Pupil Teacher Ratio')

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

ranges <- sapply(Boston, range)
ranges

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio lstat
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6  1.73
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 37.97
##      medv
## [1,]    5
## [2,]   50

For crime, it appears that there is a positive skew with most values being lower than 1, and a few outliers reaching up to 80, causing the range to be high. Tax also has a relatively high range from 187 to 711, with what almost appears to be two separate populations: a lower tax and more spread out group and a higher tax group that has much less variance. Pupil teacher ratio also appears to have a relatively wide range, with a negative skew.

How many of the census tracts in this data set bound the Charles river?

table(Boston$chas)

## 
##   0   1 
## 471  35

35 of the census tracts in this dataset bound the Charles River.

What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

The median pupil-teacher ratio among towns in this dataset is 19.05.

Which census tract of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

min(Boston$medv)

## [1] 5

Boston[Boston$medv == min(Boston$medv),]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

# information on other values for comparison
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

There are two census tracts with the lowest value of medv of 5. These two have much higher crime rates, have 100% of owner occupied units built prior to 1940, and have the highest index of accessibility to radial highways.

In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

Boston$above7 = ifelse(Boston$rm > 7, T, F)
Boston$above8 = ifelse(Boston$rm > 8, T, F)

print('Above 7 rooms:', quote=F)

## [1] Above 7 rooms:

table(Boston$above7)

## 
## FALSE  TRUE 
##   442    64

print('Above 8 rooms:', quote=F)

## [1] Above 8 rooms:

table(Boston$above8)

## 
## FALSE  TRUE 
##   493    13

64 census tracts average above 7 rooms, while 13 average above 8 rooms. Let’s take a closer look at those above 8 rooms:

# for comparison:
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv         above7          above8       
##  Min.   : 5.00   Mode :logical   Mode :logical  
##  1st Qu.:17.02   FALSE:442       FALSE:493      
##  Median :21.20   TRUE :64        TRUE :13       
##  Mean   :22.53                                  
##  3rd Qu.:25.00                                  
##  Max.   :50.00

Boston[Boston$above8 == T,]

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9
##     above7 above8
## 98    TRUE   TRUE
## 164   TRUE   TRUE
## 205   TRUE   TRUE
## 225   TRUE   TRUE
## 226   TRUE   TRUE
## 227   TRUE   TRUE
## 233   TRUE   TRUE
## 234   TRUE   TRUE
## 254   TRUE   TRUE
## 258   TRUE   TRUE
## 263   TRUE   TRUE
## 268   TRUE   TRUE
## 365   TRUE   TRUE

summary(Boston[Boston$above8 == T,])

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0  
##   above7         above8       
##  Mode:logical   Mode:logical  
##  TRUE:13        TRUE:13       
##                               
##                               
##                               
##

We can see that the median medv is much higher, 48.3 in those above 8 and 21.20 overall. We also see a much lower median lstat, 4.14 in above 8 and 11.36 overall. These would seem to point to these zones to be overall of higher class, but interestingly enough the median crime does appear to be higher, 0.52 in above 8 as opposed to 0.25 overall.

Assignment 1

Michael De La Rosa

2025-02-07