Question 8

This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US.

8A: Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
data(College)
college <- read.csv("College.csv")

8B: Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

head(college[, 1:5])
##                              X Private Apps Accept Enroll
## 1 Abilene Christian University     Yes 1660   1232    721
## 2           Adelphi University     Yes 2186   1924    512
## 3               Adrian College     Yes 1428   1097    336
## 4          Agnes Scott College     Yes  417    349    137
## 5    Alaska Pacific University     Yes  193    146     55
## 6            Albertson College     Yes  587    479    158

However, we still need to eliminate the first column in the data where the names are stored.

college <- college [,-1]
head(college[, 1:5])
##   Private Apps Accept Enroll Top10perc
## 1     Yes 1660   1232    721        23
## 2     Yes 2186   1924    512        16
## 3     Yes 1428   1097    336        22
## 4     Yes  417    349    137        60
## 5     Yes  193    146     55        16
## 6     Yes  587    479    158        38

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

8C(i): Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

8C(ii): Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[,1:10])

8C(iii): Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(college$Private, college$Outstate, xlab = "Private University", ylab ="Out of State tuition (USD)", main = "Out of State Tuition Plot", col = c('powderblue', 'mistyrose'))

8C(iv): Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]=" Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(college$Elite)
##  Yes   No 
##   78  699
plot(college$Elite, college$Outstate, xlab = "Elite University", ylab ="Out of State tuition (USD)", main = "Out of State Tuition Plot", col = c('#b492ff', '#baffc5'))

8C(v): Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow=c(2,2))
hist(college$F.Undergrad, xlab = "Full-time Undergraduates", ylab = "Count", main = "Histogram of Full-time Undergraduates", col= "#c16e6e")
hist(college$Personal, xlab = "Personal Spending", ylab = "Count", main = "Histogram of Estimated Personal Spending", col = "#d39797")
hist(college$Grad.Rate, xlab = "Rate", ylab = "Count", main = "Histogram of Graduation Rate", col = "#9a5656")
hist(college$Room.Board, xlab = "Cost", ylab = "Count", main = "Histogram of Room and Board Costs", col = "#7d2d2d")

8C(vi): Continue exploring the data, and provide a brief summary of what you discover.

summary(college$Outstate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2340    7320    9990   10441   12925   21700

Summary: The code chunk above shows that the highest out-of-state tuition for universities in the US is $21,700. Let’s see which university has the highest out-of-state tuition.

outstate <- college[college$Outstate == 21700, ]
rownames(outstate)
## [1] "48"

Looking back at the dataset, we can see that Bennington College (48) has the highest out-of-state tuition fee

Question 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Q9A: Which of the predictors are quantitative, and which are qualitative?

  • All predictors are quantitative except for name

Q9B: What is the range of each quantitative predictor? You can answer this using the range() function.

summary(Auto[, -9])
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin     
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :15.50   Median :76.00   Median :1.000  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

Q9C: What is the mean and standard deviation of each quantitative predictor?

sapply(Auto[, -9], mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531
sapply(Auto[, -9], sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8050075    1.7057832  104.6440039   38.4911599  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

Q9D: Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

subset <- subset(Auto[-(10:85), -9])
sapply(subset, range)
##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3
sapply(subset, mean)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266
sapply(subset, sd)
##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

Q9E: Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto[1:8]) 

Comments: From the scatter plot above, we can see that horsepower, weight, and displacement have an inverse relationship with mpg. However, horsepower, weight, and displacement have a linear/direct relationship with each other. This makes sense because horsepower is heavy. That means the more power your car has, the heavier it will become. And generally speaking, the higher an engine’s displacement the more power it can create.

Q9F: Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

  • Yes. As you can see from the first scatter plot, all predictors show some correlation with mpg. For example, year and mpg seem to have a positive relationship (as the year increases, the mpg increases) While horsepower and mpg have a negative/inverse relationship (as horsepower increases, mpg decreases. This is because higher horsepower comes from burning more fuel, so mpg is lowered)

Question 10

This exercise involves the Boston housing data set.

Q10A: To begin, load in the Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?

library(MASS)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
  • There are 506 rows and 14 columns in this data set. The rows represent predictor observations for a neighborhood in Boston, while the columns represent predictor variables for a neighborhood in Boston.

Q10B: Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(Boston)

Comments: There’s nothing much to find in this scatter plot other than the fact that some variables appear to be correlated. For example, zn and dis appear to have a fairly linear relationship as denoted by one of the scatter plots.

Q10C: Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

plot(crim ~ age, data = Boston, log = "xy") #The older the house, the more crime 

plot(crim ~ tax, data = Boston, log = "xy") #The higher the tax, the more crime 

plot(crim ~ indus, data = Boston, log = "xy") #More crime in industrial areas

plot(crim ~ dis, data = Boston, log = "xy") #The closer to work area, the more crime 

plot(crim ~ rad, data = Boston, log = "xy") #Higher index of accessibility to radial highway, the more crime 

Q10D: Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

par(mfrow=c(1,3))
hist(Boston$crim[Boston$crim > 1], breaks=25)
hist(Boston$tax, xlab = "Tax Rate", ylab = "Number of Suburbs", main = "Tax Rates", breaks=25)
hist(Boston$ptratio, xlab = "Pupil-Teacher Ratio", ylab = "Number of Suburbs", main = "Pupil-Teacher Ratio", breaks=25)

Comments: As you can see, most suburbs have low crime rates, but there are some suburbs where the crime rate is high. There’s a huge gap between suburbs with low tax rates and a peak at 660-680.

Q10E: How many of the suburbs in this data set bound the Charles river?

sum(Boston$chas == 1)
## [1] 35
  • There are 35 suburbs bound to the Charles river

Q10F: What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)
## [1] 19.05
  • The median would be 19 pupils for each teacher

Q10G: Which suburb of Boston has lowest median value of owneroccupied homes?

which.min(Boston$medv)
## [1] 399
t(subset(Boston, medv == min(medv)))
##              399      406
## crim     38.3518  67.9208
## zn        0.0000   0.0000
## indus    18.1000  18.1000
## chas      0.0000   0.0000
## nox       0.6930   0.6930
## rm        5.4530   5.6830
## age     100.0000 100.0000
## dis       1.4896   1.4254
## rad      24.0000  24.0000
## tax     666.0000 666.0000
## ptratio  20.2000  20.2000
## black   396.9000 384.9700
## lstat    30.5900  22.9800
## medv      5.0000   5.0000

What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Comments:

  • The crime rate for Suburb 399 is relatively high compared to the average crime rate of suburbs in Boston
  • There’s 0 residential land zoned for lots over 25,000sqft
  • The proportion of non-retail business acres per town is slightly higher than the avg suburbs in Boston
  • This suburb is not bound to the Charles river
  • The nitrogen oxides concentration (parts per 10 million) is slightly higher than the avg suburbs in Boston
  • Average number of rooms per dwelling is lower than the avg suburbs in Boston
  • One of the suburbs that have the highest proportion of owner-occupied units built prior to 1940
  • It has a low weighted mean of distances to five Boston employment centres
  • Higher accessibility to radial highways
  • Higher full-value property tax rate per $10,000
  • Higher rate of pupil-teacher ratio by town
  • Highest value for 000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • Higher status of lower population (%)
  • Low median value of owner-occupied homes in $1,000

Q10H: In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

sum(Boston$rm > 7)
## [1] 64
sum(Boston$rm > 8)
## [1] 13
summary(subset(Boston, rm > 8))
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0

Comments:

  • Suburbs that average more than 8 rooms per dwelling tend to have:
    • Lower crime rates than the average suburbs in Boston (crim)
    • Higher median value of owner-occupied homes in $1,000 (medv)
    • Lower rate of low status population (lstat)