Assignment 1

Question 8

  1. Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

  2. Look at the data using the View() function. You should notice that the first column is just the name of each university.

We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

    1. Use the summary() function to produce a numerical summary of the variables in the data set.
##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00
  1. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

Recall that you can reference the first ten columns of a matrix A using A[,1:10].

  1. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

  1. Create a new qualitative variable, called Elite, by binning the Top10perc variable.

We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Use the summary() function to see how many elite universities there are.

##  No Yes 
## 699  78

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite

  1. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.

You may find the command par(mfrow = c(2, 2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously.

Modifying the arguments to this function will divide the screen in other ways.

A comparison of the number of applications between the different college types (all, private, public elite)

A comparison of the number of applicants accepted between the different college types (all, private, public, elite)

A comparison of the total numbers of Applicants, Acceptance, Enrollment, and Graduation

  1. Continue exploring the data, and provide a brief summary of what you discover

Which college has the highest acceptance rate?

## [1] "Emporia State University"

Which college has the lowest acceptance rate?

## [1] "Princeton University"

Which college has the highest enrollment rate?

## [1] "California Lutheran University"

Which college has the lowest enrollment rate?

## [1] "Franklin Pierce College"

The correlation of applications to expenditures

Summary: This plot demonstrated that colleges with lower instructional expenditures per student receive more applications.

The correlation of enrollment to expenditures

Summary: This plot demonstrates that colleges with lower instructional expenditures per student enroll more students than colleges with higher expenditures.

Question 9

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

## [1] 392   9

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

(a) Which of the predictors are quantitative, and which are qualitative?

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

The quantitative predictors are: mpg, cylinders, displacement, horsepower, weight, acceleration, and year. The qualitative predictors are: name and origin

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

##       mpg cylinders displacement horsepower weight acceleration year
## [1,]  9.0         3           68         46   1613          8.0   70
## [2,] 46.6         8          455        230   5140         24.8   82

Solution: mpg = 37.6, cylinders = 5, displacement = 387, horsepower = 184, weight = 3527, acceleration = 16.8, year = 12

(c) What is the mean and standard deviation of each quantitative predictor?

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year 
##    75.979592

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.805007     1.705783   104.644004    38.491160   849.402560     2.758864 
##         year 
##     3.683737

Solution: mpg: mean = 23.45, standard deviation = 7.81 cylinders: mean = 5.47, standard deviation = 1.71 displacement: mean = 194.41, standard deviation = 104.64 horsepower: mean = 104.47, standard deviation = 38.49 weight: mean = 2977.58, standard deviation = 849.40 acceleration: mean = 15.54, standard deviation = 2.76 year: mean = 75.98, standard deviation = 3.68

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

##       mpg cylinders displacement horsepower weight acceleration year
## [1,] 11.0         3           68         46   1649          8.5   70
## [2,] 46.6         8          455        230   4997         24.8   82

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year 
##    77.145570

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year 
##     3.106217

Solution: mpg: range = 35.6, mean = 24.40, standard deviation = 7.87 cylinders: range = 5, mean = 5.37, standard deviation = 1.65 displacement: range = 387, mean = 187.24, standard deviation = 99.68 horsepower: range = 184, mean = 100.72, standard deviation = 35.71 weight: range = 3348, mean = 2935.97, standard deviation = 811.30 acceleration: range = 16.3, mean = 15.73, standard deviation = 2.69 year: mean = 77.15, standard deviation = 3.11

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice.

Create some plots highlighting the relationships among the predictors. Comment on your findings.

Question 10

#10. This exercise involves the Boston housing data set. # (a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.

## Warning: package 'ISLR2' was built under R version 4.1.3

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

## starting httpd help server ... done

How many rows are in this data set? How many columns? What do the rows and columns represent?

Solution: There are 506 rows and 13 columns containing housing values in suburbs of Boston, such as per capita crime rate, average number of rooms per dwelling, and pupil-teacher ratio.

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Solution: The data suggests that more residents who are considered lower status of the population live in homes that have a lower median value.
The accessibility to radial highways decreases when the value of the home increases. The pupil-teacher ratio is higher in areas with lower valued homes. The per capita crime rate is higher in areas with lower valued homes.

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Solution: Yes, the per capita crime rate is higher in areas with lower valued homes.

(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Solution: The data suggests that the crime rates in Boston are especially high in areas with lower tax-rates. As noted previously, the pupil-teacher ratio is higher in areas with lower-values homes. Based on our findings, we could indicate that the pupil-teacher ratio is related to the crime rates in Boston.

(e) How many of the census tracts in this data set bound the Charles river?

## [1] 35

Solution: 35 suburbs in this data set bound to the Charles river

(f) What is the median pupil-teacher ratio among the towns in this data set?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

Solution: The median ratio of pupil-teacher among suburbs in this data set it 19.05.

(g) Which census tract of Boston has lowest median value of owner-occupied homes?

What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors?

Comment on your findings.

## [1] 5

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5

Solution: South-Boston has the lowest median value of owner-occupied homes.

(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling?

Comment on the census tracts that average more than eight rooms per dwelling.

## [1] 64

## [1] 13

##        crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
## 98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
## 164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
## 205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
## 225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
## 226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
## 227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
## 233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
## 234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
## 254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
## 258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
## 263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
## 268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
## 365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

Solution: 64 of the census tracts average more than seven rooms per dwelling. 13 of the census tracts average more than eight rooms per dwelling.We can see that the census tracts that average more than eight rooms per dwelling have a low crime rate and lower numbers of lower status residents.