Assignment 1

8A.

Use the read.csv() function to read the data into R. Call the loaded data COLLEGE. Make sure that you have the directory set to the correct location for the data.

college<-read.csv('C:/Users/Steve/Documents/2021 Spring MSDA/STA-6543 Data Analytics Algorithms II/Lectures/ISLR Datasets/College.csv')
dim(college)

## [1] 777  19

8B.

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.

fix(college)
rownames(college)=college[ ,1]
fix(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored.

college<-college[ ,-1]
dim(college)

## [1] 777  18

fix(college)

8C-i.

Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

##    Private               Apps           Accept          Enroll    
##  Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
##  Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
##  Mode  :character   Median : 1558   Median : 1110   Median : 434  
##                     Mean   : 3002   Mean   : 2019   Mean   : 780  
##                     3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
##                     Max.   :48094   Max.   :26330   Max.   :6392  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.0   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0  
##  Median :23.00   Median : 54.0   Median : 1707   Median :  353.0  
##  Mean   :27.56   Mean   : 55.8   Mean   : 3700   Mean   :  855.3  
##  3rd Qu.:35.00   3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0  
##  Max.   :96.00   Max.   :100.0   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7320   1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850  
##  Median : 9990   Median :4200   Median : 500.0   Median :1200  
##  Mean   :10441   Mean   :4358   Mean   : 549.4   Mean   :1341  
##  3rd Qu.:12925   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.50   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.00   Median : 82.0   Median :13.60   Median :21.00  
##  Mean   : 72.66   Mean   : 79.7   Mean   :14.09   Mean   :22.74  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate     
##  Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 8377   Median : 65.00  
##  Mean   : 9660   Mean   : 65.46  
##  3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :56233   Max.   :118.00

8C-ii.

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

college$Private<-as.factor(college$Private)
pairs(college[ ,1:10])

8C-iii.

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(college$Private, college$Outstate, col='red', xlab='Private',ylab='Out-of-state Tuition ($)',
     main = 'Out-of-state Tuition in Private Schools')

8C-iv.

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

Elite<-rep('No',nrow(college))
Elite[college$Top10perc>50]='Yes'
Elite<-as.factor(Elite)
college <- data.frame(college,Elite)
summary(college)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate      Elite    
##  Min.   : 10.00   No :699  
##  1st Qu.: 53.00   Yes: 78  
##  Median : 65.00            
##  Mean   : 65.46            
##  3rd Qu.: 78.00            
##  Max.   :118.00

# There are 78 Elite schools

plot(college$Elite, college$Outstate, col='green', xlab='Elite Status',ylab='Out-of-state- Tuition($)',
     main = 'Out-of-state Tuition in Elite Schools')

# Out-of-state tuition for elite schools looks to be significantly higher than for non-elite.

8C-v.

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow=c(2,2))
# Histogram of graduation rate
hist(college$Grad.Rate, col='blue', breaks=30)
# Histogram of new students enrolled
hist(college$Enroll, col='red', breaks=30)
# Histogram of estimated book costs
hist(college$Books, col='green', breaks=50)
# Histogram of Out-of-state tuition
hist(college$F.Undergrad, col='orange', breaks=25)

# Begone with you, par!
dev.off()

## null device 
##           1

8C-vi.

Continue exploring the data, and provide a brief summary of what you discover.

Which university has highest costs?

college$TotalCosts<- college$Outstate + college$Room.Board + college$Books
college[which.max(college$TotalCosts),]

# Yale, $26,980

Which university has the lowest acceptance rate?

college$AcceptRate<-college$Accept/college$Apps
college[which.min(college$AcceptRate), ]

# Princeton, at only 15.4%

Which university has the highest acceptance rate?

college[which.max(college$AcceptRate), ]

#Emporia State University, at 100%

Which university attracts the highest performers?

college[which.max(college$Top10perc), ]

#MIT, not a surprise

9.

This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

auto <- read.csv("C:/Users/Steve/Documents/2021 Spring MSDA/STA-6543 Data Analytics Algorithms II/Lectures/ISLR Datasets/Auto.csv", 
                header=T, na.strings='?')
dim(auto)

## [1] 397   9

#getting rid of NA entries
auto <- na.omit(auto)
dim(auto)

## [1] 392   9

summary(auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

9-a.

Which of the predictors are quantitative, and which are qualitative? Qualitative predictors: cylinders, origin, name. Quantitative predictors: mpg, displacement, horsepower, weight, acceleration, year.

9-b.

What is the range of each quantitative predictor?

auto.qualitative<-c(2,8,9) #create marker for the columns with qual data
sapply(auto[, -auto.qualitative], range) #sapply is awesome!

##       mpg displacement horsepower weight acceleration year
## [1,]  9.0           68         46   1613          8.0   70
## [2,] 46.6          455        230   5140         24.8   82

9-c.

What is the mean and standard deviation of each quantitative predictor?

sapply(auto[, -auto.qualitative], mean)

##          mpg displacement   horsepower       weight acceleration         year 
##     23.44592    194.41199    104.46939   2977.58418     15.54133     75.97959

sapply(auto[, -auto.qualitative], sd)

##          mpg displacement   horsepower       weight acceleration         year 
##     7.805007   104.644004    38.491160   849.402560     2.758864     3.683737

9-d.

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto.qualitative.d <- seq(10,85)
sapply(auto[-auto.qualitative.d,-auto.qualitative], range)

##       mpg displacement horsepower weight acceleration year
## [1,] 11.0           68         46   1649          8.5   70
## [2,] 46.6          455        230   4997         24.8   82

sapply(auto[-auto.qualitative.d,-auto.qualitative], mean)

##          mpg displacement   horsepower       weight acceleration         year 
##     24.40443    187.24051    100.72152   2935.97152     15.72690     77.14557

sapply(auto[-auto.qualitative.d,-auto.qualitative], sd)

##          mpg displacement   horsepower       weight acceleration         year 
##     7.867283    99.678367    35.708853   811.300208     2.693721     3.106217

9-e.

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(~mpg+displacement+horsepower+weight+acceleration+year, auto)

#I'm not using the categorical variables in this one

Observe positive correlations among: Displacement-Horsepower; Horsepower-weight; Displacement-weight. Observe negative correlations among: mpg-displacement; mpg-horsepower; mpg-weight.

9-f.

Suppose that we wish to predict gas mileage (mpg) on the basis # of the other variables. Do your plots suggest that any of the # other variables might be useful in predicting mpg? Justify your # answer.

plot(as.factor(auto$cylinders), auto$mpg)

plot(auto$displacement, auto$mpg)

plot(auto$weight, auto$mpg)

plot(auto$horsepower, auto$mpg)

plot(as.factor(auto$origin), auto$mpg)

Looks like MPG decreases as displacement, weight and HP increase. It looks like the average MPG of US made vehicles is less than the average MPG of Asian made vehicles, but further examination needed. It looks like there’s a decreasing trend in MPG as number of cylinders increases…except for that strange 3-cylinder group.

10-a.

How many rows in this data set? How many columns? What do the rows and columns represent?

Boston<-Boston
?Boston

## starting httpd help server ... done

dim(Boston)

## [1] 506  14

506 rows. 14 columns.

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

1 crim = per capita crime rate by town. 2 zn = proportion of residential land zoned for lots over 25K SF 3 indus = proportion of non-retail business acres per town 4 chas = Charles River dummy variable (=1 if tract bounds river, 0 otherwise) 5 nox = nitrogen oxides conc (ppm) 6 rm = average number of rooms per dwelling 7 age = proportion of owner-occupied units built prior to 1940 8 dis = weighted mean of distances of five Boston employment centres 9 rad = index of accessibility to radial highways 10 tax = full-value property-tax rate per $10k 11 ptratio = pupil-teacher ratio by town 12 black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 13 lstat = lower status of the population (%) 14 medv = median value of owner-occupied homes, /$1000

10-b.

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Boston.cols.b<-c(1,7,8,9,10,12,13,14)
pairs(Boston[ , Boston.cols.b])

There appears to be a weak negative correlation between “age” and “dis.”The older houses are closer to the employment centers, and newer housing is built further out.

There appears to be a weak positive correlation between “age” and “lstat.” Those who are lower-status tend to live in older homes, but older houses may be bought by people from all levels of status (old-money neighborhoods, for instance).

There seems to be a relatively strong negative correlation between “lstat” and “medv.” Those who are lower-status cannot afford higher median value homes.

10-c.

Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Boston.correlations<- cor(Boston)
print(Boston.correlations[ ,1])

##        crim          zn       indus        chas         nox          rm 
##  1.00000000 -0.20046922  0.40658341 -0.05589158  0.42097171 -0.21924670 
##         age         dis         rad         tax     ptratio       black 
##  0.35273425 -0.37967009  0.62550515  0.58276431  0.28994558 -0.38506394 
##       lstat        medv 
##  0.45562148 -0.38830461

Crime rate has the highest correlation with index of accessibility to radial highways “rad” at 0.626.

10-d.

Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Boston.10d<-c(1,10,11)

Boston[which.max(Boston$crim), ]

Boston[which.max(Boston$tax), ]

Boston[which.max(Boston$ptratio), ]

sapply(Boston[ ,Boston.10d], range)

##          crim tax ptratio
## [1,]  0.00632 187    12.6
## [2,] 88.97620 711    22.0

10-e.

How many of the suburbs in this data set bound the Charles river?

sum(Boston$chas == 1)

## [1] 35

#35

10-f.

What is the median pupil-teacher ratio among the towns in this dataset?

median(Boston$ptratio)

## [1] 19.05

#19.05

10-g.

Which suburb of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

Boston[which.min(Boston$medv), ]

sapply(Boston[ , ], range)

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50

crim - 399 is 38.4 / 88.9 per capita crime zn - 399 has no land zones for lots over 25K SF indus - 399 is 18.1 / 27.7 proportion of non-retail business acres per town chas - 399 is not on the Charles river nox - 399 is 0.693/0.871 for NOx conc. (ppm) rm - 399 is 5.5/8.8 for rooms per dwelling age - 399 has everyone living in homes built prior to 1940 dis - 399 is 1.5 / 12.1 in mean distances of five employment centers rad - 399 is 24/24 in accessibility to radial highways tax - 399 is 666/711 in property-tax rate per $10k (they have high taxes!) ptratio - 399 is 20.2/22 in pupil-teacher ratio black - 399 is 100% black lstat - 399 is 30.59/37.97 on % population lower status medv - 399 is 5/50 ranked lowest on median home value.

10-h.

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

sum(Boston$rm > 7)

## [1] 64

#64
sum(Boston$rm > 8)

## [1] 13

#13

Boston.8rooms<-subset(Boston,rm > 8)
summary(Boston.8rooms)

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

sapply(Boston.8rooms[ , ], range)

##         crim zn indus chas    nox    rm  age    dis rad tax ptratio  black
## [1,] 0.02009  0  2.68    0 0.4161 8.034  8.4 1.8010   2 224    13.0 354.55
## [2,] 3.47428 95 19.58    1 0.7180 8.780 93.9 8.9067  24 666    20.2 396.90
##      lstat medv
## [1,]  2.47 21.9
## [2,]  7.44 50.0

sapply(Boston[ , ], range)

##          crim  zn indus chas   nox    rm   age     dis rad tax ptratio  black
## [1,]  0.00632   0  0.46    0 0.385 3.561   2.9  1.1296   1 187    12.6   0.32
## [2,] 88.97620 100 27.74    1 0.871 8.780 100.0 12.1265  24 711    22.0 396.90
##      lstat medv
## [1,]  1.73    5
## [2,] 37.97   50

8 rooms subset has less crime. Lower industrial zoning. Lower distance to employment centers. They are mostly black. Lower values of low status (lstat). They have higher median home values.