Use the read.csv() function to read the data into R. Call the loaded data COLLEGE. Make sure that you have the directory set to the correct location for the data.
college<-read.csv('C:/Users/Steve/Documents/2021 Spring MSDA/STA-6543 Data Analytics Algorithms II/Lectures/ISLR Datasets/College.csv')
dim(college)
## [1] 777 19
Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later.
fix(college)
rownames(college)=college[ ,1]
fix(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored.
college<-college[ ,-1]
dim(college)
## [1] 777 18
fix(college)
Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
college$Private<-as.factor(college$Private)
pairs(college[ ,1:10])
Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(college$Private, college$Outstate, col='red', xlab='Private',ylab='Out-of-state Tuition ($)',
main = 'Out-of-state Tuition in Private Schools')
Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.
Elite<-rep('No',nrow(college))
Elite[college$Top10perc>50]='Yes'
Elite<-as.factor(Elite)
college <- data.frame(college,Elite)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate Elite
## Min. : 10.00 No :699
## 1st Qu.: 53.00 Yes: 78
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
# There are 78 Elite schools
plot(college$Elite, college$Outstate, col='green', xlab='Elite Status',ylab='Out-of-state- Tuition($)',
main = 'Out-of-state Tuition in Elite Schools')
# Out-of-state tuition for elite schools looks to be significantly higher than for non-elite.
Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow=c(2,2))
# Histogram of graduation rate
hist(college$Grad.Rate, col='blue', breaks=30)
# Histogram of new students enrolled
hist(college$Enroll, col='red', breaks=30)
# Histogram of estimated book costs
hist(college$Books, col='green', breaks=50)
# Histogram of Out-of-state tuition
hist(college$F.Undergrad, col='orange', breaks=25)
# Begone with you, par!
dev.off()
## null device
## 1
Continue exploring the data, and provide a brief summary of what you discover.
Which university has highest costs?
college$TotalCosts<- college$Outstate + college$Room.Board + college$Books
college[which.max(college$TotalCosts),]
# Yale, $26,980
Which university has the lowest acceptance rate?
college$AcceptRate<-college$Accept/college$Apps
college[which.min(college$AcceptRate), ]
# Princeton, at only 15.4%
Which university has the highest acceptance rate?
college[which.max(college$AcceptRate), ]
#Emporia State University, at 100%
Which university attracts the highest performers?
college[which.max(college$Top10perc), ]
#MIT, not a surprise
This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
auto <- read.csv("C:/Users/Steve/Documents/2021 Spring MSDA/STA-6543 Data Analytics Algorithms II/Lectures/ISLR Datasets/Auto.csv",
header=T, na.strings='?')
dim(auto)
## [1] 397 9
#getting rid of NA entries
auto <- na.omit(auto)
dim(auto)
## [1] 392 9
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
Which of the predictors are quantitative, and which are qualitative? Qualitative predictors: cylinders, origin, name. Quantitative predictors: mpg, displacement, horsepower, weight, acceleration, year.
What is the range of each quantitative predictor?
auto.qualitative<-c(2,8,9) #create marker for the columns with qual data
sapply(auto[, -auto.qualitative], range) #sapply is awesome!
## mpg displacement horsepower weight acceleration year
## [1,] 9.0 68 46 1613 8.0 70
## [2,] 46.6 455 230 5140 24.8 82
What is the mean and standard deviation of each quantitative predictor?
sapply(auto[, -auto.qualitative], mean)
## mpg displacement horsepower weight acceleration year
## 23.44592 194.41199 104.46939 2977.58418 15.54133 75.97959
sapply(auto[, -auto.qualitative], sd)
## mpg displacement horsepower weight acceleration year
## 7.805007 104.644004 38.491160 849.402560 2.758864 3.683737
Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
auto.qualitative.d <- seq(10,85)
sapply(auto[-auto.qualitative.d,-auto.qualitative], range)
## mpg displacement horsepower weight acceleration year
## [1,] 11.0 68 46 1649 8.5 70
## [2,] 46.6 455 230 4997 24.8 82
sapply(auto[-auto.qualitative.d,-auto.qualitative], mean)
## mpg displacement horsepower weight acceleration year
## 24.40443 187.24051 100.72152 2935.97152 15.72690 77.14557
sapply(auto[-auto.qualitative.d,-auto.qualitative], sd)
## mpg displacement horsepower weight acceleration year
## 7.867283 99.678367 35.708853 811.300208 2.693721 3.106217
Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
pairs(~mpg+displacement+horsepower+weight+acceleration+year, auto)
#I'm not using the categorical variables in this one
Observe positive correlations among: Displacement-Horsepower; Horsepower-weight; Displacement-weight. Observe negative correlations among: mpg-displacement; mpg-horsepower; mpg-weight.
Suppose that we wish to predict gas mileage (mpg) on the basis # of the other variables. Do your plots suggest that any of the # other variables might be useful in predicting mpg? Justify your # answer.
plot(as.factor(auto$cylinders), auto$mpg)
plot(auto$displacement, auto$mpg)
plot(auto$weight, auto$mpg)
plot(auto$horsepower, auto$mpg)
plot(as.factor(auto$origin), auto$mpg)
Looks like MPG decreases as displacement, weight and HP increase. It looks like the average MPG of US made vehicles is less than the average MPG of Asian made vehicles, but further examination needed. It looks like there’s a decreasing trend in MPG as number of cylinders increases…except for that strange 3-cylinder group.
How many rows in this data set? How many columns? What do the rows and columns represent?
Boston<-Boston
?Boston
## starting httpd help server ... done
dim(Boston)
## [1] 506 14
506 rows. 14 columns.
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
1 crim = per capita crime rate by town. 2 zn = proportion of residential land zoned for lots over 25K SF 3 indus = proportion of non-retail business acres per town 4 chas = Charles River dummy variable (=1 if tract bounds river, 0 otherwise) 5 nox = nitrogen oxides conc (ppm) 6 rm = average number of rooms per dwelling 7 age = proportion of owner-occupied units built prior to 1940 8 dis = weighted mean of distances of five Boston employment centres 9 rad = index of accessibility to radial highways 10 tax = full-value property-tax rate per $10k 11 ptratio = pupil-teacher ratio by town 12 black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. 13 lstat = lower status of the population (%) 14 medv = median value of owner-occupied homes, /$1000
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
Boston.cols.b<-c(1,7,8,9,10,12,13,14)
pairs(Boston[ , Boston.cols.b])
There appears to be a weak negative correlation between “age” and “dis.”The older houses are closer to the employment centers, and newer housing is built further out.
There appears to be a weak positive correlation between “age” and “lstat.” Those who are lower-status tend to live in older homes, but older houses may be bought by people from all levels of status (old-money neighborhoods, for instance).
There seems to be a relatively strong negative correlation between “lstat” and “medv.” Those who are lower-status cannot afford higher median value homes.
Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
Boston.correlations<- cor(Boston)
print(Boston.correlations[ ,1])
## crim zn indus chas nox rm
## 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171 -0.21924670
## age dis rad tax ptratio black
## 0.35273425 -0.37967009 0.62550515 0.58276431 0.28994558 -0.38506394
## lstat medv
## 0.45562148 -0.38830461
Crime rate has the highest correlation with index of accessibility to radial highways “rad” at 0.626.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
Boston.10d<-c(1,10,11)
Boston[which.max(Boston$crim), ]
Boston[which.max(Boston$tax), ]
Boston[which.max(Boston$ptratio), ]
sapply(Boston[ ,Boston.10d], range)
## crim tax ptratio
## [1,] 0.00632 187 12.6
## [2,] 88.97620 711 22.0
How many of the suburbs in this data set bound the Charles river?
sum(Boston$chas == 1)
## [1] 35
#35
What is the median pupil-teacher ratio among the towns in this dataset?
median(Boston$ptratio)
## [1] 19.05
#19.05
Which suburb of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
Boston[which.min(Boston$medv), ]
sapply(Boston[ , ], range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 0.32
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
## lstat medv
## [1,] 1.73 5
## [2,] 37.97 50
crim - 399 is 38.4 / 88.9 per capita crime zn - 399 has no land zones for lots over 25K SF indus - 399 is 18.1 / 27.7 proportion of non-retail business acres per town chas - 399 is not on the Charles river nox - 399 is 0.693/0.871 for NOx conc. (ppm) rm - 399 is 5.5/8.8 for rooms per dwelling age - 399 has everyone living in homes built prior to 1940 dis - 399 is 1.5 / 12.1 in mean distances of five employment centers rad - 399 is 24/24 in accessibility to radial highways tax - 399 is 666/711 in property-tax rate per $10k (they have high taxes!) ptratio - 399 is 20.2/22 in pupil-teacher ratio black - 399 is 100% black lstat - 399 is 30.59/37.97 on % population lower status medv - 399 is 5/50 ranked lowest on median home value.
In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
sum(Boston$rm > 7)
## [1] 64
#64
sum(Boston$rm > 8)
## [1] 13
#13
Boston.8rooms<-subset(Boston,rm > 8)
summary(Boston.8rooms)
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
sapply(Boston.8rooms[ , ], range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.02009 0 2.68 0 0.4161 8.034 8.4 1.8010 2 224 13.0 354.55
## [2,] 3.47428 95 19.58 1 0.7180 8.780 93.9 8.9067 24 666 20.2 396.90
## lstat medv
## [1,] 2.47 21.9
## [2,] 7.44 50.0
sapply(Boston[ , ], range)
## crim zn indus chas nox rm age dis rad tax ptratio black
## [1,] 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6 0.32
## [2,] 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0 396.90
## lstat medv
## [1,] 1.73 5
## [2,] 37.97 50
8 rooms subset has less crime. Lower industrial zoning. Lower distance to employment centers. They are mostly black. Lower values of low status (lstat). They have higher median home values.