CAP 5703 Assignment#1 Danilo Martinez
8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are .
Private : Public/private indicator .
Apps : Number of applications received .
Accept : Number of applicants accepted .
Enroll : Number of new students enrolled .
Top10perc : New students from top 10% of high school class .
Top25perc : New students from top 25% of high school class .
F.Undergrad : Number of full-time undergraduates .
P.Undergrad : Number of part-time undergraduates .
Outstate : Out-of-state tuition .
Room.Board : Room and board costs .
Books : Estimated book costs .
Personal : Estimated personal spending .
PhD : Percent of faculty with Ph.D.'s .
Terminal : Percent of faculty with terminal degree .
S.F.Ratio : Student/faculty ratio .
perc.alumni : Percent of alumni who donate .
Expend : Instructional expenditure per student .
Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor.
(a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
library(ISLR)
college <- College
rm(College)
attach(college)
(b) Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don't really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
#The
following command would assign the first column as rownames, but this data set
already was prepared
#rownames (college
)=college [,1]
#Following command
displays the dataset in a text box
#fix (college)
#by calling the name of
the dataset, we see it in the output
#college
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
#This
would remove the first column
#college =college [,-1]
#Following command displays
the dataset in a text box
#fix(college)
#by calling the name of
the dataset, we see it in the output
#college
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row. (c) i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
##
Private Apps Accept Enroll Top10perc
## No :212 Min.
: 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.:
776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median :
1558 Median : 1110 Median : 434 Median :23.00
## Mean :
3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.:
3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max.
:48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc
F.Undergrad P.Undergrad Outstate
## Min. : 9.0
Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st
Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median
: 1707 Median : 353.0 Median : 9990
## Mean : 55.8
Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd
Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0
Max. :31643 Max. :21836.0 Max. :21700
## Room.Board
Books Personal PhD
## Min. :1780
Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st
Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200
Median : 500.0 Median :1200 Median : 75.00
## Mean :4358
Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd
Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124
Max. :2340.0 Max. :6800 Max. :103.00
## Terminal
S.F.Ratio perc.alumni Expend
## Min. : 24.0
Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st
Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0
Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7
Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd
Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0
Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
pairs(college[,1:10])
iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(Private,Outstate, xlab = "Private", ylab ="Out of State tuition in USD", main = "Outstate Private Boxlots")
iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.
Elite
=rep ("No",nrow(college))
Elite [Top10perc >50]="Yes"
Elite =as.factor (Elite)
college =data.frame(college,Elite)
#by calling the name of
the dataset, we see it in the output
#college
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
summary(college)
##
Private Apps Accept Enroll Top10perc
## No :212 Min.
: 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.:
776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median :
1558 Median : 1110 Median : 434 Median :23.00
## Mean :
3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.:
3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max.
:48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc
F.Undergrad P.Undergrad Outstate
## Min. : 9.0
Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st
Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0
Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8
Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd
Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0
Max. :31643 Max. :21836.0 Max. :21700
## Room.Board
Books Personal PhD
## Min. :1780
Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st
Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200
Median : 500.0 Median :1200 Median : 75.00
## Mean :4358
Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd
Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124
Max. :2340.0 Max. :6800 Max. :103.00
## Terminal
S.F.Ratio perc.alumni Expend
## Min. : 24.0
Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st
Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0
Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7
Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd
Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max.
:39.80 Max. :64.00 Max. :56233
## Grad.Rate
Elite
## Min. : 10.00 No
:699
## 1st Qu.: 53.00
Yes: 78
## Median :
65.00
## Mean :
65.46
## 3rd Qu.:
78.00
## Max. :118.00
plot(Elite,Outstate, xlab = "Elite", ylab ="Out of State tuition in USD", main = "Outstate Elite Boxlots")
There are 78 Elite Universities.
v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
par(mfrow = c(3,2))
hist(Apps,col = 2, xlab = "Applications", ylab = "Count")
hist(Accept,col = 3, xlab = "Accepted", ylab = "Count")
hist(Enroll,col = 4, xlab = "Enrolled", ylab = "Count")
hist(Top10perc,col = 5, xlab = "Top 10%", ylab = "Count")
hist(F.Undergrad,col = 7, xlab = "Full-time Undergrads", ylab = "Count")
hist(Top10perc,col = 9, xlab = "Out of State", ylab = "Count")
vi. Continue exploring the data, and provide a brief summary of what you discover.
highnumbers<- college[PhD > 100,]
highnumbers<- rbind(college[Grad.Rate > 100,])
nrow(highnumbers)
## [1] 1
row.names(highnumbers)
## [1] "Cazenovia College"
rm(list=ls())
From the scatterplots, one can see that some variables have a liner relationship, while some do not. The histograms show that very few of the variables have a normal distribution. The Phd and Grad Rate have max values greater than 100%. 103% and 118% respectively. This is probably data that was recorded incorrectly and should be further studied and possibly removed. Cazenovia College is the only entry that represents these errors.
10. This exercise involves the Boston housing data set.
(a) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
library (MASS)
boston<-Boston
Now the data set is contained in the object boston.
#To
see the data set call it below
#boston
attach(boston)
Read about the data set:
chas
<- as.factor(chas)
?Boston
rm(Boston)
How many rows are in this data set? How many columns? What do the rows and columns represent?
nrow(boston)
## [1] 506
ncol(boston)
## [1] 14
The Boston data frame has 506 rows (Observations) and
14 columns (variables).
Following are the variable names and what they represent:
crim-per capita crime rate by town.
zn-proportion of residential land zoned for lots over 25,000 sq.ft.
indus-proportion of non-retail business acres per town.
chas-Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox-nitrogen oxides concentration (parts per 10 million).
rm-average number of rooms per dwelling.
age-proportion of owner-occupied units built prior to 1940.
dis-weighted mean of distances to five Boston employment centres.
rad-index of accessibility to radial highways.
tax-full-value property-tax rate per $10,000.
ptratio-pupil-teacher ratio by town.
black-1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat-lower status of the population (percent).
medv-median value of owner-occupied homes in $1000s.
(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
pairs(boston[,1:10])
It seems that zn and distance have a negative relationship with crime. Age seems to have a positive relationship.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
par(mfrow=c(3,2))
plot(nox,crim)
plot(rm,crim)
plot(age,crim)
plot(dis,crim)
plot(zn,crim)
plot(tax,crim)
It seems that high crime increases in areas close to employment centers, older homes, and zones with residential lots less than 25,000 sqft. In other words, more urban or populated areas.
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
par(mfrow=c(3,1))
hist(crim,breaks = 40)
nrow(boston[crim> 20,])
## [1] 18
hist(tax,breaks = 40)
nrow(boston[tax > 650,])
## [1] 137
hist(ptratio,breaks = 40)nrow(boston[ptratio > 19,])
## [1] 253
From the histograms, one can see a few observations stand out. There are only 18 suburbs with a crime rate greater than 20. There are 137 suburbs with a tax rate greater than 650. There are 253 suburbs with a pupil teacher ratio greater than 20. No real high crime rates, but high tax rates and high pupil teacher rates.
(e) How many of the suburbs in this data set bound the Charles river?
nrow(boston[chas==1,])
## [1] 35
There are 35 suburbs that are bound by the Charles river.
(f) What is the median pupil-teacher ratio among the towns in this data set?
median(ptratio)
## [1] 19.05
The median pupil teacher ration is 19.05.
(g) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
min(medv)
## [1] 5
The 5th suburb has the lowest median value of owner occupied homes.
range(tax)
## [1] 187 711
The range for taxes for suburb 5 is 187 to 711.
boston[min(medv),]$tax
## [1] 222
The taxes for suburb 5 is 222, more on the lower end of the range. Therefore, mirroring the low median value.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling?
nrow(boston[rm>7,])
## [1] 64
64 suburbs average more than 7 rooms per dwelling.
nrow(boston[rm>8,])
## [1] 13
13 suburbs average more than 8 rooms per dwelling.
Comment on the suburbs that average more than eight rooms per dwelling.
avg<-boston[rm>8,]
avg
##
crim zn indus chas nox rm age dis rad tax ptratio black
## 98 0.12083 0
2.89 0 0.4450 8.069 76.0 3.4952 2 276 18.0 396.90
## 164 1.51902 0
19.58 1 0.6050 8.375 93.9 2.1620 5 403 14.7 388.45
## 205 0.02009 95
2.68 0 0.4161 8.034 31.9 5.1180 4 224 14.7 390.55
## 225 0.31533 0
6.20 0 0.5040 8.266 78.3 2.8944 8 307 17.4 385.05
## 226 0.52693 0
6.20 0 0.5040 8.725 83.0 2.8944 8 307 17.4 382.00
## 227 0.38214 0
6.20 0 0.5040 8.040 86.5 3.2157 8 307 17.4 387.38
## 233 0.57529 0
6.20 0 0.5070 8.337 73.3 3.8384 8 307 17.4 385.91
## 234 0.33147 0
6.20 0 0.5070 8.247 70.4 3.6519 8 307 17.4 378.95
## 254 0.36894 22
5.86 0 0.4310 8.259 8.4 8.9067 7 330 19.1 396.90
## 258 0.61154 20
3.97 0 0.6470 8.704 86.9 1.8010 5 264 13.0 389.70
## 263 0.52014 20
3.97 0 0.6470 8.398 91.5 2.2885 5 264 13.0 386.86
## 268 0.57834 20
3.97 0 0.5750 8.297 67.0 2.4216 5 264 13.0 384.54
## 365 3.47428 0
18.10 1 0.7180 8.780 82.9 1.9047 24 666 20.2 354.55
## lstat medv
## 98 4.21 38.7
## 164 3.32 50.0
## 205 2.88 50.0
## 225 4.14 44.8
## 226 4.63 50.0
## 227 3.13 37.6
## 233 2.47 41.7
## 234 3.95 48.3
## 254 3.54 42.8
## 258 5.12 50.0
## 263 5.91 48.8
## 268 7.44 50.0
## 365 5.29 21.9
One thing that stands out immediately is that there are only 2 suburbs with more than 8 rooms per dwelling that lie on the Charles river. Furthermore, those 2 suburbs are highly industrial or have a high proportion of non-retail business acres per town, and have a higher age or have a high proportion of owner-occupied units built prior to 1940.