CAP 5703 Assignment#1 3623774 Danilo Martinez

CAP 5703 Assignment#1 Danilo Martinez

8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are .

Private : Public/private indicator .

Apps : Number of applications received .

Accept : Number of applicants accepted .

Enroll : Number of new students enrolled .

Top10perc : New students from top 10% of high school class .

Top25perc : New students from top 25% of high school class .

F.Undergrad : Number of full-time undergraduates .

P.Undergrad : Number of part-time undergraduates .

Outstate : Out-of-state tuition .

Room.Board : Room and board costs .

Books : Estimated book costs .

Personal : Estimated personal spending .

PhD : Percent of faculty with Ph.D.'s .

Terminal : Percent of faculty with terminal degree .

S.F.Ratio : Student/faculty ratio .

perc.alumni : Percent of alumni who donate .

Expend : Instructional expenditure per student .

Grad.Rate : Graduation rate

Before reading the data into R, it can be viewed in Excel or a text editor.

(a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
college <- College
rm(College)
attach(college)

(b) Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don't really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

#The following command would assign the first column as rownames, but this data set already was prepared
#rownames (college )=college [,1]
#Following command displays the dataset in a text box
#fix (college)
#by calling the name of the dataset, we see it in the output
#college

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

#This would remove the first column
#college =college [,-1]
#Following command displays the dataset in a text box
#fix(college)
#by calling the name of the dataset, we see it in the output
#college

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row. (c) i. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

## Private        Apps           Accept          Enroll       Top10perc
## No :212   Min.   :   81   Min.   :   72   Min.   : 35   Min.   : 1.00
## Yes:565   1st Qu.: 776   1st Qu.: 604   1st Qu.: 242   1st Qu.:15.00
##            Median : 1558   Median : 1110   Median : 434   Median :23.00
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00
##    Top25perc      F.Undergrad     P.Undergrad         Outstate
## Min.   : 9.0   Min.   : 139   Min.   :    1.0   Min.   : 2340
## 1st Qu.: 41.0   1st Qu.: 992   1st Qu.:   95.0   1st Qu.: 7320
## Median : 54.0 Median : 1707   Median : 353.0   Median : 9990
## Mean   : 55.8   Mean   : 3700   Mean   : 855.3   Mean   :10441
## 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.: 967.0   3rd Qu.:12925
## Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700
##    Room.Board       Books           Personal         PhD
## Min.   :1780   Min.   : 96.0   Min.   : 250   Min.   : 8.00
## 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00
## Median :4200   Median : 500.0   Median :1200   Median : 75.00
## Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66
## 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00
## Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00
##     Terminal       S.F.Ratio      perc.alumni        Expend
## Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186
## 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751
## Median : 82.0   Median :13.60   Median :21.00   Median : 8377
## Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660
## 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830
## Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233
##    Grad.Rate
## Min.   : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean   : 65.46
## 3rd Qu.: 78.00
## Max.   :118.00

ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[,1:10])

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(Private,Outstate, xlab = "Private", ylab ="Out of State tuition in USD", main = "Outstate Private Boxlots")

iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

Elite =rep ("No",nrow(college))
Elite [Top10perc >50]="Yes"
Elite =as.factor (Elite)
college =data.frame(college,Elite)
#by calling the name of the dataset, we see it in the output
#college

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

summary(college)

## Private        Apps           Accept          Enroll       Top10perc
## No :212   Min.   :   81   Min.   :   72   Min.   : 35   Min.   : 1.00
## Yes:565   1st Qu.: 776   1st Qu.: 604   1st Qu.: 242   1st Qu.:15.00
##            Median : 1558   Median : 1110   Median : 434   Median :23.00
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00
##    Top25perc      F.Undergrad     P.Undergrad         Outstate
## Min.   : 9.0   Min.   : 139   Min.   :    1.0   Min.   : 2340
## 1st Qu.: 41.0   1st Qu.: 992   1st Qu.:   95.0   1st Qu.: 7320
## Median : 54.0   Median : 1707   Median : 353.0   Median : 9990
## Mean   : 55.8   Mean   : 3700   Mean   : 855.3   Mean   :10441
## 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.: 967.0   3rd Qu.:12925
## Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700
##    Room.Board       Books           Personal         PhD
## Min.   :1780   Min.   : 96.0   Min.   : 250   Min.   : 8.00
## 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00
## Median :4200   Median : 500.0   Median :1200   Median : 75.00
## Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66
## 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00
## Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00
##     Terminal       S.F.Ratio      perc.alumni        Expend
## Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186
## 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751
## Median : 82.0   Median :13.60   Median :21.00   Median : 8377
## Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660
## 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830
## Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233
##    Grad.Rate      Elite
## Min.   : 10.00   No :699
## 1st Qu.: 53.00   Yes: 78
## Median : 65.00
## Mean   : 65.46
## 3rd Qu.: 78.00
## Max.   :118.00

plot(Elite,Outstate, xlab = "Elite", ylab ="Out of State tuition in USD", main = "Outstate Elite Boxlots")

There are 78 Elite Universities.

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

par(mfrow = c(3,2))
hist(Apps,col = 2, xlab = "Applications", ylab = "Count")
hist(Accept,col = 3, xlab = "Accepted", ylab = "Count")
hist(Enroll,col = 4, xlab = "Enrolled", ylab = "Count")
hist(Top10perc,col = 5, xlab = "Top 10%", ylab = "Count")
hist(F.Undergrad,col = 7, xlab = "Full-time Undergrads", ylab = "Count")
hist(Top10perc,col = 9, xlab = "Out of State", ylab = "Count")

vi. Continue exploring the data, and provide a brief summary of what you discover.

highnumbers<- college[PhD > 100,]
highnumbers<- rbind(college[Grad.Rate > 100,])
nrow(highnumbers)

## [1] 1

row.names(highnumbers)

## [1] "Cazenovia College"

rm(list=ls())

From the scatterplots, one can see that some variables have a liner relationship, while some do not. The histograms show that very few of the variables have a normal distribution. The Phd and Grad Rate have max values greater than 100%. 103% and 118% respectively. This is probably data that was recorded incorrectly and should be further studied and possibly removed. Cazenovia College is the only entry that represents these errors.

10. This exercise involves the Boston housing data set.

(a) To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.

library (MASS)
boston<-Boston

Now the data set is contained in the object boston.

#To see the data set call it below
#boston
attach(boston)

Read about the data set:

chas <- as.factor(chas)
?Boston
rm(Boston)

How many rows are in this data set? How many columns? What do the rows and columns represent?

nrow(boston)

## [1] 506

ncol(boston)

## [1] 14

The Boston data frame has 506 rows (Observations) and 14 columns (variables).
Following are the variable names and what they represent:

crim-per capita crime rate by town.

zn-proportion of residential land zoned for lots over 25,000 sq.ft.

indus-proportion of non-retail business acres per town.

chas-Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox-nitrogen oxides concentration (parts per 10 million).

rm-average number of rooms per dwelling.

age-proportion of owner-occupied units built prior to 1940.

dis-weighted mean of distances to five Boston employment centres.

rad-index of accessibility to radial highways.

tax-full-value property-tax rate per $10,000.

ptratio-pupil-teacher ratio by town.

black-1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat-lower status of the population (percent).

medv-median value of owner-occupied homes in $1000s.

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

pairs(boston[,1:10])

It seems that zn and distance have a negative relationship with crime. Age seems to have a positive relationship.

par(mfrow=c(3,2))
plot(nox,crim)
plot(rm,crim)
plot(age,crim)
plot(dis,crim)
plot(zn,crim)
plot(tax,crim)

It seems that high crime increases in areas close to employment centers, older homes, and zones with residential lots less than 25,000 sqft. In other words, more urban or populated areas.

(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

par(mfrow=c(3,1))
hist(crim,breaks = 40)
nrow(boston[crim> 20,])

## [1] 18

hist(tax,breaks = 40)
nrow(boston[tax > 650,])

## [1] 137

hist(ptratio,breaks = 40)nrow(boston[ptratio > 19,])

## [1] 253

From the histograms, one can see a few observations stand out. There are only 18 suburbs with a crime rate greater than 20. There are 137 suburbs with a tax rate greater than 650. There are 253 suburbs with a pupil teacher ratio greater than 20. No real high crime rates, but high tax rates and high pupil teacher rates.

(e) How many of the suburbs in this data set bound the Charles river?

nrow(boston[chas==1,])

## [1] 35

There are 35 suburbs that are bound by the Charles river.

(f) What is the median pupil-teacher ratio among the towns in this data set?

median(ptratio)

## [1] 19.05

The median pupil teacher ration is 19.05.

(g) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

min(medv)

## [1] 5

The 5th suburb has the lowest median value of owner occupied homes.

range(tax)

## [1] 187 711

The range for taxes for suburb 5 is 187 to 711.

boston[min(medv),]$tax

## [1] 222

The taxes for suburb 5 is 222, more on the lower end of the range. Therefore, mirroring the low median value.

(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling?

nrow(boston[rm>7,])

## [1] 64

64 suburbs average more than 7 rooms per dwelling.

nrow(boston[rm>8,])

## [1] 13

13 suburbs average more than 8 rooms per dwelling.

Comment on the suburbs that average more than eight rooms per dwelling.

avg<-boston[rm>8,]
avg

##        crim zn indus chas    nox    rm age    dis rad tax ptratio black
## 98 0.12083 0 2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0 396.90
## 164 1.51902 0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7 388.45
## 205 0.02009 95 2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7 390.55
## 225 0.31533 0 6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4 385.05
## 226 0.52693 0 6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4 382.00
## 227 0.38214 0 6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4 387.38
## 233 0.57529 0 6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4 385.91
## 234 0.33147 0 6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4 378.95
## 254 0.36894 22 5.86    0 0.4310 8.259 8.4 8.9067   7 330    19.1 396.90
## 258 0.61154 20 3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0 389.70
## 263 0.52014 20 3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0 386.86
## 268 0.57834 20 3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0 384.54
## 365 3.47428 0 18.10    1 0.7180 8.780 82.9 1.9047 24 666    20.2 354.55
##     lstat medv
## 98   4.21 38.7
## 164 3.32 50.0
## 205 2.88 50.0
## 225 4.14 44.8
## 226 4.63 50.0
## 227 3.13 37.6
## 233 2.47 41.7
## 234 3.95 48.3
## 254 3.54 42.8
## 258 5.12 50.0
## 263 5.91 48.8
## 268 7.44 50.0
## 365 5.29 21.9

One thing that stands out immediately is that there are only 2 suburbs with more than 8 rooms per dwelling that lie on the Charles river. Furthermore, those 2 suburbs are highly industrial or have a high proportion of non-retail business acres per town, and have a higher age or have a high proportion of owner-occupied units built prior to 1940.