8 (a)

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

library(ISLR)
college <- read.csv("College.csv")
head(college[, 1:5])

8 (b)

Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

rownames(college) <- college[, 1]
college <- college[, -1]
head(college[, 1:5])
#fix(college)

8 (c) i

Use the summary() function to produce a numerical summary of the variables in the data set. note: had to convert from character to factor to be able to count the data in the Private column

college$Private <- as.factor(as.character(college$Private))
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

8 (c) ii

Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].

pairs(college[,1:10])

8 (c) iii

Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

library(ggplot2)

attach(college)
boxplot(Outstate~Private)
stripchart(Outstate~Private,data=college, vertical=TRUE, method="jitter", pch=21, add=T)

p1.8c <- ggplot(college, aes(x=Private, y=Outstate, group=Private)) + geom_boxplot()
p1.8c + geom_jitter(shape=16, position=position_jitter(0.2))

8 (d) iv

Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.

There are 699

Elite <- rep("No",nrow(college ))
Elite[college$Top10perc >50]="Yes"
Elite <- as.factor (Elite)
college <- data.frame(college ,Elite)
summary(college$Elite)
##  No Yes 
## 699  78
library(ggplot2)
p1.8d <- ggplot(college, aes(x=Elite, y=Outstate)) + geom_boxplot()
p1.8d + geom_jitter(shape=16, position=position_jitter(0.2))

8 (d) v

Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

The plot shows that the number of students that applies, are accepted and in fact enrolls in the universities decreases step by step. We can see that clearly by the number of subdivisions (bins) on the graphs. The numbers of counts are larger than 400, but the limit was chosen to emphasize the maximum values in the x axis.

par(mfrow = c(1,3))
hist(Apps, col = 2, xlab = "Apps", ylab = "Count",xaxp=c(0,50000,10), breaks = 10, ylim = c(0,400))
hist(Accept, col = 3, xlab = "Accept", ylab = "Count", xaxp=c(0,30000,15), breaks = 15, ylim = c(0,400))
hist(Enroll, col = 4, xlab = "Enroll", ylab = "Count", xaxp=c(0,7000,20), breaks = 20, ylim = c(0,400))

8 (d) vi

Continue exploring the data, and provide a brief summary of what you discover.

The Figure below shows a zoom at the x axis. Observations are:

  • one university at has applications as high as 40k
  • three universities have acceptances superior to 15k
  • the enrolls are as high as 6300 but less than 11 have more than 4k
par(mfrow = c(1,3))
hist(Apps, col = 2, xlab = "Apps", ylab = "Count",xaxp=c(0,50000,10), breaks = 10, ylim = c(0,50))
hist(Accept, col = 3, xlab = "Accept", ylab = "Count", xaxp=c(0,30000,15), breaks = 15, ylim = c(0,50))
hist(Enroll, col = 4, xlab = "Enroll", ylab = "Count", xaxp=c(0,7000,20), breaks = 20, ylim = c(0,50))

9 (a)

Which of the predictors are quantitative, and which are qualitative?

All except horsepower and name are quantitative. origin might be a qualitative variable, even so it has int as its class

#library(ISLR)
#data(Auto)
auto <- read.csv("Auto.csv")
auto$horsepower <- as.integer(as.character(auto$horsepower))
auto <- na.omit(auto)
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
#auto$horsepower <- as.numeric(as.character(auto$horsepower))
#auto <- na.omit(auto)
#summary(auto)
#head(auto[,1:9])
#str(auto)

9 (b)

What is the range of each quantitative predictor? You can answer this using the range() function.

mpg[9,46.60]

cylinders[3,8] –> I think this would be better classified as qualitative as well

displacement[68,455]

weight[1613,5140]

acceleration[8,24.80]

year[70,82]

attach(auto)
summary(auto[-c(2,4,8,9)])
##       mpg         displacement       weight      acceleration        year      
##  Min.   : 9.00   Min.   : 68.0   Min.   :1613   Min.   : 8.00   Min.   :70.00  
##  1st Qu.:17.00   1st Qu.:105.0   1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00  
##  Median :22.75   Median :151.0   Median :2804   Median :15.50   Median :76.00  
##  Mean   :23.45   Mean   :194.4   Mean   :2978   Mean   :15.54   Mean   :75.98  
##  3rd Qu.:29.00   3rd Qu.:275.8   3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00  
##  Max.   :46.60   Max.   :455.0   Max.   :5140   Max.   :24.80   Max.   :82.00
#range(mpg)
#range(cylinders)
#range(weight)
#range(acceleration)
#range(year)

9 (c)

What is the mean and standard deviation of each quantitative predictor?

auto.sd <- sapply(auto[-c(2,4,8,9)], sd)
auto.mean <- sapply(auto[-c(2,4,8,9)], mean)
df.9c <- data.frame(auto.sd,auto.mean)
df.9c

9 (d)

Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

auto2 <- auto[-c(10:85),-c(4,9)]
sapply(auto2, range)
##       mpg cylinders displacement weight acceleration year origin
## [1,] 11.0         3           68   1649          8.5   70      1
## [2,] 46.6         8          455   4997         24.8   82      3
sapply(auto2, mean)
##          mpg    cylinders displacement       weight acceleration         year 
##    24.404430     5.373418   187.240506  2935.971519    15.726899    77.145570 
##       origin 
##     1.601266
sapply(auto2, sd)
##          mpg    cylinders displacement       weight acceleration         year 
##     7.867283     1.654179    99.678367   811.300208     2.693721     3.106217 
##       origin 
##     0.819910
auto2.range <- sapply(auto2, range)
auto2.mean <- sapply(auto2, mean)
auto2.sd <- sapply(auto2, sd)
df.9d <- data.frame(auto2.mean,auto2.sd)
df.9d

9 (e)

Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

  • mpg versus quantitative parameters: the fuel consumption parameter mpg is related inversely to the quantitative parameters displacement, horsepower and weight. This means that the increase of them has a consequence of decrease the mpg, which in a generalized view means increase the fuel consumption, and increase in cost for the vehicle.

  • Others parameters: as for the relation between the others among themselves, we can see a linear increase, meaning that the horsepower increase with the displacement and weight.

pairs(auto[,-c(2,6,8,7,9)])

#auto$cylinders <- as.factor(auto$cylinders)
#auto$year <- as.factor(auto$year)
#auto$origin <- as.factor(auto$origin)
#pairs(auto[-c(9)])
#str(auto)
  • mpg versus acceleration: as one can see for the following panel, the acceleration does not say much about the mpg. It is too much dispersed at the center between 10 and 20. We could say that the values are somewhere around a mean of 15.54.
#par(mfrow = c(1, 3))
#plot(displacement,mpg)
#plot(horsepower,mpg)
#plot(weight,mpg)
plot(acceleration,mpg)

mean(acceleration)
## [1] 15.54133
  • mpg versus qualitative like parameters:

  • cylinders number as one could observe, the mpg parameters is high for 4 cylinders scenario, and decreases when the cylinders numbers are higher than 4. However, the boxplot panel (left) does not have sufficient data for 3, 5 cylinders scenario. As such, for mpg result, the data for 4 6 and 8 cylinders has more trustfulness

  • numbers of years: as the years of manufacture increased, the tendency is to increase the mpg of all the cars. True if we considerate the natural evolution of this industry.

  • Car with the highest economy parameters: the car with the following parameters would have the lowest mpg:

lower displacement, horsepower and weight

4 cylinders

had been build on the 80’s forward

par(mfrow = c(1, 2))

p1.axis <- min(auto[,"cylinders"]):max(auto[,"cylinders"])
p1.9e <- ggplot(auto, aes(x=cylinders, y=mpg, group=cylinders)) + geom_boxplot() + scale_x_continuous(labels = p1.axis, breaks = p1.axis)
p1.9e + geom_jitter(shape=16, position=position_jitter(0.2))

p2.axis <- min(auto[,"year"]):max(auto[,"year"])
p2.9e <- ggplot(auto, aes(x=year, y=mpg, group=year)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
p2.9e + geom_jitter(shape=16, position=position_jitter(0.2))

#boxplot(mpg ~ cylinders, lwd = 2, ylab = 'mpg')
#stripchart(mpg~cylinders, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = 'blue')

#boxplot(mpg~year, data = auto, lwd = 2, ylab = 'mpg')
#stripchart(mpg~year, vertical = TRUE, method = "jitter", add = TRUE, pch = 20, col = 'blue')

9 (f)

Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

  • WRONG ANSWER

variables suggested to predict mpg, according to the dataset: displacement, horsepower, cylinders, year

variables to be treated as consequence of the others, but could be use in the predicting: weight

variables NOT suggested to predict mpg, according to the dataset: acceleration

  • RIGHT ANSWER

You have to look for predictors that are not correlated with each other. For this reason, weight, horsepower and displacement cannot be at the same fit. The book section 3.3.3 Potential Problems of regression says:

"An important assumption of the linear regression model is that the error terms, are uncorrelated. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

The right set of predictions that should be used are the ones not correlated: cylinders, horsepower (which could be switch by displacement, weight), year and origin.

note: cylinders is also highly correlated with horsepower, is that problem? Possible other answer, exclude horsepower or cylinders.

auto$horsepower <- as.numeric(auto$horsepower)
auto$cylinders <- as.numeric(auto$cylinders)
auto$year <- as.numeric(auto$year)
auto$origin <- as.numeric(auto$origin)
p1.9f <- cor(auto[-c(9)])
library('corrplot') #package corrplot
corrplot(p1.9f, method = "pie",  tl.srt = 45 ) #plot matrix

#str(auto)
#cor(auto$weight, auto$horsepower)
#cor(auto$weight, auto$displacement)
#cor(auto$displacement, auto$horsepower)
#cor(auto$cylinders, auto$horsepower)
#cor(auto$cylinders, auto$horsepower)

10 (a)

This exercise involves the Boston housing data set.

To begin, load in the Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?

506 observations

14 variables

data from the Boston state related to house sales

library(MASS)
attach(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

10 (b)

Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

  • nox versus dis: it says that pollution is inversely proportional to the distance of the employment centers.

  • nox versus age: it shows that the pollution increases as the occupation increased.

  • lstat versus medv: shows that lower status population has the lower mean value of homes.

  • lstat versus rm: as lower the population status, lower the rooms per dwelling.

  • crim versus dis: the crimes seems to happen more nearby the employment centers

  • crim versus black: there are more crimes at lower per capita people when the black people population increases.

par(mfrow= c(3,2))
plot(dis,nox)
plot(age,nox)
plot(medv,lstat)
plot(rm,lstat)
plot(crim~dis)
plot(black,crim)

10 (c)

Are any of the predictors associated with per capita crime rate? If so, explain the relationship

  • crim versus dis: the crimes seems to happen more nearby the employment centers

  • crim versus black: there are more crimes at lower per capita people when the black people population increases.

par(mfrow= c(1,2))
plot(dis,crim)
plot(black,crim)

10 (d)

Do any of the suburbs of Boston appear to have particularly high crime rates ? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor

Observing the plots, one can see that:

  • first plot: general view of the problem, showing the relations in a zoom out. One can seen peaks among the plots

  • second plot first panel: the crimes seems to happen more in suburbs near by the commercial centers.

  • second plot second panel: the property-tax rate is particularly high in some suburbs near the commercial centers.

  • second plot third panel: the pupil-teacher ratio by town is high in the suburbs near the commercial centers, which could be a indication of a school or university

  • other plots: showing each scenario in more detailed, showing the observations concentrations.

library(ggplot2)

par(mfrow=c(1,3))
#pairs(Boston[,c(1,8,10,11)])
plot(crim,dis,type = 'p', pch=16)
plot(tax,dis,type = 'p', pch=16)
plot(ptratio,dis,type = 'p', pch=16)

#range(dis)
#range(ptratio)
#range(tax)
#range(crim)

#p3.axis <- min(Boston[,"tax"]):max(Boston[,"ptratio"])
#p3.10d <- ggplot(Boston, aes(x=tax, y=ptratio, group=ptratio)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
#p3.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()  + xlim(c(180,750))

#p2.axis <- min(Boston[,"crim"]):max(Boston[,"ptratio"])
#p2.10d <- ggplot(Boston, aes(x=crim, y=ptratio, group=ptratio)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
#p2.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()  + xlim(c(0,80))

p1.axis <- min(Boston[,"dis"]):max(Boston[,"ptratio"])
p1.10d <- ggplot(Boston, aes(x=dis, y=ptratio, group=ptratio)) + geom_boxplot()
p1.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() 

#p4.axis <- min(Boston[,"dis"]):max(Boston[,"crim"])
p4.10d <- ggplot(Boston, aes(x=dis, y=crim, group=crim)) + geom_boxplot()
p4.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() 

p5.axis <- min(Boston[,"dis"]):max(Boston[,"tax"])
p5.10d <- ggplot(Boston, aes(x=dis, y=tax, group=tax)) + geom_boxplot()
p5.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() 

10 (e)

How many of the suburbs in this data set bound the Charles river?

suburbs set bound the Charles = 35

otherwise = 471

#p1.axis <- min(Boston[,"dis"]):max(Boston[,"chas"])
p1.10e <- ggplot(Boston, aes(x=dis, y=chas, group=chas)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
p1.10e + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()

Boston$chas <- as.factor(as.numeric(Boston$chas))
summary(Boston$chas)
##   0   1 
## 471  35

10 (f)

What is the median pupil-teacher ratio among the towns in this data set?

median(ptratio)
## [1] 19.05
#mean(ptratio) # WRONG is the median not the mean

10 (g)

Which suburb of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings

  • The two lowest median value of owner occupied homes and theirs respective predictors:

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

38.3518 0 18.1 1 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

67.9208 0 18.1 1 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98 5

  • The mean values of the predictors can be seen below:

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv

3.61352 11.36 11.14 - 0.554 6.285 68.57 3.795 9.54 408.2 18.46 356.67 12.65 22.53

The lowest medv clearly emphasis the poor people. The main predictors that this emphasis is quantified are crim, black, lstat and medv.

Boston$chas <- as.numeric(as.character(Boston$chas))
low.mv.owner <- Boston[399,]
mean.pred <- Boston[406,]
mean.boston <- colMeans(Boston)

df.10g <- data.frame(low.mv.owner, mean.pred)
df.10g

10 (g)

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

  • more than seven rooms: 64

  • more than eight rooms: 13

the number of suburbs that have more than 8 rooms represents only 2.56%, which is a substantial few quantity

library(dplyr)
nrow(filter(Boston, rm>=7))
## [1] 64
nrow(filter(Boston, rm>=8))
## [1] 13