Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.
Look at the data using the fix() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
Use the summary() function to produce a numerical summary of the variables in the data set. note: had to convert from character to factor to be able to count the data in the Private column
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].
Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
library(ggplot2)
attach(college)
boxplot(Outstate~Private)
stripchart(Outstate~Private,data=college, vertical=TRUE, method="jitter", pch=21, add=T)p1.8c <- ggplot(college, aes(x=Private, y=Outstate, group=Private)) + geom_boxplot()
p1.8c + geom_jitter(shape=16, position=position_jitter(0.2))Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %.
There are 699
Elite <- rep("No",nrow(college ))
Elite[college$Top10perc >50]="Yes"
Elite <- as.factor (Elite)
college <- data.frame(college ,Elite)
summary(college$Elite)## No Yes
## 699 78
library(ggplot2)
p1.8d <- ggplot(college, aes(x=Elite, y=Outstate)) + geom_boxplot()
p1.8d + geom_jitter(shape=16, position=position_jitter(0.2))Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
The plot shows that the number of students that applies, are accepted and in fact enrolls in the universities decreases step by step. We can see that clearly by the number of subdivisions (bins) on the graphs. The numbers of counts are larger than 400, but the limit was chosen to emphasize the maximum values in the x axis.
par(mfrow = c(1,3))
hist(Apps, col = 2, xlab = "Apps", ylab = "Count",xaxp=c(0,50000,10), breaks = 10, ylim = c(0,400))
hist(Accept, col = 3, xlab = "Accept", ylab = "Count", xaxp=c(0,30000,15), breaks = 15, ylim = c(0,400))
hist(Enroll, col = 4, xlab = "Enroll", ylab = "Count", xaxp=c(0,7000,20), breaks = 20, ylim = c(0,400))Continue exploring the data, and provide a brief summary of what you discover.
The Figure below shows a zoom at the x axis. Observations are:
par(mfrow = c(1,3))
hist(Apps, col = 2, xlab = "Apps", ylab = "Count",xaxp=c(0,50000,10), breaks = 10, ylim = c(0,50))
hist(Accept, col = 3, xlab = "Accept", ylab = "Count", xaxp=c(0,30000,15), breaks = 15, ylim = c(0,50))
hist(Enroll, col = 4, xlab = "Enroll", ylab = "Count", xaxp=c(0,7000,20), breaks = 20, ylim = c(0,50))Which of the predictors are quantitative, and which are qualitative?
All except horsepower and name are quantitative. origin might be a qualitative variable, even so it has int as its class
#library(ISLR)
#data(Auto)
auto <- read.csv("Auto.csv")
auto$horsepower <- as.integer(as.character(auto$horsepower))
auto <- na.omit(auto)
str(auto)## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
What is the range of each quantitative predictor? You can answer this using the range() function.
mpg[9,46.60]
cylinders[3,8] –> I think this would be better classified as qualitative as well
displacement[68,455]
weight[1613,5140]
acceleration[8,24.80]
year[70,82]
## mpg displacement weight acceleration year
## Min. : 9.00 Min. : 68.0 Min. :1613 Min. : 8.00 Min. :70.00
## 1st Qu.:17.00 1st Qu.:105.0 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00
## Median :22.75 Median :151.0 Median :2804 Median :15.50 Median :76.00
## Mean :23.45 Mean :194.4 Mean :2978 Mean :15.54 Mean :75.98
## 3rd Qu.:29.00 3rd Qu.:275.8 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00
## Max. :46.60 Max. :455.0 Max. :5140 Max. :24.80 Max. :82.00
What is the mean and standard deviation of each quantitative predictor?
auto.sd <- sapply(auto[-c(2,4,8,9)], sd)
auto.mean <- sapply(auto[-c(2,4,8,9)], mean)
df.9c <- data.frame(auto.sd,auto.mean)
df.9cNow remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
## mpg cylinders displacement weight acceleration year origin
## [1,] 11.0 3 68 1649 8.5 70 1
## [2,] 46.6 8 455 4997 24.8 82 3
## mpg cylinders displacement weight acceleration year
## 24.404430 5.373418 187.240506 2935.971519 15.726899 77.145570
## origin
## 1.601266
## mpg cylinders displacement weight acceleration year
## 7.867283 1.654179 99.678367 811.300208 2.693721 3.106217
## origin
## 0.819910
auto2.range <- sapply(auto2, range)
auto2.mean <- sapply(auto2, mean)
auto2.sd <- sapply(auto2, sd)
df.9d <- data.frame(auto2.mean,auto2.sd)
df.9dUsing the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
mpg versus quantitative parameters: the fuel consumption parameter mpg is related inversely to the quantitative parameters displacement, horsepower and weight. This means that the increase of them has a consequence of decrease the mpg, which in a generalized view means increase the fuel consumption, and increase in cost for the vehicle.
Others parameters: as for the relation between the others among themselves, we can see a linear increase, meaning that the horsepower increase with the displacement and weight.
#auto$cylinders <- as.factor(auto$cylinders)
#auto$year <- as.factor(auto$year)
#auto$origin <- as.factor(auto$origin)
#pairs(auto[-c(9)])
#str(auto)#par(mfrow = c(1, 3))
#plot(displacement,mpg)
#plot(horsepower,mpg)
#plot(weight,mpg)
plot(acceleration,mpg)## [1] 15.54133
mpg versus qualitative like parameters:
cylinders number as one could observe, the mpg parameters is high for 4 cylinders scenario, and decreases when the cylinders numbers are higher than 4. However, the boxplot panel (left) does not have sufficient data for 3, 5 cylinders scenario. As such, for mpg result, the data for 4 6 and 8 cylinders has more trustfulness
numbers of years: as the years of manufacture increased, the tendency is to increase the mpg of all the cars. True if we considerate the natural evolution of this industry.
Car with the highest economy parameters: the car with the following parameters would have the lowest mpg:
lower displacement, horsepower and weight
4 cylinders
had been build on the 80’s forward
par(mfrow = c(1, 2))
p1.axis <- min(auto[,"cylinders"]):max(auto[,"cylinders"])
p1.9e <- ggplot(auto, aes(x=cylinders, y=mpg, group=cylinders)) + geom_boxplot() + scale_x_continuous(labels = p1.axis, breaks = p1.axis)
p1.9e + geom_jitter(shape=16, position=position_jitter(0.2))p2.axis <- min(auto[,"year"]):max(auto[,"year"])
p2.9e <- ggplot(auto, aes(x=year, y=mpg, group=year)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
p2.9e + geom_jitter(shape=16, position=position_jitter(0.2))Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
variables suggested to predict mpg, according to the dataset: displacement, horsepower, cylinders, year
variables to be treated as consequence of the others, but could be use in the predicting: weight
variables NOT suggested to predict mpg, according to the dataset: acceleration
You have to look for predictors that are not correlated with each other. For this reason, weight, horsepower and displacement cannot be at the same fit. The book section 3.3.3 Potential Problems of regression says:
"An important assumption of the linear regression model is that the error terms, are uncorrelated. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
The right set of predictions that should be used are the ones not correlated: cylinders, horsepower (which could be switch by displacement, weight), year and origin.
note: cylinders is also highly correlated with horsepower, is that problem? Possible other answer, exclude horsepower or cylinders.
auto$horsepower <- as.numeric(auto$horsepower)
auto$cylinders <- as.numeric(auto$cylinders)
auto$year <- as.numeric(auto$year)
auto$origin <- as.numeric(auto$origin)
p1.9f <- cor(auto[-c(9)])
library('corrplot') #package corrplot
corrplot(p1.9f, method = "pie", tl.srt = 45 ) #plot matrixThis exercise involves the Boston housing data set.
To begin, load in the Boston data set. The Boston data set is part of the MASS library in R. How many rows are in this data set? How many columns? What do the rows and columns represent?
506 observations
14 variables
data from the Boston state related to house sales
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
nox versus dis: it says that pollution is inversely proportional to the distance of the employment centers.
nox versus age: it shows that the pollution increases as the occupation increased.
lstat versus medv: shows that lower status population has the lower mean value of homes.
lstat versus rm: as lower the population status, lower the rooms per dwelling.
crim versus dis: the crimes seems to happen more nearby the employment centers
crim versus black: there are more crimes at lower per capita people when the black people population increases.
par(mfrow= c(3,2))
plot(dis,nox)
plot(age,nox)
plot(medv,lstat)
plot(rm,lstat)
plot(crim~dis)
plot(black,crim)Are any of the predictors associated with per capita crime rate? If so, explain the relationship
crim versus dis: the crimes seems to happen more nearby the employment centers
crim versus black: there are more crimes at lower per capita people when the black people population increases.
Do any of the suburbs of Boston appear to have particularly high crime rates ? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor
Observing the plots, one can see that:
first plot: general view of the problem, showing the relations in a zoom out. One can seen peaks among the plots
second plot first panel: the crimes seems to happen more in suburbs near by the commercial centers.
second plot second panel: the property-tax rate is particularly high in some suburbs near the commercial centers.
second plot third panel: the pupil-teacher ratio by town is high in the suburbs near the commercial centers, which could be a indication of a school or university
other plots: showing each scenario in more detailed, showing the observations concentrations.
library(ggplot2)
par(mfrow=c(1,3))
#pairs(Boston[,c(1,8,10,11)])
plot(crim,dis,type = 'p', pch=16)
plot(tax,dis,type = 'p', pch=16)
plot(ptratio,dis,type = 'p', pch=16)#range(dis)
#range(ptratio)
#range(tax)
#range(crim)
#p3.axis <- min(Boston[,"tax"]):max(Boston[,"ptratio"])
#p3.10d <- ggplot(Boston, aes(x=tax, y=ptratio, group=ptratio)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
#p3.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() + xlim(c(180,750))
#p2.axis <- min(Boston[,"crim"]):max(Boston[,"ptratio"])
#p2.10d <- ggplot(Boston, aes(x=crim, y=ptratio, group=ptratio)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
#p2.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() + xlim(c(0,80))
p1.axis <- min(Boston[,"dis"]):max(Boston[,"ptratio"])
p1.10d <- ggplot(Boston, aes(x=dis, y=ptratio, group=ptratio)) + geom_boxplot()
p1.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() #p4.axis <- min(Boston[,"dis"]):max(Boston[,"crim"])
p4.10d <- ggplot(Boston, aes(x=dis, y=crim, group=crim)) + geom_boxplot()
p4.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() p5.axis <- min(Boston[,"dis"]):max(Boston[,"tax"])
p5.10d <- ggplot(Boston, aes(x=dis, y=tax, group=tax)) + geom_boxplot()
p5.10d + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip() How many of the suburbs in this data set bound the Charles river?
suburbs set bound the Charles = 35
otherwise = 471
#p1.axis <- min(Boston[,"dis"]):max(Boston[,"chas"])
p1.10e <- ggplot(Boston, aes(x=dis, y=chas, group=chas)) + geom_boxplot() + scale_x_continuous(labels = p2.axis, breaks = p2.axis)
p1.10e + geom_jitter(shape=16, position=position_jitter(0.2)) + coord_flip()## 0 1
## 471 35
What is the median pupil-teacher ratio among the towns in this data set?
## [1] 19.05
Which suburb of Boston has lowest median value of owner- occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
38.3518 0 18.1 1 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
67.9208 0 18.1 1 0.693 5.683 100 1.4254 24 666 20.2 384.97 22.98 5
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
3.61352 11.36 11.14 - 0.554 6.285 68.57 3.795 9.54 408.2 18.46 356.67 12.65 22.53
The lowest medv clearly emphasis the poor people. The main predictors that this emphasis is quantified are crim, black, lstat and medv.
Boston$chas <- as.numeric(as.character(Boston$chas))
low.mv.owner <- Boston[399,]
mean.pred <- Boston[406,]
mean.boston <- colMeans(Boston)
df.10g <- data.frame(low.mv.owner, mean.pred)
df.10gIn this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
more than seven rooms: 64
more than eight rooms: 13
the number of suburbs that have more than 8 rooms represents only 2.56%, which is a substantial few quantity
## [1] 64
## [1] 13