This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!)
The folowing content is my work on the applied problem set from the chapter-2 of the book Introduction to Statistical Learning with R. You can click on the link to access the book online for free.
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data:
library("ISLR")
college = na.omit(College)
head(college)
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
I will be skipping this one because it involves making sure that the row names column is not added as a predictor column.
i.Use the summary() function to produce a numerical summary of the variables in the data set:
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
ii.Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10]:
pairs(college[,1:10])
iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private:
plot(Outstate ~ Private, data = college, col =c("green", "blue"))
iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50.
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite:
Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)
summary(college$Elite)
## No Yes
## 699 78
plot(Outstate ~ Elite, data = college, col =c("green","red"))
v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables:
vi. Continue exploring the data, and provide a brief summary of what you discover:
#Finding the acceptance rate for each observation
Accept_Rate = college$Accept/ college$Apps
#Finding the cost (on average) of attending each college
Cost = college$Room.Board + college$Outstate +college$Books + college$Personal +
college$Expend
#Adding the two newly created varibales to the data frame
college = data.frame(college, Accept_Rate, Cost)
par(mfrow=c(1,2))
# Scatterplots and Regression lines for Acceptance rate against Cost and Elite status
# respectively.
plot(college$Accept_Rate ~ college$Cost, xlab ="Cost of Attendance",
ylab ="Acceptance Rate" )
lm_Cost <- lm(Accept_Rate ~ Cost, data = college)
abline(lm_Cost, col ="red")
plot(college$Accept_Rate ~ college$Elite, xlab ="Elite Status",
ylab ="Acceptance Rate")
lm_Elite <- lm(Accept_Rate ~ Elite, data = college)
abline(lm_Elite, col ="red")
As expected, the Acceptance Rate is negatively related to the Cost of Attendance and the Elite status. A multi linear regression model with these two predictors give us the following results.
##
## Call:
## lm(formula = Accept_Rate ~ Cost + Elite, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45336 -0.07080 0.02147 0.08710 0.37536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.449e-01 1.558e-02 54.228 < 2e-16 ***
## Cost -3.044e-06 5.984e-07 -5.086 4.59e-07 ***
## EliteYes -1.774e-01 1.809e-02 -9.808 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1285 on 774 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2373
## F-statistic: 121.7 on 2 and 774 DF, p-value: < 2.2e-16
This tells us that about 24 % of the variation in the Acceptance Rate is due to the Cost of Attendance and the Elite status of the university. The Elite status is a binary qualitative variable( yes, no) and the model choses ‘no’ as its baseline. The p-values indicate strong significance, which proves our assumption for the relationship between these two predictors and the responce variable.
Do Elite schools have, in average, a greater percentage of alumni who donate? Per assumption, the answer is ‘YES’(ex: Havard University problably has a greater per.alumni value than most non-elite and elite schools).
Based on the boxplots graph above, we would make the inference that Elite university do have a higher percentage( in average) of alumni who donate.
I wanted to some data sorting
The school that is: lowest acceptance rate & most costly & Private & Elite
The school that is: lowest acceptance rate & least costly & Private & Elite
The school that is: lowest acceptance rate & least costly & Public & Elite
However, I ran into the issue of having the first argument/criteria is the only thing that is being order. An example of the issue I am talking about:
attach(mtcars)
newdata <- mtcars[order(-mpg, -hp),]
head(newdata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
newdata <- mtcars[order(-mpg, hp),]
head(newdata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
detach(mtcars)
As you can see the output is exactly the same, regardless of the second sorting rule. I will look into this further and find a way to do a multi-rule-sorting ordering of data frame. REMEMBER, I AM NEW TO THIS. If you read this far, and hapeen to know how it is done, let me know via comment. Also, I welcome criticism.
Ahmed TADDE