This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!)

The folowing content is my work on the applied problem set from the chapter-2 of the book Introduction to Statistical Learning with R. You can click on the link to access the book online for free.

Question-a

Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data:

library("ISLR")
college = na.omit(College)
head(college)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55

Question-b

I will be skipping this one because it involves making sure that the row names column is not added as a predictor column.

Question-c

i.Use the summary() function to produce a numerical summary of the variables in the data set:

summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

ii.Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10]:

pairs(college[,1:10])

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private:

plot(Outstate ~ Private, data = college, col =c("green", "blue"))

iv. Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50.

Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite:

Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)

summary(college$Elite)
##  No Yes 
## 699  78
plot(Outstate ~ Elite, data = college, col =c("green","red"))

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables:

vi. Continue exploring the data, and provide a brief summary of what you discover:

#Finding the acceptance rate for each observation
Accept_Rate =  college$Accept/ college$Apps
#Finding the cost (on average) of attending each college
Cost = college$Room.Board + college$Outstate +college$Books + college$Personal +
       college$Expend 
#Adding the two newly created varibales to the data frame
college = data.frame(college, Accept_Rate, Cost)

par(mfrow=c(1,2))

# Scatterplots and Regression lines for Acceptance rate against Cost and Elite status 
# respectively.
plot(college$Accept_Rate ~ college$Cost, xlab ="Cost of Attendance", 
                                         ylab ="Acceptance Rate"    )

lm_Cost <- lm(Accept_Rate ~ Cost, data = college)
abline(lm_Cost, col ="red")


plot(college$Accept_Rate ~ college$Elite, xlab ="Elite Status", 
                                          ylab ="Acceptance Rate")

lm_Elite <- lm(Accept_Rate ~ Elite, data = college)
abline(lm_Elite, col ="red")

As expected, the Acceptance Rate is negatively related to the Cost of Attendance and the Elite status. A multi linear regression model with these two predictors give us the following results.

## 
## Call:
## lm(formula = Accept_Rate ~ Cost + Elite, data = college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45336 -0.07080  0.02147  0.08710  0.37536 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.449e-01  1.558e-02  54.228  < 2e-16 ***
## Cost        -3.044e-06  5.984e-07  -5.086 4.59e-07 ***
## EliteYes    -1.774e-01  1.809e-02  -9.808  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1285 on 774 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2373 
## F-statistic: 121.7 on 2 and 774 DF,  p-value: < 2.2e-16

This tells us that about 24 % of the variation in the Acceptance Rate is due to the Cost of Attendance and the Elite status of the university. The Elite status is a binary qualitative variable( yes, no) and the model choses ‘no’ as its baseline. The p-values indicate strong significance, which proves our assumption for the relationship between these two predictors and the responce variable.

Do Elite schools have, in average, a greater percentage of alumni who donate? Per assumption, the answer is ‘YES’(ex: Havard University problably has a greater per.alumni value than most non-elite and elite schools).

Based on the boxplots graph above, we would make the inference that Elite university do have a higher percentage( in average) of alumni who donate.

I wanted to some data sorting

The school that is: lowest acceptance rate & most costly & Private & Elite

The school that is: lowest acceptance rate & least costly & Private & Elite

The school that is: lowest acceptance rate & least costly & Public & Elite

However, I ran into the issue of having the first argument/criteria is the only thing that is being order. An example of the issue I am talking about:

attach(mtcars)
newdata <- mtcars[order(-mpg, -hp),] 
head(newdata)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
newdata <- mtcars[order(-mpg, hp),] 
head(newdata)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
detach(mtcars)

As you can see the output is exactly the same, regardless of the second sorting rule. I will look into this further and find a way to do a multi-rule-sorting ordering of data frame. REMEMBER, I AM NEW TO THIS. If you read this far, and hapeen to know how it is done, let me know via comment. Also, I welcome criticism.


Ahmed TADDE