Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are based on the material covered in James et al. Normally you will start working on the assignments after class. Think about the assignments as the most practical and also the best way to learn machine learning!
Before we dive into our first exercise, let’s become a bit more familiar with the programming tools used in this course.
We will write our annotated R code using Markdown.
Markdown is a simple formatting syntax to generate HTML or PDF documents. In combination with R, it will generate a document that includes the comments, the R code, and the output of running such code.
You can embed R code in chunks like this one:
1 + 1
[1] 2
You can run each chunk of code one by one, by highlighting the code and clicking Run (or pressing Ctrl + Enter in Windows or command + enter in OS X). You can see the output of the code in the console right below, inside the RStudio window.
Alternatively, you can generate (or knit) an html document with all the code, comment, and output in the entire .Rmd file by clicking on Knit HTML. The Notebook contains HTML output and embedded code. You can generate it by clicking Preview after running all the code chunks.
You can also embed plots and graphics, for example:
x <- c(1, 3, 4, 5)
y <- c(2, 6, 8, 10)
plot(x, y)
If you run the chunk of code, the plot will be generated on the panel on the bottom right corner. If instead you knit the entire file, the plot will appear after you view the html document.
Using R + Markdown has several advantages: it leaves an “audit trail” of your work, including documentation explaining the steps you made. This is helpful to not only keep your own progress organised, but also make your work reproducible and more transparent. You can easily correct errors (just fix them and run the script again), and after you have finished, you can generate a PDF or HTML version of your work.
We will be exploring R through R Markdown over the next weeks. For more details and documentation see http://rmarkdown.rstudio.com. R (or Python) Notebooks are also the only acceptable format for your assignments!
Follow the instructions in the class material and install R and RStudio. If you feel more comfortable using the basic R terminal, skip the step of installing RStudio and the corresponding chunk.
Now run the following code to make sure that you have the current version of R.
version$version.string
[1] "R version 3.5.2 (2018-12-20)"
This chunk should return something like R version 3.5.1 (2018-07-02).
rstudioapi::versionInfo()$version
This chunk should print 1.1.423.
If they do not, then try to get as close to the current versions as possible!
For this we will use the automobile dataset from the James et. al. text. This can be found in the ISLR package in R.
Start by loading this package:
install.packages("ISLR")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/ISLR_1.2.tgz'
Content type 'application/x-gzip' length 2924917 bytes (2.8 MB)
==================================================
downloaded 2.8 MB
The downloaded binary packages are in
/var/folders/s8/9bjpbpqd4pv_f0rfm_v4t_jr0000gn/T//RtmpgGuJB6/downloaded_packages
library(ISLR)
Now we can regress miles-per-gallon on the weight of the vehicle and the number of cylinders.
data(Auto)
with(Auto, lm(mpg ~ weight + cylinders))
Call:
lm(formula = mpg ~ weight + cylinders)
Coefficients:
(Intercept) weight cylinders
46.292310 -0.006347 -0.721378
Why did we need the with() wrapper?
The with() wrapper applies the lm() function, which regresses miles-per-gallon (mpg) with weight and cylinders, to the dataset Auto.
This exercise relates to the College data set, which can be found in the file College.csv on the website for the main course textbook (James et al 2013) http://www-bcf.usc.edu/~gareth/ISL/data.html. It contains a number of variables for 777 different universities and colleges in the US.
The variables are:
* Private : Public/private indicator
* Apps : Number of applications received
* Accept : Number of applicants accepted
* Enroll : Number of new students enrolled
* Top10perc : New students from top 10% of high school class * Top25perc : New students from top 25% of high school class
* F.Undergrad : Number of full-time undergraduates * P.Undergrad : Number of part-time undergraduates
* Outstate : Out-of-state tuition
* Room.Board : Room and board costs
* Books : Estimated book costs
* Personal : Estimated personal spending
* PhD : Percent of faculty with Ph.D.’s
* Terminal : Percent of faculty with terminal degree
* S.F.Ratio : Student/faculty ratio * perc.alumni : Percent of alumni who donate
* Expend : Instructional expenditure per student
* Grad.Rate : Graduation rate
Before reading the data into R, it can be viewed in Excel or a text editor, if you find that convenient.
Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data. You can load this in R directly from the website, using:
college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")
Look at the data using the View() function. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:
rownames(college) <- college[, 1]
View(college)
You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try
college <- college[, -1]
View(college)
Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.
summary() function to produce a numerical summary of the variables in the data set.summary(college)
X Private Apps
Abilene Christian University: 1 No :212 Min. : 81
Adelphi University : 1 Yes:565 1st Qu.: 776
Adrian College : 1 Median : 1558
Agnes Scott College : 1 Mean : 3002
Alaska Pacific University : 1 3rd Qu.: 3624
Albertson College : 1 Max. :48094
(Other) :771
Accept Enroll Top10perc Top25perc
Min. : 72 Min. : 35 Min. : 1.00 Min. : 9.0
1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 1st Qu.: 41.0
Median : 1110 Median : 434 Median :23.00 Median : 54.0
Mean : 2019 Mean : 780 Mean :27.56 Mean : 55.8
3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 3rd Qu.: 69.0
Max. :26330 Max. :6392 Max. :96.00 Max. :100.0
F.Undergrad P.Undergrad Outstate Room.Board
Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
Median : 1707 Median : 353.0 Median : 9990 Median :4200
Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
Books Personal PhD Terminal
Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
S.F.Ratio perc.alumni Expend Grad.Rate
Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
Median :13.60 Median :21.00 Median : 8377 Median : 65.00
Mean :14.09 Mean :22.74 Mean : 9660 Mean : 65.46
3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
Max. :39.80 Max. :64.00 Max. :56233 Max. :118.00
pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10].pairs(college[,1:10])
plot() function to produce side-by-side boxplots of Outstate versus Private.plot(college$Private, college$Outstate)
Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
summary(college$Elite)
No Yes
699 78
plot(college$Elite, college$Outstate, main = "Plot of Outstate vs. Elite", xlab = "Elite", ylab = "Outstate")
Continue exploring the data, and provide a brief summary of what you discover.
par(mfrow = c(2,2))
plot(college$Outstate, college$Room.Board, xlab = "Outstate", ylab = "Room and board costs")
plot(college$Outstate, college$Personal, xlab = "Outstate", ylab = "Personal spending")
plot(Elite, college$Room.Board, xlab = "Elite", ylab = "Room.Board")
plot(Elite, college$Personal, xlab = "Elite", ylab = "Personal")
Out-of-state tuition fees have a positive relationship with room and board costs. However, personal spending seems comparable across individuals regardless of whether they attend an elite college with higher tuition fees.
par(mfrow = c(2,2))
plot(Elite, college$Accept/college$Apps, xlab = "Elite", ylab = "Accept/Apps")
plot(Elite, college$Grad.Rate, xlab = "Elite", ylab = "Grad.Rate")
plot(Elite, college$PhD, xlab = "Elite", ylab = "PhD")
plot(Elite, college$S.F.Ratio, xlab = "Elite", ylab = "S.F.Ratio")
Elite colleges have lower acceptance rates and higher graduation rates than non-elite colleges. Elite colleges also have a higher percentage of faculty with PhDs and a lower student-faculty ratio compared to non-elite colleges.
par(mfrow = c(1,2))
plot(college$Top10perc, college$Grad.Rate, xlab = "Top10perc", ylab = "Grad.Rate")
hist(college$Grad.Rate, main = NULL, xlab = "Grad.Rate")
However, colleges with the highest number of students from the top 10% of their high school class do not necessarily have the highest grduation rate. In addition, a closer inspection of the graduation rate also revealed a rate of more than 100, which is likely to be erroneous.
This exercise involves the Auto data set available as Auto.csv from the website for the main course textbook James et al. http://www-bcf.usc.edu/~gareth/ISL/data.html. Make sure that the missing values have been removed from the data. You should load that dataset as the first step of the exercise.
Auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", stringsAsFactors = FALSE, header=TRUE, na.strings="?")
View(Auto)
any(is.na(Auto))
[1] FALSE
sum(is.na(Auto))
[1] 0
summary(Auto)
mpg cylinders displacement horsepower
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
Median :22.75 Median :4.000 Median :151.0 Median : 93.5
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
weight acceleration year origin
Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
Median :2804 Median :15.50 Median :76.00 Median :1.000
Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
name
amc matador : 5
ford pinto : 5
toyota corolla : 5
amc gremlin : 4
amc hornet : 4
chevrolet chevette: 4
(Other) :365
dim(Auto)
[1] 392 9
There are 5 missing values in the dataset.
Auto_clean = na.omit(Auto)
dim(Auto_clean)
[1] 392 9
5 rows containing the missing values have been removed in the clean dataset.
str(Auto_clean)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
View(Auto_clean)
Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
Qualitative predictors: origin, name
range() function.sapply(Auto_clean[,1:7], range)
mpg cylinders displacement horsepower weight acceleration year
[1,] 9.0 3 68 46 1613 8.0 70
[2,] 46.6 8 455 230 5140 24.8 82
sapply(Auto_clean[,1:7], mean)
mpg cylinders displacement horsepower weight
23.445918 5.471939 194.411990 104.469388 2977.584184
acceleration year
15.541327 75.979592
Standard deviation
sapply(Auto_clean[,1:7], sd)
mpg cylinders displacement horsepower weight
7.805007 1.705783 104.644004 38.491160 849.402560
acceleration year
2.758864 3.683737
subAuto = Auto_clean[-(10:85),]
summary(subAuto)
mpg cylinders displacement horsepower
Min. :11.00 Min. :3.000 Min. : 68.0 Min. : 46.0
1st Qu.:18.00 1st Qu.:4.000 1st Qu.:100.2 1st Qu.: 75.0
Median :23.95 Median :4.000 Median :145.5 Median : 90.0
Mean :24.40 Mean :5.373 Mean :187.2 Mean :100.7
3rd Qu.:30.55 3rd Qu.:6.000 3rd Qu.:250.0 3rd Qu.:115.0
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
weight acceleration year origin
Min. :1649 Min. : 8.50 Min. :70.00 Min. :1.000
1st Qu.:2214 1st Qu.:14.00 1st Qu.:75.00 1st Qu.:1.000
Median :2792 Median :15.50 Median :77.00 Median :1.000
Mean :2936 Mean :15.73 Mean :77.15 Mean :1.601
3rd Qu.:3508 3rd Qu.:17.30 3rd Qu.:80.00 3rd Qu.:2.000
Max. :4997 Max. :24.80 Max. :82.00 Max. :3.000
name
Length:316
Class :character
Mode :character
View(subAuto)
Range
sapply(subAuto[,1:7], range)
mpg cylinders displacement horsepower weight acceleration year
[1,] 11.0 3 68 46 1649 8.5 70
[2,] 46.6 8 455 230 4997 24.8 82
Mean
sapply(subAuto[,1:7], mean)
mpg cylinders displacement horsepower weight
24.404430 5.373418 187.240506 100.721519 2935.971519
acceleration year
15.726899 77.145570
Standard deviation
sapply(subAuto[,1:7], sd)
mpg cylinders displacement horsepower weight
7.867283 1.654179 99.678367 35.708853 811.300208
acceleration year
2.693721 3.106217
hist(Auto$mpg, main = "Histogram of mpg", xlab = "mpg")
Relationship of predictors with mpg
par(mfrow = c(2,2))
plot(Auto$displacement, Auto$mpg, xlab = "displacement", ylab = "mpg")
plot(Auto$horsepower, Auto$mpg, xlab = "horsepower", ylab = "mpg")
plot(Auto$weight, Auto$mpg, xlab = "weight", ylab = "mpg")
plot(Auto$acceleration, Auto$mpg, xlab = "acceleration", ylab = "mpg")
par(mfrow = c(1,2))
boxplot(Auto$mpg ~ Auto$cylinders, xlab = "cylinders", ylab = "mpg")
boxplot(Auto$mpg ~ Auto$year, xlab = "year", ylab = "mpg")
Overall, predictors appear to have a non-linear relationship with mpg. Higher levels of displacement, horsepower and weight seem correlated with lower gas mileage/ fuel efficiency. While higher acceleration may or may not correspond to higher gas mileage/ fuel efficiency. The number of cylinders and mpg seems to have diminishing marginal effects on mpg. Older cars are also likely to be less efficient.
Relationships between predictors
par(mfrow = c(2,2))
plot(Auto$cylinders, Auto$displacement)
plot(Auto$displacement, Auto$horsepower)
plot(Auto$cylinders, Auto$weight)
plot(Auto$acceleration, Auto$horsepower)
Some of the predictors appear to be correlated with each other. Number of cylinders is positively correlated with displacement as the measurement of engine displacement depends on the number of cylinders. Number of cylinders also exhibits a positive relationship with weight. Displacement may also generally be positively related with horsepower, while acceleration seems to be negatively related to horsepower.
mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.As most of the predictors except for name and origin exhibit some relationship with mpg, they could be used to predict gas mileage. However, we should avoid including too many of these predictors in a model (e.g. multiple linear regression) as several predictors may be correlated with each other resulting in multicollinearity, which may affect the efficiency of the model. Other model designs should be considered to control for correlations between predictors. As predictors seem to have a non-linear relationship with mpg, this could indicate the need to use the logarithm of mpg as the dependent variable in our models.
The predictors name and origin did not have enough observations per name/ origin. Including these predictors may result in overfitting.