This notebook was prepared by Everton Lima.
For very large number of observations, and a small set of predictors, the variance within each observation is unexpectedly small. For any predictor p the variance is defined as \(Var(p) = \frac{1}{n}\cdot\sum_{i=1}^n(p_i - \mu)^{2}\) where \(\mu\) is the mean of p. From this equation one can easily observe that the variance is inversely proportional to n. Thus, I would expect more flexible models to perform better since it would achieve a low bias and not be penalized for a higher variance in the results.
If the number of predictors is extremely large, and the number of predictors is small then there is a higher variance associated with the number of predictors. In this case also many machine learning methods are not applicable. For example if there is a larger number of predictors than observations the coefficients in a linear regression model cannot be estimated.
A flexible model will perform better.
A inflexible model will have better performance.
This is an inference problem because we are interest in which factors affect CEO salary. In this case there are 500 observations and 3 predictor values.
This is a prediction problem. There are 20 observations and 14 predictors.
We are interested in predicting the target variable. There are 3 predictors and the number of observations is the number of weeks in 2012 (52).
library('ggplot2')
library('reshape2')
xrange <- 0:1000
bias <- 10-xrange
variance <- xrange
testError <- 6+(xrange-5)^2
trainError <- sapply(xrange,function(x){ if(x<=5) 5.5+(x-5)^2 else 11-x })
data=data.frame(xrange,variance,bias,trainError,testError)
data_long=melt(data,id="xrange")
ggplot(data_long,aes(x=xrange, y=value, colour=variable)) + ylim(2,10) + xlim(0,10) + geom_line(na.rm = T) +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())
Variance is directly proportional to the increase in flexibility. The relationship between variance and model flexibility is likely non-linear, however here the relationship is represented as linear. This is still a good representation, since it is clear that an increase in flexibility provides an increase in variance.
Bias has a inversely proportional relationship with flexibility. From the graph this is clearly the case, as flexibility increases there is a linear decrease in bias.
The test error is represented by an ‘U’ shape curve. This is because in increase in flexibility typically improves model performance initially. Performance only later worsening when the variance produced by flexible models not being off set by a decrease in bias.
The train error curve closely follows the test error curve initially, but as flexibility increases this quantity continues to decrease. With high level of flexibility train error can be reduced to zero, but in such cases, the model is over fitted. It provides no insight on the underlying behavior of \(\hat{f}\) but instead it models noise.
The Bayes error is represented by a horizontal line. This error is the irreducible error. The curve shows the lower bound of modeling error. It can be inferred that a model with no training error is typically modeling noise, since it cannot reasonably be lower than the irreducible error (unless there is no variance inherent in the data, which is unlikely).
In classifications problems the target variable is typically nominal. Classifying emails into spam or not-spam categories is a elliptical application for classification. Another well known application is if a bank customer is going to default or not default on their loan. However, target variables can also be ordinal. A typical application is rating or score prediction.
There are a wide range of regression products in machine learning. Its applications long predate machine learning, as regression was first used to predict orbits of planets around the sun. Other applications are predicting stock market percent increase and housing prices.
Clustering is an important technique or knowledge discovery. Clustering can be use to provide user suggestions by finding similar customers in a web store, or movie streaming site.
Flexible models make less strict (or none) assumptions about the form of f, while inflexible models have a strong bias in regards to the form of f.
Flexible models should be preferred when there is a large amount of data available. This is because is a large amount of observations is provided then there is a smaller variance associated with the predictors. Moreover, flexible models make better predictive models.
Inflexible models should be used when the number of observations is small, or it is known that there is a high variance associated with the observations. This type of model is also better at inference.
Parametric methods makes explicit assumptions about the form of f, while non-parametric methods do not. Parametric methods then suffer from a higher bias, and thus are less flexible than non-parametric methods. This class of model is great for inference since the model provides an explicit relationship between the target variables and the set of predictors.
Non-parametric models then are much more flexible than parametric models, since no form of f is assumed. The goal for this class of model is to estimate f as closely as possible via the observed data. However, they suffer from many disadvantages when compares to parametric methods. Typically more observations are needed (think K-nearest). Moreover, this class of model does not provide an explicit relationship between the target variable and the predictors making non-parametric models harder to interpret.
train <- data.frame(X1=c(0,2,0,0,-1,1),X2=c(3,0,1,1,0,1),X3=c(0,0,3,2,1,1),Y=c("Red","Red","Red","Green","Green","Red"))
print(train)
## X1 X2 X3 Y
## 1 0 3 0 Red
## 2 2 0 0 Red
## 3 0 1 3 Red
## 4 0 1 2 Green
## 5 -1 0 1 Green
## 6 1 1 1 Red
scores = apply(train[,-4],1,function(x){ sqrt(sum((c(0,0,0)-x)^2)) })
names(scores) = 1:6
print(scores)
## 1 2 3 4 5 6
## 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051
The prediction is 5 because it is the point with the smallest euclidean distance from the origin.
Red. In this case the points with the smallest distance are 2, 5, and 6. Since 2 and 6 are both red points then the prediction value in also red.
If the Bayes decision boundary is highly non-linear, I would expect the value for K to be small. Small values for K provide a more flexible boundary.
The data can also be loaded directly from the ISLR package.
library('ISLR')
?College
The fix functions allows the supplied argument to be edited (either data or functions). Here the use of this function is simply for inspection. Other functions such as head are typically more practical for this.
Also, since the data has been loaded from the ISLR library directly, the row names are already set correctly.
head(College[,])
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(College[,1:10])
plot(x=College[,"Outstate"],y=College[,"Private"],xlab="Outsate",ylab="Private")
Elite = rep("No",nrow(College))
Elite[College$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
College = data.frame(College,Elite)
summary(Elite)
## No Yes
## 699 78
plot(College[,"Outstate"],Elite,xlab="Outstate")
par(mfrow = c(2,2))
hist(College[,"PhD"],xlab="Percent of faculty with PhD")
hist(College[,"PhD"],xlab="Percent of faculty with PhD",breaks = 100/5)
hist(College[,"Accept"],xlab="Number of applicants accepted")
hist(College[,"Room.Board"],xlab="Room and board costs")
Subtracting the amount of accepted students divided by the number of applications from one allows us to produce a quantity that measures the difficulty of acceptance for a particular school. We can then use this quantity to answer questions such as; Do difficult to get into schools invest more per student? Is there a relationship between the faculty and student ratio and this quantity?
Unaccept = apply(College[],1,function(x){ 1- as.numeric(x[3]) / as.numeric(x[2]) })
plot(Unaccept,College[,"Expend"],ylab="Ammount Spend per Student",xlim = c(0,1))
text(Unaccept,College[,"Expend"],labels = ifelse(Unaccept > 0.75,row.names(College),""),pos = 3)
plot(Unaccept,College[,"S.F.Ratio"],ylab="Student Faculty Ratio",xlim = c(0,1),ylim=c(0,25))
text(Unaccept,College[,"S.F.Ratio"],labels = ifelse(Unaccept > 0.78,row.names(College),""),pos = 3)
The quantitative predictors are displacement,horsepower, weight, acceleration. The remaining ones are qualitative; cylinders, year, origin, name.
dat = apply(Auto[,-c(7,8,9)],2,range)
row.names(dat) = c("Min","Max")
dat
## mpg cylinders displacement horsepower weight acceleration
## Min 9.0 3 68 46 1613 8.0
## Max 46.6 8 455 230 5140 24.8
dat = apply(Auto[,-c(7,8,9)],2,function(x){ c(mean(x),sd(x)) })
row.names(dat) = c("Mean","SD")
dat
## mpg cylinders displacement horsepower weight acceleration
## Mean 23.445918 5.471939 194.412 104.46939 2977.5842 15.541327
## SD 7.805007 1.705783 104.644 38.49116 849.4026 2.758864
dat = apply(Auto[-c(10:85),-c(7,8,9)],2,function(x){ c(range(x),mean(x),sd(x)) })
row.names(dat) = c("Min","Max","Mean","SD")
dat
## mpg cylinders displacement horsepower weight acceleration
## Min 11.000000 3.000000 68.00000 46.00000 1649.0000 8.500000
## Max 46.600000 8.000000 455.00000 230.00000 4997.0000 24.800000
## Mean 24.404430 5.373418 187.24051 100.72152 2935.9715 15.726899
## SD 7.867283 1.654179 99.67837 35.70885 811.3002 2.693721
pairs(Auto)
boxplot(mpg~origin,data=Auto,names=c("American","European","Japanese"),main="Miles per Gallon ~ Origin")
boxplot(mpg~cylinders,data=Auto,main="Miles per Gallon ~ Origin")
Any variable, except the number of cylinders and name, seem to be a good predictor in the value of mpg.
library('MASS')
nrow(Boston) # number of rows.
## [1] 506
ncol(Boston) # number of columns.
## [1] 14
Inspecting the data shows that each column is a feature, and each row is an observation (Boston suburb).
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
par(mfrow = c(2,2))
plot(Boston[,c("dis","crim")])
plot(Boston[,c("ptratio","crim")])
plot(Boston[,c("medv","crim")])
plot(Boston[,c("indus","crim")])
boxplot(crim~chas,Boston)
There seems to be an association between the median value of owner-occupied homes and crime per capital. Low median value of owner-occupied home indicates a higher crime per capita.
There is also seem to be a correlation between distance to employment centers and crime. Areas close to employment centers show high crime per capita.
ranges = apply(Boston,2,range)
row.names(ranges) = c("min","max")
ranges
## crim zn indus chas nox rm age dis rad tax ptratio
## min 0.00632 0 0.46 0 0.385 3.561 2.9 1.1296 1 187 12.6
## max 88.97620 100 27.74 1 0.871 8.780 100.0 12.1265 24 711 22.0
## black lstat medv
## min 0.32 1.73 5
## max 396.90 37.97 50
selection = Boston[,"chas"]
nrow(Boston[selection,])
## [1] 35
median(Boston[,"ptratio"])
## [1] 19.05
Boston[which.min(Boston[,"medv"]),]
## crim zn indus chas nox rm age dis rad tax ptratio black
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9
## lstat medv
## 399 30.59 5
This suburb presents a significant black population, with all houses being built before 1940, and high teacher pupil ratio when comparing to the other observations in the data.
rooms = lapply(1:8,function(x){ sum(Boston[,"rm"] > x) })
rooms
## [[1]]
## [1] 506
##
## [[2]]
## [1] 506
##
## [[3]]
## [1] 506
##
## [[4]]
## [1] 504
##
## [[5]]
## [1] 490
##
## [[6]]
## [1] 333
##
## [[7]]
## [1] 64
##
## [[8]]
## [1] 13
Applying the summary function to observations with more the 8 rooms yields the output below. Straight away one can note that these suburbs have a low crime per capita, and high median home value.
summary(Boston[Boston[,"rm"] > 8,])
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
apply(Boston[rooms>8,] ,2,mean) < apply(Boston,2,mean)
## crim zn indus chas nox rm age dis rad
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## tax ptratio black lstat medv
## FALSE FALSE FALSE FALSE FALSE