1. Review of Essential Concepts - 15 points

\(\begin{bmatrix} 1 &-1 &3&-5 \\ 2 & 1 & 5&-9\\6&-1&-2&4\end{bmatrix}\)

Let’s try and see how we can reduce this matrix to row echelon form:

m<-matrix(c(1,-1,3,-5,2,1,5,-9,6,-1,-2,4),byrow=TRUE,nrow=3)

tmp <- m

# R2 -> R2 - 2*R1
tmp[2,] <- tmp[2,] - 2*tmp[1,]

# R3 -> R3 - 6*R1
tmp[3,] <- tmp[3,] - 6*tmp[1,]

tmp
##      [,1] [,2] [,3] [,4]
## [1,]    1   -1    3   -5
## [2,]    0    3   -1    1
## [3,]    0    5  -20   34
# R2 -> R2/3
tmp[2,] <- tmp[2,]/3
# R3 -> R3 - 5*R2
tmp[3,] <- tmp[3,] - 5*tmp[2,]

tmp
##      [,1] [,2]        [,3]       [,4]
## [1,]    1   -1   3.0000000 -5.0000000
## [2,]    0    1  -0.3333333  0.3333333
## [3,]    0    0 -18.3333333 32.3333333
# R1 -> R1 + R2
tmp[1,] <- tmp[1,] + tmp[2,]

# R3 -> R3/-18.33
tmp[3,] <- tmp[3,]/tmp[3,3]

tmp
##      [,1] [,2]       [,3]       [,4]
## [1,]    1    0  2.6666667 -4.6666667
## [2,]    0    1 -0.3333333  0.3333333
## [3,]    0    0  1.0000000 -1.7636364
# R2 -> 0.33 * R3 + R2
tmp[2,] <- -1*tmp[2,3] * tmp[3,] +  tmp[2,] 

# R1 -> R1 - 2.66 * R3 
tmp[1,] <- tmp[1,]  - tmp[1,3] * tmp[3,]

tmp
##      [,1] [,2] [,3]        [,4]
## [1,]    1    0    0  0.03636364
## [2,]    0    1    0 -0.25454545
## [3,]    0    0    1 -1.76363636

Given that we don’t have any non-zero rows, we can say that the above matrix has a rank of 3.

t(m)
##      [,1] [,2] [,3]
## [1,]    1    2    6
## [2,]   -1    1   -1
## [3,]    3    5   -2
## [4,]   -5   -9    4

Given a set of vectors, we can say that they are orthonormal if they meet these two characteristics:

An example of such a vectorset is:

\(R^{3} = \left [ v_{1}=(1,0,0),v_{2}=(0,1,0),v_{3}=(0,0,1) \right ]\)

\(A = \begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\1&2&0\end{bmatrix}\)

Before we do anything, let’s reduce the first two values of the third row of the matrix:

\(\begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\1&2&0\end{bmatrix} R_3 \rightarrow R_3 - \frac{1}{5} R_1 \begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\0&2&-\frac{3}{5}\end{bmatrix}\)

\(\begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\0&2&-\frac{3}{5}\end{bmatrix} R_3 \rightarrow R_3 - 2 R_2 \begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\0&0&-\frac{3}{5}+4\end{bmatrix}\)

which becomes:

\(\begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\0&0&\frac{17}{5}\end{bmatrix}\)

The characteristic polynomial of A, denoted by \(p_A(t)\), is the polynomial defined by

\(p_A(\lambda) = \det \left(t \boldsymbol{\lambda} - A\right)\)

\(\begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\0&0&\frac{17}{5}\end{bmatrix}\)

\(A = \begin{bmatrix} 5 &0 &3 \\ 0 & 1 & -2\\1&2&0\end{bmatrix}\)

The characteristic polynomial of A, denoted by \(p_A(t)\), is the polynomial defined by

\(p_A(t) = \det \left(t \boldsymbol{I} - A\right)\)

For the provided matrix, let’s compute determinant of:

\(t I-A = \begin{pmatrix}t-5&0&-3\\0&t-1&2\\-1&-2&t\end{pmatrix}\)

\(det(t I-A) = (t-5)(t-1)(t) + (0)(2)(-1) + (-3)(0)(-2) - (-3)(t-1)(-1) - (t-5)(2)(-2) - (0)(0)(t)\)

\(det(t I-A) = (t-5)(t-1)(t) - (-3)(t-1)(-1) - (t-5)(2)(-2)\)

\(det(t I-A) = (t^2 -6t +5)(t) - (3t-3) - (-4t+20)\)

\(det(t I-A) = (t^3 -6t^2 +5t) - 3t+3) +4t-20)\)

\(det(t I-A) = (t^3 -6t^2 +5t) - t - 17)\)

\(det(t I-A) = (t^3 -6t^2 +4t - 17)\)

Therefore,

\(p_A(t) = t^3 -6t^2 +4t - 17\)

The eigenvalues are those values of \(\lambda\) such that:

\(\lambda^3 -6\lambda^2 +4\lambda - 17 =0\)

By using Wolfram Alpha (http://www.wolframalpha.com/input/?i=x%5E3+-+6*x%5E2%2B4*x-17%3D0), we find that the system has one real solution: \(\lambda_1=5.8149\) and two complex solutions: \(\lambda_2=0.0926 + 1.7073i\) and \(\lambda_3=0.0926 + 1.7073i\).

Let’s calculate the eigenvector that corresponds to the first eigenvalue:

m<-matrix(c(5,0,3,0,1,-2,1,2,0),byrow=TRUE,nrow=3)

solve(m,matrix(c(5.8149,5.8149,5.8149),byrow=TRUE,nrow=3))
##           [,1]
## [1,]  2.394371
## [2,]  1.710265
## [3,] -2.052318

Let’s obtain the remaining eigenvectors and values by using R’s funciton.

eigen(m)
## $values
## [1] 5.471264+0.000000i 0.264368+1.742772i 0.264368-1.742772i
## 
## $vectors
##                [,1]                  [,2]                  [,3]
## [1,]  0.98551404+0i  0.2583505-0.2761849i  0.2583505+0.2761849i
## [2,] -0.06924766+0i -0.6725516+0.0000000i -0.6725516+0.0000000i
## [3,]  0.15481227+0i -0.2473752+0.5860519i -0.2473752-0.5860519i

From such a random matrix of URL links, we suspect that the resulting ranked network is also random. The structure that pagerank will reveal is a star where all the nodes are equaly weighted.

Given a set of samples with an unknown pdf, we can say that the arithmetic mean or average of sets of samples will follow the normal distribution.

By applying the multiplication of functions rule we have:

\(\frac{d}{dx}(e^{x} cos^2(x)) = \frac{d}{dx}(e^{x}) cos^2(x) + e^{x} \frac{d}{dx} cos^2(x)\)

\(\frac{d}{dx}(e^{x} cos^2(x)) = e^{x} cos^2(x) + e^{x} \frac{d}{dx} cos^2(x)\)

Again, we can express \(cos^2(x)\) as \(cos(x) cos(x)\) and apply the multiplication of functions rule:

\(\frac{d}{dx}(e^{x} cos^2(x)) = e^{x} cos^2(x) + e^{x} (\frac{d}{dx} cos(x) cos(x))\)

\(\frac{d}{dx}(e^{x} cos^2(x)) = e^{x} cos^2(x) + e^{x} (\frac{d}{dx}(cos(x)) cos(x) + cos(x)\frac{d}{dx}(cos(x)))\)

\(\frac{d}{dx}(e^{x} cos^2(x)) = e^{x} cos^2(x) + e^{x} (-sin(x) cos(x) -sin(x) cos(x))\)

\(\frac{d}{dx}(e^{x} cos^2(x)) = e^{x} cos^2(x) - 2 e^{x} sin(x) cos(x)\)

\(\frac{d}{dx}(e^{x} cos^2(x)) = cos(x)e^{x}(cos(x) - 2 sin(x) )\)

By applying the chain rule we have:

\(\frac{d}{dx}(e^{x^{3}}) = \frac{d}{du}(e^{u}) \frac{d}{dx}(x^{3})\)

\(\frac{d}{dx}(e^{x^{3}}) = (e^{x^{3}}) 3x^{2}\)

This is equivalent to:

\(\int e^{x} cos(x) + sin(x) dx = \int e^{x} cos(x) dx + \int sin(x) dx\)

\(\int e^{x} cos(x) + sin(x) dx = \int e^{x} cos(x) dx - cos(x) + c\)

We can solve \(\int e^{x} cos(x) dx\) by parts

Given:

\(\int fg'dx = fg - \int f'g dx\)

where:

f = \(e^{x}\), f’= \(e^{x} dx\)

g’ = cos(x) dx, g = sin(x)

we have:

\(\int e^{x}cos(x)dx = fg - \int f'g dx\)

\(\int e^{x}cos(x)dx = e^{x}cos(x) - \int f'g dx\)

\(\int e^{x}cos(x)dx = e^{x}cos(x) - \int e^{x} sin(x) dx\)

With: \(\int e^{x} sin(x)\)

where:

f = \(sin(x)\), f’= \(cos(x) dx\)

g’ = \(e^{x} dx\), g = \(e^{x}\)

we have:

\(\int e^{x}sin(x)dx = fg - \int f'g dx\)

\(\int e^{x}sin(x)dx = sin(x)e^{x} - \int e^{x}cos(x) dx\)

by parts again:

\(\int e^{x}cos(x) dx\)

where:

f = \(cos(x)\), f’= \(-sin(x) dx\)

g’ = \(e^{x} dx\), g = \(e^{x}\)

\(\int e^{x}cos(x)dx = fg - \int f'g dx\)

\(\int e^{x}cos(x)dx = cos(x)e^{x} - \int e^{x}-sin(x) dx\)

If we now put it all together we can simplify:

\(\int e^{x} sin(x) = sin(x)e^{x} - (cos(x)e^{x} - \int e^{x}-sin(x) dx)\)

\(\int e^{x} sin(x) = sin(x)e^{x} - (cos(x)e^{x} + \int e^{x}sin(x) dx)\)

\(\int e^{x} sin(x) = sin(x)e^{x} - cos(x)e^{x} - \int e^{x}sin(x) dx\)

\(2\int e^{x} sin(x) = sin(x)e^{x} - cos(x)e^{x}\)

\(\int e^{x} sin(x) = \frac{1}{2} sin(x)e^{x} - cos(x)e^{x} + c\)

\(\int e^{x} sin(x) = -\frac{1}{2} e^{x} (cos(x) - sin(x)) + c\)

With all the following previously obtained intermediate equations:

\(\int (e^{x} cos(x) + sin(x) ) dx = \int e^{x} cos(x) dx - cos(x) + c\)

\(\int e^{x}cos(x)dx = e^{x}cos(x) - \int e^{x} sin(x) dx\)

\(\int e^{x} sin(x) = -\frac{1}{2} e^{x} (cos(x) - sin(x)) + c\)

we have:

\(\int (e^{x} cos(x) + sin(x) ) dx = e^{x}cos(x) - (-\frac{1}{2} e^{x} (cos(x) - sin(x))) - cos(x) + c\)

\(\int (e^{x} cos(x) + sin(x) ) dx = e^{x}cos(x) - (-\frac{1}{2} e^{x} cos(x) +\frac{1}{2} e^{x} sin(x)) - cos(x) + c\)

\(\int (e^{x} cos(x) + sin(x) ) dx = e^{x}cos(x) -\frac{1}{2} e^{x} cos(x) +\frac{1}{2} e^{x} sin(x)) - cos(x) + c\)

\(\int (e^{x} cos(x) + sin(x) ) dx = \frac{1}{2} e^{x} cos(x) +\frac{1}{2} e^{x} sin(x) - cos(x) + c\)

Therefore:

\(\int (e^{x} cos(x) + sin(x) ) dx = \frac{1}{2} e^{x} (cos(x) + sin(x))- cos(x) + c\)


2. Mini-coding assignments - 15 points


library("knitr")

f<-function (x) {
  dbinom(x,20,0.25)
}

#Let's setup an array x values
x <- seq(0,20,by=1)

p_x <- sapply(x,f)

#let's obtain f(x)
f_x <- sample(x,1000,replace=TRUE,prob=p_x)

par(mfrow=c(1,2)) 

hist(f_x, breaks=x)


In order to do PCA from SVD we need to:

auto_mpg_data = read.fwf(file='C:\\Users\\malarconba001\\Google Drive\\CUNY\\IS605\\final_exam\\IS605_Finals_Fall2015\\auto-mpg.data',
                         widths = c(5,11,11,10,8))


colnames(auto_mpg_data) <- c("displacement", "horsepower", "weight", "acceleration","mpg")

# Normalize the data
A <- scale(data.matrix(auto_mpg_data[1:4]))

mpg <- data.matrix(auto_mpg_data[5])
colnames(mpg)<- "mpg"
# Let's calculate the covariance matrix
cov <- cov(A)

# Now, let's do SVD on the covariance Matrix
LSA <- svd(cov)

# and the variance eigenvectors are
LSA$d
## [1] 3.20893718 0.65406182 0.08295314 0.05404786
# which plots as
plot(LSA$d,type='b',pch=20,xlab='Singular value',ylab='magnitude')

We can compare the results to the ones off the prcomp function:

ir.pca <- prcomp(A,center=FALSE,scale.=FALSE)
ir.pca
## Standard deviations:
## [1] 1.7913507 0.8087409 0.2880159 0.2324820
## 
## Rotation:
##                     PC1         PC2        PC3        PC4
## displacement -0.5344668  0.25097580 -0.4903039 -0.6410604
## horsepower   -0.5417662 -0.03090597  0.8189342 -0.1867643
## weight       -0.5127678  0.43914194 -0.1583275  0.7205248
## acceleration  0.3973711  0.86209647  0.2527474 -0.1870952
ir.pca$sdev^2
## [1] 3.20893718 0.65406182 0.08295314 0.05404786
plot(ir.pca, type = "l")

From the above we can see that most of the variance can be explained between the first two dimensions.

Let’s now project the data onto the first two dimensions:

LSA <- svd(A)

# Let's now project the data onto these two dimensions
depth <-2
us <- as.matrix(LSA$u[, 1:depth])
vs <- as.matrix(LSA$v[, 1:depth])
ds <- as.matrix(diag(LSA$d)[1:depth, 1:depth])

A_dn <- us %*% ds %*% t(vs)
colnames(A_dn) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(A_dn,mpg), 
   main=paste("Scatterplot Matrix of auto-mpg Data onto the First",depth,"Dimensions."))

which we can compare against the original dataset:

#colnames(A) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(A,mpg), 
   main="Scatterplot Matrix of auto-mpg Data (Original)")

It looks pretty close to me!


In this case, the probability of selecting any given value from the sample is p=1/n.

With this, we can also say that the proability of an individual value not being selected is \(q = \left ( 1-\frac{1}{n} \right )\). Since we’re sampling with replacement, the cumulative probability is \(q_{n} = \left ( 1-\frac{1}{n} \right )^{n}\)

With this, we can say that we’re looking for:

\(q_{\infty} = \lim_{n \rightarrow \infty } \left ( 1-\frac{1}{n} \right )^{n}\)

Which we can try and estimate the limit in R:

infinity<-1000
n <- c(1:infinity)
q <-(1-1/n)^n

plot(n,q)

lim <- max(q)

lim
## [1] 0.3676954

As we can see from the chart above, an n =1000 gets us pretty close to \(\infty\).

Therefore, \(p_{\infty} = 1- \lim_{n \rightarrow \infty } \left ( 1-\frac{1}{n} \right )^{n}\),

1-lim
## [1] 0.6323046

Computational proof:

# In this experiment we will boostrap n elements from a sequence of values from 1:n
bootstrap_experiment <- function(n){

  # Define the sequence of values to sample from
  values <- seq(1:n)
  
  # let's sample with replacement n-times
  samples <- replicate(n,{sample(values,1,replace=TRUE)})

  # return the percentage of values samples in this iteration
  
  return(length(unique(samples))/n)
}

# The output of a single experiment with n=1000 is:
n <- 10000
bootstrap_experiment(n)
## [1] 0.6288
#now, let's repeat the experiment m-times and display the average

m<- 100
mean(replicate(m,{bootstrap_experiment(n)}))
## [1] 0.632006

3. Mini-project - 20 points

In this mini project, you’ll perform a Multivariate Linear Regression analysis using Gradient Descent. The data set consists of two predictor variables and one response variable. The predictor variables are living area in square feet and number of bedrooms. The response variable is the price of the house. You have 47 data points in total.

Since both the number of rooms and the living area are in different units, it makes it hard to compare them in relative terms. One way to compensate for this is to standardize the variables. In order to standardize, you estimate the mean and standard deviation of each variable and then compute new versions of these variables. For instance, if you have a variable x, then then standardized version of x is \(x_{std} = (x - \mu)/\sigma\) where \(\mu\) and \(\sigma\) are the mean and standard deviation of x, respectively.

As we saw in the gradient descent equations, we introduce a dummy variable \(x_{0} = 1\) in order to calculate the intercept term of the linear regression. Please standardize the 2 variables, introduce the dummy variable and then write a function to perform gradient descent on this data set. You’ll repeat gradient descent for a range of \(\alpha\) values. Please use \(\alpha = (0.001; 0.01; 0.1; 1.0)\) as your choices. For each value of \(\alpha\) perform about 100 iterations and compute \(J(\theta)\) at the end of each iteration. Please plot \(J(\theta)\) versus number of iterations for each of the \(4\alpha\) choices.

Once you have your final gradient descent solution, compare this with regular linear regression (using the built-in function in R). Also solve using Ordinary Least Squares approach. Please document all 3 solutions in your submission. For bonus points (4 additional points), also perform stochastic gradient descent in addition to regular gradient descent and show those results as well.

Finally, using the built-in cross-validation function in R, perform a 5-fold cross-validation and estimate the test prediction error. In order to perform the cross-validation, you can also write your own procedure if you prefer to use your own instead of the built-in functions.

You can choose any one of the 3 approaches to perform cross-validation on. No need to do it for all 3 approaches.

Data Loading and Standardization

First, let’s load the dataset and combine it all into a combined dataframe:

library("knitr")
ex3x <- as.matrix(read.table('C:/Users/malarconba001/Google Drive/CUNY/IS605/final_exam/IS605_Finals_Fall2015/mini-project-data/ex3x.dat'))

colnames(ex3x)<-c("sqf","bedrooms")

ex3y <- as.matrix(read.table('C:/Users/malarconba001/Google Drive/CUNY/IS605/final_exam/IS605_Finals_Fall2015/mini-project-data/ex3y.dat'))

colnames(ex3y)<-c("price")

Now, let’s standardize the bedrooms and sqf and combine the columns into a single dataset:

#standardize the x values
ex3x <- scale(ex3x)

house_data <- data.frame(ex3x,ex3y)

pairs(~.,data=house_data, 
   main="Scatterplot Matrix of House Prices Data")

head(house_data)
##           sqf   bedrooms  price
## 1  0.13000987 -0.2236752 399900
## 2 -0.50418984 -0.2236752 329900
## 3  0.50247636 -0.2236752 369000
## 4 -0.73572306 -1.5377669 232000
## 5  1.25747602  1.0904165 539900
## 6 -0.01973173  1.0904165 299900

Modeling price = f(sqf,bedrooms)

Linear Regression

With a linear regression we have:

house_data.glm <- glm(price ~ sqf+bedrooms, data=house_data)
summary(house_data.glm) # show results
## 
## Call:
## glm(formula = price ~ sqf + bedrooms, data = house_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -130582   -43636   -10829    43698   198147  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   340413       9637  35.323  < 2e-16 ***
## sqf           110631      11758   9.409 4.22e-12 ***
## bedrooms       -6650      11758  -0.566    0.575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 4365189199)
## 
##     Null deviance: 7.1921e+11  on 46  degrees of freedom
## Residual deviance: 1.9207e+11  on 44  degrees of freedom
## AIC: 1181.5
## 
## Number of Fisher Scoring iterations: 2

Least Squares

A <- cbind(1,ex3x)

b <- ex3y


A_T <- t(A)

A_TA <- A_T %*% A

A_Tb <- A_T %*% b

A_TA_inv <- solve(A_TA)

x_hat <- A_TA_inv %*% A_Tb
x_hat
##               price
##          340412.660
## sqf      110631.050
## bedrooms  -6649.474

5-fold Cross Validation of the Linear Model

library(boot)

# 5-fold cross validation
(cv.err.5 <- cv.glm(house_data, house_data.glm, K = 5)$delta)
## [1] 4560119246 4505164554
# test error prediction
mu_hat <- fitted(house_data.glm)
house_data.diag <- glm.diag(house_data.glm)
(cv.err <- mean((house_data.glm$y - mu_hat)^2/(1 - house_data.diag$h)^2))
## [1] 4647800409

In order to interpret the delta for this linear model, we need to compare it against some other model. Let’s see how the linear model compares against a various degrees of polynomial models (1-3):

do_auto_poly_regression <- function(data,power){
  glm.fit <- glm(price~poly(sqf+bedrooms,power), data=data)  
  return(cv.glm(data,glm.fit,K=5)$delta[1])
}


degree<-c(1:3)
cv_err <- sapply(degree,  function(x) do_auto_poly_regression(house_data, x))
degree <- c(0,degree)
cv_err <- c(cv.err.5[1],cv_err)
cv_err
## [1]  4560119246  7629033735  6912612787 10348525159
plot(degree,cv_err,type='b'
     ,main="Cross Validation of Linear and Polynomial Models of house_data"
     ,xlab="Polynomial Degree"
     ,ylab="Cross Validation Error")

As we can see, the linear model offers the least amount of cross validation error in comparison to other polynomial models.

Conclusion

In this excercise we compared three different techniques to do a multivariate linear regression of home prices. The three modeling techniques produce a multivariate linear model in the form of \(price = 110631\dot sqf -6650 bedrooms + 340413\). It is worth mentioning the results of the glm model reveal that the number of bedrooms is NOT statistically significant.

In regards to the LGFGS linear search method, we see that it performs much better with an \(\alpha=1\). At this alpha level, we find a solution that is comparable to the Least Squares or R’s Linear Regression library. In addition, the squared error appears to be minimized after only 40 iterations.