\[ \mathbb 1. \ Review \ of \ Essential \ Concepts \ - \ 15 \ points \\ \]

1- What is the rank of the following matrix?

\[ \begin{align} \begin{bmatrix} 1 & -1 & 3 & -5 \\ 2 & 1 & 5 & -9 \\ 6 & -1 & -2 & 4 \\ \end{bmatrix} \end{align} \]

A<-matrix(c(1,-1,3,-5,2,1,5,-9,6,-1,-2,4),byrow=TRUE,nrow=3)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1   -1    3   -5
## [2,]    2    1    5   -9
## [3,]    6   -1   -2    4
#Multiply the 1st row by 2
A[1,]<- A[1,]*2 
A
##      [,1] [,2] [,3] [,4]
## [1,]    2   -2    6  -10
## [2,]    2    1    5   -9
## [3,]    6   -1   -2    4
#Subtract the 1st row from the 2nd row and restore it
A[2,] <- A[2,] - A[1,]
A[1,] <- A[1,] / 2
A
##      [,1] [,2] [,3] [,4]
## [1,]    1   -1    3   -5
## [2,]    0    3   -1    1
## [3,]    6   -1   -2    4
#Multiply the 1st row by 6
A[1,]<- A[1,] * 6
A
##      [,1] [,2] [,3] [,4]
## [1,]    6   -6   18  -30
## [2,]    0    3   -1    1
## [3,]    6   -1   -2    4
#Subtract the 1st row from the 3rd row and restore it
A[3,]<- A[3,] - A[1,]
A[1,]<- A[1,] / 6 
A
##      [,1] [,2] [,3] [,4]
## [1,]    1   -1    3   -5
## [2,]    0    3   -1    1
## [3,]    0    5  -20   34
# Divide the 2nd row by 3
A[2,]<- A[2,] / 3
A
##      [,1] [,2]        [,3]       [,4]
## [1,]    1   -1   3.0000000 -5.0000000
## [2,]    0    1  -0.3333333  0.3333333
## [3,]    0    5 -20.0000000 34.0000000
# Multiply the 2nd row by 5
A[2,] <- A[2,] * 5

# Subtract the 2nd row from the 3rd row and restore it
A[3,]<- A[3,] - A[2,]
A[2,]<- A[2,] /5
A
##      [,1] [,2]        [,3]       [,4]
## [1,]    1   -1   3.0000000 -5.0000000
## [2,]    0    1  -0.3333333  0.3333333
## [3,]    0    0 -18.3333333 32.3333333
# Divide the third row by A[3,3]
A[3,]<- A[3,] / A[3,3]
A
##      [,1] [,2]       [,3]       [,4]
## [1,]    1   -1  3.0000000 -5.0000000
## [2,]    0    1 -0.3333333  0.3333333
## [3,]    0    0  1.0000000 -1.7636364
# Addrow 2 from row 1
A[1,]<- A[1,] + A[2,]
A
##      [,1] [,2]       [,3]       [,4]
## [1,]    1    0  2.6666667 -4.6666667
## [2,]    0    1 -0.3333333  0.3333333
## [3,]    0    0  1.0000000 -1.7636364

We conclude from the above matrix that the rank is 3.

\[ \\ \\ \\ \]

2- What is the transpose of the below matrix?

\[ \begin{align} \begin{bmatrix} 1 & -1 & 3 & -5 \\ 2 & 1 & 5 & -9 \\ 6 & -1 & -2 & 4 \\ \end{bmatrix} \end{align} \]

The transpose is of the above matrix is: \[ \begin{align} \begin{bmatrix} 1 & 2 & 6 \\ -1 & 1 & -1 \\ 3 & 5 & -2 \\ -5 & -9 & 4 \\ \end{bmatrix} \end{align} \]

3- Define orthonormal basis vectors. Please write down at least one orthonormal basis for the 3-dimensional vector space RxRxR.

A subset {v_1,…,v_k} of a vector space V, with the inner product <,>, is called orthonormal if:

1- they are unit vector; that is they are all required to have length one: = 1.

2-and orthogonal vectors. (v_i,v_j) =0 when i!=j. That is, the vectors are mutually perpendicular.

An example of an orthonormal basis for the 3-dimensional vector space RxRxR is as follow:

v1= c(1/sqrt(2),0,-1/sqrt(2))
v2= c(1/2,sqrt(2)/2, 1/2)
v3= c(1/2,-sqrt(2)/2, 1/2)
v1
## [1]  0.7071068  0.0000000 -0.7071068
v2
## [1] 0.5000000 0.7071068 0.5000000
v3
## [1]  0.5000000 -0.7071068  0.5000000

Testing for orthogonality

round(v1 %*%v2, 10)==0
##      [,1]
## [1,] TRUE
round(v1 %*%v3, 10)==0
##      [,1]
## [1,] TRUE
round(v3 %*%v2, 10)==0
##      [,1]
## [1,] TRUE

Testing if the vectors are unit vectors:

## [1] 1
## [1] 1
## [1] 1

4- Given the following matrix, what is its characteristic polynomial?

\[ \begin{align} \begin{bmatrix} 5 & 0 & 3 \\ 0 & 1 & -2 \\ 1 & 2 & 0 \\ \end{bmatrix} \end{align} \]

A<-matrix(c(5,0,3,0,1,-2,1,2,0),byrow=TRUE,nrow=3)
A
##      [,1] [,2] [,3]
## [1,]    5    0    3
## [2,]    0    1   -2
## [3,]    1    2    0

\[ The \ characteristic \ polynomial \ of \ an \ n×n \ matrix A = (a_{ij} ) \ is \ defined \ as \:\\ K_A(\lambda) = det(\lambda I_n- A) \]

In our case we have 3x3 matrix hence: \[ K_A(\lambda) = det(\lambda I_n- A) = \lambda^3 -\sigma_1\lambda^2+\sigma_2\lambda-\sigma_3 \\ \] \[ \sigma_1 is \ the \ sum \ of \ all \ first-order \ diagonal \ minors \ of \ A \\ \sigma_1 = 5+ 1+ 0 = 6 \\ \]

\[ \sigma_2 is \ the \ sum \ of \ all \ second-order \ diagonal \ minors \ of \ A \\ \sigma_2 = \] \[ \begin{vmatrix} 5 & 0 \\ 0 & 1 \\ \end{vmatrix} + \begin{vmatrix} 1 & -2\\ 2 & 0 \\ \end{vmatrix} + \begin{vmatrix} 5 & 3 \\ 1 & 0 \\ \end{vmatrix} \] \[ \sigma_2 = 5 + 4 - 3 = 6 \]

\[ \sigma_3 is \ the \ sum \ of \ all \ third-order \ diagonal \ minors \ of \ A \\ \sigma_3 =det(A) \]

det(A)
## [1] 17

\[ \sigma_3 =det(A) = 17 \]

\[ Hence \ the \ characteristic \ polynomial \ of \ A \ is: \\ P(x) = x^3 -\sigma_1 x^2+\sigma_2 x -\sigma_3 \\ replacing \ \sigma_1 , \sigma_2 , \ and \ \sigma_3 \ with \ their \ values \ we \ get: \\ P(x) = x^3 -6 x^2+6 x -17 \\ \]

5- What are its eigenvectors and eigenvalues? Please note that it is possible to get

complex vectors as eigenvectors.

The eigenvalues of A are the roots of the characteristic polynomial below:

\[ K_A(\lambda) = \lambda ^3 -6 \lambda ^2+ 6 \lambda -17 \\ \]

The eigenvectors are the solutions to the Homogeneous system:

\[ (\lambda I_3 - A) X = \theta \]

eigen(A)
## $values
## [1] 5.471264+0.000000i 0.264368+1.742772i 0.264368-1.742772i
## 
## $vectors
##                [,1]                  [,2]                  [,3]
## [1,]  0.98551404+0i  0.2583505-0.2761849i  0.2583505+0.2761849i
## [2,] -0.06924766+0i -0.6725516+0.0000000i -0.6725516+0.0000000i
## [3,]  0.15481227+0i -0.2473752+0.5860519i -0.2473752-0.5860519i

Using eigen() functionin R, the Eigen values are:

\[ \lambda_1 = 5.471264+0.000000i \\ \lambda_2 = 0.264368+1.742772i\\ \lambda_3 = 0.264368-1.742772i\\ \] #### THe Eigen vectors are: \[ v_1=(0.98551404+0i, 0.2583505-0.2761849i, 0.2583505+0.2761849i) \\ v_2=(-0.06924766+0i, -0.6725516+0.0000000i, -0.6725516+0.0000000i) \\ v_3=(0.15481227+0i, -0.2473752+0.5860519i, -0.2473752-0.5860519i) \\ \]

this set of URLs?

Recall: A matrix is called column-stochastic if all of its entries are greater or equal to zero (nonnegative) and the sum of the entries in each column is equal to 1. If all entries of a matrix are nonnegative, then we say that the matrix itself is nonnegative. Furthermore, a matrix is positive is all its entries are positive (greater than zero) real numbers.

The PageRank matrix M is column-stochastic by the design of the Pagerank algorithm. Also, all entries are positive. We can invoke the Perron-Frobenius theorem of matrix in the analysis of the convergence of the Pagerank algorithm. The Perron-Frobenius theorem states that for any column-stochastic matrix with positive entries, we can always find 1 as an eigenvalue, and all other eigenvalues have magnitude strictly less than 1. Moreover, 1 is a simple eigenvalue (“simple” means that 1 is a root, but not a repeated root, of the characteristic polynomial.)

7- Assuming that we are repeatedly sampling sets of numbers (each set is of size n)

from an unknown probability density function. What can we say about the average

value of each set?

recall: If X is a continuous random variable with probability density function f defined on an interval with (possibly infinite) endpoints a and b, then the mean or expected value of X is \[ E(X) = \int_a^b x f(x) dx. \] if discrete: \[ E(X) = \sum_a^b x f(x) dx. \]

The distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. In other words, if we have a sample containing a large number of independent observations and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the computed values of the average will be distributed according to the normal distribution. This is essentially the definition of central limit theorem

Below is a quick allustration:

func1<- function(n){
y = runif(n)
par(mfrow = c(2,1))
out = vector()
for (i in 1:n){
 out_i = mean(sample(y, size = 3))
 out = c(out, out_i)
   }
  hist(out) 
}

func1(100)
func1(10000)

8- What is the derivative of:

\[ e^x cos^2(x)\\ \] \[ \begin{align} \frac {d} {dx} [e^x cos^2(x)] = & \frac {d} {dx} [e^x] . cos^2(x) + e^x \frac {d} {dx} [cos^2(x)] \\ = & \frac {d} {dx} [e^x] . cos^2(x) + e^x \frac {d} {dx} [cos^2(x)] \\ = & e^x.cos^2(x) + 2.cos(x).\frac {d} {dx}[cos(x)].e^x \\ = & e^x.cos^2(x) + 2(-sin(x))e^x cos(x)\\ = & e^x.cos^2(x) - 2e^x cos(x)sin(x)\\ = & -e^xcos(x).[2sin(x) - cos(x)]\\ \end{align} \]

9- What is the derivative of

\[ f(x) = {e^x}^3 \] \[ \begin{align} \frac {d} {dx}[{e^x}^3] = & {e^x}^3. \frac {d} {dx}[x^3]\\ = & 3x^2{e^x}^3 \\ \end{align} \]

10 What is

\[ \int e^x cos(x) + sin(x) dx = \int e^x cos(x) +\int sin(x) dx \\ \] Using integration by parts we get: \[ \begin{align} \int e^x cos(x) +\int sin(x) dx = & (cos(x).e^x - \int e^x - \int e^x.(- sin(x)) dx) + \int sin(x)dx\\ = & (e^x cos(x) - \int (-e^x sin(x)) dx) + \int sin(x) dx\\ = & e^x cos(x) -(- \int e^x sin(x) dx) + \int sin(x) dx \\ = & e^x cos(x) +( \int e^x sin(x) dx) + \int sin(x) dx \\ = & e^x cos(x) + (sin(x).e^x - \int e^x . cos(x) dx) + \int sin(x) dx\\ = & e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) + \int sin(x) dx \\ \end{align} \] \[ \int e^x cos(x) +\int sin(x) dx = e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) + \int sin(x) dx \\ \]

we canel out integral sin(x) dx, we get:

\[ \begin{align} \int e^x cos(x) = & e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) \\ 2*\int e^x cos(x) = & e^x cos(x) + (e^x sin(x) \\ \int e^x cos(x) = & \frac {e^x} {2} . (cos(x) + sin(x)) \\ \end{align} \] Therefore,

\[ \begin{align} \int e^x cos(x) +\int sin(x) dx = & \frac {e^x} {2} . (cos(x) + sin(x)) + \int sin(x) dx\\ = & \frac {e^x} {2} . (cos(x) + sin(x)) - cos(x) \\ \end{align} \] Hence: \[ \int e^x cos(x) + sin(x) dx = \frac {e^x} {2} .\left( cos(x) + sin(x)\right) - cos(x) + C \\ \]

\[ \mathbb 2. \ Mini-coding \ assignments\ - \ 15 \ points \\ \]

2.1 Sampling from function.

Assume that you have a function that generates integers between 0 and 20 with the following probability distribution: \[ P( x == k) = (_k^{20})p^k q^{20 -k} \]

where p = 0:25 and q = 1- p = 0:75 and x[0,20]. This is also known as a Binomial Distribution. Write a function to sample from this distribution. After that, generate 1000 samples from this distribution and plot the histogram of the sample. Please note that the Binomial distribution is a discrete distribution and takes values only at integer values of x, x between [0,20]. Sampling from a discrete distribution with finite values is very simple but it is not the same as sampling from a continuous distribution.

#func: function and random generation for the binomial distribution with:
#x: vector of quantiles
#p: probability of success on each trial.
#size: sample size
func<- function(x,size, p) {
dbinom(x, size, p, log = FALSE)
}

#vector x 
x <- seq(0,20,by=1)

# a sample of the specified 1000 from the elements of x using either with or without replacement.
y<- sample(x,1000,replace=TRUE, prob=func(x,20,.25))

hist(y)

2.2 Principal Components Analysis.

For the auto data set attached with the final exam, please perform a Principal Components Analysis by performing an SVD on the 4 independent variables (with mpg as the dependent variable) and select the top 2 directions. Please scatter plot the data set after it has been projected to these two dimensions. Your code should print out the two orthogonal vectors and also perform the scatter plot of the data after it has been projected to these two dimensions.

datapath <- "C:/CUNY/Courses/IS605/Assignments/final/auto-mpg.DATA"
autompg_data <- scan(datapath)

autompg_data_mtrx <- t(matrix(autompg_data, nrow = 5))
colnames(autompg_data_mtrx) <- c("displacement", "horsepower", "weight", "acceleration","mpg")
#autompg_data_mtrx

A<- autompg_data_mtrx[,1:4]
mpg <- data.matrix(autompg_data_mtrx[,5])
colnames(mpg)<- "mpg"
pairs(~.,data=cbind(A,mpg), 
   main="Scatterplot of Auto mpg Data before SVD")

# We can use the sweep function to perform arbitrary operations on the rows and columns of a matrix. 
# The second argument specifies we want to operate on the columns (1 would be used for rows), 
# and the third and fourth arguments specify that we want to subtract the column means.

cx <- sweep(A, 2, colMeans(A), "-")
sv <- svd(cx)
names(sv)
## [1] "d" "u" "v"
mpg <- data.matrix(autompg_data_mtrx[,5])
colnames(mpg)<- "mpg"
plot(sv$u[, 1], sv$u[, 2], col = mpg, main = "SVD", xlab = "U1", ylab = "U2")

# Using PCA..
pc <- prcomp(A)
names(pc)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
plot(pc$x[, 1], pc$x[, 2], col = mpg, main = "PCA", xlab = "PC1", ylab = "PC2")

##
##
# plotting all 4 dimensions
top2<- 4
utop2 <- (sv$u[, 1:top2])
vtop2 <- (sv$v[, 1:top2])
dtop2 <- (diag(sv$d)[1:top2, 1:top2])

Atop2_dec <- utop2 %*% dtop2 %*% t(vtop2)
colnames(Atop2_dec) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(Atop2_dec,mpg), main=paste("Scatterplot of four dimensions."))

##

Please scatter plot the data set after it has been projected to these two dimensions. Your code should print out the two orthogonal vectors and also perform the scatter plot of the data after it has been projected to these two dimensions.

top2<- 2
utop2 <- (sv$u[, 1:top2])
vtop2 <- (sv$v[, 1:top2])
dtop2 <- (diag(sv$d)[1:top2, 1:top2])

Atop2_dec <- utop2 %*% dtop2 %*% t(vtop2)
colnames(Atop2_dec) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(Atop2_dec,mpg), main=paste("Scatterplot of the top two dimensions.After SVD"))

2.3. Sampling in Bootstrapping.

As we discussed in class, in bootstrapping we start with n data points and repeatedly sample many times with replacement. Each time, we generate a candidate data set of size n from the original data set. All parameter estimations are performed on these candidate data sets. It can be easily shown that any particular data set generated by sampling n points from an original set of size n covers roughly 63.2% of the original data set. Using probability theory and limits, please prove that this is true. After that, write a program to perform this sampling and show that the empirical observation also agrees this.

2.3 Solution

In sampling in Bootstrapping, the sample S is constructed by randomly drawing, with replacement, n data points from the dataset of size n. It replaces the distribution D (according to which data is generated) by an empirical distribution D?? that is a discrete distribution and assigns probability of 1/n to each data point. The bootstrapped dataset may not contain all of the data points from the original dataset and some may occur many times. The probability that a data point is not included in the bootstrapped dataset is (1 ??? 1/n). And the probability that the jth observation is not in the bootstrap sample is the product of the probabilities that each bootstrap observation is not the jth observation from the original sample which is: (1 - 1/n) * (1 - 1/n) * (1 - 1/n) * (1 - 1/n) * (1 - 1/n)… = (1 -1/n)^n

For large sample sizes the probability of ending up in the test data (not selected) is is approximately 1/e = 0.368. On average a bootstrapped dataset contains (1- 0.368)= 63.2% of the distinct data points.

Empirical proof: When n=5, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/5)^5 = 0.67232 \] When n=10, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/10)^10 = 0.6513215599 \] When n=100, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/100)^100 = 0.6339676587 \]

x <- 1:100000
plot(x, 1 - (1 - 1/x)^x)

Mathematical proof of Equation (1) below:

\[ \lim_{n\to \infty} ( 1- \frac {1} {n})^n = 1/e \qquad Equation (1) \]

let f(x) = ln(x). then we know that f’(x)=1/x. Also, by the definition of derivative, we can write: \[ \begin{align} f'(x) = & \lim_{h\to 0} \frac{f(x+h) - f(x)} {h} \\ = & \lim_{h\to 0} \frac{ln(x+h) - ln(x)} {h} \\ = & \lim_{h\to 0} \frac {1} {h} ln \frac {x+h} {x} \\ = & \lim_{h\to 0} ln \left( \frac {x+h} {x} \right)^\frac {1} {h} \\ = & \lim_{h\to 0} ln \left( 1 + \frac {h} {x} \right)^\frac {1}{h} \\ \end{align} \] Then, using the fact that ln(x) is a continuous function for all x in its domain, we can exchange the lim and ln:

\[ f'(x) = ln \lim_{h\to 0} \left( 1 + \frac {h} {x} \right)^\frac {1}{h} \\ Now \ let \ m = \frac {1} {h}. \ Then \ m-> \infty \ as \ h-> 0 \ and \\ \]

\[ f'(x) = ln \lim_{m\to \infty} \left( 1 + \frac {1} {mx} \right)^{m} \\ \]

\[ Now \ assuming \ x>0, \ define \ n=mx^2, \ and \ so \ n-> \infty \ as \ m-> \infty. \ Then \ we \ get: \\ f'(x) = ln \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} \]

and from before, we still have f’(x)=1/x,therefore:

\[ ln \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} = \frac {1} {x} \] Exponentiating both sides, we get:

\[ \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} = e^{\frac {1} {x}} \]

Finally, raising both sides to the x^2, we get the Equation (2): \[ \lim_{n\to \infty } \left( 1 + \frac {x} {n} \right)^n = e^{x} \qquad Equation (2) \]

Recall we are trying to prove Equation (1) below:

\[ \lim_{n\to \infty} ( 1- \frac {1} {n})^n = 1/e \]

From Equation (2) we replace x by -1, we get:

\[ \begin{align} \lim_{n\to \infty } \left( 1 + \frac {-1} {n} \right)^n = & e^{-1} \\ \lim_{n\to \infty } \left( 1 - \frac {1} {n} \right)^n = & e^{-1} \\ \lim_{n\to \infty } \left( 1 - \frac {1} {n} \right)^n = & 0.368 \\ \end{align} \]

Hence, for large sample sizes the probability of ending up in the test data (not selected) is

approximately 1/e = 0.368. On average a bootstrapped dataset contains (1- 0.368)= 63.2% of the distinct

data points.

\[ \\ \\ \]

\[ \mathbb 3. \ Mini \ project \ - \ 20 \ points \\ \]

In this mini project, you’ll perform a Multivariate Linear Regression analysis using Gra- dient Descent. The data set consists of two predictor variables and one response variable. The predictor variables are living area in square feet and number of bedrooms. The re- sponse variable is the price of the house. You have 47 data points in total. Since both the number of rooms and the living area are in diffierent units, it makes it hard to compare them in relative terms. One way to compensate for this is to standardize the variables. In order to standardize, you estimate the mean and standard deviation of each variable and then compute new versions of these variables.

For instance, if you have a variable x, then then standardized version of x is xstd = (x - u )/o where u and o are the mean and standard deviation of x, respectively.

As we saw in the gradient descent equations, we introduce a dummy variable x0 = 1 in order to calculate the intercept term of the linear regression. Please standardize the 2 variables, introduce the dummy variable and then write a function to perform gradient descent on this data set. You’ll repeat gradient descent for a range of ‘a’ values. Please use a = (0.001, 0.01, 0.1, 1.0) as your choices. For each value of ‘a’ perform about 100 iterations

and compute J(O) at the end of each iteration. Please plot J(O) versus number of iterations for each of the 4 ‘a’ choices. Once you have your final gradient descent solution, compare this with regular linear regression (using the built-in function in R). Also solve using Ordinary Least Squares approach. Please document all 3 solutions in your submission. For bonus points (4 additional points), also perform stochastic gradient descent in addition to regular gradient descent and show those results as well.

Finally, using the built-in cross-validation function in R, perform a 5-fold cross-validation and estimate the test prediction error. In order to perform the cross-validation, you can also write your own procedure if you prefer to use your own instead of the built-in functions. You can choose any one of the 3 approaches to perform cross-validation on. No need to do it for all 3 approaches.

Solution

1- load the data 2- standardize the predictor varaibles. 3- Combine the two datasets to have standardized form of BR & SQFT and then Price in its original form.

##############################

xdatapath <- "C:/CUNY/Courses/IS605/Assignments/final/mini-project-data/ex3x.dat"
ydatapath <- "C:/CUNY/Courses/IS605/Assignments/final/mini-project-data/ex3y.dat"

predictors <- scan(xdatapath)
response <- scan(ydatapath)

predictors_df <- data.frame(t(matrix(predictors, nrow = 2)))
#predictors_df

response_df <- data.frame(t(matrix(response, nrow = 1)))
#response_df

colnames(predictors_df) <- c('SQFT', 'BDR')
colnames(response_df) <- c( 'Price')

#standardizing data by estimating the mean and standard deviation of each variable and then compute 
#new versions of these variables.

predictors_df$SQFT <- ( predictors_df$SQFT - mean(predictors_df$SQFT) ) /  sd(predictors_df$SQFT)
predictors_df$BDR <- ( predictors_df$BDR - mean(predictors_df$BDR) ) /  sd(predictors_df$BDR)  

data_std <- cbind(predictors_df,response_df)

head(data_std) 
##          SQFT        BDR  Price
## 1  0.13000987 -0.2236752 399900
## 2 -0.50418984 -0.2236752 329900
## 3  0.50247636 -0.2236752 369000
## 4 -0.73572306 -1.5377669 232000
## 5  1.25747602  1.0904165 539900
## 6 -0.01973173  1.0904165 299900
# add a column of 1's for the intercept coefficient
x0 <- rep(1, length(predictors_df$SQFT))

# create the x- matrix of explanatory variables
x <- as.matrix(cbind(x0,predictors_df))

# create the y-matrix of dependent variables
y <- as.matrix(response_df)

m <- nrow(y)


# the cost gradient function takes the predictor variables and values of thetas

cost <- function(x,y,theta){
gradient <- (1/m)* (t(x) %*% ((x%*%t(theta)) - y))
  return(t(gradient))
}

# define gradient descent simultanous update algorithm

gradient_descent <- function(x, max_iter, alpha){
    # Initialize the parameters
      theta <- matrix(c(0, 0,0), nrow=1) 
  
  for (i in 1:max_iter) {
       theta <- theta - alpha  * cost(x, y, theta)  
       theta_ret <- rbind(theta_ret,theta)    
  }
  
return(theta_ret)
}


# We will perform gradient descent for alpha values 0.001,0.01,0.1,1.0, 
# with 100 iteration. 

alpha_iter <- c(0.001,0.01,0.1,1.0)
par(mfrow=c(2,2))

alpha_iter <- c(0.001,0.01,0.1,1.0)
par(mfrow=c(2,2))


## Plotting price, square footage, and number of bedrooms
## Price = blue , Black = SQFT = red, BR = black

theta_ret<-c()

for(i in 1:length(alpha_iter)) {

 tab <- gradient_descent(x,100, alpha_iter[i])
 
 plot(tab[,1],type="b",ylim=c(min(tab),max(tab)),col="blue",lty=1,ylab="Value",lwd=0.5,
      xlab=paste("iter- , alpha=", alpha_iter[i]),xaxt="n")
      lines(tab[,2],type="b",col="red",lty=1,lwd=0.5)
      lines(tab[,3],type="b",col="black",lty=1,lwd=0.5)
   }

##############################

Once you have your final gradient descent solution, compare this with regular linear regression (using the built-in function in R)

fit <- lm(y ~ x[, 2]+x[, 3])
summary(fit)
## 
## Call:
## lm(formula = y ~ x[, 2] + x[, 3])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130582  -43636  -10829   43698  198147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   340413       9637  35.323  < 2e-16 ***
## x[, 2]        110631      11758   9.409 4.22e-12 ***
## x[, 3]         -6650      11758  -0.566    0.575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66070 on 44 degrees of freedom
## Multiple R-squared:  0.7329, Adjusted R-squared:  0.7208 
## F-statistic: 60.38 on 2 and 44 DF,  p-value: 2.428e-13
par(mfrow=c(2,2))
plot(fit)

solve using Ordinary Least Squares approach:

lst<- solve(t(x)%*%x)%*%t(x)%*%y
lst
##           Price
## x0   340412.660
## SQFT 110631.050
## BDR   -6649.474

Comparing all 3 solutions:

Gradient Descent approach:

for(i in 1:length(alpha_iter)) {
  print (tab[100,])
  #print(paste(tab[100,]))
}
##         x0       SQFT        BDR 
## 340412.660 110631.050  -6649.474 
##         x0       SQFT        BDR 
## 340412.660 110631.050  -6649.474 
##         x0       SQFT        BDR 
## 340412.660 110631.050  -6649.474 
##         x0       SQFT        BDR 
## 340412.660 110631.050  -6649.474

Least Squares

data.frame(lst)
##           Price
## x0   340412.660
## SQFT 110631.050
## BDR   -6649.474

linear regression:

data.frame(fit$coefficients) 
##             fit.coefficients
## (Intercept)       340412.660
## x[, 2]            110631.050
## x[, 3]             -6649.474

The results in all three solutions match greatly especially when we compare the gradient descent using learning rate alpha = 1.0, with the least square and linear regression… for smaller learning rates, the least square and linear regression seem to mismatch

Cross Validation

library(DAAG)
## Warning: package 'DAAG' was built under R version 3.2.3
## Loading required package: lattice
library(boot)
## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:lattice':
## 
##     melanoma
# Plotting the price with predicted data
d9 = data.frame(data_std)
fit <- lm(Price ~ SQFT+BDR, data= d9)
cv.k5 <- cv.lm(d9, fit, m=5)
## Analysis of Variance Table
## 
## Response: Price
##           Df   Sum Sq  Mean Sq F value  Pr(>F)    
## SQFT       1 5.26e+11 5.26e+11  120.44 3.5e-14 ***
## BDR        1 1.40e+09 1.40e+09    0.32    0.57    
## Residuals 44 1.92e+11 4.37e+09                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(d9, fit, m = 5): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

## 
## fold 1 
## Observations in test set: 9 
##                  1      4      8      9     18     20     23     32     47
## Predicted   356283 269244 262037 255495 235448 476593 286678 220071 230854
## cvpred      361440 290164 272862 266712 247872 463986 296020 254476 243554
## Price       399900 232000 198999 212000 199900 599000 242900 169900 239500
## CV residual  38460 -58164 -73863 -54712 -47972 135014 -53120 -84576  -4054
## 
## Sum of squares = 4.38e+10    Mean square = 4.87e+09    n = 9 
## 
## fold 2 
## Observations in test set: 10 
##                  5     13     15     17     25     34      39     40
## Predicted   472278 326492 239903 255880 604913 500088  641419 355619
## cvpred      485220 327506 234866 250217 625381 513231  666180 360410
## Price       539900 329999 259900 299900 573900 579900  549000 287000
## CV residual  54680   2493  25034  49683 -51481  66669 -117180 -73410
##                 42     46
## Predicted   374937 312464
## cvpred      379336 314239
## Price       329900 299900
## CV residual -49436 -14339
## 
## Sum of squares = 3.5e+10    Mean square = 3.5e+09    n = 10 
## 
## fold 3 
## Observations in test set: 10 
##                  2      3     11     14     16     22     28     31     33
## Predicted   286121 397489 324715 669293 374830 334952 415030 328130 338636
## cvpred      287518 395850 322524 655173 371273 337553 412913 330918 336065
## Price       329900 369000 239999 699900 449900 255000 469000 349900 314900
## CV residual  42382 -26850 -82525  44727  78627 -82553  56087  18982 -21165
##                 38
## Predicted   351443
## cvpred      348524
## Price       345000
## CV residual  -3524
## 
## Sum of squares = 2.83e+10    Mean square = 2.83e+09    n = 10 
## 
## fold 4 
## Observations in test set: 9 
##                 19     21     26     27     29      30     36     44
## Predicted   417846 309369 216516 266353 369647  430482 263430 230437
## cvpred      410731 300089 199169 253336 365604  431724 250159 214299
## Price       499998 252900 249900 464500 475000  299900 249900 299000
## CV residual  89267 -47189  50731 211164 109396 -131824   -259  84701
##                 45
## Predicted   190729
## cvpred      178402
## Price       179900
## CV residual   1498
## 
## Sum of squares = 9.39e+10    Mean square = 1.04e+10    n = 9 
## 
## fold 5 
## Observations in test set: 9 
##                  6      7     10     12     24     35     37     41
## Predicted   330979 276933 271365 341805 327777 306756 235866 303768
## cvpred      348982 281853 276535 343814 345924 325847 242629 291976
## Price       299900 314900 242500 347000 259900 285900 229900 368500
## CV residual -49082  33047 -34035   3186 -86024 -39947 -12729  76524
##                  43
## Predicted    412000
## cvpred       426367
## Price        314000
## CV residual -112367
## 
## Sum of squares = 3.23e+10    Mean square = 3.59e+09    n = 9 
## 
## Overall (Sum over all 9 folds) 
##       ms 
## 4.96e+09
# estimate the test prediction error.
fit <- glm(Price ~ SQFT+BDR, data= d9)
cv.k5 <- cv.glm(d9, fit, K=5) # 5 fold cross-validation
names(cv.k5)
## [1] "call"  "K"     "delta" "seed"
cv.k5$delta
## [1] 5.51e+09 5.35e+09

For bonus points (4 additional points), also perform stochastic gradient descent in addition

to regular gradient descent and show those results as well.

Bonus points attempt

Recall:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration.

rm(sgd)
## Warning in rm(sgd): object 'sgd' not found
library(sgd)

xsgd <- as.matrix(predictors_df[,c(1,2)])
ysgd<- as.matrix(response_df[,c(1)])



colnames(xsgd) <- c('SQFT', 'BDR')
colnames(ysgd) <- c( 'Price')


xsgd<-data.frame(xsgd)
ysgd<-data.frame(ysgd)

xsgd$SQFT <- ( xsgd$SQFT - mean(xsgd$SQFT) ) /  sd(xsgd$SQFT)
xsgd$BDR <- ( xsgd$BDR - mean(xsgd$BDR) ) /  sd(xsgd$BDR)  

#data_std <- cbind(predictors_df,response_df)



##  using the built-in stochastic gradient descent, sgd()
sgd.D93 <- sgd(ysgd$Price ~ xsgd$SQFT+xsgd$BDR, X0=1,alpha= 1, model="lm", model.control=list(family = poisson()),
               sgd.control = list(size = 10000, lr.control = c(0,0,0)))
sgd.D93$coefficients
## (Intercept)   xsgd$SQFT    xsgd$BDR 
##      319861       85762       -8300
#

Comparing the stochastic gradient descent results above to gradient descent results below, we see

that Intercepts (price) do not match. the SGD Intercept is \(319861\) while the traditional GD is \(340413\)

which about 6% off. Perhaps when using the SGD in large set of data, 6% may not be of concern especially that

the SGD is performs faster than DG.

for(i in 1:1) {
  print (tab[100,])
  #print(paste(tab[100,]))
}
##     x0   SQFT    BDR 
## 340413 110631  -6649