\[ \mathbb 1. \ Review \ of \ Essential \ Concepts \ - \ 15 \ points \\ \]
\[ \begin{align} \begin{bmatrix} 1 & -1 & 3 & -5 \\ 2 & 1 & 5 & -9 \\ 6 & -1 & -2 & 4 \\ \end{bmatrix} \end{align} \]
A<-matrix(c(1,-1,3,-5,2,1,5,-9,6,-1,-2,4),byrow=TRUE,nrow=3)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3 -5
## [2,] 2 1 5 -9
## [3,] 6 -1 -2 4
#Multiply the 1st row by 2
A[1,]<- A[1,]*2
A
## [,1] [,2] [,3] [,4]
## [1,] 2 -2 6 -10
## [2,] 2 1 5 -9
## [3,] 6 -1 -2 4
#Subtract the 1st row from the 2nd row and restore it
A[2,] <- A[2,] - A[1,]
A[1,] <- A[1,] / 2
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3 -5
## [2,] 0 3 -1 1
## [3,] 6 -1 -2 4
#Multiply the 1st row by 6
A[1,]<- A[1,] * 6
A
## [,1] [,2] [,3] [,4]
## [1,] 6 -6 18 -30
## [2,] 0 3 -1 1
## [3,] 6 -1 -2 4
#Subtract the 1st row from the 3rd row and restore it
A[3,]<- A[3,] - A[1,]
A[1,]<- A[1,] / 6
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3 -5
## [2,] 0 3 -1 1
## [3,] 0 5 -20 34
# Divide the 2nd row by 3
A[2,]<- A[2,] / 3
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3.0000000 -5.0000000
## [2,] 0 1 -0.3333333 0.3333333
## [3,] 0 5 -20.0000000 34.0000000
# Multiply the 2nd row by 5
A[2,] <- A[2,] * 5
# Subtract the 2nd row from the 3rd row and restore it
A[3,]<- A[3,] - A[2,]
A[2,]<- A[2,] /5
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3.0000000 -5.0000000
## [2,] 0 1 -0.3333333 0.3333333
## [3,] 0 0 -18.3333333 32.3333333
# Divide the third row by A[3,3]
A[3,]<- A[3,] / A[3,3]
A
## [,1] [,2] [,3] [,4]
## [1,] 1 -1 3.0000000 -5.0000000
## [2,] 0 1 -0.3333333 0.3333333
## [3,] 0 0 1.0000000 -1.7636364
# Addrow 2 from row 1
A[1,]<- A[1,] + A[2,]
A
## [,1] [,2] [,3] [,4]
## [1,] 1 0 2.6666667 -4.6666667
## [2,] 0 1 -0.3333333 0.3333333
## [3,] 0 0 1.0000000 -1.7636364
\[ \\ \\ \\ \]
\[ \begin{align} \begin{bmatrix} 1 & -1 & 3 & -5 \\ 2 & 1 & 5 & -9 \\ 6 & -1 & -2 & 4 \\ \end{bmatrix} \end{align} \]
The transpose is of the above matrix is: \[ \begin{align} \begin{bmatrix} 1 & 2 & 6 \\ -1 & 1 & -1 \\ 3 & 5 & -2 \\ -5 & -9 & 4 \\ \end{bmatrix} \end{align} \]
A subset {v_1,…,v_k} of a vector space V, with the inner product <,>, is called orthonormal if:
1- they are unit vector; that is they are all required to have length one:
2-and orthogonal vectors. (v_i,v_j) =0 when i!=j. That is, the vectors are mutually perpendicular.
An example of an orthonormal basis for the 3-dimensional vector space RxRxR is as follow:
v1= c(1/sqrt(2),0,-1/sqrt(2))
v2= c(1/2,sqrt(2)/2, 1/2)
v3= c(1/2,-sqrt(2)/2, 1/2)
v1
## [1] 0.7071068 0.0000000 -0.7071068
v2
## [1] 0.5000000 0.7071068 0.5000000
v3
## [1] 0.5000000 -0.7071068 0.5000000
round(v1 %*%v2, 10)==0
## [,1]
## [1,] TRUE
round(v1 %*%v3, 10)==0
## [,1]
## [1,] TRUE
round(v3 %*%v2, 10)==0
## [,1]
## [1,] TRUE
## [1] 1
## [1] 1
## [1] 1
\[ \begin{align} \begin{bmatrix} 5 & 0 & 3 \\ 0 & 1 & -2 \\ 1 & 2 & 0 \\ \end{bmatrix} \end{align} \]
A<-matrix(c(5,0,3,0,1,-2,1,2,0),byrow=TRUE,nrow=3)
A
## [,1] [,2] [,3]
## [1,] 5 0 3
## [2,] 0 1 -2
## [3,] 1 2 0
\[ The \ characteristic \ polynomial \ of \ an \ n×n \ matrix A = (a_{ij} ) \ is \ defined \ as \:\\ K_A(\lambda) = det(\lambda I_n- A) \]
In our case we have 3x3 matrix hence: \[ K_A(\lambda) = det(\lambda I_n- A) = \lambda^3 -\sigma_1\lambda^2+\sigma_2\lambda-\sigma_3 \\ \] \[ \sigma_1 is \ the \ sum \ of \ all \ first-order \ diagonal \ minors \ of \ A \\ \sigma_1 = 5+ 1+ 0 = 6 \\ \]
\[ \sigma_2 is \ the \ sum \ of \ all \ second-order \ diagonal \ minors \ of \ A \\ \sigma_2 = \] \[ \begin{vmatrix} 5 & 0 \\ 0 & 1 \\ \end{vmatrix} + \begin{vmatrix} 1 & -2\\ 2 & 0 \\ \end{vmatrix} + \begin{vmatrix} 5 & 3 \\ 1 & 0 \\ \end{vmatrix} \] \[ \sigma_2 = 5 + 4 - 3 = 6 \]
\[ \sigma_3 is \ the \ sum \ of \ all \ third-order \ diagonal \ minors \ of \ A \\ \sigma_3 =det(A) \]
det(A)
## [1] 17
\[ \sigma_3 =det(A) = 17 \]
\[ Hence \ the \ characteristic \ polynomial \ of \ A \ is: \\ P(x) = x^3 -\sigma_1 x^2+\sigma_2 x -\sigma_3 \\ replacing \ \sigma_1 , \sigma_2 , \ and \ \sigma_3 \ with \ their \ values \ we \ get: \\ P(x) = x^3 -6 x^2+6 x -17 \\ \]
\[ K_A(\lambda) = \lambda ^3 -6 \lambda ^2+ 6 \lambda -17 \\ \]
\[ (\lambda I_3 - A) X = \theta \]
eigen(A)
## $values
## [1] 5.471264+0.000000i 0.264368+1.742772i 0.264368-1.742772i
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.98551404+0i 0.2583505-0.2761849i 0.2583505+0.2761849i
## [2,] -0.06924766+0i -0.6725516+0.0000000i -0.6725516+0.0000000i
## [3,] 0.15481227+0i -0.2473752+0.5860519i -0.2473752-0.5860519i
\[ \lambda_1 = 5.471264+0.000000i \\ \lambda_2 = 0.264368+1.742772i\\ \lambda_3 = 0.264368-1.742772i\\ \] #### THe Eigen vectors are: \[ v_1=(0.98551404+0i, 0.2583505-0.2761849i, 0.2583505+0.2761849i) \\ v_2=(-0.06924766+0i, -0.6725516+0.0000000i, -0.6725516+0.0000000i) \\ v_3=(0.15481227+0i, -0.2473752+0.5860519i, -0.2473752-0.5860519i) \\ \]
Recall: A matrix is called column-stochastic if all of its entries are greater or equal to zero (nonnegative) and the sum of the entries in each column is equal to 1. If all entries of a matrix are nonnegative, then we say that the matrix itself is nonnegative. Furthermore, a matrix is positive is all its entries are positive (greater than zero) real numbers.
The PageRank matrix M is column-stochastic by the design of the Pagerank algorithm. Also, all entries are positive. We can invoke the Perron-Frobenius theorem of matrix in the analysis of the convergence of the Pagerank algorithm. The Perron-Frobenius theorem states that for any column-stochastic matrix with positive entries, we can always find 1 as an eigenvalue, and all other eigenvalues have magnitude strictly less than 1. Moreover, 1 is a simple eigenvalue (“simple” means that 1 is a root, but not a repeated root, of the characteristic polynomial.)
recall: If X is a continuous random variable with probability density function f defined on an interval with (possibly infinite) endpoints a and b, then the mean or expected value of X is \[ E(X) = \int_a^b x f(x) dx. \] if discrete: \[ E(X) = \sum_a^b x f(x) dx. \]
The distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. In other words, if we have a sample containing a large number of independent observations and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the computed values of the average will be distributed according to the normal distribution. This is essentially the definition of central limit theorem
Below is a quick allustration:
func1<- function(n){
y = runif(n)
par(mfrow = c(2,1))
out = vector()
for (i in 1:n){
out_i = mean(sample(y, size = 3))
out = c(out, out_i)
}
hist(out)
}
func1(100)
func1(10000)
\[ e^x cos^2(x)\\ \] \[ \begin{align} \frac {d} {dx} [e^x cos^2(x)] = & \frac {d} {dx} [e^x] . cos^2(x) + e^x \frac {d} {dx} [cos^2(x)] \\ = & \frac {d} {dx} [e^x] . cos^2(x) + e^x \frac {d} {dx} [cos^2(x)] \\ = & e^x.cos^2(x) + 2.cos(x).\frac {d} {dx}[cos(x)].e^x \\ = & e^x.cos^2(x) + 2(-sin(x))e^x cos(x)\\ = & e^x.cos^2(x) - 2e^x cos(x)sin(x)\\ = & -e^xcos(x).[2sin(x) - cos(x)]\\ \end{align} \]
\[ f(x) = {e^x}^3 \] \[ \begin{align} \frac {d} {dx}[{e^x}^3] = & {e^x}^3. \frac {d} {dx}[x^3]\\ = & 3x^2{e^x}^3 \\ \end{align} \]
\[ \int e^x cos(x) + sin(x) dx = \int e^x cos(x) +\int sin(x) dx \\ \] Using integration by parts we get: \[ \begin{align} \int e^x cos(x) +\int sin(x) dx = & (cos(x).e^x - \int e^x - \int e^x.(- sin(x)) dx) + \int sin(x)dx\\ = & (e^x cos(x) - \int (-e^x sin(x)) dx) + \int sin(x) dx\\ = & e^x cos(x) -(- \int e^x sin(x) dx) + \int sin(x) dx \\ = & e^x cos(x) +( \int e^x sin(x) dx) + \int sin(x) dx \\ = & e^x cos(x) + (sin(x).e^x - \int e^x . cos(x) dx) + \int sin(x) dx\\ = & e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) + \int sin(x) dx \\ \end{align} \] \[ \int e^x cos(x) +\int sin(x) dx = e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) + \int sin(x) dx \\ \]
we canel out integral sin(x) dx, we get:
\[ \begin{align} \int e^x cos(x) = & e^x cos(x) + (e^x sin(x) - \int e^x cos(x) dx) \\ 2*\int e^x cos(x) = & e^x cos(x) + (e^x sin(x) \\ \int e^x cos(x) = & \frac {e^x} {2} . (cos(x) + sin(x)) \\ \end{align} \] Therefore,
\[ \begin{align} \int e^x cos(x) +\int sin(x) dx = & \frac {e^x} {2} . (cos(x) + sin(x)) + \int sin(x) dx\\ = & \frac {e^x} {2} . (cos(x) + sin(x)) - cos(x) \\ \end{align} \] Hence: \[ \int e^x cos(x) + sin(x) dx = \frac {e^x} {2} .\left( cos(x) + sin(x)\right) - cos(x) + C \\ \]
\[ \mathbb 2. \ Mini-coding \ assignments\ - \ 15 \ points \\ \]
Assume that you have a function that generates integers between 0 and 20 with the following probability distribution: \[ P( x == k) = (_k^{20})p^k q^{20 -k} \]
where p = 0:25 and q = 1- p = 0:75 and x[0,20]. This is also known as a Binomial Distribution. Write a function to sample from this distribution. After that, generate 1000 samples from this distribution and plot the histogram of the sample. Please note that the Binomial distribution is a discrete distribution and takes values only at integer values of x, x between [0,20]. Sampling from a discrete distribution with finite values is very simple but it is not the same as sampling from a continuous distribution.
#func: function and random generation for the binomial distribution with:
#x: vector of quantiles
#p: probability of success on each trial.
#size: sample size
func<- function(x,size, p) {
dbinom(x, size, p, log = FALSE)
}
#vector x
x <- seq(0,20,by=1)
# a sample of the specified 1000 from the elements of x using either with or without replacement.
y<- sample(x,1000,replace=TRUE, prob=func(x,20,.25))
hist(y)
For the auto data set attached with the final exam, please perform a Principal Components Analysis by performing an SVD on the 4 independent variables (with mpg as the dependent variable) and select the top 2 directions. Please scatter plot the data set after it has been projected to these two dimensions. Your code should print out the two orthogonal vectors and also perform the scatter plot of the data after it has been projected to these two dimensions.
datapath <- "C:/CUNY/Courses/IS605/Assignments/final/auto-mpg.DATA"
autompg_data <- scan(datapath)
autompg_data_mtrx <- t(matrix(autompg_data, nrow = 5))
colnames(autompg_data_mtrx) <- c("displacement", "horsepower", "weight", "acceleration","mpg")
#autompg_data_mtrx
A<- autompg_data_mtrx[,1:4]
mpg <- data.matrix(autompg_data_mtrx[,5])
colnames(mpg)<- "mpg"
pairs(~.,data=cbind(A,mpg),
main="Scatterplot of Auto mpg Data before SVD")
# We can use the sweep function to perform arbitrary operations on the rows and columns of a matrix.
# The second argument specifies we want to operate on the columns (1 would be used for rows),
# and the third and fourth arguments specify that we want to subtract the column means.
cx <- sweep(A, 2, colMeans(A), "-")
sv <- svd(cx)
names(sv)
## [1] "d" "u" "v"
mpg <- data.matrix(autompg_data_mtrx[,5])
colnames(mpg)<- "mpg"
plot(sv$u[, 1], sv$u[, 2], col = mpg, main = "SVD", xlab = "U1", ylab = "U2")
# Using PCA..
pc <- prcomp(A)
names(pc)
## [1] "sdev" "rotation" "center" "scale" "x"
plot(pc$x[, 1], pc$x[, 2], col = mpg, main = "PCA", xlab = "PC1", ylab = "PC2")
##
##
# plotting all 4 dimensions
top2<- 4
utop2 <- (sv$u[, 1:top2])
vtop2 <- (sv$v[, 1:top2])
dtop2 <- (diag(sv$d)[1:top2, 1:top2])
Atop2_dec <- utop2 %*% dtop2 %*% t(vtop2)
colnames(Atop2_dec) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(Atop2_dec,mpg), main=paste("Scatterplot of four dimensions."))
##
Please scatter plot the data set after it has been projected to these two dimensions. Your code should print out the two orthogonal vectors and also perform the scatter plot of the data after it has been projected to these two dimensions.
top2<- 2
utop2 <- (sv$u[, 1:top2])
vtop2 <- (sv$v[, 1:top2])
dtop2 <- (diag(sv$d)[1:top2, 1:top2])
Atop2_dec <- utop2 %*% dtop2 %*% t(vtop2)
colnames(Atop2_dec) <- c("displacement", "horsepower", "weight", "acceleration")
pairs(~.,data=cbind(Atop2_dec,mpg), main=paste("Scatterplot of the top two dimensions.After SVD"))
As we discussed in class, in bootstrapping we start with n data points and repeatedly sample many times with replacement. Each time, we generate a candidate data set of size n from the original data set. All parameter estimations are performed on these candidate data sets. It can be easily shown that any particular data set generated by sampling n points from an original set of size n covers roughly 63.2% of the original data set. Using probability theory and limits, please prove that this is true. After that, write a program to perform this sampling and show that the empirical observation also agrees this.
In sampling in Bootstrapping, the sample S is constructed by randomly drawing, with replacement, n data points from the dataset of size n. It replaces the distribution D (according to which data is generated) by an empirical distribution D?? that is a discrete distribution and assigns probability of 1/n to each data point. The bootstrapped dataset may not contain all of the data points from the original dataset and some may occur many times. The probability that a data point is not included in the bootstrapped dataset is (1 ??? 1/n). And the probability that the jth observation is not in the bootstrap sample is the product of the probabilities that each bootstrap observation is not the jth observation from the original sample which is: (1 - 1/n) * (1 - 1/n) * (1 - 1/n) * (1 - 1/n) * (1 - 1/n)… = (1 -1/n)^n
For large sample sizes the probability of ending up in the test data (not selected) is is approximately 1/e = 0.368. On average a bootstrapped dataset contains (1- 0.368)= 63.2% of the distinct data points.
Empirical proof: When n=5, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/5)^5 = 0.67232 \] When n=10, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/10)^10 = 0.6513215599 \] When n=100, the probability that the jth observation is in the bootstrap sample is: \[ P(jth \ obs \ in \ bootstrap \ sample) = 1 - (1 - 1/100)^100 = 0.6339676587 \]
x <- 1:100000
plot(x, 1 - (1 - 1/x)^x)
\[ \lim_{n\to \infty} ( 1- \frac {1} {n})^n = 1/e \qquad Equation (1) \]
let f(x) = ln(x). then we know that f’(x)=1/x. Also, by the definition of derivative, we can write: \[ \begin{align} f'(x) = & \lim_{h\to 0} \frac{f(x+h) - f(x)} {h} \\ = & \lim_{h\to 0} \frac{ln(x+h) - ln(x)} {h} \\ = & \lim_{h\to 0} \frac {1} {h} ln \frac {x+h} {x} \\ = & \lim_{h\to 0} ln \left( \frac {x+h} {x} \right)^\frac {1} {h} \\ = & \lim_{h\to 0} ln \left( 1 + \frac {h} {x} \right)^\frac {1}{h} \\ \end{align} \] Then, using the fact that ln(x) is a continuous function for all x in its domain, we can exchange the lim and ln:
\[ f'(x) = ln \lim_{h\to 0} \left( 1 + \frac {h} {x} \right)^\frac {1}{h} \\ Now \ let \ m = \frac {1} {h}. \ Then \ m-> \infty \ as \ h-> 0 \ and \\ \]
\[ f'(x) = ln \lim_{m\to \infty} \left( 1 + \frac {1} {mx} \right)^{m} \\ \]
\[ Now \ assuming \ x>0, \ define \ n=mx^2, \ and \ so \ n-> \infty \ as \ m-> \infty. \ Then \ we \ get: \\ f'(x) = ln \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} \]
and from before, we still have f’(x)=1/x,therefore:
\[ ln \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} = \frac {1} {x} \] Exponentiating both sides, we get:
\[ \lim_{n\to \infty } \left[ \left( 1 + \frac {x} {n} \right)^n \right]^\frac {1} {x^2} = e^{\frac {1} {x}} \]
Finally, raising both sides to the x^2, we get the Equation (2): \[ \lim_{n\to \infty } \left( 1 + \frac {x} {n} \right)^n = e^{x} \qquad Equation (2) \]
\[ \lim_{n\to \infty} ( 1- \frac {1} {n})^n = 1/e \]
\[ \begin{align} \lim_{n\to \infty } \left( 1 + \frac {-1} {n} \right)^n = & e^{-1} \\ \lim_{n\to \infty } \left( 1 - \frac {1} {n} \right)^n = & e^{-1} \\ \lim_{n\to \infty } \left( 1 - \frac {1} {n} \right)^n = & 0.368 \\ \end{align} \]
\[ \\ \\ \]
\[ \mathbb 3. \ Mini \ project \ - \ 20 \ points \\ \]
In this mini project, you’ll perform a Multivariate Linear Regression analysis using Gra- dient Descent. The data set consists of two predictor variables and one response variable. The predictor variables are living area in square feet and number of bedrooms. The re- sponse variable is the price of the house. You have 47 data points in total. Since both the number of rooms and the living area are in diffierent units, it makes it hard to compare them in relative terms. One way to compensate for this is to standardize the variables. In order to standardize, you estimate the mean and standard deviation of each variable and then compute new versions of these variables.
For instance, if you have a variable x, then then standardized version of x is xstd = (x - u )/o where u and o are the mean and standard deviation of x, respectively.
As we saw in the gradient descent equations, we introduce a dummy variable x0 = 1 in order to calculate the intercept term of the linear regression. Please standardize the 2 variables, introduce the dummy variable and then write a function to perform gradient descent on this data set. You’ll repeat gradient descent for a range of ‘a’ values. Please use a = (0.001, 0.01, 0.1, 1.0) as your choices. For each value of ‘a’ perform about 100 iterations
and compute J(O) at the end of each iteration. Please plot J(O) versus number of iterations for each of the 4 ‘a’ choices. Once you have your final gradient descent solution, compare this with regular linear regression (using the built-in function in R). Also solve using Ordinary Least Squares approach. Please document all 3 solutions in your submission. For bonus points (4 additional points), also perform stochastic gradient descent in addition to regular gradient descent and show those results as well.
Finally, using the built-in cross-validation function in R, perform a 5-fold cross-validation and estimate the test prediction error. In order to perform the cross-validation, you can also write your own procedure if you prefer to use your own instead of the built-in functions. You can choose any one of the 3 approaches to perform cross-validation on. No need to do it for all 3 approaches.
1- load the data 2- standardize the predictor varaibles. 3- Combine the two datasets to have standardized form of BR & SQFT and then Price in its original form.
##############################
xdatapath <- "C:/CUNY/Courses/IS605/Assignments/final/mini-project-data/ex3x.dat"
ydatapath <- "C:/CUNY/Courses/IS605/Assignments/final/mini-project-data/ex3y.dat"
predictors <- scan(xdatapath)
response <- scan(ydatapath)
predictors_df <- data.frame(t(matrix(predictors, nrow = 2)))
#predictors_df
response_df <- data.frame(t(matrix(response, nrow = 1)))
#response_df
colnames(predictors_df) <- c('SQFT', 'BDR')
colnames(response_df) <- c( 'Price')
#standardizing data by estimating the mean and standard deviation of each variable and then compute
#new versions of these variables.
predictors_df$SQFT <- ( predictors_df$SQFT - mean(predictors_df$SQFT) ) / sd(predictors_df$SQFT)
predictors_df$BDR <- ( predictors_df$BDR - mean(predictors_df$BDR) ) / sd(predictors_df$BDR)
data_std <- cbind(predictors_df,response_df)
head(data_std)
## SQFT BDR Price
## 1 0.13000987 -0.2236752 399900
## 2 -0.50418984 -0.2236752 329900
## 3 0.50247636 -0.2236752 369000
## 4 -0.73572306 -1.5377669 232000
## 5 1.25747602 1.0904165 539900
## 6 -0.01973173 1.0904165 299900
# add a column of 1's for the intercept coefficient
x0 <- rep(1, length(predictors_df$SQFT))
# create the x- matrix of explanatory variables
x <- as.matrix(cbind(x0,predictors_df))
# create the y-matrix of dependent variables
y <- as.matrix(response_df)
m <- nrow(y)
# the cost gradient function takes the predictor variables and values of thetas
cost <- function(x,y,theta){
gradient <- (1/m)* (t(x) %*% ((x%*%t(theta)) - y))
return(t(gradient))
}
# define gradient descent simultanous update algorithm
gradient_descent <- function(x, max_iter, alpha){
# Initialize the parameters
theta <- matrix(c(0, 0,0), nrow=1)
for (i in 1:max_iter) {
theta <- theta - alpha * cost(x, y, theta)
theta_ret <- rbind(theta_ret,theta)
}
return(theta_ret)
}
# We will perform gradient descent for alpha values 0.001,0.01,0.1,1.0,
# with 100 iteration.
alpha_iter <- c(0.001,0.01,0.1,1.0)
par(mfrow=c(2,2))
alpha_iter <- c(0.001,0.01,0.1,1.0)
par(mfrow=c(2,2))
## Plotting price, square footage, and number of bedrooms
## Price = blue , Black = SQFT = red, BR = black
theta_ret<-c()
for(i in 1:length(alpha_iter)) {
tab <- gradient_descent(x,100, alpha_iter[i])
plot(tab[,1],type="b",ylim=c(min(tab),max(tab)),col="blue",lty=1,ylab="Value",lwd=0.5,
xlab=paste("iter- , alpha=", alpha_iter[i]),xaxt="n")
lines(tab[,2],type="b",col="red",lty=1,lwd=0.5)
lines(tab[,3],type="b",col="black",lty=1,lwd=0.5)
}
##############################
Once you have your final gradient descent solution, compare this with regular linear regression (using the built-in function in R)
fit <- lm(y ~ x[, 2]+x[, 3])
summary(fit)
##
## Call:
## lm(formula = y ~ x[, 2] + x[, 3])
##
## Residuals:
## Min 1Q Median 3Q Max
## -130582 -43636 -10829 43698 198147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 340413 9637 35.323 < 2e-16 ***
## x[, 2] 110631 11758 9.409 4.22e-12 ***
## x[, 3] -6650 11758 -0.566 0.575
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66070 on 44 degrees of freedom
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
## F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
par(mfrow=c(2,2))
plot(fit)
solve using Ordinary Least Squares approach:
lst<- solve(t(x)%*%x)%*%t(x)%*%y
lst
## Price
## x0 340412.660
## SQFT 110631.050
## BDR -6649.474
Gradient Descent approach:
for(i in 1:length(alpha_iter)) {
print (tab[100,])
#print(paste(tab[100,]))
}
## x0 SQFT BDR
## 340412.660 110631.050 -6649.474
## x0 SQFT BDR
## 340412.660 110631.050 -6649.474
## x0 SQFT BDR
## 340412.660 110631.050 -6649.474
## x0 SQFT BDR
## 340412.660 110631.050 -6649.474
Least Squares
data.frame(lst)
## Price
## x0 340412.660
## SQFT 110631.050
## BDR -6649.474
linear regression:
data.frame(fit$coefficients)
## fit.coefficients
## (Intercept) 340412.660
## x[, 2] 110631.050
## x[, 3] -6649.474
The results in all three solutions match greatly especially when we compare the gradient descent using learning rate alpha = 1.0, with the least square and linear regression… for smaller learning rates, the least square and linear regression seem to mismatch
Cross Validation
library(DAAG)
## Warning: package 'DAAG' was built under R version 3.2.3
## Loading required package: lattice
library(boot)
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:lattice':
##
## melanoma
# Plotting the price with predicted data
d9 = data.frame(data_std)
fit <- lm(Price ~ SQFT+BDR, data= d9)
cv.k5 <- cv.lm(d9, fit, m=5)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## SQFT 1 5.26e+11 5.26e+11 120.44 3.5e-14 ***
## BDR 1 1.40e+09 1.40e+09 0.32 0.57
## Residuals 44 1.92e+11 4.37e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(d9, fit, m = 5):
##
## As there is >1 explanatory variable, cross-validation
## predicted values for a fold are not a linear function
## of corresponding overall predicted values. Lines that
## are shown for the different folds are approximate
##
## fold 1
## Observations in test set: 9
## 1 4 8 9 18 20 23 32 47
## Predicted 356283 269244 262037 255495 235448 476593 286678 220071 230854
## cvpred 361440 290164 272862 266712 247872 463986 296020 254476 243554
## Price 399900 232000 198999 212000 199900 599000 242900 169900 239500
## CV residual 38460 -58164 -73863 -54712 -47972 135014 -53120 -84576 -4054
##
## Sum of squares = 4.38e+10 Mean square = 4.87e+09 n = 9
##
## fold 2
## Observations in test set: 10
## 5 13 15 17 25 34 39 40
## Predicted 472278 326492 239903 255880 604913 500088 641419 355619
## cvpred 485220 327506 234866 250217 625381 513231 666180 360410
## Price 539900 329999 259900 299900 573900 579900 549000 287000
## CV residual 54680 2493 25034 49683 -51481 66669 -117180 -73410
## 42 46
## Predicted 374937 312464
## cvpred 379336 314239
## Price 329900 299900
## CV residual -49436 -14339
##
## Sum of squares = 3.5e+10 Mean square = 3.5e+09 n = 10
##
## fold 3
## Observations in test set: 10
## 2 3 11 14 16 22 28 31 33
## Predicted 286121 397489 324715 669293 374830 334952 415030 328130 338636
## cvpred 287518 395850 322524 655173 371273 337553 412913 330918 336065
## Price 329900 369000 239999 699900 449900 255000 469000 349900 314900
## CV residual 42382 -26850 -82525 44727 78627 -82553 56087 18982 -21165
## 38
## Predicted 351443
## cvpred 348524
## Price 345000
## CV residual -3524
##
## Sum of squares = 2.83e+10 Mean square = 2.83e+09 n = 10
##
## fold 4
## Observations in test set: 9
## 19 21 26 27 29 30 36 44
## Predicted 417846 309369 216516 266353 369647 430482 263430 230437
## cvpred 410731 300089 199169 253336 365604 431724 250159 214299
## Price 499998 252900 249900 464500 475000 299900 249900 299000
## CV residual 89267 -47189 50731 211164 109396 -131824 -259 84701
## 45
## Predicted 190729
## cvpred 178402
## Price 179900
## CV residual 1498
##
## Sum of squares = 9.39e+10 Mean square = 1.04e+10 n = 9
##
## fold 5
## Observations in test set: 9
## 6 7 10 12 24 35 37 41
## Predicted 330979 276933 271365 341805 327777 306756 235866 303768
## cvpred 348982 281853 276535 343814 345924 325847 242629 291976
## Price 299900 314900 242500 347000 259900 285900 229900 368500
## CV residual -49082 33047 -34035 3186 -86024 -39947 -12729 76524
## 43
## Predicted 412000
## cvpred 426367
## Price 314000
## CV residual -112367
##
## Sum of squares = 3.23e+10 Mean square = 3.59e+09 n = 9
##
## Overall (Sum over all 9 folds)
## ms
## 4.96e+09
# estimate the test prediction error.
fit <- glm(Price ~ SQFT+BDR, data= d9)
cv.k5 <- cv.glm(d9, fit, K=5) # 5 fold cross-validation
names(cv.k5)
## [1] "call" "K" "delta" "seed"
cv.k5$delta
## [1] 5.51e+09 5.35e+09
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration.
rm(sgd)
## Warning in rm(sgd): object 'sgd' not found
library(sgd)
xsgd <- as.matrix(predictors_df[,c(1,2)])
ysgd<- as.matrix(response_df[,c(1)])
colnames(xsgd) <- c('SQFT', 'BDR')
colnames(ysgd) <- c( 'Price')
xsgd<-data.frame(xsgd)
ysgd<-data.frame(ysgd)
xsgd$SQFT <- ( xsgd$SQFT - mean(xsgd$SQFT) ) / sd(xsgd$SQFT)
xsgd$BDR <- ( xsgd$BDR - mean(xsgd$BDR) ) / sd(xsgd$BDR)
#data_std <- cbind(predictors_df,response_df)
## using the built-in stochastic gradient descent, sgd()
sgd.D93 <- sgd(ysgd$Price ~ xsgd$SQFT+xsgd$BDR, X0=1,alpha= 1, model="lm", model.control=list(family = poisson()),
sgd.control = list(size = 10000, lr.control = c(0,0,0)))
sgd.D93$coefficients
## (Intercept) xsgd$SQFT xsgd$BDR
## 319861 85762 -8300
#
for(i in 1:1) {
print (tab[100,])
#print(paste(tab[100,]))
}
## x0 SQFT BDR
## 340413 110631 -6649