1. The data set “Movies.txt” (available on the homework website) contains information o$ \(443\) semi-recent movies (from \(1990 - 2005\)). For each movie, the title, the year of release, the budget (in $), the length, the viewer rating and the type (either action, drama or comedy) are available. Use \(\mathit{ggplot2}\) to re-create the graph shown below.  In particular, your graph should contain the following features: \(•\) Scatterplot of the viewer’s rating by movie budget
\(•\) Faceted by movie type
\(•\) Points colored (discretely) by year
\(•\) Labeled axes, legend and title
\(•\) Logarithmic \(x\)-axis
\(•\) Linear smoother included (whether this makes sense for the data or not…)
movie <- read.table("http://www.math.sjsu.edu/~bremer/Teaching/Math267/Data/Movies.txt", header=T)
library(ggplot2)
p <- ggplot(movie, aes(budget/1000, rating)) # scatterplot of the viewer's rating by movie budget #
p <- p + geom_point(aes(color= factor(year)))+scale_x_log10() # Points colored (discretely) by year + logarithmic x-axis #
p <- p + stat_smooth(method = "lm") # linear smoother #
p <- p + facet_wrap(~Type) # Faceted by movie type #
p <- p + labs(x = "Movie Budget (in $k)", y = "Viewer's Rating", title = "Successful Movies")+ scale_color_discrete(name="Year") # labeled axes, legend and title #
p
\(2.\) \({\bf Gaussian \space Process \space Regression}\)
In this problem, we will generate (noisy) data from a known true function, fit a Gaussian process regression model and compare the result to the true underlying function.
\((a)\) Generate data from the model
\[y=sin(x)+ε,\space where \space ε∼N(0,σ_n^2)\]
Begin by generating \(100\) random \(x\)-values uniformly distributed over the interval \(\left[0, 2π\right]\). Then generate the corresponding y-values using intrinsic variance \(σ_n^2 = 0.25\).
n <- 100
set.seed(10)
x <- runif(n, 0, 2*pi)
e <- rnorm(n, 0, 0.25^0.5)
y <- sin(x)+e
\((b)\) Write an R-program that takes as input values for the parameters \(σ_n^2\) , \(σ_f^2\) , \(h\) and fits a Gaussian process model to the data you have created in \((a)\). Use kernel function
\[k(x, x^\prime) = \sigma^2_f exp\left( -\frac{1}{2} \frac{\left( x-x\prime \right)^2}{h^2} \right)\]
Your function should be able to produce both the value of the estimated response \(E\left[f^* \right]\) as well as the uncertainty of this estimate \(\left( Var\left( f^* \right) \right)\) at an arbitrary point \(x^* \in \left[ 0, 2\pi \right]\).
response.value <- function(x.star, h, s.f, s.n){
D <- as.matrix(dist(x, diag=T, upper=T)) # distance matrix for observations x
Sigma <- (s.f)^2*exp(-0.5*D^2/h^2)
Sigma.x <- (s.f)^2*exp(-0.5*(x-x.star)^2/h^2)
f.star <- t(Sigma.x)%*%solve(Sigma+(s.n)^2*diag(length(x)))%*%y
f.star.var <- (s.f)^2 - t(Sigma.x)%*%solve(Sigma+(s.n)^2*diag(length(x)))%*%Sigma.x
return(c(f.star, f.star.var))
}
\((c)\) For an arbitrary choice of parameters \(\left(σ_n^2 \space ,\space σ_f^2, \space h\right)\) create a graph that shows the following:
\(•\) The true function that generated the simulted data (in red) over the interval [0, 2π].
\(•\) The data from (a) (as black dots)
\(•\) The estimated response surface (as a black curve)
\(•\) The 95% confidence region of the response surface (as a grey band)
\(•\) A title that includes the parameter values you chose.
response.surface <- Vectorize(response.value, c("x.star"))
x.seq <- seq(min(x), max(x),0.01)
y.seq <- response.surface(x.seq, 2,0.5,2)[1,] # large bandwidth
y.var <- response.surface(x.seq, 2,0.5,2)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=2, Covariance=0.5, Intrinsic Variance=2")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)
\((d)\) Experiment with different values of the model parameters. Repeat \((c)\) using a better set of model parameters \(σ_n^2\),\(σ_f^2\),\(h\). Show your graph and include the parameter values you chose in the title. Explain what you are looking for in the graph to identify a “good” set of parameters.
response.surface <- Vectorize(response.value, c("x.star"))
x.seq <- seq(min(x), max(x),0.01)
y.seq <- response.surface(x.seq, 1,0.5,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,0.5,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=0.5, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)
response.surface <- Vectorize(response.value, c("x.star"))
x.seq <- seq(min(x), max(x),0.01)
y.seq <- response.surface(x.seq, 1,1,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,1,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=1, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)
response.surface <- Vectorize(response.value, c("x.star"))
x.seq <- seq(min(x), max(x),0.01)
y.seq <- response.surface(x.seq, 1,1,1)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,1,1)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=1, Intrinsic Variance=1")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)
response.surface <- Vectorize(response.value, c("x.star"))
x.seq <- seq(min(x), max(x),0.01)
y.seq <- response.surface(x.seq,2,0.5,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq,2,0.5,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=2, Covariance=0.5, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)
In conclusion:
In part \((c)\), we have \(h=2\), \(\sigma^2_n=0.5\), and \(\sigma^2_f=2\). As the graph shows, the estimated response surface does not fit the true fucntion well. Indeed, partial of true function curve is not in the \(95\)% confidence interval band.
Comparing to part \((c)\), the estimated response surfaces are better in graphes shown in part \((d)\). Among the estimated response surfaces shown in part \((d)\), the one with \(h=1\), \(\sigma^2_n=0,5\), and \(\sigma^2_f=0.5\) is the best option here because \(1)\). it fits the true function line ; \(2)\). the true function in inside the \(95\)% confidence interval. Maybe the confidence interval band is narrower when \(h=2\), \(\sigma^2_n=0.5\) and \(\sigma^2_f=0.5\). However, the estimated response surface doesn’t fit the true function nearl as well as when \(h=1\), \(\sigma^2_n=0.5\) and \(\sigma^2_f=0.5\)