Math 267 - Spring 2015 Homework 5

1. The data set “Movies.txt” (available on the homework website) contains information o$ $443$ semi-recent movies (from $1990 - 2005$). For each movie, the title, the year of release, the budget (in $), the length, the viewer rating and the type (either action, drama or comedy) are available. Use $\mathit{ggplot2}$ to re-create the graph shown below. In particular, your graph should contain the following features: $•$ Scatterplot of the viewer’s rating by movie budget

$•$ Faceted by movie type

$•$ Points colored (discretely) by year

$•$ Labeled axes, legend and title

$•$ Logarithmic $x$-axis

$•$ Linear smoother included (whether this makes sense for the data or not…)

movie <- read.table("http://www.math.sjsu.edu/~bremer/Teaching/Math267/Data/Movies.txt", header=T)

library(ggplot2)

p <- ggplot(movie, aes(budget/1000, rating)) # scatterplot of the viewer's rating by movie budget #
p <- p + geom_point(aes(color= factor(year)))+scale_x_log10() # Points colored (discretely) by year + logarithmic x-axis #
p <- p + stat_smooth(method = "lm") # linear smoother #
p <- p + facet_wrap(~Type) # Faceted by movie type #
p <- p + labs(x = "Movie Budget (in $k)", y = "Viewer's Rating", title = "Successful Movies")+ scale_color_discrete(name="Year") # labeled axes, legend and title #
p

$2.$ ${\bf Gaussian \space Process \space Regression}$

In this problem, we will generate (noisy) data from a known true function, fit a Gaussian process regression model and compare the result to the true underlying function.

$(a)$ Generate data from the model

\[y=sin(x)+ε,\space where \space ε∼N(0,σ_n^2)\]

Begin by generating $100$ random $x$-values uniformly distributed over the interval $\left[0, 2π\right]$. Then generate the corresponding y-values using intrinsic variance $σ_n^2 = 0.25$.

n <- 100
set.seed(10)
x <- runif(n, 0, 2*pi)
e <- rnorm(n, 0, 0.25^0.5)
y <- sin(x)+e

$(b)$ Write an R-program that takes as input values for the parameters $σ_n^2$ , $σ_f^2$ , $h$ and fits a Gaussian process model to the data you have created in $(a)$. Use kernel function

\[k(x, x^\prime) = \sigma^2_f exp\left( -\frac{1}{2} \frac{\left( x-x\prime \right)^2}{h^2} \right)\]

Your function should be able to produce both the value of the estimated response $E\left[f^* \right]$ as well as the uncertainty of this estimate $\left( Var\left( f^* \right) \right)$ at an arbitrary point $x^* \in \left[ 0, 2\pi \right]$.

response.value <- function(x.star, h, s.f, s.n){
  D <- as.matrix(dist(x, diag=T, upper=T)) # distance matrix for observations x
  Sigma <- (s.f)^2*exp(-0.5*D^2/h^2)
  Sigma.x <- (s.f)^2*exp(-0.5*(x-x.star)^2/h^2)
  f.star <- t(Sigma.x)%*%solve(Sigma+(s.n)^2*diag(length(x)))%*%y
  f.star.var <- (s.f)^2 - t(Sigma.x)%*%solve(Sigma+(s.n)^2*diag(length(x)))%*%Sigma.x
  return(c(f.star, f.star.var))
}

$(c)$ For an arbitrary choice of parameters $\left(σ_n^2 \space ,\space σ_f^2, \space h\right)$ create a graph that shows the following:

$•$ The true function that generated the simulted data (in red) over the interval [0, 2π].

$•$ The data from (a) (as black dots)

$•$ The estimated response surface (as a black curve)

$•$ The 95% confidence region of the response surface (as a grey band)

$•$ A title that includes the parameter values you chose.

response.surface <- Vectorize(response.value, c("x.star"))

x.seq <- seq(min(x), max(x),0.01)

y.seq <- response.surface(x.seq, 2,0.5,2)[1,] # large bandwidth
y.var <- response.surface(x.seq, 2,0.5,2)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=2, Covariance=0.5, Intrinsic Variance=2")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)

$(d)$ Experiment with different values of the model parameters. Repeat $(c)$ using a better set of model parameters $σ_n^2$,$σ_f^2$,$h$. Show your graph and include the parameter values you chose in the title. Explain what you are looking for in the graph to identify a “good” set of parameters.

response.surface <- Vectorize(response.value, c("x.star"))

x.seq <- seq(min(x), max(x),0.01)

y.seq <- response.surface(x.seq, 1,0.5,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,0.5,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=0.5, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)

response.surface <- Vectorize(response.value, c("x.star"))

x.seq <- seq(min(x), max(x),0.01)

y.seq <- response.surface(x.seq, 1,1,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,1,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=1, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)

response.surface <- Vectorize(response.value, c("x.star"))

x.seq <- seq(min(x), max(x),0.01)

y.seq <- response.surface(x.seq, 1,1,1)[1,] # large bandwidth
y.var <- response.surface(x.seq, 1,1,1)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=1, Covariance=1, Intrinsic Variance=1")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)

response.surface <- Vectorize(response.value, c("x.star"))

x.seq <- seq(min(x), max(x),0.01)

y.seq <- response.surface(x.seq,2,0.5,0.5)[1,] # large bandwidth
y.var <- response.surface(x.seq,2,0.5,0.5)[2,]
y.upper <- y.seq + 1.96*sqrt(y.var)
y.lower <- y.seq - 1.96*sqrt(y.var)
plot(x,y, pch = 16, main = "Bandwidth=2, Covariance=0.5, Intrinsic Variance=0.5")
polygon(c(x.seq, rev(x.seq)), c(y.lower, rev(y.upper)), col = "grey85", border = NA)
points(x,y, pch = 16)
points(x.seq, sin(x.seq), type="l", col="red", lwd=3)
points(x.seq, y.seq, type = "l", col = "black", lwd = 2)

In conclusion:

In part $(c)$, we have $h=2$, $\sigma^2_n=0.5$, and $\sigma^2_f=2$. As the graph shows, the estimated response surface does not fit the true fucntion well. Indeed, partial of true function curve is not in the $95$% confidence interval band.

Comparing to part $(c)$, the estimated response surfaces are better in graphes shown in part $(d)$. Among the estimated response surfaces shown in part $(d)$, the one with $h=1$, $\sigma^2_n=0,5$, and $\sigma^2_f=0.5$ is the best option here because $1)$. it fits the true function line ; $2)$. the true function in inside the $95$% confidence interval. Maybe the confidence interval band is narrower when $h=2$, $\sigma^2_n=0.5$ and $\sigma^2_f=0.5$. However, the estimated response surface doesn’t fit the true function nearl as well as when $h=1$, $\sigma^2_n=0.5$ and $\sigma^2_f=0.5$

Math 267 - Spring 2015 Homework 5

Liqian Situ

March 5, 2015