Examining the model: \[y_i=3-4x_i+\epsilon_i\]
We will use the rnorm function to create an object that represents a data set of 1000 observations of a particular variable x
set.seed(256)
x <- rnorm(n=1000, mean=4, sd=1)
The set.seed function ensures that the set of numbers represented by the object x can be replicated. I have set the mean argument at 4. Although I have specified here that the standard deviation (sd) is 1, this is the default setting of the argument and the resulting dataset would have been the same had I left this argument blank.
Next, I will create an object that defines epsilon for the first dataset.
epsilon1 <- rnorm(n=1000, mean=0, sd=.5)
Here, the default argument for mean is zero and specifying it was unnecessary; however, since the mean of epsilon has to be zero, it is good to be explicit. Per the assignment, I have set the standard deviation at 0.5.
epsilon2 <- rnorm(n=1000, mean=0, sd=1)
epsilon3 <- rnorm(n=1000, mean=0, sd=3)
epsilon4 <- rnorm(n=1000, mean=0, sd=5)
epsilon5 <- rnorm(n=1000, mean=0, sd=10)
Above are the other four objects I have created for the various values of epsilon according to the homework instructions.
To define y I will plug my objects into the linear model.
y1 <- 3-4*x+epsilon1
This line of code creates the object y as defined by the x variable as well as the parameters of the intercept(a=3), slope (b=-4) and epsilon, in this case the epsilon with a standard deviation of 0.5
Now I can plot y1 in relation to x
plot(y1~x)
model1 is the name of the model I created with the glm function
model1 <- glm(y1~x, family="gaussian")
summary(model1)
##
## Call:
## glm(formula = y1 ~ x, family = "gaussian")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.65383 -0.33992 0.00732 0.33665 1.91104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.02493 0.06317 47.88 <2e-16 ***
## x -3.99918 0.01537 -260.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2353299)
##
## Null deviance: 16172.17 on 999 degrees of freedom
## Residual deviance: 234.86 on 998 degrees of freedom
## AIC: 1395.1
##
## Number of Fisher Scoring iterations: 2
The summary of the model returns an intercept according to the x variable of -3.99. The formula \(1/(1+exp(3.99))\) returns a value of 0.02.
Defining the y variables and creating the GLMs
y2 <- 3-4*x+epsilon2
model2 <- glm(y2~x)
summary(model2)
##
## Call:
## glm(formula = y2 ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.90677 -0.67788 0.00328 0.69842 3.04093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.88977 0.13335 21.67 <2e-16 ***
## x -3.96538 0.03244 -122.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.048697)
##
## Null deviance: 16715.6 on 999 degrees of freedom
## Residual deviance: 1046.6 on 998 degrees of freedom
## AIC: 2889.4
##
## Number of Fisher Scoring iterations: 2
y3 <- 3-4*x+epsilon3
model3 <- glm(y3~x)
summary(model3)
##
## Call:
## glm(formula = y3 ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.9134 -2.1156 0.1477 2.0654 8.8401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.83287 0.40311 7.028 3.89e-12 ***
## x -3.92224 0.09806 -39.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 9.582396)
##
## Null deviance: 24893.2 on 999 degrees of freedom
## Residual deviance: 9563.2 on 998 degrees of freedom
## AIC: 5101.8
##
## Number of Fisher Scoring iterations: 2
y4 <- 3-4*x+epsilon4
model4 <- glm(y4~x)
summary(model4)
##
## Call:
## glm(formula = y4 ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.7572 -3.3035 0.0438 3.1111 16.2158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5337 0.6377 2.405 0.0164 *
## x -3.5813 0.1551 -23.084 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 23.98467)
##
## Null deviance: 36717 on 999 degrees of freedom
## Residual deviance: 23937 on 998 degrees of freedom
## AIC: 6019.3
##
## Number of Fisher Scoring iterations: 2
y5 <- 3-4*x+epsilon5
model5 <- glm(y5~x)
summary(model5)
##
## Call:
## glm(formula = y5 ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -37.059 -6.717 -0.142 6.753 43.695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1876 1.2756 2.499 0.0126 *
## x -4.0498 0.3103 -13.051 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 95.95287)
##
## Null deviance: 112104 on 999 degrees of freedom
## Residual deviance: 95761 on 998 degrees of freedom
## AIC: 7405.7
##
## Number of Fisher Scoring iterations: 2
This is where I lose things. used the logLik function and found that model3 had the lowest value for the intercept in the x row. I used the y3 variable to create a dataset to run through the code we covered in Lab 4 last week
dat<-data.frame(x,y3)
I took values from the dataset I created and plugged them into the chunk below. You can see that I changed the intercept to 3 and the slope to -4 as defined by the linear model in the homework assignment. We create an object called epsilon defined by the slope and intercept. We create a density and then log-transform it.
a <- 3
b <- -4
sigma <- 0.5
parameters <- c(a, b, sigma)
#now proceed with the calculation
epsilon <- y3-(parameters[1]+parameters[2]*x)#epsilon is a vector
myDensityValues <- dnorm(x=epsilon, mean = 0, sd = parameters[3])
myLogDensityValues <- dnorm(x=epsilon, mean = 0, sd = parameters[3], log=T)# we can work
#with these numbers, can't we?
myLogLValue <- sum(myLogDensityValues)#Here you go.
This code is the function we created in lab 4 using the data I produced and the parameters defined by the given linear model.
myLogLFunction <- function(parameters, data){ #this line says that the custom function
#will have too arguments named data and par
x <- dat$x
y <- dat$y
epsilon <- y-(parameters[1]+parameters[2]*x)
myLogDensityValues <- dnorm(x=epsilon, mean = 0, sd = parameters[3], log=T)
myLogLValue <- sum(myLogDensityValues)
return(myLogLValue)
}
Now we create an object that uses an optimization function optim. When you run the function, it returns an intercept of 2.8, a slope of -3.9 and a standard deviation of 3.1, which is the standard deviation in my data set from the epsilon value I used when defining y3.
result <- optim(fn=myLogLFunction, par=c(1,1,1), data=dat, control=list(fnscale=-1))
result
## $par
## [1] 2.843836 -3.924283 3.093186
##
## $value
## [1] -2547.902
##
## $counts
## function gradient
## 132 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
This chunk of code creates a a panel of two frames that plots the data y3~x with a line based on the slope and intercept of the model, as defined by our parameters.
par(mfrow=c(1,2))# one-by-two multipanel figure
plot(y3~x, las=1, bty="l", cex=.5, col="black", main="Scatter plot and model predictions")
abline(result$par[1], result$par[2], lwd=2, lty=2, col="red")##pretty good, huh?
##now let's look at the residuals
epsilon <- y3-(result$par[2]*x+result$par[1])
hist(epsilon, nclass=30, freq=F)
lines(seq(-5,5,0.01), dnorm(seq(-5,5,0.01), mean=0, sd=result$par[3]), col="red", lty=2
, lwd=2)
As I mentioned before, I used the logLik function to look at the log transformed likelihood of each of the models. Below are the plots of all five predicted models superimposed on top of the dataset.
logLik(model3)
## 'log Lik.' -2547.901 (df=3)
AIC(model3)##compare with previous result...
## [1] 5101.803
plot(y3~x, las=1, bty="l", cex=.5, col="black", main="Scatter plot and model predictions")
points(dat$x, predict(model3), cex=.1, col="red")
points(dat$x, predict(model1), cex=.1, col="blue")
points(dat$x, predict(model2), cex=.1, col="green")
points(dat$x, predict(model4), cex=.1, col="brown")
points(dat$x, predict(model5), cex=.1, col="yellow")
Here, I create an object for slope values from -10 to 10 and create a function to show the maximum likelihood for the slope of the data. The apex of the curve is around -4, which matches with our linear model.
bVals <- seq(-10,10,0.01)
logLik <- numeric(0)# initialization: numerical vector of length=0
for (i in bVals){
logLik <- c(logLik, myLogLFunction(parameters=c(result$par[1], i, result$par[3])
, data=dat))
}
plot(bVals, logLik) # the value of b that maximizes logLik is around 3. Makes sense?
This is where I get stuck. I have created the various values for epsilon, programming in the various standard deviations indicated in the assignment. However, I am not sure where to program the other models into the function. It looks like I have to go all the way back to the code that create the object myLogFunction, but I get lost in the code.