From book, Exercise 10.1: These data are the average March temperatures (Fahrenheit) in Kansas Cityfrom 1961 to 1990. Estimate the probability denisty function using the histogram and the kernel density estimate.
#Make sure to pick a good bin width for the histogram.
#Make sure to pick a good bandwith for the kernel smoother.
kcTemp <- c(43.8,40.1,49.2,41.8,34.0,49.1,47.8,48.1,37.6,42.0,43.7,47.1,47.7,46.9,36.5,45.0,48.0,37.6,42.2,38.7,45.2,42.5,43.1,36.0,47.4,48.5,47.1,43.2,43.8,45.7)
set.seed(1234)
n<-100
x = rnorm(n)
S<-sd(kcTemp)
n<-length(kcTemp)
hist(kcTemp,main="Delta=1.5",freq=FALSE)
dens<-density(kcTemp,bw=1.5)
points(dens$x,dens$y,type="l",col="red",lwd=3)
Download this data: http://bit.ly/HW6stateData (For a description of the data use: help(state.x77)). We are interested in the relationship between HS graduation rate and Income.
df.2 = read.csv("https://www.dropbox.com/s/s4cp6q7xpdli7na/HW6stateData.csv?dl=1")
lm = lm(df.2$Income~df.2$HS.Grad)
plot(df.2$HS.Grad,df.2$Income)
par(new=T)
abline(lm)
df.2$HS.Grad2 = df.2$HS.Grad^2
lm2 = lm(df.2$Income~HS.Grad + HS.Grad2, data=df.2)
x <- seq(30, 70, 0.1)
p = predict(lm2, data.frame(HS.Grad = x, HS.Grad2 = x^2))
plot(df.2$HS.Grad,df.2$Income)
par(new=T)
abline(lm, col = "red", lwd = 3, lty = 4)
lines(x, p, col = "dark green", lwd = 3, lty = 1)
legend("topleft", lwd = 3, lty = c(4, 1), col = c("red", "dark green"), legend = c("Linear", "Quadratic"))
lw<-loess(Income~HS.Grad,span=1,degree=2,data=df.2)
p2 = predict(lw, data.frame(HS.Grad = x))
plot(df.2$HS.Grad,df.2$Income)
par(new=T)
abline(lm, col = "red", lwd = 3, lty = 4)
lines(x, p, col = "dark green", lwd = 3, lty = 3)
lines(x, p2, col = "blue", lwd = 3, lty = 1)
legend("topleft", lwd = 3, lty = c(4, 3, 1), col = c("red", "dark green", "blue"), legend = c("Linear", "Quadratic", "Loess"))
##Comments on the similarities and differences between these three different fits. All three show that Income increases as Graduation Rate increases. Linear does not appear to be appropriate as the data between 40-50 has a concave down shape.
The loess curve shows this leveling out and also shows an increase in income between 60-65 which the quadratic curve misses.
Consider the data generated based on the code below. Find a range of spans (the smoothing parameter) that produces reasonable (i.e. the fit is not too jagged nor too smooth) loess fits to this data.
set.seed(1234)
x <- runif(100,0,10)
y <- sin(x)+rnorm(100,0,.3)
lw1<-loess(y~x,span=.65,degree=2)
plot(x,y)
par(new=T)
j = order(x)
points(x[j],lw1$fitted[j],col="red",lwd=3,type="l")
A span in the range of .2 to .65 provides a reasonable fit for these data.
(GRAD ONLY) Repeat the previous problem with values of σ equal to 0.1, 0.5, 1, and 1.5. Discuss the relationship between the random variablity in the data and the range of reasonable spans.
#sigma is the amount of noise added.
set.seed(1234)
sigma<-.1
x <- runif(100,0,10)
y <- sin(x)+rnorm(100,0,sigma)
lw2<-loess(y~x,span=.45,degree=2)
plot(x,y)
par(new=T)
j = order(x)
points(x[j],lw2$fitted[j],col="red",lwd=3,type="l")
A span from .15 to .45 is appropriate
set.seed(1234)
sigma<-.5
x <- runif(100,0,10)
y <- sin(x)+rnorm(100,0,sigma)
lw2<-loess(y~x,span=.65,degree=2)
plot(x,y)
par(new=T)
j = order(x)
points(x[j],lw2$fitted[j],col="red",lwd=3,type="l")
A span from .2 to .55 is appropriate
set.seed(1234)
sigma<-1
x <- runif(100,0,10)
y <- sin(x)+rnorm(100,0,sigma)
lw2<-loess(y~x,span=.7,degree=2)
plot(x,y)
par(new=T)
j = order(x)
points(x[j],lw2$fitted[j],col="red",lwd=3,type="l")
A span from .4 to .7 is appropriate
set.seed(1234)
sigma<-1.5
x <- runif(100,0,10)
y <- sin(x)+rnorm(100,0,sigma)
lw2<-loess(y~x,span=.9,degree=2)
plot(x,y)
par(new=T)
j = order(x)
points(x[j],lw2$fitted[j],col="red",lwd=3,type="l")
A span from .5 to .9 is appropriate.
It seems like the more noise introduced, the higher the span we need to smooth the curve.