This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Here,we use the size of the data from AER package that is CPS1988 data,it may be worthwhile to model the role of this variable using more flexible tools. In R, the partially linear model eqtn.
log(wage) = beta1 + g(experience) + beta2 education + beta3 ethnicity+error term
Here, g is an unknwon function to be estimated from the data.In R, splines are available in the package splines (part of the base R distribution and automatically loaded with AER).
B splines are computationally convenient, and the function providing them is bs(). It can directly be used in lm(), and so fitting the desired model is as easy as
R> data(“CPS1988”, package=“AER”)
R> cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988)
R> summary(CPS1988)
R> cps_plm <- lm(log(wage) ~ bs(experience, df = 5) + education + ethnicity, data = CPS1988)
There are a couple of possibilities for using bs() one can either specify the degree of the piecewise polynomial (defaulting to 3) and the knots by hand or supply the parameter df, which selects the remaining ones.The expression bs(experience, df = 5) thus internally generates regressors,namely piecewise cubic polynomials evaluated at the observations pertaining to experience, with the implied 5???3 = 2 interior knots evenly spaced and therefore located at the 33.33% and 66.67% quantiles of experience.The choice of df = 5 was made based on model selection in terms of the Schwarz criterion (BIC). The following code considers B splines with df ranging from 3 through 10, suggesting that df = 5 is appropriate for the data at hand:
##
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education +
## ethnicity, data = CPS1988)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9428 -0.3162 0.0580 0.3756 4.3830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 ***
## experience 7.747e-02 8.800e-04 88.03 <2e-16 ***
## I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 ***
## education 8.567e-02 1.272e-03 67.34 <2e-16 ***
## ethnicityafam -2.434e-01 1.292e-02 -18.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5839 on 28150 degrees of freedom
## Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346
## F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
R> cps_bs <- lapply(3:10, function(i) lm(log(wage) ~bs(experience, df = i) + education + ethnicity,data = CPS1988))
R> structure(sapply(cps_bs, AIC, k = log(nrow(CPS1988))),.Names = 3:10)
## 3 4 5 6 7 8 9 10
## 49205.17 48835.97 48794.23 48794.83 48801.41 48797.30 48799.28 48802.49
First, a list cps_bs of fitted linear models is constructed via lapply(), to which extractor functions can be easily applied as in sapply(cps_bs, AIC).The call above utilizing structure() is slightly more complicated because it additionally sets log(n) as the penalty parameter (BIC instead of the default AIC) and assigns intuitive names (degrees of freedom) to the resulting vector.The cubic spline from the selected model is best compared with the classical fit from cps_lm by means of a graphical display. The following code plots log-wage as a function of experience (for Caucasian workers with average years of education) for the classical model with a quadratic term in experience and for the partially linear model.
R> cps <- data.frame(experience = -2:60, education=with(CPS1988,mean(education[ethnicity == “cauc”])),ethnicity =“cauc”)
R> cps\(yhat1 <- predict(cps_lm, newdata = cps) R> cps\)yhat2 <- predict(cps_plm, newdata = cps)
You can also embed plots, for example:
R> plot(log(wage) ~ jitter(experience, factor = 3), pch = 19,col = rgb(0.5, 0.5, 0.5, alpha = 0.02), data = CPS1988)
R> lines(yhat1 ~ experience, data = cps, lty = 2)
R> lines(yhat2 ~ experience, data = cps)
R> legend(“topleft”, c(“quadratic”, “spline”), lty = c(2,1),bty = “n”)
Figure indicates that both models are not too distinct for the 20-40 years of experience range. Overall, the spline version exhibits less curvature beyond eight years of experience. However, the most remarkable feature of the plot appears to be the more pronounced curvature below seven years of experience in the spline fit, which is largely missed by the traditional approach. An alternative approach to partially linear models is to use kernel methods,this is implemented in the package np (Hayfield and Racine 2008). Some further remarks on the plot are appropriate. The large number of observations and numerous ties in experience provide a challenge in that the standard scatterplot will result in overplotting. Here, we circumvent the problem by (1) adding some amount of “jitter” (i.e., noise) to the regressor experience and (2) setting the color to “semitransparent” gray. This results in darker shades of gray for areas with more data points, thus conveying a sense of density. This technique is also called “alpha blending” and requires that, in addition to the color itself, a value of alpha-ranging between 0 (fully transparent) and 1 (opaque)-be specified. Various color functions in R provide an argument alpha; e.g., the basic rgb() function implementing the** RGB** (red, green, blue) color model. Selecting equal intensities for all three color values in rgb() yields a shade of gray (which would be more conveniently available in gray(), but this does not allow for alpha blending). Note that alpha transparency is not available for all plotting devices in R. Among others, it is available for windows() (typically used on Microsoft Windows), quartz() (typically used on Mac OS X), and pdf(), provided that the argument version is set to version = “1.4” or greater (on all platforms).See ?rgb for further details. A somewhat simpler but less appealing solution available on all devices is to employ the default color (i.e., black) and a tiny plotting character such as pch = “.”.
this topic has been taken from Applied econometrics using R chapter 3 (Linear Regression)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.