For homework 1 you will create your first knitr report and perform bunch of exploratory data analysis. This will be the first of many reports you will generate throughtout the year. Many questions will be open ended but you will be evaluated on the
There are couple rules that you must follow
Make sure the code you submit runs on a clean R environment. If a grader runs your code and get an error, you get 50% off for that homework after fixing the mistake.
You are welcome to help each other but write the code on your own since we will test you on your coding ability here and there. These spot tests are to get you prepared for your future interviews.
About this document In this discussion, you will learn how to use R Markdown withknitr and create a nice report.
The base structure for a markdown file contains both R code and Knitr code. Any code between the sets of three backticks is R code and any code outside is Knitr.
Chunk options are written in one line. The followings are some popular chunk options.
echo=TRUE - display the command ; error=FALSE - do not display any error messageseval=TRUE - run the code ; message =FALSE - do not display any messagesinclude=TRUE - show the output ; warning =FALSE - do not display any warning messages
The code to read in the data is already written for you. Your job is to look at the data and think about the following
# read data
leinhardt<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Leinhardt.txt")
# boxplot
ggplot(leinhardt) + aes(x=region, y=infant/1000, color=oil)+ geom_boxplot(width = .7)+
facet_wrap(~oil, ncol = 2)+ stat_summary(fun.y=mean, colour="darkred", geom="point", shape=15, size=2)+
ylab("infant rate") + ggtitle("Relationship between region and infant rate")
Seems like linear relationship only exists in Europe
# scatterplot
ggplot(leinhardt) + aes(x=income, y=infant, color=region) + geom_point()+ geom_smooth(method = "lm")+
facet_grid(. ~region)
Any luck if we take a log transformation for the data?
ggplot(leinhardt) + aes(x=log(income), y=log(infant), color=region)+ geom_point() +
geom_smooth(method = "lm")+ facet_grid(. ~region)
You can fit a linear model for data only in Europe
logout = lm(infant ~ income, data = leinhardt[which(leinhardt$region=="Europe"),])
kable(summary(logout)$coef, digits=2)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 34.65 | 4.37 | 7.93 | 0 |
| income | -0.01 | 0.00 | -3.88 | 0 |
How about the model for the whole data?
logout2 = lm(log(infant) ~ log(income)*region, data = leinhardt)
kable(summary(logout2)$coef, digits=c(2, 2, 3, 3))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.94 | 0.64 | 7.763 | 0.000 |
| log(income) | -0.01 | 0.12 | -0.091 | 0.928 |
| regionAmericas | 1.57 | 1.19 | 1.321 | 0.190 |
| regionAsia | 1.26 | 0.86 | 1.476 | 0.143 |
| regionEurope | 2.09 | 1.84 | 1.134 | 0.260 |
| log(income):regionAmericas | -0.40 | 0.20 | -2.010 | 0.047 |
| log(income):regionAsia | -0.38 | 0.16 | -2.404 | 0.018 |
| log(income):regionEurope | -0.52 | 0.25 | -2.069 | 0.041 |
Draw a scatterplot of acres versus consumers. What kind of relationship can you see? linear or nonlinear?
# read data
sahlins<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Sahlins.txt")
# use geom_smooth to help see patterns
ggplot(sahlins) + aes(x=consumers, y=acres)+ geom_point()+ geom_smooth()
The relationship seems to “locally polynomial”, so we use LOESS regression which stands for locally weighted scatterplot smoothing.
regout2 = loess(acres ~ consumers, data =sahlins, span=.5)
summary(regout2)
## Call:
## loess(formula = acres ~ consumers, data = sahlins, span = 0.5)
##
## Number of Observations: 20
## Equivalent Number of Parameters: 7.23
## Residual Standard Error: 0.4682
## Trace of smoother matrix: 8 (exact)
##
## Control settings:
## span : 0.5
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
Let’s plot fitted values. A nonlinear line looks good in general.
plot(acres ~ consumers,data =sahlins)
i = order(sahlins$consumers)
lines(sahlins$consumers[i],regout2$fitted[i],col="red")