Introduction


For homework 1 you will create your first knitr report and perform bunch of exploratory data analysis. This will be the first of many reports you will generate throughtout the year. Many questions will be open ended but you will be evaluated on the

  1. Accuracy and creativity of your answer.
  2. Clenlyness of your report.
  3. How legible your code is written.

There are couple rules that you must follow

  1. Put comments on your code where ever necessary.
  2. The code should run from top to bottom.
  3. Name the variables well so that what you are trying to do is obvious.
  4. Test your script when you are done.

Make sure the code you submit runs on a clean R environment. If a grader runs your code and get an error, you get 50% off for that homework after fixing the mistake.

You are welcome to help each other but write the code on your own since we will test you on your coding ability here and there. These spot tests are to get you prepared for your future interviews.

About this document

In this discussion, you will learn how to use R Markdown with knitr and create a nice report.

The base structure for a markdown file contains both R code and Knitr code. Any code between the sets of three backticks is R code and any code outside is Knitr.

Chunk options are written in one line. The followings are some popular chunk options.


Exploratory Data Analysis


You can get the example data from this link

The code to read in the data is already written for you. Your job is to look at the data and think about the following

  1. What are some hypothesis that can be addressed from the data?
  2. Does regression seem like a good analysis strategy?
  3. Is there some transformation that you can apply so that you can use regression?


Example: United Nations Social Indicators Data


Variables:

  • region: Africa, America, Asia, Europe, Oceania
  • educationMale: Average number of years of education for men
  • lifeMale: Expectation of life at birth for men
  • lifeFemale: Expectation of life at birth for women
  • infantMortality: infant deaths per 1000 live births.

Making plots

For a quick plot, you can use qplot() which wraps up the details of ggplot. The syntax looks like this

qplot(x, y, data=, color=, shape=, size=, geom=, method=, ...)

The ggplot2 package provides a more flexible syntax for creating elegant and complex plots.

ggplot(data, mapping) + layer( stat = "", geom = "", position = "", geom_parms = list(),...)

Now let’s use the example data to do some data visualization.


Scatterplot

# read data
unitednations<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/UnitedNations.txt")

# scatterplot using qplot()
qplot(lifeFemale, lifeMale, data = unitednations, colour = region, main = "scatterplot using qplot()" )

Add a smoother to show a trend

qplot(lifeFemale, lifeMale, data = unitednations, colour = region, geom = c("point", "smooth"))   

Add a line specified by slope and intercept by using geom_ablin in ggplot()

# scatterplot using ggplot()
unitednations_gp <- ggplot(unitednations)

unitednations_gp + aes(x=lifeFemale, y=lifeMale, color=region)+ geom_point()+ 
                   geom_abline(intercept=0,slope=1) + xlab("Expectation of life at birth for women") + 
                   ylab("Expectation of life at birth for men") +
                   ggtitle("scatterplot using ggplot()")

Split up your data by one or more variables - lay out panels in a grid using facet_grid

e.g. facet panels by region (a categorical variable with 5 levels) and show regions on the top

unitednations_gp +aes(x=lifeFemale, y=lifeMale, color=region)+geom_point()+ 
                  stat_smooth(method=lm)+
                  facet_grid( .~region, scales="fixed", labeller= label_both)


Histogram and density plot

# histogram
unitednations_gp + aes(x=lifeFemale, fill=region)+ geom_histogram(binwidth =1)

To compare distributions of “lifeFemale” by regions, we can use a density plot

# density plot
unitednations_gp + aes(x=lifeFemale,fill = region, colour = region)+ geom_density(alpha = .1)

If you want to display density plots by region, facet_wrap can wrap with a certain number of columns or rows.

unitednations_gp + aes(x=lifeMale,fill = region, colour = region)+ geom_density(alpha = .1)+ 
                   facet_wrap(~region, nrow = 2)


Fitting a model

For example, we fit a linear regression model with two predictors (“educationMale”,“region”) and a response variable “lifeMale”.

regout = lm(lifeMale ~ factor(region) + educationMale, data =unitednations)

The kable function is sufficient to make a table for the model output, although it doesn’t have many options.

kable(summary(regout)$coef, digits=2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.13 2.39 16.81 0
factor(region)America 12.55 1.92 6.52 0
factor(region)Asia 12.30 1.78 6.93 0
factor(region)Europe 10.53 2.09 5.04 0
factor(region)Oceania 10.21 3.43 2.98 0
educationMale 1.52 0.25 6.08 0


Example: Leinhardt and Wasserman’s Data on Infant Mortality


Variables:

  • income: Per-capita income in U.S. dollars.
  • infant: Infant-mortality rate per 1000 live births.
  • region: Africa; Americas; Asia, Asia and Oceania; Europe.
  • oil: Oil-exporting country: no, yes.

Plots

Boxplot is good for measuring dispersion and telling outlier points.

# read data
leinhardt<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Leinhardt.txt")

# boxplot
ggplot(leinhardt) + aes(x=region, y=infant/1000, color=oil)+ geom_boxplot(width = .7)+
                    facet_wrap(~oil, ncol = 2)+ stat_summary(fun.y=mean, colour="darkred", geom="point", shape=15, size=2)+
                    ylab("infant rate") + ggtitle("Relationship between region and infant rate")

Seems like linear relationship only exists in Europe

# scatterplot
ggplot(leinhardt) + aes(x=income, y=infant, color=region) + geom_point()+ geom_smooth(method = "lm")+
                   facet_grid(. ~region)

Any luck if we take a log transformation for the data?

ggplot(leinhardt) + aes(x=log(income), y=log(infant), color=region)+ geom_point() +
                    geom_smooth(method = "lm")+ facet_grid(. ~region)

Model

You can fit a linear model for data only in Europe

logout = lm(infant ~ income, data = leinhardt[which(leinhardt$region=="Europe"),]) 
kable(summary(logout)$coef, digits=2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.65 4.37 7.93 0
income -0.01 0.00 -3.88 0

How about the model for the whole data?

logout2 = lm(log(infant) ~ log(income)*region, data = leinhardt)  
kable(summary(logout2)$coef, digits=c(2, 2, 3, 3))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.94 0.64 7.763 0.000
log(income) -0.01 0.12 -0.091 0.928
regionAmericas 1.57 1.19 1.321 0.190
regionAsia 1.26 0.86 1.476 0.143
regionEurope 2.09 1.84 1.134 0.260
log(income):regionAmericas -0.40 0.20 -2.010 0.047
log(income):regionAsia -0.38 0.16 -2.404 0.018
log(income):regionEurope -0.52 0.25 -2.069 0.041


Example: Sahlins’s Data on Agricultural Production in Mazulu Village


Variables:

  • consumers: weighted number of consumers per gardener in each household
  • acres: number of acres per gardener tended in each household
Notes: The observations are 20 households in an African village.

Plot

Draw a scatterplot of acres versus consumers. What kind of relationship can you see? linear or nonlinear?

# read data
sahlins<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Sahlins.txt")

# use geom_smooth to help see patterns 
ggplot(sahlins) + aes(x=consumers, y=acres)+ geom_point()+ geom_smooth()

Model

The relationship seems to “locally polynomial”, so we use LOESS regression which stands for locally weighted scatterplot smoothing.

regout2 = loess(acres ~ consumers, data =sahlins, span=.5)
summary(regout2)
## Call:
## loess(formula = acres ~ consumers, data = sahlins, span = 0.5)
## 
## Number of Observations: 20 
## Equivalent Number of Parameters: 7.23 
## Residual Standard Error: 0.4682 
## Trace of smoother matrix: 8  (exact)
## 
## Control settings:
##   span     :  0.5 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE


Let’s plot fitted values. A nonlinear line looks good in general.

plot(acres ~ consumers,data =sahlins)
i = order(sahlins$consumers)
lines(sahlins$consumers[i],regout2$fitted[i],col="red")