Introduction

For homework 1 you will create your first knitr report and perform bunch of exploratory data analysis. This will be the first of many reports you will generate throughtout the year. Many questions will be open ended but you will be evaluated on the

Accuracy and creativity of your answer.
Clenlyness of your report.
How legible your code is written.

There are couple rules that you must follow

Put comments on your code where ever necessary.
The code should run from top to bottom.
Name the variables well so that what you are trying to do is obvious.
Test your script when you are done.

Make sure the code you submit runs on a clean R environment. If a grader runs your code and get an error, you get 50% off for that homework after fixing the mistake.

You are welcome to help each other but write the code on your own since we will test you on your coding ability here and there. These spot tests are to get you prepared for your future interviews.

About this document

In this discussion, you will learn how to use R Markdown with knitr and create a nice report.

The base structure for a markdown file contains both R code and Knitr code. Any code between the sets of three backticks is R code and any code outside is Knitr.

Chunk options are written in one line. The followings are some popular chunk options.

echo=TRUE - display the command ; error=FALSE - do not display any error messages
eval=TRUE - run the code ; message =FALSE - do not display any messages
include=TRUE - show the output ; warning =FALSE - do not display any warning messages

Exploratory Data Analysis

You can get the example data from this link

The code to read in the data is already written for you. Your job is to look at the data and think about the following

What are some hypothesis that can be addressed from the data?
Does regression seem like a good analysis strategy?
Is there some transformation that you can apply so that you can use regression?

Example: United Nations Social Indicators Data

Variables:

region: Africa, America, Asia, Europe, Oceania
educationMale: Average number of years of education for men
lifeMale: Expectation of life at birth for men
lifeFemale: Expectation of life at birth for women
infantMortality: infant deaths per 1000 live births.

Making plots

For a quick plot, you can use qplot() which wraps up the details of ggplot. The syntax looks like this

qplot(x, y, data=, color=, shape=, size=, geom=, method=, ...)

The ggplot2 package provides a more flexible syntax for creating elegant and complex plots.

ggplot(data, mapping) + layer( stat = "", geom = "", position = "", geom_parms = list(),...)

Now let’s use the example data to do some data visualization.

Scatterplot

# read data
unitednations<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/UnitedNations.txt")

# scatterplot using qplot()
qplot(lifeFemale, lifeMale, data = unitednations, colour = region, main = "scatterplot using qplot()" )

Add a smoother to show a trend

qplot(lifeFemale, lifeMale, data = unitednations, colour = region, geom = c("point", "smooth"))

Add a line specified by slope and intercept by using geom_ablin in ggplot()

# scatterplot using ggplot()
unitednations_gp <- ggplot(unitednations)

unitednations_gp + aes(x=lifeFemale, y=lifeMale, color=region)+ geom_point()+ 
                   geom_abline(intercept=0,slope=1) + xlab("Expectation of life at birth for women") + 
                   ylab("Expectation of life at birth for men") +
                   ggtitle("scatterplot using ggplot()")

Split up your data by one or more variables - lay out panels in a grid using facet_grid

e.g. facet panels by region (a categorical variable with 5 levels) and show regions on the top

unitednations_gp +aes(x=lifeFemale, y=lifeMale, color=region)+geom_point()+ 
                  stat_smooth(method=lm)+
                  facet_grid( .~region, scales="fixed", labeller= label_both)

Histogram and density plot

# histogram
unitednations_gp + aes(x=lifeFemale, fill=region)+ geom_histogram(binwidth =1)

To compare distributions of “lifeFemale” by regions, we can use a density plot

# density plot
unitednations_gp + aes(x=lifeFemale,fill = region, colour = region)+ geom_density(alpha = .1)

If you want to display density plots by region, facet_wrap can wrap with a certain number of columns or rows.

unitednations_gp + aes(x=lifeMale,fill = region, colour = region)+ geom_density(alpha = .1)+ 
                   facet_wrap(~region, nrow = 2)

Fitting a model

For example, we fit a linear regression model with two predictors (“educationMale”,“region”) and a response variable “lifeMale”.

regout = lm(lifeMale ~ factor(region) + educationMale, data =unitednations)

The kable function is sufficient to make a table for the model output, although it doesn’t have many options.

kable(summary(regout)$coef, digits=2)

	Estimate	Std. Error	t value
(Intercept)	40.13	2.39	16.81
factor(region)America	12.55	1.92	6.52
factor(region)Asia	12.30	1.78	6.93
factor(region)Europe	10.53	2.09	5.04
factor(region)Oceania	10.21	3.43	2.98
educationMale	1.52	0.25	6.08

Example: Leinhardt and Wasserman’s Data on Infant Mortality

Variables:

income: Per-capita income in U.S. dollars.
infant: Infant-mortality rate per 1000 live births.
region: Africa; Americas; Asia, Asia and Oceania; Europe.
oil: Oil-exporting country: no, yes.

Plots

Boxplot is good for measuring dispersion and telling outlier points.

# read data
leinhardt<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Leinhardt.txt")

# boxplot
ggplot(leinhardt) + aes(x=region, y=infant/1000, color=oil)+ geom_boxplot(width = .7)+
                    facet_wrap(~oil, ncol = 2)+ stat_summary(fun.y=mean, colour="darkred", geom="point", shape=15, size=2)+
                    ylab("infant rate") + ggtitle("Relationship between region and infant rate")

Seems like linear relationship only exists in Europe

# scatterplot
ggplot(leinhardt) + aes(x=income, y=infant, color=region) + geom_point()+ geom_smooth(method = "lm")+
                   facet_grid(. ~region)

Any luck if we take a log transformation for the data?

ggplot(leinhardt) + aes(x=log(income), y=log(infant), color=region)+ geom_point() +
                    geom_smooth(method = "lm")+ facet_grid(. ~region)

Model

You can fit a linear model for data only in Europe

logout = lm(infant ~ income, data = leinhardt[which(leinhardt$region=="Europe"),]) 
kable(summary(logout)$coef, digits=2)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	34.65	4.37	7.93	0
income	-0.01	0.00	-3.88	0

How about the model for the whole data?

logout2 = lm(log(infant) ~ log(income)*region, data = leinhardt)  
kable(summary(logout2)$coef, digits=c(2, 2, 3, 3))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.94	0.64	7.763	0.000
log(income)	-0.01	0.12	-0.091	0.928
regionAmericas	1.57	1.19	1.321	0.190
regionAsia	1.26	0.86	1.476	0.143
regionEurope	2.09	1.84	1.134	0.260
log(income):regionAmericas	-0.40	0.20	-2.010	0.047
log(income):regionAsia	-0.38	0.16	-2.404	0.018
log(income):regionEurope	-0.52	0.25	-2.069	0.041

Example: Sahlins’s Data on Agricultural Production in Mazulu Village

Variables:

consumers: weighted number of consumers per gardener in each household
acres: number of acres per gardener tended in each household

Notes: The observations are 20 households in an African village.

Plot

Draw a scatterplot of acres versus consumers. What kind of relationship can you see? linear or nonlinear?

# read data
sahlins<-read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Sahlins.txt")

# use geom_smooth to help see patterns 
ggplot(sahlins) + aes(x=consumers, y=acres)+ geom_point()+ geom_smooth()

Model

The relationship seems to “locally polynomial”, so we use LOESS regression which stands for locally weighted scatterplot smoothing.

regout2 = loess(acres ~ consumers, data =sahlins, span=.5)
summary(regout2)

## Call:
## loess(formula = acres ~ consumers, data = sahlins, span = 0.5)
## 
## Number of Observations: 20 
## Equivalent Number of Parameters: 7.23 
## Residual Standard Error: 0.4682 
## Trace of smoother matrix: 8  (exact)
## 
## Control settings:
##   span     :  0.5 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

Let’s plot fitted values. A nonlinear line looks good in general.

plot(acres ~ consumers,data =sahlins)
i = order(sahlins$consumers)
lines(sahlins$consumers[i],regout2$fitted[i],col="red")

MA684 Discussion

Yitong

September 06, 2016

Introduction

Exploratory Data Analysis

Example: Leinhardt and Wasserman’s Data on Infant Mortality

Variables:

Plots

Model

Example: Sahlins’s Data on Agricultural Production in Mazulu Village

Variables:

Plot

Model