Ordinary Least Squares Regression, part 1

Sunday, October 04, 2015

First, some house keeping

We're in alpha testing for this class, which is hard

We're all a bit uncomfortable. (I've got a bad habit of reinventing the wheel and now we're going into uncharted territories.)

This is good news! It means we're really learning.

Plan for the rest of the semester

We've been lax with scheduling in the interest of building up basic skills. Now it's time to practice those skills.

I've split up the 10,000 possible points that make up your final grade across the remaining 10 weeks of the semester. Each week will be a fresh start where you are given the chance to earn up to 1000 points by completing a combination of assignments.

Points to earn this week

Assignment	# of submissions	Pts/submission
Simple OLS	1-3	100
–	4-5	80
–	6-10	55
–	10+	30
Concept Demo	1-3	100
–	4-5	80
–	6-10	55
–	10+	30

The assignments

Simple OLS means submitting a simple linear regression plus appropriate background information. I have provided a template for doing this work with more details.

Concept demonstration means creating a short R markdown document that illustrates some basic statistical/econometric concept using verbal description, R code (consider using built in data for simplicity), and graphs.

In the future…

Next week we will adapt the scoring to the tools we learn this week.

New concept: Linear Regression

So we've got some data, but what are we actually going to do with it?

It's too late to run an experiment (i.e. we're using found data rather than generating our own data), so…

Let's look for patterns

Ultimately we're going to ask questions like:

"Is X related to Y?"
"Does X cause Y?"
"How important is X in determining Y?"

So let's bring in the data

library(readr)
EFW <- read_csv("EFW-clean.csv") # Pull in data on Economic Freedom of the World
summary(EFW$EFW)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   5.820   6.620   6.479   7.340   9.170     543

sd(EFW$EFW)

## [1] NA

Looking at the summary gives us some idea of what we're dealing with.

And a histogram

library(ggplot2)
qplot(EFW,data=EFW)

And since I worked so hard on it, here's some colors

library(ggplot2)
qplot(EFW,fill=Continent,data=EFW)

Now another variable

One variable on its own may be interesting, but we're more interested in connections between variables.

ECI <- read_csv("http://atlas.cid.harvard.edu/rankings/country/download/")
colnames(ECI)[2] <- "ISO_Code" ; colnames(ECI)[6] <- "Year"
colnames(ECI)[4] <- "ECI"
data <- merge(ECI,EFW)
summary(data$ECI)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.520000 -0.790600 -0.132000  0.003459  0.674300  3.048000

sd(data$ECI)

## [1] 1.054263

And a histogram

qplot(ECI,data=data)

… plus some color

qplot(ECI,fill=Continent,data=data)

Both together

qplot(EFW,ECI,data=data)

What we see

It looks like higher levels of economic freedom (EFW) are associated with higher levels of economic complexity (ECI).

But how do we quantify that relationship?

The (apparent) relationship between the two variables.

plot <- ggplot(aes(EFW,ECI),data=data)
plot + geom_point() + geom_smooth(method="lm")