Handout 006 In-Class Beta bias

Plotting and Analyzing Data

As promised, I’m going to give you the code to replicate the figures from class. Let’s start with the easiest plot: the scatterplot of all the data:

First, I’m going to read in the data:

library(foreign)
setwd("/Users/paulmusgrave/Dropbox/0005 Teaching/2016 03 UMASS Energy/Handouts")
data <- read.csv("Sample Data.csv")
colnames(data) <- c("Country","GDPPC","Energy","Set")

And then a quick plot to show the data:

plot(data$GDPPC~data$Energy,
     main="All Data, All Subsets",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")

If you’ve read the other handout, you’ll notice that I’m using a slightly different notation for entering the plot coordinates here: data$GDPPC ~ data$ Energy, instead of plot(data$Energy,data$GDPPC). R was built for data analysis, and it uses the ~ operator to indicate a dependent relationship between two variables; it carries that syntax over to the plot() command. So you can write, equivalently, plot(Xvariable,Yvariable) or plot(Yvariable ~ Xvariable). It’s a matter of personal preference, but the Y ~ X syntax looks more like how we’ll conduct OLS and other regressions, so it might be more intuitive.

By the way, we mentioned that this might look different if we “logged” (took the logarithm of) the data. It’s really easy to do that in a plot (R uses natural logs by default), like this code that logs the x-axis only:

plot(data$GDPPC~data$Energy,
     main="All Data, All Subsets",
     pch = 16,
     xlab="log of kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD",
     log="x") # This line logs the x-axis

You could change the parameter log to "y" to log the y-axis instead or to "yx" to log both, like:

plot(data$GDPPC~data$Energy,
     main="All Data, All Subsets",
     pch = 16,
     xlab="log of kg oil equivalent usage per capita",
     ylab="Log of GDP PC at PPP 2013 USD",
     log="yx")

(Gee, doesn’t this relationship look a little more linear? This is probably a better fit for linearly modeling these data series, but the interpretation is a little more difficult–we’ll talk about what a one-unit change in log(x) means a little later on.) Note that changing the log parameter in the plot() function doesn’t transform the data—we’ll talk about how to do that later. (It’s really simple, but I don’t want you to accidentally change something that you might need!)

The point of today’s exercise was to talk about how analyzing subsets of data (as we always do) can lead to estimates of $\beta$ that differ from the true or population value. To do that, though, we have to spend a moment learning about how R indexes its data and how to subset that data so that we can (for instance) only analyze set “A” or set “G” (or whatever) of this data. For more information, you should check out the StatMethods page on subsetting and indexing.

Subsetting and Indexing Data

The basic idea is this: you can call named columns or other elements of a dataset by name, like so:

data$GDPPC

##   [1]  7428 28460  7390  1879   990 32018 16200 32176  3493 25251 45627
##  [12] 12998  4120  3194 21376  9728  2721 38319  8348 41188  7432 22038
##  [23] 11795  2029  3596 11613  4203  1402 16710 26730 67963  9926  2061
##  [34]  7080   789  2030  1050  1735 24820  4688 11656 11650  5144  1419
##  [45]  9469 34527   304  7888 30442  2062 34100  1687 11880 22205 24227
##  [56] 47175  1219 20092  9356  1846  6810  2611 24649 16712 19059  4811
##  [67]  6012  3332 10134  7052 29161 25303  2813 29626  1375 11851  5608
##  [78] 11932 65894  1715 27083  7161  5763 11316  8590 17565  2332  4302
##  [89] 16104  5929  2088  9554  3697 10319 10284 19354 12902 11973  1052
## [100]  4122 10798  9469  1962   886  9044  3891 13273 32400 45156 13814
## [111] 25790 31519 13009  4281 10463  5246 15362  3460 36521  7950 13617
## [122] 20383  6893  4043  5366  2272 32300 45203 33989  4526 11384  6207
## [133] 23123   867  4320  2980  3264  1442 15089  2606  3971  3365 15644
## [144] 46271  5297 23262 32470  1323 34705  4245 11591  8268  1423   494
## [155]  1655  1063 36274 25300  1245

That just returns the GDP per capita from the table, where the _n_th element of the data$GDPPC vector corresponds to the value of GDPPC for the _n_th row of the data object. We could get the same thing if we used the column number for GDPPC. It is the second column from the left (the row numbers to the far left don’t count as a column), so we use

data[,2]

##   [1]  7428 28460  7390  1879   990 32018 16200 32176  3493 25251 45627
##  [12] 12998  4120  3194 21376  9728  2721 38319  8348 41188  7432 22038
##  [23] 11795  2029  3596 11613  4203  1402 16710 26730 67963  9926  2061
##  [34]  7080   789  2030  1050  1735 24820  4688 11656 11650  5144  1419
##  [45]  9469 34527   304  7888 30442  2062 34100  1687 11880 22205 24227
##  [56] 47175  1219 20092  9356  1846  6810  2611 24649 16712 19059  4811
##  [67]  6012  3332 10134  7052 29161 25303  2813 29626  1375 11851  5608
##  [78] 11932 65894  1715 27083  7161  5763 11316  8590 17565  2332  4302
##  [89] 16104  5929  2088  9554  3697 10319 10284 19354 12902 11973  1052
## [100]  4122 10798  9469  1962   886  9044  3891 13273 32400 45156 13814
## [111] 25790 31519 13009  4281 10463  5246 15362  3460 36521  7950 13617
## [122] 20383  6893  4043  5366  2272 32300 45203 33989  4526 11384  6207
## [133] 23123   867  4320  2980  3264  1442 15089  2606  3971  3365 15644
## [144] 46271  5297 23262 32470  1323 34705  4245 11591  8268  1423   494
## [155]  1655  1063 36274 25300  1245

and we get a vector that is precisely identical to data$GDPPC. Often, calling elements by name is more convenient when we’re playing with data, but when you’re programming or working with things that we don’t need to name, the column number is better.

We can also use row numbers to identify all the values for one observation:

data[4,]

##    Country GDPPC Energy Set
## 4 Cambodia  1879    352   A

(Note that R uses the [row,column] convention.) You can choose a range of rows:

data[4:10,]

##     Country GDPPC Energy Set
## 4  Cambodia  1879    352   A
## 5   Comoros   990     61   A
## 6   Denmark 32018   3323   A
## 7   Estonia 16200   3544   A
## 8   Germany 32176   3872   A
## 9  Honduras  3493    598   A
## 10   Israel 25251   2876   A

or arbitrary rows:

data[c(1,3,58,99),]

##                   Country GDPPC Energy Set
## 1                 Albania  7428    654   A
## 3  Bosnia and Herzegovina  7390   1602   A
## 58           Saudi Arabia 20092   5889   B
## 99                  Nepal  1052    339   D

or (and here we arrive at the reason for this tour) rows that match some condition:

data[data$Set=="A",]

##                          Country GDPPC Energy Set
## 1                        Albania  7428    654   A
## 2                        Bahamas 28460   2156   A
## 3         Bosnia and Herzegovina  7390   1602   A
## 4                       Cambodia  1879    352   A
## 5                        Comoros   990     61   A
## 6                        Denmark 32018   3323   A
## 7                        Estonia 16200   3544   A
## 8                        Germany 32176   3872   A
## 9                       Honduras  3493    598   A
## 10                        Israel 25251   2876   A
## 11                        Kuwait 45627  11403   A
## 12              Malaysia (1966-) 12998   2500   A
## 13                       Morocco  4120    478   A
## 14                     Nicaragua  3194    534   A
## 15                      Portugal 21376   2272   A
## 16 St Vincent and the Grenadines  9728    642   A
## 17                       Vietnam  2721    622   A
## 18                   Switzerland 38319   3484   A
## 19                       Tunisia  8348    867   A
## 20                 United States 41188   7058   A
## 21                       Algeria  7432   1166   A
## 22                       Bahrain 22038   8097   A
## 23                      Botswana 11795   1021   A
## 24                      Cameroon  2029    361   A
## 25                         Congo  3596    360   A
## 26                      Dominica 11613    629   A
## 27                          Fiji  4203    628   A
## 28                         Ghana  1402    369   A
## 29                       Hungary 16710   2481   A
## 30                         Italy 26730   2739   A
## 31                    Luxembourg 67963   7939   A
## 32                    Montenegro  9926   1039   A

which is, again, equivalent to data[data[,4]=="A",]. You can read these expressions as saying “Return all data from any rows in the data object that have an ‘A’ in the column named ‘Set.’” Equivalently, if we only wanted to see observations with high energy usage:

summary(data$Energy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10     560    1244    2217    2846   16900

which tells us that the 75th percentile of Energy is 2,846 kg oil equivalent per capita. Which countries match that criterion?

data[data$Energy>=2846,1]

##  [1] Denmark              Estonia              Germany             
##  [4] Israel               Kuwait               Switzerland         
##  [7] United States        Bahrain              Luxembourg          
## [10] Slovenia             Canada               Finland             
## [13] Iceland              Oman                 Norway              
## [16] Saudi Arabia         South Africa         Turkmenistan        
## [19] New Zealand          France (1963-)       Japan               
## [22] Qatar                Kazakhstan           Slovakia            
## [25] Belgium              Brunei               Iran                
## [28] Libya                Netherlands          Russia              
## [31] Sweden               United Arab Emirates Australia           
## [34] Czech Republic       Singapore            Trinidad and Tobago 
## [37] United Kingdom       Austria              Ireland             
## [40] Korea, South        
## 159 Levels: Albania Algeria Angola Antigua and Barbuda ... Zambia

which would be equivalent to writing data[data$Energy>=2846,]$Country. This command would be expressed in English as “Return the first column of the dataset for only those observations that have a value of 2846 or higher in the ‘Energy’ column.” Note that these names aren’t in order; you could alphabetize them using the order() command from the other handout. You might also find the list of R’s logical operators to be interesting.

Analyzing a subset of data

Subsetting and indexing objects in R is important, but we’re here to learn how to do it so that we can carry out our investigations. The other handout talked about the lm function for OLS regressions; now, we’ll use that function to analyze just a subset of our data using our new subsetting functions.

ols.A <- lm(GDPPC~Energy,data=data[data$Set=="A",])
summary(ols.A)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "A", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22406  -4474  -1994   4010  24289 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4975.3398  2106.8428   2.362   0.0249 *  
## Energy         4.8745     0.5905   8.255 3.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8919 on 30 degrees of freedom
## Multiple R-squared:  0.6943, Adjusted R-squared:  0.6841 
## F-statistic: 68.14 on 1 and 30 DF,  p-value: 3.251e-09

And we can plot the data and the associated regression line easily:

plot(data[data$Set=="A",]$GDPPC~data[data$Set=="A",]$Energy,
     main="GDP PC versus Energy Use Per Capita",
     pch = 16,
     xlab="kg oil equivalent usage PC",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.A,
       col="red")

Now, let’s repeat the exercise for the different subsets. First, let’s calculate each OLS:

ols.B <- lm(GDPPC~Energy,data=data[data$Set=="B",])
summary(ols.B)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "B", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18721  -4062  -2343   3396  25798 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4815.6496  1765.0330   2.728   0.0105 *  
## Energy         2.8397     0.4236   6.704 1.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7946 on 30 degrees of freedom
## Multiple R-squared:  0.5997, Adjusted R-squared:  0.5863 
## F-statistic: 44.94 on 1 and 30 DF,  p-value: 1.986e-07

ols.C<- lm(GDPPC~Energy,data=data[data$Set=="C",])
summary(ols.C)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "C", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11555  -3353  -1453   3110  10395 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4252.7491  1145.9629   3.711 0.000839 ***
## Energy         4.4780     0.3648  12.274 3.17e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5281 on 30 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8284 
## F-statistic: 150.6 on 1 and 30 DF,  p-value: 3.169e-13

ols.D <- lm(GDPPC~Energy,data=data[data$Set=="D",])
summary(ols.D)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "D", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13015.5  -2804.6   -548.7   1978.0  14828.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1824.8594  1410.5571   1.294    0.206    
## Energy         5.4415     0.4558  11.938 6.35e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5411 on 30 degrees of freedom
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.8203 
## F-statistic: 142.5 on 1 and 30 DF,  p-value: 6.345e-13

ols.E <- lm(GDPPC~Energy,data=data[data$Set=="E",])
summary(ols.E)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "E", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26591  -5495  -3373   3445  24486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5772.8219  2260.1784   2.554   0.0162 *  
## Energy         2.9059     0.6317   4.600  7.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10200 on 29 degrees of freedom
## Multiple R-squared:  0.4218, Adjusted R-squared:  0.4019 
## F-statistic: 21.16 on 1 and 29 DF,  p-value: 7.7e-05

ols.out <- lm(GDPPC~Energy,data=data) # no subsets, this is all the data
summary(ols.out)

## 
## Call:
## lm(formula = GDPPC ~ Energy, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39714  -4261  -2000   2801  32659 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4917.0849   843.5910   5.829 3.08e-08 ***
## Energy         3.8275     0.2389  16.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8279 on 157 degrees of freedom
## Multiple R-squared:  0.6204, Adjusted R-squared:  0.618 
## F-statistic: 256.6 on 1 and 157 DF,  p-value: < 2.2e-16

With one additional line of R code, we can combine the plots for all of these subsets on one chart. For more on this technique, see the StatMethods page on combining R plots.

par(mfrow=c(3,3))

plot(data[data$Set=="A",]$GDPPC~data[data$Set=="A",]$Energy,
     main="Set A",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.A,
       col="red")


plot(data[data$Set=="B",]$GDPPC~data[data$Set=="B",]$Energy,
     main="Set B",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.B,
       col="red")


plot(data[data$Set=="C",]$GDPPC~data[data$Set=="C",]$Energy,
     main="Set C",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.C,
       col="red")


plot(data[data$Set=="D",]$GDPPC~data[data$Set=="D",]$Energy,
     main="Set D",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.D,
       col="red")


plot(data[data$Set=="E",]$GDPPC~data[data$Set=="E",]$Energy,
     main="Set E",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.E,
       col="red")

plot(data$GDPPC~data$Energy,
     main="All",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.out,
       col="blue")

And we can combine all of these estimates into one pane:

plot(data$GDPPC~data$Energy,
     main="All Data, All Subsets",
     pch = 16,
     xlab="kg oil equivalent usage per capita",
     ylab="GDP PC at PPP 2013 USD")
abline(ols.A)
abline(ols.B)
abline(ols.C)
abline(ols.D)
abline(ols.E)
abline(ols.out,
       col="blue",
       lwd=3) ## lwd varies the width of the line being plotted

Handout 006 In-Class Beta bias

Paul Musgrave

October 5, 2016

Plotting and Analyzing Data

Subsetting and Indexing Data

Analyzing a subset of data