As promised, I’m going to give you the code to replicate the figures from class. Let’s start with the easiest plot: the scatterplot of all the data:
First, I’m going to read in the data:
library(foreign)
setwd("/Users/paulmusgrave/Dropbox/0005 Teaching/2016 03 UMASS Energy/Handouts")
data <- read.csv("Sample Data.csv")
colnames(data) <- c("Country","GDPPC","Energy","Set")
And then a quick plot to show the data:
plot(data$GDPPC~data$Energy,
main="All Data, All Subsets",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
If you’ve read the other handout, you’ll notice that I’m using a slightly different notation for entering the plot coordinates here: data$GDPPC ~ data$ Energy, instead of plot(data$Energy,data$GDPPC). R was built for data analysis, and it uses the ~ operator to indicate a dependent relationship between two variables; it carries that syntax over to the plot() command. So you can write, equivalently, plot(Xvariable,Yvariable) or plot(Yvariable ~ Xvariable). It’s a matter of personal preference, but the Y ~ X syntax looks more like how we’ll conduct OLS and other regressions, so it might be more intuitive.
By the way, we mentioned that this might look different if we “logged” (took the logarithm of) the data. It’s really easy to do that in a plot (R uses natural logs by default), like this code that logs the x-axis only:
plot(data$GDPPC~data$Energy,
main="All Data, All Subsets",
pch = 16,
xlab="log of kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD",
log="x") # This line logs the x-axis
You could change the parameter log to "y" to log the y-axis instead or to "yx" to log both, like:
plot(data$GDPPC~data$Energy,
main="All Data, All Subsets",
pch = 16,
xlab="log of kg oil equivalent usage per capita",
ylab="Log of GDP PC at PPP 2013 USD",
log="yx")
(Gee, doesn’t this relationship look a little more linear? This is probably a better fit for linearly modeling these data series, but the interpretation is a little more difficult–we’ll talk about what a one-unit change in log(x) means a little later on.) Note that changing the log parameter in the plot() function doesn’t transform the data—we’ll talk about how to do that later. (It’s really simple, but I don’t want you to accidentally change something that you might need!)
The point of today’s exercise was to talk about how analyzing subsets of data (as we always do) can lead to estimates of \(\beta\) that differ from the true or population value. To do that, though, we have to spend a moment learning about how R indexes its data and how to subset that data so that we can (for instance) only analyze set “A” or set “G” (or whatever) of this data. For more information, you should check out the StatMethods page on subsetting and indexing.
The basic idea is this: you can call named columns or other elements of a dataset by name, like so:
data$GDPPC
## [1] 7428 28460 7390 1879 990 32018 16200 32176 3493 25251 45627
## [12] 12998 4120 3194 21376 9728 2721 38319 8348 41188 7432 22038
## [23] 11795 2029 3596 11613 4203 1402 16710 26730 67963 9926 2061
## [34] 7080 789 2030 1050 1735 24820 4688 11656 11650 5144 1419
## [45] 9469 34527 304 7888 30442 2062 34100 1687 11880 22205 24227
## [56] 47175 1219 20092 9356 1846 6810 2611 24649 16712 19059 4811
## [67] 6012 3332 10134 7052 29161 25303 2813 29626 1375 11851 5608
## [78] 11932 65894 1715 27083 7161 5763 11316 8590 17565 2332 4302
## [89] 16104 5929 2088 9554 3697 10319 10284 19354 12902 11973 1052
## [100] 4122 10798 9469 1962 886 9044 3891 13273 32400 45156 13814
## [111] 25790 31519 13009 4281 10463 5246 15362 3460 36521 7950 13617
## [122] 20383 6893 4043 5366 2272 32300 45203 33989 4526 11384 6207
## [133] 23123 867 4320 2980 3264 1442 15089 2606 3971 3365 15644
## [144] 46271 5297 23262 32470 1323 34705 4245 11591 8268 1423 494
## [155] 1655 1063 36274 25300 1245
That just returns the GDP per capita from the table, where the _n_th element of the data$GDPPC vector corresponds to the value of GDPPC for the _n_th row of the data object. We could get the same thing if we used the column number for GDPPC. It is the second column from the left (the row numbers to the far left don’t count as a column), so we use
data[,2]
## [1] 7428 28460 7390 1879 990 32018 16200 32176 3493 25251 45627
## [12] 12998 4120 3194 21376 9728 2721 38319 8348 41188 7432 22038
## [23] 11795 2029 3596 11613 4203 1402 16710 26730 67963 9926 2061
## [34] 7080 789 2030 1050 1735 24820 4688 11656 11650 5144 1419
## [45] 9469 34527 304 7888 30442 2062 34100 1687 11880 22205 24227
## [56] 47175 1219 20092 9356 1846 6810 2611 24649 16712 19059 4811
## [67] 6012 3332 10134 7052 29161 25303 2813 29626 1375 11851 5608
## [78] 11932 65894 1715 27083 7161 5763 11316 8590 17565 2332 4302
## [89] 16104 5929 2088 9554 3697 10319 10284 19354 12902 11973 1052
## [100] 4122 10798 9469 1962 886 9044 3891 13273 32400 45156 13814
## [111] 25790 31519 13009 4281 10463 5246 15362 3460 36521 7950 13617
## [122] 20383 6893 4043 5366 2272 32300 45203 33989 4526 11384 6207
## [133] 23123 867 4320 2980 3264 1442 15089 2606 3971 3365 15644
## [144] 46271 5297 23262 32470 1323 34705 4245 11591 8268 1423 494
## [155] 1655 1063 36274 25300 1245
and we get a vector that is precisely identical to data$GDPPC. Often, calling elements by name is more convenient when we’re playing with data, but when you’re programming or working with things that we don’t need to name, the column number is better.
We can also use row numbers to identify all the values for one observation:
data[4,]
## Country GDPPC Energy Set
## 4 Cambodia 1879 352 A
(Note that R uses the [row,column] convention.) You can choose a range of rows:
data[4:10,]
## Country GDPPC Energy Set
## 4 Cambodia 1879 352 A
## 5 Comoros 990 61 A
## 6 Denmark 32018 3323 A
## 7 Estonia 16200 3544 A
## 8 Germany 32176 3872 A
## 9 Honduras 3493 598 A
## 10 Israel 25251 2876 A
or arbitrary rows:
data[c(1,3,58,99),]
## Country GDPPC Energy Set
## 1 Albania 7428 654 A
## 3 Bosnia and Herzegovina 7390 1602 A
## 58 Saudi Arabia 20092 5889 B
## 99 Nepal 1052 339 D
or (and here we arrive at the reason for this tour) rows that match some condition:
data[data$Set=="A",]
## Country GDPPC Energy Set
## 1 Albania 7428 654 A
## 2 Bahamas 28460 2156 A
## 3 Bosnia and Herzegovina 7390 1602 A
## 4 Cambodia 1879 352 A
## 5 Comoros 990 61 A
## 6 Denmark 32018 3323 A
## 7 Estonia 16200 3544 A
## 8 Germany 32176 3872 A
## 9 Honduras 3493 598 A
## 10 Israel 25251 2876 A
## 11 Kuwait 45627 11403 A
## 12 Malaysia (1966-) 12998 2500 A
## 13 Morocco 4120 478 A
## 14 Nicaragua 3194 534 A
## 15 Portugal 21376 2272 A
## 16 St Vincent and the Grenadines 9728 642 A
## 17 Vietnam 2721 622 A
## 18 Switzerland 38319 3484 A
## 19 Tunisia 8348 867 A
## 20 United States 41188 7058 A
## 21 Algeria 7432 1166 A
## 22 Bahrain 22038 8097 A
## 23 Botswana 11795 1021 A
## 24 Cameroon 2029 361 A
## 25 Congo 3596 360 A
## 26 Dominica 11613 629 A
## 27 Fiji 4203 628 A
## 28 Ghana 1402 369 A
## 29 Hungary 16710 2481 A
## 30 Italy 26730 2739 A
## 31 Luxembourg 67963 7939 A
## 32 Montenegro 9926 1039 A
which is, again, equivalent to data[data[,4]=="A",]. You can read these expressions as saying “Return all data from any rows in the data object that have an ‘A’ in the column named ‘Set.’” Equivalently, if we only wanted to see observations with high energy usage:
summary(data$Energy)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 560 1244 2217 2846 16900
which tells us that the 75th percentile of Energy is 2,846 kg oil equivalent per capita. Which countries match that criterion?
data[data$Energy>=2846,1]
## [1] Denmark Estonia Germany
## [4] Israel Kuwait Switzerland
## [7] United States Bahrain Luxembourg
## [10] Slovenia Canada Finland
## [13] Iceland Oman Norway
## [16] Saudi Arabia South Africa Turkmenistan
## [19] New Zealand France (1963-) Japan
## [22] Qatar Kazakhstan Slovakia
## [25] Belgium Brunei Iran
## [28] Libya Netherlands Russia
## [31] Sweden United Arab Emirates Australia
## [34] Czech Republic Singapore Trinidad and Tobago
## [37] United Kingdom Austria Ireland
## [40] Korea, South
## 159 Levels: Albania Algeria Angola Antigua and Barbuda ... Zambia
which would be equivalent to writing data[data$Energy>=2846,]$Country. This command would be expressed in English as “Return the first column of the dataset for only those observations that have a value of 2846 or higher in the ‘Energy’ column.” Note that these names aren’t in order; you could alphabetize them using the order() command from the other handout. You might also find the list of R’s logical operators to be interesting.
Subsetting and indexing objects in R is important, but we’re here to learn how to do it so that we can carry out our investigations. The other handout talked about the lm function for OLS regressions; now, we’ll use that function to analyze just a subset of our data using our new subsetting functions.
ols.A <- lm(GDPPC~Energy,data=data[data$Set=="A",])
summary(ols.A)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "A", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22406 -4474 -1994 4010 24289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4975.3398 2106.8428 2.362 0.0249 *
## Energy 4.8745 0.5905 8.255 3.25e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8919 on 30 degrees of freedom
## Multiple R-squared: 0.6943, Adjusted R-squared: 0.6841
## F-statistic: 68.14 on 1 and 30 DF, p-value: 3.251e-09
And we can plot the data and the associated regression line easily:
plot(data[data$Set=="A",]$GDPPC~data[data$Set=="A",]$Energy,
main="GDP PC versus Energy Use Per Capita",
pch = 16,
xlab="kg oil equivalent usage PC",
ylab="GDP PC at PPP 2013 USD")
abline(ols.A,
col="red")
Now, let’s repeat the exercise for the different subsets. First, let’s calculate each OLS:
ols.B <- lm(GDPPC~Energy,data=data[data$Set=="B",])
summary(ols.B)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "B", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -18721 -4062 -2343 3396 25798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4815.6496 1765.0330 2.728 0.0105 *
## Energy 2.8397 0.4236 6.704 1.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7946 on 30 degrees of freedom
## Multiple R-squared: 0.5997, Adjusted R-squared: 0.5863
## F-statistic: 44.94 on 1 and 30 DF, p-value: 1.986e-07
ols.C<- lm(GDPPC~Energy,data=data[data$Set=="C",])
summary(ols.C)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "C", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11555 -3353 -1453 3110 10395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4252.7491 1145.9629 3.711 0.000839 ***
## Energy 4.4780 0.3648 12.274 3.17e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5281 on 30 degrees of freedom
## Multiple R-squared: 0.8339, Adjusted R-squared: 0.8284
## F-statistic: 150.6 on 1 and 30 DF, p-value: 3.169e-13
ols.D <- lm(GDPPC~Energy,data=data[data$Set=="D",])
summary(ols.D)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "D", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13015.5 -2804.6 -548.7 1978.0 14828.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1824.8594 1410.5571 1.294 0.206
## Energy 5.4415 0.4558 11.938 6.35e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5411 on 30 degrees of freedom
## Multiple R-squared: 0.8261, Adjusted R-squared: 0.8203
## F-statistic: 142.5 on 1 and 30 DF, p-value: 6.345e-13
ols.E <- lm(GDPPC~Energy,data=data[data$Set=="E",])
summary(ols.E)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data[data$Set == "E", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -26591 -5495 -3373 3445 24486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5772.8219 2260.1784 2.554 0.0162 *
## Energy 2.9059 0.6317 4.600 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10200 on 29 degrees of freedom
## Multiple R-squared: 0.4218, Adjusted R-squared: 0.4019
## F-statistic: 21.16 on 1 and 29 DF, p-value: 7.7e-05
ols.out <- lm(GDPPC~Energy,data=data) # no subsets, this is all the data
summary(ols.out)
##
## Call:
## lm(formula = GDPPC ~ Energy, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39714 -4261 -2000 2801 32659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4917.0849 843.5910 5.829 3.08e-08 ***
## Energy 3.8275 0.2389 16.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8279 on 157 degrees of freedom
## Multiple R-squared: 0.6204, Adjusted R-squared: 0.618
## F-statistic: 256.6 on 1 and 157 DF, p-value: < 2.2e-16
With one additional line of R code, we can combine the plots for all of these subsets on one chart. For more on this technique, see the StatMethods page on combining R plots.
par(mfrow=c(3,3))
plot(data[data$Set=="A",]$GDPPC~data[data$Set=="A",]$Energy,
main="Set A",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.A,
col="red")
plot(data[data$Set=="B",]$GDPPC~data[data$Set=="B",]$Energy,
main="Set B",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.B,
col="red")
plot(data[data$Set=="C",]$GDPPC~data[data$Set=="C",]$Energy,
main="Set C",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.C,
col="red")
plot(data[data$Set=="D",]$GDPPC~data[data$Set=="D",]$Energy,
main="Set D",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.D,
col="red")
plot(data[data$Set=="E",]$GDPPC~data[data$Set=="E",]$Energy,
main="Set E",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.E,
col="red")
plot(data$GDPPC~data$Energy,
main="All",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.out,
col="blue")
And we can combine all of these estimates into one pane:
plot(data$GDPPC~data$Energy,
main="All Data, All Subsets",
pch = 16,
xlab="kg oil equivalent usage per capita",
ylab="GDP PC at PPP 2013 USD")
abline(ols.A)
abline(ols.B)
abline(ols.C)
abline(ols.D)
abline(ols.E)
abline(ols.out,
col="blue",
lwd=3) ## lwd varies the width of the line being plotted