Creating Scatterplots: Introduction

For this exercise we analize data related to education as it is collected and presented by the National Center for Education Statistics (NCES). We will be using scatterplot to explore the performance in Science, Mathematics and Reading Comprehension in the USA in the year 2011. For that, the results of “Lunch Program Eligibility” (an indicator of poverty) will be analyzed, that is, we compare the aggragate scores obtained by students “eligible” and “non-eligible” for lunch programs by State.

Getting the Data

We have stored the data downloaded for NCES in a folder in google drive. We consider this a good practice to ease the sharing and replicability of exercises, besides we do not need to deal with the location of the file in our hard drive. So, we have the following links stored:

Math8G2011 = "https://docs.google.com/spreadsheet/pub?key=0AhVqDdZgThPldEVpQWxSWWItNjlEbS1sU3NEamUzZmc&output=csv"
Read8G2011 = "https://docs.google.com/spreadsheet/pub?key=0AhVqDdZgThPldFRJbm9yMWlGcjBSaW1zcFh0S1ByM0E&output=csv"
Sci8G2011 = "https://docs.google.com/spreadsheet/pub?key=0AhVqDdZgThPldGNsQ2F6WFlsOTFfWVlaTk0xM1N0LXc&output=csv"

To actually get the data into R, we will use the following code:

library(RCurl)

## Loading required package: bitops

Math8G2011 <- getURL(Math8G2011)
Read8G2011 <- getURL(Read8G2011)
Sci8G2011 <- getURL(Sci8G2011)

Math8G2011 <- read.csv(textConnection(Math8G2011), header = TRUE, row.names = 1)
Read8G2011 <- read.csv(textConnection(Read8G2011), header = TRUE, row.names = 1)
Sci8G2011 <- read.csv(textConnection(Sci8G2011), header = TRUE, row.names = 1)

A it is seen, all the datasets have the same data variables:

colnames(Math8G2011)

## [1] "All"     "Male"    "Female"  "M.F"     "Elig"    "NotElig" "E.NE"

colnames(Read8G2011)

## [1] "All"     "Male"    "Female"  "M.F"     "Elig"    "NotElig" "E.NE"

colnames(Sci8G2011)

## [1] "All"     "Male"    "Female"  "M.F"     "Elig"    "NotElig" "E.NE"

According to what we stated at the begining, we will use the variables Elig and NotElig.

Using Scatterplots

To produce an scatterplot we will make use of a function provided by Prof. Daniel Carr:

plotSetup <- function(x,y, # required arguments to set scales
  fill.plot=rgb(.9,.9,.9), # default fill color
  col.grid="white",        # default grid color
  lwd.grid=2,              # default grid line width
  lty.grid=1,              # default grid line dash pattern
   ...){                   # passes other argment to ... in plot       

  plot(x, y, type='n', axes=FALSE, ...)
  xy <- par()$usr
  rect(xy[1], xy[3], xy[2], xy[4], col=fill.plot)
  grid(lty=lty.grid, col=col.grid, lwd=lwd.grid)
  box()
  axis(side=1, tick=FALSE, mgp=c(2, .1, 0))
  axis(side=2, las=1, tick=FALSE, mgp=c(2, .25, 0))
}

Now we are ready to plot using that function. In each of the following plots we will use the function and just modify the parameters needed. (NOTE: we will use the command identify() to interactively show the names of some of the dots in the scatterplots).

First for the Math results:

textX="Lunch-Eligible Average Scale Scores" # we will use these texts for the next 3 plots
textY="Non-Lunch-Eligible  Average Scale Scores"
plotSetup(Math8G2011$Elig,Math8G2011$NotElig,                   #variables to plot
    main="NAEP 8th Grade Math Performance by State, 2011",      #title
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub = "Individual's Possible Scale Score Range: 0 - 500")   #subtitle
abline(a=0,b=1,col="red",lwd=2)
points(Math8G2011$Elig,Math8G2011$NotElig,pch=19,col="blue")
identify(Math8G2011$Elig,Math8G2011$NotElig,labels=rownames(Math8G2011), n=10)

Then, for the Reading results:

plotSetup(Read8G2011$Elig,Read8G2011$NotElig,                   #variables to plot
    main="NAEP 8th Grade Reading Performance by State, 2011",   #title
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub = "Individual's Possible Scale Score Range: 0 - 500")   #subtitle
abline(a=0,b=1,col="red",lwd=2) 
points(Read8G2011$Elig,Read8G2011$NotElig,pch=19,col="blue")
identify(Read8G2011$Elig,Read8G2011$NotElig,labels=rownames(Read8G2011), n=10)

And finally, for the Science results:

plotSetup(Sci8G2011$Elig,Sci8G2011$NotElig,                     #variables to plot
    main="NAEP 8th Grade Science Performance by State, 2011",   #title
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub = "Individual's Possible Scale Score Range: 0 - 300")   #subtitle
abline(a=0,b=1,col="red",lwd=2)
points(Sci8G2011$Elig,Sci8G2011$NotElig,pch=19,col="blue")
identify(Sci8G2011$Elig,Sci8G2011$NotElig,labels=rownames(Sci8G2011), n=10)

Mean and difference plots

As proposed by Prof. Carr, we will produce a mean-difference (MD) plot.

First, for the Math results:

textX="Mean of State Eligible and Not Eligible Average Scale Scores"
textY="Difference of State Eligible and Not Eligible Average Scale Scores"
textSub="Blue line: Average of State Differences"
math8G2011Ave <- (Math8G2011$Elig+Math8G2011$NotElig)/2
math8G2011Dif <- Math8G2011$Elig-Math8G2011$NotElig 
plotSetup(math8G2011Ave,math8G2011Dif,                          #variables to plot
    main="NAEP 8th Grade Math Performance by State, 2011",      #title  
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub=paste(textSub))                                         #label for subtitle
abline(h=0,col="red",lwd=2) 
abline(h=mean(math8G2011Dif), col="blue", lwd=2)
points(math8G2011Ave,math8G2011Dif,pch=19,col="blue")
identify(math8G2011Ave,math8G2011Dif,labels=rownames(Math8G2011), n=10)

We continue with Reading:

read8G2011Ave <- (Read8G2011$Elig+Read8G2011$NotElig)/2
read8G2011Dif <- Read8G2011$Elig-Read8G2011$NotElig 
plotSetup(read8G2011Ave,read8G2011Dif,                          #variables to plot
    main="NAEP 8th Grade Reading Performance by State, 2011",   #title  
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub=paste(textSub))                                         #label for subtitle
abline(h=0,col="red",lwd=2) 
abline(h=mean(read8G2011Dif), col="blue", lwd=2)
points(read8G2011Ave,read8G2011Dif,pch=19,col="blue")
identify(read8G2011Ave,read8G2011Dif,labels=rownames(Read8G2011), n=10)

And finally, Science:

sci8G2011Ave <- (Sci8G2011$Elig+Sci8G2011$NotElig)/2
sci8G2011Dif <- Sci8G2011$Elig-Sci8G2011$NotElig 
plotSetup(sci8G2011Ave,sci8G2011Dif,                            #variables to plot
    main="NAEP 8th Grade Reading Performance by State, 2011",   #title  
    xlab=paste(textX),                                          #label for x
    ylab=paste(textY),                                          #label for y
    sub=paste(textSub))                                         #label for subtitle
abline(h=0,col="red",lwd=2) 
abline(h=mean(sci8G2011Dif), col="blue", lwd=2)
points(sci8G2011Ave,sci8G2011Dif,pch=19,col="blue")
identify(sci8G2011Ave,sci8G2011Dif,labels=rownames(Sci8G2011), n=10)