Okay, finally, some data analysis. Typically, after the data is cleaned, the first step of a data analysis is inspecting the data and presenting summary statistics. That is the point of this tutorial.

For this tutorial, we will be using the same apidata_2012_so.csv dataset. If you haven’t already done so, set your working directory and read in the data.

# Sets working directory
setwd("C:/Users/Richard/Desktop/R Introduction")
# Reads in data file
api <- read.csv("apidata_2012_so.csv")

Summary statistics

Most of the various summary statistics that you might consider important for a data analysis (e.g., mean, standard deviation) have their own commands. For example, see below how we get the mean (mean()) and standard deviation (sd()) of the cst28_engl variable

# Get the mean of cst28_engl
mean(api$cst28_engl, na.rm=TRUE)
[1] 3015.405
# Get the standard deviation
sd(api$cst28_engl, na.rm=TRUE)
[1] 10762.86

Note that both of them include an argument called na.rm, which I set equal to TRUE. NA is how R codes missing data. If you look in the help files for mean() or sd, you will see that the usage section includes the argument na.rm = FALSE. This means that by default, the na.rm argument is set to FALSE. The problem here is that if there is missing data in your variable, by default, it will not remove that missing data first. (Why? I don’t know. I didn’t come up with the rules.) You can see what happens if you try to take the mean of a variable with missing data.

# Get the mean of cst28_engl
mean(api$cst28_engl)
[1] NA

It can’t take the mean if there’s missing data, so it spits back NA as the result. As a side note, you can count the number of missing values.

# Count the number of missing values
sum(is.na(api$cst28_engl))
[1] 27

This is the first example we’ve seen of what we call “nested” function calls. Above, there are two functions: is.na() and sum(). Essentially, the is.na() command takes a list of values and gives back a list of 1’s and 0’s for whether the values are missing. The sum() command takes a list of values and adds them together. Now, we could have done the above in two steps, where we saved the list of missing indicators in an object and then used that object in the next line. See below.

# Save a list if missingness indicators
miss <- is.na(api$cst28_engl)
# Add up the number of 1's
sum(miss)
[1] 27

Notice that we get the same answer. In the first setup, though, we skipped the middleman. We put the is.na() function call right in where the argument goes for sum(), so it created the list of missingnes indicators and then immediately took the sum. Cool, right?

What is the mean of the cst911_engl variable? What is the standard deviation?

Mean and SD:

The summary() command

Pulling each one of the different summary statistics you want for your data seems like a hassle. Maybe you just want to pull all the summary statistics at once. Enter, the summary() command.

The summary() command is very versatile. What it spits back out depends on the type of argument it is given. For example, let’s use it on a continuous variable like cst28_engl.

# Get the summary of cst28_engl
summary(api$cst28_engl)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     115     707    3015    3016  303153      27 

It gives us a lot of information on the data.

It also works for categorical variables (that are already formatted as characters).

# Get the summary of districttype
summary(api$districttype)
Elementary       High    Unified 
       538         78        400 

Here, it gave us a frequency list. You could include multiple variables.

# Get the summaries of two variables at once
summary(api[c("districttype", "cst28_engl")])
     districttype   cst28_engl    
 Elementary:538   Min.   :     0  
 High      : 78   1st Qu.:   115  
 Unified   :400   Median :   707  
                  Mean   :  3015  
                  3rd Qu.:  3016  
                  Max.   :303153  
                  NA's   :27      

In fact, you could grab all of the statistics at once.

# Just go crazy
summary(api)
   districtid           year                         districtname 
 Min.   : 110017   Min.   :2012   Jefferson Elementary     :   3  
 1st Qu.:1764069   1st Qu.:2012   Pioneer Union Elementary :   3  
 Median :3238646   Median :2012   Hope Elementary          :   2  
 Mean   :3159407   Mean   :2012   Junction Elementary      :   2  
 3rd Qu.:4569959   3rd Qu.:2012   Lakeside Union Elementary:   2  
 Max.   :5872769   Max.   :2012   Liberty Elementary       :   2  
                                  (Other)                  :1002  
     districttype      api          cst28_engl      cst911_engl    
 Elementary:538   Min.   :336.0   Min.   :     0   Min.   :     0  
 High      : 78   1st Qu.:751.0   1st Qu.:   115   1st Qu.:     0  
 Unified   :400   Median :798.0   Median :   707   Median :     3  
                  Mean   :792.4   Mean   :  3015   Mean   :  1278  
                  3rd Qu.:848.0   3rd Qu.:  3016   3rd Qu.:   888  
                  Max.   :973.0   Max.   :303153   Max.   :113705  
                  NA's   :27      NA's   :27       NA's   :27      
   cst28_math      cst911_math         numstu          pctfrpl      
 Min.   :     0   Min.   :     0   Min.   :    11   Min.   :  0.00  
 1st Qu.:   114   1st Qu.:     0   1st Qu.:   220   1st Qu.: 37.00  
 Median :   706   Median :     3   Median :  1152   Median : 56.00  
 Mean   :  3012   Mean   :  1206   Mean   :  4299   Mean   : 54.82  
 3rd Qu.:  3012   3rd Qu.:   813   3rd Qu.:  4304   3rd Qu.: 76.00  
 Max.   :302887   Max.   :106209   Max.   :417643   Max.   :100.00  
 NA's   :27       NA's   :27       NA's   :27       NA's   :1       
     pctell          pctmin         region   
 Min.   : 0.00   Min.   :  0.00   North:308  
 1st Qu.: 4.00   1st Qu.: 34.00   South:708  
 Median :13.00   Median : 58.00              
 Mean   :18.06   Mean   : 58.33              
 3rd Qu.:27.00   3rd Qu.: 84.00              
 Max.   :83.00   Max.   :100.00              
 NA's   :1       NA's   :1                   

There are other ways to get these statistics that are perhaps closer to what we would want to present in a quantitative paper, but we will get to that later.

Bivariate statistics

Often, before conducting regression analysis, quantitative papers will present a series of bivariate relationships in order to establish unconditional relationships. Here are some common ones.

T-test (independent and dependent)

You can run an independent sample t-test across groups using the t.test() command. You have to reference the two groups separately.

# Conduct an independent sample t-test of the cst28_engl variable across district type
t.test(api$cst28_engl[api$districttype=="Elementary"],
       api$cst28_engl[api$districttype=="Unified"])

    Welch Two Sample t-test

data:  api$cst28_engl[api$districttype == "Elementary"] and api$cst28_engl[api$districttype == "Unified"]
t = -4.4132, df = 409.33, p-value = 1.304e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5351.465 -2053.197
sample estimates:
mean of x mean of y 
 1627.809  5330.140 

First of all, for educational purposes, notice that this is the first time we have seen a function go across two lines, which I typed this way for space reasons. In R, as you know, a function call begins with FunctionName( and ends with a ). R knows not to actually run the function until it sees that final ). So you can go on as many lines as you want, and it won’t run anything until you type the final ).

In addition, let’s explain api$cst28_engl[api$districttype=="Elementary"]. You should read this as “The values of cst28_engl in api for observations where districttype (also in api) is equal to ‘Elementary’.” The t.test() command takes two arguments at its most basic, and it conducts an independent sample t-test on those variables.

What would a dependent sample t-test comparing cst911_engl across the values of “North” and “South” in the region variable look like?

Run:

You can run also an dependent sample t-test between two variables. You have to reference the two variables separately and specify that it is paired.

# Run a dependent sample t-test on cst28_engl versus cst28_math
t.test(api$cst28_engl, api$cst28_math, paired=TRUE)

    Paired t-test

data:  api$cst28_engl and api$cst28_math
t = 9.3784, df = 988, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.664109 4.074010
sample estimates:
mean of the differences 
                3.36906 

Can you run a dependent sample t-test of pctmin versus pctfrpl?

Run:

One-way ANOVA

Though it is not used often, you can indeed do one-way analysis of variance in R. However, I will note that the results you get in terms of statistical significance are identical to those you could get from running a linear model. See below for the code to run a one-way ANOVA of the cst28_engl variable on districttype.

# Run the one-way ANOVA model
aov(cst28_engl ~ districttype, data=api)
Call:
   aov(formula = cst28_engl ~ districttype, data = api)

Terms:
                districttype    Residuals
Sum of Squares    3592084560 110857037508
Deg. of Freedom            2          986

Residual standard error: 10603.35
Estimated effects may be unbalanced
27 observations deleted due to missingness

There are three important differences between this command and the other commands we see in this tutorial that are important to note. They are especially important because we will see them pop up again in the next discussion of linear regression.

  1. Note the ~ operator setup. You should read this as “A one-way analysis of variance of the cst28_engl variable across values of the districttype variable.” Essentially, it is setting up a model where there is a dependent variable of interest, cst28_engl, and an independent variable of interest, districttype.
  2. We did not have to refer to each variable by the object name, api. Instead, we set the data that the variables are coming from in a separate argument called data. This is something unique to functions that are running models, as we will see again when doing linear regression.
  3. Notice that on its face, the result from the aov() command are not particularly informative. To get the full results of one of these models (again, similar to we will see with regression), you need to apply the summary() command, as shown below.
# Display full ANOVA results
summary(aov(cst28_engl ~ districttype, data=api))
              Df    Sum Sq   Mean Sq F value   Pr(>F)    
districttype   2 3.592e+09 1.796e+09   15.97 1.49e-07 ***
Residuals    986 1.109e+11 1.124e+08                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
27 observations deleted due to missingness

Note that this is another one of those nested command structures from before. We could have done this on two lines if we wanted to. What would that look like?

Line 1:

Line 2:

Chi-squared test

You can test for the statistical relationship between two categorical variables using the chisq.test() function. At it’s base, it takes two variables as input.

chisq.test(api$districttype, api$region, correct=FALSE)

    Pearson's Chi-squared test

data:  api$districttype and api$region
X-squared = 30.994, df = 2, p-value = 1.861e-07

Notice that we include the correct=FALSE argument in order to make sure that R does not automatically (as you can see in the help file) apply a continuity correction.

Correlation

You can calculate the correlation between two variables specifically with the cor() command.

# Get a correlation of cst28_engl and cst28_math
cor(api$cst28_engl, api$cst28_math, use="complete.obs")
[1] 0.9999999

Here, you enter the two variables separately as separate arguments. In addition, note the complete.obs setting. Like the mean() and sd() functions from before, cor() by default does not omit missing data. You need to include use="complete.obs" to make cor() omit missing data.

To make a correlation matrix of several variables, the command is the same except instead of entering the variables separately, you create a subset of the variables, and just enter them all at once.

# Get a subset of variables
correlationsubset <- api[c("cst28_engl", "cst28_math", "pctfrpl")]
# Get the correlation matrix
cor(correlationsubset, use="complete.obs")
           cst28_engl cst28_math    pctfrpl
cst28_engl 1.00000000 0.99999989 0.04835997
cst28_math 0.99999989 1.00000000 0.04836393
pctfrpl    0.04835997 0.04836393 1.00000000

Note that we could have done the above all in one line as shown below.

# Get the correlation matrix
cor(api[c("cst28_engl", "cst28_math", "pctfrpl")], use="complete.obs")
           cst28_engl cst28_math    pctfrpl
cst28_engl 1.00000000 0.99999989 0.04835997
cst28_math 0.99999989 1.00000000 0.04836393
pctfrpl    0.04835997 0.04836393 1.00000000

What line of code could produce a correlation between cst911_engl and cst911_math?

Run:

Next steps

Forward! Onto linear regression: http://rpubs.com/rslbliss/r_regression_ws