Getting started with R: Basic data management, descriptives, frequencies, plots, correlation, and simple regression

Set working directory

The first thing we want to do is set our working directory to the folder that contains our dataset. Though we won’t actually use this feature in this tutorial, it is a good habit. For example, if we ask R to save a file and don’t specify the exact directory (i.e., folder) we want it to go to, it will go to the default working directory folder.

# ask R what the current working directory is
getwd()

## [1] "C:/Users/Nicholas Bishop/OneDrive/work files/UArizona/intro grad stat/intro stat 24/tutorials"

# tell R what folder we want to use as our working directory.
setwd("C:/Users/Nicholas Bishop/OneDrive/work files/UArizona/intro grad stat/intro stat 24/tutorials")

# check that the 'setwd()' function worked
getwd()

## [1] "C:/Users/Nicholas Bishop/OneDrive/work files/UArizona/intro grad stat/intro stat 24/tutorials"

Installing and loading packages

These are the packages necessary to complete the following tutorial. You will first need to install each package, then load the package into working memory. For the purposes of this tutorial, I’ve commented out (using a hashtag #) the “if(!require() install.packages()” commands, but you will need to make sure the packages are installed on your machine. This form ““if(!require() install.packages()” of the “install.packages” function checks whether the package is installed on your machine, and installs if it is not already installed. If you want to manually install any package (perhaps to update, just use the “install.packages()” function. Note that the “install.packages()” function requires quotes around the package name inside parentheses, but the “library()” function does not include quotes around the package name. Also note that single or double quotes are interchangeable in R, but double quotes are preferred (though the package I’m using to create this tutorial turns double quotes to single quotes in R comments…weird).

After running this block of code, we will see several messages about whether the packages loaded successfully and the version of each package. We don’t worry about this much unless there is some kind of warning (more on that later).

# install necessary packages

# if(!require('tidyverse') install.packages('tidyverse')

# if(!require('vtable') install.packages('vtable')

# if(!require('haven') install.packages('haven')

# if(!require('janitor') install.packages('janitor')

# if(!require('sjPlot') install.packages('sjPlot')

# load necessary packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(vtable)

## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(haven)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(sjPlot)

Importing data

Here we are opening the dataset (2022 General Social Survey). The data is in a format not native to R (it is a SAS dataset), so we use the package “haven” to import the data as a R dataset.

# Load data (adjust the path to your data file) We create an object by pointing
# the function 'read_sas' to a new object named 'gss_data' using the arrow
# '<-'.

gss_data <- read_sas("C:/Users/Nicholas Bishop/OneDrive/work files/UArizona/intro grad stat/intro stat 24/data/2022/2022/gss2022.sas7bdat")

Cleaning data

After turning off scientific notation, we use the “clean_names” function from the “janitor” package to make sure all variable names are lower case. We then use the “select” function within the “dplyr” package (part of the “tidyverse” package) to select the variables we want to analyze. Note that by using “dplyr::select”, we are telling R that we want to use the “select” function within the “dplyr” package. This reduces confusion because sometimes when we have several packages open, they use the same function names which can cause errors.

This is also our first use of the pipe operator (%>%). This is a function in the “tidyverse” package that allows us to string together different functions. You can think of each object or operation as a bin of water, and the %>% allows the water to flow from one function to the next.

We then create a new variable inside the “mutate” function (using dplyr::mutate), then take a quick look at the data.

# turn off scientific notation, which is the default in R
options(scipen = 999)

# Make sure all variable names are lower case (case matters in R) Using the
# pipe operator '%>%', we ask that the 'gss_data' dataset is piped into the
# 'clean_names' function, then use the arrow to overwrite the original data.
# In the 'clean_names' function, we use the period to denote that the first
# argument for the function should be satisfied by the prior function.  In this
# case, the first argument of 'clean_names' is specifying what dataset should
# be used, so the period denotes that the data should be coming from 'up the
# pipe'.  The 'snake' statement is what the 'clean_names' function uses to
# denote lower case letters.
gss_data <- gss_data %>%
    clean_names(., "snake")

# Since we are only looking at two variables, let's create a clean dataset
# including only those measures.  Variable 'childs' is the reported number of
# children and variable 'educ' is years of education.  Here we are piping the
# dataset into the variable selection function of dplyr (dplyr::select), then
# creating a new dataset named 'gss_data_clean'.
gss_data_clean <- gss_data %>%
    dplyr::select(educ, childs)

# We need to create a new version of the 'childs' variable with the class
# factor that allows us to look at frequencies using the 'sumtable' function.
# In R, variables can be different types (which R calls classes).  We can ask R
# what class a specific variable is using the 'class' function.  To identify a
# specific variable in a dataset, we often use the form 'dataset$variable'.
# The dollar sign '$' is used to move down a level in an object, so here we go
# from the upper level dataset to the lower level variable.
class(gss_data_clean$childs)

## [1] "numeric"

# This tells us that the variable 'childs' is originally of class 'numeric'.

# Another way to look at the variable classes would be to use the 'vtable'
# function, which give us a list of variables and attributes including class
# for all variables in the dataset.
vtable(gss_data_clean)

gss_data_clean
Name	Class	Label	Values
educ	numeric	Highest year of school completed	Num: 0 to 20
childs	numeric	Number of children	Num: 0 to 8

# This is our first taste of recoding using the 'dplyr::mutate' function, which
# is the function we will use for most of our variable recoding.  Here we use
# the 'as.factor' function to create a new variable 'childs_f', which is a
# factor version of the original variable 'childs'.  Notice we are creating a
# new dataset named 'gss_data_clean2'.
gss_data_clean2 <- gss_data_clean %>%
    dplyr::mutate(childs_f = as.factor(childs))

# Let's use vtable to see that we did in fact create a new variable that is of
# class factor
vtable(gss_data_clean2)

gss_data_clean2
Name	Class	Label	Values
educ	numeric	Highest year of school completed	Num: 0 to 20
childs	numeric	Number of children	Num: 0 to 8
childs_f	factor	NULL	‘0’ ‘1’ ‘2’ ‘3’ ‘4’ and 4 more

# We can look at the first few rows of the dataset, though this isn't very
# helpful
head(gss_data_clean2)

## # A tibble: 6 × 3
##    educ childs childs_f
##   <dbl>  <dbl> <fct>   
## 1    16      1 1       
## 2    18      2 2       
## 3    12      1 1       
## 4    16      0 0       
## 5    14      2 2       
## 6    12      0 0

# Not shown here, but we can also open up the complete dataset using view()
# view(gss_data_clean2)

Descriptive statistics

We will now look at descriptive statistics using the “sumtable” function of the package “vtable”. There are several options within sumtable, but for now we just want to state what dataset is being used (gss_data_clean2), the variables we want to analyze (“educ”, “childs”, “childs_f”), and the number of digits we want the statistics rounded to. This is our first use of the combine function (c(“…”, “…”)), which is used to combine values or object into a list or vector. Here were are creating a list of three variables c(“educ”, “childs”, “childs_f”). Note the variables must be inside quotes.

# Descriptive statistics for 'educ' and 'childs' using vtable
sumtable(gss_data_clean2, vars = c("educ", "childs", "childs_f"), digits = 4)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
educ	4120	14.15	2.908	0	12	16	20
childs	4136	1.735	1.685	0	0	3	8
childs_f	4136
… 0	1309	31.65%
… 1	651	15.74%
… 2	1041	25.17%
… 3	603	14.58%
… 4	288	6.96%
… 5	112	2.71%
… 6	52	1.26%
… 7	30	0.73%
… 8	50	1.21%

Since we have already selected only two variables from the original dataset (“childs”, “educ”) and created one new variable (“childs_f”), we can create the same table without needing to use the list option.

# Descriptive statistics for 'educ' and 'childs' using vtable
sumtable(gss_data_clean2, digits = 4)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
educ	4120	14.15	2.908	0	12	16	20
childs	4136	1.735	1.685	0	0	3	8
childs_f	4136
… 0	1309	31.65%
… 1	651	15.74%
… 2	1041	25.17%
… 3	603	14.58%
… 4	288	6.96%
… 5	112	2.71%
… 6	52	1.26%
… 7	30	0.73%
… 8	50	1.21%

We see that the average (or mean) years of education is 14.15 and the standard deviation for years of education is 2.908. We can also see that the mean number of children is 1.735 and the standard deviation for number of children is 1.685. Finally, by recoding number of children as a factor variable, we get to see the frequency distribution for number of children. This gives us more information about the distribution of this variable. For example, we can see that around 32% of respondents (n = 1,309) had no children.

Simple histogram

We will use the “hist” function in base R to create a simple histogram. Note that when I say base R, I mean the functions that exist within R that don’t require another package to be loaded.

hist(gss_data_clean2$childs)

This is a visualization of the frequency distribution we created using the “sumtable” function.

Histogram with density curve

The graphing capabilities of base R are limited, so we will want to use the “ggplot” function within the “tidyverse” package to create more elegant visualizations.

We will talk about ggplot code more this semester, but it is helpful to think of it as an additive graphing tool, where we specify separate components of a graph, then use an addition symbol (+) to layer these components. Here we use the pipe to bring in the data, then tell ggplot that we want the aesthetic (aes) to place the variable “childs” on the x-axis. We then specify that we want a histogram that creates a density plot on the y-axis, with a specific number of bins (can think of as columns), the fill color of the bins, and the outline color of the bins. We then get fancy and apply a statistical function to draw a continuous curve using the observed distribution of the variable (based on the mean and standard deviation). Finally we add labels to the graph.

# Create a histogram for 'childs' with a normal curve
gss_data_clean2 %>%
    ggplot(aes(x = childs)) + geom_histogram(aes(y = after_stat(density)), bins = 10,
    fill = "skyblue", color = "black") + stat_function(fun = dnorm, args = list(mean = mean(gss_data_clean2$childs,
    na.rm = TRUE), sd = sd(gss_data_clean2$childs, na.rm = TRUE)), color = "red",
    linewidth = 1) + labs(title = "Number of Children Histogram with Normal Curve",
    x = "Children", y = "Density")

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

Boxplot

The boxplot is a visualization of the 5-number summary, including the minimum, quartile 1 (Q1), quartile 2 (Q2; median), quartile 3 (Q3), and maximum. The whiskers are the minimum and maximum, and the box represents Q1 through Q3, with a line through the box identifying the median. Note we are not using ggplot here, but the boxplot function from base R. We can create fancy boxplots in ggplot though.

# Boxplot for education variable
boxplot(gss_data_clean2$educ, main = "Boxplot of Education", ylab = "Years of Education")

Scatterplot with regression line

Using the GSS 2022 dataset, let’s test whether the respondent’s highest year of education completed (variable name “educ”) predicts the number of children they have (variable name “childs”).

First we will create a scatter plot to visualize the association between number of children and education. Using ggplot, we identify which variable to place on the x and y axes of the graph. We then use “geom_point” to specify that we want each observation to have its own dot (or point), then we use “geom_smooth” to create a line of best fit using a linear regression model (this is denoted as lm throughout R). Finally we add labels.

# Scatterplot with regression line for 'educ' and 'childs'
gss_data_clean2 %>%
    ggplot(aes(x = educ, y = childs)) + geom_point() + geom_smooth(method = "lm",
    se = FALSE, color = "blue") + labs(title = "Scatterplot of Education vs Children",
    x = "Years of Education", y = "Number of Children")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 41 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 41 rows containing missing values or values outside the scale range
## (`geom_point()`).

This shows us that there is a generally negative association between number of children and years of education. The line of best fit helps guide our interpretation of the graph.

Correlation

To quantity the association between number of children and years of education, we will first look at the correlation between these measures. We again use the $ sign to specify the dataset and variables to be used in the correlation. We then include the “use =”complete.obs”” statement to specify that we only want to include cases that have valid, non-missing information on both of the variables.

# Correlation between 'educ' and 'childs'
correlation <- cor.test(gss_data_clean2$educ, gss_data_clean2$childs, use = "complete.obs")
correlation

## 
##  Pearson's product-moment correlation
## 
## data:  gss_data_clean2$educ and gss_data_clean2$childs
## t = -9.939, df = 4106, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1829979 -0.1232708
## sample estimates:
##        cor 
## -0.1532743

# checking the number of observations used to calculate the statistic
num_obs <- sum(complete.cases(gss_data_clean2$educ, gss_data_clean2$childs))
num_obs

## [1] 4108

The correlation (found under section “sample estimates:) is -.015, which confirms a negative association between number of children and years of education. We aren’t to the point of understanding p values, but we want to remember that when p is less than .05 (p < .05) in a correlation, there is a statistically significant association between the two variables. This basically means we having compelling evidence that the association between the measures is not zero.

We also want to know exactly how many cases were used to calculate the correlation coefficient, so we use the “sum” function to identify the number of observations used in the calculation. We see that there were 4,108 observations used in the analysis. This means there were 4,108 respondents who answered both questions about number of children and years of education.

Linear regression

We now want to use regression to examine the association between number of children and years of education, using the independent variable (the variable labeled x in the regression equation) to predict the dependent variable (the variable labeled y in the regression equation).

The way R calculates regression is using the “lm” function, which stands for linear model. The basic setup is (lm(dependent_variable ~ independent_variable, data = dataset_name)). We then create a new object named “model” to store the results of the regression model. Note we can name this object anything we want.

We use the function “summary()” to give us an quick overall summary of the results, and we also want to verify the number of observations included in the model using the “nobs()” function.

Finally we use the “tab_model” function from the sjPlot package to create a nice looking html table that contains the regression estimates, sample size used to calculate the regression estimates, and the r-squared and adjusted r-squared values. Notice that the F-statistic, associated degrees of freedom, and p value for the F test statistic are not included in the “tab_model” table, so you’ll need to pull those from the “summary()” function for the write-up.

# Linear regression model
model <- lm(childs ~ educ, data = gss_data_clean2)

# summary() allows us to examine the results of the regression model
summary(model)

## 
## Call:
## lm(formula = childs ~ educ, data = gss_data_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9675 -1.5622  0.0864  1.0864  6.7891 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.967526   0.127661  23.245 <0.0000000000000002 ***
## educ        -0.087831   0.008837  -9.939 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.647 on 4106 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.02349,    Adjusted R-squared:  0.02326 
## F-statistic: 98.78 on 1 and 4106 DF,  p-value: < 0.00000000000000022

# checking the number of observations used in the regression model
nobs(model)

## [1] 4108

# create nice looking html table using tab_model
tab_model(model, digits = 3)

	Number of children
Predictors	Estimates	CI	p
(Intercept)	2.968	2.717 – 3.218	<0.001
Highest year of school completed	-0.088	-0.105 – -0.071	<0.001
Observations	4108
R² / R² adjusted	0.023 / 0.023

The results on the F-statistic line produced by “summary(model)” is what SPSS labels the ANOVA table. The ANOVA table tells us whether the independent variable(s) significantly predict the dependent variable. The F-test statistic is 98.78 and the p value is less than .05 (actually p < .001), meaning that variation in number of children is significantly associated with education level. The regression has 1 degree of freedom (df) and the residual has 4,106 df. In our interpretation, we write this as F(1, 4,106).

The values under the “Coefficients” heading provide the estimated values for a (our constant or intercept) and b (our slope), as well as tells us if we can reject the null hypothesis H₀: β = 0 (see page 611 in BPoS 8th ed. for explanation of the t statistic for regression hypothesis testing). We can see that the t test for the slope of highest year of school completed is statistically significant (t = -9.939, p <.001), meaning that we can reject the null hypothesis that the slope of the regression is equal to zero (meaning we reject the null hypothesis that there is no association between number of children and years of education).

For our regression estimates, we have a constant (also called the intercept) of 2.968 (a = 2.968) and a slope of -.088 (b = -.088). The regression coefficient for highest year of school completed tells us how much change in the dependent or response variable (number of children) can be attributed to a one unit change in the independent or explanatory variable (years of school). We know that the correlation r is negative, so as the number of years of school increases, the number of expected children decreases. With our slope coefficient of -.088, we can say that for every one unit increase in highest years of education completed (or every one-year increase in highest year of school completed because the unit of measurement is years), there is a .088 unit DECREASE in the predicted number of children.

The R-square (R²) statistic tells us what proportion (or percent if we multiply by 100) of the variation in the dependent variable can be attributed to the independent variable. When possible, we always want to report the adjusted R² which accounts for sample size. With an adjusted R² of .023, we can say that 2.3% (.023*100 = 2.3) of the variation in number of children can be attributed to the respondent’s highest year of education completed.

Here is how we would write the results in APA format:

Using data from the 2022 General Social Survey, a linear regression was conducted to determine whether the respondent’s highest year of school completed predicted the respondent’s number of children. There was a significant association between the respondent’s highest year of school completed and the respondent’s number of children, F(1, 4106) = 98.78, p < .001. The regression equation for predicting the number of children was:

\[ \hat{y} = 2.968 - 0.088 \times (\text{highest year of school completed}) \]

The adjusted R² value was .023, indicating that 2.3% of the variance in the number of children was explained by the respondent’s highest year of school completed. For each additional year of education completed, the number of children was expected to decrease by 0.088. These findings suggest that individuals with more education may be more informed about birth control methods, which could result in having fewer children compared to those with less education.