Introduction to R

This example uses a subset of the 2006 Houston Area Survey, namely, the oversample of Hispanic adults.

Research Question: To what extent is living in racially diverse neighborhoods associated with having more close interracial (i.e., non- Hispanic) friends among Hispanic/Latinx adults?

Hypothesis: Hispanic adults who live in more racially diverse neighborhoods have more close interracial friendships.

Focal relationship DV: number of close interracial friendships among three closest friendships (“interfriends”)

IV: neighborhood racial diversity, operationalized as the entropy index (“entropy” and a version collapsed into quartiles, “entropy_cat”)

First, we’ll use the read_dta() and readxl() functions to import various versions of the data into R Studio.

Then, we’ll examine the entire univariate distribution for each variable used to measure our focal relationship BEFORE examining summary stats like means, medians, quartiles, standard deviations, etc.

(Why do you think this is important?)

Finally, we’ll examine the entire bivariate distribution for these variables before having R calculate some summary stats that provide measures of bivariate association.

This example makes use of several “packages” available through the Comprehensive R Archive Network and various associated “mirror” sites. Accordingly, we’ll need to make sure these packages are installed prior to running any of the code in this R Markdown document.

Toward that end, check whether the following packages are installed on your machine. If any of them are not, enter the commands below directly at the prompt in the Console window below:

install.packages(“haven”) install.packages(“readxl”) install.packages(“car”) install.packages(“janitor”) install.packages(“DescTools”)

As you will see, we’ll still need to use a series of “library()” commands to make sure that these packages are available in the project environment even if after they’ve been installed on your machine.

Overview of the Data

Excel version

Now that we’ve run the above commands, we can see that there are two different versions of a dataset with 500 observations, each of which has 26 variables.

To begin to get a sense of what’s included in these data, we’ll use the summary function:

library(car)
## Loading required package: carData
brief(has06hisp_if_xl)
## # A tibble: 500 × 26
##       id ethfrnds ethfrnd1 ethfrnd2 ethfrnd3 selfborn parsborn hisprace income8 
##    <dbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
##  1 21801 1 [yes]  NA       NA       NA       1 [yes]  2 [both]  5 [som… -1 [rf] 
##  2 21416 1 [yes]  NA       NA       NA       1 [yes]  2 [both] NA        1 [les…
##  3 21021 0 [no]    2 [bla…  1 [ang…  3 [His… 1 [yes]  1 [one]   5 [som…  3 [$25…
##  4 20977 1 [yes]  NA       NA       NA       0 [no]   0 [neit…  1 [whi…  2 [$12…
##  5 21612 0 [no]    2 [bla…  2 [bla…  3 [His… 1 [yes]  2 [both]  5 [som…  8 [mor…
##  6 21768 1 [yes]  NA       NA       NA       0 [no]   0 [neit…  5 [som… -2 [dk] 
##  7 20934 0 [no]    1 [ang…  1 [ang…  5 [oth… 1 [yes]  2 [both]  5 [som…  3 [$25…
##  8 21770 1 [yes]  NA       NA       NA       1 [yes]  2 [both]  5 [som…  4 [$37…
##  9 21754 1 [yes]  NA       NA       NA       0 [no]   0 [neit…  5 [som…  1 [les…
## 10 21482 1 [yes]  NA       NA       NA       1 [yes]  2 [both] NA        5 [$50…
## # ℹ 490 more rows
## # ℹ 17 more variables: havinsur <dbl+lbl>, educ <dbl+lbl>, educ2 <dbl+lbl>,
## #   age <dbl+lbl>, famsize <dbl+lbl>, female <dbl+lbl>, charity3 <dbl+lbl>,
## #   zipcode <dbl>, livezips <dbl+lbl>, blackb <dbl>, hispb <dbl>, anglob <dbl>,
## #   asianb <dbl>, otherb <dbl>, interfriends <dbl>, entropy <dbl>,
## #   entropy_cat <dbl>
summary(has06hisp_if_dta)
##        id           ethfrnds         ethfrnd1        ethfrnd2    
##  Min.   :20243   Min.   :-1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:20941   1st Qu.: 0.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :21297   Median : 1.000   Median :2.000   Median :2.000  
##  Mean   :21234   Mean   : 0.584   Mean   :1.974   Mean   :2.155  
##  3rd Qu.:21633   3rd Qu.: 1.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :21801   Max.   : 1.000   Max.   :5.000   Max.   :5.000  
##                  NA's   :12       NA's   :307     NA's   :307    
##     ethfrnd3         selfborn        parsborn         hisprace     
##  Min.   :-1.000   Min.   :0.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.:0.000   1st Qu.: 0.000   1st Qu.: 5.000  
##  Median : 2.000   Median :1.000   Median : 1.000   Median : 5.000  
##  Mean   : 2.218   Mean   :0.688   Mean   : 0.894   Mean   : 4.229  
##  3rd Qu.: 3.000   3rd Qu.:1.000   3rd Qu.: 2.000   3rd Qu.: 5.000  
##  Max.   : 5.000   Max.   :1.000   Max.   : 2.000   Max.   : 5.000  
##  NA's   :307                                       NA's   :24      
##     income8          havinsur           educ            educ2       
##  Min.   :-2.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.:-1.000   1st Qu.: 0.000   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 1.000   Median : 3.000   Median : 2.000  
##  Mean   : 2.702   Mean   : 0.704   Mean   : 3.764   Mean   : 2.428  
##  3rd Qu.: 5.000   3rd Qu.: 1.000   3rd Qu.: 5.000   3rd Qu.: 3.000  
##  Max.   : 8.000   Max.   : 1.000   Max.   : 9.000   Max.   : 4.000  
##                                                                     
##       age           famsize          female         charity3     
##  Min.   :-1.00   Min.   :1.000   Min.   :0.000   Min.   :-1.000  
##  1st Qu.:26.00   1st Qu.:3.000   1st Qu.:0.000   1st Qu.: 0.000  
##  Median :39.00   Median :4.000   Median :1.000   Median : 1.000  
##  Mean   :39.98   Mean   :3.776   Mean   :0.588   Mean   : 0.554  
##  3rd Qu.:52.00   3rd Qu.:5.000   3rd Qu.:1.000   3rd Qu.: 1.000  
##  Max.   :85.00   Max.   :8.000   Max.   :1.000   Max.   : 1.000  
##                                                                  
##     zipcode         livezips         blackb            hispb       
##  Min.   :77002   Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:77018   1st Qu.:1.000   1st Qu.:0.00382   1st Qu.:0.2622  
##  Median :77035   Median :2.000   Median :0.04509   Median :0.5999  
##  Mean   :77108   Mean   :1.949   Mean   :0.13683   Mean   :0.5542  
##  3rd Qu.:77084   3rd Qu.:3.000   3rd Qu.:0.16642   3rd Qu.:0.8726  
##  Max.   :77660   Max.   :3.000   Max.   :0.93147   Max.   :0.9909  
##  NA's   :2       NA's   :12      NA's   :18        NA's   :18      
##      anglob            asianb             otherb          interfriends   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.04821   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.16505   Median :0.001741   Median :0.008429   Median :0.0000  
##  Mean   :0.27052   Mean   :0.026492   Mean   :0.011955   Mean   :0.7469  
##  3rd Qu.:0.41137   3rd Qu.:0.029560   3rd Qu.:0.018264   3rd Qu.:2.0000  
##  Max.   :0.97905   Max.   :0.445833   Max.   :0.075335   Max.   :3.0000  
##  NA's   :18        NA's   :18         NA's   :18         NA's   :18      
##     entropy         entropy_cat   
##  Min.   :0.05205   Min.   :1.000  
##  1st Qu.:0.40697   1st Qu.:1.250  
##  Median :0.71513   Median :2.000  
##  Mean   :0.71141   Mean   :2.488  
##  3rd Qu.:0.97389   3rd Qu.:3.000  
##  Max.   :1.52328   Max.   :4.000  
##  NA's   :18        NA's   :18

Now, we have some simple summary stats for each of the 26 variables included in the dataset, which importantly includes not only the minimum and maximum values and the 1st, 2nd (median), and 3rd quartiles, but also the number of “NA’s”, which is the symbol that R typically uses for missing values. Several of the variables appear to have no missing values, but the variable called “ethfrnds” has 12 missing values, the ethfrnd# (i.e., ethfrnd1 and so on) variables have 307 missing values, zipcode has 2 missing values, livezips has 12, and several of the variables have 18 missing values.

We can also confirm that the version of the data we imported from Stata format has the same basic information included

Stata version

library(car)
brief(has06hisp_if_dta)
## # A tibble: 500 × 26
##       id ethfrnds ethfrnd1 ethfrnd2 ethfrnd3 selfborn parsborn hisprace income8
##    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
##  1 21801        1       NA       NA       NA        1        2        5      -1
##  2 21416        1       NA       NA       NA        1        2       NA       1
##  3 21021        0        2        1        3        1        1        5       3
##  4 20977        1       NA       NA       NA        0        0        1       2
##  5 21612        0        2        2        3        1        2        5       8
##  6 21768        1       NA       NA       NA        0        0        5      -2
##  7 20934        0        1        1        5        1        2        5       3
##  8 21770        1       NA       NA       NA        1        2        5       4
##  9 21754        1       NA       NA       NA        0        0        5       1
## 10 21482        1       NA       NA       NA        1        2       NA       5
## # ℹ 490 more rows
## # ℹ 17 more variables: havinsur <dbl>, educ <dbl>, educ2 <dbl>, age <dbl>,
## #   famsize <dbl>, female <dbl>, charity3 <dbl>, zipcode <dbl>, livezips <dbl>,
## #   blackb <dbl>, hispb <dbl>, anglob <dbl>, asianb <dbl>, otherb <dbl>,
## #   interfriends <dbl>, entropy <dbl>, entropy_cat <dbl>
summary(has06hisp_if_dta)
##        id           ethfrnds         ethfrnd1        ethfrnd2    
##  Min.   :20243   Min.   :-1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:20941   1st Qu.: 0.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :21297   Median : 1.000   Median :2.000   Median :2.000  
##  Mean   :21234   Mean   : 0.584   Mean   :1.974   Mean   :2.155  
##  3rd Qu.:21633   3rd Qu.: 1.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :21801   Max.   : 1.000   Max.   :5.000   Max.   :5.000  
##                  NA's   :12       NA's   :307     NA's   :307    
##     ethfrnd3         selfborn        parsborn         hisprace     
##  Min.   :-1.000   Min.   :0.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.: 1.000   1st Qu.:0.000   1st Qu.: 0.000   1st Qu.: 5.000  
##  Median : 2.000   Median :1.000   Median : 1.000   Median : 5.000  
##  Mean   : 2.218   Mean   :0.688   Mean   : 0.894   Mean   : 4.229  
##  3rd Qu.: 3.000   3rd Qu.:1.000   3rd Qu.: 2.000   3rd Qu.: 5.000  
##  Max.   : 5.000   Max.   :1.000   Max.   : 2.000   Max.   : 5.000  
##  NA's   :307                                       NA's   :24      
##     income8          havinsur           educ            educ2       
##  Min.   :-2.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.:-1.000   1st Qu.: 0.000   1st Qu.: 3.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 1.000   Median : 3.000   Median : 2.000  
##  Mean   : 2.702   Mean   : 0.704   Mean   : 3.764   Mean   : 2.428  
##  3rd Qu.: 5.000   3rd Qu.: 1.000   3rd Qu.: 5.000   3rd Qu.: 3.000  
##  Max.   : 8.000   Max.   : 1.000   Max.   : 9.000   Max.   : 4.000  
##                                                                     
##       age           famsize          female         charity3     
##  Min.   :-1.00   Min.   :1.000   Min.   :0.000   Min.   :-1.000  
##  1st Qu.:26.00   1st Qu.:3.000   1st Qu.:0.000   1st Qu.: 0.000  
##  Median :39.00   Median :4.000   Median :1.000   Median : 1.000  
##  Mean   :39.98   Mean   :3.776   Mean   :0.588   Mean   : 0.554  
##  3rd Qu.:52.00   3rd Qu.:5.000   3rd Qu.:1.000   3rd Qu.: 1.000  
##  Max.   :85.00   Max.   :8.000   Max.   :1.000   Max.   : 1.000  
##                                                                  
##     zipcode         livezips         blackb            hispb       
##  Min.   :77002   Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:77018   1st Qu.:1.000   1st Qu.:0.00382   1st Qu.:0.2622  
##  Median :77035   Median :2.000   Median :0.04509   Median :0.5999  
##  Mean   :77108   Mean   :1.949   Mean   :0.13683   Mean   :0.5542  
##  3rd Qu.:77084   3rd Qu.:3.000   3rd Qu.:0.16642   3rd Qu.:0.8726  
##  Max.   :77660   Max.   :3.000   Max.   :0.93147   Max.   :0.9909  
##  NA's   :2       NA's   :12      NA's   :18        NA's   :18      
##      anglob            asianb             otherb          interfriends   
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.04821   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.16505   Median :0.001741   Median :0.008429   Median :0.0000  
##  Mean   :0.27052   Mean   :0.026492   Mean   :0.011955   Mean   :0.7469  
##  3rd Qu.:0.41137   3rd Qu.:0.029560   3rd Qu.:0.018264   3rd Qu.:2.0000  
##  Max.   :0.97905   Max.   :0.445833   Max.   :0.075335   Max.   :3.0000  
##  NA's   :18        NA's   :18         NA's   :18         NA's   :18      
##     entropy         entropy_cat   
##  Min.   :0.05205   Min.   :1.000  
##  1st Qu.:0.40697   1st Qu.:1.250  
##  Median :0.71513   Median :2.000  
##  Mean   :0.71141   Mean   :2.488  
##  3rd Qu.:0.97389   3rd Qu.:3.000  
##  Max.   :1.52328   Max.   :4.000  
##  NA's   :18        NA's   :18

Since both datasets are essentially the same, from here on out we’ll concentrate on the version we imported from the Excel file.

Univariate analysis

Next, let’s take a closer look at each of the focal variables, focusing initially on their univariate distribution (i.e., we’re going to examine each variable one at a time). We’ll start with the dependent variable, our measure of how many of each respondents’ 3 closest friends in Houston were Hispanic.

DV

table_dv <- table(has06hisp_if_xl$interfriends)
table_dv
## 
##   0   1   2   3 
## 290  64  88  40
cstable_dv <- cumsum(table_dv)
cstable_dv
##   0   1   2   3 
## 290 354 442 482
data_freq_dv <- data.frame(Freq = as.numeric(table_dv),
                           Perc = round(as.numeric(table_dv/482)*100, 
                                        digits = 2 ),
                           CumFreq = as.numeric(cstable_dv),
                           CumPerc = round(as.numeric(cstable_dv/482)*100, 
                                           digits = 2))
rownames(data_freq_dv) <- 0:3
print(data_freq_dv)
##   Freq  Perc CumFreq CumPerc
## 0  290 60.17     290   60.17
## 1   64 13.28     354   73.44
## 2   88 18.26     442   91.70
## 3   40  8.30     482  100.00

The “dataframe” results reveal clearly that most respondents (about 60%) had no non-Hispanic / non-Latino friends, while roughly similar percentages had 1 or 2 non-Hispanic friends (13.3% and 18.3%, respectively), and only about 8% reported that all three of their closest 3 friends in Houston were non-Hispanic.

Now, that we’ve examined the overall distribution, we can look at summary stats:

summary(has06hisp_if_xl$interfriends)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.7469  2.0000  3.0000      18
sd(has06hisp_if_xl$interfriends, na.rm=TRUE)
## [1] 1.026793

Here, we see that the number of close non-Hispanic friends reported by our respondents ranges from 0 to 3, with a median value of zero and mean value of about .75. That tells us that this is a highly skewed distribution, in which more than half of our respondents report zero non-Hispanic friends, but a relatively small number of respondents report two or even three such friends, pulling the mean well above the median (toward the upper tail of the distribution). The standard deviation is very close to one, which we can somewhat loosely think of as telling us that each respondent differs from a typical respondent (as represented by the mean number of non-Hispanic friends) by one friend on average.

main IV

Now, let’s examine the univarate distribution for the main independent variable, our measure of racial/ethnic diversity in the neighborhood (blockgroup) where the respondent lived. While the DV in our analysis is a discrete (count) variable, our measure of neighborhood racial/ethnic diversity, “entropy”, is a continuous variable. So, it doesn’t make as much sense to examine the values this variable takes on individually to get a sense of what the overall distribution looks like. So instead, we’ll examine a histogram and a similar approach called a density plot:

with(has06hisp_if_xl, {
     hist(entropy, breaks="FD", 
     main="racial-ethnic diversity in neighborhood")
})

Now, we can see that our neighborhood racial-ethnic diversity measures takes on values ranging from close to 0 to just over 1.5. While the distribution isn’t exactly normal, it does have a clear peak in the center between .7 and .8, and extreme values are relatively unusual, although there is a sort of “mini-peak” between .1 and .2, suggesting that significant minority of Hispanic/Latinx Houston residents live in neighborhoods with relatively low diversity.

With a few more lines of code, we can explicitly compare the observed sample distribution of the entropy measure to a normal distribution with the same mean and standard deviation:

x1 <- has06hisp_if_xl$entropy
m_iv <- mean(x1, na.rm=TRUE)
sd_iv <- sd(x1, na.rm=TRUE)
with(has06hisp_if_xl, {
     hist(entropy, prob=TRUE, breaks="FD", 
     main="racial-ethnic diversity in neighborhood")
  curve(dnorm(x, mean=m_iv, sd=sd_iv), lwd=2, add=TRUE)
})

So now, we can see even more clearly that the distribution is approximately but not exactly normal.

In any case, now that we’ve gotten a quick overview of the overall distribution, it makes sense to look more closely at the summary stats for this variable:

summary(has06hisp_if_xl$entropy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.05205 0.40697 0.71513 0.71141 0.97389 1.52328      18
sd(has06hisp_if_xl$entropy, na.rm=TRUE)
## [1] 0.3639529

Consistent with our examination of the overall distribution via the histogram, we can see that our measures ranges from a low of about .05 to a high of just over 1.5, with very similar mean and median values.The standard deviation is about 0.36, which we can think of as telling us that any given case differs from the average (i.e., mean) value by about .36. Note that this is approximately equal to 25% of the variable’s range of just under 1.5 (1.47 to be relatively exact). This suggests, consistent with histogram results, that the cases are fairly evenly spread across the four quartiles.

##Bivariate Distribution - Focal Relationship

Now, we’re ready to examine the bivariate distribution of the variables we’re using to measure the focal relationship. Let’s start by examining a scatterplot. A scatterplot is NOT the best way to examine a bivariate distribution when one of the variables is discrete and has a narrow range, as is the case for our DV here. As we’ve seen, it only takes on integer (i.e., whole number) values from 0 to 3, reflecting the nature of the variable (i.e., it’s a count of the number of the respondent’s three closest friends who were not Hispanic or Latinx). But let’s look at the scatterplot to see WHY this probably isn’t the best approach to examine this bivariate distribution:

with(has06hisp_if_xl, plot(entropy, interfriends))

While this scatterplot isn’t completely uninformative, we can see that it produces four “straight lines” of dots representing the observations in which respondents reported, 0, 1, 2, and 3 non-Hispanic friends among their three closest friends in Houston.

One way to begin getting a better purchase on what’s going on with the relationship between these two variables is to divide the entropy measure into categories. Doing so obscures some of the variation in the main IV, since we we only see categories of neighborhood racial-ethnic diversity rather than how the entropy measures varies both within AND across those categories. Nevertheless, cross tabulating the DV with the categorical version of the entropy measure may provide a somewhat clearer overview of the relationship.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
tabyl(has06hisp_if_xl, interfriends, entropy_cat) # raw counts
##  interfriends  1  2  3  4 NA_
##             0 82 70 74 56   8
##             1 10 16 15 21   2
##             2 18 25 23 17   5
##             3  8  8  9 12   3
##            NA  3  2  3 10   0
tabyl(has06hisp_if_xl, interfriends, entropy_cat, show_na=FALSE) %>%
  adorn_percentages("col") %>% # get row percentages
  adorn_pct_formatting(digits = 1) # format percentages (one decimal place)
##  interfriends     1     2     3     4
##             0 69.5% 58.8% 61.2% 52.8%
##             1  8.5% 13.4% 12.4% 19.8%
##             2 15.3% 21.0% 19.0% 16.0%
##             3  6.8%  6.7%  7.4% 11.3%

The first set of results above gives the raw counts of respondents with 0, 1, 2 and 3 non-Hispanic friends among their three closest friends in Houston, with the number of missing values in the far-right column. The second set of results aims to make these figures more readily interpretable by expressing them as column percentages, such that we can now more easily see that respondents who lived in more diverse neighborhoods are less likely to have zero non-Hispanic friends (e.g., those in the 4th quartile only have about a 53% chance, whereas as those in the bottom least diverse quartile have almost a 70% chance of falling into this category; conversely, those in the top, most diverse quartile are somewhat more likely to have exclusively non-Hispanic friends). Of course, in an actual research project, we would also want to consider whether this sample association is likely to reflect chance sampling variation rather than a real relationship between neighborhood racial-ethnic diversity and having non-Hispanic friends, but we’ll hold off on that for now until we’ve reviewed the basic principles of statistical inference (i.e., next week!).

In any case, now that we’ve made some effort to examine the overall bivariate distribution (which hasn’t produced any evidence of a non-linear relationship), we can more safely consider statistics that summarize the overall relationship.

library(DescTools)
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
## 
##     Recode
cor(has06hisp_if_xl$interfriends, has06hisp_if_xl$entropy, 
    use="pairwise.complete.obs")
## [1] 0.09363334
GoodmanKruskalGamma(x=has06hisp_if_xl$entropy_cat,
                    y=has06hisp_if_xl$interfriends)
## [1] 0.1230905

So our rough and ready preliminary analysis appears to show that that there is modest, positive association between living in a more diverse neighborhood and having more non-Hispanic friends in our 2006 Houston Area Survey sample. While this is broadly consistent with our working hypothesis, we would need to examine potential sources of spurious and other alternative explanations before drawing any strong conclusion, particularly given the relatively weak association we observe at the bivariate level. Data permitting, we might also want to examine whether any measures of possible mechanisms can account for this association, as well.

For Homework Assignment #2, you’ll need toconduct a similar but more extensive analysis that examines the univariate distributions of some potential control variables, as well as their bivariate distributions with your DV and main IV, in addition to examining the variables used to measure your focal relationship. But first, you’ll need to find some data you can analyze! That’s the focus of our first homework assignment.