main IV
Now, let’s examine the univarate distribution for the main independent variable, our measure of racial/ethnic diversity in the neighborhood (blockgroup) where the respondent lived. While the DV in our analysis is a discrete (count) variable, our measure of neighborhood racial/ethnic diversity, “entropy”, is a continuous variable. So, it doesn’t make as much sense to examine the values this variable takes on individually to get a sense of what the overall distribution looks like. So instead, we’ll examine a histogram and a similar approach called a density plot:
with(has06hisp_if_xl, {
hist(entropy, breaks="FD",
main="racial-ethnic diversity in neighborhood")
})
Now, we can see that our neighborhood racial-ethnic diversity measures takes on values ranging from close to 0 to just over 1.5. While the distribution isn’t exactly normal, it does have a clear peak in the center between .7 and .8, and extreme values are relatively unusual, although there is a sort of “mini-peak” between .1 and .2, suggesting that significant minority of Hispanic/Latinx Houston residents live in neighborhoods with relatively low diversity.
With a few more lines of code, we can explicitly compare the observed sample distribution of the entropy measure to a normal distribution with the same mean and standard deviation:
x1 <- has06hisp_if_xl$entropy
m_iv <- mean(x1, na.rm=TRUE)
sd_iv <- sd(x1, na.rm=TRUE)
with(has06hisp_if_xl, {
hist(entropy, prob=TRUE, breaks="FD",
main="racial-ethnic diversity in neighborhood")
curve(dnorm(x, mean=m_iv, sd=sd_iv), lwd=2, add=TRUE)
})
So now, we can see even more clearly that the distribution is approximately but not exactly normal.
In any case, now that we’ve gotten a quick overview of the overall distribution, it makes sense to look more closely at the summary stats for this variable:
summary(has06hisp_if_xl$entropy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.05205 0.40697 0.71513 0.71141 0.97389 1.52328 18
sd(has06hisp_if_xl$entropy, na.rm=TRUE)
## [1] 0.3639529
Consistent with our examination of the overall distribution via the histogram, we can see that our measures ranges from a low of about .05 to a high of just over 1.5, with very similar mean and median values.The standard deviation is about 0.36, which we can think of as telling us that any given case differs from the average (i.e., mean) value by about .36. Note that this is approximately equal to 25% of the variable’s range of just under 1.5 (1.47 to be relatively exact). This suggests, consistent with histogram results, that the cases are fairly evenly spread across the four quartiles.
##Bivariate Distribution - Focal Relationship
Now, we’re ready to examine the bivariate distribution of the variables we’re using to measure the focal relationship. Let’s start by examining a scatterplot. A scatterplot is NOT the best way to examine a bivariate distribution when one of the variables is discrete and has a narrow range, as is the case for our DV here. As we’ve seen, it only takes on integer (i.e., whole number) values from 0 to 3, reflecting the nature of the variable (i.e., it’s a count of the number of the respondent’s three closest friends who were not Hispanic or Latinx). But let’s look at the scatterplot to see WHY this probably isn’t the best approach to examine this bivariate distribution:
with(has06hisp_if_xl, plot(entropy, interfriends))
While this scatterplot isn’t completely uninformative, we can see that it produces four “straight lines” of dots representing the observations in which respondents reported, 0, 1, 2, and 3 non-Hispanic friends among their three closest friends in Houston.
One way to begin getting a better purchase on what’s going on with the relationship between these two variables is to divide the entropy measure into categories. Doing so obscures some of the variation in the main IV, since we we only see categories of neighborhood racial-ethnic diversity rather than how the entropy measures varies both within AND across those categories. Nevertheless, cross tabulating the DV with the categorical version of the entropy measure may provide a somewhat clearer overview of the relationship.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
tabyl(has06hisp_if_xl, interfriends, entropy_cat) # raw counts
## interfriends 1 2 3 4 NA_
## 0 82 70 74 56 8
## 1 10 16 15 21 2
## 2 18 25 23 17 5
## 3 8 8 9 12 3
## NA 3 2 3 10 0
tabyl(has06hisp_if_xl, interfriends, entropy_cat, show_na=FALSE) %>%
adorn_percentages("col") %>% # get row percentages
adorn_pct_formatting(digits = 1) # format percentages (one decimal place)
## interfriends 1 2 3 4
## 0 69.5% 58.8% 61.2% 52.8%
## 1 8.5% 13.4% 12.4% 19.8%
## 2 15.3% 21.0% 19.0% 16.0%
## 3 6.8% 6.7% 7.4% 11.3%
The first set of results above gives the raw counts of respondents with 0, 1, 2 and 3 non-Hispanic friends among their three closest friends in Houston, with the number of missing values in the far-right column. The second set of results aims to make these figures more readily interpretable by expressing them as column percentages, such that we can now more easily see that respondents who lived in more diverse neighborhoods are less likely to have zero non-Hispanic friends (e.g., those in the 4th quartile only have about a 53% chance, whereas as those in the bottom least diverse quartile have almost a 70% chance of falling into this category; conversely, those in the top, most diverse quartile are somewhat more likely to have exclusively non-Hispanic friends). Of course, in an actual research project, we would also want to consider whether this sample association is likely to reflect chance sampling variation rather than a real relationship between neighborhood racial-ethnic diversity and having non-Hispanic friends, but we’ll hold off on that for now until we’ve reviewed the basic principles of statistical inference (i.e., next week!).
In any case, now that we’ve made some effort to examine the overall bivariate distribution (which hasn’t produced any evidence of a non-linear relationship), we can more safely consider statistics that summarize the overall relationship.
library(DescTools)
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
##
## Recode
cor(has06hisp_if_xl$interfriends, has06hisp_if_xl$entropy,
use="pairwise.complete.obs")
## [1] 0.09363334
GoodmanKruskalGamma(x=has06hisp_if_xl$entropy_cat,
y=has06hisp_if_xl$interfriends)
## [1] 0.1230905
So our rough and ready preliminary analysis appears to show that that there is modest, positive association between living in a more diverse neighborhood and having more non-Hispanic friends in our 2006 Houston Area Survey sample. While this is broadly consistent with our working hypothesis, we would need to examine potential sources of spurious and other alternative explanations before drawing any strong conclusion, particularly given the relatively weak association we observe at the bivariate level. Data permitting, we might also want to examine whether any measures of possible mechanisms can account for this association, as well.
For Homework Assignment #2, you’ll need toconduct a similar but more extensive analysis that examines the univariate distributions of some potential control variables, as well as their bivariate distributions with your DV and main IV, in addition to examining the variables used to measure your focal relationship. But first, you’ll need to find some data you can analyze! That’s the focus of our first homework assignment.