“SPEAKING about IGNORANCE & ARROGANCE”

Author(s): Borja V. Sorli Sanz - borjavss@gmail.com - @borjavss

1.- Introduction:

Research question:

Is there a relationship between whether the people classified himself into the top one society group and the believe that the average negroes/blacks/African-Americans have worse jobs, income, and housing than white people is due to less in-born ability to learn ?

I am interested in this research question because I believe that, in a positive correlation, ignorance and arrogance are related.

How to connect that with my study question?

“people classified himself into the top of the society” could be considered as a sign of ARROGANCE.
“the believe that the average negroes/blacks/African-Americans have worse jobs, income and housing than white people is due to less in-born ability to learn” could be considered as a sign of IGNORANCE.

Why I think that we should care about that?

Because, if the answer of the study question is affirmative (as I guess), we could agree with the sentence: open-minded education is needed to palliate with arrogance.

In a wider point of view, it could gives us a particular point of vision of this subject related to human psychology. If this correlation exist, it could be a good start point to propose future human psychology in the subject title of study.

2.- Data:

I have used the -1- General Social Survey (GSS) proposed for the Project by Coursera web page. It is a sociological survey that collects data on demographics and attitudes of the US residents.

How they collected the data? Three methods where used: computer-assisted personal interview (CAPI), face-to-face interview and telephone interview.

Also, they claims a random non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States as a universe of cases part of the survey.

Cases:

From that huge survey, I took two survey questions answer as variables to my study:

1) People considering themselves in the apex (top one group) of the society or not.

Link vble 1: (RANK: http://goo.gl/b5qHhP)
Type vble 1: CATEGORICAL with 2 levels (it means: LOGICAL - yes/not-)

2) People being agree or not agree with the affirmation: Negroes/blacks/African-Americans have worse jobs, income, and housing than white people because most they have less in-born ability to learn.

Link vble 2: (RACDIF2: http://goo.gl/UQWGQM)
Type vble 2: CATEGORICAL with 2 levels (it means: LOGICAL - yes/not-)

PS: My variable 1 (vble_1) comes from the aggrupation of the variable RANK in two groups:

People considered themselves in the society apex (top ONE group)
The rest of the people (considered themselves in the 2 to 10 groups of the society).

So, this variable was a 10 levels categorical variable that I decided to disaggregate in two. By the way, as I have mentioned just before, it becomes a 2 level categorical variable (logical variable -yes/not-).

So, we have two categorical variables (each with only 2 levels) or, what its the same, two logical variables.

The most important thing is that we have paired data. It means, each observation corresponds to one person for the two questions (our two logical variables). Then, we have a 3290 people answering the two questions concerning our study; it conforms our SAMPLE.

Study: Our parameter of interest will be the difference between two proportions of all Americans (population).

So, our parameter of interest is: \[ p_{vble1-group1-positive}-p_{vble2-group2to10-positive} \]

And, we are going to deal with sample proportions: \[ \overline{p}_{vble1-group1-positive}-\overline{p}_{vble2-group2to10-positive} \]

To observe that, we are going to realize an :

hypothesis test and a
confidence interval (more details in next points of the study)

And, we would take attention in next points if success-failure condition is met or not:

if condition is met, we will use a theoretical method
if condition is not met: we will use a simulation method

Remember:

If subjects are selected randomly (random sample) from the population the resulting sample is likely representative of the population. Therefore, the study's results are generalizable to the population at large.
Random assignment occurs only in experimental settings where subjects are being assigned to various treatments. Random assignment could only been given on experimental studies where we have two groups, the treatment group and the control groups.

So, as the most observational studies, we are dealing with random sampling but not with random assignment.

Therefore, the study will be generalizable but not causal. It means, we could generalize the conclusions to the population at large BUT we cannot justify a causation effect between both variables.

They could be a big “source of bias” in answering the vble2 question (that related with racism) according with the 10 groups of self-qualification in social status. But, we have shortened in part this bias by transforming 10 groups into 2 to deal with a logical variable called vble2.

So, to conclude, our variable of interest is the difference between two proportions of all Americans (population).

And, for the explained reasons, we will be able to generalize the sample data to get our study conclusions.

3.- Exploratory data analysis:

I want to start showing how to deal with the data from the web to get handly data to our study.

Loading data and putting them in a matrix (data.frame) where columns are our two variables (vble1 and vble2) and rows are the individual respondents to the survey. So, as we have 3290 people answering the two questions, we have 3290 rows:

data <- data.frame(gss$rank, gss$racdif2)
data <- na.omit(data)
rownames(data) <- NULL
colnames(data) <- c("vble1", "vble2")
prop <- data
levels(data[, 2]) <- c("1", "0")

As I mentioned before, in the data from the web, we have vble1_original decomposed in 10 groups (related with how a person classifies him/her-self in the society):

table(prop)

##      vble2
## vble1  Yes   No
##    1    26  141
##    2    14  103
##    3    31  366
##    4    43  461
##    5   123 1055
##    6    42  378
##    7    28  259
##    8    18  126
##    9     1   33
##    10   12   30

This table shows in rows the “self-classified group society (from 1 to 10)” answering “yes” or “no” to vble2 (racism question). Converting group 2 to 10 in one only group, I get my vble1 of study, a categorical variable with two values (in one hand group 1; in the other hand the rest of the groups - 2 to 10- )

for (i in 1:3290) if (data[i, 1] >= 2) prop[i, 1] <- 0
rownames(prop) <- c("group2to10", "group1")

## Error: invalid 'row.names' length

table(prop)

##      vble2
## vble1  Yes   No
##     0  312 2811
##     1   26  141

Then, in our SAMPLE, we have:

\( 312/(312+2811) = 0.0999 \) –> 9.99 % of group2to10 say “YES” to our racism question conforming vble2 of our study.
\( 26/(26+141) = 0.1557 \) –> 15.57 % of group1 say “YES” to our racism question conforming vble2 of our study.

So, in our SAMPLE, there are a difference of at least 5 points between group1 and group2to10.

Also, you can observe this difference graphically:

plot(table(prop), col = c(1, 4), cex = 1.5, main = " ", xlab = "vble1 - self_class_society_group", 
    ylab = "vble2 - racism_question")

plot of chunk unnamed-chunk-5

In the graphic, vble1 (logical variable) is represented by : (1) represents group1 and (0) represents group2to10. (Note: the large width of the graphical in group2to10 in face of group1 of the of vble1 only shows that there are much more people who consider themselves in group2to10 that in group1).

Summarizing, a difference between groups of vble1 is observed in our SAMPLE.

So, our research question should be deployed because there are a possibility of that difference be truth in the entire POPULATION.

To can generalize that, we are going to realize an hypothesis test and a confidence interval.

4.- Inference:

Remember that we are dealing with paired data. So, the answer of vble1 and vble2 comes from the same person.

Hypothesis null Ho: \( (p_1-p_2)=0 \) of the POPULATION

Alternative hypothesis Ha: \( (p_1-p_2)\neq0 \) of the POPULATION

This will give us clear information to solve our research question.

In addition, we are going to build a CONFIDENCE INTERVAL.

We estimate the difference between two proportions:

\( (\overline{p}_1-\overline{p}_2)\pm z^* SE \) giving us a Confidence Interval of the difference between proportions.

Checking conditions -

The success-failure condition is met because although our proportion between 10-15% we have a elevated number of cases (n=3290) in our sample (remember, if n increases, the standard deviation deviation decreases).

So, using our dataset:

1) Independence:

Within groups:

As we have read in the web of the data, each person answer their question regardless the rest of the people.

So, sampled observations are independent.

As they sampled without replacement, we should check that n <10% of population. Obviously, \( n=3290<0.1*314millions=31400000 \)

Between groups. Obviously, the two groups, group1 and the rest of the groups (group2to10), are independents.

Sample size/skew is assured because 3290 > 30, so we won't have problems asking us if the population distribution is very skewed or not.

2) Success-failure condition is met in our data:

\( 3123* 0.0999\ge 10 so n_1\overline{p}_1 \ge 10 \) and \( 3123*(1-0.0999) \ge 10 so n_1(1-\overline{p}_1) \ge 10 \) \( 167* 0.1557\ge 10 so n_2\overline{p}_2 \ge 10 \) and \( 167*(1-0.1557) \ge 10 so n_2(1-\overline{p}_2) \ge 10 \)

The selected methods are the best to be able to apply inference: With a Confidence Interval and a Hypothesis test, we are going to show if population can correspond to the sample object of our study or not. It let us solve our question study.

** - CONFIDENCE INTERVAL - ** Estimating the difference between two proportions

\( (\overline{p}_{group1} - \overline{p}_{group2to10}) \pm z^* SE \)

with a Confidence Interval of \( \alpha =0.05 \)

\[ (0.099-0.156)\pm 1.96*sqrt({\frac{0.099(1-0.099)}{167}+\frac{0.156(1-0.156)}{3123}})=(-0.1041,-0.0099) \]

So, the fact that 0 is not included in our Confidence Interval is the first indication that in POPULATION could be a real difference, as well as there are in our data SAMPLE.

CONCLUSION:

We are 95% confident that the proportion of group1 answering YES to racism question is 1% higher to 10% higher than the proportion group2to10

** - HYPOTHESIS TEST - ** Hypothesis test for comparing two proportions

Null Hypothesis (Ho): \( (p_1-p_2)=0 \) of the POPULATION

Alternative hypothesis (Ha): \( (p_1-p_2)\neq0 \) of the POPULATION

So, as we have expected that \( p_1=p_2 \) , we need to calculate the pooled proportion with the table:

table(prop)

##      vble2
## vble1  Yes   No
##     0  312 2811
##     1   26  141

and with \[ \overline{p}_{POOL} =\frac{total successes}{total n} = (312+26)/3290 = 0.0922 \] so, \[ SE=sqrt({\frac{\hat{p}_{POOL}(1-\hat{p}_{POOL})}{167}+\frac{\hat{p}_{POOL}(1-\hat{p}_{POOL})}{3123}}) \]

Conducting the hypothesis test

\( (\hat{p}_{group1}-\hat{p}_{group2to10}) \sim N(mean=0,SE) \) where \( SE=0.0241 \)

As “point estimate”“ is \( (\hat{p}_{group1}-\hat{p}_{group2to10})=(0.156-0.099)=0.0558 \) and "\( null value=0 \)”, we have: \[ Z=\frac{0.0558-0}{0.0241} = 2.365 \]

We calculate p-value with the Normal distribution obtaining: \[ p-value=P(\mid Z > 2.315 ) \approx 0.009 \] Remark: we have used the both tails to obtain the p-value because we are looking for the \( \neq \) (not for > or <).

As \( p-value = P(observed_.or_.more_.extreme_.outcome/ Ho_.true) \), if we use a significance level of \( \alpha =0.05 \), we should to reject Ho because \( 0.009

CONCLUSION

As, we reject the Null Hypothesis because \( 0.009real difference in population proportions.

It means, proportion of people that classifies their self as a group1 in society answering “yes” to racism question (vble2) is bigger than that the proportion of people that classifies their self as a group2to10 in society answering “yes” to racism question (vble2).

So, confidence interval and hypothesis test agree in their conclusions.

5.- Conclusion:

In previous points, we observed a SAMPLE proportion differences between two groups (people classifying themselves in group1 of the society versus people classifying themselves in group2to10).

So, the question was if we could extend this difference to the entire POPULATION.

We try whit a Confidence interval and we got affirmative results to such difference in POPULATION (interval confidence didn't contain the 0 value).

Even if we claim to justify no difference in POPULATION, the hypothesis test rejects the Ho.

So, inference tells us that a real difference exist between two mentioned groups in POPULATION and not only in SAMPLE answering racism question (our vble2)

Translating the results to our research question, we have learn:

(inference) there is a clear evident difference between two groups not only in SAMPLE unless in POPULATION.
then, the percentage of people considering themselves in the apex (top one group) of the society trends to believe that the average negroes/blacks/African-Americans have worse jobs, income, and housing than white people is due to less in-born ability to learn (our vble2).
If we admit that the fact to answer 'YES' to these 'racism question' (our vble2) is an indicator of IGNORANCE and; in addition, you admit that the fact to classify ourselves as top ten in society is an indicator of ARROGANCE. Then, we can conclude that IGNORANCE and ARROGANCE could be related.

As we have no assured a random assignment (it is given in experimental studies and not in observational studies like that), we cannot justify a causation effect between both variables.

To solve this 'problem of causality', a possible shortcoming of study provided could be establish a experimental study (not only a observational study like this one).

Also, research could be established in other countries, showing that a possible correlation between 'ignorance' and 'arrogance' is not only part of American people unless a human condition.

I want to apologies the risky fact to extend one answering question as an IGNORANCE/ARROGANCE indicator. This is another point should be deal in possible future researches in this way.

……………………………………………………………………………………………………………… ………………………………………………………………………………………………………………

—————- References: —————–

Complete data information: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34802/version/1 Complete data of the survey: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

——– Appendix containing data: ——–

head(prop, n = 15)

##    vble1 vble2
## 1      0    No
## 2      0    No
## 3      0    No
## 4      0    No
## 5      0    No
## 6      0    No
## 7      1    No
## 8      0    No
## 9      0    No
## 10     0    No
## 11     0    No
## 12     0    No
## 13     0    No
## 14     0    No
## 15     0    No

Bloxpot showing all data graphically

plot(table(prop), col = c(1, 4), cex = 1.5, main = "", xlab = "vble1 - self_class_society_group", 
    ylab = "vble2 - racism_question")

plot of chunk unnamed-chunk-8

In the graphic, vble1 (logical variable) is represented by :

(1), represents group1
(0), represents group2to10

and their numerical components:

table(prop)

##      vble2
## vble1  Yes   No
##     0  312 2811
##     1   26  141

So, the proportions used are:

P(group1 answering YES): \( 26/(26+141) \) is \( \hat{p}_{group1}=0.156 \)
P(group2to10 answering YES): \( 312/(2811+312) \) is \( \hat{p}_{group2to10}=0.099 \)