Homework #3: Subsetting Variables and Basic Group Comparison

Sociology 333: Introduction to Quantitative Analysis

Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)

Vectors

We've already learned about dataframes - just tables of numbers. We index dataframe with two numbers, the row number and the column number. In addition, we found out that dataframes also have row names in them and column names. We are going to look at an even simpler object, a vector.

In R, variables can be a single value, or a vector of values. What are vectors? They are just a long list of numbers. If I wrote down the ages of everyone in the class, I could store this in a vector. We index vectors with only one number. The ages of class members might look like: 22, 21, 23, 20, 20, 22. Element 1 of this vector is 22. Element 3 is 23.

We can store vectors in R in a single object. To create vectors, we use the c() command. In this code, what is R doing?

X = c(1, 2, 3, 4, 5)
# Element 1=1
X[1]

## [1] 1

# Element 5=5
X[5]

## [1] 5

Y = c(1, 2, 1, 4, 4)
# Element 3=1
Y[3]

## [1] 1

We can add/subtract/multiple/divide vectors, so long as they either the same length as each other, or multiples of each other. If they are the same length, it is done on an element by element basis.

X + Y

## [1] 2 4 4 8 9

Z = c(1, 1, 1, 1, 1)
X + Z

## [1] 2 3 4 5 6

X + Z + Y

## [1]  3  5  5  9 10

X - Z

## [1] 0 1 2 3 4

X * Y

## [1]  1  4  3 16 20

X/Y

## [1] 1.00 1.00 3.00 1.00 1.25

If they are multiples of each other (e.g. one is length 5 and the other length 1, or 1 is length 10 and the other length 5), then the shorter vector is just recycled and used over.

Q = 3
X + Q  #3 is added to each element of X

## [1] 4 5 6 7 8

X - Q

## [1] -2 -1  0  1  2

L = c(1, 2)
X + L  #ERROR!! Not the length of L is NOT a multiple of X

## Warning: longer object length is not a multiple of shorter object length

## [1] 2 4 4 6 6

L = seq(1, 20, 2)  #seq() makes a sequence from 1 to 20, counting by 2s.
X + L  #X gets recycled

##  [1]  2  5  8 11 14 12 15 18 21 24

X - L

##  [1]   0  -1  -2  -3  -4 -10 -11 -12 -13 -14

L - X

##  [1]  0  1  2  3  4 10 11 12 13 14

Subsetting Data

There is another class of variable out there that we need to learn about - logical variables. Logical variables are like light switches, they turn something on or off. Logical variables have two values, TRUE and FALSE. There are several logical operators out there. A logical operator asks R to evaluate a statement and tell us if it is true or false. There are a bunch of logical operators. Here is a list of a few of the most important:

X == Y. This asks is X equal to Y?
X > Y. This asks is X greater than Y?
X < Y. This asks is X less than Y?
X != Y. This asks is X not equal to Y (we can use the bang(!) to negate most expressions)

Here they are in action:

X = 1  #notice that one equal sign assigns a value
Y = 3
X == Y  #This returns a value!!! It is like a plus sign or a multiplication sign, it does something...

## [1] FALSE

X > Y

## [1] FALSE

X < Y

## [1] TRUE

X != Y

## [1] TRUE

With logicals, R has two answers it will provide to logical statements. TRUE and FALSE. These are special values that we can use to do all kinds of useful things.

There are also logical expressions that can be used to combine logical operators together. The ones of importance here are & and |. These are AND and OR. They let us string together logical operators to make longer statements.

X = 1
Y = 1
Z = 3
X == Y

## [1] TRUE

X == Y & X == Z

## [1] FALSE

X == Y & X != Z

## [1] TRUE

X == Y | X == Z

## [1] TRUE

Get our data again.

load(url("http://www.soc.duke.edu/~dee4/soc333data/gssHW2.data"))

Now back to subsetting. Say we want a data frame with just the men in it. We can use the subset command to get these data.

gss.men = subset(gss, gss$sex == "male")
gss.women = subset(gss, gss$sex == "female")  #OR
gss.women = subset(gss, gss$sex != "male")

Or, maybe we just want men over 30. Now we can combine logical expressions as follows to do this:

gss.men.o30 = subset(gss, gss$sex == "male" & gss$age > 30)
# How about a dataset of men over 30 who haven't had sex in the past 12
# months
gss.men.o30.ns = subset(gss, gss$sex == "male" & gss$age > 30 & gss$sexfreq == 
    "not at all")
dim(gss.men.o30.ns)  #There are 1436 of these men.

## [1] 588  13

nrow(gss.men.o30.ns)  #just the number of rows.

## [1] 588

Now, say we were doing a study on men and their number of sex partners for a study on sexually transmitted diseases. First of all, we want to see how many male and female sex partners men have had since they turned 18. A small number say more than 50, so we're going to drop cases where they name more than 50.

gss.men = subset(gss, gss$sex == "male" & gss$numwomen < 301 & gss$nummen < 
    301)
prop.table(table(gss.men$numwomen))

## 
##         0         1         2         3         4         5         6 
## 0.0709812 0.1511482 0.0747390 0.0776618 0.0640919 0.0713987 0.0517745 
##         7         8         9        10        11        12        13 
## 0.0267223 0.0286013 0.0096033 0.0797495 0.0035491 0.0275574 0.0041754 
##        14        15        16        17        18        19        20 
## 0.0043841 0.0342380 0.0041754 0.0022965 0.0062630 0.0010438 0.0576200 
##        21        22        23        24        25        27        28 
## 0.0027140 0.0025052 0.0014614 0.0025052 0.0223382 0.0006263 0.0012526 
##        29        30        31        32        33        34        35 
## 0.0004175 0.0237996 0.0004175 0.0014614 0.0008351 0.0004175 0.0068894 
##        36        37        39        40        42        45        48 
## 0.0006263 0.0002088 0.0002088 0.0108559 0.0002088 0.0020877 0.0002088 
##        49        50        51        52        53        54        55 
## 0.0004175 0.0217119 0.0004175 0.0004175 0.0002088 0.0002088 0.0002088 
##        56        58        59        60        62        63        65 
## 0.0002088 0.0002088 0.0002088 0.0037578 0.0002088 0.0002088 0.0008351 
##        70        73        74        75        77        80        85 
## 0.0014614 0.0002088 0.0002088 0.0022965 0.0002088 0.0014614 0.0002088 
##        90       100       101       103       120       121       137 
## 0.0006263 0.0181628 0.0006263 0.0002088 0.0006263 0.0002088 0.0002088 
##       138       147       150       167       170       175       200 
## 0.0002088 0.0002088 0.0025052 0.0002088 0.0002088 0.0004175 0.0033403 
##       201       240       250       300 
## 0.0002088 0.0002088 0.0008351 0.0018789

summary(gss.men$numwomen)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     2.0     5.0    13.8    15.0   300.0

# a picture is better
hist(gss.men$numwomen, breaks = 300, freq = T)  #still not that useful

plot of chunk unnamed-chunk-9

# how about just for those with 50 or fewer?
hist(subset(gss.men$numwomen, gss.men$numwomen < 50), freq = T, breaks = 50)  #more useful

plot of chunk unnamed-chunk-9

hist(gss.men$nummen, breaks = 300)  #not very useful at all

plot of chunk unnamed-chunk-9

table(gss.men$nummen)  #very few men have > 0 male sex partners

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   15   18 
## 4481   62   36   29   22   12   13    5    2    5   20    1    6    6    4 
##   20   21   22   25   30   32   35   40   41   45   50   70   75   80   90 
##   23    1    3    6    8    1    2    3    1    2   13    1    1    1    1 
##  100  110  150  200  300 
##   10    1    1    3    4

summary(gss.men$nummen)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    1.32    0.00  300.00

To calculate the number of men who report male sex partners, we can add up the number of TRUES to the statement, is the number of male sex partners reported greater than 0.

head(gss.men$nummen > 0)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

# A bunch of TRUES and FALSES. In R, when we sum up TRUES, it treats them as
# 1 and FALSES as zeroes.
sum(gss.men$nummen > 0)

## [1] 309

# 307 TRUES
sum(!(gss.men$nummen > 0))

## [1] 4481

sum((gss.men$nummen > 0) == FALSE)

## [1] 4481

sum((gss.men$nummen > 0) == F)

## [1] 4481

# All different ways to get falses.

# As a percentage:
sum(gss.men$nummen > 0)/nrow(gss.men) * 100

## [1] 6.451

Exercise 1: Write expressions that calculate the following:

Multiple each of these numbers by three: 100,2,33,4,10

Create a sequence of numbers, 20 numbers long that starts at zero and counts up by 1

Adds together these vectors: 8,10,22,44 and 9,10,54,34

Exercise 2: Create a data frame of just the people who report less than 300 male or female sex partners. Do the following with this group:

Create a table of both nummen and numwomen by sex
What number of women report at least 1 female sex partner? Use logical expression and a sum().
What number of men?
Calculate the mean, the range, and standard deviation of the number of same sex partners for men and for women who report at least 1 same-sex partner.
Describe what is different between men and women in terms of the number of same-sex partners they have had.

Exercise 3: Create two histograms: 1) the number of female partners for men and 2) the number of male partners for women. What patterns to you see for each group separately? What key differences between the groups?

Exercise 4: With the men, create a new variable that calculates the number of opposite sex partners divided by the number of years since the respondent's 18th birthday. Make a histogram that displays this statistic. Experiment with “breaks=” to make it look good. In your script file, keep the histogram with the best “breaks”.

Exercise 5: The GSS asks if the respondent has had an extra marital affair. This is the variable evstray. What proportion of married men have “strayed”, what proportion of women? Show me the R commands to produce this proportion.

Exercise 6: Create a subset of the data restricted to only the “strayers.” If you were to predict the religious affiliation of “strayers” what would you guess? Now, figure out the proportion of “strayers” by religious tradition (in the variable reltrad). Make a barplot of the proportions to visualize these data. What is this plot telling us? What conclusions can we reach? What might lie behind this result?