Homework #2: Introducing Data

Sociology 333: Introduction to Quantitative Analysis

Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)

Statistics takes information and helps us learn stuff about that information. In general, we can break statistics into two, related tasks:

The first step in any stastistical analysis is to summarize the information we wish to analyze. In technical statistics-speak, this is called “Exploratory Data Analysis.”

You are going to work with a selection of data from the United States General Social Suvery (the GSS), a survey, conducted every two years, that collects information about a represenstative sample of people in the United States. The survey began in 1972 and continues to the present day. See: http://www3.norc.org/GSS+Website.

We are going to look at some “juicy” variables, just to keep you interested. Since 1998, the GSS has collected information asking individuals about their number of sex partners, frequency of intercourse, and extramarital relationships.

I've made a dataset that consists of the responses to several questions on the GSS. Along with their answers to the questions about sex, I've included variables for age, education level, marital status, income, religious tradition, whether there are children present in the home, and a couple of other random items (just to keep it interesting).

First of all, you will need to get the data. I put a copy of it online, which you can download using the load command.

# load data into an object called 'gss' I've made this file ahead of time.
# It will load the data into your computer's memory. the object gss is a
# dataframe, where each row corresponds to a person who answered the survey
# the columns are the variable names
load(url("http://www.soc.duke.edu/~dee4/soc333data/gssHW2.data"))
# Look on the 'Environment' tab. You should see gss, a dataframe that has
# 10760 observations and 13 variables These are the names of the variables:
names(gss)
##  [1] "age"      "sex"      "coninc"   "evstray"  "sexfreq"  "year"    
##  [7] "marital"  "numwomen" "nummen"   "reltrad"  "degree"   "premarsx"
## [13] "attend"
# Here is the size of the data set (remember rows are observations, columns
# are variables)
dim(gss)
## [1] 10760    13

One thing we'd like to be able to do is count the number of people who are in various categories. We can express these as raw counts (e.g. 30 people said yes to a particular question), or they can be expressed as proportions (e.g. 10% of people responded yes to a particular question).

One thing we might care about is what year the surveys were conducted in. If you look at the the output from the names command, you see there is a variable called year, which records the year the survey was conducted. We might want to count the number of surveys we have for each year. We also might wonder what proportion of the surveys were done in each year. There are two commands that do this for us, table() and prop.table().

# Table makes a table of a variable with the freqency counts
table(gss$year)
## 
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 
##  658  766 1224 1234 1135 1031  516  555 1221  849  965  606
# We can store this table in the computer's memory by giving it a name. I'll
# call it 'MyTable'
MyTable = table(gss$year)
# We have created our first object! It's a table called 'MyTable'. We can
# look at it by typing:
MyTable
## 
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 
##  658  766 1224 1234 1135 1031  516  555 1221  849  965  606
# We can count how many values MyTable has by asking R for MyTable's length
length(MyTable)
## [1] 12
dim(MyTable)
## [1] 12
# It has 12 values If I want to figure out the proportion of people in
# various categories I use the command prop.table(). prop.table() is a
# command that we have to use on a table, not on the raw data.
prop.table(MyTable)
## 
##    1991    1993    1994    1996    1998    2000    2002    2004    2006 
## 0.06115 0.07119 0.11375 0.11468 0.10548 0.09582 0.04796 0.05158 0.11348 
##    2008    2010    2012 
## 0.07890 0.08968 0.05632

Histograms do the same thing, they just use a picture to do it. Histograms can display either counts are proportions. Here's how you make a histogram. It will show up in you “plots” tab. You can click the “zoom” button above the graph to make it bigger.

To make a histogram we create a bunch of bins, but each observation in a bin, and add up the number of people in each bin.

#Histogram shows the same thing as table and prop.table
hist(gss$year,main="Number of Observations Per Year",xlim=c(1990,2015),density=8)
#If I want to get proportions I just add an option
hist(gss$year,main="Number of Observations Per Year",xlim=c(1990,2015),ylim=c(0,.15),density=8,prob=T)

To just get a glimpse of the first 10 rows of a data frame, type (this is useful if you have lots of observations):

head(gss["age"])
##       age
## 26266  61
## 26268  35
## 26271  59
## 26275  64
## 26276  72
## 26277  67
tail(gss["age"])
##       age
## 57050  50
## 57051  23
## 57052  65
## 57056  25
## 57057  61
## 57060  37
# why is this different?
head(gss$age)
## [1] 61 35 59 64 72 67

When doing data analysis there are, in general, three types of variables:

Continuous Variables

Continuous Variables are, well, continuous. They are numbers that have some direct meaning. Age is continuous. So is GPA, height, weight, how much money you made last year, etc.

Categorical Variables

These are variables that indicate an observation is part of a particular category. Categorical variables have no intrinsic ordering. Think about gender, male and female. There is no intrinsic order to male and female. Categorical variables can be numbers, we can say “1s are girls and 2s are boys,” for instance.

Ordinal Variables

Pretty much the same as categorical (or nominal) variables, there's just some, well, ordering. An ordinal variable could be education: do you have a GED, a college degree, a graduate degree? The key is there is an order: Do you earn <$5,000/year, $5,000-$50,000/year or >$50,000/year? Questions that ask for your opinion often are ordinal variables: Indicate your level of agreement with the following statement: “I like Jiffy Peanut Butter”, Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree.

Here are some examples. Continuous variables:

hist(gss$age, main = "Age Spread")

plot of chunk unnamed-chunk-4

hist(gss$coninc, main = "Annual Income in Constant Dollars")

plot of chunk unnamed-chunk-4

Categorical variables:

table(gss$sex)
## 
##   male female 
##   4802   5958
table(gss$degree)
## 
## lt high school    high school junior college       bachelor       graduate 
##           1249           5704            848           1980            979

Ordinal Variables:

print("There's been a lot of discussion about the way morals and attitudes about sex are changing  in this country. If a man and woman have sex relations before marriage, do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?")
## [1] "There's been a lot of discussion about the way morals and attitudes about sex are changing  in this country. If a man and woman have sex relations before marriage, do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?"
table(gss$premarsx)
## 
##     always wrong almst always wrg  sometimes wrong not wrong at all 
##             2444              899             2183             5234
# don't worry about iap, other, dk. There are stuff from the GSS about
# missing responses. I've deleted them all.

####Getting a grip on your data There is a lot of data here. A good first step is to try and distill this information down. There are 10,000+ observations in these data. A good command to start with is summary.

summary(gss)
##       age           sex           coninc                evstray    
##  Min.   :18.0   male  :4802   Min.   :   383   yes          :1503  
##  1st Qu.:31.0   female:5958   1st Qu.: 19893   no           :6519  
##  Median :42.0                 Median : 38752   never married:2738  
##  Mean   :44.1                 Mean   : 49046                       
##  3rd Qu.:55.0                 3rd Qu.: 64546                       
##  Max.   :89.0                 Max.   :178712                       
##                                                                    
##               sexfreq          year               marital    
##  not at all       :2128   Min.   :1991   married      :5300  
##  once or twice    : 833   1st Qu.:1996   widowed      : 775  
##  once a month     :1211   Median :2000   divorced     :1591  
##  2-3 times a month:1784   Mean   :2001   separated    : 355  
##  weekly           :1884   3rd Qu.:2006   never married:2739  
##  2-3 per week     :2203   Max.   :2012                       
##  4+ per week      : 717                                      
##     numwomen         nummen                  reltrad    
##  Min.   :  0.0   Min.   :  0.0   evangelical     :2768  
##  1st Qu.:  0.0   1st Qu.:  0.0   mainline        :1929  
##  Median :  0.0   Median :  1.0   black protestant: 828  
##  Mean   :  6.9   Mean   :  3.6   catholic        :2696  
##  3rd Qu.:  5.0   3rd Qu.:  3.0   jewish          : 203  
##  Max.   :989.0   Max.   :989.0   other faith     : 660  
##                                  nonaffiliated   :1676  
##             degree                 premarsx                 attend    
##  lt high school:1249   always wrong    :2444   never           :2067  
##  high school   :5704   almst always wrg: 899   every week      :2007  
##  junior college: 848   sometimes wrong :2183   once a year     :1495  
##  bachelor      :1980   not wrong at all:5234   sevrl times a yr:1255  
##  graduate      : 979                           2-3x a month    : 993  
##                                                once a month    : 840  
##                                                (Other)         :2103

For continuous variables, we get the minimum, 1 quartile, Median, Mean, 3rd quartile, the maximum, and the number of missing values - what R calls NA's. If this is too much to take in at once, you can do summary on a single slice or a single variable.

summary(gss["sex"])
##      sex      
##  male  :4802  
##  female:5958
summary(gss$sex)
##   male female 
##   4802   5958
summary(gss["age"])
##       age      
##  Min.   :18.0  
##  1st Qu.:31.0  
##  Median :42.0  
##  Mean   :44.1  
##  3rd Qu.:55.0  
##  Max.   :89.0
summary(gss$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    31.0    42.0    44.1    55.0    89.0

Again, table just produces counts of the number of observations in each category. Again, if you want to convert this to propotions (percents/100), just add in the prop.table() command. Notice that in the second line of code, I've nested two functions - the function table is nested in prop.table.

table(gss$year)
## 
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 
##  658  766 1224 1234 1135 1031  516  555 1221  849  965  606
prop.table(table(gss$year))
## 
##    1991    1993    1994    1996    1998    2000    2002    2004    2006 
## 0.06115 0.07119 0.11375 0.11468 0.10548 0.09582 0.04796 0.05158 0.11348 
##    2008    2010    2012 
## 0.07890 0.08968 0.05632

You can nest like crazy. Say you want a barplot of the percentage of people in each year and don't want to use the histogram command. Here, you can use a barplot. Barplot doesn't add anything up. It just plots x vs. y.

# Make a variable of the percentage of survey responses in each year
yearTable = prop.table(table(gss$year))
# Now plot this
barplot(yearTable, ylab = "Percentage", las = 2)

plot of chunk unnamed-chunk-10

Tables can summarize multiple variables. Here, we want to look at how often people have had sexual intercourse in the past year by gender. We do this by creating a two-way table. Two way tables (or cross-tabs), summarize data by groupings.

# Here it is not by gender, just everyone
table(gss$sexfreq)
## 
##        not at all     once or twice      once a month 2-3 times a month 
##              2128               833              1211              1784 
##            weekly      2-3 per week       4+ per week 
##              1884              2203               717
prop.table(table(gss$sexfreq))
## 
##        not at all     once or twice      once a month 2-3 times a month 
##           0.19777           0.07742           0.11255           0.16580 
##            weekly      2-3 per week       4+ per week 
##           0.17509           0.20474           0.06664
table(gss$sexfreq, gss$sex)  #Hard to make sense because there aren't the same number of males and females in the dataset.
##                    
##                     male female
##   not at all         709   1419
##   once or twice      399    434
##   once a month       559    652
##   2-3 times a month  823    961
##   weekly             894    990
##   2-3 per week      1037   1166
##   4+ per week        381    336
prop.table(table(gss$sexfreq, gss$sex)) * 100  #Better, because now they are percentages
##                    
##                       male female
##   not at all         6.589 13.188
##   once or twice      3.708  4.033
##   once a month       5.195  6.059
##   2-3 times a month  7.649  8.931
##   weekly             8.309  9.201
##   2-3 per week       9.638 10.836
##   4+ per week        3.541  3.123

What might explain the big gap in men and women in the “not getting any” category?

By default this is the percentage of each category out of the total number of people. The total number of people in this dataset is 10,760 (obtained by dim(gss)).

Maybe we want to know what proportion of married people are males, or what proportion of males are married.

This is done by specifying the margin. The margin has a value of either 1 or 2. When margin=1 you are asking for the proportions by ROW. When margin=2 you are asking for the proportions by COLUMN. Here's an example:

MyTable = table(gss$sexfreq, gss$sex)
prop.table(MyTable)  #All the cells add up to 1
##                    
##                        male  female
##   not at all        0.06589 0.13188
##   once or twice     0.03708 0.04033
##   once a month      0.05195 0.06059
##   2-3 times a month 0.07649 0.08931
##   weekly            0.08309 0.09201
##   2-3 per week      0.09638 0.10836
##   4+ per week       0.03541 0.03123
sum(prop.table(MyTable))  #All the cells add up to 1
## [1] 1
prop.table(MyTable, 1)  #All the ROWS add up to 1
##                    
##                       male female
##   not at all        0.3332 0.6668
##   once or twice     0.4790 0.5210
##   once a month      0.4616 0.5384
##   2-3 times a month 0.4613 0.5387
##   weekly            0.4745 0.5255
##   2-3 per week      0.4707 0.5293
##   4+ per week       0.5314 0.4686
sum(prop.table(MyTable, 1)[1, ])
## [1] 1
sum(prop.table(MyTable, 1)[2, ])
## [1] 1
sum(prop.table(MyTable, 1)[3, ])  #ETC.
## [1] 1
prop.table(MyTable, 2)  #All the COLUMNS add up to 1
##                    
##                        male  female
##   not at all        0.14765 0.23817
##   once or twice     0.08309 0.07284
##   once a month      0.11641 0.10943
##   2-3 times a month 0.17139 0.16130
##   weekly            0.18617 0.16616
##   2-3 per week      0.21595 0.19570
##   4+ per week       0.07934 0.05639
sum(prop.table(MyTable, 2)[, 1])
## [1] 1
sum(prop.table(MyTable, 2)[, 2])
## [1] 1

In the textbook, we've begun talking about measures of central tendency and measures of spread. Taking the variable coninc lets calculate several statistics and then make some plots. coninc is household income in 2006 dollars.

mean(gss$coninc)
## [1] 49046
median(gss$coninc)
## [1] 38752
quantile(gss$coninc)
##     0%    25%    50%    75%   100% 
##    383  19893  38752  64546 178712
# To get the top 1%, you specify the 'probs argument'
quantile(gss$coninc, probs = 0.99)
##    99% 
## 178266
# The bottom 10%
quantile(gss$coninc, probs = 0.1)
##  10% 
## 9630
# The median is the 50th percentile
quantile(gss$coninc, probs = 0.5)
##   50% 
## 38752
median(gss$coninc)
## [1] 38752
# Standard Deviation
sd(gss$coninc)
## [1] 39685
# Oddly enough, R lacks a built-in function for the mode!  The mode is kinda
# useless, anyway. It's just the most common value.

And, we can visual these data with a histogram to get a picture of the distribution of the data. What does this historgram tell us about income? Which way is it skewed - positive or negative?

hist(gss$coninc)

plot of chunk unnamed-chunk-14

Exercises to Complete and Hand In

Exercise 0: Create a new R script file, save it as LASTNAME_HW2.R.

Exercise 1: Complete the following tasks:

A1: Make a table of the variable marital. How many people are separated? What proportion of people are separated?

A2: What is the median age? What is the 99th percentile of age? What is the 25th percentile of age? What is the 75th percentile of age?

B1: Make a table of sexfreq versus reltrad. How many evangelicals are having sex 4+ times per week?

B2: Out of the whole group, what religious group is getting no sex the most?

B3: Among evangelicals, what is the most common frequency of having sex?

B4: Among the unaffiliated, what is the least common frequency of having sex?

C1: Make a table of age vs. sexfreq.

C2: What proportion of 36 year olds are getting sex 4+ times a week?

C3: What proportion of 80 year olds are getting any sex?

Exercise 2: Make a histogram of the variable age. Make histograms adding the option breaks=1, breaks=2, breaks=10, and breaks=71. Explain what makes these histograms different.

Exercise 3: Answer the following:

A) What is the average age of people in the dataset?

B) What is the standard deviation of the age?

C) What is the median age? What is the variance of age?

Exercise 4:

A) What is the average number of female sex partners respondents report? (numwomen) What is the median? What is the standard deviation?

B) Make a barplot of the number of female sex partners. How is it skewed?

C) What might be misleading about this statistic?

Exercise 5:

A) What is the average number of male sex partners respondents report? (nummen) What is the median? What is the standard deviation?

B) What do the differences in the average, median and standard deviation tell us?