Statistics takes information and helps us learn stuff about that information. In general, we can break statistics into two, related tasks:
The first step in any stastistical analysis is to summarize the information we wish to analyze. In technical statistics-speak, this is called “Exploratory Data Analysis.”
You are going to work with a selection of data from the United States General Social Suvery (the GSS), a survey, conducted every two years, that collects information about a represenstative sample of people in the United States. The survey began in 1972 and continues to the present day. See: http://www3.norc.org/GSS+Website.
We are going to look at some “juicy” variables, just to keep you interested. Since 1998, the GSS has collected information asking individuals about their number of sex partners, frequency of intercourse, and extramarital relationships.
I've made a dataset that consists of the responses to several questions on the GSS. Along with their answers to the questions about sex, I've included variables for age, education level, marital status, income, religious tradition, whether there are children present in the home, and a couple of other random items (just to keep it interesting).
First of all, you will need to get the data. I put a copy of it online, which you can download using the load command.
# load data into an object called 'gss' I've made this file ahead of time.
# It will load the data into your computer's memory. the object gss is a
# dataframe, where each row corresponds to a person who answered the survey
# the columns are the variable names
load(url("http://www.soc.duke.edu/~dee4/soc333data/gssHW2.data"))
# Look on the 'Environment' tab. You should see gss, a dataframe that has
# 10760 observations and 13 variables These are the names of the variables:
names(gss)
## [1] "age" "sex" "coninc" "evstray" "sexfreq" "year"
## [7] "marital" "numwomen" "nummen" "reltrad" "degree" "premarsx"
## [13] "attend"
# Here is the size of the data set (remember rows are observations, columns
# are variables)
dim(gss)
## [1] 10760 13
One thing we'd like to be able to do is count the number of people who are in various categories. We can express these as raw counts (e.g. 30 people said yes to a particular question), or they can be expressed as proportions (e.g. 10% of people responded yes to a particular question).
One thing we might care about is what year the surveys were conducted in. If you look at the the output from the names command, you see there is a variable called year, which records the year the survey was conducted. We might want to count the number of surveys we have for each year. We also might wonder what proportion of the surveys were done in each year. There are two commands that do this for us, table() and prop.table().
# Table makes a table of a variable with the freqency counts
table(gss$year)
##
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
## 658 766 1224 1234 1135 1031 516 555 1221 849 965 606
# We can store this table in the computer's memory by giving it a name. I'll
# call it 'MyTable'
MyTable = table(gss$year)
# We have created our first object! It's a table called 'MyTable'. We can
# look at it by typing:
MyTable
##
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
## 658 766 1224 1234 1135 1031 516 555 1221 849 965 606
# We can count how many values MyTable has by asking R for MyTable's length
length(MyTable)
## [1] 12
dim(MyTable)
## [1] 12
# It has 12 values If I want to figure out the proportion of people in
# various categories I use the command prop.table(). prop.table() is a
# command that we have to use on a table, not on the raw data.
prop.table(MyTable)
##
## 1991 1993 1994 1996 1998 2000 2002 2004 2006
## 0.06115 0.07119 0.11375 0.11468 0.10548 0.09582 0.04796 0.05158 0.11348
## 2008 2010 2012
## 0.07890 0.08968 0.05632
Histograms do the same thing, they just use a picture to do it. Histograms can display either counts are proportions. Here's how you make a histogram. It will show up in you “plots” tab. You can click the “zoom” button above the graph to make it bigger.
To make a histogram we create a bunch of bins, but each observation in a bin, and add up the number of people in each bin.
#Histogram shows the same thing as table and prop.table
hist(gss$year,main="Number of Observations Per Year",xlim=c(1990,2015),density=8)
#If I want to get proportions I just add an option
hist(gss$year,main="Number of Observations Per Year",xlim=c(1990,2015),ylim=c(0,.15),density=8,prob=T)
To just get a glimpse of the first 10 rows of a data frame, type (this is useful if you have lots of observations):
head(gss["age"])
## age
## 26266 61
## 26268 35
## 26271 59
## 26275 64
## 26276 72
## 26277 67
tail(gss["age"])
## age
## 57050 50
## 57051 23
## 57052 65
## 57056 25
## 57057 61
## 57060 37
# why is this different?
head(gss$age)
## [1] 61 35 59 64 72 67
When doing data analysis there are, in general, three types of variables:
Continuous Variables are, well, continuous. They are numbers that have some direct meaning. Age is continuous. So is GPA, height, weight, how much money you made last year, etc.
These are variables that indicate an observation is part of a particular category. Categorical variables have no intrinsic ordering. Think about gender, male and female. There is no intrinsic order to male and female. Categorical variables can be numbers, we can say “1s are girls and 2s are boys,” for instance.
Pretty much the same as categorical (or nominal) variables, there's just some, well, ordering. An ordinal variable could be education: do you have a GED, a college degree, a graduate degree? The key is there is an order: Do you earn <$5,000/year, $5,000-$50,000/year or >$50,000/year? Questions that ask for your opinion often are ordinal variables: Indicate your level of agreement with the following statement: “I like Jiffy Peanut Butter”, Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree.
Here are some examples. Continuous variables:
hist(gss$age, main = "Age Spread")
hist(gss$coninc, main = "Annual Income in Constant Dollars")
Categorical variables:
table(gss$sex)
##
## male female
## 4802 5958
table(gss$degree)
##
## lt high school high school junior college bachelor graduate
## 1249 5704 848 1980 979
Ordinal Variables:
print("There's been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and woman have sex relations before marriage, do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?")
## [1] "There's been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and woman have sex relations before marriage, do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?"
table(gss$premarsx)
##
## always wrong almst always wrg sometimes wrong not wrong at all
## 2444 899 2183 5234
# don't worry about iap, other, dk. There are stuff from the GSS about
# missing responses. I've deleted them all.
####Getting a grip on your data
There is a lot of data here. A good first step is to try and distill this information down. There are 10,000+ observations in these data. A good command to start with is summary.
summary(gss)
## age sex coninc evstray
## Min. :18.0 male :4802 Min. : 383 yes :1503
## 1st Qu.:31.0 female:5958 1st Qu.: 19893 no :6519
## Median :42.0 Median : 38752 never married:2738
## Mean :44.1 Mean : 49046
## 3rd Qu.:55.0 3rd Qu.: 64546
## Max. :89.0 Max. :178712
##
## sexfreq year marital
## not at all :2128 Min. :1991 married :5300
## once or twice : 833 1st Qu.:1996 widowed : 775
## once a month :1211 Median :2000 divorced :1591
## 2-3 times a month:1784 Mean :2001 separated : 355
## weekly :1884 3rd Qu.:2006 never married:2739
## 2-3 per week :2203 Max. :2012
## 4+ per week : 717
## numwomen nummen reltrad
## Min. : 0.0 Min. : 0.0 evangelical :2768
## 1st Qu.: 0.0 1st Qu.: 0.0 mainline :1929
## Median : 0.0 Median : 1.0 black protestant: 828
## Mean : 6.9 Mean : 3.6 catholic :2696
## 3rd Qu.: 5.0 3rd Qu.: 3.0 jewish : 203
## Max. :989.0 Max. :989.0 other faith : 660
## nonaffiliated :1676
## degree premarsx attend
## lt high school:1249 always wrong :2444 never :2067
## high school :5704 almst always wrg: 899 every week :2007
## junior college: 848 sometimes wrong :2183 once a year :1495
## bachelor :1980 not wrong at all:5234 sevrl times a yr:1255
## graduate : 979 2-3x a month : 993
## once a month : 840
## (Other) :2103
For continuous variables, we get the minimum, 1 quartile, Median, Mean, 3rd quartile, the maximum, and the number of missing values - what R calls NA's. If this is too much to take in at once, you can do summary on a single slice or a single variable.
summary(gss["sex"])
## sex
## male :4802
## female:5958
summary(gss$sex)
## male female
## 4802 5958
summary(gss["age"])
## age
## Min. :18.0
## 1st Qu.:31.0
## Median :42.0
## Mean :44.1
## 3rd Qu.:55.0
## Max. :89.0
summary(gss$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 31.0 42.0 44.1 55.0 89.0
Again, table just produces counts of the number of observations in each category. Again, if you want to convert this to propotions (percents/100), just add in the prop.table() command. Notice that in the second line of code, I've nested two functions - the function table is nested in prop.table.
table(gss$year)
##
## 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
## 658 766 1224 1234 1135 1031 516 555 1221 849 965 606
prop.table(table(gss$year))
##
## 1991 1993 1994 1996 1998 2000 2002 2004 2006
## 0.06115 0.07119 0.11375 0.11468 0.10548 0.09582 0.04796 0.05158 0.11348
## 2008 2010 2012
## 0.07890 0.08968 0.05632
You can nest like crazy. Say you want a barplot of the percentage of people in each year and don't want to use the histogram command. Here, you can use a barplot. Barplot doesn't add anything up. It just plots x vs. y.
# Make a variable of the percentage of survey responses in each year
yearTable = prop.table(table(gss$year))
# Now plot this
barplot(yearTable, ylab = "Percentage", las = 2)
Tables can summarize multiple variables. Here, we want to look at how often people have had sexual intercourse in the past year by gender. We do this by creating a two-way table. Two way tables (or cross-tabs), summarize data by groupings.
# Here it is not by gender, just everyone
table(gss$sexfreq)
##
## not at all once or twice once a month 2-3 times a month
## 2128 833 1211 1784
## weekly 2-3 per week 4+ per week
## 1884 2203 717
prop.table(table(gss$sexfreq))
##
## not at all once or twice once a month 2-3 times a month
## 0.19777 0.07742 0.11255 0.16580
## weekly 2-3 per week 4+ per week
## 0.17509 0.20474 0.06664
table(gss$sexfreq, gss$sex) #Hard to make sense because there aren't the same number of males and females in the dataset.
##
## male female
## not at all 709 1419
## once or twice 399 434
## once a month 559 652
## 2-3 times a month 823 961
## weekly 894 990
## 2-3 per week 1037 1166
## 4+ per week 381 336
prop.table(table(gss$sexfreq, gss$sex)) * 100 #Better, because now they are percentages
##
## male female
## not at all 6.589 13.188
## once or twice 3.708 4.033
## once a month 5.195 6.059
## 2-3 times a month 7.649 8.931
## weekly 8.309 9.201
## 2-3 per week 9.638 10.836
## 4+ per week 3.541 3.123
What might explain the big gap in men and women in the “not getting any” category?
By default this is the percentage of each category out of the total number of people. The total number of people in this dataset is 10,760 (obtained by dim(gss)).
Maybe we want to know what proportion of married people are males, or what proportion of males are married.
This is done by specifying the margin. The margin has a value of either 1 or 2.
When margin=1 you are asking for the proportions by ROW. When margin=2 you are asking for the proportions by COLUMN. Here's an example:
MyTable = table(gss$sexfreq, gss$sex)
prop.table(MyTable) #All the cells add up to 1
##
## male female
## not at all 0.06589 0.13188
## once or twice 0.03708 0.04033
## once a month 0.05195 0.06059
## 2-3 times a month 0.07649 0.08931
## weekly 0.08309 0.09201
## 2-3 per week 0.09638 0.10836
## 4+ per week 0.03541 0.03123
sum(prop.table(MyTable)) #All the cells add up to 1
## [1] 1
prop.table(MyTable, 1) #All the ROWS add up to 1
##
## male female
## not at all 0.3332 0.6668
## once or twice 0.4790 0.5210
## once a month 0.4616 0.5384
## 2-3 times a month 0.4613 0.5387
## weekly 0.4745 0.5255
## 2-3 per week 0.4707 0.5293
## 4+ per week 0.5314 0.4686
sum(prop.table(MyTable, 1)[1, ])
## [1] 1
sum(prop.table(MyTable, 1)[2, ])
## [1] 1
sum(prop.table(MyTable, 1)[3, ]) #ETC.
## [1] 1
prop.table(MyTable, 2) #All the COLUMNS add up to 1
##
## male female
## not at all 0.14765 0.23817
## once or twice 0.08309 0.07284
## once a month 0.11641 0.10943
## 2-3 times a month 0.17139 0.16130
## weekly 0.18617 0.16616
## 2-3 per week 0.21595 0.19570
## 4+ per week 0.07934 0.05639
sum(prop.table(MyTable, 2)[, 1])
## [1] 1
sum(prop.table(MyTable, 2)[, 2])
## [1] 1
In the textbook, we've begun talking about measures of central tendency and measures of spread. Taking the variable coninc lets calculate several statistics and then make some plots. coninc is household income in 2006 dollars.
mean(gss$coninc)
## [1] 49046
median(gss$coninc)
## [1] 38752
quantile(gss$coninc)
## 0% 25% 50% 75% 100%
## 383 19893 38752 64546 178712
# To get the top 1%, you specify the 'probs argument'
quantile(gss$coninc, probs = 0.99)
## 99%
## 178266
# The bottom 10%
quantile(gss$coninc, probs = 0.1)
## 10%
## 9630
# The median is the 50th percentile
quantile(gss$coninc, probs = 0.5)
## 50%
## 38752
median(gss$coninc)
## [1] 38752
# Standard Deviation
sd(gss$coninc)
## [1] 39685
# Oddly enough, R lacks a built-in function for the mode! The mode is kinda
# useless, anyway. It's just the most common value.
And, we can visual these data with a histogram to get a picture of the distribution of the data. What does this historgram tell us about income? Which way is it skewed - positive or negative?
hist(gss$coninc)
Exercise 0: Create a new R script file, save it as LASTNAME_HW2.R.
Exercise 1: Complete the following tasks:
A1: Make a table of the variable marital. How many people are separated? What proportion of people are separated?
A2: What is the median age? What is the 99th percentile of age? What is the 25th percentile of age? What is the 75th percentile of age?
B1: Make a table of
sexfreqversusreltrad. How many evangelicals are having sex 4+ times per week?B2: Out of the whole group, what religious group is getting no sex the most?
B3: Among evangelicals, what is the most common frequency of having sex?
B4: Among the unaffiliated, what is the least common frequency of having sex?
C1: Make a table of
agevs.sexfreq.C2: What proportion of 36 year olds are getting sex 4+ times a week?
C3: What proportion of 80 year olds are getting any sex?
Exercise 2: Make a histogram of the variable
age. Make histograms adding the option breaks=1, breaks=2, breaks=10, and breaks=71. Explain what makes these histograms different.Exercise 3: Answer the following:
A) What is the average age of people in the dataset?
B) What is the standard deviation of the age?
C) What is the median age? What is the variance of age?
Exercise 4:
A) What is the average number of female sex partners respondents report? (
numwomen) What is the median? What is the standard deviation?B) Make a barplot of the number of female sex partners. How is it skewed?
C) What might be misleading about this statistic?
Exercise 5:
A) What is the average number of male sex partners respondents report? (
nummen) What is the median? What is the standard deviation?B) What do the differences in the average, median and standard deviation tell us?