Homework #9: Starting an Analysis From Scratch

#### Sociology 333: Introduction to Quantitative Analysis #### Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)

You are going to learn how to do a data analysis from scratch. Follow these steps to get going:

R is a language that you can expand using packages. Packages are just collections of commands that someone else has written than you can use. To use packages, you need to download them. Once downloaded, you need to attach them in your R session to use them.

1.To install a package click Tools>Install Packages. In the “Packages” box you type the name of your package. Here, we want to install the “foreign” package, so type the word foreign into the box. Click OK. This package will allow us to read data from different statistical programs like Stata and SPSS.

2.Now, you are going to head over to the NORC website - these guys do the General Social Survey - and download the 2012 General Social Survey data. First, go to: http://www3.norc.org/GSS+Website/. Now click on the Download tab and click on Stata Format. From this page click on the “2012” link underneath: “GSS 1972-2012 Release 6”.

Or you can just click here: http://publicdata.norc.org:41000/gss/documents//OTHR/2012_stata.zip

3.Download this file to your hard drive. (should be a zip file).

4.Double click on the file (this should open it (windows) or unzip it. Now, find the file: GSS2012.DTA.

5.Move this file to a location on your hard drive where you want to do your analyses.

6.Now, in R Studio, make a new R script and save it in the same place as you saved the GSS datafile.

7.Finally in R Studio click Session>Set Working Directory>Choose Directory. Now find the directory where your files are saved and click “Open”. Now R knows where to look for the file.

8.You are now ready to load the data. To do that, you will load the foreign package. Type:

library(foreign)

9.To import the data into a dataframe called gss, type:

gss = read.dta("GSS2012.DTA")

## Warning: duplicated levels in factors are deprecated
## Warning: duplicated levels in factors are deprecated
## Warning: duplicated levels in factors are deprecated

10.You've done it!

11.Now you've got a dataset. To use these data, you need to know what all the variables stand for. For that, you need the codebook, which you can download from here: http://publicdata.norc.org/GSS/DOCUMENTS/BOOK/GSS_Codebook.pdf. It's a big file! Open this up and take a look. Look through the list of variables to get a sense of all the stuff that is contain in the GSS.

This has every variable ever included in the GSS. For each variable it lists the question that was asked and provides the counts for the variables over the different time points that the survey was asked.

For instance, head over to page 127 in the codebook. You will find: “4. Are you currently–married, widowed, divorced, separated, or have you never been married?” Below that is [VAR: MARITAL], which tells you that this question is coded into the variable MARITAL.

Confirm that in 2012, 900 people reported being married. Check this with the data you downloaded:

We can attach our dataset…

attach(gss)

## The following object is masked from package:MASS:
## 
##     coop
## The following object is masked from package:base:
## 
##     version

table(marital)

## marital
##       married       widowed      divorced     separated never married 
##           900           163           317            68           526 
##            na 
##             0

12.There is also a variable called divorce. This is on p. 128 of the codebook. Notice it has 263 Yes'es and 799 No's. It has has 1 “No answer” and 911 “Not Applicable”. The no answers are people who refused to answer this question. The not applicables are people who were never married. Check out this variable in R:

table(divorce)

## divorce
## iap yes  no  dk  na 
##   0 263 799   0   0

Notice that the 912 people who didn't answer this question aren't listed. That's because R codes them into missing data. R assigns missing data the value of NA. Type:

summary(divorce)

##  iap  yes   no   dk   na NA's 
##    0  263  799    0    0  912

Now, you'll see there are 912 NA's. There are also three categories with 0 in them, “iap”, “dk”, and “na”. There are codes from the GSS for missing data that we don't need to worry about. We'd like to just get rid of all the categories with 0's in them. We use the R command:

divorce = droplevels(divorce)
table(divorce)

## divorce
## yes  no 
## 263 799

We can just drop all the empty levels in the whole gss dataset, by detaching the data and running droplevels() on the whole dataset.

detach(gss)
gss = droplevels(gss)

## Warning: duplicated levels in factors are deprecated
## Warning: duplicated levels in factors are deprecated
## Warning: duplicated levels in factors are deprecated

attach(gss)

## The following object is masked _by_ .GlobalEnv:
## 
##     divorce
## The following object is masked from package:MASS:
## 
##     coop
## The following object is masked from package:base:
## 
##     version

13.There are many times when you want to only use a small number of variables in a dataset. You can quickly create a new dataframe that has just the variables you need by doing the following:

varnames = c("marital", "sex", "coninc", "educ", "race", "degree")
newgss = gss[varnames]
names(newgss)

## [1] "marital" "sex"     "coninc"  "educ"    "race"    "degree"

14.Finally, we need to talk about numerical vs. factor variables. A numeric class of variables is just a number. You can find out if something is numeric by typing:

class(coninc)

## [1] "numeric"

Now, what happens when we type this for the variable marital?

class(marital)

## [1] "factor"

It says it's a factor variable…A factor variable is a categorical variable. A categorical variable isn't a continuous number, but instead its a count of the number of people in a particular category. The categories names are called the levels in R.

table(marital)

## marital
##       married       widowed      divorced     separated never married 
##           900           163           317            68           526

levels(marital)

## [1] "married"       "widowed"       "divorced"      "separated"    
## [5] "never married"

So, for marital it has five levels named “married”, “widowed”, etc.

Many times you want to lump categories together (or sometimes split them apart). To do that, you need to create new factor variables. For factor variables, you need to first of all tell R that you are making a factor. By default it assumes variables are numeric.

Let's say we wanted a categorical variable that indicated people who are married, divorced/widowed/separated, and never married. We want to call this variable marital2. I usually initialize variables with all NA's as values. That way, any NA's in the variable I want to recode get automatically transfered into my new variable.

# Need to put gss$marital because we want a new variable attached to gss
# data object
gss$marital2 = factor(NA, c("married", "div/wid/sep", "never married"))
# ifelse(condition,value to assign if it's true,value to assign if condition
# is false)
gss$marital2[marital == "married"] = "married"
gss$marital2[marital == "widowed"] = "div/wid/sep"
gss$marital2[marital == "divorced"] = "div/wid/sep"
gss$marital2[marital == "separated"] = "div/wid/sep"
gss$marital2[marital == "never married"] = "never married"
detach(gss)
attach(gss)  #attaches the gss dataframe with the new variable

## The following object is masked _by_ .GlobalEnv:
## 
##     divorce
## The following object is masked from package:MASS:
## 
##     coop
## The following object is masked from package:base:
## 
##     version

We could also create a variable for age that is divided into categories with:

gss$agecat = factor(NA, c("Elder", "Middle Aged", "Young"))
gss$agecat[age > 75] <- "Elder"
gss$agecat[age > 45 & age <= 75] <- "Middle Aged"
gss$agecat[age <= 45] <- "Young"
detach(gss)
attach(gss)

## The following object is masked _by_ .GlobalEnv:
## 
##     divorce
## The following object is masked from package:MASS:
## 
##     coop
## The following object is masked from package:base:
## 
##     version

table(agecat)

## agecat
##       Elder Middle Aged       Young 
##         156         887         926

16.Renaming Variables: If you want to rename variables in R, type: fix(gss). This opens a spreadsheet with your variable names at the top. You just click on the variable name and click “Change Name”. The changes will be saved on closing the window.

Exercise 1:

Using the GSS codebook, find the variable that asks what the respondent was doing last week, working, going to school, etc. What is the variable name?
How many people said they were retired? Confirm this number by using the table() command.
Find the variable that indicates the respondent's age. How many missings are there in the variable?
Type mean(age). What does R return? Why might R be doing this? (hint type ?mean)
After reading the help file on mean, you should be able to calculate the mean of age. What is it?
What is the standard deviation of age?

Exercise 2:

Find a variable you are interested in in the 2012 GSS. Make a histogram of the variable. hist() will ignore NA's.
How many missings are there in this variable?

Exercise 3:

Find the variable padeg in the codebook. What does this variable ask?
For the variable padeg how many cases had less than a High School Diploma in 2000?

Exercise 5:

Choose five variables from the GSS. Make a new data frame that only includes these five variables.

Exercise 6:

The variable spanking asks: “Do you strongly agree, agree, disagree, or strongly disagree that it is sometimes necessary to discipline a child with a good, hard spanking?” Recode this variable into a new variable combines “strong agree” and “agree” together and “disagree” and “strong disagree” together.
Make a new variable for income (coninc) that has the following levels - “<10k per year", "10k-19k","20k-49k","50k-100k", >100k”
The variable xmovie asks if the respondent has seen a pornographic movie. Create a new variable that has the following categories: “yes, seen porn, under 30 years old”, “yes, seen porn, 30+ years old”, “no”. Make a table of your variable.
How many missings are there in xmovie? Check the code book and see why they were missing. How many said “Don't know”? How many “Didn't Answer”? How many were “Not Applicable”? The last category is people who weren't asked this question.
Now make a new variable from pornlaw that has people who believe that pornography should always be illegal and those who feel otherwise.
Make a cross tabulation table(x,y) of your variables from 3 and 5. Does this table tell you anything interesting?
Take the variable natspac (are we spending too much on a national space program) and make a new variable recoded as follows: too little=1; about right=2, too much=3. Note: here, you don't need to initialize the new variable as a factor.

Exercise 7:

Type the following command cor(age,coninc,use="complete"). This produces the correlation coefficient between age and income. Note: you need the use=complete option to use only the cases where age and coninc are not missing. This produces a correlation coefficient. How are these variables correlated?
How are coninc and size (size of town/city) correlated?
How are age and tvhours correlated? How about coninc and tvhours (hours per day of television)? What is the problem if you talk about the relationship between income and hours of tv watching without considering age?
How are age and you numeric variable you created for the space program correlated?