Datasets

Overview of Lesson

In this lesson, we’ll work to create fake dataset, open or load a pre-existing data, and how to do a merge of datasets

Creating Data

Social scientists are confronted with real world data. The data may come from a variety of different sources and be of different type, experimental or observational. Because creating your own fake dataset is useful, this lesson will go through making a fake dataset.

Individual Characteristics

Suppose we have three vectors describing the characteristics of a seris of individuals. They include:

age a continuous random variable that is normally distributed,
college a dichotomous or “dummy” random variable that takes value of 1 if the individual completed a 4-year college,
income a continuous random variable that indicates the individuals’ income in thousands of $US.

We create each of these variables in turn in R following the subsequent script:

## We draw from the following distributions 100 draws. 

set.seed(333)                 #We must set the seed so that our work is replicable, any small # is fine

age<-rnorm(100,45,15)         #This creates the variable *age* which is drawn from the normal distribution
                              #with mean 45 and st. dev. 20. 

college<-rbinom(100,1,0.3)    #This creates the variable *college* which is drawn from the binomial
                              #distribution, where the probability of obtaining a draw with value 1 is 0.3

income<-10*rbeta(100,2,8)     #This creates the variable *income* which is drawn from the beta distribution
                              #with shape parameters 2 and 8, such that it has a slight positive skew.

The functions used to create these vectors are not important for now. Datasets typically combine these vectors are variables. This makes it easy to export data, but also to conduct some analyses there. We do this with the as.data.frame() comman as shown below, we name the dataset df by convention:

df<-as.data.frame(cbind(age, college,income))              
# the data fram df gets the vectors age, college, and income combined.

Save your own data using the following command:

#write.table("dataset name, file = "directory and file name"", sep= ",", row.names=FALSE)
write.table(df,"file path",  sep= ",", row.names=FALSE)

Opening Data

Opening data is pretty simple. The command you use depends on the type of data you are opening. Below are some common file types. To open some of these files you need to install new pacakges. More on that later. For now, pay attention to packages we call up to open the data.

#CSV Files
df <- read.table("file path", header=TRUE, 
    sep=",")

#key here is that the header is the true names of the variables, and that the entries are 
#separated by commas. 

#STATA

---## Old STATA
library(foreign)
df <- read.dta("file path")

---## New STATA
library(readstata13)
df<-read.dta13("file path")


#EXCEL
library(xlsx)
df <- read.xlsx("filepath", 1)
#1 says that the first column is the variable names

Merging

Because R is an object-based environment one of its advantages over, say, Stata, is that it is able to process and manipulate several datasets in a single session. Ocassionally, it will become necessary to merge two or more datasets.

#Setting Up the Data

#For this to work, the dataset must have a unique identifier for each observation. In this example, I will generate a counter and append it to the dataset by the conventional techniques we have considered before.

df$id<-seq(from=1,to=100, by=1)

#Now let's create another dataset:

#Create if variable:
id<-seq(from=1,to=100, by=1)

#Create urban dummy, to describe whether the individual lives in a city.
urban<-rbinom(100,1,0.25)    

#Now let's create data set 2
df2<-as.data.frame(cbind(id, urban))

We explore merging below:

#To merge two data frames we use the merge() function. It requires three arguments, the data frames followed by the variable by which the merge will be conducted:

new_df <- merge(df,df2,by="id")

#If you had multiple variables by which to conduct the merge, use c() to provide these.

We now have a complete dataset on which to conduct an analysis!