Econ 284 R Studio Lab

Objective: The purpose of this lab is to familiarize you with working in R Studio and give you some commands that you will be using throughout the semester for transforming variables, running regressions, and creating subsets. You should keep this file as a reference to refer back to in the future.

Setting a Working Directory

If you haven’t already, you should create a folder for this class that will contain all of your R files (extension .R) and data sets. Anytime you download a new data set or open a new script file, you should save it to this folder. In R, you will then be able to set the folder as your “working directory” and R will automatically save everything to that folder moving forward. If you haven’t already, create a folder and move the “caschool.csv” dataset that folder on your laptop

Once in R Studio, you can set your working directory by clicking the “Session” tab at the top of the screen and then going to “Set Working Directory” and “Choose Directory”. Then select your folder and hit save. R Studio will automatically save all your files to your working directory and you will be able to use the code below to import data files from your working directory directly into R Studio.

You can see your current working directory by using the command “getwd()” try it below:

#Command to see your current working directory
getwd()

## [1] "/Users/lfortmann/Google Drive/ECON 284/284 Lab Stuff"

Importing Data into R Studio

If you haven’t already, move your “caschool.csv” file to your working directory folder. Next, use the following code in the gray box below to import that dataset into R Studio. Note that you should read the code from right to left. You are telling R Studio to import the caschool.csv datafile and then you are naming the dataframe “caschool” as shown by the arrow <- then file name.

To run the code, click to green arrow in the corner.

 caschool <- read.csv("caschool.csv")

You should now see “caschool” appear in the Data section in the upper right window under the “Global Environment” tab. You can also click on the “caschool” tab at the top of the upper left window in R Studio to see the data in spreadsheet form.

Familiazrizing yourself with the data

To start familiarizing yourself with the data set, you want to get some summary statistics. To find the mean value of a variable, you will use the code “mean(caschool\(str)". First you type in your command for finding the mean "mean" then identify the name of the dataframe the variable is contained in "caschool" then then use "\)” and the variable name, str.

#Find the mean of the variable "teachers"
mean(caschool$teachers)

## [1] 129.0674

To get more summary stats, use the command “summary” instead of “mean”. Now try this for the teacher variable in the box below.

# Find summary stats for variable "teachers"

summary(caschool$teachers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.85   19.66   48.56  129.07  146.35 1429.00

You will also need to know the sample size (or number of observations) in a dataset. You can do this by looking in the Data window or by using the nrow command.

To get the sample size for caschool data, you would use the code nrow(caschool). Try this code using the nrow command in the box below.

#Find the sample size of the caschool dataset
nrow(caschool)

## [1] 420

Now suppose you want to know how many schools have over 2000 students in the sample. To do this, you can identify a subset of data using the “nrow” command. In this section, I will walk you through code for a variable in the caschool dataset, then you will apply the same commands for a different variable in the caschool dataset.

#For example, to get the number of school districts that have enrollment over 2000 in the caschool data run the code below:
    nrow(subset(caschool, enrl_tot>2000))

## [1] 153

#Or to get the number of districts with exactly 100 computers:
    nrow(subset(caschool, computer==100))

## [1] 2

Now it’s your turn. How many schools in the sample have less than 100 teachers? Enter the code below and execute.

#schools with less than 100 teachers in sample
nrow(subset(caschool, teachers<100))

## [1] 265

How many schools in the sample have exactly 100 teachers

#schools with 100 teachers
nrow(subset(caschool, teachers==105))

## [1] 1

###Changing a Variable Name

To make manipulating the data and coding easier, you may first want to change the name of some of the key variables. Suppose that you don’t want to use the variable name “el_pct” and want to rename it “english”

To rename a variable, first type the data\(new_name=data\)old_name. For example, if I want to change the name of “el_pct” in the caschool dataset to “english” I would use this command.

#changing variable name
    caschool$english=caschool$el_pct

After you run the command, you should see the new variable appear in your dataset at the end (notice, that the original variable “el_pct” is still in the dataset). To make sure you did it correctly, you also take the mean of the new variable, which should be the same as the previous variable. See below.

#checking that the new variable "class_size" is correct
mean(caschool$english)

## [1] 15.76816

mean(caschool$el_pct)

## [1] 15.76816

Rename the “enrl_tot” variable to “enrollment” and check to see that you created the new variable correctly by comparing means.

#rename variable to enrollment
caschool$enrollment=caschool$enrl_tot

###Creating a new variable

Creating a new variable is the same code you used to change the name of a variable. You may want to do this if you want to square a variable, take a log, or change the units (e.g. hours to weeks).

For this example, you will convert the “avginc” variable to dollars (instead of thousands of dollars). To do this, you multiply the original variable in the data by 1000 (using “*” for multiplication) and the create a new variable (with a new name). Try it below.

#Create a new variable that is avginc_th multiplied by 1000 and give it a new name:

caschool$avginc_th=caschool$avginc*1000

To check that you created the new variable, you should be able to view it in the data tab. It will appear as a new variable column at the end of the dataset.

Creating a Scatter Plot

Sometimes it is useful to create a scatter plot of your data to look for relationships. The general code for a scatter plot of two variables is: plot(data\(x, data\)y), where data refers to the name of the dataset the variables are located, and x and y refer to the variable names of interest.

For example, I use the following code to create a scatter plot of str and testscr with the caschool data.

#plotting data
plot(caschool$str, caschool$testscr)

Now you create a scatter plot of avginc and numnber of computers

#Scatter plot of avginc (x) and computer (y)

plot(caschool$avginc, caschool$computer)

Running a Linear Regression

To run a linear (OLS) regression, you use the “lm”” command for Linear Model. You must also specify a name for your model so you can call it later using the “summary” command to view the results of the regression. The general code is as follows: model_name<-(y ~ x, data)

For example, to run a regression of testscr on str, I would use the code:

#running OLS regression
model1<-lm(testscr~str,caschool)

#to see the results, use the summary command and your model name
summary(model1)

## 
## Call:
## lm(formula = testscr ~ str, data = caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Now run a regression of testscr (y) on avginc (x) using the “lm” command. Name the results of your model “model2”.

#regress testscr on avginc 
model2<-lm(testscr~avginc,caschool)

Notice, after you run the model, you don’t see the results! You have to include a separate “summary” command. Do this below for model2

#Get results for model2

summary(model2)

## 
## Call:
## lm(formula = testscr ~ avginc, data = caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.574  -8.803   0.603   9.032  32.530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 625.3836     1.5324  408.11   <2e-16 ***
## avginc        1.8785     0.0905   20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.39 on 418 degrees of freedom
## Multiple R-squared:  0.5076, Adjusted R-squared:  0.5064 
## F-statistic: 430.8 on 1 and 418 DF,  p-value: < 2.2e-16

Interpret the coefficent for “avginc”

#What is the relationship between testscr and avginc Interpret your results. Be sure to keep the "#" before your text so it's a comment and not code

#positive relationship. If avginc increases by 1,000 test scores increase on average by 1.9 points

Finally, you may want to create a subset of the data that you want to work with more extensively. For example, a separate dataset that only includes larger schools. To do this, you will identify the subset of interest, then name a new dataset (large_schools), shown below.

#creating a subset of large school districts 
    large_schools<-subset(caschool, enrl_tot>2000)

You can then call on this new dataset to get summary statistics for variables or run regressions. For example, the str for schools with enrollment > 2000 is shown below.

# summary stats with new subset

summary(large_schools$str)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.73   19.52   20.38   20.45   21.19   24.41

**Now create a subset of data for schools that have district income above the mean in the sample. and the get the summary stats for computer in this new subset.

#creating a subset of high income schools
summary(caschool$avginc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.335  10.639  13.728  15.317  17.629  55.328

highinc_schools<-subset(caschool, avginc>15.3)

Now get the summary stats for “computer” in the new subset of high income schools.

##summary of computer variable for high income schools 
summary(highinc_schools$computer)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    60.0   245.0   379.6   497.0  2401.0

Knitting file to HTML

Now that you have completed the lab, you can save all of your code and results to a html page for future reference by clicking the “Knit to HTML” button under the “Knit” tab at the top of the window. It will automatically open in your web browser. If you are not able to “knit” the document, it’s okay, don’t worry about it!