R

R is an open-source programming software most commonly used by data scientists and other academics R is used to: 1. Import data from your computer, websites, databases 2. Clean the data by organizing it into matrices, data frames, or tables 3. Analyze the data using statistical tests or graphs 4. Communicate your results

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

R Workspace

R Workspace

When you click the Knit button (which can be found at the top of your editor) all of your code will be run and a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x <- 100
x
## [1] 100

Anything within the ` and {r} text will be considered R code. You can also insert a chunk by clicking on the Insert icon and choosing R.

Writing and Executing code in R

Although you can write and execute R commands using the R console, a better option is to write your comments in a script that you can save and modify as necessary. This also allows for you/anyone to reproduce the work you do. To start a new script, click File -> New File -> R Script. To create a new markdown document, choose R Markdown.

To execute a command in a script, you can select and run specific lines of code by highlighting the code and either clicking the Run icon or use the keyboard shortcut Ctrl+Enter (Windows) or Command+Enter (Mac). You can also run all or part of a script line by line without selecting any code: if you click the Run icon or use the keyboard shortcut without selecting anything, R will execute only the line of code where your cursor is located and move the cursor to the next line.

You can use multiple lines to write one command or function. If you have a command written across multiple lines, you will need to run all of the lines of a function to execute it. You can highlight the entire multi-line function and run the block of code, or you can run one block at a time.

Filepaths

File paths are a key to using R for data analysis. A file path specifies the location of a file on your computer. For example, you might notice that when you download a file from the internet, your computer probably adds the file to your “Downloads” folder. Generally, whether you use a Mac or a PC, you have your files stored in folders (and folders within folders). You may have file icons or shortcuts on your desktop, but even those files are stored in a folder-perhaps a “Desktop” folder.

Why are file paths so important when you use R (and many other applications)? You will want to import or load data, and you will need to use a file path to specify where R should access the file. Youmay also want to save or export data, tables, and figures, so you will need to specify where R should put these files.

How do you determine the file path? There are multiple ways to figure out a file path. For example, you may know where a file is located-perhaps you saved it to your “Downloads” folder. In that case, your file path might look something like this: /Users/YourName/Downloads (Mac) or C:(PC).

You can see a Mac example below (a Windows example follows). Figure 1 shows a Finder window on a Mac. To access Finder, click the Finder icon (Finder is always open and its icon is always in your Dock). There are a few things to notice here. First, a file for the ERC Workshop is highlighted. You can see from the top of the window that the file is inside a folder called “R Workshop,” and at the bottom of the window, you can see that this folder is nested in a series of folders.

Figure 1: Example – Mac Finder Window

Figure 1: Example – Mac Finder Window

One easy way to find the file path for this document is to right click (or secondary click-you often do this on a Mac by clicking with two fingers on your trackpad) on the file. You will see a menu like the one in Figure 2. Select “Get Info” for details on the file.

Figure 2: Example – File Menu

Figure 2: Example – File Menu

Note that in the example in Figure 3, all of the information in the “Where:” field is highlighted-this is the file path. You can highlight this information, copy it, and paste it into an R script. Your computer will copy only the portion of the file path that you need, and it should automatically paste the file path in a readable format, i.e., the arrows in the “Get Info” window will be replaced by slashes (If you have an older Mac OS, you might have to change the arrows to slashes manually).

Figure 3: Example – Mac Get Info

Figure 3: Example – Mac Get Info

Finding the file path on a Windows system is similar. Figure 4 shows a File Explorer window. You can access File Explorer from the taskbar or via the Start menu. The name of the current folder is displayed at the very top of the window. Just under the window’s pull-down menus, you will see the location of the current folder on your computer (it’s highlighted in pink in Figure 4). This information is the file path for the folder displayed in the window. If you click on the file path, you will see it change to a format with backslashes instead of arrows, and you can copy the file path from here.

Figure 4: Example – File Explorer

Figure 4: Example – File Explorer

Note that you will need to modify the Windows file path to use it in R. Windows file paths include backslashes (), which you will need to change. This modification is necessary because the backslash () is a character that has a designated meaning in R. There are two ways to resolve this minor issue:

  1. You can simply delete the backslashes and replace them with forward slashes. C:??? C:/Users/Downloads

  2. You can add a second backslash directly before or after each existing backslash C:??? C:\Users\Downloads

Another way to find the file path in Windows is to right-click on a file and select “Properties.” Figure 5 shows an example of a “Properties” window. Note the “Location” field, which is the file path for the file or folder. You can copy and paste the file path from here as well. You will still need to either replace the backslashes () with forward slashes (/) or add an additional backslash next to each backslash in the file path.

Figure 5: Example – Properties Window

Figure 5: Example – Properties Window

Set up Workspace

Now that we know some of the basics of working with R, let’s start using it! First, we will clear your workspace. This is not necessary, but is helpful if you don’t want the work you do with this one document to be affected by previous work you may have done in R. It deletes any variables you may have saved in your workspace.

Don’t forget that you actually have to run your code after you type it out for it to work! You can run it by either clicking the Run icon or by using the keyboard shortcut Ctrl+Enter (Windows) or Command+Enter (Mac).

rm(list=ls(all=TRUE))

Working directory

You can use the file path each time you want to use or create a file with R, but you can also set a working directory. Setting the working directory involves designating a file path for the location on your computer where R will access and save files. You can see your working directory, and you can set your working directory. You can set a working directory at the start of each R script to use files for specific projects or assignments. Keep in mind that your file path for the working directory will be different from mine.

getwd() ##tells you what the working directory is currently 
## [1] "C:/Users/fkoli/Downloads/RWorkshop1"
setwd("/Users/fkoli/Downloads/RWorkshop1/") ##set the directory you want as your working directory

getwd() ##Check again to see the change 
## [1] "C:/Users/fkoli/Downloads/RWorkshop1"

Installing and Loading Packages

Because R is an open-source application, many users develop packages to handle specific tasks. For example, there are packages for data visualization, data manipulation, and many types of statistical analyses. You will need to install packages to handle certain tasks. You only need to install packages once, but you will need to load them any time you want to use them.

We will be using three packages: 1. descr – useful for producing frequency tables and crosstabs 2. dplyr – useful for friendly data cleaning and manipulation 3. ggplot2 - useful for making plots and figures

Some other useful packages are: foreign – load data formatted for other software xtable – export code to produce tables in LaTeX arm – applied regression and multi-level modeling more packages: http://cran.r-project.org/web/packages/

To install packages:

## use dependencies = TRUE to install any other required packages

install.packages("descr", dependencies=TRUE)
install.packages("dplyr", dependencies=TRUE)
install.packages("ggplot2", dependencies=TRUE)

To load packages:

## Note the install packages command takes the package name with quotes, but the library command takes the package name without quotes

library(descr)
library(dplyr)
library(ggplot2)

If you are looking at the markdown document and not the final pdf, you will notice that in the brackets that specify this is r code for the chunk with the installing commands, there are also some other arguments such as warning, error, message, and eval. Warning, error, and message are set to FALSE for most of the document and that just ensures that if it runs into errors or is prompted to display a warning, these are not outputted in the pdf when the document is Knit. I have only specified eval for certain chunks. This tells Markdown whether to actually run these lines when the document is Knit. I have set it to False which means it will not be run when it knits. That is because we will actually get an error if we try to do so when knitting. However, you still want the code because you need to run it yourself at least once at some point before the document is Knit. The most common practice is actually to change the code into comments which is the next section, rather than using the eval argument, but I have used eval here just for purposes of demonstration.

Getting Started

Comments

After you run the code for installing and loading packages, you should change the lines with code for installing packages into comments. This is so that it doesn’t run when you knit your document and turn it into a pdf/html/word document. Packages should always be installed before knitting the document.

Within an R script, you can use the hash sign (#) to designate text as a comment. Any text that follows # will be ignored when R executes a script. This feature enables you to annotate your code, and it can also be helpful for debugging code. You can “comment out” (or un-comment) a line of code using Ctrl+Shift+C (Windows) or Command+Shift+C (Mac) with your cursor anywhere on the line.You can highlight multiple lines and use the same shortcut to comment out a block of code.

Here is how the lines would look now:

#install.packages("dplyr", dependencies=TRUE)
#install.packages("ggplot2", dependencies=TRUE)
#install.packages("descr", dependencies=TRUE)

You can use R for simple (and not so simple) calculations.

4
## [1] 4
"yes"
## [1] "yes"
2+3
## [1] 5
46^2
## [1] 2116

Assignment Operator

You can create variables in r and assign them values. There are two operators that assign values to objects. Most R users opt to use the < ??? operator, but you can also use =. One reason the < ??? operator tends to be preferred is that unlike the equal sign, the < ??? operator serves only one purpose. For the sake of clarity, we will use the < ??? assignment operator exclusively.

##this is how you add a comment

x <- 100
x
## [1] 100
y <- "string"
y
## [1] "string"

Data objects in R

Objects are the building blocks of R. When you use R to analyze data, you will typically direct R to perform a series of functions (often commands or calculations) on a data object-usually, a data frame. Data frames are the key unit of data analysis in R, but other R objects include vectors, matrices, and lists.

Below we are creating a vector.A vector is a one-dimensional matrix or array, a group of elements in one row or column. Vectors are important to R for several reasons. A vector can be a data object, i.e., you can create a vector, store it, and perform functions on or with it.

The function c() allows you to concatenate multiple elements into a vector. These elements can be numbers or strings (characters). As you will learn, you can often use c() to pass vectors to functions. You will also use c() to create vectors of arguments or instructions to functions. You use brackets to access elements within the vectors, this is called indexing. In R, indexing starts at 1 so to get the first element of our vector x, we would write x[1], to get the second element we would write x[2] and so forth. You will get an NA value if you try to access an element in the vector that doesn’t exist.

x <- c(2,3,4)
x
## [1] 2 3 4
x[4]
## [1] NA

Logical Operators

Logical operators test conditions. For example, you might want a subset of data that includes observations for which a specific variable exceeds some value, or you may want to find observations with missing values. When working with vectors, logical operators test the condition for every element. In the example below, we are seeing which elements equal 2. Notice that we use two equal signs to check if elements are equal.

##conditional
x == 2
## [1]  TRUE FALSE FALSE

Help

If you need help with a specific function and know the name of the function, enter ? followed by the function name at the command line-e.g., ?getwd for information about the getwd() function or ?table for information about the table() function. You can also search more generally for a topic by entering ?? followed by a search term. Using ?? can help you find a function to perform a specific task. For example, ??variance will bring up search results for a variety of functions from different packages, allowing you to choose the option you want. The search results will open up on the bottom right area of the screen under your environment.

# funtion search
?table

# topic search
??variance

Working with Data

Now, we will read in a csv and look at some data that we have. Before we do that, we need to learn about data frames.

Data frames are fundamental objects for data analysis in R. When you use R to manipulate, clean, or analyze data, you will typically work with data frames. Generally, each column is a variable, and each row is an observation. You could think of the term “data frame” as R-speak for a dataset. However, data frames can be a bit more flexible. For example, you could save regression results or a table in a data frame, which can help you output results, data, or statistics in your preferred format.

A few important features of data frames: . you assign elements to a data frame- often an existing dataset but also vectors . you can have multiple data frames in memory at any given time- i.e., you can have multiple datasets open at the same time . you can combine data frames to add rows (append) or columns (merge) . you must name data frames subject to general parameters for naming objects in R - names cannot include spaces - names can include underscores (_) and/or periods (.) - it is best to avoid using functions (per se) as names - you can overwrite an existing object by assigning different elements to it (no errors, no warnings) . to work with an element of a data frame (e.g., a variable in a dataset), you must reference the data frame

Now, we will read in our data from a csv. R can read data files in a variety of formats so you are not limited to csv’s but csv’s are a common format when working with data.

Note: If the data file is stored in your working directory, you need only specify the file name. However, if the file is stored somewhere else on your computer, you will need to include the file path.

The dataset we are using today was compiled by Walt Hickey at fivethirtyeight.com and contains information on 1,794 films released from 1970 to 2013. His article, “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women” examines the budgets and revenues of films that pass the Bechdel test. The Bechdel test is a popular method of measuring how female-friendly a movie is. To pass the test: 1) there must be two named female characters, 2) the two women must talk to each other, and 3) the conversation cannot be about a man.

data <- read.csv("bechdel_test_movies.csv", header=TRUE)

Looking at data

The names() function will display the names of the columns (variables) in a data frame. The dateset contains the Year the movie was released, its IMDB code, its title, whether it failed or passed the bechdeltest, its budget, its domestic gross, and its international gross.

## look at variable names and dimensions

names(data)
## [1] "Year"        "IMDBcode"    "movietitle"  "bechdeltest" "budget"     
## [6] "domgross"    "intgross"

The dim() function returns the dimensions of the data frame. Note that rows are the first dimension and columns are the second dimension. This convention is consistent across R functions and packages.

dim(data)
## [1] 1776    7
dim(data)[1]  ## remember square brackets are used for indexing-- here, 1st element of dimensions
## [1] 1776
dim(data)[2]  ## 2nd element of dimensions
## [1] 7

You can refer to specific rows or columns in a data frame by row or column number(s)- this allows you to see a subset of your data. You could even assign it to a new object and you would have effectively subset your data. Note the comma inside the square brackets. This comma differentiates rows from columns. If you refer to only a row or column, you still need the comma. Numbers or code to the left of the comma refer to rows, and numbers or code to the right of the comma refer to columns.

data[1,]        # row 1 only
data[1:3,]  # rows 1 to 3 only

The next two commands are not evaluated in the notebook since their output would be very long.

data[,1]        # column 1 only
data[,2:4]  # columns 2 to 4 only
data[ c(1,2,4), 1] #rows 1, 2, and 4, column 1
## [1] 1970 1971 1971

The head() function will show you the first few rows of data for all of the columns or variables in the data frame.

head(data)

You can also view specific variables by referencing the data frame and the variable name, linking the data frame and the variable together with a dollar sign ($).

data$budget
data$intgross

domgross        # This will give us an error! why? because we didn't say what data frame to access this variable from

Find out the classification or type of an object such as a data frame or a variable with the class() function.

class(data)
## [1] "data.frame"
class(data$budget)
## [1] "integer"
class(data$bechdeltest)
## [1] "factor"

Basic Data Analysis

Tables

The basic table() function in R creates a very simple table, but the table() function is also incredibly flexible. Depending on your needs and preferences, you can take a quick look at frequencies for a variable or a crosstab, but you also can build a more elaborate custom table for display or publication.

# table() function
table(data$bechdeltest, useNA="always") ##fail pass variable
## 
## FAIL PASS <NA> 
##  982  794    0

You can store a table as an object and continue to add features or make changes if you wish. You also can generate tables of proportions by sung the prop.table() function

## cross tab for year + fail pass variable
crosstab <- table(data$Year, data$bechdeltest, 
                  dnn=c("Year", "Test Result")) ## add dimension  names
crosstab
##       Test Result
## Year   FAIL PASS
##   1970    0    1
##   1971    5    0
##   1972    1    1
##   1973    4    1
##   1974    5    2
##   1975    5    0
##   1976    5    3
##   1977    5    2
##   1978    6    2
##   1979    3    2
##   1980   10    4
##   1981    8    1
##   1982   11    3
##   1983    3    2
##   1984   13    3
##   1985    4    4
##   1986    6    4
##   1987   12    2
##   1988   10    9
##   1989   10    4
##   1990    9    6
##   1991    7    6
##   1992   16    4
##   1993    8    8
##   1994   17    9
##   1995   18   18
##   1996   21   21
##   1997   23   28
##   1998   38   24
##   1999   33   23
##   2000   34   29
##   2001   31   33
##   2002   43   37
##   2003   30   34
##   2004   36   45
##   2005   46   54
##   2006   47   41
##   2007   30   40
##   2008   48   50
##   2009   81   40
##   2010   68   60
##   2011   71   51
##   2012   49   37
##   2013   52   46
prop.table(crosstab, 1) ##The 1 will calculate proportions for the first dimension. 
##       Test Result
## Year        FAIL      PASS
##   1970 0.0000000 1.0000000
##   1971 1.0000000 0.0000000
##   1972 0.5000000 0.5000000
##   1973 0.8000000 0.2000000
##   1974 0.7142857 0.2857143
##   1975 1.0000000 0.0000000
##   1976 0.6250000 0.3750000
##   1977 0.7142857 0.2857143
##   1978 0.7500000 0.2500000
##   1979 0.6000000 0.4000000
##   1980 0.7142857 0.2857143
##   1981 0.8888889 0.1111111
##   1982 0.7857143 0.2142857
##   1983 0.6000000 0.4000000
##   1984 0.8125000 0.1875000
##   1985 0.5000000 0.5000000
##   1986 0.6000000 0.4000000
##   1987 0.8571429 0.1428571
##   1988 0.5263158 0.4736842
##   1989 0.7142857 0.2857143
##   1990 0.6000000 0.4000000
##   1991 0.5384615 0.4615385
##   1992 0.8000000 0.2000000
##   1993 0.5000000 0.5000000
##   1994 0.6538462 0.3461538
##   1995 0.5000000 0.5000000
##   1996 0.5000000 0.5000000
##   1997 0.4509804 0.5490196
##   1998 0.6129032 0.3870968
##   1999 0.5892857 0.4107143
##   2000 0.5396825 0.4603175
##   2001 0.4843750 0.5156250
##   2002 0.5375000 0.4625000
##   2003 0.4687500 0.5312500
##   2004 0.4444444 0.5555556
##   2005 0.4600000 0.5400000
##   2006 0.5340909 0.4659091
##   2007 0.4285714 0.5714286
##   2008 0.4897959 0.5102041
##   2009 0.6694215 0.3305785
##   2010 0.5312500 0.4687500
##   2011 0.5819672 0.4180328
##   2012 0.5697674 0.4302326
##   2013 0.5306122 0.4693878

While the table() function offers considerable flexibility to generate custom tables, the descr package includes functions that return frequency tables and props. The descr package can also handle weighted data.

The freq() function from the descr package returns a frequency table as well as a bar plot showing the distribution of the variable in question. To omit the plot, simply add the argument plot = FALSE to the function

freq(data$Year)

## data$Year 
##       Frequency   Percent
## 1970          1   0.05631
## 1971          5   0.28153
## 1972          2   0.11261
## 1973          5   0.28153
## 1974          7   0.39414
## 1975          5   0.28153
## 1976          8   0.45045
## 1977          7   0.39414
## 1978          8   0.45045
## 1979          5   0.28153
## 1980         14   0.78829
## 1981          9   0.50676
## 1982         14   0.78829
## 1983          5   0.28153
## 1984         16   0.90090
## 1985          8   0.45045
## 1986         10   0.56306
## 1987         14   0.78829
## 1988         19   1.06982
## 1989         14   0.78829
## 1990         15   0.84459
## 1991         13   0.73198
## 1992         20   1.12613
## 1993         16   0.90090
## 1994         26   1.46396
## 1995         36   2.02703
## 1996         42   2.36486
## 1997         51   2.87162
## 1998         62   3.49099
## 1999         56   3.15315
## 2000         63   3.54730
## 2001         64   3.60360
## 2002         80   4.50450
## 2003         64   3.60360
## 2004         81   4.56081
## 2005        100   5.63063
## 2006         88   4.95495
## 2007         70   3.94144
## 2008         98   5.51802
## 2009        121   6.81306
## 2010        128   7.20721
## 2011        122   6.86937
## 2012         86   4.84234
## 2013         98   5.51802
## Total      1776 100.00000

Histograms, Boxplots, and Scatterplots

To quickly visualize the distribution of your data, you can generate basic histograms, boxplots, and scatter plots. You can produce very basic plots, or you can add features or options to enhance the appearance of plots. A later session will cover the use of the ggplot2 package to produce more complex figures.

Asides from the variable, the hist() function also takes the number of breaks we want. We are also providing optional parameters main and xlab to add a title and label for the x axis to our histogram.

hist(data$budget, breaks=50, main="Histogram of Budget", xlab="Budget")

We don’t want it to print the numbers in scientific notation, so to change that we can use the options() function. Let’s do another histrogram now and let’s also change the number of breaks.

options(scipen=999)

hist(data$intgross, breaks=10, main="Histogram of International Gross", xlab="International Gross in 2013 Dollars")

Now let’s create a boxplot.

boxplot(data$budget)

And now let’s create a scatterplot. The pch argument determines which markers are used.

# plot of international gross and budget

# plot()
plot(data$budget, data$intgross, main="Scatterplot of Budget and International Gross", pch=16)

You can save a plot in PDF format. R will save the file to your working directory unless you specify a different file path. To do that, add the pdf function before the plot command and dev.off() afterwards.

# save to disk
pdf("scatter_plot.pdf")
plot(data$budget, data$intgross, main="Scatterplot of Budget and International Gross", pch=2)
dev.off()
## png 
##   2

Summary Statistics

The summary() function returns descriptive statistics for your entire dataset or for a specific variable.

# summary statistics
summary(data)
##       Year            IMDBcode                     movietitle  
##  Min.   :1970   tt00293564:   1   Pride and Prejudice   :   3  
##  1st Qu.:1998   tt0035423 :   1   Assault on Precinct 13:   2  
##  Median :2005   tt0065466 :   1   Beautiful Creatures   :   2  
##  Mean   :2003   tt0067065 :   1   Black Christmas       :   2  
##  3rd Qu.:2009   tt0067116 :   1   Carrie                :   2  
##  Max.   :2013   tt0067741 :   1   Conan the Barbarian   :   2  
##                 (Other)   :1770   (Other)               :1763  
##  bechdeltest     budget             domgross             intgross         
##  FAIL:982    Min.   :     8632   Min.   :       899   Min.   :       899  
##  PASS:794    1st Qu.: 16234217   1st Qu.:  20546594   1st Qu.:  33735900  
##              Median : 37157440   Median :  55993640   Median :  96890722  
##              Mean   : 55889829   Mean   :  95174784   Mean   : 198568876  
##              3rd Qu.: 79081181   3rd Qu.: 121678352   3rd Qu.: 241966042  
##              Max.   :461435929   Max.   :1771682790   Max.   :3171930973  
## 
summary(data$budget) ##budget 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      8632  16234217  37157440  55889829  79081181 461435929
summary(data$bechdeltest) ##bechdel test
## FAIL PASS 
##  982  794
quantile(data$intgross)
##         0%        25%        50%        75%       100% 
##        899   33735900   96890722  241966042 3171930973

There are multiple functions to obtain specific summary statistics. Many of these are included in the summary() function, but several are not.

NOTE: Missing values will create problems with these functions, so be sure to exclude them. Generally, for functions that evaluate 1 variable (e.g., mean, variance), use the argument na.rm = TRUE. Functions that evaluate 2 variables (e.g., covariance), take the argument use = complete.obs.

min(data$intgross)
## [1] 899
max(data$intgross)
## [1] 3171930973

Now let’s use the mean function. Once we get the mean, let’s say we want to see how the mean differs by movies that failed and movies that passed the bechdel test. We need to index our budget variable, but we don’t know all the row numbers for movies that failed and it would be tedious to write all of them out. Instead of doing that, we can use a conditional to index our data. Writing bechdeltest==“FAIL” within the brackets will check which movies have “FAIL” in the bechdeltest column and subset the budget variable accordingly.

## mean
mean(data$budget, na.rm=TRUE)
## [1] 55889829
mean(data$budget[data$bechdeltest == "FAIL"]) ## mean budget for movies that failed the test
## [1] 63343993
mean(data$budget[data$bechdeltest == "PASS"])
## [1] 46670698
## median
median(data$intgross, na.rm=TRUE)
## [1] 96890722
median(data$intgross[data$bechdeltest == "FAIL"])
## [1] 110526562
median(data$intgross[data$bechdeltest == "PASS"])
## [1] 81799154

The by() function offers an easy option to obtain univariate statistics for subsets of a data frame by a factor (categorical) variable. The first argument is the variable variable to describe, the second argument is the factor or group variable, and these are followed by the function to apply and an argument for handling missing values. Note that by() can be used with many functions including summary().

by(data$intgross, data$bechdeltest, mean, na.rm=TRUE)
## data$bechdeltest: FAIL
## [1] 223175119
## -------------------------------------------------------- 
## data$bechdeltest: PASS
## [1] 168136471
by(data$intgross, data$bechdeltest, median, na.rm=TRUE)
## data$bechdeltest: FAIL
## [1] 110526562
## -------------------------------------------------------- 
## data$bechdeltest: PASS
## [1] 81799154

CREATING VARIABLES

Since intgross represents the sum of domestic gross and international gross, we will create a variable to store just international gross. To do that, we need to subtract domgross from intgross for each movie. Instead of doing each calculation manually, we can do it in a single step for all the rows:

data$intgross - data$domgross

At the moment, we’ve subtracted the two, but we haven’t actually stored it anywhere. To store it in the dataframe, we will just specify the dataframe and the name of the new variable and we will assign the subtracted values to it.

data$IntOnlyGross <- data$intgross - data$domgross

Now, we will create a Profit variable. Since profit is the difference between how much money its earned (intgross) and how much its spent (budget), we will just subtract the two from each other and store it into a new variable. And once we have the new variable, let’s see if there’s a difference in median profit based on the bechdel test. We will use the by() function that we used earlier. Notice that the difference in profit is smaller than the difference in budget and intgross. This tells that movies that pass also do generally well when it comes to profit.

##Now create a profit variable
data$Profit <- data$intgross - data$budget

##average profit by fail pass
by(data$Profit, data$bechdeltest, median, na.rm=TRUE)
## data$bechdeltest: FAIL
## [1] 65328219
## -------------------------------------------------------- 
## data$bechdeltest: PASS
## [1] 44135686

Now let’s create a new variable that tells us whether the movie has a low, medium, or high budget. By categorizing the movies this way, we might be able to see patterns we coudn’t see as easily otherwise. The code below creates a variable called budgetcategory and assigns values to it based on conditions. What we are doing is called recoding.

Shortcut: Note that we are doing the recoding inside the within() function. THis is not necessary, but it is helpful when you need to make multiple changes in one data frame because it allows you to refer to the data frame only once rather than linking every variable to the data frame. For the within function, you pass data (our data frame) to the data argument in within. After that, you can make changes inside curly brackets without having to write data$ each time you refer to a variable. The first line creates a new variable and assigns a NULL value to each row. Then, we assign low to budgetcategory for any row based on a conditional, if the budget is less than or equal to 16 million. The next line assigns a value of medium if the budget is greater than 16 million but less than or equal to 78 million.

Make sure you are careful with your conditional tests! If we had just written less than 78 million, all of the rows labelled low would have been changed to medium.

And lastly, we use a conditional for high. Note that the breakpoints we use to define what category a movie falls in is subjective since there is no one definition for what is a low/medium/high budget.

data <- within(data=data, {
   
    ## create and code a variable for population categories
    budgetcategory <- NULL
    budgetcategory[budget <= 16000000] <- "low"
    budgetcategory[budget > 16000000 & budget <= 78000000] <- "medium"
    budgetcategory[budget > 78000000] <- "high"

})

Let’s create a frequency table. Note that our result shows high first, then low, then medium. We want to reorder it since it is more intuitive that way and it will also be easier to find patterns that way later on.

freq(data$budgetcategory)

## data$budgetcategory 
##        Frequency Percent
## high         449   25.28
## low          430   24.21
## medium       897   50.51
## Total       1776  100.00

To be able to reorder text variables, they have to be defined as factors. Factor Variables are categorical variables that can be either string or numeric and ordered or unordered. Factors can be useful for designating categories or groups of interest in both statistical analyses and graphics. Sometimes, when you create a data frame or read in data, a numeric or integer variable may stored as a factor variable. Even if you see a number when you examine a factor variable, R recognizes factors as categorical- i.e., not numeric values.

First, let’s use the class() function to see what the current type is for budgetcategory. It should print character, which means it is considered text/string data.

#convert to factor so that low medium high are in order
class(data$budgetcategory)
## [1] "character"

We will use the factor() function to convert our variable. Not all factors have an order to them, so we have to explicitly say ordered=TRUE for the function to order them. The level argument takes a list of the values the variable can take in the order that you want it to be. And if we run the freq() function after that, it will print out in the correct order.

data$budgetcategory<- factor(data$budgetcategory, levels = c("low", "medium", "high"), ordered=TRUE)

freq(data$budgetcategory)

## data$budgetcategory 
##        Frequency Percent Cum Percent
## low          430   24.21       24.21
## medium       897   50.51       74.72
## high         449   25.28      100.00
## Total       1776  100.00

Subset Data

There are many different ways to keep or drop certain variables or rows. The filter() function in the dplyr package also allows you to select rows based on one or more condition(s). Let’s say you only want to look at data with a medium budget category since it has almost half of the data.

datamedium <- filter(data, budgetcategory == "medium") 

Now that we have this new variable, we can do some analysis with it. Here I am using functions we’ve already made use of.

#now do cross tabs again
#crosstab of budget category and bechdel test results
crosstab <- table(data$budgetcategory, data$bechdeltest)
crosstab
##         
##          FAIL PASS
##   low     210  220
##   medium  472  425
##   high    300  149
prop.table(crosstab, 1)
##         
##               FAIL      PASS
##   low    0.4883721 0.5116279
##   medium 0.5261984 0.4738016
##   high   0.6681514 0.3318486
#plot it 
plot(crosstab)

##average profit for budget category? 
##just use the by function
by(data$Profit, data$budgetcategory, mean)
## data$budgetcategory: low
## [1] 44168378
## -------------------------------------------------------- 
## data$budgetcategory: medium
## [1] 114789835
## -------------------------------------------------------- 
## data$budgetcategory: high
## [1] 292737426

Creating plots and figures with GGPLOT

Figures and plots can help you present your data and results visually. The base package of R includes several data visualization options, including the plot() and hist() functions. However, the ggplot2 package is popular with R users across disciplines because of its extensive options and high quality output. In this section, you will find a variety of ggplot2 graphs and figures.

The ggplot2 package builds plots in layers. With the ggplot() function, you first supply the data frame, and the aesthetics (aes)-i.e., the data the figure will present. Then, you can add multiple layers to the figure. These layers can include specifications of the figure as well as formatting details. You can save the plot to an object, and once you store a plot, you can add additional layers or make changes to the plot. You will find several examples here, but there are many more options you can discover online.

First, we will create a scatter plot. There are many arguments ggplot can take, but a template to start off with is ggplot(data, aes(x = independent variable, y=dependent variable)). Once we start with that template, we get a chart, but it has nothing on it.

#first scatter budget and revenue by fail and pass
ggplot(data, aes(x=budget, y=intgross))

That’s because the ggplot() function allow us to specify the data and the aesthetics, but it needs additional functions. These additional functions will actually add features or layers to our plot. You add additional functions with an addition sign (+). Your code can be spread over multiple lines, but when you are adding functions, be sure to end each line with + to indicate that the code continues onto the next line. We will use geom_point() to plot points. If you were creating a line chart instead, you would use geom_line().

plot1 <- ggplot(data, aes(x=budget, y=intgross))+ geom_point()
plot1

Now, let’s color them based on whether they failed or passed the bechdel test. To do that, you add a color argument to the aes() function and you give it the variable that you want the points to be colored by.

plot2 <- ggplot(data, aes(x=budget, y=intgross, color=bechdeltest))+geom_point()
plot2

Now that we’ve done that, let’s say we want to understand how movies did across the years by looking at the average international gross. Simple enough, we could use the by() function and calculate mean intgross by Year. But what if we wanted to also break the mean intgross by bechdel test results? So what we want is something that tells us how bechdel test results effect average gross over time and let’s say we want to plot it so that we can visually see the effect. Well, we first need to create that data to be able to plot it.

We’re going to use functions from the dplyr package to be able to do this. First, we need to split the data into groups. What would those groups look like? One such group would be movies in 1975 that failed the bechdel test. Another would be movies in the same year, 1975, that passed the bechdel test. And so on for every year in the dataset.

We will use the group_by() function to do this and we have to specify which variables to group data by. I’ve stored it into a new variable called summary.

## we're going to calculate average int gross for years + bechdel test result

# 1: set up data frame for by-group processing.  
summary = group_by(data, Year, bechdeltest)

After we group the data, we are going to apply some analysis to each group. To do that, we will use the summarize() function. We give it the summary variable, which is a grouped dataframe, and then we specify a new variable we want to create. In our case, we will call it meanIntGross. Although we use the sign “<-” to assign variables everywhere else, when using dplyr functions such as summarize, we have to use the equal sign so we will do that here. To this variable, we will assign the mean of intgross. This just requires the mean function. The summarize function will calculate the mean for each group separately since it has a grouped dataframe. We will put the results of the summarize() function back into summary.

# 2: calculate the summary metric
summary = summarize(summary, 
                    meanIntGross= mean(intgross, na.rm=TRUE))

Lastly, we will just divide each value by 1 million so that now the values will be in millions of dollars. We already learned how to do this.

summary$meanIntGross <- summary$meanIntGross/1000000

head(summary)

We created a temporary dataframe to store our results and then used that as input to the next function. This can sometimes clutter up your workspace. We can use what are called pipes to do this without the temporary frames. Pipes let you take the output of one function and send it directly to the next function. Pipes in R look like %>% and are installed with dplyr. Below is how the same steps would like with pipes. It is convention to use pipes to do these steps. I’ve also moved the step of dividing by a million into the summarize function. But that could still be done in a separate step.

summary2 <- data %>%
  group_by(Year, bechdeltest) %>%
  summarize(meanIntGross = mean(intgross, na.rm = TRUE)/1000000)

head(summary2)

Now, let’s plot the date from the new dataframe we just made. This time we will create lines so we will use geom_line.

plot3 <- ggplot(summary, aes(x=Year, y=meanIntGross, color=bechdeltest)) + geom_line()
plot3

This is the end of this workshop. You now have a good foundation in data analysis in R and have the tools you will need for simple exploratory data analysis! We have only touched some of the capabilities of the packages and functions we’ve used today, but these same tools can also be used for more advanced analysis/complicated visualizations. Come visit us if you have questions about that! Have fun applying the tools you learned today on your own datasets!