###Thank you for extension
In this homework, we’ll be working on getting you set up with the tools you will need for this class. Once you are set up, we’ll do what we’re here to do: analyze data!
Here’s what we will accomplish by the end of the assignment:
We need two basic sets of tools for this class. We will need
R to analyze data. We will need RStudio to
help us interface with R and to produce documentation of our
results.
R is going to be the only programming language we will use. R is an extensible statistical programming environment that can handle all of the main tasks that we’ll need to cover this semester: getting data, analyzing data and communicating data analysis.
If you haven’t already, you need to download R here: https://cran.r-project.org/.
When we work with R, we communicate via the command line. To help automate this process, we can write scripts, which contain all of the commands to be executed. These scripts generate various kinds of output, like numbers on the screen, graphics or reports in common formats (pdf, word). Most programming languages have several I ntegrated D evelopment E nvironments (IDEs) that encompass all of these elements (scripts, command line interface, output). The primary IDE for R is RStudio.
If you haven’t already, you need to download RStudio here: https://rstudio.com/products/rstudio/download/. You need the free RStudio desktop version.
In each class, we’re going to include some code and text in one file,
and data in another file. You’ll need to download both of these files to
your computer. You need to have a particular place to put these files.
Computers are organized using named directories (sometimes called
folders). Don’t just put the files in your Downloads directory. One
common solution is to created a directory on your computer named after
the class: psc_4175. Each time you access the files, you’ll
want to place them in that directory.
We’re going to grab some data that’s part of the college scorecard and do a bit of analysis on it.
Open RStudio, then create a new .Rmd file.
To do this, click on File → New File →
R Markdown....
You will then be asked to determine a bunch of settings for this
.Rmd document. For example, you can choose whether you want
to create a “Document”, “Presentation”, “Shiny”, or “From Template” on
the left. You can set the “Title:” “Author:” and “Date:” on the
top-right. And you can choose the “Default Output Format:” to be either
“HTML”, “PDF”, or “Word”. You should not change any of these
settings. Their defaults (“Document”, “Untitled”, “[Your
name]”, “[Today’s Date]”, and “HTML”) are sufficient. Just click
“OK”.
Copy the raw code from the psc4175_hw_1.Rmd
file by clicking on the copy button as shown in the image below.
Finally, replace the default code in your R Markdown file with the copied code from the GitHub!
If viewing this as an html file, you can view this gif for more help!
Helpful tip! You can also simply download the
.Rmd file for the homeworks/problem sets and work directly
from them. Just be sure you save it to your preferred folder on your
computer system!
.Rmd files will be the only file format we work in this
class. .Rmd files contain three basic elements:
From an .Rmd file you can generate html documents, pdf documents, word documents, slides . . . lots of stuff. All class notes will be in .Rmd. All assignments will be completed on .Rmd files.
In the .Rmd file you’ll notice that there are three open
single quotes in a row, like so: ``` This indicates the
start of a “code chunk” in our file. The first code chunk that we load
will include a set of programs that we will need all semester long.
These .Rmd files are great, but the reader (you and me)
cannot easily see all of the output. You are in effect writing the code
that Microsoft Word does behind the scenes. In order to see the nice
output we need to “knit” the .Rmd file. At the top of
RStudio, you’ll see a “Knit” button with a drop down arrow. If you click
“Knit” the .Rmd will transform into an .html
file and automatically save in the same directory as your
.Rmd file. If you receive an error in the console, you’ll
need to go back and check where that error occurred (this will be
frustrating at first, but you’ll soon find mistakes quite quickly!). I
find it easier to “knit” your .Rmd file as a PDF as this is
the file you will submit for all of your assignments.
To “knit” your .Rmd as a PDF follow these steps:
Select the “Knit” drop-down icon at the top of the RStudio window, and select “Knit to PDF”. RStudio will ask you to first save the markdown file, then it will process the markdown file and render it to PDF.
If this worked, you’re good to go! You can ignore the next section here. If it didn’t work, then proceed with the next steps
Install the tinytex package by typing this code into the
command console of RStudio:
# install.packages("tinytex") # Uncomment this to install
Then, once that has installed successfully, type the following:
# tinytex::install_tinytex() # Uncomment this to install
Remember, once these are loaded, you do not need to install them every time; they are saved in your library!
Now, go back and “Knit to PDF”. If you’re still having problems, let me know in Campuswire.
I like to see results in the Console. By default Rstudio will output results from an Rmd file inline– meaning in the document itself. To change this, go to Tools–>global Options–>R Markdown, and uncheck the box for “show output inline for all Rmarkdown documents.”
When we say that R is extensible, we mean that people in the
community can write programs that everyone else can use. These are
called “packages.” In these first few lines of code, I load a set of
packages using the library command in R. The set of packages, called
tidyverse were written by Hadley Wickham and others and
play a key role in his book. To install this set of packages, simply
type in install.packages("tidyverse") at the R command
prompt. Alternatively, you can use the “Packages” pane in the lower
right hand corner of your Rstudio screen. Click on Packages, then click
on install, then type in “tidyverse.”
To run the code below in R, you can:
CMD+RETURNCTRL+RETURN## Get necessary libraries-- won't work the first time, because you need to install them!
# install.packages("tidyverse") # Uncomment this to install
library(tidyverse)
Here’s the thing about packages. There’s a difference between installing a package and calling a package. Installing means that the package is on your computer and available to use. Calling a package means that the commands in the package will be used in this session. A “session” is basically when R has been opened up on your computer. As long as R/Rstudio are open and running, the session is active.
It’s a good practice to shutdown R/Rstudio once you’re no longer working on it, and then to restart it when you begin working again. Otherwise, the working environment can get pretty crowded with data and packages.
Now we’re ready to load in data. The data frame will be our basic way
of interacting with everything in this class. The
sc_debt.Rds (found here: https://github.com/rweldzius/PSC4175_SUM2025/blob/main/Data/sc_debt.Rds)
data frame contains information from the college scorecard on different
colleges and universities.
tidyverse includes a read_rds() function
that can read data directly from the internet.
df <- read_rds('https://github.com/rweldzius/PSC4175_SUM2025/raw/main/Data/sc_debt.Rds')
You’ll notice that the code above starts with df. This
is just an arbitrary name for an object. You could name it
dat or raw or debt or whatever
you want. Then there’s an arrow <-. This is an
assignment operator. Then there’s a function, readRDS, with
parentheses, and an argument “sc_debt.Rds”. Here’s how to think about
this.
readRDS opens a type of data– rds data. This
function has one argument which is the name of the file I want to
open.So the command above says “use readRDS to open the
file”sc_debt.Rds” and assign the result to the object
df.
Let’s take a quick look at the object df
df
## # A tibble: 2,546 × 16
## unitid instnm stabbr grad_debt_mdn control region preddeg openadmp adm_rate
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl>
## 1 100654 Alabama… AL 33375 Public South… Bachel… 2 0.918
## 2 100663 Univers… AL 22500 Public South… Bachel… 2 0.737
## 3 100690 Amridge… AL 27334 Private South… Associ… 1 NA
## 4 100706 Univers… AL 21607 Public South… Bachel… 2 0.826
## 5 100724 Alabama… AL 32000 Public South… Bachel… 2 0.969
## 6 100751 The Uni… AL 23250 Public South… Bachel… 2 0.827
## 7 100760 Central… AL 12500 Public South… Associ… 1 NA
## 8 100812 Athens … AL 19500 Public South… Bachel… NA NA
## 9 100830 Auburn … AL 24826 Public South… Bachel… 2 0.904
## 10 100858 Auburn … AL 21281 Public South… Bachel… 2 0.807
## # ℹ 2,536 more rows
## # ℹ 7 more variables: ccbasic <int>, sat_avg <int>, md_earn_wne_p6 <int>,
## # ugds <int>, costt4_a <int>, selective <dbl>, research_u <dbl>
This is just the first part of the data frame. All data frames have
the exact same structure. Each row is a case. In this example, each row
is a college. Each column is a characteristics of the case, what we call
a variable. Let’s use the names command to see what
variables are in the dataset.
names(df)
## [1] "unitid" "instnm" "stabbr" "grad_debt_mdn"
## [5] "control" "region" "preddeg" "openadmp"
## [9] "adm_rate" "ccbasic" "sat_avg" "md_earn_wne_p6"
## [13] "ugds" "costt4_a" "selective" "research_u"
It’s hard to know what these mean without some more information. We
usually use a codebook to get more information about a dataset. Because
we use very short names for variables, it’s useful to have some more
information (fancy name: metadata) that tells us about those variables.
Below you’ll see the R name for each variable next to a
description of each variable.
| Name | Definition |
|---|---|
| unitid | Unit ID |
| instnm | Institution Name |
| stabbr | State Abbreviation |
| grad_debt_mdn | Median Debt of Graduates |
| control | Control Public or Private |
| region | Census Region |
| preddeg | Predominant Degree Offered: Associates or Bachelors |
| openadmp | Open Admissions Policy: 1= Yes, 2=No,3=No 1st time students |
| adm_rate | Admissions Rate: proportion of applications accepted |
| ccbasic | Type of institution– see here |
| selective | Institution admits fewer than 10 % of applicants, 1=Yes, 0=No |
| research_u | Institution is a research university 1=Yes, 0=No |
| sat_avg | Average Sat Scores |
| md_earn_wne_p6 | Average Earnings of Recent Graduates |
| ugds | Number of undergraduates |
| costt4a | Average cost of attendance (tuition-grants) |
We can also look at the whole dataset using View. Just delete the
# sign below to make the code work. That #
sign is a comment in R code, which indicates to the computer that
everything on that line should be ignored. To get it to run, we need to
drop the #.
View(df)
You’ll notice that this data is arranged in a rectangular format, with each row showing a different college, and each column representing a different characteristic of that college. Datasets are always structured this way— cases (or units) will form the rows, and the characteristics of those cases– or variables— will form the columns. Unlike working with spreadsheets, this structure is always assumed for datasets.
In exploring data, many times we want to look at smaller parts of the dataset. There are three commands we’ll use today that help with this.
-filter selects only those cases or rows that meet some
logical criteria.
-select selects only those variables or columns that
meet some criteria
-arrange arranges the rows of a dataset in the way we
want.
For more on these, please see this vignette.
Let’s grab just the data for Villanova, then look only at the average test scores and admit rate. We can use filter to look at all of the variables for Villanova:
df%>%
filter(instnm=="Villanova University")
## # A tibble: 1 × 16
## unitid instnm stabbr grad_debt_mdn control region preddeg openadmp adm_rate
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl>
## 1 216597 Villanov… PA 26000 Private North… Bachel… 2 0.282
## # ℹ 7 more variables: ccbasic <int>, sat_avg <int>, md_earn_wne_p6 <int>,
## # ugds <int>, costt4_a <int>, selective <dbl>, research_u <dbl>
What’s that weird looking %>% thing? That’s called a
pipe. This is how we chain commands together in R. Think of it as saying
“and then” to R. In the above case, we said, take the data and
then filter it to be just the data where the institution name is
Vanderbilt University.
The command above says the following:
Take the dataframe df and then filter it to
just those cases where instnm is equal to “Villanova
University.” Notice the “double equals” sign, that’s a logical operator
asking if instnm is equal to “Villanova University.”
Many times, though we don’t want to see everything, we just want to
choose a few variables. select allows us to select only the
variables we want. In this case, the institution name, its admit rate,
and the average SAT scores of entering students.
df%>%
filter(instnm=="Villanova University")%>%
select(instnm,adm_rate,sat_avg)
## # A tibble: 1 × 3
## instnm adm_rate sat_avg
## <chr> <dbl> <int>
## 1 Villanova University 0.282 1422
filter takes logical tests as its argument. The code
insntnm=="Villanova University" is a logical statement that
will be true of just one case in the dataset– when institution name is
Vanderbilt University. The == is a logical test, asking if
this is equal to that. Other common logical and relational operators for
R include
>, <: greater than, less than>=, <=: greater than or equal to,
less than or equal to! :not, as in != not equal to& AND| ORNext, we can use filter to look at colleges with low
admissions rates, say less than 10% ( or .1 in the proportion scale used
in the dataset).
df%>%
filter(adm_rate<.1)%>%
select(instnm,adm_rate,sat_avg)%>%
arrange(sat_avg,adm_rate)%>%
print(n=20)
## # A tibble: 25 × 3
## instnm adm_rate sat_avg
## <chr> <dbl> <int>
## 1 Colby College 0.0967 1456
## 2 Swarthmore College 0.0893 1469
## 3 Pomona College 0.074 1480
## 4 Dartmouth College 0.0793 1500
## 5 Stanford University 0.0434 1503
## 6 Northwestern University 0.0905 1506
## 7 Columbia University in the City of New York 0.0545 1511
## 8 Brown University 0.0707 1511
## 9 University of Pennsylvania 0.0766 1511
## 10 Vanderbilt University 0.0912 1515
## 11 Harvard University 0.0464 1517
## 12 Princeton University 0.0578 1517
## 13 Yale University 0.0608 1517
## 14 Rice University 0.0872 1520
## 15 Duke University 0.076 1522
## 16 University of Chicago 0.0617 1528
## 17 Massachusetts Institute of Technology 0.067 1547
## 18 California Institute of Technology 0.0642 1557
## 19 Saint Elizabeth College of Nursing 0 NA
## 20 Yeshivat Hechal Shemuel 0 NA
## # ℹ 5 more rows
Now let’s look at colleges with low admit rates, and order them using
arrange by SAT scores (-sat_avg gives
descending order).
df%>%
filter(adm_rate<.1)%>%
select(instnm,adm_rate,sat_avg)%>%
arrange(-sat_avg)
## # A tibble: 25 × 3
## instnm adm_rate sat_avg
## <chr> <dbl> <int>
## 1 California Institute of Technology 0.0642 1557
## 2 Massachusetts Institute of Technology 0.067 1547
## 3 University of Chicago 0.0617 1528
## 4 Duke University 0.076 1522
## 5 Rice University 0.0872 1520
## 6 Yale University 0.0608 1517
## 7 Harvard University 0.0464 1517
## 8 Princeton University 0.0578 1517
## 9 Vanderbilt University 0.0912 1515
## 10 Columbia University in the City of New York 0.0545 1511
## # ℹ 15 more rows
And one last operation: all colleges that admit between 20 and 30 percent of students, looking at their SAT scores, earnings of attendees six years letter, and what state they are in, then arranging by state, and then SAT score.
df%>%
filter(adm_rate>.2&adm_rate<.3)%>%
select(instnm,sat_avg,md_earn_wne_p6,stabbr)%>%
arrange(stabbr,-sat_avg)%>%
print(n=40)
## # A tibble: 37 × 4
## instnm sat_avg md_earn_wne_p6 stabbr
## <chr> <int> <int> <chr>
## 1 Heritage Christian University NA NA AL
## 2 University of California-Santa Barbara 1370 39000 CA
## 3 California Polytechnic State University-San Lu… 1342 52100 CA
## 4 University of California-Irvine 1306 41100 CA
## 5 California Institute of the Arts NA 26900 CA
## 6 University of Miami 1371 47500 FL
## 7 Georgia Institute of Technology-Main Campus 1418 65500 GA
## 8 Point University 986 27300 GA
## 9 Grinnell College 1457 32800 IA
## 10 St Luke's College NA 45300 IA
## 11 Purdue University Northwest 1074 NA IN
## 12 Alice Lloyd College 1040 25300 KY
## 13 Wellesley College 1452 44900 MA
## 14 Boston College 1437 57000 MA
## 15 Brandeis University 1434 41700 MA
## 16 Babson College 1362 70400 MA
## 17 Laboure College NA 52000 MA
## 18 Coppin State University 903 28500 MD
## 19 University of Michigan-Ann Arbor 1448 49800 MI
## 20 University of North Carolina at Chapel Hill 1402 41000 NC
## 21 University of North Carolina School of the Arts 1202 23800 NC
## 22 Cabarrus College of Health Sciences 1063 41600 NC
## 23 Carolina University 979 NA NC
## 24 Wake Forest University NA 51100 NC
## 25 Webb Institute 1465 NA NY
## 26 Vassar College 1452 36100 NY
## 27 Colgate University 1437 47700 NY
## 28 University of Rochester 1418 44800 NY
## 29 Case Western Reserve University 1436 59600 OH
## 30 Denison University 1328 38900 OH
## 31 Kettering College 1135 48800 OH
## 32 Art Academy of Cincinnati 958 22400 OH
## 33 Villanova University 1422 62600 PA
## 34 Rhode Island School of Design 1349 40300 RI
## 35 Trinity University 1381 45700 TX
## 36 University of Virginia-Main Campus 1436 50300 VA
## 37 University of Richmond 1395 46900 VA
Quick Exercise Choose a different college and two different things about that college. Have R print the output.
df%>%
filter(instnm=="University of Miami")%>%
select(adm_rate,costt4_a)%>%
print()
## # A tibble: 1 × 2
## adm_rate costt4_a
## <dbl> <int>
## 1 0.271 67249
## Summarizing Data
To summarize data, we use the `summarize` command. Inside that command, we tell R two things: what to call the new variable that we're creating, and what numerical summary we would like. The code below summarizes median debt for the colleges in the dataset by calculating the average of median debt for all institutions.
## Error in parse(text = input): <text>:4:4: unexpected symbol
## 3:
## 4: To summarize
## ^
df%>%
select(sat_avg)
## # A tibble: 2,546 × 1
## sat_avg
## <int>
## 1 939
## 2 1234
## 3 NA
## 4 1319
## 5 946
## 6 1261
## 7 NA
## 8 NA
## 9 1082
## 10 1300
## # ℹ 2,536 more rows
Quick Exercise Summarize the average entering SAT scores in this dataset.
# select(instnm,sat_avg,md_earn_wne_p6,stabbr)%>%
We can also combine commands, so that summaries are done on only a part of the dataset. Below, we summarize median debt for selective schools, and not very selective schools.
df%>%
filter(adm_rate<.1)%>%
summarize(mean_debt=mean(grad_debt_mdn,na.rm=TRUE))
## # A tibble: 1 × 1
## mean_debt
## <dbl>
## 1 16178.
What about for not very selective schools?
df%>%
filter(adm_rate>.3)%>%
summarize(mean_debt=mean(grad_debt_mdn,na.rm=TRUE))
## # A tibble: 1 × 1
## mean_debt
## <dbl>
## 1 23230.
Quick Exercise Calculate average earnings for schools where SAT>1200
df
## # A tibble: 2,546 × 16
## unitid instnm stabbr grad_debt_mdn control region preddeg openadmp adm_rate
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl>
## 1 100654 Alabama… AL 33375 Public South… Bachel… 2 0.918
## 2 100663 Univers… AL 22500 Public South… Bachel… 2 0.737
## 3 100690 Amridge… AL 27334 Private South… Associ… 1 NA
## 4 100706 Univers… AL 21607 Public South… Bachel… 2 0.826
## 5 100724 Alabama… AL 32000 Public South… Bachel… 2 0.969
## 6 100751 The Uni… AL 23250 Public South… Bachel… 2 0.827
## 7 100760 Central… AL 12500 Public South… Associ… 1 NA
## 8 100812 Athens … AL 19500 Public South… Bachel… NA NA
## 9 100830 Auburn … AL 24826 Public South… Bachel… 2 0.904
## 10 100858 Auburn … AL 21281 Public South… Bachel… 2 0.807
## # ℹ 2,536 more rows
## # ℹ 7 more variables: ccbasic <int>, sat_avg <int>, md_earn_wne_p6 <int>,
## # ugds <int>, costt4_a <int>, selective <dbl>, research_u <dbl>
# filter("sat_avg>1200)
Quick Exercise Calculate the average debt for schools that admit over 50% of the students who apply.
df%>%
filter(adm_rate >0.5)
## # A tibble: 1,303 × 16
## unitid instnm stabbr grad_debt_mdn control region preddeg openadmp adm_rate
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl>
## 1 100654 Alabama… AL 33375 Public South… Bachel… 2 0.918
## 2 100663 Univers… AL 22500 Public South… Bachel… 2 0.737
## 3 100706 Univers… AL 21607 Public South… Bachel… 2 0.826
## 4 100724 Alabama… AL 32000 Public South… Bachel… 2 0.969
## 5 100751 The Uni… AL 23250 Public South… Bachel… 2 0.827
## 6 100830 Auburn … AL 24826 Public South… Bachel… 2 0.904
## 7 100858 Auburn … AL 21281 Public South… Bachel… 2 0.807
## 8 100937 Birming… AL 25773 Private South… Bachel… 2 0.538
## 9 101189 Faulkne… AL 24363 Private South… Bachel… 2 0.783
## 10 101435 Hunting… AL 26399 Private South… Bachel… 2 0.593
## # ℹ 1,293 more rows
## # ℹ 7 more variables: ccbasic <int>, sat_avg <int>, md_earn_wne_p6 <int>,
## # ugds <int>, costt4_a <int>, selective <dbl>, research_u <dbl>