In this problem set, you will investigate objects and data patterns.

1 Step 1: Make changes to YAML header

See reading R Markdown: The Definitive Guide section 3.1 (LINK HERE) to help answering these questions

  1. Add a table of contents to YAML header
  2. table of contents should have “depth” of 2
  3. table of contents should “float”
  4. add section numbering to headers

2 Step 2: Load Tidyverse and Data

Load tidyverse package [code already provided]

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.4     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data frame object df_school_all; it is similar to the dataframe df_school used in lecture but has more varibles

  • Use names() to see all the variables contained within df_school_all [code already provided]
  • Use table() to tabulate the total number of visits each school received [code already provided]
rm(list = ls()) # remove all objects before loading new dataset
getwd()
#> [1] "C:/Users/Debbie Feehs/Desktop/HED PhD/HED 696C- Data Management and Manipulation in R/rclass/modules/module2"

load(url("https://github.com/ksalazar3/HED696C_RClass/raw/master/data/recruiting/recruit_school_allvars.RData"))
#glimpse(df_school_all)

names(df_school_all)
#>  [1] "state_code"         "school_type"        "ncessch"           
#>  [4] "name"               "address"            "city"              
#>  [7] "zip_code"           "pct_white"          "pct_black"         
#> [10] "pct_hispanic"       "pct_asian"          "pct_amerindian"    
#> [13] "pct_other"          "num_fr_lunch"       "total_students"    
#> [16] "num_took_math"      "num_prof_math"      "num_took_rla"      
#> [19] "num_prof_rla"       "avgmedian_inc_2564" "latitude"          
#> [22] "longitude"          "visits_by_196097"   "visits_by_186380"  
#> [25] "visits_by_215293"   "visits_by_201885"   "visits_by_181464"  
#> [28] "visits_by_139959"   "visits_by_218663"   "visits_by_100751"  
#> [31] "visits_by_199193"   "visits_by_110635"   "visits_by_110653"  
#> [34] "visits_by_126614"   "visits_by_155317"   "visits_by_106397"  
#> [37] "visits_by_149222"   "visits_by_166629"   "total_visits"      
#> [40] "inst_196097"        "inst_186380"        "inst_215293"       
#> [43] "inst_201885"        "inst_181464"        "inst_139959"       
#> [46] "inst_218663"        "inst_100751"        "inst_199193"       
#> [49] "inst_110635"        "inst_110653"        "inst_126614"       
#> [52] "inst_155317"        "inst_106397"        "inst_149222"       
#> [55] "inst_166629"

table(df_school_all$total_visits)
#> 
#>     0     1     2     3     4     5     6     7     8     9    10    11    12 
#> 15405  2718  1324   671   395   263   152   107    89    57    31    25    19 
#>    13    14    15    16    17    18    19    20    21    22    23    26 
#>    16     8     7     3     1     2     1     1     1     1     3     1

3 Step 3: Investigating objects

  1. Answer the following questions about the object df_school_all by running the appropriate R commands within the code chunk and writing any substantive response required next to the question. The code and substantive response for the first question will be answered for you as an example.
    • What “type” of object is df_school_all?
      • ANSWER [FREEBIE]: the object df_school_all is a list type
    • What is the “length” of the object df_school_all? What does this specific value of length refer to?
      • YOUR ANSWER HERE: the length is 55. This is the number of variables.
    • How many “rows” are in the object df_school_all? what does each row represent?
      • YOUR ANSWER HERE : there are 21301 row. This is the number of observations.
#type of df_school_all 
typeof(df_school_all)
#> [1] "list"

#length of df_school_all 
length(df_school_all)
#> [1] 55

#num of rows in df_school_all
nrow(df_school_all)
#> [1] 21301
  1. In the below code chunk, use the str() function to describe the contents of df_school_all and then answer the following questions in text above the code chunk.
    • What does each element of the object df_school_all represent?
      • YOUR ANSWER HERE: the elements in this object represent the details of the variables in the data frame.
    • Are the individual elements within df_school_all lists or vectors?
      • YOUR ANSWER HERE: the elements in this object are lists because there are a variety of types, such as characters and numbers.
    • Are the individual elements within df_school_all named or un-named? what do these element names refer to? (hint use names())
      • YOUR ANSWER HERE: yes, they are named and they represent the name of the 55 variables.
str(df_school_all)
#> tibble [21,301 × 55] (S3: tbl_df/tbl/data.frame)
#>  $ state_code        : chr [1:21301] "AK" "AK" "AK" "AK" ...
#>  $ school_type       : chr [1:21301] "public" "public" "public" "public" ...
#>  $ ncessch           : chr [1:21301] "020000100208" "020000100211" "020000100212" "020000100213" ...
#>  $ name              : chr [1:21301] "Bethel Regional High School" "Ayagina'ar Elitnaurvik" "Kwigillingok School" "Nelson Island Area School" ...
#>  $ address           : chr [1:21301] "1006 Ron Edwards Memorial Dr" "106 Village Road" "108 Village Road" "118 Village Road" ...
#>  $ city              : chr [1:21301] "Bethel" "Kongiganak" "Kwigillingok" "Toksook Bay" ...
#>  $ zip_code          : chr [1:21301] "99559" "99559" "99622" "99637" ...
#>  $ pct_white         : num [1:21301] 11.78 0 0 0 2.52 ...
#>  $ pct_black         : num [1:21301] 0.599 0 0 0 0 ...
#>  $ pct_hispanic      : num [1:21301] 1.6 0 0 0 0 ...
#>  $ pct_asian         : num [1:21301] 0.998 0 0 0 0 ...
#>  $ pct_amerindian    : num [1:21301] 84.6 99.5 100 100 97.5 ...
#>  $ pct_other         : num [1:21301] 0.399 0.549 0 0 0 ...
#>  $ num_fr_lunch      : num [1:21301] 362 182 116 187 238 180 418 185 179 186 ...
#>  $ total_students    : num [1:21301] 501 182 120 201 238 231 428 262 179 186 ...
#>  $ num_took_math     : num [1:21301] 146 17 14 30 28 25 62 21 23 19 ...
#>  $ num_prof_math     : num [1:21301] 24.8 1.7 3.5 3 2.8 ...
#>  $ num_took_rla      : num [1:21301] 147 17 14 30 28 24 62 22 23 19 ...
#>  $ num_prof_rla      : num [1:21301] 25 1.7 3.5 3 2.8 ...
#>  $ avgmedian_inc_2564: num [1:21301] 76160 76160 NA 57657 37553 ...
#>  $ latitude          : num [1:21301] 60.8 60 59.9 60.5 62.7 ...
#>  $ longitude         : num [1:21301] -162 -163 -163 -165 -165 ...
#>  $ visits_by_196097  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_186380  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_215293  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_201885  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_181464  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_139959  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_218663  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_100751  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_199193  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_110635  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_110653  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_126614  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_155317  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_106397  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_149222  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ visits_by_166629  : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ total_visits      : int [1:21301] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ inst_196097       : chr [1:21301] "NY" "NY" "NY" "NY" ...
#>  $ inst_186380       : chr [1:21301] "NJ" "NJ" "NJ" "NJ" ...
#>  $ inst_215293       : chr [1:21301] "PA" "PA" "PA" "PA" ...
#>  $ inst_201885       : chr [1:21301] "OH" "OH" "OH" "OH" ...
#>  $ inst_181464       : chr [1:21301] "NE" "NE" "NE" "NE" ...
#>  $ inst_139959       : chr [1:21301] "GA" "GA" "GA" "GA" ...
#>  $ inst_218663       : chr [1:21301] "SC" "SC" "SC" "SC" ...
#>  $ inst_100751       : chr [1:21301] "AL" "AL" "AL" "AL" ...
#>  $ inst_199193       : chr [1:21301] "NC" "NC" "NC" "NC" ...
#>  $ inst_110635       : chr [1:21301] "CA" "CA" "CA" "CA" ...
#>  $ inst_110653       : chr [1:21301] "CA" "CA" "CA" "CA" ...
#>  $ inst_126614       : chr [1:21301] "CO" "CO" "CO" "CO" ...
#>  $ inst_155317       : chr [1:21301] "KS" "KS" "KS" "KS" ...
#>  $ inst_106397       : chr [1:21301] "AR" "AR" "AR" "AR" ...
#>  $ inst_149222       : chr [1:21301] "IL" "IL" "IL" "IL" ...
#>  $ inst_166629       : chr [1:21301] "MA" "MA" "MA" "MA" ...

typeof(df_school_all)
#> [1] "list"

names(df_school_all)
#>  [1] "state_code"         "school_type"        "ncessch"           
#>  [4] "name"               "address"            "city"              
#>  [7] "zip_code"           "pct_white"          "pct_black"         
#> [10] "pct_hispanic"       "pct_asian"          "pct_amerindian"    
#> [13] "pct_other"          "num_fr_lunch"       "total_students"    
#> [16] "num_took_math"      "num_prof_math"      "num_took_rla"      
#> [19] "num_prof_rla"       "avgmedian_inc_2564" "latitude"          
#> [22] "longitude"          "visits_by_196097"   "visits_by_186380"  
#> [25] "visits_by_215293"   "visits_by_201885"   "visits_by_181464"  
#> [28] "visits_by_139959"   "visits_by_218663"   "visits_by_100751"  
#> [31] "visits_by_199193"   "visits_by_110635"   "visits_by_110653"  
#> [34] "visits_by_126614"   "visits_by_155317"   "visits_by_106397"  
#> [37] "visits_by_149222"   "visits_by_166629"   "total_visits"      
#> [40] "inst_196097"        "inst_186380"        "inst_215293"       
#> [43] "inst_201885"        "inst_181464"        "inst_139959"       
#> [46] "inst_218663"        "inst_100751"        "inst_199193"       
#> [49] "inst_110635"        "inst_110653"        "inst_126614"       
#> [52] "inst_155317"        "inst_106397"        "inst_149222"       
#> [55] "inst_166629"
  1. These questions refer to the variable school_type within the object df_school_all. Run the appropriate R commands in the chunk below and write substantive responses below each question.
    • What is the data “type” of school_type?
      • YOUR ANSWER HERE: character
    • What is the “length” of school_type? What does this specific value of length refer to?
      • YOUR ANSWER HERE: 21301. This is the number of observations listed within “school_type”.
class(df_school_all$school_type)
#> [1] "character"

length(df_school_all$school_type)
#> [1] 21301

4 Step 4: Filter, select, arrange questions

The data frame df_school_all has one observation for each high school (public and private).

  • The variables that begin with visits_by_... identify how many off-campus recruiting visits the high school received from a particular public university. For example, UC Berkeley has the ID 110635 so the variable visits_by_110635 identifies how many visits the high school received from UC Berkeley.
  • The variable total_visits identifies the number of visits the high school received from all (16) public research universities in this data collection sample.

For the questions below, imagine that you have been asked by a major news outlet to identify which high schools receive the most total number of off-campus recruiting visits from public universities.

  • For all questions below, you can answer using one line of code or you can answer in several steps (e.g., first create new data frame, then print selected variables)
  • For questions that ask you to print the “top 10” observations, you can simply print the object and rely on the fact that the default option is to print 10 observations OR you can wrap the command in the head() function and explicitly tell R to print 10 observations
  • Before conducting analyses, we’ll rename the variable avgmedian_inc_2564 to give it a shorter name.
  1. Rename the variable avgmedian_inc_2564 to med_inc and assign new variable name to the existing object df_school_all
df_school_all2 <- rename(df_school_all, med_inc = avgmedian_inc_2564)
  1. The news outlet is interested in comparing the in-state and out-of-state high school visits for The University of Alabama (IPEDS ID = 100751) variable visits_by_100751. Compare the number of in-state public high schools to the number that received at least one visit from The University of Alabama.__
    • Note: You will need to use filter and the count function. The associated variables needed to filter by: state_code, school_type, and visits_by_100751
    • Use commas to separate variables for this questions (e.g. filter(dataframe, variable == something, variable == something))
    • You can do this in one step by wrapping the count function around the filter function ; or you can do this in two steps by creating a new data frame first
df_school_alabama <- filter(df_school_all, school_type == "public", state_code == "AL", visits_by_100751 >= 1)

nrow(df_school_alabama)
#> [1] 108
  1. How many public in-state high schools visited by The University of Alabama enroll at least 50% Latinx students or 50% Black students?
    • hint: use the variables pct_hispanic and pct_black.
df_School_alabama_.5his_.5blk <- filter(df_school_alabama, pct_hispanic >= .5, pct_black >= .5)

nrow(df_School_alabama_.5his_.5blk)
#> [1] 97
  1. Now count the out-of-state public high schools that received at least one visit by the university of Alabama without using commas to separate conditions in your filter.
    • hint: & or %in%
count(filter(df_school_all, visits_by_100751 >= 1 & school_type == "public" & state_code != "AL")) 
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1  1644

Once finished, knit to (HTML) and upload both .Rmd and HTML files

Remember to use this naming convention “lastname_module2_ps”