1. Average Class Size

The data file EconEnrollment.csv includes enrollment data for every economics course at the University of Oregon from 2014Q1 to 2020Q3. A description of the variables in the data is in the table below.

Name Description
Term Quarter Code
course number Number of course
coursename Descriptive Name
instructor Instructor
enrollment Number of students in course
level Course level (0 - Intro, 1 - Intermediate, 2 - Masters, 3 - PhD)

In this problem, we will explore a couple of R’s “tidyverse” packages while analyzing this data.

  1. Install and load the ‘readr’ and ‘dplyr’ packages.
# You can type your R code here.

library(readr, dplyr)
  1. Set your working directory to the folder on your computer where you downloaded “EnrollmentByTerm.csv” using the “setwd()” function. Load the data using the read_csv function from the ‘readr’ package. Give the dataframe a logical name, like “enroll”.
# Erase eval=FALSE. You will need to do this for every block of code. 
# I had to put it there to show but not run the code. 
# It is an option that tells Markdown to not run the code.
# Complete the commands below
setwd("~/Downloads/School Stuff/EC423")
EconEnrollment <- read_csv("EconEnrollment.csv")
## Rows: 840 Columns: 6
## ── Column specification ─────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): coursename, instructor
## dbl (4): Term, coursenumber, enrollment, level
## 
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(EconEnrollment)
enroll <- EconEnrollment

‘dplyr’ is a popular package for cleaning data in R. You can use it to apply a sequence of functions to a dataframe. It returns the data with all of the functions applied in sequence. Dplyr uses “pipes” which are written as “%>%”.

For example, if you want to filter the data to only see courses taught by me, you can use the following code:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
DavisCourses <- enroll %>% filter(instructor=="Davis, Jon")
DavisCourses
## # A tibble: 8 x 6
##     Term coursenumber coursename                     instructor enrollment level
##    <dbl>        <dbl> <chr>                          <chr>           <dbl> <dbl>
## 1 202001          423 Econometrics                   Davis, Jon         NA     2
## 2 201903          607 Experimental Econ              Davis, Jon         13     3
## 3 201901          311 Inter Micro Theory             Davis, Jon         75     1
## 4 201901          423 Econometrics                   Davis, Jon         25     2
## 5 201802          428 Behav and Exp Econ             Davis, Jon         80     2
## 6 201802          607 Applied Behavioral Economics,… Davis, Jon          4     3
## 7 201801          311 Inter Micro Theory             Davis, Jon         83     1
## 8 201801          311 Inter Micro Theory             Davis, Jon         74     1

This shows that I have taught 8 classes in the dataset. My smallest class had 4 students and my largest class had 83 students. Note that the enrollment for this term is listed as NA.

Let’s drop the current term to make things easier.

enroll <- enroll %>% filter(Term<202001)
  1. In the above example, “filter” is an example of a verb being applied to the data. The power of dplyr is in all of the verbs that can be used! Fill in the missing pieces of the following code to find the average class size in each term. In what terms were class sizes biggest and smallest?
byTerm <- enroll %>% group_by(Term) %>% summarize(avg = mean(enrollment)) %>% arrange(avg)

byTerm
## # A tibble: 18 x 2
##      Term   avg
##     <dbl> <dbl>
##  1 201901  59.7
##  2 201801  64.7
##  3 201902  69.1
##  4 201903  70.3
##  5 201803  71.3
##  6 201802  78.2
##  7 201701  80.9
##  8 201601  81.3
##  9 201703  82.1
## 10 201603  84.8
## 11 201402  88.1
## 12 201602  88.3
## 13 201503  91.4
## 14 201502  91.4
## 15 201702  94.6
## 16 201401  95.6
## 17 201501  97.7
## 18 201403  98.1
# the largest was spring term of 2014 and the smallest was fall term of 2019
  1. What was the average class size in the economics department over this time period? Hint: You can base your code off of the code above.
total_avg <- enroll %>%
  summarize(avg = mean(enrollment))

total_avg
## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  82.1
#The average class size was 82.11
  1. What was the average class size in the economics department by level of the course over this time period? Hint: Use the group_by verb.
byLevel <- enroll %>% group_by(level) %>% summarize(avg = mean(enrollment)) %>% arrange(avg)

byLevel
## # A tibble: 4 x 2
##   level   avg
##   <dbl> <dbl>
## 1     3  12.4
## 2     2  53.8
## 3     1  75.9
## 4     0 204.
# For intro level, the average was 204.44, for intermediate, the average was 75.88, for masters, the average was 53.82, and for the phD program, the average was 12.41.
  1. Now, calculate the average class size by level using the weighted.mean() instead of the mean() function. Weight the mean by each classes enrollment. Interpret what this weighted average tells us. Would a prospective student prefer knowing the weighted or unweighted average? What about a prospective faculty hire?
enroll %>% group_by(level) %>% summarize(avg2 = weighted.mean(enrollment, enrollment)) %>% arrange(avg2)
## # A tibble: 4 x 2
##   level  avg2
##   <dbl> <dbl>
## 1     3  14.3
## 2     2  65.5
## 3     1  87.0
## 4     0 251.
# A prospective student would want to know the unweighted mean so the number appears lower, giving them the idea that the student to faculty ratio is lower and therefore making it more compelling to apply to UO. For a prospectuve faculty member, the weighted mean will show a higher number of students to teach, making the school seem more successful and prominent.

2. Predicting Criminality

This question will be based on the paper “Automated Inference on Criminality using Face Images” by Xialoin Wu and Xi Zhang. The paper is posted on Canvas.

This paper attracted a lot of attention and stirred controversy when it was first posted in 2016. See for example this Vice article. Most of the media about the article focused on the ethics of predicting criminality. This question will help you assess how concerned you should be about the future of predicting criminality with only data on faces.

2.a What is the authors’ research question?

The author’s research question is whether certain AI facial recognition patterns are able to infer if you will have future criminal tendencies or not.

2.b What do the authors’ find? How accurate are their predictions?

The authors found that CNN had a 95.40%, SVM had a 93.03%, KNN had an 88.38% and LR had a 86.66% accuracy.

2.c As a student in a graduate economics course, would you describe their methods as accessible or inaccessible?

their methods are somewhat accessible because they used caution with their research, but it is also very difficult to disscet and understand, especially for those who do not undertsand what is really going on, making the data difficult to replicate.

2.d How do the authors collect their data? Are the photos of the criminals and non-criminals comparable?

Yes, they are comparable because they ensure that the photos are not mugshots, which would have made the data much more biased in terms of inciminating those in the photos.

2.e Look at Figure 10. What jumps out to you about the main difference between the “average” criminal’s face compared to the “average” non-criminal’s face?

the “average” criminal faces appear much more blurry than the “average” non-criminal faces.

2.f True/False/Uncertain. You need to understand the methodology of a paper to assess whether it’s conclusions are plausible. Justify your answer.

I think it is true to an extent, however the threshold of understanding has to be a certain level for it to be reasonable. There is a group of people out in the world who is not inteligent enough to interpret data and understand what they are reading, even in the most simple terms, but they should be able to accept certain things when they are peer reviewed and proven true through replication and whatnot because they do not possess the knowledge to do so. If anything is fact checked by multiple other groups or parties and proven true, then you should be able to decide the conclusions are plausible.