All the content in this notebook, is based on datacamp’s course “Introduction to Data in R”.

Since the analysis will focus on a high school survery dataset, and an email dataset containing emails received in first 3 months of a single email account in 2012, available in the openintro package, making sure that is loaded. In addition dplyr will be loaded

library(openintro)
library(dplyr)
library(ggplot2)
library(gapminder)
library(tidyr)

Language of data

To load a data which is included in a package, you can use data() function. It automatically saves/attaches it in your workspace. str() gives us structure of the dataset, like shape, first few values of every column, their datatypes etc. dplyr has a glimpse() function, which gives a slightly cleaner output. It shows you as much data as will fit on screen.

# The high school survery data
data(hsb2)
str(hsb2)

## 'data.frame':    200 obs. of  11 variables:
##  $ id     : int  70 121 86 141 172 113 50 11 84 48 ...
##  $ gender : chr  "male" "female" "male" "male" ...
##  $ race   : chr  "white" "white" "white" "white" ...
##  $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 3 2 2 2 2 2 2 ...
##  $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prog   : Factor w/ 3 levels "general","academic",..: 1 3 1 3 2 2 1 2 1 2 ...
##  $ read   : int  57 68 44 63 47 44 50 34 63 57 ...
##  $ write  : int  52 59 33 44 52 52 59 46 57 55 ...
##  $ math   : int  41 53 54 47 57 51 42 45 54 52 ...
##  $ science: int  47 63 58 53 53 63 53 39 58 50 ...
##  $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

glimpse(hsb2)

## Observations: 200
## Variables: 11
## $ id      <int> 70, 121, 86, 141, 172, 113, 50, 11, 84, 48, 75, 60, 95...
## $ gender  <chr> "male", "female", "male", "male", "male", "male", "mal...
## $ race    <chr> "white", "white", "white", "white", "white", "white", ...
## $ ses     <fct> low, middle, high, high, middle, middle, middle, middl...
## $ schtyp  <fct> public, public, public, public, public, public, public...
## $ prog    <fct> general, vocational, general, vocational, academic, ac...
## $ read    <int> 57, 68, 44, 63, 47, 44, 50, 34, 63, 57, 60, 57, 73, 54...
## $ write   <int> 52, 59, 33, 44, 52, 52, 59, 46, 57, 55, 46, 65, 60, 63...
## $ math    <int> 41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71, 57...
## $ science <int> 47, 63, 58, 53, 53, 63, 53, 39, 58, 50, 53, 63, 61, 55...
## $ socst   <int> 57, 61, 31, 56, 61, 61, 61, 36, 51, 51, 61, 61, 71, 46...

# The email dataset
data(email50)
str(email50)

## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 07:19:16" "2012-02-16 14:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

glimpse(email50)

## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 07:19:16, 2012-02-16 14:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fct> small, big, none, small, small, small, small, sma...

There are 2 main types of variables:

Numerical or quantitative variables: It is sensible to add, subtract etc. for these variables, and they take on numerical values

Continuous: Infinite number of values can be taken. Also can be measured. Like height
Discrete: Can only take on limited numeric values - e.g. whole numbers. Often counted

Categorical or qualitative variables: Take limited values, and even though they can be coded as numeric, it makes no sense to arithmetic operations on them

Ordinal: Inherent order in them . e.g good, better , best
Nominal: No inherent order- e.g. colours

A good application of the above would be to tag the hsb2 dataset.

id - categorical
gender - categorical
race - categorical
ses - ordinal categorical
schtyp - categorical
prog - categorical
read, write….socst - continuous numeric

Similarly, for the email50 dataset:

spam, to_multiple, from, cc, sent_email - categorical
time - would depend on the context. I would think here it is a continous numeric
image,attach,dollar - discrete numeric
winner - categorical
inherit, viagra, password, num_car, line_breaks - discrete numeric
format,re_subj,exclaim_subj,urgent_subj - categorical
exclaim_mess - discrete numeric
number - ordinal categorical

R often stores categorical variables as factors. They enter models differently. These are useful especially when doing sub-group analysis, which focuses only on a part of data - example - females, high school students etc. We can filter for specific levels.

A very common operation on categorical variables is frequency distribution. One method is table() function.

# Using table function
table(hsb2$schtyp)

## 
##  public private 
##     168      32

# Then using dply's filter we can only keep a particular subset
public_schools <- hsb2 %>% filter(schtyp=="public")

table(hsb2$schtyp)

## 
##  public private 
##     168      32

Despite having no private schools in the new dataset, the level which is essentially a placeholder in R will remain, although there will be no observations in it. This can create issues in modeling , plotting etc.. To drop ununsed levels we can use droplevels()

public_schools$schtyp <- droplevels(public_schools$schtyp)
table(public_schools$schtyp)

## 
## public 
##    168

Similarly, we can create a new dataset, which only has emails containing large numbers

email50_big <- email50 %>% filter(number=="big")
glimpse(email50_big)

## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 14:10:06, 2012-02-04 17:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fct> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fct> big, big, big, big, big, big, big

Just like hsb2, after dropping smaill and medium, and none categories of numbers in emails, the levels will still remain. We can go ahead and remove them

table(email50_big$number)

## 
##  none small   big 
##     0     0     7

# Droping levels
email50_big$number <- droplevels(email50_big$number)

#Levels dropped
table(email50_big$number)

## 
## big 
##   7

When we create categories using numeric variables, it is often called discretizing the data. Example: using test scores, creating ‘below average’, and ‘above average’.

New variables are created using mutate() function in dplyr package. and conditional checking is handled by ifelse() function

hsb2 <- hsb2 %>% mutate(read_cat = ifelse(read<mean(read),"below average","at/above average"))

Similarly, we can categorize the emails in the dataset, as emails having above median number of characters and below median

email50 <- email50 %>% mutate(num_char_cat = ifelse(num_char < median(num_char),"below median",
                                                    "at/above median"))
table(email50$num_char_cat)

## 
## at/above median    below median 
##              25              25

Creating another column, to mark absense or presence of numbers in the emails and then visualizing it

email50 <- email50 %>% mutate(number_yn = ifelse(number=="none","n","y"))
email50 %>% ggplot(aes(x=number_yn)) + geom_bar()

Scatterplots are common for numeric data. We can see the correlation between math and science scores in hsb2 dataset

hsb2 %>% ggplot(aes(x=math,y=science)) + geom_point()

A multivariate scaterplot, can be created by adding a 3rd variable which has categories, which can either be shown as color, shape etc. You often may need to force the variable controlling color, shape etc to be read as factor if it is not stored as one.

hsb2 %>% ggplot(aes(x=math,y=science,color=prog)) + geom_point()

Showing correlation between number of exclaimation marks in emails vs number of characters, for spam and non-spam emails

email50 %>% ggplot(aes(x=exclaim_mess,y=num_char,color=factor(spam))) + geom_point()

Types of studies

1.Observational Study: Collect data in a way that does not interfere with how the data arises, which means they only observe. Only correlation or association between response and explainatory variable. Causal relationships are not established, because it could be caused by variables which were not controlled/measured. These are called confounding variables

2.Experiment: Subjects are randomly assigned to different treatments, and hence causal relationships can be inferred

In the gapminder dataset, which has multiple variables like life expectancy, GDP etc for different countries, this kind of data is purely observational, since people were not assgined to go live in these countries

data(gapminder)
glimpse(gapminder)

## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Random sampling happens when subjects are being selected for a study. Since it is random, it is likely that they are a good representation of the population.

Random assignment is how subjects are assigned to different treatments in a study.Happens only in experiments.

Random Sampling & Random Assignment.

Labeling variables as response and explainatory does not actually establish any causal relationship or association. There maybe other variables which contribute to response, in conjunction with the explainatory variable. Simpson’s paradox illustrates the effect omission of an explainatory variable can have on establishing association between the response, and explainatory variable.

Analyzing UCB data on gender discrimination in admissions. A ‘tidyr’ way of doing frequency distributions is count() function. It returns a tibble. spread() takes a dataset, and makes it wider. It will chose your designation values of the ‘value’ column, and make them columns, so that the data will have key as index (sort of)

# The original dataset is a 3D array, convert to dataframe
ucb_admits_df <- as.data.frame(UCBAdmissions)
ucb_admits <- ucb_admits_df[rep(row.names(ucb_admits_df),ucb_admits_df$Freq),1:3]

# Counting males and females admitted
ucb_counts <- ucb_admits %>% count(Gender,Admit)

# Spreading the data
ucb_counts %>% spread(key=Admit,value=n)

## # A tibble: 2 x 3
##   Gender Admitted Rejected
## * <fct>     <int>    <int>
## 1 Male       1198     1493
## 2 Female      557     1278

# Making percentage admitted per gender
ucb_counts %>% spread(key=Admit,value=n) %>% mutate(Perc_Admit=Admitted/(Admitted+Rejected))

## # A tibble: 2 x 4
##   Gender Admitted Rejected Perc_Admit
##   <fct>     <int>    <int>      <dbl>
## 1 Male       1198     1493      0.445
## 2 Female      557     1278      0.304

Simpson’s Paradox comes into play here. It looks like more percentage of males are admitted than females, but now if we look at the same slice by department we will see a different story

ucb_admits %>% count(Dept,Admit,Gender) %>% spread(Admit,n) %>%
  mutate(Perc_Admit = Admitted/(Admitted+Rejected))

## # A tibble: 12 x 5
##    Dept  Gender Admitted Rejected Perc_Admit
##    <fct> <fct>     <int>    <int>      <dbl>
##  1 A     Male        512      313     0.621 
##  2 A     Female       89       19     0.824 
##  3 B     Male        353      207     0.630 
##  4 B     Female       17        8     0.680 
##  5 C     Male        120      205     0.369 
##  6 C     Female      202      391     0.341 
##  7 D     Male        138      279     0.331 
##  8 D     Female      131      244     0.349 
##  9 E     Male         53      138     0.277 
## 10 E     Female       94      299     0.239 
## 11 F     Male         22      351     0.0590
## 12 F     Female       24      317     0.0704

More females applied to departments which reject more.

Sampling Strategies

We need to sample because :

Collecting data from a population is very resource intensive
It is not possible to collect from all possible people (some maybe hard to locate). If these hard to locate people are different from others in the variable being studied, you’ll get a biased sample
Populations constantly change

Three common sampling methods:
1.Simple Random Sampling Randomly collect datapoints, such that each datapoint is equally likely to be collected.

Stratified Sampling
First divide all datapoints into homogenous strata, and then randomly select from those strata
Cluster Sampling Cluster all observations, randomly sample clusters, and then take all obs from the selected clusters . Unlike strata, clusters are heterogenous, and also clusters are similar enough to one another such that we can get away with sampling only a few clusters.
Mutistage Samling Same as above 3. except that we randomly sample from within selected clusters

Imagine having to make some studies for different US counties. Since it might not be feasible to collect data from all US counties, we will sample

# Loading information about county
data(county)
glimpse(county)

## Observations: 3,143
## Variables: 10
## $ name          <fct> Autauga County, Baldwin County, Barbour County, ...
## $ state         <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399...
## $ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947...
## $ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910...
## $ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, ...
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, ...
## $ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7,...
## $ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916,...
## $ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659,...

# Now remove DC since it is not a real state
county_no_dc <- county %>% filter(name!="District of Columbia") %>% droplevels()

# Now sample 150 counties
county_srs <- county_no_dc %>% sample_n(size=150)
glimpse(county_srs)

## Observations: 150
## Variables: 10
## $ name          <fct> Stokes County, Pecos County, Mitchell County, Fa...
## $ state         <fct> North Carolina, Texas, North Carolina, South Dak...
## $ pop2000       <dbl> 44711, 16809, 15687, 7453, 381751, 7857, 11211, ...
## $ pop2010       <dbl> 47401, 15507, 15579, 7094, 434972, 7266, 10419, ...
## $ fed_spend     <dbl> 5.597392, 5.399433, 10.157327, 17.982239, 9.4220...
## $ poverty       <dbl> 12.2, 19.9, 16.8, 17.4, 9.0, 8.2, 42.4, 23.0, 11...
## $ homeownership <dbl> 81.8, 70.0, 75.0, 64.8, 76.4, 82.6, 59.1, 72.8, ...
## $ multiunit     <dbl> 4.4, 11.1, 8.6, 13.8, 14.3, 5.0, 12.5, 5.9, 14.5...
## $ income        <dbl> 20852, 16717, 18804, 21574, 27196, 21419, 14190,...
## $ med_income    <dbl> 42689, 38125, 32743, 35833, 57494, 48318, 20081,...

# Checking if all states got equal representation
county_srs %>% group_by(state) %>% count()

## # A tibble: 39 x 2
## # Groups:   state [39]
##    state           n
##    <fct>       <int>
##  1 Alabama         2
##  2 Alaska          2
##  3 Arkansas        5
##  4 California      1
##  5 Colorado        4
##  6 Connecticut     1
##  7 Florida         3
##  8 Georgia        13
##  9 Idaho           1
## 10 Illinois        6
## # ... with 29 more rows

Some states are over-represented. Hence do a stratified sample with 3 counties from each of 50 states

county_strs <- county_no_dc %>% group_by(state) %>% sample_n(size=3)
county_strs %>% count(state)

## # A tibble: 50 x 2
## # Groups:   state [50]
##    state           n
##    <fct>       <int>
##  1 Alabama         3
##  2 Alaska          3
##  3 Arizona         3
##  4 Arkansas        3
##  5 California      3
##  6 Colorado        3
##  7 Connecticut     3
##  8 Delaware        3
##  9 Florida         3
## 10 Georgia         3
## # ... with 40 more rows

Similarly, now we want to sample states - 8 of them in USA. Using SRS on us_regions data

# Making the US regions data, since it is not directly available
data(state)
us_regions <- as.data.frame(cbind(state.name,as.character(state.region)))
names(us_regions) <- c("state","region")

state_srs <- us_regions %>% sample_n(size=8)

state_srs %>% group_by(region) %>% count()

## # A tibble: 4 x 2
## # Groups:   region [4]
##   region            n
##   <fct>         <int>
## 1 North Central     2
## 2 Northeast         2
## 3 South             3
## 4 West              1

Some regions are over-represented. Hence stratified sample

state_strs <- us_regions %>% group_by(region) %>% sample_n(size=2)
state_strs %>% group_by(region) %>% count()

## # A tibble: 4 x 2
## # Groups:   region [4]
##   region            n
##   <fct>         <int>
## 1 North Central     2
## 2 Northeast         2
## 3 South             2
## 4 West              2

There are 4 main principles of experimental design:

Control: compare treatment of interest to a control group
Randomize: randomly assign subjects to treatments
Replicate: collect a sufficiently large sample within a study , or replicate entire study
Block: account for potential impact of confounding variables – by grouping subjects into blocks based on these variables, and then randomize within block to treatment groups

Suppose you are studying effect of light and noise on exam performance. Unfortunately, you also think being a male and female can elicit diferent performance for same light & noise. In this case it is a blocking variable. To ensure you capture both male and female equally, you divide the population of males and females, and then randomly sample equally from both for final sample. This way you have blocked the impact of being male or female.

In random sampling you use stratifying to control for a variable. In random assignment, blocing is used to achieve the same

Case Study

The aim of this case study is to find if better looking professors, received better evaluation by students.. The data was collected at UT-Austin. This study can be said to be ‘observational’ in nature. The data was gathered by randomly sampling class evals.

# Downloading and loading the data annd getting a glimpse
download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")

glimpse(evals)

## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity     <fct> minority, minority, minority, minority, not mino...
## $ gender        <fct> female, female, female, female, male, male, male...
## $ language      <fct> english, english, english, english, english, eng...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs     <fct> single, single, single, single, multiple, multip...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color     <fct> color, color, color, color, color, color, color,...

# Getting more detail information about variables
str(evals)

## 'data.frame':    463 obs. of  21 variables:
##  $ score        : num  4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
##  $ rank         : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
##  $ ethnicity    : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
##  $ gender       : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
##  $ language     : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age          : int  36 36 36 36 59 59 59 51 51 40 ...
##  $ cls_perc_eval: num  55.8 68.8 60.8 62.6 85 ...
##  $ cls_did_eval : int  24 86 76 77 17 35 39 55 111 40 ...
##  $ cls_students : int  43 125 125 123 20 40 44 55 195 46 ...
##  $ cls_level    : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cls_profs    : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
##  $ cls_credits  : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bty_f1lower  : int  5 5 5 5 4 4 4 5 5 2 ...
##  $ bty_f1upper  : int  7 7 7 7 4 4 4 2 2 5 ...
##  $ bty_f2upper  : int  6 6 6 6 2 2 2 5 5 4 ...
##  $ bty_m1lower  : int  2 2 2 2 2 2 2 2 2 3 ...
##  $ bty_m1upper  : int  4 4 4 4 3 3 3 3 3 3 ...
##  $ bty_m2upper  : int  6 6 6 6 3 3 3 3 3 2 ...
##  $ bty_avg      : num  5 5 5 5 3 ...
##  $ pic_outfit   : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pic_color    : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...

# Instead of using actual number of students in class, create categories for size of class
evals <- evals %>% mutate(cls_type = ifelse(cls_students<=18,"small",
                                            ifelse(cls_students<60,"medium","large")))


# Creating a scatterplot to show association of beauty score vs score of each prof
evals %>% ggplot(aes(x=bty_avg,y=score)) + geom_point()

# Taking into account different class types
evals %>% ggplot(aes(x=bty_avg,y=score,color=cls_type)) + geom_point()