All the content in this notebook, is based on datacamp’s course “Introduction to Data in R”.
Since the analysis will focus on a high school survery dataset, and an email dataset containing emails received in first 3 months of a single email account in 2012, available in the openintro package, making sure that is loaded. In addition dplyr will be loaded
library(openintro)
library(dplyr)
library(ggplot2)
library(gapminder)
library(tidyr)Language of data
To load a data which is included in a package, you can use data() function. It automatically saves/attaches it in your workspace. str() gives us structure of the dataset, like shape, first few values of every column, their datatypes etc. dplyr has a glimpse() function, which gives a slightly cleaner output. It shows you as much data as will fit on screen.
# The high school survery data
data(hsb2)
str(hsb2)## 'data.frame': 200 obs. of 11 variables:
## $ id : int 70 121 86 141 172 113 50 11 84 48 ...
## $ gender : chr "male" "female" "male" "male" ...
## $ race : chr "white" "white" "white" "white" ...
## $ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 3 2 2 2 2 2 2 ...
## $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
## $ prog : Factor w/ 3 levels "general","academic",..: 1 3 1 3 2 2 1 2 1 2 ...
## $ read : int 57 68 44 63 47 44 50 34 63 57 ...
## $ write : int 52 59 33 44 52 52 59 46 57 55 ...
## $ math : int 41 53 54 47 57 51 42 45 54 52 ...
## $ science: int 47 63 58 53 53 63 53 39 58 50 ...
## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
glimpse(hsb2)## Observations: 200
## Variables: 11
## $ id <int> 70, 121, 86, 141, 172, 113, 50, 11, 84, 48, 75, 60, 95...
## $ gender <chr> "male", "female", "male", "male", "male", "male", "mal...
## $ race <chr> "white", "white", "white", "white", "white", "white", ...
## $ ses <fct> low, middle, high, high, middle, middle, middle, middl...
## $ schtyp <fct> public, public, public, public, public, public, public...
## $ prog <fct> general, vocational, general, vocational, academic, ac...
## $ read <int> 57, 68, 44, 63, 47, 44, 50, 34, 63, 57, 60, 57, 73, 54...
## $ write <int> 52, 59, 33, 44, 52, 52, 59, 46, 57, 55, 46, 65, 60, 63...
## $ math <int> 41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71, 57...
## $ science <int> 47, 63, 58, 53, 53, 63, 53, 39, 58, 50, 53, 63, 61, 55...
## $ socst <int> 57, 61, 31, 56, 61, 61, 61, 36, 51, 51, 61, 61, 71, 46...
# The email dataset
data(email50)
str(email50)## 'data.frame': 50 obs. of 21 variables:
## $ spam : num 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct, format: "2012-01-04 07:19:16" "2012-02-16 14:10:06" ...
## $ image : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
glimpse(email50)## Observations: 50
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time <dttm> 2012-01-04 07:19:16, 2012-02-16 14:10:06, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number <fct> small, big, none, small, small, small, small, sma...
There are 2 main types of variables:
- Numerical or quantitative variables: It is sensible to add, subtract etc. for these variables, and they take on numerical values
- Continuous: Infinite number of values can be taken. Also can be measured. Like height
- Discrete: Can only take on limited numeric values - e.g. whole numbers. Often counted
- Categorical or qualitative variables: Take limited values, and even though they can be coded as numeric, it makes no sense to arithmetic operations on them
- Ordinal: Inherent order in them . e.g good, better , best
- Nominal: No inherent order- e.g. colours
A good application of the above would be to tag the hsb2 dataset.
- id - categorical
- gender - categorical
- race - categorical
- ses - ordinal categorical
- schtyp - categorical
- prog - categorical
- read, write….socst - continuous numeric
Similarly, for the email50 dataset:
- spam, to_multiple, from, cc, sent_email - categorical
- time - would depend on the context. I would think here it is a continous numeric
- image,attach,dollar - discrete numeric
- winner - categorical
- inherit, viagra, password, num_car, line_breaks - discrete numeric
- format,re_subj,exclaim_subj,urgent_subj - categorical
- exclaim_mess - discrete numeric
- number - ordinal categorical
R often stores categorical variables as factors. They enter models differently. These are useful especially when doing sub-group analysis, which focuses only on a part of data - example - females, high school students etc. We can filter for specific levels.
A very common operation on categorical variables is frequency distribution. One method is table() function.
# Using table function
table(hsb2$schtyp) ##
## public private
## 168 32
# Then using dply's filter we can only keep a particular subset
public_schools <- hsb2 %>% filter(schtyp=="public")
table(hsb2$schtyp)##
## public private
## 168 32
Despite having no private schools in the new dataset, the level which is essentially a placeholder in R will remain, although there will be no observations in it. This can create issues in modeling , plotting etc.. To drop ununsed levels we can use droplevels()
public_schools$schtyp <- droplevels(public_schools$schtyp)
table(public_schools$schtyp)##
## public
## 168
Similarly, we can create a new dataset, which only has emails containing large numbers
email50_big <- email50 %>% filter(number=="big")
glimpse(email50_big)## Observations: 7
## Variables: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 14:10:06, 2012-02-04 17:26:09, 2012-0...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fct> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fct> big, big, big, big, big, big, big
Just like hsb2, after dropping smaill and medium, and none categories of numbers in emails, the levels will still remain. We can go ahead and remove them
table(email50_big$number)##
## none small big
## 0 0 7
# Droping levels
email50_big$number <- droplevels(email50_big$number)
#Levels dropped
table(email50_big$number)##
## big
## 7
When we create categories using numeric variables, it is often called discretizing the data. Example: using test scores, creating ‘below average’, and ‘above average’.
New variables are created using mutate() function in dplyr package. and conditional checking is handled by ifelse() function
hsb2 <- hsb2 %>% mutate(read_cat = ifelse(read<mean(read),"below average","at/above average"))Similarly, we can categorize the emails in the dataset, as emails having above median number of characters and below median
email50 <- email50 %>% mutate(num_char_cat = ifelse(num_char < median(num_char),"below median",
"at/above median"))
table(email50$num_char_cat)##
## at/above median below median
## 25 25
Creating another column, to mark absense or presence of numbers in the emails and then visualizing it
email50 <- email50 %>% mutate(number_yn = ifelse(number=="none","n","y"))
email50 %>% ggplot(aes(x=number_yn)) + geom_bar()Scatterplots are common for numeric data. We can see the correlation between math and science scores in hsb2 dataset
hsb2 %>% ggplot(aes(x=math,y=science)) + geom_point()A multivariate scaterplot, can be created by adding a 3rd variable which has categories, which can either be shown as color, shape etc. You often may need to force the variable controlling color, shape etc to be read as factor if it is not stored as one.
hsb2 %>% ggplot(aes(x=math,y=science,color=prog)) + geom_point()Showing correlation between number of exclaimation marks in emails vs number of characters, for spam and non-spam emails
email50 %>% ggplot(aes(x=exclaim_mess,y=num_char,color=factor(spam))) + geom_point()Types of studies
1.Observational Study: Collect data in a way that does not interfere with how the data arises, which means they only observe. Only correlation or association between response and explainatory variable. Causal relationships are not established, because it could be caused by variables which were not controlled/measured. These are called confounding variables
2.Experiment: Subjects are randomly assigned to different treatments, and hence causal relationships can be inferred
In the gapminder dataset, which has multiple variables like life expectancy, GDP etc for different countries, this kind of data is purely observational, since people were not assgined to go live in these countries
data(gapminder)
glimpse(gapminder)## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
Random sampling happens when subjects are being selected for a study. Since it is random, it is likely that they are a good representation of the population.
Random assignment is how subjects are assigned to different treatments in a study.Happens only in experiments.
Random Sampling & Random Assignment.
Labeling variables as response and explainatory does not actually establish any causal relationship or association. There maybe other variables which contribute to response, in conjunction with the explainatory variable. Simpson’s paradox illustrates the effect omission of an explainatory variable can have on establishing association between the response, and explainatory variable.
Analyzing UCB data on gender discrimination in admissions. A ‘tidyr’ way of doing frequency distributions is count() function. It returns a tibble. spread() takes a dataset, and makes it wider. It will chose your designation values of the ‘value’ column, and make them columns, so that the data will have key as index (sort of)
# The original dataset is a 3D array, convert to dataframe
ucb_admits_df <- as.data.frame(UCBAdmissions)
ucb_admits <- ucb_admits_df[rep(row.names(ucb_admits_df),ucb_admits_df$Freq),1:3]
# Counting males and females admitted
ucb_counts <- ucb_admits %>% count(Gender,Admit)
# Spreading the data
ucb_counts %>% spread(key=Admit,value=n)## # A tibble: 2 x 3
## Gender Admitted Rejected
## * <fct> <int> <int>
## 1 Male 1198 1493
## 2 Female 557 1278
# Making percentage admitted per gender
ucb_counts %>% spread(key=Admit,value=n) %>% mutate(Perc_Admit=Admitted/(Admitted+Rejected))## # A tibble: 2 x 4
## Gender Admitted Rejected Perc_Admit
## <fct> <int> <int> <dbl>
## 1 Male 1198 1493 0.445
## 2 Female 557 1278 0.304
Simpson’s Paradox comes into play here. It looks like more percentage of males are admitted than females, but now if we look at the same slice by department we will see a different story
ucb_admits %>% count(Dept,Admit,Gender) %>% spread(Admit,n) %>%
mutate(Perc_Admit = Admitted/(Admitted+Rejected))## # A tibble: 12 x 5
## Dept Gender Admitted Rejected Perc_Admit
## <fct> <fct> <int> <int> <dbl>
## 1 A Male 512 313 0.621
## 2 A Female 89 19 0.824
## 3 B Male 353 207 0.630
## 4 B Female 17 8 0.680
## 5 C Male 120 205 0.369
## 6 C Female 202 391 0.341
## 7 D Male 138 279 0.331
## 8 D Female 131 244 0.349
## 9 E Male 53 138 0.277
## 10 E Female 94 299 0.239
## 11 F Male 22 351 0.0590
## 12 F Female 24 317 0.0704
More females applied to departments which reject more.
Sampling Strategies
We need to sample because :
- Collecting data from a population is very resource intensive
- It is not possible to collect from all possible people (some maybe hard to locate). If these hard to locate people are different from others in the variable being studied, you’ll get a biased sample
- Populations constantly change
Three common sampling methods:
1.Simple Random Sampling Randomly collect datapoints, such that each datapoint is equally likely to be collected.
Stratified Sampling
First divide all datapoints into homogenous strata, and then randomly select from those strataCluster Sampling Cluster all observations, randomly sample clusters, and then take all obs from the selected clusters . Unlike strata, clusters are heterogenous, and also clusters are similar enough to one another such that we can get away with sampling only a few clusters.
Mutistage Samling Same as above 3. except that we randomly sample from within selected clusters
Imagine having to make some studies for different US counties. Since it might not be feasible to collect data from all US counties, we will sample
# Loading information about county
data(county)
glimpse(county)## Observations: 3,143
## Variables: 10
## $ name <fct> Autauga County, Baldwin County, Barbour County, ...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala...
## $ pop2000 <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399...
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947...
## $ fed_spend <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910...
## $ poverty <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, ...
## $ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, ...
## $ multiunit <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7,...
## $ income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916,...
## $ med_income <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659,...
# Now remove DC since it is not a real state
county_no_dc <- county %>% filter(name!="District of Columbia") %>% droplevels()
# Now sample 150 counties
county_srs <- county_no_dc %>% sample_n(size=150)
glimpse(county_srs)## Observations: 150
## Variables: 10
## $ name <fct> Stokes County, Pecos County, Mitchell County, Fa...
## $ state <fct> North Carolina, Texas, North Carolina, South Dak...
## $ pop2000 <dbl> 44711, 16809, 15687, 7453, 381751, 7857, 11211, ...
## $ pop2010 <dbl> 47401, 15507, 15579, 7094, 434972, 7266, 10419, ...
## $ fed_spend <dbl> 5.597392, 5.399433, 10.157327, 17.982239, 9.4220...
## $ poverty <dbl> 12.2, 19.9, 16.8, 17.4, 9.0, 8.2, 42.4, 23.0, 11...
## $ homeownership <dbl> 81.8, 70.0, 75.0, 64.8, 76.4, 82.6, 59.1, 72.8, ...
## $ multiunit <dbl> 4.4, 11.1, 8.6, 13.8, 14.3, 5.0, 12.5, 5.9, 14.5...
## $ income <dbl> 20852, 16717, 18804, 21574, 27196, 21419, 14190,...
## $ med_income <dbl> 42689, 38125, 32743, 35833, 57494, 48318, 20081,...
# Checking if all states got equal representation
county_srs %>% group_by(state) %>% count()## # A tibble: 39 x 2
## # Groups: state [39]
## state n
## <fct> <int>
## 1 Alabama 2
## 2 Alaska 2
## 3 Arkansas 5
## 4 California 1
## 5 Colorado 4
## 6 Connecticut 1
## 7 Florida 3
## 8 Georgia 13
## 9 Idaho 1
## 10 Illinois 6
## # ... with 29 more rows
Some states are over-represented. Hence do a stratified sample with 3 counties from each of 50 states
county_strs <- county_no_dc %>% group_by(state) %>% sample_n(size=3)
county_strs %>% count(state)## # A tibble: 50 x 2
## # Groups: state [50]
## state n
## <fct> <int>
## 1 Alabama 3
## 2 Alaska 3
## 3 Arizona 3
## 4 Arkansas 3
## 5 California 3
## 6 Colorado 3
## 7 Connecticut 3
## 8 Delaware 3
## 9 Florida 3
## 10 Georgia 3
## # ... with 40 more rows
Similarly, now we want to sample states - 8 of them in USA. Using SRS on us_regions data
# Making the US regions data, since it is not directly available
data(state)
us_regions <- as.data.frame(cbind(state.name,as.character(state.region)))
names(us_regions) <- c("state","region")
state_srs <- us_regions %>% sample_n(size=8)
state_srs %>% group_by(region) %>% count()## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 North Central 2
## 2 Northeast 2
## 3 South 3
## 4 West 1
Some regions are over-represented. Hence stratified sample
state_strs <- us_regions %>% group_by(region) %>% sample_n(size=2)
state_strs %>% group_by(region) %>% count()## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 North Central 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
There are 4 main principles of experimental design:
- Control: compare treatment of interest to a control group
- Randomize: randomly assign subjects to treatments
- Replicate: collect a sufficiently large sample within a study , or replicate entire study
- Block: account for potential impact of confounding variables – by grouping subjects into blocks based on these variables, and then randomize within block to treatment groups
Suppose you are studying effect of light and noise on exam performance. Unfortunately, you also think being a male and female can elicit diferent performance for same light & noise. In this case it is a blocking variable. To ensure you capture both male and female equally, you divide the population of males and females, and then randomly sample equally from both for final sample. This way you have blocked the impact of being male or female.
In random sampling you use stratifying to control for a variable. In random assignment, blocing is used to achieve the same
Case Study
The aim of this case study is to find if better looking professors, received better evaluation by students.. The data was collected at UT-Austin. This study can be said to be ‘observational’ in nature. The data was gathered by randomly sampling class evals.
# Downloading and loading the data annd getting a glimpse
download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")
glimpse(evals)## Observations: 463
## Variables: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank <fct> tenure track, tenure track, tenure track, tenure...
## $ ethnicity <fct> minority, minority, minority, minority, not mino...
## $ gender <fct> female, female, female, female, male, male, male...
## $ language <fct> english, english, english, english, english, eng...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper,...
## $ cls_profs <fct> single, single, single, single, multiple, multip...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi ...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, ...
## $ pic_color <fct> color, color, color, color, color, color, color,...
# Getting more detail information about variables
str(evals)## 'data.frame': 463 obs. of 21 variables:
## $ score : num 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
## $ rank : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
## $ ethnicity : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
## $ language : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 36 36 36 36 59 59 59 51 51 40 ...
## $ cls_perc_eval: num 55.8 68.8 60.8 62.6 85 ...
## $ cls_did_eval : int 24 86 76 77 17 35 39 55 111 40 ...
## $ cls_students : int 43 125 125 123 20 40 44 55 195 46 ...
## $ cls_level : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
## $ cls_profs : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
## $ cls_credits : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bty_f1lower : int 5 5 5 5 4 4 4 5 5 2 ...
## $ bty_f1upper : int 7 7 7 7 4 4 4 2 2 5 ...
## $ bty_f2upper : int 6 6 6 6 2 2 2 5 5 4 ...
## $ bty_m1lower : int 2 2 2 2 2 2 2 2 2 3 ...
## $ bty_m1upper : int 4 4 4 4 3 3 3 3 3 3 ...
## $ bty_m2upper : int 6 6 6 6 3 3 3 3 3 2 ...
## $ bty_avg : num 5 5 5 5 3 ...
## $ pic_outfit : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
## $ pic_color : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...
# Instead of using actual number of students in class, create categories for size of class
evals <- evals %>% mutate(cls_type = ifelse(cls_students<=18,"small",
ifelse(cls_students<60,"medium","large")))
# Creating a scatterplot to show association of beauty score vs score of each prof
evals %>% ggplot(aes(x=bty_avg,y=score)) + geom_point()# Taking into account different class types
evals %>% ggplot(aes(x=bty_avg,y=score,color=cls_type)) + geom_point()