Use the package DSLabs (Data Science Labs)

Introduction

In this project, we were assigned to create a multivariable graph with any of the datasets included in “dslabs”. I was really interested in the dataset “admissions” because I was wondering about the factors of college admission. Otherwise, I want to know if college/university admission can be affected by the variables gender and time.

With 12 observations, this data frame contains 4 variables:

1- major

2- gender

3- admitted

4- applicants

Reading data

library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(RColorBrewer)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'purrr' was built under R version 3.6.1
## Warning: package 'stringr' was built under R version 3.6.1
## Warning: package 'forcats' was built under R version 3.6.1
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()
library(highcharter)
## Warning: package 'highcharter' was built under R version 3.6.1
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(dslabs)
## Warning: package 'dslabs' was built under R version 3.6.1
## 
## Attaching package: 'dslabs'
## The following object is masked from 'package:highcharter':
## 
##     stars
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
data("admissions")

Dimensions and Structure of the Dataset

dim(admissions)
## [1] 12  4
str(admissions)
## 'data.frame':    12 obs. of  4 variables:
##  $ major     : chr  "A" "B" "C" "D" ...
##  $ gender    : chr  "men" "men" "men" "men" ...
##  $ admitted  : num  62 63 37 33 28 6 82 68 34 35 ...
##  $ applicants: num  825 560 325 417 191 373 108 25 593 375 ...

Removing NA Values

At this step, we are going to remove all the missing values by using complete.cases() function. We will call the new data frame “admissions1”

admissions1 <- admissions[complete.cases(admissions),]
dim(admissions1)
## [1] 12  4

we are no missing values.

Scatterplot

g<- ggplot(data = admissions, mapping = aes(x= applicants, y = admitted , color = gender))
 class(g)
## [1] "gg"     "ggplot"
 g + geom_point() +
 labs(title = "Admitted vs Applicants by Gender", 
       x = "Applicants", y = "Admitted", color = "Gender")

Use a custom theme

You can change the entire appearance of a plot by using a custom theme. The library ggthemes containing many custom themes and scales for ggplot.

theme_economist : theme based on the plots

g<- ggplot(data = admissions, mapping = aes(x= applicants, y = admitted , color = gender))+
  geom_point()
# Use economist color scales
g + theme_economist() + 
  scale_color_economist()+
  ggtitle("Admitted vs Applicants by Gender")

Analysis

The results show that there are more women admitted than men, in the first 200 applications. Women and men almost have the same chance to be admitted around 400 applications. Most of the applicants were also selected at that step. After 600 applications, the number of admitted was low. Then, the factor time could affect the possibility to get an admission in an institution.In my opinion, the admission committee could give more chance to women when the applications were submitted early. On the hand, women may have apply earlier than men. The committee of admission maybe used rule such as “first come, firt serve”. However, this dataset could be quite a bit limited. I did not find the year of the study.I also was unable to see whether this admission was for an undergraduate or a graduate school. Therefore, we can not really use the outcomes at this time unless we get an accurate dataset.

Conclusion

In my opinion, men and women can probably increase their chance to get an admission if they apply early .