Week 8 Assignment Using DSLabs Datasets

Introduction

For this assignment, I will take the Admissions data set from the dslabs package and create a new multivariable graph. I will use dplyr to manipulate the data.

- THINGS TO KEEP IN MIND -

Comments describing all chunks of code
Meaningful labels for x- and y-axes Meaningful title
A theme for the graph (you must change the generic ggplot style)
Colors for a third variable, with a legend

I MAY create a scatterplot, heatmap, or other plot appropriate for continuous variables. NO BAR GRAPH. Can incorporate elements from prior weeks’ materials, as well.

Be sure to describe in a paragraph what dataset I have used and document how I have created my graph.

Load Libraries

library(dslabs)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## 
## Attaching package: 'highcharter'
## 
## The following object is masked from 'package:dslabs':
## 
##     stars

data(package = "dslabs")
list.files(system.file("script", package = "dslabs"))

##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-calificaciones.R"               
##  [5] "make-death_prob.R"                   
##  [6] "make-divorce_margarine.R"            
##  [7] "make-gapminder-rdas.R"               
##  [8] "make-greenhouse_gases.R"             
##  [9] "make-historic_co2.R"                 
## [10] "make-mice_weights.R"                 
## [11] "make-mnist_127.R"                    
## [12] "make-mnist_27.R"                     
## [13] "make-movielens.R"                    
## [14] "make-murders-rda.R"                  
## [15] "make-na_example-rda.R"               
## [16] "make-nyc_regents_scores.R"           
## [17] "make-olive.R"                        
## [18] "make-outlier_example.R"              
## [19] "make-polls_2008.R"                   
## [20] "make-polls_us_election_2016.R"       
## [21] "make-pr_death_counts.R"              
## [22] "make-reported_heights-rda.R"         
## [23] "make-research_funding_rates.R"       
## [24] "make-stars.R"                        
## [25] "make-temp_carbon.R"                  
## [26] "make-tissue-gene-expression.R"       
## [27] "make-trump_tweets.R"                 
## [28] "make-weekly_us_contagious_diseases.R"
## [29] "save-gapminder-example-csv.R"

Load Data

data("admissions")

MY MISSION

I am going to use the admissions data set from the dslabs package. This data set contains information about the admission of students to University of California, Berkeley. This data set claims to be bias of gender, so I will see today if that is the case and where in the data I may see gender biases! The applications for the college are separated by gender and major.

Variables

Major: Major of student from A - F. Gender: Men / Women Admitted: Percentage of students admitted. Applicants: Number of students who applied.

Data Manipulation

I am going to use dplyr to manipulate the data. I will create a new variable called calculating the number of students who were admitted.

admissions1 <- admissions |>
  mutate(admitted_students = round(admitted/100 * applicants)) |>
  mutate(not_admitted_students = applicants - admitted_students) |>
  rename(admitted_percentage = admitted)

I want to see the correlation between the admitted_students and not_admitted_students. I will use the cor() function to calculate the correlation. The reason why I want to do this is to see if there is a relationship between the number of students who were admitted and the number of students who were not admitted. Most likely won’t be any relation since the number of people who applied is not the same for each major, but it doesn’t hurt to try!

cor(admissions1$admitted_students, admissions1$not_admitted_students)

## [1] 0.2728222

The number 0.272 is the correlation. Not very strong, but I still wan’t to graph and see if I can find anything.

Graph

I will now make my graph. I will use the highchart package to create a scatterplot, and to make it interactive! I will use the admitted_students and not_admitted_students as the x and y axis. The major will be the color of the points. I’ll also include other variables to look at such as the admission percentage and applicants for a better understanding of the data to explain exaclty what we are seeing. Many of the variables are discrete, so this is why I will use a scatterplot to show the data. I’ll use the Set 1 color palette to change the color of the legend in a nice pattern.

#I will set a color palette for the major variable
cols <- brewer.pal(6, "Set1")

highchart() |>
  hc_add_series(data = admissions1, 
                type = "scatter", 
                hcaes(x = not_admitted_students, 
                      y = admitted_students, 
                      group = major)) |>
  hc_colors(cols) |>
  hc_xAxis(min = 0, max = 400) |> #Used ChatAI to set min & max values
  hc_yAxis(min = 0, max = 550) |>
  hc_title(text = "Admitted vs Denied Students: Investigation of Major Bias") |>
  hc_xAxis(title = list(text = "Denied students")) |>
  hc_yAxis(title = list(text = "Admitted students")) |>
  hc_legend(align = "left", 
            verticalAlign = "bottom", 
            layout = "horizontal", 
            enabled = TRUE) |>
  hc_tooltip(shared = TRUE,
             bordercolor = "black",
             pointFormat = "Applicants: {point.applicants},<br> Admitted %: {point.admitted_percentage}, <br> Admitted students: {point.y}, <br> Denied students: {point.x}, <br> Gender: {point.gender}") #Used Chat AI to figure out how to put text

Conclusion

HERE I will list down my findings with the graph.

- Major A:

More women got the chance for this major

Men were more admitted than women A LOT, but more denied (A lot more men applied)

Admitted Men (512) > Women applicants (108): Reason why more men are in major

- Major B:

Almost the same equal chance for both genders, women being a tiny bit higher

Men were more admitted than women A LOT, but more denied (A lot more men applied)

Admitted Men (353) > Women applicants (25): Reason why more men are in major

- Major C:

Almost the same equal chance for both genders, Women were more admitted than men slightly.

Admitted Women (202) < Men applicants (325)

Applicants of Women (593) > Men applicants (325)

I noticed that women were more denied and more women applicants! However, fewer men applied for this major, which is why women have more.

So far as I see these graphs, the percentage of admission for each gender is not exaclty matching the number of admitted students. Taking this major for example, there is more admitted women but less percent of admission because many women applied for this major in comparison to men. Men have a higher percentage for admission in this major, but there are less applications, so there is less men taking this major.

- Major D:

There is not a huge difference between men and women in this major. Men are slightly more admitted and denied in this major compared to women.

- Major E:

Women were admitted more than men. There was 200 more women applicants than men.

I do not see any biased or unfair admission in regards to gender.

- Major F:

Here is something I found interesting

This major is almost the same slightly. I realized that the number of men that applied were denied more than women, and women had less applicants than men. This is why the number of admitted students is almost the same, but also shows that this major favored women more in terms of admission.

In conclusion, from looking at this data, I can’t really say that there is a huge bias in the data for admissions regarding someones gender in UC Berkeley. I can see that there are more discrete variables that are affecting the admission of students. For example, the number of applicants for each major is different, so the number of admitted students will be different, depending on how many people apply to these majors.

Week 8 Assignment Using DSLabs Datasets

Emilio Sanchez San Martin

2024-03-21

Week 8 Assignment Using DSLabs Datasets

Introduction

- THINGS TO KEEP IN MIND -

Load Libraries

Load Data

MY MISSION

Variables

Data Manipulation

Graph

Conclusion

HERE I will list down my findings with the graph.

- Major A:

- Major B:

- Major C:

- Major D:

- Major E:

- Major F: