For this assignment, I will take the Admissions data set from the dslabs package and create a new multivariable graph. I will use dplyr to manipulate the data.
Comments describing all chunks of code
Meaningful labels for x- and y-axes Meaningful title
A theme for the graph (you must change the generic ggplot style)
Colors for a third variable, with a legend
I MAY create a scatterplot, heatmap, or other plot appropriate for continuous variables. NO BAR GRAPH. Can incorporate elements from prior weeks’ materials, as well.
Be sure to describe in a paragraph what dataset I have used and document how I have created my graph.
library(dslabs)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'highcharter'
##
## The following object is masked from 'package:dslabs':
##
## stars
data(package = "dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-calificaciones.R"
## [5] "make-death_prob.R"
## [6] "make-divorce_margarine.R"
## [7] "make-gapminder-rdas.R"
## [8] "make-greenhouse_gases.R"
## [9] "make-historic_co2.R"
## [10] "make-mice_weights.R"
## [11] "make-mnist_127.R"
## [12] "make-mnist_27.R"
## [13] "make-movielens.R"
## [14] "make-murders-rda.R"
## [15] "make-na_example-rda.R"
## [16] "make-nyc_regents_scores.R"
## [17] "make-olive.R"
## [18] "make-outlier_example.R"
## [19] "make-polls_2008.R"
## [20] "make-polls_us_election_2016.R"
## [21] "make-pr_death_counts.R"
## [22] "make-reported_heights-rda.R"
## [23] "make-research_funding_rates.R"
## [24] "make-stars.R"
## [25] "make-temp_carbon.R"
## [26] "make-tissue-gene-expression.R"
## [27] "make-trump_tweets.R"
## [28] "make-weekly_us_contagious_diseases.R"
## [29] "save-gapminder-example-csv.R"
data("admissions")
I am going to use the admissions data set from the dslabs package. This data set contains information about the admission of students to University of California, Berkeley. This data set claims to be bias of gender, so I will see today if that is the case and where in the data I may see gender biases! The applications for the college are separated by gender and major.
Major: Major of student from A - F. Gender: Men / Women Admitted: Percentage of students admitted. Applicants: Number of students who applied.
I am going to use dplyr to manipulate the data. I will create a new variable called calculating the number of students who were admitted.
admissions1 <- admissions |>
mutate(admitted_students = round(admitted/100 * applicants)) |>
mutate(not_admitted_students = applicants - admitted_students) |>
rename(admitted_percentage = admitted)
I want to see the correlation between the admitted_students and not_admitted_students. I will use the cor() function to calculate the correlation. The reason why I want to do this is to see if there is a relationship between the number of students who were admitted and the number of students who were not admitted. Most likely won’t be any relation since the number of people who applied is not the same for each major, but it doesn’t hurt to try!
cor(admissions1$admitted_students, admissions1$not_admitted_students)
## [1] 0.2728222
The number 0.272 is the correlation. Not very strong, but I still wan’t to graph and see if I can find anything.
I will now make my graph. I will use the highchart package to create a scatterplot, and to make it interactive! I will use the admitted_students and not_admitted_students as the x and y axis. The major will be the color of the points. I’ll also include other variables to look at such as the admission percentage and applicants for a better understanding of the data to explain exaclty what we are seeing. Many of the variables are discrete, so this is why I will use a scatterplot to show the data. I’ll use the Set 1 color palette to change the color of the legend in a nice pattern.
#I will set a color palette for the major variable
cols <- brewer.pal(6, "Set1")
highchart() |>
hc_add_series(data = admissions1,
type = "scatter",
hcaes(x = not_admitted_students,
y = admitted_students,
group = major)) |>
hc_colors(cols) |>
hc_xAxis(min = 0, max = 400) |> #Used ChatAI to set min & max values
hc_yAxis(min = 0, max = 550) |>
hc_title(text = "Admitted vs Denied Students: Investigation of Major Bias") |>
hc_xAxis(title = list(text = "Denied students")) |>
hc_yAxis(title = list(text = "Admitted students")) |>
hc_legend(align = "left",
verticalAlign = "bottom",
layout = "horizontal",
enabled = TRUE) |>
hc_tooltip(shared = TRUE,
bordercolor = "black",
pointFormat = "Applicants: {point.applicants},<br> Admitted %: {point.admitted_percentage}, <br> Admitted students: {point.y}, <br> Denied students: {point.x}, <br> Gender: {point.gender}") #Used Chat AI to figure out how to put text
More women got the chance for this major
Men were more admitted than women A LOT, but more denied (A lot more men applied)
Admitted Men (512) > Women applicants (108): Reason why more men are in major
Almost the same equal chance for both genders, women being a tiny bit higher
Men were more admitted than women A LOT, but more denied (A lot more men applied)
Admitted Men (353) > Women applicants (25): Reason why more men are in major
Almost the same equal chance for both genders, Women were more admitted than men slightly.
Admitted Women (202) < Men applicants (325)
Applicants of Women (593) > Men applicants (325)
I noticed that women were more denied and more women applicants! However, fewer men applied for this major, which is why women have more.
So far as I see these graphs, the percentage of admission for each gender is not exaclty matching the number of admitted students. Taking this major for example, there is more admitted women but less percent of admission because many women applied for this major in comparison to men. Men have a higher percentage for admission in this major, but there are less applications, so there is less men taking this major.
There is not a huge difference between men and women in this major. Men are slightly more admitted and denied in this major compared to women.
Women were admitted more than men. There was 200 more women applicants than men.
I do not see any biased or unfair admission in regards to gender.
Here is something I found interesting
This major is almost the same slightly. I realized that the number of men that applied were denied more than women, and women had less applicants than men. This is why the number of admitted students is almost the same, but also shows that this major favored women more in terms of admission.
In conclusion, from looking at this data, I can’t really say that there is a huge bias in the data for admissions regarding someones gender in UC Berkeley. I can see that there are more discrete variables that are affecting the admission of students. For example, the number of applicants for each major is different, so the number of admitted students will be different, depending on how many people apply to these majors.