DS Labs Assignment

Author

Gabriel Castillo Lopez

Loading in Tidyverse and DSLABS

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs) # Loading in the two desired packages
Warning: package 'dslabs' was built under R version 4.3.3

Searching for a good dataset

list.files(system.file("script", package = "dslabs"))
 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"        
data("admissions") # The datset I will be using from dsLabs

Making the graph

 admission_rate <- admissions[-c(8),] |> #removing desired row that was screwing the data
  mutate(Acceptance_Percent = admitted/applicants ) #This mutate function was used to make the acceptance percent for each major/gender 

Adding two new libraries

library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
library(ggthemes) # loading in the library 
Warning: package 'ggthemes' was built under R version 4.3.3

My Final Graph

admission_rate |> ggplot(aes(x = applicants, y = Acceptance_Percent, label = gender)) + # Adding the x and y variables for the graph
 geom_point(aes(color= major), size = 3) + # making a point graph and a legend for majors in Berekeley based on color
  geom_smooth(method = lm, se =FALSE, color = "black", lty = 2, linewidth = 0.3) + # adding a dot line correlation line
  theme_solarized()+ # adding a different ggtheme
 geom_text_repel(nudge_x = 0.005) + # add the ggrepel to label points as either women or men and making it easier to follow
  ylim(0,1) + # Adding a limiter so the graph max out at 1 
 xlab("Total of Applicants") + 
 ylab("Acceptance Percentage in Decimals") +
 ggtitle("The Berkeley Acceptance Percentage of Different Majors") + #Adding titles to the graph and the x&y
 scale_color_discrete(name="Different Majors by Letters") # title the legend for the majors 
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_smooth()`).

Paragraph:

I decided to use the admissions dataset from Berkeley because I wondered if the acceptance rate in each major was different in men and women. I created the graph by making a new variable with mutate function called Acceptance_Percent by dividing admitted/applicants. To get a decimal which will help others read my graph. The decimals would show the percentage for example 0.1 = 10% acceptance rate. The making of the graph with ggplot was simple as all needed was the x which is applicants and the y with Acceptance_Percent. I used a new theme using ggthemes which looked very cool. I used ggrepel for the dots to be properly labeled with women or men. As some dots were packed together so ggrepel helped label them. I used geom_smooth to make a dotted correlation line as I liked to see a trend in the graphs I make. I used xlab,ylab, and ggtitle to properly give titles to the x& y and giving the whole graph a proper title. What I learned from my graph is that the more applicants there was for each major the harder it was to get admitted. One other thing is we can see how many women compared to men were applicants for each major and vise versa. However, the main question was also answered by the graph showing that there is some bias for men and women depending for each major in Berekeley because some majors had a higher acceptance percentage for men and vise versa for women. It was a very cool dataset to work with and I love all the new ways to customize my graphs for the future .