My name is Cindy Lopez and I’m aspiring to enter a career in data science/data analytics. This space is where I document what I’ve been up to in terms of educating myself, whether it’s learning coding languages, practicing data analysis with data sets or reading up about statistics. Here I’ll post projects I’m undergoing, books I’m reading, and data sets that I’m practicing with.
I’m looking for internships or programs to begin working in data analytics, data science, or just data entry jobs. Feel free to contact me at lopez.cindy345@gmail.com.
Amherst College: BA in Mathematics (3.9 GPA)
Graduated in May 2020 in the midst of the COVID-19 pandemic
Whitney M. Young Magnet High School : Graduated June 2016
Udemy Courses
I have completed the following courses on Udemy learning site:
IT Asset Management Analyst August 2019 - June 2020
IT Asset Management Intern June 2019 - August 2019
IT Hardware Asset Intern September 2015 - August 2016
Math Fellow September 2016 - August 2019
Math Grader September 2016 - August 2019
Math and Statistics Researcher Summer 2018
Catering Assistant September 2017 - May 2018
To study and practice R, I have been reading Rafael A. Irizarry’s Introduction to Data Science: Data Analysis and Prediction Algorithms with R to educate myself on some common data analysis methods.
The data set I work with is from the palmer penguins website provided by Dr. Kristen Gorman, Dr. Allison Horst, and Dr. Alison Hill.
library(ggthemes)
library(tidyverse)
library(ggridges)
library(gridExtra)
library(palmerpenguins)
library(tidytuesdayR)
tuesdata <- tidytuesdayR::tt_load(2020, week = 31)The data set includes 8 variables, detailing 3 species of penguins on 3 different islands. I have eliminated NAs from the data.
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 5 Adelie Torge~ 39.3 20.6 190 3650 male
## 6 Adelie Torge~ 38.9 17.8 181 3625 fema~
## # ... with 1 more variable: year <int>
The table below summarizes how many of each penguins we are working with. The largest subset is the Adelie penguins, their numbers more than double that of the Chinstraps. It also appears that the male to female ratio is almost 1:1 for all three species.
## # A tibble: 6 x 3
## # Groups: species [3]
## species sex total
## <fct> <fct> <int>
## 1 Adelie female 73
## 2 Adelie male 73
## 3 Chinstrap female 34
## 4 Chinstrap male 34
## 5 Gentoo female 58
## 6 Gentoo male 61
Let’s take a look at the distribution of the variables body_mass_g, bill_depth_mm, bill_length_mm and flipper_length_mm. I’ve arranged the boxplots for each variable in ascending order from left to right according to the median.
We can make the following obversations from above:
bill_length_mm has quite a few noted outliersbill_depth_mm, Adelie penguins have the smallest measurements. (Although note that male Adelie have larger body mass than male and female Chinstraps)body_mass_g and flipper_length_mm with relatively high bill_length_mm, while having the smallest bill_depth_mm.Could there be an inverse relationship between bill_depth_mm and the other variables?
A quick regression analysis shows the following:
There appears to be a positive correlation between bill_depth_mm and the other variables. Below shows the strength of these relationships.
main_cor <-
penguins %>% group_by(species) %>%
summarise(depth_length = cor(bill_depth_mm, bill_length_mm),
depth_mass = cor(bill_depth_mm, body_mass_g),
depth_flipper = cor(bill_depth_mm, flipper_length_mm))
main_cor## # A tibble: 3 x 4
## species depth_length depth_mass depth_flipper
## <fct> <dbl> <dbl> <dbl>
## 1 Adelie 0.386 0.580 0.311
## 2 Chinstrap 0.654 0.604 0.580
## 3 Gentoo 0.654 0.723 0.711
include p - value above?
It looks like Gentoo penguins show the strongest correlations while Adelie penguins show the weakest correlations. What is causing this?
First, a better look into bill_depth_mm shows us:
penguins %>% ggplot(aes(x = bill_depth_mm, fill = species))+
geom_histogram(binwidth=0.3, show.legend=FALSE)+
ggtitle("Bill Depth Distribution")+
xlab("Bill depth (mm)") +
theme_clean()Notice how Adelie penguins have the largest subset of penguins and the weakest correlations. We’ll zero our focus on them for now to investigate why the relationships between bill_depth_mm and the other variables are so weak.
Before we begin, notice how bill_depth_mm for the Adelie penguins changes across the years.
penguins %>% filter(species == "Adelie") %>%
ggplot(aes(x = bill_depth_mm, fill = sex, y = as.factor(year))) +
geom_density_ridges(show.legend = FALSE)+
facet_grid(.~sex)+
xlab("Bill depth (mm)") +
ylab("year")+
ggtitle("Adelie bill depth over time")+
theme_clean()It appears that female Adelie bill depths decrease over the years, male Adelie too, albeit less noticably. Is this data tracking the same penguins? Do Adelie bill depths drop as Adelies get older? As a side track, just for comparison, observe the bill depths of Gentoo penguins:
## Picking joint bandwidth of 0.239
## Picking joint bandwidth of 0.262
## Picking joint bandwidth of 0.402
## Picking joint bandwidth of 0.402
Chinstrap penguins’ bill depths also slightly decrease across the years, but Gentoo penguins bill depths increase over the years. I’m no penguin expert, but perhaps this is because of biological differences in the each species.
Back to focusing on the Adelie, let’s check if any outliers in the data led to weak correlations by removing them and checking the relationship between the variables again.
Since we’re comparing bill_depth_mm to all the other variables, let’s remove its outliers from the data.
To ensure these are outliers, I’ll use Tukey’s definition of outliers to exclude any data points outside of the acceptable range and plot this range below. Below is how I calculated and plotted the graphs.
tukey_ranges <- penguins %>% filter(species == "Adelie") %>%
group_by(sex) %>%
summarise(mean= mean(bill_depth_mm),
sd = sd(bill_depth_mm),
q1 = quantile(bill_depth_mm, 0.25),
q3 = quantile(bill_depth_mm, 0.75),
IQR = IQR(bill_depth_mm),
min = q1 - 1.5*IQR, max = q3 + 1.5*IQR,
three_sds_out = mean + 3*sd) %>%
select(sex, min, mean, max, three_sds_out)
female_Adelie <- penguins %>% filter(species == "Adelie", sex == "female") %>%
ggplot(aes( x = bill_depth_mm)) +
geom_histogram(binwidth = 0.3, fill = "orange")+
xlab("")+
xlim(15, 23)+
ggtitle("Female Adelie bill depth")+
geom_vline(xintercept = tukey_ranges$min[1], color = "blue", size = 1)+
geom_vline(xintercept = tukey_ranges$max[1], color = "blue", size = 1 )+
geom_vline(xintercept = tukey_ranges$three_sds_out[1], color = "red", size = 1) +
geom_text(data = data.frame(c("min range", "max range", "3 sds out"),
x = c(15, 20, 21),
y = c(10, 10, 10)), aes(x, y, label = c("min", "max", "3 sds out"))) +
theme_clean()
male_Adelie <- penguins %>% filter(species == "Adelie", sex == "male") %>%
ggplot(aes( x = bill_depth_mm)) +
geom_histogram(binwidth = 0.3, fill = "blue")+
xlab("Bill depth (mm)")+
xlim(15, 23)+
ggtitle("Male Adelie bill depth") +
geom_vline(xintercept = tukey_ranges$min[2], color = "blue", size = 1)+
geom_vline(xintercept = tukey_ranges$max[2], color = "blue", size = 1)+
geom_vline(xintercept = tukey_ranges$three_sds_out[2], color = "red", size = 1) +
geom_text(data = data.frame(c("min range", "max range", "3 sds out"),
x = c(17, 21, 22),
y = c(10, 10, 10)), aes(x, y, label = c("min", "max", "3 sds out")))+
theme_clean()
tukey_ranges## # A tibble: 2 x 5
## sex min mean max three_sds_out
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 female 15.0 17.6 20.2 20.5
## 2 male 16.8 19.1 21.3 22.1
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
Anything outside of the blue min and max bars are outliers. For extra measure, I added a red line signifying what three standard deviations away was to emphasize how far out one of the female Adelie data points was. I will remove occurences outside of the ranges and update our penguins table.
penguins_2 <- penguins %>%
filter(!((species == "Adelie" & sex == "female" &
bill_depth_mm >tukey_ranges$max[1])|
(species == "Adelie" & sex == "male" &
bill_depth_mm > tukey_ranges$max[2])))Now let’s see if this has affected the correlations for Adelie.
## # A tibble: 1 x 3
## depth_length depth_mass depth_flipper
## <dbl> <dbl> <dbl>
## 1 0.386 0.580 0.311
## # A tibble: 1 x 3
## depth_length depth_mass depth_flipper
## <dbl> <dbl> <dbl>
## 1 0.355 0.579 0.310
The correlations have actually fallen! Perhaps it’s because I removed the outliers individually by sex but am computing the correlation for the Adelie as a whole. I could compute the correlations by sex as well, but that would be too tedious.
Since Adelie population doesn’t have outliers for bill_depth_mm altogether, I’ll return the data as it was before removing the outliers.
Out of curiosity, I will check to see if Adelie’s have outliers as a whole for the rest of their variables.
Only flipper_length_mm have occurences outside of its range. I’ll remove those and compare the original correlations to the correlations we get after removing the flipper outliers.
## # A tibble: 1 x 3
## depth_length depth_mass depth_flipper
## <dbl> <dbl> <dbl>
## 1 0.386 0.580 0.311
## # A tibble: 1 x 3
## depth_length depth_mass depth_flipper
## <dbl> <dbl> <dbl>
## 1 0.396 0.587 0.342
The correlations did go up, but they’re still not high enough to be significant. It makes me wonder whether calculating correlations would be better if I calculated it per sex as well.
penguins_3 %>%
group_by(species, sex) %>%
summarise(depth_length = cor(bill_depth_mm, bill_length_mm),
depth_flipper = cor(bill_depth_mm, flipper_length_mm),
depth_mass = cor(bill_depth_mm, body_mass_g))## # A tibble: 6 x 5
## # Groups: species [3]
## species sex depth_length depth_flipper depth_mass
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Adelie female 0.157 0.113 0.414
## 2 Adelie male -0.0145 0.239 0.159
## 3 Chinstrap female 0.256 0.135 0.391
## 4 Chinstrap male 0.446 0.421 0.345
## 5 Gentoo female 0.430 0.308 0.372
## 6 Gentoo male 0.307 0.471 0.253
They are very much not. Based on this, I believe I can conclude that bill_depth_mm is not correlated to any of the other variables for the Adelie species.
I started using python in a cryptography class taught by professor Nathan Pflueger at Amherst in the spring of 2020. The code I wrote for the course was usually short programs in Jupyter notebooks meant to demonstrate encrytion languages and key signatures.
Currently, I am reading Zed. A Shaw’s “Learn Python the Hard Way 3”, following the exercises in the book and utilizing Atom and the command line to run files. I have recently created a short program choose your path, the code to which I have shared on GitHub. Meant to run on the command line.
My goal is to also get a certification from w3schools.com in python.
I’m also practicing manipulating and plotting data using python’s matplotlib and pandas by utilizing the palmer penguins data set as described in the R section. These I do within a Jupyter file.
I’m working on showing a preview of that Juypter notebook here
I am currently reading Practical SQL: A Beginner’s Guide to Storytelling with Data written by Anthony DeBarros and working through its guided lessons on pgAdmin.
In addition, I have been utilizing w3schools.com to teach myself and am an active participant of HackerRank.
I’ve only recently started using Tableau, particularly thanks to Stanford Computer Science Professor Widom’s free online courses for data science. Utilizing a data set for world soccer players, I created this Tableau dashboard to try my hand at using the program.