#reading in the dataset from computer
<- read.csv("~/Desktop/baseball_players.csv")
base_play
#scatter plot to compare height and runs scored
plot(base_play$height,
$runs_scored,
base_playmain = "Height vs Runs Scored",
xlab = "Height (inches)",
ylab = "Runs Scored",
pch = 16,
col = "blue") #color for visual
R Portfolio
Kelly Ratigan (Junior ACMS)
About Me!
Hi, my name is Kelly Ratigan, and I’m a student-athlete and aspiring Data Scientist. I recently transferred to the University of Notre Dame, where I plan to study ACMS and play for the Women’s Basketball Team. Previously, I attended Loyola University Maryland, majoring in Data Science with a minor in Mechanical Engineering. I’ve earned Dean’s List honors every semester and have gained hands-on experience in R, working under Dr. Alan Huebner on biomechanics research with DARI Motion Technology.
I’ve worked with R for several years and it remains my preferred coding language, especially for data manipulation, statistical modeling, and visualization. My long term goal is to build a career in Assistive Technology, ideally working with motion analysis systems or prosthetic development.
Contact Information:
Email: kratigan@nd.edu
Phone: (574) -335 - 9909
Summer Coding Examples!
Code Example: Assignment 1 (Part A)
Background Information: For this assignment,the data consisted of baseball players who played during the 2005 MLB season.The dataset included various player variables such as height, weight, and number of runs scored. One question of Part A asked to examine the relationship between height and offensive performance using a scatter plot.
Evaluation: This code snippet, was a great re-introduction to R Coding. Displaying relationships between variables in a visual way like this, as opposed to creating a new data set or a more formal plot, helped me understand that some times more is less when evaluating data. Working with this data, while simple, helped me to practice customizing plots and understanding the complexities of chart elements. Knowing what we know now and using what we’ve learned over the course of these 6 weeks in class, I could potentialy improve this plot by using ggplot2 or enhance the depth of the plot by using a regression line.
Code Example: Assignment 2 (Part B)
Background Information: For this assignment, the data came from the NHANESraw dataset, which includes health-related data from the National Health and Nutrition Examination Survey. One question from Part B asked us to filter the dataset to include only individuals between the ages of 18 and 69 and create a bar chart showing how many reported having tried hard drugs such as cocaine, heroin, or meth.
#loading in data / given library from R
library(NHANES)
library(ggplot2)
# Load NHANESraw dataset
data(NHANESraw)
# Filter NHANESraw data to include only individuals aged 18 to 69
<- NHANESraw[NHANESraw$Age >= 18 & NHANESraw$Age <= 69, ]
NHANES_subset
# Create a bar chart of the HardDrugs variable
ggplot(data = NHANES_subset, aes(x = HardDrugs)) +
geom_bar(fill = "purple") +
labs(title = "Hard Drug Usage (Ages 18 to 69)",
x = "Tried Hard Drugs",
y = "Number of Individuals")
Context for Output: A bar chart displaying two bars: one for individuals who answered “Yes” and one for those who answered “No” to having used hard drugs. The “No” bar is significantly taller, showing that most respondents in this age group had not used these substances
Evaluation: This code was a good way to get more comfortable filtering data and using ggplot2 to make bar charts. It wasn’t too complicated, but it helped me see how useful simple visuals can be when trying to show patterns, especially with yes/no data like this. It also made me realize how important it is to clean and narrow down data before jumping into analysis. If I did this again, I might try adding labels on top of the bars or organizing them better, but overall this was a solid step in learning how to make clear and effective plots.
Code Example: Assignment 3 (Part C)
Background Information: For this assignment, the data set included arrest records from the Berkeley Police Department between November and early December 2017. Part C asked to calculate the actual age of individuals at the time of their arrest by using their date of birth and arrest date. The goal was to compare this calculated “real age” to the age recorded by the officer and spot any inconsistencies.
# Load dataset
<- read.csv("~/Desktop/Berkeley_Arrest.csv")
Berkeley_Arrest
# convert character columns to Date format
$date_of_birth <- as.Date(Berkeley_Arrest$date_of_birth, format = "%Y-%m-%d")
Berkeley_Arrest$arrest_date <- as.Date(Berkeley_Arrest$arrest_date, format = "%Y-%m-%d")
Berkeley_Arrest
# calculate real age at time of arrest
$real_age <- floor(as.numeric(Berkeley_Arrest$arrest_date - Berkeley_Arrest$date_of_birth) / 365)
Berkeley_Arrest
# Summary of the new age variable
summary(Berkeley_Arrest$real_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 27.00 37.00 39.59 50.00 84.00
Evaluation: This part of assignment 3 was a cool challenge because it involved fixing character variables and turning them into something useful. Once I got the dates cleaned up, calculating the real age with floor() made it feel more realistic, for example how people would actually say their age. It also showed how something as small as converting a column type can totally change what you’re able to do with the data. If I did this again, I’d probably add a step to catch missing or weird values, but overall I thought this was a good way to see the value of working with dates in R.
Code Example: Assignment 4 (Part A)
Background Information: For this assignment, we worked with a dataset of YouTube comments pulled from three popular music videos: “Roar” by Katy Perry, “Gangnam Style” by Psy, and “Love The Way You Lie” by Eminem. One question asked us to use text analysis to look for phrases that could be linked to spam behavior such as “subscribe.”
# Load libraries
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr)
# Load the dataset (imported manually through RStudio file browser)
<- read.csv("~/Desktop/yt_comments_a.csv")
yt_comments_a
# Create a new variable that is TRUE if the comment contains the word "subscribe"
<- yt_comments_a %>%
yt_comments_a mutate(subscribe = str_detect(str_to_lower(content), "subscribe"))
# Group by 'subscribe' and 'spam' to see how often the word appears in spam vs non-spam
%>%
yt_comments_a group_by(subscribe, spam) %>%
summarise(count = n())
`summarise()` has grouped output by 'subscribe'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: subscribe [2]
subscribe spam count
<lgl> <int> <int>
1 FALSE 0 475
2 FALSE 1 398
3 TRUE 0 2
4 TRUE 1 124
Evaluation: This part of the assignment was actually one of the hardest for me. I hadn’t really used mutate() before, and I kept running into issues like loading the wrong packages or applying functions in the wrong order. It got really frustrating until I slowed down and took it one line at a time. Once I saw how mutate() and str_detect() work together, it finally clicked, and being able to run the code and immediately see the output made it so much easier to learn. It helped me realize how helpful instant feedback can be when you’re trying to clean and analyze data, especially in a data set that’s messy or full of text like this one.
Code Example: Assignment 5 (Part C)
Background Information: For this assignment, we worked with a dataset of homes in Wake County to explore how different factors affect total home price. We started by building a simple linear model using just square footage SQFT as the predictor. Eventually, we created a third model that added Acres land size to see if it improved the model’s accuracy. This part of the assignment helped show how including more relevant variables and understanding outliers can lead to much better predictions.
# Load the housing dataset
<- read.csv("~/Desktop/Houses.csv")
houses
# Create a model predicting Total price using SQFT only
<- lm(Total ~ SQFT, data = houses)
model_1
# Remove outlier and create a new dataset
#dont want the 4.9 million house in it just yet
<- subset(houses, Total < 4000000)
houses2
# Create a new model using both SQFT and Acres as predictors
#includes the outlier to see that it can still be a good model adding in additional variables
<- lm(Total ~ SQFT + Acres, data = houses)
model_3
# View model summary
summary(model_3)
Call:
lm(formula = Total ~ SQFT + Acres, data = houses)
Residuals:
Min 1Q Median 3Q Max
-189677 -48727 -5686 30044 448248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -99554.31 24126.44 -4.126 7.79e-05 ***
SQFT 151.79 12.83 11.832 < 2e-16 ***
Acres 120884.58 2360.00 51.222 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 91800 on 97 degrees of freedom
Multiple R-squared: 0.9658, Adjusted R-squared: 0.9651
F-statistic: 1368 on 2 and 97 DF, p-value: < 2.2e-16
Evaluation: This was my first time using multiple variables in a linear regression model, and it helped me understand how important it is to go beyond surface-level patterns. Earlier in the assignment, the model using just SQFT gave me a really low R-squared value, which showed me that size alone doesn’t explain home price well. But when I added Acres to the model, the R-squared jumped to 0.9658, meaning the model could now explain almost 97% of the variation in home price. That was a huge jump and made me realize how much impact land size had been hiding in the data. I had never worked with model output this deeply before, but breaking it down line by line helped me actually understand what the numbers meant instead of just guessing. This whole process showed me how valuable regression summaries are when it comes to explaining real-world relationships with data.
Summary of Growth:
Over the course of this R Programming class, my perspective on coding shifted from simply getting the right answer to understanding why certain approaches work better than others. At first, I merely followed the directions step by step, thinking that the result would be what I expected. But as we progressed through the assignments, particularly when working with text data, filtering large datasets, and comparing linear models, I began to think more about the structure of the data and the story it was attempting to tell. Especially with data such as the YouTube file or the House Data, I was able to start to incorporate what we had learned in previous weeks throughout each code. The patterns stayed relatively consistent which helped me both practice and get the intended outputs.
One of the most significant shifts for me was learning how to slow down when I became stuck. There were several occasions, especially in Assignments 3 and 4, when my code would not work and I felt overwhelmed. But doing things one line at a time and reading the results changed everything. I began to focus on minutiae such as p-values, model fit, and how alternative representations may obfuscate or clarify a point. More than anything, I’ve learned to trust the process a little more.