This page covers a diverse range of data analytics, data science, and machine learning techniques and topics I have worked on, each focusing on a different event or domain. Hope you find my findings interesting!
The code I used for the following projects can all be found in my GitHub Repository: https://github.com/andrejcc04/Portfolio
All the projects and models I’ve built are available in the Tab on the left.
Welcome to my portfolio of machine learning projects. Dive into a collection that spans across diverse domains, showcasing my passion for leveraging data to uncover insights and make predictions. In the Formula 1 Lap Time Prediction Project, I delve into the world of motorsport analytics, employing advanced algorithms to forecast lap times with precision, crucial for race strategy optimization. Moving to the Formula 1 Race Winner Prediction Project, I harness historical race data to develop models that predict race outcomes, offering valuable insights into driver and team performance factors.
Shifting gears to the realm of sports analytics, the Premier League Champion Prediction Project employs statistical modeling to forecast league winners based on team performance metrics over multiple seasons. In the financial domain, the XLK Stock Price Prediction Project focuses on predicting stock prices using historical market data and machine learning techniques, aiding in investment decision-making with robust predictive models. Lastly, the American House Price Prediction Project explores the dynamics of real estate valuation, utilizing key property features to predict housing prices accurately. Explore these projects to see how data science and machine learning can illuminate trends and drive informed decision-making across diverse fields.
As I was watching the 2024 Monaco Grand Prix- I wondered how hard it would be to build a Machine Learning model that could (somewhat) successfully predict the winner of the races to come, so I got to work and after a couple of days of trial and error I managed to build a model. I started off with just a Random Forest Classification model, but it wasn’t giving me the results I wanted. Therefore, I figured I should create 3 different classification models (Support Vector Machine, Random Forest, and Gradient Boosting) to then combine them all in this Ensemble Methods model and see if that got me more precise results… It did!
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 163 8
## 1 2 7
##
## Accuracy : 0.9444
## 95% CI : (0.9002, 0.973)
## No Information Rate : 0.9167
## P-Value [Acc > NIR] : 0.1081
##
## Kappa : 0.5556
##
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.9879
## Specificity : 0.4667
## Pos Pred Value : 0.9532
## Neg Pred Value : 0.7778
## Prevalence : 0.9167
## Detection Rate : 0.9056
## Detection Prevalence : 0.9500
## Balanced Accuracy : 0.7273
##
## 'Positive' Class : 0
##
## Predicted Winner of the 2024 Austrian Grand Prix Styrian Grand Prix : Max Verstappen
Accuracy: The overall accuracy of 0.9593 indicates the proportion of correct predictions made by the model on the entire test dataset.
Precision: The precision quantifies the model’s ability to avoid incorrectly predicting a driver did not win the race, when they actually did (false positives). Therefore, a high precision score of 0.957 indicates that when the model predicts a driver as not winning the race, it is correct about 95.7% of the time.
Sensitivity (Recall): The sensitivitiy quantifies the model’s ability to successfully capture all cases where a driver didn’t win (positive cases). The sensitivity score of 1 suggests that the model is able to successfully capture a high proportion of the cases where a driver actually did not win the race. This means that when a driver did not win the race, the model correctly identifies them as such 100% of the time.
Specificity: The specificity quantifies the models ability to correctly identify cases where a driver won (negative cases). A specificity score of 0.5 means that approximately 50% of the time when a driver actually won the race, the model incorrectly predicts that they didn’t win the race
F1 Score: The F1 score of 0.978 is a harmonic mean of precision and recall, providing a balanced measure of the model’s performance. It combines both the precision and sensitivity of the model into a single metric.
Balanced Accuracy: The balanced accuracy of 0.75 accounts for class imbalance by taking the average of sensitivity and specificity. It provides a more reliable measure of model performance when dealing with imbalanced datasets.
This project focuses more on developing a linear regression model to predict a laptime given the driver, circuit, and lap. Below I display Sergio Perez’s Lap 30 time prediction in each circuit of the 2024 F1 calendar Season, and then I offer a post-race analysis featuring multiple race-descriptive plots, and a a review of my pre-race laptime prediciton.
Note: 1 second = 1000 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Bahrain Grand Prix : 1:37:785
## Root Mean Squared Error: 1737 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 1472 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 Bahrain Grand Prix 30 1:37:785 1:36.313
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Saudi Arabian Grand Prix : 1:34:406
## Root Mean Squared Error: 1509 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 1624 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Saudi Arabian Grand Prix 30 1:34:406
## Actual Laptime
## 1 1:32.782
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Australian Grand Prix : 1:22:280
## Root Mean Squared Error: 1550 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 100 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Australian Grand Prix 30 1:22:280
## Actual Laptime
## 1 1:22.180
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Japanese Grand Prix : 1:37:392
## Root Mean Squared Error: 1683 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 295 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Japanese Grand Prix 30 1:37:392
## Actual Laptime
## 1 1:37.097
##
## [[2]]
##
## [[3]]
##
## [[4]]
## Predicted lap time of Sergio Pérez on lap 30 in the 2024 Chinese Grand Prix: 1:41:434
## Root Mean Squared Error: 1243 milliseconds
## forename surname year name lap china_predicted_laptime
## 1 Sergio Pérez 2024 Chinese Grand Prix 30 101434
## Predicted Laptime Actual Laptime milliseconds
## 1 1:41:434 2:21.644 141644
## POST RACE ANALYSIS: Off by 40210 milliseconds
## Safety Car lap 21-31
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Miami Grand Prix : 1:32:656
## Root Mean Squared Error: 1334 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 27937 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 Miami Grand Prix 30 1:32:656 2:00.593
##
## [[2]]
##
## [[3]]
##
## [[4]]
## Safety Car Laps 28-32
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Emilia Romagna Grand Prix : 1:22:158
## Root Mean Squared Error: 1449 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 1105 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Emilia Romagna Grand Prix 30 1:22:158
## Actual Laptime
## 1 1:23.263
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Monaco Grand Prix : 1:19:511
## Root Mean Squared Error: 2198 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by NA milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 Monaco Grand Prix 30 1:19:511 <NA>
##
## [[2]]
##
## [[3]]
##
## [[4]]
## Crashed out lap 1
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Canadian Grand Prix : 1:17:692
## Root Mean Squared Error: 1120 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 11641 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Canadian Grand Prix 30 1:17:692
## Actual Laptime
## 1 1:29.333
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Spanish Grand Prix : 1:20:814
## Root Mean Squared Error: 4366 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 589 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 Spanish Grand Prix 30 1:20:814 1:21.403
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Austrian Grand Prix : 1:11:208
## Root Mean Squared Error: 1199 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 515 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Austrian Grand Prix 30 1:11:208
## Actual Laptime
## 1 1:10.693
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 British Grand Prix : 1:32:185
## Root Mean Squared Error: 1523 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 7836 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 British Grand Prix 30 1:32:185 1:40.021
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Hungarian Grand Prix : 1:24:827
## Root Mean Squared Error: 1395 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 691 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime
## 1 Sergio Pérez 2024 Hungarian Grand Prix 30 1:24:827
## Actual Laptime
## 1 1:25.518
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Belgian Grand Prix : 1:49:216
## Root Mean Squared Error: 3667 milliseconds
##
## POST RACE ANALYSIS: Laptime Prediction was off by 338 milliseconds
## [[1]]
## forename surname year name lap Predicted Laptime Actual Laptime
## 1 Sergio Pérez 2024 Belgian Grand Prix 30 1:49:216 1:48.878
##
## [[2]]
##
## [[3]]
##
## [[4]]
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Dutch Grand Prix : 1:16:921
## Root Mean Squared Error: 1270 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Italian Grand Prix : 1:26:820
## Root Mean Squared Error: 870 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 European Grand Prix : 1:47:279
## Root Mean Squared Error: 1263 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Singapore Grand Prix : 1:40:180
## Root Mean Squared Error: 1190 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 United States Grand Prix : 1:42:469
## Root Mean Squared Error: 1431 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Mexican Grand Prix : 1:25:215
## Root Mean Squared Error: 2283 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Brazilian Grand Prix : 1:17:29
## Root Mean Squared Error: 1708 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Las Vegas Grand Prix : 1:38:630
## Root Mean Squared Error: 1488 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Qatar Grand Prix : 1:28:124
## Root Mean Squared Error: 1466 milliseconds
## PRE-RACE PREDICTIONS: Predicted lap time of Sergio Pérez on lap 30 in the 2024 Abu Dhabi Grand Prix : 1:30:844
## Root Mean Squared Error: 1389 milliseconds
I’ve become a big Arsenal fan, and after their recent misfortunes in being “so close yet so far” to a Premier League title for two years in a row, I started to wonder how many wins, points, or goals we would need to have a really good shot at finally winning it after 20 years. I developed a logistic regression model to predict whether a team will be crowned champions or not using important features such as the number of wins, draws, losses, points, goals for, and goals against. Below are the results:
## Actual
## Prediction 0 1
## 0 231 7
## 1 6 9
It’s pretty disheartening to see that Arsenal was favorable in every aspect, but still came up short to the behemoth that is Manchester City. However, it makes me feel better that we didn’t have 97 points in a season with only one loss and still came up short! (Like 2018-2019 Liverpool)
Below I sifted through 700 lines of data to compile a plot visualizing each Team’s progress (or fall) as the Premier League season unfolded from start to finish.
The second of four graphs… I made this plot just to get a grasp of how each team that has ever played a game in the Premier League has done. I wanted to include every team for those Wigan Athletic, Derby Country, Swansea City (…) fans who are just happy to see their team in a PL graphic- hence why it looks pretty messy.
(Don’t worry Blackburn fans I have another one coming you way)
The next two plots are more constrained to specific teams that are pretty well-known so it would be easier to follow along with. Like the last one, it shows the progress of each of the teams that have ever won the Premier League. From Leicester City’s cinderella story to Manchester City’s steady rise to the top to Blackburn’s slow decline… This plot tells a lot!
The last plot is similar to the last one except its confined to the Premier League’s big 6 clubs only. I just wanted to make this one to stick it to Tottenham fans that they’ve never won a Premier League title.
Interesting Fact: Since the start of the Premier League, the big 6 clubs have all been in the top 6 standings in 5 of the last 10 PL seasons. Additionally, 5 of the 6 teams have been in the top 6 in 12 of the last 15 seasons.
It’s also intriguing to see that only 1 season has been won by a team outside the big 6 - Leicester City’s cinderella run as mentioned in the plot before.
Below are some captivating statistics I found after manipulating the Premier League dataset:
## # A tibble: 43 × 2
## Team Average_finishing_position
## <chr> <dbl>
## 1 Manchester Utd 2.62
## 2 Arsenal 3.94
## 3 Liverpool 4.31
## 4 Chelsea 4.88
## 5 Manchester City 6.96
## 6 Tottenham 7.41
## 7 Newcastle Utd 9.76
## 8 Blackburn 10
## 9 Leeds United 10.1
## 10 Aston Villa 10.2
## # ℹ 33 more rows
## # A tibble: 43 × 2
## Team Total_Wins
## <chr> <dbl>
## 1 Manchester Utd 744
## 2 Arsenal 673
## 3 Liverpool 652
## 4 Chelsea 647
## 5 Tottenham 540
## 6 Manchester City 529
## 7 Everton 439
## 8 Newcastle Utd 419
## 9 Aston Villa 392
## 10 West Ham 360
## # ℹ 33 more rows
## # A tibble: 43 × 2
## Team Total_Points
## <chr> <dbl>
## 1 Manchester Utd 2501
## 2 Arsenal 2314
## 3 Liverpool 2258
## 4 Chelsea 2245
## 5 Tottenham 1913
## 6 Manchester City 1809
## 7 Everton 1658
## 8 Newcastle Utd 1541
## 9 Aston Villa 1487
## 10 West Ham 1350
## # ℹ 33 more rows
It’s evident that the big 6 clubs have been dominating since 1993!
I successfully train a Machine Learning model by gathering and manipulating the relevant data, splitting it into two sets (train & test), training a linear regression model, and testing it on the data set aside for testing. The resulting model displays a Root Mean Square Error of only 0.69, meaning that on average, my predictions are off by $0.69. I also included a regression line to show that on average, the stock price has gone up in the past year as well as a perfect fit line in the other plot.
## Root Mean Squared Error: $ 0.69
Recently, I found at our neighbors were selling their house and moving away. I was curious how much the house would cost so I went on Zillow and to my surprise, it was worth way more than ours. After looking at the pictures available, it was easy to understand why: they had a finished basement, a breakfast sunroom, and their house was nearly 1000 sqft bigger than ours.
This got me wondering how important certain features have in a home, especially the number of beds, baths, and its size in sqft. I got to work building a linear regression model that is able to predict the price of a house given the features as input. All in all, it turned out pretty good.
## Predicted price of a house in Texas with 3 beds, 3 baths, and a living space of 3000 sqft: $ 576003.1
This is the Data Analytics Portion of my projects. Explore a diverse collection where I apply analytical skills to uncover insights and predictions across various domains. In the NFL 2023 Quarterback Performance Analysis, I used advanced metrics to evaluate quarterback efficiency and overall performance, developing a unique formula to assess MVP ratings throughout the season, as well as comparing the performance of different players using radar charts.
Turning to the Olympic History Data Analysis, I examined decades of Olympic data to reveal trends in medal distributions and sports participation across different countries, presenting these insights through compelling visualizations and statistical analysis. Lastly, my analysis of the 2011 Masters Golf Tournament highlighted player performances throughout the entirety of the tournament.
Many times you see people debating whether this player did better than that player. Using data from the 2013-2023 NFL seasons, I developed a formula in Excel to grade each player’s MVP-worthy performance, and overall grade. I then transferred the file into R to visualize my findings. Below you’ll find a table displaying the hgihest graded players in a season since 2013, and 4 separate scatterplots (one for each position) along with the names of the top 6 highest graded players at that position.
(If you’re interesting in which performance metrics I used in determining the grade and whether a player had a good MVP score feel free to check out the dataset on my GitHub nfldata.xlsx)
## # A tibble: 10 × 9
## id name position team season grade mvp_rating first_votes isMvp
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1051 Lamar Jackson QB BAL 2019 0.998 0.882 50 1
## 2 1267 Patrick Mahom… QB KC 2020 0.939 0.888 2 0
## 3 1267 Patrick Mahom… QB KC 2018 0.942 0.957 41 1
## 4 11 Aaron Rodgers QB GB 2020 0.995 0.963 44 1
## 5 330 Cooper Kupp WR LA 2021 0.985 0.956 1 0
## 6 1188 Michael Thomas WR NO 2019 0.948 0.908 0 0
## 7 85 Antonio Brown WR PIT 2015 0.961 0.932 0 0
## 8 295 Christian McC… RB CAR 2019 0.978 0.949 0 0
## 9 295 Christian McC… RB SF 2023 0.943 0.862 0 0
## 10 868 Jonathan Tayl… RB IND 2021 0.968 0.903 0 0
## # A tibble: 10 × 4
## name position team grade
## <chr> <chr> <chr> <dbl>
## 1 Christian McCaffrey RB SF 0.943
## 2 CeeDee Lamb WR DAL 0.897
## 3 Keenan Allen WR LAC 0.897
## 4 Justin Jefferson WR MIN 0.891
## 5 Amon-Ra St. Brown WR DET 0.887
## 6 Tyreek Hill WR MIA 0.883
## 7 Puka Nacua WR LA 0.858
## 8 Davante Adams WR LV 0.848
## 9 D.J. Moore WR CHI 0.847
## 10 Dak Prescott QB DAL 0.835
In the 2023-2024 NFL Season, Lamar Jackson won the MVP with Dak Prescott coming in 2nd. Here is how the QBs stack up against eachother, along with other positional comparisons:
The plot below is a line graph I created visualizing the summary of the Masters 2011 Pro Golf Tournament, along with the performance of each golfer and the overall winner of the comptetition (Charl Schwartzel).
After gaining access to mulitple datasets of the Olympics containing every instance throughout every competition since the inaugural season back in 1896 (Greece) up until the 2016 Games in Brazil, I decided my free time would be well spent answering a couple of questions I, like many others (I think), have been wondering:
- Does the economical stability of a country affect the number of athletes it sends to the olympics and the number of medals it wins?
- Does hosting the olympics correlate to winning more medals that year?
Below are the results I found for the first question, along with the code I wrote to filter and manipulate the data so I can visualize it in a more effective manner.
As we can see, there is in fact a positive correlation between a country’s gdp per capita and the number of medals and athletes a country has. This means that the higher the gdp is, the more medals it wins and more athletes it sends to the Olympics.
For the second question… I started by joining data sets together and creating a function that will filter the joint dataset for each country and in each of the seasons: determine whether they hosted or not. The function also displays a plot to compare the amount of medals that country won when they hosted vs when they did not. We will then compare and draw reasonable conclusions by creating a histogram containing the average number of medals all countries combined have won when they host vs in the competitions before.
As stated earlier, I created a histogram of the difference of medals (by subtracting the medals won when they host minus the medals won in the olympic season directly prior) to draw a reasonable conclusion.
We can see there is a positive host effect country on the amount of medals won when a country hosts the olympics vs when they don’t because there is an overall positive difference.
In my diverse portfolio of projects spanning machine learning and data analysis, I’ve delved deep into various domains, from sports analytics in Formula 1, NFL, and Premier League to financial forecasting in stock markets and real estate.
Coming up: NFL MVP Linear/Logistic Regression Model, TMDB Movie Data Analysis, Alzheimers Classification Model