Project 1

Author

Daniel Ekane

This data set was provided by Mart Jürisoo on Kaggle.

We will be exploring a record of matches played between various national teams. The dataset comprises crucial variables that shed light on the dynamics of these contests, allowing us to gain insights into the performance of teams, the tournaments they participated in, and the goals scored.

Data set variables:

I will be trying to connect any patterns or trends in goal-scoring behavior over time, by teams, or in specific tournaments into certain outcomes and recent team performances.

setwd("C:/Users/danie/OneDrive/Documents/Data 110 work")
Soccer <- read.csv("results.csv")
summary(Soccer)
     date            home_team          away_team           home_score   
 Length:44934       Length:44934       Length:44934       Min.   : 0.00  
 Class :character   Class :character   Class :character   1st Qu.: 1.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 1.00  
                                                          Mean   : 1.74  
                                                          3rd Qu.: 2.00  
                                                          Max.   :31.00  
   away_score      tournament            city             country         
 Min.   : 0.000   Length:44934       Length:44934       Length:44934      
 1st Qu.: 0.000   Class :character   Class :character   Class :character  
 Median : 1.000   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 1.178                                                           
 3rd Qu.: 2.000                                                           
 Max.   :21.000                                                           
  neutral       
 Mode :logical  
 FALSE:33739    
 TRUE :11195    
                
                
                
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
head(Soccer)
        date home_team away_team home_score away_score tournament    city
1 1872-11-30  Scotland   England          0          0   Friendly Glasgow
2 1873-03-08   England  Scotland          4          2   Friendly  London
3 1874-03-07  Scotland   England          2          1   Friendly Glasgow
4 1875-03-06   England  Scotland          2          2   Friendly  London
5 1876-03-04  Scotland   England          3          0   Friendly Glasgow
6 1876-03-25  Scotland     Wales          4          0   Friendly Glasgow
   country neutral
1 Scotland   FALSE
2  England   FALSE
3 Scotland   FALSE
4  England   FALSE
5 Scotland   FALSE
6 Scotland   FALSE
Soccer <- Soccer %>%
  mutate(winner = ifelse(home_score > away_score, home_team,
                         ifelse(home_score < away_score, away_team, "Draw")))
# Calculate the total goals scored by each country

total_goals <- Soccer %>%
  group_by(country) %>%
  summarise(totalgoals = sum(home_score + away_score)) %>%
  arrange(desc(totalgoals))

# Select the top 3 countries
top_countries <- head(total_goals, 10)

ggplot(data = top_countries, aes(x = reorder(country, total_goals), y = totalgoals)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top 10 Countries with Most Goals Scored",
       x = "Country",
       y = "Total Goals Scored") +
  theme_minimal() +
  coord_flip()
Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
length is not a multiple of split variable
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
length is not a multiple of split variable
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

# Calculate the total goals scored in each tournament

tournament_goals <- Soccer %>%
  group_by(tournament) %>%
  summarise(totalgoals = sum(home_score + away_score)) %>%
  arrange(desc(totalgoals))

# Select the top 3 tournaments
top_tournaments <- head(tournament_goals, 10)

# Create a bar chart
ggplot(data = top_tournaments, aes(x = reorder(tournament, totalgoals), y = totalgoals)) +
  geom_bar(stat = "identity", fill = "purple") +
  labs(title = "Top 10 Tournaments with Most Goals Scored",
       x = "Tournament",
       y = "Total Goals Scored") +
  theme_minimal() +
  coord_flip()

# Ifelse statement to get the winner/loser from matches
soccer_data <- Soccer %>%
  mutate(winner = ifelse(home_score > away_score, home_team,
                         ifelse(home_score < away_score, away_team, "Draw")))
# Defining the dates in a time frame
date1 <- as.Date("2018-07-20")
date2 <- as.Date("2022-11-19")

# filtered Date
soccer_data <- soccer_data %>%
  filter(date >= date1 & date <= date2, 
         winner %in% c("Netherlands", "Argentina", "France", "Morocco", "Spain", "Croatia", "Switzerland", "Brazil", "Cameroon", "Costa Rica")) %>%
  group_by(winner) %>%
  summarize(wins = n())

ggplot(soccer_data, aes(x = winner, y = wins, fill = winner)) +
  geom_bar(stat = "identity") +
  labs(x = "Country", y = "Number of Wins", title = "Wins by Country from after the 2018 World Cup to before the 2022 World Cup") +
  theme(legend.position = "top")

Essay #2

A) I didn’t really clean this data set as it already had everything I needed and didnt need any adjustments except the date.

B) These visualizations are self-explanatory as the first one represents the top 10 countries that scored the most goals throughout history. What suprised me with this chart’s result was the fact that the U.S had the most goals of all time which is insane especially if you watch soccer. Not saying they are bad, but when I took a look at the data set myself, I found a pattern, and it was that the U.S tend to play against way weaker teams.

Similar to the first one, the second chart represent the top 10 tournaments that has the most goals. I wasn’t really surprised with this one since it was expected. The outlier here was Friendlies, since it wasn’t a tournament, so I expected it to be the highest. I also expected the World Cup to not be that high since it is a very hard competition with the best national teams in the whole world.

The last chart represent how many wins 10 countries participating in the 2022 World cup had before the tournament. This one really surprised me because it kind of fit with the prediction sport analysts had of who’s going to win the world cup before it started. Using this website, https://projects.fivethirtyeight.com/2022-world-cup-predictions/ you can see that Brazil was the favorites to win it all with a whooping 22% , while Spain got second place with an 11% chance at winning it while France and Argentina respectively got 9% and 8%. Looking at my chart, It makes some sense.

c) I had lots of fun with this, but I feel like I couldv’e done much more, but the only thing stopping me from doing it was the lack of variables and data provided by my data set.