setwd("C:/Users/danie/OneDrive/Documents/Data 110 work")
Soccer <- read.csv("results.csv")Project 1
This data set was provided by Mart Jürisoo on Kaggle.
We will be exploring a record of matches played between various national teams. The dataset comprises crucial variables that shed light on the dynamics of these contests, allowing us to gain insights into the performance of teams, the tournaments they participated in, and the goals scored.
Data set variables:
Date: The date when the match took place.
Home Team: The name of the national team playing on their home turf.
Away Team: The name of the opposing national team.
Home Score: The number of goals scored by the home team.
Away Score: The number of goals scored by the away team.
Tournament: The type of tournament or competition in which the match was played.
City: The city where the match was held. Country: The country in which the match took place.
Neutral: A binary indicator indicating whether the match was played at a neutral venue.
I will be trying to connect any patterns or trends in goal-scoring behavior over time, by teams, or in specific tournaments into certain outcomes and recent team performances.
summary(Soccer) date home_team away_team home_score
Length:44934 Length:44934 Length:44934 Min. : 0.00
Class :character Class :character Class :character 1st Qu.: 1.00
Mode :character Mode :character Mode :character Median : 1.00
Mean : 1.74
3rd Qu.: 2.00
Max. :31.00
away_score tournament city country
Min. : 0.000 Length:44934 Length:44934 Length:44934
1st Qu.: 0.000 Class :character Class :character Class :character
Median : 1.000 Mode :character Mode :character Mode :character
Mean : 1.178
3rd Qu.: 2.000
Max. :21.000
neutral
Mode :logical
FALSE:33739
TRUE :11195
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
head(Soccer) date home_team away_team home_score away_score tournament city
1 1872-11-30 Scotland England 0 0 Friendly Glasgow
2 1873-03-08 England Scotland 4 2 Friendly London
3 1874-03-07 Scotland England 2 1 Friendly Glasgow
4 1875-03-06 England Scotland 2 2 Friendly London
5 1876-03-04 Scotland England 3 0 Friendly Glasgow
6 1876-03-25 Scotland Wales 4 0 Friendly Glasgow
country neutral
1 Scotland FALSE
2 England FALSE
3 Scotland FALSE
4 England FALSE
5 Scotland FALSE
6 Scotland FALSE
Soccer <- Soccer %>%
mutate(winner = ifelse(home_score > away_score, home_team,
ifelse(home_score < away_score, away_team, "Draw")))# Calculate the total goals scored by each country
total_goals <- Soccer %>%
group_by(country) %>%
summarise(totalgoals = sum(home_score + away_score)) %>%
arrange(desc(totalgoals))
# Select the top 3 countries
top_countries <- head(total_goals, 10)
ggplot(data = top_countries, aes(x = reorder(country, total_goals), y = totalgoals)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Top 10 Countries with Most Goals Scored",
x = "Country",
y = "Total Goals Scored") +
theme_minimal() +
coord_flip()Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
length is not a multiple of split variable
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data
length is not a multiple of split variable
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
# Calculate the total goals scored in each tournament
tournament_goals <- Soccer %>%
group_by(tournament) %>%
summarise(totalgoals = sum(home_score + away_score)) %>%
arrange(desc(totalgoals))
# Select the top 3 tournaments
top_tournaments <- head(tournament_goals, 10)
# Create a bar chart
ggplot(data = top_tournaments, aes(x = reorder(tournament, totalgoals), y = totalgoals)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Top 10 Tournaments with Most Goals Scored",
x = "Tournament",
y = "Total Goals Scored") +
theme_minimal() +
coord_flip()# Ifelse statement to get the winner/loser from matches
soccer_data <- Soccer %>%
mutate(winner = ifelse(home_score > away_score, home_team,
ifelse(home_score < away_score, away_team, "Draw")))# Defining the dates in a time frame
date1 <- as.Date("2018-07-20")
date2 <- as.Date("2022-11-19")
# filtered Date
soccer_data <- soccer_data %>%
filter(date >= date1 & date <= date2,
winner %in% c("Netherlands", "Argentina", "France", "Morocco", "Spain", "Croatia", "Switzerland", "Brazil", "Cameroon", "Costa Rica")) %>%
group_by(winner) %>%
summarize(wins = n())
ggplot(soccer_data, aes(x = winner, y = wins, fill = winner)) +
geom_bar(stat = "identity") +
labs(x = "Country", y = "Number of Wins", title = "Wins by Country from after the 2018 World Cup to before the 2022 World Cup") +
theme(legend.position = "top")Essay #2
A) I didn’t really clean this data set as it already had everything I needed and didnt need any adjustments except the date.
B) These visualizations are self-explanatory as the first one represents the top 10 countries that scored the most goals throughout history. What suprised me with this chart’s result was the fact that the U.S had the most goals of all time which is insane especially if you watch soccer. Not saying they are bad, but when I took a look at the data set myself, I found a pattern, and it was that the U.S tend to play against way weaker teams.
Similar to the first one, the second chart represent the top 10 tournaments that has the most goals. I wasn’t really surprised with this one since it was expected. The outlier here was Friendlies, since it wasn’t a tournament, so I expected it to be the highest. I also expected the World Cup to not be that high since it is a very hard competition with the best national teams in the whole world.
The last chart represent how many wins 10 countries participating in the 2022 World cup had before the tournament. This one really surprised me because it kind of fit with the prediction sport analysts had of who’s going to win the world cup before it started. Using this website, https://projects.fivethirtyeight.com/2022-world-cup-predictions/ you can see that Brazil was the favorites to win it all with a whooping 22% , while Spain got second place with an 11% chance at winning it while France and Argentina respectively got 9% and 8%. Looking at my chart, It makes some sense.
c) I had lots of fun with this, but I feel like I couldv’e done much more, but the only thing stopping me from doing it was the lack of variables and data provided by my data set.