The goal of this project is to predict who will win the 2022 FIFA World Cup. I will use country metrics, international friendlies, and World Cup data sets.

Upload Data

Data Set Without Corruption Data

World Bank Data

cereal_yield_raw <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Cereal_Yield.csv")
gdp_raw <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/GDP.csv")
gov_exp_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Gov_Expendeture_Education.csv")
literacy_rate_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Literacy_Rate.csv")
mortality_rate_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/MortalityRate_Kids_Under5.csv")
population_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Population.csv")
rural_pop_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Rural_Population_Percentage.csv")
urban_pop_raw <-read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Urban_Population_Percentage.csv")
teen_fertility_rate <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Adolecent Fertility Rate.csv")
life_expectancy <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Life Expectancy at Birth.csv")
farm_land_percentage<- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/Percentage of Farm Land.csv")

World Cup Data

world_cup_1930 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1930.csv")
world_cup_1934 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1934.csv")
world_cup_1938 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1938.csv")
world_cup_1950 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1950.csv")
world_cup_1954 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1954.csv")
world_cup_1958 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1958.csv")
world_cup_1962 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1962.csv")
world_cup_1966 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1966.csv")
world_cup_1970 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1970.csv")
world_cup_1974 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1974.csv")
world_cup_1978 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1978.csv")
world_cup_1982 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1982.csv")
world_cup_1986 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1986.csv")
world_cup_1990 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1990.csv")
world_cup_1994 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1994.csv")
world_cup_1998 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 1998.csv")
world_cup_2002 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 2002.csv")
world_cup_2006 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 2006.csv")
world_cup_2010 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 2010.csv")
world_cup_2014 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 2014.csv")
world_cup_2018 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/FIFA - 2018.csv")

Spoken Languages

languages <- read.csv("https://raw.githubusercontent.com/JovianML/opendatasets/master/data/countries-languages-spoken/countries-languages.csv")

Perceived Corruption Index

corruption_index <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/corruption_data_csv.csv")
cpi_2016 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/cpi 2016.csv")
cpi_2017 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/cpi 2017.csv")
cpi_2018 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/cpi 2018.csv")
cpi_2019 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/cpi 2019.csv")
cpi_2020 <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/cpi 2020.csv")

International Friendlies

friendlies <- read.csv("https://raw.githubusercontent.com/tylerbaker01/Capstone-Project/main/international%20friendlies%20results.csv")

Tidy Data

I want to have a large data set that has the World Cup rank of each country, averages of the country metrics over the 3 years prior to the World Cup, and the amount of international friendlies won during the 3 years prior to the World Cup. I will also want a small data set that will allow me to predict the 2022 FIFA World Cup.

International Friendlies

For this data set I will first change the country names as I did for the World Cup data. There may be a chance that they are there. I will also change the date format to just the year. I will also exclude the World Cup from this data. I will also split the data so that I will have a data set for the 3 years before each World Cup. Then, I will create a composite score for each team over those three years. It will be

Step 1: Only keep the year. The first four characters will be the year. So I will simply just extract the first 4 characters.

friendlies$date <- substr(friendlies$date, 1, 4)

Step 2: Change the Country names.

friendlies$home_team[friendlies$home_team=="West Germany"]<-"Germany"
friendlies$away_team[friendlies$away_team=="West Germany"]<-"Germany"
friendlies$home_team[friendlies$home_team=="Soviet Union"]<-"Russia"
friendlies$away_team[friendlies$away_team=="Soviet Union"]<-"Russia"
friendlies$home_team[friendlies$home_team=="Czechoslovakia"]<-"Czech Republic"
friendlies$away_team[friendlies$away_team=="Czechoslovakia"]<- "Czech Republic"
friendlies$home_team[friendlies$home_team=="FR Yugoslavia"]<- "Yugoslavia"
friendlies$away_team[friendlies$away_team=="FR Yugoslavia"]<- "Yugoslavia"
friendlies$home_team[friendlies$home_team=="Dutch East Indies"]<- "India"
friendlies$away_team[friendlies$away_team=="Dutch East Indies"]<- "India"
friendlies$home_team[friendlies$home_team=="China PR"]<- "China"
friendlies$away_team[friendlies$away_team=="China PR"]<- "China"
friendlies$home_team[friendlies$home_team=="Turkiye"]<-"Turkey"
friendlies$away_team[friendlies$away_team=="Turkiye"]<-"Turkey"
friendlies$home_team[friendlies$home_team=="Serbia and Montenegro"]<-"Serbia"
friendlies$away_team[friendlies$away_team=="Serbia and Montenegro"]<-"Serbia"
friendlies$home_team[friendlies$home_team=="Republic of Ireland"]<-"Ireland"
friendlies$away_team[friendlies$awat_team=="Republic of Ireland"]<-"Ireland"

Step 3: Make a column for the winners.

friendlies <- mutate(friendlies, winner = ifelse(home_score > away_score, home_team,
                                                 ifelse(away_score > home_score, away_team, NA)))

Step 4: Make a column for the goal differential. I may use this, I may not.

friendlies <- mutate(friendlies, goal_diff = ifelse(home_score > away_score, home_score-away_score,
                                                    ifelse(away_score > home_score, away_score-home_score, 0)))

Step 5: I will only consider wins. A draw doesn’t really show much of anything.

friendlies <- na.omit(friendlies)

Step 6: Only consider the 3 years prior to a World Cup. The first World Cup was 1930. Step 7: I want to count the number of international friendlies won for each country during the three year spans prior to a FIFA World Cup.

friendlies_1930 <- friendlies[friendlies$date %in% c("1927", "1928", "1929"),]
pre_1930 <- as.data.frame(table(friendlies_1930$winner))

friendlies_1934 <- friendlies[friendlies$date %in% c("1931", "1932", "1933"),]
pre_1934 <- as.data.frame(table(friendlies_1934$winner))

friendlies_1938 <- friendlies[friendlies$date %in% c("1935", "1936", "1937"),]
pre_1938 <- as.data.frame(table(friendlies_1938$winner))

friendlies_1950 <- friendlies[friendlies$date %in% c("1947", "1948", "1949"),]
pre_1950 <- as.data.frame(table(friendlies_1950$winner))

friendlies_1954 <- friendlies[friendlies$date %in% c("1951", "1952", "1953"),]
pre_1954 <- as.data.frame(table(friendlies_1954$winner))

friendlies_1958 <- friendlies[friendlies$date %in% c("1955", "1956", "1957"),]
pre_1958 <- as.data.frame(table(friendlies_1958$winner))

friendlies_1962 <- friendlies[friendlies$date %in% c("1959", "1960", "1961"),]
pre_1962 <- as.data.frame(table(friendlies_1962$winner))

friendlies_1966 <- friendlies[friendlies$date %in% c("1963", "1964", "1965"),]
pre_1966 <- as.data.frame(table(friendlies_1966$winner))

friendlies_1970 <- friendlies[friendlies$date %in% c("1967", "1968", "1969"),]
pre_1970 <- as.data.frame(table(friendlies_1970$winner))

friendlies_1974 <- friendlies[friendlies$date %in% c("1971", "1972", "1973"),]
pre_1974 <- as.data.frame(table(friendlies_1974$winner))

friendlies_1978 <- friendlies[friendlies$date %in% c("1975", "1976", "1977"),]
pre_1978 <- as.data.frame(table(friendlies_1978$winner))

friendlies_1982 <- friendlies[friendlies$date %in% c("1979", "1980", "1981"),]
pre_1982 <- as.data.frame(table(friendlies_1982$winner))

friendlies_1986 <- friendlies[friendlies$date %in% c("1983", "1984", "1985"),]
pre_1986 <- as.data.frame(table(friendlies_1986$winner))

friendlies_1990 <- friendlies[friendlies$date %in% c("1987", "1988", "1989"),]
pre_1990 <- as.data.frame(table(friendlies_1990$winner))

friendlies_1994 <- friendlies[friendlies$date %in% c("1991", "1992", "1993"),]
pre_1994 <- as.data.frame(table(friendlies_1994$winner))

friendlies_1998 <- friendlies[friendlies$date %in% c("1995", "1996", "1997"),]
pre_1998 <- as.data.frame(table(friendlies_1998$winner))

friendlies_2002 <- friendlies[friendlies$date %in% c("1999", "2000", "2001"),]
pre_2002 <- as.data.frame(table(friendlies_2002$winner))

friendlies_2006 <- friendlies[friendlies$date %in% c("2003", "2004", "2005"),]
pre_2006 <- as.data.frame(table(friendlies_2006$winner))

friendlies_2010 <- friendlies[friendlies$date %in% c("2007", "2008", "2009"),]
pre_2010 <- as.data.frame(table(friendlies_2010$winner))

friendlies_2014 <- friendlies[friendlies$date %in% c("2011", "2012", "2013"),]
pre_2014 <- as.data.frame(table(friendlies_2014$winner))

friendlies_2018 <- friendlies[friendlies$date %in% c("2015", "2016", "2017"),]
pre_2018 <- as.data.frame(table(friendlies_2018$winner))

friendlies_2022 <- friendlies[friendlies$date %in% c("2019", "2020", "2021"),]
pre_2022 <- as.data.frame(table(friendlies_2022$winner))

Step 8: Add a year column

pre_1930$year <- 1930
pre_1934$year <- 1934
pre_1938$year <- 1938
pre_1950$year <- 1950
pre_1954$year <- 1954
pre_1958$year <- 1958
pre_1962$year <- 1962
pre_1966$year <- 1966
pre_1970$year <- 1970
pre_1974$year <- 1974
pre_1978$year <- 1978
pre_1982$year <- 1982
pre_1986$year <- 1986
pre_1990$year <- 1990
pre_1994$year <- 1994
pre_1998$year <- 1998
pre_2002$year <- 2002
pre_2006$year <- 2006
pre_2010$year <- 2010
pre_2014$year <- 2014
pre_2018$year <- 2018
pre_2022$year <- 2022

Step 9: Rbind

int_friendlies <- rbind(pre_1930,pre_1934,pre_1938,pre_1950,pre_1954,pre_1958,pre_1962,pre_1966,pre_1966,pre_1970,pre_1974,pre_1978,pre_1982,pre_1986,pre_1990,pre_1994,pre_1998,pre_2002,pre_2006,pre_2010,pre_2014,pre_2018,pre_2022)

Step 10: Add a home_differential and an away_differential to the friendlies dataframe.

friendlies$home_diff <- friendlies$home_score - friendlies$away_score
friendlies$away_diff <- friendlies$away_score - friendlies$home_score

Step 11: Remove and rename columns for friendlies

friendlies <- friendlies[,c("date", "home_team", "away_team", "home_score", "away_score", "home_diff", "away_diff")]
colnames(friendlies) <- c("year", "home_team", "away_team", "home_score", "away_score", "home_diff", "away_diff")

Step 12: Pivot Longer.

friendlies <- friendlies%>% pivot_longer(cols = c("home_team","away_team"),
                       names_to = "home_away",
                       values_to = "country")

Step 13: Split friendlies into a home df and an away df.

friendlies_home <- friendlies%>% filter(home_away=="home_team")
friendlies_away <- friendlies%>%filter(home_away=="away_team")

Step 14: Mutate

friendlies_home <- friendlies_home %>%
  group_by(country, year)%>%
  mutate(total_home_goals_scored= sum(home_score))%>%
  mutate(total_home_goals_conceeded = sum(away_score))%>%
  mutate(total_home_goal_diff = total_home_goals_scored - total_home_goals_conceeded)
friendlies_away <- friendlies_away %>%
  group_by(country,year)%>%
  mutate(total_away_goals_scored = sum(away_score))%>%
  mutate(total_away_goals_conceeded = sum(home_score))%>%
  mutate(total_away_goal_diff = total_away_goals_scored - total_away_goals_conceeded)

Step 15: Merge

colnames(int_friendlies) <- c("country", "pre_world_cup_wins", "year")
int_friendlies <-merge(int_friendlies, friendlies_home, by=c("country", "year"))
int_friendlies <- merge(int_friendlies, friendlies_away, by=c("country","year"))

Step 16: Find total goals scored, total goals conceeded, and goal diff.

int_friendlies$total_goals_scored <- int_friendlies$total_home_goals_scored + int_friendlies$total_away_goals_scored
int_friendlies$total_goals_conceeded <- int_friendlies$total_home_goals_conceeded + int_friendlies$total_away_goals_conceeded
int_friendlies$goal_diff <- int_friendlies$total_goals_scored - int_friendlies$total_goals_conceeded

Step 17: Remove unwanted columns, and then remove duplicate rows.

int_friendlies <- int_friendlies[, c("country", "year", "pre_world_cup_wins", "total_goals_scored", "total_goals_conceeded", "goal_diff")]
int_friendlies<- int_friendlies[!duplicated(int_friendlies), ]

World Cup Data

I need to first combine all of the world cup data into one data set. I can do this by adding a year column to each data set and then only keeping the position, the nation, and the year of the World Cup.

Step 1: Add a year column

world_cup_1930$year <- c("1930")
world_cup_1934$year <- c("1934")
world_cup_1938$year <- c("1938")
world_cup_1950$year <- c("1950")
world_cup_1954$year <- c("1954")
world_cup_1958$year <- c("1958")
world_cup_1962$year <- c("1962")
world_cup_1966$year <- c("1966")
world_cup_1970$year <- c("1970")
world_cup_1974$year <- c("1974")
world_cup_1978$year <- c("1978")
world_cup_1982$year <- c("1982")
world_cup_1986$year <- c("1986")
world_cup_1990$year <- c("1990")
world_cup_1994$year <- c("1994")
world_cup_1998$year <- c("1998")
world_cup_2002$year <- c("2002")
world_cup_2006$year <- c("2006")
world_cup_2010$year <- c("2010")
world_cup_2014$year <- c("2014")
world_cup_2018$year <- c("2018")

Step 2: RBind

wc_data <- rbind(world_cup_1930, world_cup_1934, world_cup_1938, world_cup_1950, world_cup_1954, world_cup_1958, world_cup_1962, world_cup_1966, world_cup_1970, world_cup_1974, world_cup_1978, world_cup_1982, world_cup_1986, world_cup_1990, world_cup_1994, world_cup_1998, world_cup_2002, world_cup_2006, world_cup_2010, world_cup_2014, world_cup_2018)

Step 3: Keeping and Renaming Columns

wc_data <- wc_data[c("Position","Team", "year")]
colnames(wc_data)<- c("standing","team","year")

Step 4: There are some countries that need their names changed.

wc_data$team[wc_data$team=="West Germany"]<-"Germany"
wc_data$team[wc_data$team=="Soviet Union"]<-"Russia"
wc_data$team[wc_data$team=="Czechoslovakia"]<-"Czech Republic"
wc_data$team[wc_data$team=="Dutch East Indies"]<-"India"
wc_data$team[wc_data$team=="FR Yugoslavia"]<-"Yugoslavia"
wc_data$team[wc_data$team=="China PR"]<- "China"
wc_data$team[wc_data$team=="Turkiye"]<-"Turkey"
wc_data$team[wc_data$team=="Serbia and Montenegro"]<-"Serbia"
wc_data$team[wc_data$team=="Republic of Ireland"]<-"Ireland"

Country Metrics

World Bank Data

Step 1: Combine World Bank Data. This can be done by first pivoting each dataframe longer. Once they are pivoted longer, I can row bind them all together.

cereal_yield <- cereal_yield_raw
colnames(cereal_yield)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
cereal_yield <- cereal_yield[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
cereal_yield<- cereal_yield %>%
  pivot_longer(!country, names_to = "year", values_to = "cereal_yield")
cereal_yield$country[cereal_yield$country=="West Germany"]<-"Germany"
cereal_yield$country[cereal_yield$country=="Soviet Union"]<-"Russia"
cereal_yield$country[cereal_yield$country=="Czechoslovakia"]<-"Czech Republic"
cereal_yield$country[cereal_yield$country=="Dutch East Indies"]<-"India"
cereal_yield$country[cereal_yield$country=="FR Yugoslavia"]<-"Yugoslavia"
cereal_yield$country[cereal_yield$country=="China PR"]<- "China"
cereal_yield$country[cereal_yield$country=="Turkiye"]<-"Turkey"
cereal_yield$country[cereal_yield$country=="Serbia and Montenegro"]<-"Serbia"
cereal_yield$country[cereal_yield$country=="Republic of Ireland"]<-"Ireland"
cereal_yield$country[cereal_yield$country=="United Kingdom"]<-"England"


gdp<-gdp_raw
colnames(gdp)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
gdp <- gdp[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
gdp<- gdp %>%
  pivot_longer(!country, names_to = "year", values_to = "gdp")
gdp$country[gdp$country=="West Germany"]<-"Germany"
gdp$country[gdp$country=="Soviet Union"]<-"Russia"
gdp$country[gdp$country=="Czechoslovakia"]<-"Czech Republic"
gdp$country[gdp$country=="Dutch East Indies"]<-"India"
gdp$country[gdp$country=="FR Yugoslavia"]<-"Yugoslavia"
gdp$country[gdp$country=="China PR"]<- "China"
gdp$country[gdp$country=="Turkiye"]<-"Turkey"
gdp$country[gdp$country=="Serbia and Montenegro"]<-"Serbia"
gdp$country[gdp$country=="Republic of Ireland"]<-"Ireland"
gdp$country[gdp$country=="United Kingdom"]<-"England"

gov_exp<-gov_exp_raw
colnames(gov_exp)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
gov_exp <- gov_exp[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
gov_exp<- gov_exp %>%
  pivot_longer(!country, names_to = "year", values_to = "gov_exp_on_edu")
gov_exp$country[gov_exp$country=="West Germany"]<-"Germany"
gov_exp$country[gov_exp$country=="Soviet Union"]<-"Russia"
gov_exp$country[gov_exp$country=="Czechoslovakia"]<-"Czech Republic"
gov_exp$country[gov_exp$country=="Dutch East Indies"]<-"India"
gov_exp$country[gov_exp$country=="FR Yugoslavia"]<-"Yugoslavia"
gov_exp$country[gov_exp$country=="China PR"]<- "China"
gov_exp$country[gov_exp$country=="Turkiye"]<-"Turkey"
gov_exp$country[gov_exp$country=="Serbia and Montenegro"]<-"Serbia"
gov_exp$country[gov_exp$country=="Republic of Ireland"]<-"Ireland"
gov_exp$country[gov_exp$country=="United Kingdom"]<-"England"

lit_rate<-literacy_rate_raw
colnames(lit_rate)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
lit_rate <- lit_rate[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
lit_rate<- lit_rate %>%
  pivot_longer(!country, names_to = "year", values_to = "lit_rate")
lit_rate$country[lit_rate$country=="West Germany"]<-"Germany"
lit_rate$country[lit_rate$country=="Soviet Union"]<-"Russia"
lit_rate$country[lit_rate$country=="Czechoslovakia"]<-"Czech Republic"
lit_rate$country[lit_rate$country=="Dutch East Indies"]<-"India"
lit_rate$country[lit_rate$country=="FR Yugoslavia"]<-"Yugoslavia"
lit_rate$country[lit_rate$country=="China PR"]<- "China"
lit_rate$country[lit_rate$country=="Turkiye"]<-"Turkey"
lit_rate$country[lit_rate$country=="Serbia and Montenegro"]<-"Serbia"
lit_rate$country[lit_rate$country=="Republic of Ireland"]<-"Ireland"
lit_rate$country[lit_rate$country=="United Kingdom"]<-"England"

mortality_rate<-mortality_rate_raw
colnames(mortality_rate)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
mortality_rate <- mortality_rate[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
mortality_rate<- mortality_rate %>%
  pivot_longer(!country, names_to = "year", values_to = "mortality_rate")
mortality_rate$country[mortality_rate$country=="West Germany"]<-"Germany"
mortality_rate$country[mortality_rate$country=="Soviet Union"]<-"Russia"
mortality_rate$country[mortality_rate$country=="Czechoslovakia"]<-"Czech Republic"
mortality_rate$country[mortality_rate$country=="Dutch East Indies"]<-"India"
mortality_rate$country[mortality_rate$country=="FR Yugoslavia"]<-"Yugoslavia"
mortality_rate$country[mortality_rate$country=="China PR"]<- "China"
mortality_rate$country[mortality_rate$country=="Turkiye"]<-"Turkey"
mortality_rate$country[mortality_rate$country=="Serbia and Montenegro"]<-"Serbia"
mortality_rate$country[mortality_rate$country=="Republic of Ireland"]<-"Ireland"
mortality_rate$country[mortality_rate$country=="United Kingdom"]<-"England"

pop<-population_raw
colnames(pop)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
pop <- pop[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
pop<- pop %>%
  pivot_longer(!country, names_to = "year", values_to = "population")
pop$country[pop$country=="West Germany"]<-"Germany"
pop$country[pop$country=="Soviet Union"]<-"Russia"
pop$country[pop$country=="Czechoslovakia"]<-"Czech Republic"
pop$country[pop$country=="Dutch East Indies"]<-"India"
pop$country[pop$country=="FR Yugoslavia"]<-"Yugoslavia"
pop$country[pop$country=="China PR"]<- "China"
pop$country[pop$country=="Turkiye"]<-"Turkey"
pop$country[pop$country=="Serbia and Montenegro"]<-"Serbia"
pop$country[pop$country=="Republic of Ireland"]<-"Ireland"
pop$country[pop$country=="United Kingdom"]<-"England"

rural<-rural_pop_raw
colnames(rural)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
rural <- rural[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
rural<- rural %>%
  pivot_longer(!country, names_to = "year", values_to = "rural_pop_percentage")
rural$country[rural$country=="West Germany"]<-"Germany"
rural$country[rural$country=="Soviet Union"]<-"Russia"
rural$country[rural$country=="Czechoslovakia"]<-"Czech Republic"
rural$country[rural$country=="Dutch East Indies"]<-"India"
rural$country[rural$country=="FR Yugoslavia"]<-"Yugoslavia"
rural$country[rural$country=="China PR"]<- "China"
rural$country[rural$country=="Turkiye"]<-"Turkey"
rural$country[rural$country=="Serbia and Montenegro"]<-"Serbia"
rural$country[rural$country=="Republic of Ireland"]<-"Ireland"
rural$country[rural$country=="United Kingdom"]<-"England"

urban<-urban_pop_raw
colnames(urban)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
urban <- urban[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
urban<- urban %>%
  pivot_longer(!country, names_to = "year", values_to = "urban_pop_percentage")
urban$country[urban$country=="West Germany"]<-"Germany"
urban$country[urban$country=="Soviet Union"]<-"Russia"
urban$country[urban$country=="Czechoslovakia"]<-"Czech Republic"
urban$country[urban$country=="Dutch East Indies"]<-"India"
urban$country[urban$country=="FR Yugoslavia"]<-"Yugoslavia"
urban$country[urban$country=="China PR"]<- "China"
urban$country[urban$country=="Turkiye"]<-"Turkey"
urban$country[urban$country=="Serbia and Montenegro"]<-"Serbia"
urban$country[urban$country=="Republic of Ireland"]<-"Ireland"
urban$country[urban$country=="United Kingdom"]<-"England"

colnames(farm_land_percentage)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
farm_land_percentage <- farm_land_percentage[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
farm_land_percentage<- farm_land_percentage %>%
  pivot_longer(!country, names_to = "year", values_to = "farm_land_percentage")
farm_land_percentage$country[farm_land_percentage$country=="West Germany"]<-"Germany"
farm_land_percentage$country[farm_land_percentage$country=="Soviet Union"]<-"Russia"
farm_land_percentage$country[farm_land_percentage$country=="Czechoslovakia"]<-"Czech Republic"
farm_land_percentage$country[farm_land_percentage$country=="Dutch East Indies"]<-"India"
farm_land_percentage$country[farm_land_percentage$country=="FR Yugoslavia"]<-"Yugoslavia"
farm_land_percentage$country[farm_land_percentage$country=="China PR"]<- "China"
farm_land_percentage$country[farm_land_percentage$country=="Turkiye"]<-"Turkey"
farm_land_percentage$country[farm_land_percentage$country=="Serbia and Montenegro"]<-"Serbia"
farm_land_percentage$country[farm_land_percentage$country=="Republic of Ireland"]<-"Ireland"
farm_land_percentage$country[farm_land_percentage$country=="United Kingdom"]<-"England"

colnames(life_expectancy)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
life_expectancy <- life_expectancy[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
life_expectancy<- life_expectancy %>%
  pivot_longer(!country, names_to = "year", values_to = "life_expectancy")
life_expectancy$country[life_expectancy$country=="West Germany"]<-"Germany"
life_expectancy$country[life_expectancy$country=="Soviet Union"]<-"Russia"
life_expectancy$country[life_expectancy$country=="Czechoslovakia"]<-"Czech Republic"
life_expectancy$country[life_expectancy$country=="Dutch East Indies"]<-"India"
life_expectancy$country[life_expectancy$country=="FR Yugoslavia"]<-"Yugoslavia"
life_expectancy$country[life_expectancy$country=="China PR"]<- "China"
life_expectancy$country[life_expectancy$country=="Turkiye"]<-"Turkey"
life_expectancy$country[life_expectancy$country=="Serbia and Montenegro"]<-"Serbia"
life_expectancy$country[life_expectancy$country=="Republic of Ireland"]<-"Ireland"
life_expectancy$country[life_expectancy$country=="United Kingdom"]<-"England"

colnames(teen_fertility_rate)<- c("country","country_code", "indicator","indicator_code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")
teen_fertility_rate <- teen_fertility_rate[c("country","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")]
teen_fertility_rate <- teen_fertility_rate %>%
  pivot_longer(!country, names_to = "year", values_to = "teen_fertility_rate")
teen_fertility_rate$country[teen_fertility_rate$country=="West Germany"]<-"Germany"
teen_fertility_rate$country[teen_fertility_rate$country=="Soviet Union"]<-"Russia"
teen_fertility_rate$country[teen_fertility_rate$country=="Czechoslovakia"]<-"Czech Republic"
teen_fertility_rate$country[teen_fertility_rate$country=="Dutch East Indies"]<-"India"
teen_fertility_rate$country[teen_fertility_rate$country=="FR Yugoslavia"]<-"Yugoslavia"
teen_fertility_rate$country[teen_fertility_rate$country=="China PR"]<- "China"
teen_fertility_rate$country[teen_fertility_rate$country=="Turkiye"]<-"Turkey"
teen_fertility_rate$country[teen_fertility_rate$country=="Serbia and Montenegro"]<-"Serbia"
teen_fertility_rate$country[teen_fertility_rate$country=="Republic of Ireland"]<-"Ireland"
teen_fertility_rate$country[teen_fertility_rate$country=="United Kingdom"]<-"England"

Step 2: Merge The Data

country_metrics <- merge(cereal_yield,gdp, by=c("country","year"))
country_metrics <- merge(country_metrics,gov_exp, by=c("country","year"))
country_metrics <- merge(country_metrics,lit_rate, by=c("country","year"))
country_metrics <- merge(country_metrics,mortality_rate, by=c("country","year"))
country_metrics <- merge(country_metrics,pop, by=c("country","year"))
country_metrics <- merge(country_metrics,rural, by=c("country","year"))
country_metrics <- merge(country_metrics,urban, by=c("country","year"))
country_metrics <- merge(country_metrics, farm_land_percentage, by=c("country","year"))
country_metrics <- merge(country_metrics, teen_fertility_rate, by=c("country","year"))
country_metrics <- merge(country_metrics, life_expectancy, by=c("country","year"))

Step 3: Change United Kingdom to England

country_metrics$country[country_metrics$country=="United Kingdom"]<-"England"

Langauge Data

I want to do two things with this data. First, I want to count the amount of spoken languages for a country. Second, I want to focus on what the primary language is.

I believe that India has the most from looking through the data with 16.

Step 1: This data is very dirty. There are a lot of unneeded symbols and words. I will remove them.

# Remove Numbers
languages$Languages.Spoken <- gsub("[[:digit:]]", "", languages$Languages.Spoken)
# Remove %
languages$Languages.Spoken <- gsub("%", "", languages$Languages.Spoken)
# Remove () inside
languages$Languages.Spoken <- gsub("\\s*\\([^\\)]+\\)", "", languages$Languages.Spoken)
# Remove () outside
languages$Languages.Spoken <- gsub("\\(", "", languages$Languages.Spoken)
languages$Languages.Spoken <- gsub("\\)", "", languages$Languages.Spoken)
# Remove "and"
languages$Languages.Spoken <- gsub("\\band\\b", "", languages$Languages.Spoken)
# Remove semi-colon
languages$Languages.Spoken <-gsub(";", "", languages$Languages.Spoken)
# Remove comma
languages$Languages.Spoken <- gsub(",","", languages$Languages.Spoken)
# Remove Periods
languages$Languages.Spoken <- gsub("\\.", "", languages$Languages.Spoken)
# Remove Dashes
languages$Languages.Spoken <- gsub("-", "", languages$Languages.Spoken)

Step 2: I will now split the column at every space.

# Splitting the words at spaces
languages[c("lang_one", "lang_two", "lang_three", "lang_four", "lang_five", "lang_six", "lang_seven", "lang_eight", "lang_nine", "lang_ten", "lang_eleven", "lang_twelve", "lang_thirteen", "lang_fourteen", "lang_fifteen", "lang_sixteen", "lang_seventeen")] <- str_split_fixed(languages$Languages.Spoken, ' ', 17)

Step 3: If a word is not capitalized that means it is not a language and thus will be turned into an NA.

# lang one
languages$lang_one<- ifelse (str_detect(languages$lang_one, "[[:upper:]]"), languages$lang_one, NA) 
# lang two
languages$lang_two<- ifelse (str_detect(languages$lang_two, "[[:upper:]]"), languages$lang_two, NA)
# lang three
languages$lang_three<- ifelse (str_detect(languages$lang_three, "[[:upper:]]"), languages$lang_three, NA)
# lang four
languages$lang_four<- ifelse (str_detect(languages$lang_four, "[[:upper:]]"), languages$lang_four, NA)
# lang five
languages$lang_five<- ifelse (str_detect(languages$lang_five, "[[:upper:]]"), languages$lang_five, NA)
# lang six
languages$lang_six<- ifelse (str_detect(languages$lang_six, "[[:upper:]]"), languages$lang_six, NA)
#lang seven
languages$lang_seven<- ifelse (str_detect(languages$lang_seven, "[[:upper:]]"), languages$lang_seven, NA)
# lang eight
languages$lang_eight<- ifelse (str_detect(languages$lang_eight, "[[:upper:]]"), languages$lang_eight, NA)
# lang nine
languages$lang_nine<- ifelse (str_detect(languages$lang_nine, "[[:upper:]]"), languages$lang_nine, NA)
# lang ten
languages$lang_ten<- ifelse (str_detect(languages$lang_ten, "[[:upper:]]"), languages$lang_ten, NA)
# lang eleven
languages$lang_eleven<- ifelse (str_detect(languages$lang_eleven, "[[:upper:]]"), languages$lang_eleven, NA)
# lang twelve
languages$lang_twelve<- ifelse (str_detect(languages$lang_twelve, "[[:upper:]]"), languages$lang_twelve, NA)
# lang thirteen
languages$lang_thirteen<- ifelse (str_detect(languages$lang_thirteen, "[[:upper:]]"), languages$lang_thirteen, NA)
# lang fourteen
languages$lang_fourteen<- ifelse (str_detect(languages$lang_fourteen, "[[:upper:]]"), languages$lang_fourteen, NA)
# lang fifteen
languages$lang_fifteen<- ifelse (str_detect(languages$lang_fifteen, "[[:upper:]]"), languages$lang_fifteen, NA)
# lang sixteen 
languages$lang_sixteen<- ifelse (str_detect(languages$lang_sixteen, "[[:upper:]]"), languages$lang_sixteen, NA)
# lang seventeen
languages$lang_seventeen<- ifelse (str_detect(languages$lang_seventeen, "[[:upper:]]"), languages$lang_seventeen, NA)

Step 4: Change the column names.

colnames(languages) <- c("country", "spkn_langs", "lang_one", "lang_two", "lang_three", "lang_four", "lang_five", "lang_six", "lang_seven", "lang_eight", "lang_nine", "lang_ten", "lang_eleven", "lang_twelve", "lang_thirteen", "lang_fourteen", "lang_fifteen", "lang_sixteen", "lang_seventeen")
languages<- languages[, c("country", "lang_one", "lang_two", "lang_three", "lang_four", "lang_five", "lang_six", "lang_seven", "lang_eight", "lang_nine", "lang_ten", "lang_eleven", "lang_twelve", "lang_thirteen", "lang_fourteen", "lang_fifteen", "lang_sixteen", "lang_seventeen")]

Step 5: Change country names.

languages$country[languages$country=="West Germany"]<-"Germany"
languages$country[languages$country=="Soviet Union"]<-"Russia"
languages$country[languages$country=="Czechoslovakia"]<-"Czech Republic"
languages$country[languages$country=="Dutch East Indies"]<-"India"
languages$country[languages$country=="FR Yugoslavia"]<-"Yugoslavia"
languages$country[languages$country=="China PR"]<- "China"
languages$country[languages$country=="Turkiye"]<-"Turkey"
languages$country[languages$country=="Serbia and Montenegro"]<-"Serbia"
languages$country[languages$country=="Republic of Ireland"]<-"Ireland"
languages$country[languages$country=="United Kingdom"]<-"England"

Step 6: Find how many languages a country speaks.

languages$amount_spoken <- 17 - rowSums(is.na(languages))

Step 7: Keep only the main language and the amount spoken.

languages<- languages[, c("country", "lang_one", "amount_spoken")]

Step 8: Merge with country metrics.

country_metrics <- merge(country_metrics, languages, by=c("country"))

Combining the Tidy Data Sets

Step 1: Remove the education variables. They are full of NAs.

country_metrics <- country_metrics[, c("country", "year", "cereal_yield", "gdp", "mortality_rate", "population", "rural_pop_percentage", "urban_pop_percentage", "farm_land_percentage", "teen_fertility_rate", "life_expectancy", "lang_one", "amount_spoken")]

Step 2: Remove the years of the World Cup and the World War II years

country_metrics<- subset(country_metrics, year != c("1930", "1934", "1938", "1939", "1940", "1941","1942", "1943", "1944", "1945", "1946", "1950", "1954", "1958", "1962", "1966", "1970", "1974", "1978", "1982", "1986", "1990", "1994", "1998", "2002", "2006", "2010", "2014", "2018"))
## Warning in year != c("1930", "1934", "1938", "1939", "1940", "1941", "1942", :
## longer object length is not a multiple of shorter object length

Step 3: Turn the 3 prior years into the year of the World Cup.

country_metrics$year[country_metrics$year=="1927"]<-"1930"
country_metrics$year[country_metrics$year=="1928"]<-"1930"
country_metrics$year[country_metrics$year=="1929"]<-"1930"

country_metrics$year[country_metrics$year=="1931"]<-"1934"
country_metrics$year[country_metrics$year=="1932"]<-"1934"
country_metrics$year[country_metrics$year=="1933"]<-"1934"

country_metrics$year[country_metrics$year=="1935"]<-"1938"
country_metrics$year[country_metrics$year=="1936"]<-"1938"
country_metrics$year[country_metrics$year=="1937"]<-"1938"

country_metrics$year[country_metrics$year=="1947"]<-"1950"
country_metrics$year[country_metrics$year=="1948"]<-"1950"
country_metrics$year[country_metrics$year=="1949"]<-"1950"

country_metrics$year[country_metrics$year=="1951"]<-"1954"
country_metrics$year[country_metrics$year=="1952"]<-"1954"
country_metrics$year[country_metrics$year=="1953"]<-"1954"

country_metrics$year[country_metrics$year=="1955"]<-"1958"
country_metrics$year[country_metrics$year=="1956"]<-"1958"
country_metrics$year[country_metrics$year=="1957"]<-"1958"

country_metrics$year[country_metrics$year=="1959"]<-"1962"
country_metrics$year[country_metrics$year=="1960"]<-"1962"
country_metrics$year[country_metrics$year=="1961"]<-"1962"

country_metrics$year[country_metrics$year=="1963"]<-"1966"
country_metrics$year[country_metrics$year=="1964"]<-"1966"
country_metrics$year[country_metrics$year=="1965"]<-"1966"

country_metrics$year[country_metrics$year=="1967"]<-"1970"
country_metrics$year[country_metrics$year=="1968"]<-"1970"
country_metrics$year[country_metrics$year=="1969"]<-"1970"

country_metrics$year[country_metrics$year=="1971"]<-"1974"
country_metrics$year[country_metrics$year=="1972"]<-"1974"
country_metrics$year[country_metrics$year=="1973"]<-"1974"

country_metrics$year[country_metrics$year=="1975"]<-"1978"
country_metrics$year[country_metrics$year=="1976"]<-"1978"
country_metrics$year[country_metrics$year=="1977"]<-"1978"

country_metrics$year[country_metrics$year=="1979"]<-"1982"
country_metrics$year[country_metrics$year=="1980"]<-"1982"
country_metrics$year[country_metrics$year=="1981"]<-"1982"

country_metrics$year[country_metrics$year=="1983"]<-"1986"
country_metrics$year[country_metrics$year=="1984"]<-"1986"
country_metrics$year[country_metrics$year=="1985"]<-"1986"

country_metrics$year[country_metrics$year=="1987"]<-"1990"
country_metrics$year[country_metrics$year=="1988"]<-"1990"
country_metrics$year[country_metrics$year=="1989"]<-"1990"

country_metrics$year[country_metrics$year=="1991"]<-"1994"
country_metrics$year[country_metrics$year=="1992"]<-"1994"
country_metrics$year[country_metrics$year=="1993"]<-"1994"

country_metrics$year[country_metrics$year=="1995"]<-"1998"
country_metrics$year[country_metrics$year=="1996"]<-"1998"
country_metrics$year[country_metrics$year=="1997"]<-"1998"

country_metrics$year[country_metrics$year=="1999"]<-"2002"
country_metrics$year[country_metrics$year=="2000"]<-"2002"
country_metrics$year[country_metrics$year=="2001"]<-"2002"

country_metrics$year[country_metrics$year=="2003"]<-"2006"
country_metrics$year[country_metrics$year=="2004"]<-"2006"
country_metrics$year[country_metrics$year=="2005"]<-"2006"

country_metrics$year[country_metrics$year=="2007"]<-"2010"
country_metrics$year[country_metrics$year=="2008"]<-"2010"
country_metrics$year[country_metrics$year=="2009"]<-"2010"

country_metrics$year[country_metrics$year=="2011"]<-"2014"
country_metrics$year[country_metrics$year=="2012"]<-"2014"
country_metrics$year[country_metrics$year=="2013"]<-"2014"

country_metrics$year[country_metrics$year=="2015"]<-"2018"
country_metrics$year[country_metrics$year=="2016"]<-"2018"
country_metrics$year[country_metrics$year=="2017"]<-"2018"

country_metrics$year[country_metrics$year=="2019"]<-"2022"
country_metrics$year[country_metrics$year=="2020"]<-"2022"
country_metrics$year[country_metrics$year=="2021"]<-"2022"

Step 4: Get the averages.

country_metrics <- country_metrics %>%
  group_by(year, country) %>%
  mutate(avg_cereal = mean(cereal_yield, na.rm = TRUE))%>%
  mutate(avg_gdp = mean(gdp, na.rm = TRUE))%>%
  mutate(avg_mortality = mean(mortality_rate, na.rm = TRUE))%>%
  mutate(avg_pop = mean(population, na.rm=TRUE))%>%
  mutate(avg_rural_pop = mean(rural_pop_percentage, na.rm=TRUE))%>%
  mutate(avg_urban_pop = mean(urban_pop_percentage, na.rm=TRUE))%>%
  mutate(avg_farmland = mean(farm_land_percentage, na.rm=TRUE))%>%
  mutate(avg_teen_fert = mean(teen_fertility_rate, na.rm=TRUE))%>%
  mutate(avg_life_exp = mean(life_expectancy, na.rm=TRUE))

Step 5: Now that I have my averages, that is all that I need. I only need one observation for each country for each World Cup year. 1962 & 2022 have one repeat row. 1966 up to 2018 have 3 repeat rows. I will remove these.

country_metrics <- country_metrics[, c("country", "year", "lang_one", "amount_spoken", "avg_cereal", "avg_gdp", "avg_mortality", "avg_pop", "avg_rural_pop", "avg_urban_pop", "avg_farmland", "avg_teen_fert", "avg_life_exp")]
country_metrics <- country_metrics[!duplicated(country_metrics), ]

Step 6: Remove NAs.

country_metrics<- na.omit(country_metrics)

Step 7: Merge the International friendlies with the country metrics.

country_metrics <- merge(country_metrics, int_friendlies, by=c("country", "year"))

Step 8: Split the data. The 2022 stuff will be alone.

cm_test_2022 <- country_metrics %>%
  filter(year == "2022")

country_metrics_train <- country_metrics %>%
  filter(year != "2022")

Step 9: Merge World Cup data with country metrics

colnames(wc_data)<- c("standing","country","year")
country_metrics_train <- merge(country_metrics_train, wc_data, by=c("country", "year"))

Step 10: The World Cup ranks should be reversed so that it will be easier to read the plots.

country_metrics_train$rank <- 33 - country_metrics_train$standing 

Data Set 2 With Corruption Data

Corruption Perception Index

Step 1: Change column names.

colnames(corruption_index)<- c("country", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")

Step 2: For the other data I need to drop the rank column and also change the column names.

colnames(cpi_2016) <- c("country", "2016", "rank")
colnames(cpi_2017) <- c("country", "2017", "rank")
colnames(cpi_2018) <- c("country", "2018", "rank")
colnames(cpi_2019) <- c("country", "2019", "rank")
colnames(cpi_2020) <- c("country", "2020", "rank")

cpi_2016 <- cpi_2016[, c("country", "2016")]
cpi_2017 <- cpi_2017[, c("country", "2017")]
cpi_2018 <- cpi_2018[, c("country", "2018")]
cpi_2019 <- cpi_2019[, c("country", "2019")]
cpi_2020 <- cpi_2020[, c("country", "2020")]

Step 3: Merge the data sets together.

corruption_index <- merge(corruption_index, cpi_2016, by=c("country"))
## Warning in merge.data.frame(corruption_index, cpi_2016, by = c("country")):
## column names 'NA', 'NA', 'NA', 'NA', 'NA' are duplicated in the result
corruption_index <- merge(corruption_index, cpi_2017, by=c("country"))
## Warning in merge.data.frame(corruption_index, cpi_2017, by = c("country")):
## column names 'NA', 'NA', 'NA', 'NA', 'NA' are duplicated in the result
corruption_index <- merge(corruption_index, cpi_2018, by=c("country"))
## Warning in merge.data.frame(corruption_index, cpi_2018, by = c("country")):
## column names 'NA', 'NA', 'NA', 'NA', 'NA' are duplicated in the result
corruption_index <- merge(corruption_index, cpi_2019, by=c("country"))
## Warning in merge.data.frame(corruption_index, cpi_2019, by = c("country")):
## column names 'NA', 'NA', 'NA', 'NA', 'NA' are duplicated in the result
corruption_index <- merge(corruption_index, cpi_2020, by=c("country"))
## Warning in merge.data.frame(corruption_index, cpi_2020, by = c("country")):
## column names 'NA', 'NA', 'NA', 'NA', 'NA' are duplicated in the result

Step 4: The Merge caused some weird columns that are only NAs. Everything else is correct so I will drop these columns.

corruption_index <- corruption_index[-c(20:25)]

Step 5: Before 2012, CPI was on a range of [0,10]. I will change all of these by multiplying by 10. I will also being changing the data type from character to numeric.

corruption_index$`1998` <- ifelse(is.na(corruption_index$`1998`),NA, as.numeric(corruption_index$`1998`)*10)
## Warning in ifelse(is.na(corruption_index$`1998`), NA,
## as.numeric(corruption_index$`1998`) * : NAs introduced by coercion
corruption_index$`1999` <- ifelse(is.na(corruption_index$`1999`),NA, as.numeric(corruption_index$`1999`)*10)
## Warning in ifelse(is.na(corruption_index$`1999`), NA,
## as.numeric(corruption_index$`1999`) * : NAs introduced by coercion
corruption_index$`2000` <- ifelse(is.na(corruption_index$`2000`),NA, as.numeric(corruption_index$`2000`)*10)
## Warning in ifelse(is.na(corruption_index$`2000`), NA,
## as.numeric(corruption_index$`2000`) * : NAs introduced by coercion
corruption_index$`2001` <- ifelse(is.na(corruption_index$`2001`),NA, as.numeric(corruption_index$`2001`)*10)
## Warning in ifelse(is.na(corruption_index$`2001`), NA,
## as.numeric(corruption_index$`2001`) * : NAs introduced by coercion
corruption_index$`2002` <- ifelse(is.na(corruption_index$`2002`),NA, as.numeric(corruption_index$`2002`)*10)
## Warning in ifelse(is.na(corruption_index$`2002`), NA,
## as.numeric(corruption_index$`2002`) * : NAs introduced by coercion
corruption_index$`2003` <- ifelse(is.na(corruption_index$`2003`),NA, as.numeric(corruption_index$`2003`)*10)
## Warning in ifelse(is.na(corruption_index$`2003`), NA,
## as.numeric(corruption_index$`2003`) * : NAs introduced by coercion
corruption_index$`2004` <- ifelse(is.na(corruption_index$`2004`),NA, as.numeric(corruption_index$`2004`)*10)
## Warning in ifelse(is.na(corruption_index$`2004`), NA,
## as.numeric(corruption_index$`2004`) * : NAs introduced by coercion
corruption_index$`2005` <- ifelse(is.na(corruption_index$`2005`),NA, as.numeric(corruption_index$`2005`)*10)
## Warning in ifelse(is.na(corruption_index$`2005`), NA,
## as.numeric(corruption_index$`2005`) * : NAs introduced by coercion
corruption_index$`2006` <- ifelse(is.na(corruption_index$`2006`),NA, as.numeric(corruption_index$`2006`)*10)
## Warning in ifelse(is.na(corruption_index$`2006`), NA,
## as.numeric(corruption_index$`2006`) * : NAs introduced by coercion
corruption_index$`2007` <- ifelse(is.na(corruption_index$`2007`),NA, as.numeric(corruption_index$`2007`)*10)
## Warning in ifelse(is.na(corruption_index$`2007`), NA,
## as.numeric(corruption_index$`2007`) * : NAs introduced by coercion
corruption_index$`2008` <- ifelse(is.na(corruption_index$`2008`),NA, as.numeric(corruption_index$`2008`)*10)
## Warning in ifelse(is.na(corruption_index$`2008`), NA,
## as.numeric(corruption_index$`2008`) * : NAs introduced by coercion
corruption_index$`2009` <- ifelse(is.na(corruption_index$`2009`),NA, as.numeric(corruption_index$`2009`)*10)
## Warning in ifelse(is.na(corruption_index$`2009`), NA,
## as.numeric(corruption_index$`2009`) * : NAs introduced by coercion
corruption_index$`2010` <- ifelse(is.na(corruption_index$`2010`),NA, as.numeric(corruption_index$`2010`)*10)
## Warning in ifelse(is.na(corruption_index$`2010`), NA,
## as.numeric(corruption_index$`2010`) * : NAs introduced by coercion
corruption_index$`2011` <- ifelse(is.na(corruption_index$`2011`),NA, as.numeric(corruption_index$`2011`)*10)
## Warning in ifelse(is.na(corruption_index$`2011`), NA,
## as.numeric(corruption_index$`2011`) * : NAs introduced by coercion
corruption_index$`2012` <- ifelse(is.na(corruption_index$`2012`),NA, as.numeric(corruption_index$`2012`))
## Warning in ifelse(is.na(corruption_index$`2012`), NA,
## as.numeric(corruption_index$`2012`)): NAs introduced by coercion
corruption_index$`2013` <- ifelse(is.na(corruption_index$`2013`),NA, as.numeric(corruption_index$`2013`))
## Warning in ifelse(is.na(corruption_index$`2013`), NA,
## as.numeric(corruption_index$`2013`)): NAs introduced by coercion
corruption_index$`2014` <- ifelse(is.na(corruption_index$`2014`),NA, as.numeric(corruption_index$`2014`))
## Warning in ifelse(is.na(corruption_index$`2014`), NA,
## as.numeric(corruption_index$`2014`)): NAs introduced by coercion
corruption_index$`2015` <- ifelse(is.na(corruption_index$`2015`),NA, as.numeric(corruption_index$`2015`))
## Warning in ifelse(is.na(corruption_index$`2015`), NA,
## as.numeric(corruption_index$`2015`)): NAs introduced by coercion

Step 6: Pivot Longer

corruption_index<- corruption_index %>%
  pivot_longer(!country, names_to = "year", values_to = "corruption_index")

Step 7: Turn the three previous years into the World Cup years.

corruption_index$year[corruption_index$year=="1999"]<-"2002"
corruption_index$year[corruption_index$year=="2000"]<-"2002"
corruption_index$year[corruption_index$year=="2001"]<-"2002"

corruption_index$year[corruption_index$year=="2003"]<-"2006"
corruption_index$year[corruption_index$year=="2004"]<-"2006"
corruption_index$year[corruption_index$year=="2005"]<-"2006"

corruption_index$year[corruption_index$year=="2007"]<-"2010"
corruption_index$year[corruption_index$year=="2008"]<-"2010"
corruption_index$year[corruption_index$year=="2009"]<-"2010"

corruption_index$year[corruption_index$year=="2011"]<-"2014"
corruption_index$year[corruption_index$year=="2012"]<-"2014"
corruption_index$year[corruption_index$year=="2013"]<-"2014"

corruption_index$year[corruption_index$year=="2015"]<-"2018"
corruption_index$year[corruption_index$year=="2016"]<-"2018"
corruption_index$year[corruption_index$year=="2017"]<-"2018"

corruption_index$year[corruption_index$year=="2019"]<-"2022"
corruption_index$year[corruption_index$year=="2020"]<-"2022"
corruption_index$year[corruption_index$year=="2021"]<-"2022"

Step 8: Find the averages.

corruption_index <- na.omit(corruption_index)
corruption_index <- corruption_index %>%
  group_by(year, country)%>%
  mutate(avg_corruption = mean(corruption_index))
corruption_index <- corruption_index[, c("country", "year", "avg_corruption")]
corruption_index <- corruption_index[!duplicated(corruption_index), ]

Step 9: Merge with Country Metrics

cm_w_corrup <- merge(country_metrics, corruption_index, by=c("country","year"))

Step 10: Split the data

cm_corrupt_2022_test <- cm_w_corrup %>%
  filter(year == "2022")
country_metrics_corrupt_train <- cm_w_corrup %>%
  filter(year != "2022")

Step 9: Merge with World Cup data

country_metrics_corrupt_train <- merge(country_metrics_corrupt_train, wc_data, by=c("country", "year"))

For Both Datasets

I need a way to effectively compare countries over years. For example the global GDP is much higher now than it was 1930. A country in 2020 with a low gdp, could possibly have one of the highest gdps if they were compared to countries during 1930. There are 2 possible solutions here. The first is that we normalize the data for each year. To explain this, we will stick with the gdp example. We will take any of the years, then we will find the max(gdp) of that year. That will now be a 1. We will take the min(gdp) from the same year, this will now be a zero. Then everything in between will be scaled to fit between 0 and 1. We can do this for every year. The only downfall to normalizing the data is that it will reduce the effect of outliers.

The second option is to standardize the data. This just means that the mean will be as close to 0 as possible, and a standard deviation will be 1. Standardization might not help as well with the comparisons over time, but it does handle the outliers better.

The best option for this data is to normalize the data. I will only normalize the data that I had previously taken the 3 year averages on. I will not normalize the amount of goals scored, goals conceded, goal difference, or amount languages spoken.

Part 1: Make the Normalization Function.

min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

Part 2: Country Metrics Test without Corruption Data

country_metrics_train <- country_metrics_train %>%
  group_by(year)%>%
  mutate(avg_cereal = min_max_norm(avg_cereal))%>%
  mutate(avg_gdp = min_max_norm(avg_gdp))%>%
  mutate(avg_mortality = min_max_norm(avg_mortality))%>%
  mutate(avg_pop = min_max_norm(avg_pop))%>%
  mutate(avg_rural_pop = min_max_norm(avg_rural_pop))%>%
  mutate(avg_urban_pop = min_max_norm(avg_urban_pop))%>%
  mutate(avg_farmland = min_max_norm(avg_farmland))%>%
  mutate(avg_teen_fert = min_max_norm(avg_teen_fert))%>%
  mutate(avg_life_exp = min_max_norm(avg_life_exp))
cm_test_2022 <- cm_test_2022 %>%
  group_by(year)%>%
  mutate(avg_cereal = min_max_norm(avg_cereal))%>%
  mutate(avg_gdp = min_max_norm(avg_gdp))%>%
  mutate(avg_mortality = min_max_norm(avg_mortality))%>%
  mutate(avg_pop = min_max_norm(avg_pop))%>%
  mutate(avg_rural_pop = min_max_norm(avg_rural_pop))%>%
  mutate(avg_urban_pop = min_max_norm(avg_urban_pop))%>%
  mutate(avg_farmland = min_max_norm(avg_farmland))%>%
  mutate(avg_teen_fert = min_max_norm(avg_teen_fert))%>%
  mutate(avg_life_exp = min_max_norm(avg_life_exp))

Part 2: With Corruption Data

country_metrics_corrupt_train$rank <- 33 - country_metrics_corrupt_train$standing
country_metrics_corrupt_train <- country_metrics_corrupt_train %>%
  group_by(year)%>%
  mutate(avg_cereal = min_max_norm(avg_cereal))%>%
  mutate(avg_gdp = min_max_norm(avg_gdp))%>%
  mutate(avg_mortality = min_max_norm(avg_mortality))%>%
  mutate(avg_pop = min_max_norm(avg_pop))%>%
  mutate(avg_rural_pop = min_max_norm(avg_rural_pop))%>%
  mutate(avg_urban_pop = min_max_norm(avg_urban_pop))%>%
  mutate(avg_farmland = min_max_norm(avg_farmland))%>%
  mutate(avg_teen_fert = min_max_norm(avg_teen_fert))%>%
  mutate(avg_life_exp = min_max_norm(avg_life_exp))
cm_corrupt_2022_test <- cm_corrupt_2022_test %>%
  group_by(year)%>%
  mutate(avg_cereal = min_max_norm(avg_cereal))%>%
  mutate(avg_gdp = min_max_norm(avg_gdp))%>%
  mutate(avg_mortality = min_max_norm(avg_mortality))%>%
  mutate(avg_pop = min_max_norm(avg_pop))%>%
  mutate(avg_rural_pop = min_max_norm(avg_rural_pop))%>%
  mutate(avg_urban_pop = min_max_norm(avg_urban_pop))%>%
  mutate(avg_farmland = min_max_norm(avg_farmland))%>%
  mutate(avg_teen_fert = min_max_norm(avg_teen_fert))%>%
  mutate(avg_life_exp = min_max_norm(avg_life_exp))

Part 3: Unite the country and the year column.

country_metrics_train <-unite(country_metrics_train, team, c(country,year))
cm_test_2022<- unite(cm_test_2022, team, c(country,year))
country_metrics_corrupt_train<- unite(country_metrics_corrupt_train, team, c(country,year))
cm_corrupt_2022_test<- unite(cm_corrupt_2022_test, team, c(country,year))

Data Explotation & Visualization

It is important to note that for most of these visualizations I will use the dataset that does not have corruption data with it. This is because the other dataset is pretty much a subset of this one.

Univariate Analysis

I will do general plots, and I will plot for the top four placements.

Lang_one

plot_data <- country_metrics_train %>%
  filter(rank !=1)%>%
  filter(rank !=2)%>%
  filter(rank !=3)%>%
  filter(rank !=4)%>%
  filter(rank !=5)%>%
  filter(rank !=6)%>%
  filter(rank !=7)%>%
  filter(rank !=8)%>%
  filter(rank !=9)%>%
  filter(rank !=10)%>%
  filter(rank !=11)%>%
  filter(rank !=12)%>%
  filter(rank !=13)%>%
  filter(rank !=14)%>%
  filter(rank !=15)%>%
  filter(rank !=16)%>%
  filter(rank !=17)%>%
  filter(rank !=18)%>%
  filter(rank !=19)%>%
  filter(rank !=20)%>%
  filter(rank !=21)%>%
  filter(rank !=22)%>%
  filter(rank !=23)%>%
  filter(rank !=24)%>%
  filter(rank !=25)%>%
  filter(rank !=26)%>%
  filter(rank !=27)%>%
  filter(rank !=28)

Part 1: General

ggplot(data = country_metrics_train, aes(x = lang_one, y=frequency(lang_one)))+
  geom_bar(stat = "identity")+
  labs(x = "\n Main Language", y = "Count \n", title = "Language Counts \n")+
  coord_flip()

Part 2: Top 4 Placements

ggplot(data = plot_data, aes(x = lang_one, y=frequency(lang_one)))+
  geom_bar(stat = "identity")+
  labs(x = "\n Main Language", y = "Count \n", title = "Language Counts \n")+
  coord_flip()

This suggests that there are 7 languages that usually make it to the final four of a World Cup. These are Spanish, Portuguese, Italian, German, French, English, and Dutch.

Amount_spoken

Part 1: General

ggplot(data = country_metrics_train, aes(x = amount_spoken))+
  geom_histogram()+
  labs(x = "\n Spoken Languages", y = "Count \n", title = "Amount of Spoken Languages \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4.

ggplot(data = plot_data, aes(x = amount_spoken))+
  geom_histogram()+
  labs(x = "\n Spoken Languages", y = "Count \n", title = "Amount of Spoken Languages \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This shows that maybe the amount of spoken languages matters, but I am not sure.

Cereal Yield

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_cereal))+
  geom_histogram()+
  labs(x = "\n Avg Cereal Yield", y = "Count \n", title = "Avg Cereal Yield \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_cereal))+
  geom_histogram()+
  labs(x = "\n Avg Cereal Yield", y = "Count \n", title = "Cereal Yield For Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is interesting. This suggests that the more cereal a country yields the more likely they will be in the top of a World Cup.

GDP

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_gdp))+
  geom_histogram()+
  labs(x = "\n Avg GDP", y = "Count \n", title = "Avg GDP \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_gdp))+
  geom_histogram()+
  labs(x = "\n Avg GDP", y = "Count \n", title = "Avg GDP for Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is surprising. It doesn’t look like GDP plays as big of a part as I thought it would.

Mortality Rate

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_mortality))+
  geom_histogram()+
  labs(x = "\n Avg Mortality Rate", y = "Count \n", title = "Avg Mortality Rate \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_mortality))+
  geom_histogram()+
  labs(x = "\n AVg Mortality Rate", y = "Count \n", title = "Avg Mortality Rate for Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This graph shows that countries with lower mortality rates seem to be more likely to finish in the top 4. This probably is simply because it seems there are more countries with lower mortality rates.

Population

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_pop))+
  geom_histogram()+
  labs(x = "\n Avg Population", y = "Count \n", title = "Avg Population \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_pop))+
  geom_histogram()+
  labs(x = "\n Avg Population", y = "Count \n", title = "Avg Population for Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It seems that population size doesn’t matter that much.

Rural Population Percentage

Part 1:

ggplot(data = country_metrics_train, aes(x = avg_rural_pop))+
  geom_histogram()+
  labs(x = "\n Avg Percentage", y = "Count \n", title = "Avg Percentage of People Living in Rural \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_rural_pop))+
  geom_histogram()+
  labs(x = "\n Avg Percentage", y = "Count \n", title = "Avg Percentage of People Living in Rural for Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This seems to show that countries that have more people living in urban areas do better at the World Cup.

Urban

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_urban_pop))+
  geom_histogram()+
  labs(x = "\n Avg Percentage", y = "Count \n", title = "Avg Percentage of People Living in Urban \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is the same as the rural graph but inverted. Probably only need to keep one of these variables.

Farmland

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_farmland))+
  geom_histogram()+
  labs(x = "\n Avg Farmland", y = "Count \n", title = "Avg Farmland \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_farmland))+
  geom_histogram()+
  labs(x = "\n Avg Farmland", y = "Count \n", title = "Avg Farmland for Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

maybe a relationship here.

Teen Fertility Rate

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_teen_fert))+
  geom_histogram()+
  labs(x = "\n Avg Teenage Fertitilty", y = "Count \n", title = "Avg Teenage Fertility \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_teen_fert))+
  geom_histogram()+
  labs(x = "\n Avg Teenage Fertitilty", y = "Count \n", title = "Avg Teenage Fertility \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This tells me that countries with lower teenage fertility rates do better at the World Cup.

Life Expectancy

Part 1: General

ggplot(data = country_metrics_train, aes(x = avg_life_exp))+
  geom_histogram()+
  labs(x = "\n Avg Life Expectancy", y = "Count \n", title = "Avg Life Expectancy \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This tells me that countries that have a higher life expectancy are more likely to enter the World Cup.

Part 2: Top 4

ggplot(data = plot_data, aes(x = avg_life_exp))+
  geom_histogram()+
  labs(x = "\n Avg Life Expectancy", y = "Count \n", title = "Avg Life Expectancy Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This may suggest that countries that have a higher life expectancy rate perform better than their counterparts at the World Cup.

Pre World Cup Wins

Part 1: General

ggplot(data = country_metrics_train, aes(x = pre_world_cup_wins))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Wins", y = "Count \n", title = "Pre World Cup Wins \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like the majority of teams win between 10 and 20 matches. Some team once won 40 matches.

Part 2: Top 4

ggplot(data = plot_data, aes(x = pre_world_cup_wins))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Wins", y = "Count \n", title = "Pre World Cup Wins \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It once again suggests that a team should win more than 10 international friendlies if they want a good chance of winning the World Cup.

Goals Scored Prior to World Cup

Part 1: General

ggplot(data = country_metrics_train, aes(x = total_goals_scored))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Goals", y = "Count \n", title = "Pre World Cup Goals \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = total_goals_scored))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Goals", y = "Count \n", title = "Pre World Cup Goals Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It seems like teams that score more than 10 goals in international friendlies have a good chance of making it into the final 4.

Goals Conceeded

Part 1: General

ggplot(data = country_metrics_train, aes(x = total_goals_conceeded))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Goals Conceeded", y = "Count \n", title = "Pre World Cup Goals Conceeded \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = total_goals_conceeded))+
  geom_histogram()+
  labs(x = "\n Pre World Cup Goals Conceeded", y = "Count \n", title = "Pre World Cup Goals Conceeded  Top 4\n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It appears that a team should not conceed more than 15 goals if they expect to do well.

Goal Differential

Part 1: General

ggplot(data = country_metrics_train, aes(x = goal_diff))+
  geom_histogram()+
  labs(x = "\n Goal Diff", y = "Count \n", title = "Goal Diff \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2: Top 4

ggplot(data = plot_data, aes(x = goal_diff))+
  geom_histogram()+
  labs(x = "\n Goal Diff", y = "Count \n", title = "Goal Diff Top 4 \n")+
  coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What stands out here is that there are only 2 teams that have made the final four of a World Cup that had a negative goal difference during international friendlies prior to the World Cup. It also looks that teams that have a +5 goal differential or better are more likely to make it to the final 4.

Bi-Variable

First I will make scatterplots for each of the independent variables against the dependent variable.

ScatterPlots

average cereal yield

ggplot(country_metrics_train, aes(x=avg_cereal, y=rank)) +
  geom_point()

There seems to be maybe a small correlation here. It appears that the teams that had larger cereal yields seemed to usually do better in the World Cup.

average gdp

ggplot(country_metrics_train, aes(x=avg_gdp, y=rank)) +
  geom_point()

Again, this seems like there might be a weak correlation here.

average mortality rate

ggplot(country_metrics_train, aes(x=avg_mortality, y=rank)) +
  geom_point()

This one is hard to tell.

average population

ggplot(country_metrics_train, aes(x=avg_pop, y=rank)) +
  geom_point()

This does seem to suggest that the countries with larger populations tend to do better in the World Cup. This intuitively makes sense.

average urban population

ggplot(country_metrics_train, aes(x=avg_urban_pop, y=rank)) +
  geom_point()

This suggests that countries with higher urban living percentages tend to better.

average farmland

ggplot(country_metrics_train, aes(x=avg_farmland, y=rank)) +
  geom_point()

There seems to be maybe a weak positive correlation here.

avgerage teenage fertility

ggplot(country_metrics_train, aes(x=avg_teen_fert, y=rank)) +
  geom_point()

This plot does seem to suggest that the countries that do extremely well in the World Cup seem to have lower teenage fertility rates.

average life expectancy

ggplot(country_metrics_train, aes(x=avg_life_exp, y=rank)) +
  geom_point()

It does seem that the teams that do well have higher average life expectancies.

pre world cup wins

ggplot(country_metrics_train, aes(x=pre_world_cup_wins, y=rank)) +
  geom_point()

Surprisingly, I do not see any correlation here.

goals scored prior to World Cup

ggplot(country_metrics_train, aes(x=total_goals_scored, y=rank)) +
  geom_point()

This definetly looks like a positive correlation to me.

goals conceeded prior to World Cup

ggplot(country_metrics_train, aes(x=total_goals_conceeded, y=rank)) +
  geom_point()

There seems to be a negative correlation here.

goal difference prior to World Cup

ggplot(country_metrics_train, aes(x=goal_diff, y=rank)) +
  geom_point()

This is a strong, positive correlation.

Perceived Corruption Index

ggplot(country_metrics_corrupt_train, aes(x=avg_corruption, y=rank)) +
  geom_point()

It looks like maybe there is a positive correlation here.

Before I move on. I want to add a column based on if lang_one is one of those big seven languages or not.

country_metrics_train <- country_metrics_train %>%
  mutate(big_seven = if_else((startsWith(lang_one, "Spanish")|
                                startsWith(lang_one,"Portuguese")|
                                startsWith(lang_one, "Italian")|
                                startsWith(lang_one, "Dutch")|
                                startsWith(lang_one, "English")|
                                startsWith(lang_one, "French")|
                                startsWith(lang_one, "German")),lang_one,"not a big seven"))

country_metrics_corrupt_train <- country_metrics_corrupt_train %>%
  mutate(big_seven = if_else((startsWith(lang_one, "Spanish")|
                                startsWith(lang_one,"Portuguese")|
                                startsWith(lang_one, "Italian")|
                                startsWith(lang_one, "Dutch")|
                                startsWith(lang_one, "English")|
                                startsWith(lang_one, "French")|
                                startsWith(lang_one, "German")),lang_one,"not a big seven"))

cm_test_2022 <- cm_test_2022 %>%
  mutate(big_seven = if_else((startsWith(lang_one, "Spanish")|
                                startsWith(lang_one,"Portuguese")|
                                startsWith(lang_one, "Italian")|
                                startsWith(lang_one, "Dutch")|
                                startsWith(lang_one, "English")|
                                startsWith(lang_one, "French")|
                                startsWith(lang_one, "German")),lang_one,"not a big seven"))

cm_corrupt_2022_test <- cm_corrupt_2022_test %>%
  mutate(big_seven = if_else((startsWith(lang_one, "Spanish")|
                                startsWith(lang_one,"Portuguese")|
                                startsWith(lang_one, "Italian")|
                                startsWith(lang_one, "Dutch")|
                                startsWith(lang_one, "English")|
                                startsWith(lang_one, "French")|
                                startsWith(lang_one, "German")),lang_one,"not a big seven"))

Remove lang_one

country_metrics_train <- country_metrics_train%>%
  select(-c(lang_one))
country_metrics_corrupt_train <- country_metrics_corrupt_train %>%
  select(-c(lang_one))
cm_test_2022 <- cm_test_2022 %>%
  select(-c(lang_one))
cm_corrupt_2022_test <- cm_corrupt_2022_test %>%
  select(-c(lang_one))

Building Models

Linear Regression

Without Corruption Data

I will make 3 separate models from this dataset. A: Country Metrics alone. B: Soccer Metrics alone. C: A combination of both.

I will also make a combination model for the dataset that includes the perceived corruption index. D: A combination of country metrics with corruption data set and soccer metrics.

Country Metrics Alone.

model_a <- lm(rank ~ big_seven + amount_spoken + avg_cereal + avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + avg_farmland + avg_teen_fert + avg_life_exp, data = country_metrics_train)
summary(model_a)
## 
## Call:
## lm(formula = rank ~ big_seven + amount_spoken + avg_cereal + 
##     avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + 
##     avg_farmland + avg_teen_fert + avg_life_exp, data = country_metrics_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.3316  -4.3507   0.7791   4.8067  19.5794 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.03351    8.13001   0.496 0.620232    
## big_sevenEnglish         -7.60944    3.09829  -2.456 0.014714 *  
## big_sevenFrench          -4.17791    3.27335  -1.276 0.202991    
## big_sevenGerman           0.15194    3.04243   0.050 0.960209    
## big_sevenItalian         -0.06233    3.53020  -0.018 0.985926    
## big_sevennot a big seven -4.47189    2.96065  -1.510 0.132164    
## big_sevenPortuguese       6.72763    3.60732   1.865 0.063326 .  
## big_sevenSpanish         -0.73155    3.20579  -0.228 0.819675    
## amount_spoken             0.56705    0.34369   1.650 0.100190    
## avg_cereal                9.51663    2.74281   3.470 0.000611 ***
## avg_gdp                   4.60791    3.53347   1.304 0.193380    
## avg_mortality            19.00695    6.22288   3.054 0.002493 ** 
## avg_pop                  -1.80488    3.21265  -0.562 0.574740    
## avg_rural_pop            -7.08171    2.36656  -2.992 0.003038 ** 
## avg_urban_pop                  NA         NA      NA       NA    
## avg_farmland              5.81007    1.77038   3.282 0.001175 ** 
## avg_teen_fert            -0.13534    3.44561  -0.039 0.968699    
## avg_life_exp             10.58942    7.11314   1.489 0.137794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.469 on 256 degrees of freedom
## Multiple R-squared:  0.3466, Adjusted R-squared:  0.3058 
## F-statistic: 8.488 on 16 and 256 DF,  p-value: < 2.2e-16

Soccer Metrics Alone.

model_b <- lm(rank ~ pre_world_cup_wins + total_goals_scored + total_goals_conceeded + goal_diff, data = country_metrics_train)
summary(model_b)
## 
## Call:
## lm(formula = rank ~ pre_world_cup_wins + total_goals_scored + 
##     total_goals_conceeded + goal_diff, data = country_metrics_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5405  -4.8691   0.3389   5.1685  16.1389 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           21.10419    1.57593  13.392  < 2e-16 ***
## pre_world_cup_wins    -0.22682    0.07101  -3.194  0.00157 ** 
## total_goals_scored     0.49984    0.05300   9.432  < 2e-16 ***
## total_goals_conceeded -0.54229    0.08053  -6.734 9.95e-11 ***
## goal_diff                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.268 on 269 degrees of freedom
## Multiple R-squared:  0.3499, Adjusted R-squared:  0.3426 
## F-statistic: 48.26 on 3 and 269 DF,  p-value: < 2.2e-16

Combination.

model_c <- lm(rank ~ big_seven + amount_spoken + avg_cereal + avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + avg_farmland + avg_teen_fert + avg_life_exp + pre_world_cup_wins + total_goals_scored + total_goals_conceeded + goal_diff, data = country_metrics_train)
summary(model_c)
## 
## Call:
## lm(formula = rank ~ big_seven + amount_spoken + avg_cereal + 
##     avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + 
##     avg_farmland + avg_teen_fert + avg_life_exp + pre_world_cup_wins + 
##     total_goals_scored + total_goals_conceeded + goal_diff, data = country_metrics_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7401  -3.7723   0.5445   4.4339  21.0494 
## 
## Coefficients: (2 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8.05886    7.73338   1.042  0.29837    
## big_sevenEnglish         -3.80688    2.78933  -1.365  0.17353    
## big_sevenFrench          -1.91546    2.93048  -0.654  0.51394    
## big_sevenGerman           1.00522    2.69936   0.372  0.70991    
## big_sevenItalian          2.87220    3.20793   0.895  0.37145    
## big_sevennot a big seven -1.03258    2.66309  -0.388  0.69854    
## big_sevenPortuguese       5.34893    3.21875   1.662  0.09779 .  
## big_sevenSpanish          3.16451    2.91221   1.087  0.27823    
## amount_spoken             0.33846    0.30796   1.099  0.27280    
## avg_cereal                5.78151    2.49363   2.319  0.02122 *  
## avg_gdp                   6.52129    3.18999   2.044  0.04196 *  
## avg_mortality            15.42625    5.72042   2.697  0.00747 ** 
## avg_pop                  -3.52423    2.96915  -1.187  0.23636    
## avg_rural_pop            -5.75000    2.11174  -2.723  0.00692 ** 
## avg_urban_pop                  NA         NA      NA       NA    
## avg_farmland              3.34276    1.60934   2.077  0.03880 *  
## avg_teen_fert            -0.44350    3.05261  -0.145  0.88460    
## avg_life_exp              6.57945    6.36003   1.035  0.30189    
## pre_world_cup_wins       -0.12265    0.07362  -1.666  0.09695 .  
## total_goals_scored        0.40221    0.05670   7.094 1.31e-11 ***
## total_goals_conceeded    -0.40613    0.08101  -5.013 1.01e-06 ***
## goal_diff                      NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 253 degrees of freedom
## Multiple R-squared:  0.4933, Adjusted R-squared:  0.4553 
## F-statistic: 12.97 on 19 and 253 DF,  p-value: < 2.2e-16

With Coruption Data

Combination

model_d <- lm(rank ~ big_seven + amount_spoken + avg_cereal + avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + avg_farmland + avg_teen_fert + avg_life_exp + pre_world_cup_wins + total_goals_scored + total_goals_conceeded + goal_diff + avg_corruption, data = country_metrics_corrupt_train)
summary(model_d)
## 
## Call:
## lm(formula = rank ~ big_seven + amount_spoken + avg_cereal + 
##     avg_gdp + avg_mortality + avg_pop + avg_rural_pop + avg_urban_pop + 
##     avg_farmland + avg_teen_fert + avg_life_exp + pre_world_cup_wins + 
##     total_goals_scored + total_goals_conceeded + goal_diff + 
##     avg_corruption, data = country_metrics_corrupt_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4793  -3.9581  -0.4391   3.5111  20.4202 
## 
## Coefficients: (2 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              17.26582   12.27361   1.407    0.162    
## big_sevenEnglish         -0.72903    4.46225  -0.163    0.870    
## big_sevenFrench          -1.17522    3.99247  -0.294    0.769    
## big_sevenGerman          -1.16828    3.90860  -0.299    0.765    
## big_sevenItalian          2.78170    5.01552   0.555    0.580    
## big_sevennot a big seven -1.73022    3.80005  -0.455    0.650    
## big_sevenPortuguese       1.90821    4.40131   0.434    0.665    
## big_sevenSpanish          5.14843    4.40891   1.168    0.245    
## amount_spoken             0.21734    0.42890   0.507    0.613    
## avg_cereal                3.59330    4.29440   0.837    0.404    
## avg_gdp                  -4.16460    4.68700  -0.889    0.376    
## avg_mortality            -2.30859   10.30369  -0.224    0.823    
## avg_pop                   3.75714    3.72750   1.008    0.315    
## avg_rural_pop            -3.14316    3.51603  -0.894    0.373    
## avg_urban_pop                  NA         NA      NA       NA    
## avg_farmland             -0.26749    2.79092  -0.096    0.924    
## avg_teen_fert            -4.82576    6.84194  -0.705    0.482    
## avg_life_exp             -9.83353   11.42624  -0.861    0.391    
## pre_world_cup_wins        0.02802    0.12673   0.221    0.825    
## total_goals_scored        0.49026    0.08125   6.034 1.58e-08 ***
## total_goals_conceeded    -0.51225    0.12216  -4.193 5.07e-05 ***
## goal_diff                      NA         NA      NA       NA    
## avg_corruption            0.08085    0.05410   1.494    0.137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.944 on 129 degrees of freedom
## Multiple R-squared:  0.5292, Adjusted R-squared:  0.4562 
## F-statistic: 7.251 on 20 and 129 DF,  p-value: 3.023e-13

It seems that average corruption did not play a significant role in the regression. Therefore, I will only evaluate the other 3 models.

Model Evaluation.

I will evaluate the models based on their respective diagnostic plots, and their summaries.

Model A: Country Metrics

plot(model_a)

Residuals vs Fitted

There are some issues with this plot. The residuals do bounce around randomly at zero up until the fitted values get near 30. The residuals then begin to taper in. The residuals pretty much have an equal variance on both sides of 0, which is good. There maybe an outlier or 2, but they are barely an outliers. That being said, there also isn’t a clear non-linear pattern for the residuals, which is good.

QQ Plot

This is a good a good. This means that the residuals are normally distributed. This is to be expected because I normalized a lot of the independent variables. Nevertheless, we wanted our residuals to follow the 45 degree angle line, which it does.

Scale-Location

This diagnostic plot tests for homoscedasticity. The goal is to see a horizontal line with residuals equally spread out on both sides of it. In this case, we pretty much have a horizontal line. There is a slight curve, but nothing major. This plot could look a lot better if the sample size was larger.

Residuals vs Leverage

This plot looks for outliers. An outlier with power would be outside of the red dashed line in the upper or lower right side. There are none. This is great.

Conclusion

This regression model may not be the strongest, however, it does seem to be pretty trustworthy.

Model B: Soccer Metrics

plot(model_b)

### Residuals vs Fitted Again, we see a pretty horizontal line. This is good. The residuals look fairly evenly spread out, until it tapers in as fitted values increases.

QQ Plot

The residuals seem to follow the 45 degree angle line pretty closely. This is good. The residuals are normally distributed.

Scale-Location

The horizontal line is not that horizontal it begins to slope down as fitted values increases. This is not at all that surprising. The goal difference variable will have more variance in it because it takes negative and positive values.

Residuals vs Leverage

There are no residuals lying outside of red dashed lines. Thus, there are no outliers here.

Conclusion

Again, I would say that this is a trustworthy model.

Model C: Combination of Both.

plot(model_c)

Residuals vs Fitted

The residuals seem to behave the same as the previous plots. The line is fairly horizontal, but the residuals taper in towards 0 as fitted values increase.

QQ Plot

The residuals seem to follow the diagonal line pretty well. I would say that the residuals are normal.

Scale-Location

The line is fairly horizontal. It is not perfect, but is enough to say that there is homoscedasticity.

Residuals vs Leverage

There are no outliers to be concerned with.

Conclusion

I also believe that this is a trustworth model.

Model Selection

Since we can trust any of these models we will compare their R squared values. Model A: 0.3466 Model B: 0.3499 Model C: 0.4933

We will pick Model C.

It is important to note that the model that only used country metrics performed about the same as the model that only used soccer data.

Model Improvement

Since there I decided to make the models in reverse, I will remove any independent variable that did not really have a significance on the model.

model <- lm(rank ~ big_seven  + avg_cereal + avg_gdp + avg_mortality + avg_urban_pop + avg_farmland  + pre_world_cup_wins + total_goals_scored + total_goals_conceeded, data = country_metrics_train)
summary(model)
## 
## Call:
## lm(formula = rank ~ big_seven + avg_cereal + avg_gdp + avg_mortality + 
##     avg_urban_pop + avg_farmland + pre_world_cup_wins + total_goals_scored + 
##     total_goals_conceeded, data = country_metrics_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0133  -4.0262   0.6621   4.3891  20.2829 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               9.66900    3.88923   2.486 0.013551 *  
## big_sevenEnglish         -3.83825    2.73926  -1.401 0.162361    
## big_sevenFrench          -1.90708    2.89705  -0.658 0.510945    
## big_sevenGerman           1.01069    2.68935   0.376 0.707366    
## big_sevenItalian          3.46090    3.09978   1.116 0.265251    
## big_sevennot a big seven -1.16638    2.61753  -0.446 0.656259    
## big_sevenPortuguese       4.09846    3.03470   1.351 0.178033    
## big_sevenSpanish          2.50526    2.74986   0.911 0.363123    
## avg_cereal                5.60682    2.46093   2.278 0.023528 *  
## avg_gdp                   3.49236    2.03386   1.717 0.087163 .  
## avg_mortality             8.54896    2.16584   3.947 0.000102 ***
## avg_urban_pop             6.50271    1.99395   3.261 0.001259 ** 
## avg_farmland              3.21587    1.56601   2.054 0.041032 *  
## pre_world_cup_wins       -0.15357    0.07071  -2.172 0.030786 *  
## total_goals_scored        0.40190    0.05641   7.125 1.05e-11 ***
## total_goals_conceeded    -0.41524    0.07984  -5.201 4.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.608 on 257 degrees of freedom
## Multiple R-squared:  0.4866, Adjusted R-squared:  0.4566 
## F-statistic: 16.24 on 15 and 257 DF,  p-value: < 2.2e-16

Predict

Who will win the World Cup according to the model?

cm_test_2022$prediction <- predict(model, newdata =  cm_test_2022)
results <- cm_test_2022[, c("team", "prediction")]
print(results[order(-results$prediction),])
## # A tibble: 130 x 2
##    team                    prediction
##    <chr>                        <dbl>
##  1 Brazil_2022                   28.7
##  2 Uruguay_2022                  27.5
##  3 Argentina_2022                25.7
##  4 Guinea-Bissau_2022            23.3
##  5 Nigeria_2022                  22.9
##  6 Cuba_2022                     22.6
##  7 Portugal_2022                 22.1
##  8 Colombia_2022                 22.1
##  9 Dominican Republic_2022       21.6
## 10 Costa Rica_2022               21.6
## # ... with 120 more rows

There you have it. My model predicts that Brazil will win the 2022 FIFA World Cup.