Disclaimer

This is a project that I worked on during my practicum in Masters of Business Analytics program at University of Washington, Foster School of Business in collarboration with Microsoft Financing team. We’ve only used public data for this project and the suggestions are not binding.

Problem Statement

Microsoft Financing is a service that offers a wide range of financing options for its customers to accomodate to their needs. Microsoft Financing has expanded to 22 different countries by 2019 and four more countries in 2020 (Singapore, United Arabic Emirates, India and South Africa). Our goal is to find out which countries they should expand into next. In this part of the quantitative analysis, we narrowed down a list of countries that we should look deeper into in our qualitative analysis.
We first collect publicly available country-level data from various websites and clean the data. Then, we build three different models to evaluate which countries are more favorable to expand into. Last we summarise our finding and move on to the qualitative analysis.


Load packages

library(tidyverse)
library(readxl)
library(data.table)
library(caret)              # For Normalizing Data

Read Data

World_Bank <- read_csv("World Bank Data_Data.csv")                                      
CPI_2019 <- read_xlsx("CPI_2012_2019.xlsx", skip = 2, sheet = "CPI2019")                
Population <- read.csv("Population.csv", stringsAsFactors = FALSE)                   
Global_Economy <- read.csv("Theglobaleconomy_Data.csv", stringsAsFactors = FALSE)       
Internet_Data <- read.csv("Internet_Data.csv" , stringsAsFactors = FALSE)              
Data_Center <- read_xlsx("#of Data Center by Country.xlsx")
GCI <- read_xlsx("WEF_GCI_4.0_2019_Dataset.xlsx", sheet = "Data", skip = 3)

The data that we collected can be generallized into 5 categories:

The data that we gathered are all public data from various websites, including the World Bank, World Economic Forum and theglobaleconomy.com.

Data Sources :
World_Bank and Population is collected from the World Bank.
CPI_2019 is from Transparency International.
Global_Economy and Internet_Data are from The Global Economy.
Data_Center is from Data Center Map.
GCI is from the World Economic Forum.

Expand_Countries_List <- c("United States", "Australia", "Austria", "Belgium", "Brazil", "Canada", "Denmark", "Finland", "France", "Germany", "Ireland", "Italy", "Japan", "Mexico", "Netherlands", "New Zealand", "Norway", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom", "South Africa", "India", "United Arab Emirates", "Singapore")
Expand_Countries_List_re <- c("United States", "Australia", "Austria", "Belgium", "Brazil", "Canada", "Denmark", "Finland", "France", "Germany", "Ireland", "Italy", "Japan", "Mexico", "Netherlands", "New Zealand", "Norway", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom")

Expand_Countries_List is a vector of countries that contains all the countries Microsoft Financing has expanded into, including the 4 expansions in 2020 (Singapore, UAE, India, South Africa)
We also wanted to evaluate the current expansions in 2020, so we created two separted vectors to store these countries.
Expand_Countries_List_re is a vector of countries that contains all the countries Microsoft Financing has expanded into, without the 4 expansions in 2020.

Cleaning Data

Clean WB

WB <- World_Bank
head(WB)
tail(WB)
# Drop last five rows
WB <- head(WB, -5)

We can see that each row is a country with a given variable (series).
The last 5 rows are not useful so we will drop them.

# Change column names
names(WB) <- c("Country_Name", "Country_Code", "Series_Name", "Series_Code", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019")

NAs are recorded as “..” and therefore other numeric values are recoreded as characters.
We loop through the 5th to the 14th columns to make the numeric values numeric and the characters that are not numeric into NAs.

WB[, 5:14] <- sapply(WB[, 5:14], as.numeric)
# Keep only the first 217 countries
Unique_Country_Name <- unique(WB$Country_Name)
Unique_Country_Name <- Unique_Country_Name[1:217]    # The rest of the Country_Name are regional such as North America
WB <- WB[WB$Country_Name %in% Unique_Country_Name, ] # Keep only the country level data in WB

Use the country names in World Bank to join data from difference source together.
We will match all the country names from data using the country names used in the World Bank.

Clean CPI_2019

head(CPI_2019)
# The list of countries that are not using the exact same name as the names used in WB
CPI_2019$Country[!CPI_2019$Country %in% unique(WB$Country_Name)]
 [1] "Hong Kong"                        "United States of America"         "Taiwan"                          
 [4] "Bahamas"                          "Korea, South"                     "Saint Vincent and the Grenadines"
 [7] "Saint Lucia"                      "Slovakia"                         "Gambia"                          
[10] "Egypt"                            "Kyrgyzstan"                       "Laos"                            
[13] "Russia"                           "Iran"                             "Congo"                           
[16] "Democratic Republic of the Congo" "Guinea Bissau"                    "Korea, North"                    
[19] "Venezuela"                        "Yemen"                            "Syria"                           

Need to change these names to the names used in WB
Manually change the country names.

CPI_2019$Country[!CPI_2019$Country %in% unique(WB$Country_Name)] <- c("Hong Kong SAR, China", "United States", "Taiwan", "Bahamas, The", "Korea, Rep.", "St. Vincent and the Grenadines", "St. Lucia", "Slovak Republic", "Gambia, The", "Egypt, Arab Rep.", "Kyrgyz Republic", "Lao PDR", "Russian Federation", "Iran, Islamic Rep.","Congo, Rep.", "Congo, Dem. Rep.", "Guinea-Bissau", "Korea, Dem. People’s Rep.", "Venezuela, RB", "Yemen, Rep.", "Syrian Arab Republic")

Clean Global_Economy

head(Global_Economy)
# The country names that are not in WB. 
Global_Economy$Country[!Global_Economy$Country %in% unique(WB$Country_Name)] %>% unique()
 [1] "Bahamas"                          "Brunei"                           "Burma (Myanmar)"                 
 [4] "Cape Verde"                       "Democratic Republic of the Congo" "Egypt"                           
 [7] "Gambia"                           "Hong Kong"                        "Iran"                            
[10] "Ivory Coast"                      "Kyrgyzstan"                       "Laos"                            
[13] "Macao"                            "Macedonia"                        "Micronesia"                      
[16] "North Korea"                      "Palestine"                        "Republic of the Congo"           
[19] "Russia"                           "Saint Lucia"                      "Saint Vincent and the Grenadines"
[22] "Slovakia"                         "South Korea"                      "Swaziland"                       
[25] "Syria"                            "Taiwan"                           "USA"                             
[28] "Venezuela"                        "Yemen"                           

Manually check the names used in WB. Some of them are not in WB such as Taiwan and Palestine, just keep them as they are.

GE_name <- Global_Economy$Country[!Global_Economy$Country %in% unique(WB$Country_Name)] %>% unique()
WB_name <- c("Bahamas, The", "Brunei Darussalam", "Myanmar", "Cabo Verde", "Congo, Dem. Rep.", "Egypt, Arab Rep.", "Gambia, The", "Hong Kong SAR, China", "Iran, Islamic Rep.", "Cote d'Ivoire", "Kyrgyz Republic", "Lao PDR",  "Macao SAR, China", "North Macedonia", "Micronesia, Fed. Sts.",  "Korea, Dem. People’s Rep.", "Palestine", "Congo, Rep.", "Russian Federation", "St. Lucia", "St. Vincent and the Grenadines", "Slovak Republic", "Korea, Rep.", "Eswatini", "Syrian Arab Republic", "Taiwan", "United States", "Venezuela, RB", "Yemen, Rep.")
length(GE_name) == length(WB_name)  # Check if the two vectors have the same length
[1] TRUE

Because there are multiple rows for a country, so change the names using a loop

for (i in 1:nrow(Global_Economy)) {
  if(!Global_Economy$Country[i] %in% GE_name) {                                    # If the country name is the same as WB
    next                                                                           # Do not change name
  } else {                                                                         # Else (the country name is different)
    Global_Economy$Country[i] <- WB_name[Global_Economy$Country[i] == GE_name]     # Use the name in WB
  }
}

Change column names

names(Global_Economy) <- c("Country", "Code", "Year", "Rule_of_law_index", "Government_effectiveness_index", "Control_of_corruption", "Regulatory_quality_index")
names(Global_Economy)
[1] "Country"                        "Code"                           "Year"                          
[4] "Rule_of_law_index"              "Government_effectiveness_index" "Control_of_corruption"         
[7] "Regulatory_quality_index"      

Clean Data_Center

Need to change these names to the names used in WB

Data_Center$Country[!Data_Center$Country %in% WB$Country_Name]
 [1] "USA"                    "The Netherlands"        "Hong Kong"              "Russia"                
 [5] "Iran"                   "South Korea"            "Slovakia"               "Egypt"                 
 [9] "Taiwan"                 "Jersey"                 "Macedonia"              "Venezuela"             
[13] "Guernsey"               "Isle Of Man"            "Bahamas"                "Bosnia And Herzegovina"
[17] "Dr Congo"               "Macau"                  "Netherlands Antilles"   "Palestine"             
[21] "Reunion"                "Trinidad And Tobago"    "Us Virgin Islands"     
Data_Center$Country[!Data_Center$Country %in% WB$Country_Name] <- c("United States", "Netherlands", "Hong Kong SAR, China", "Russian Federation", "Iran, Islamic Rep.", "Korea, Rep.", "Slovak Republic", "Egypt, Arab Rep.", "Taiwan", "Jersey", "North Macedonia", "Venezuela, RB", "Guernsey", "Isle of Man", "Bahamas, The", "Bosnia and Herzegovina", "Congo, Dem. Rep.", "Macao SAR, China", "Netherlands Antilles", "Palestine", "Reunion", "Trinidad and Tobago", "Virgin Islands (U.S.)")
# "Taiwan", "Jersey", "Guernsey", "Netherlands Antilles", "Palestine", "Reunion" are not in WB

Clean Internet_Data

for (i in 1:nrow(Internet_Data)) {
  if(!Internet_Data$Country[i] %in% GE_name) {                                    # If the country name is the same as WB
    Internet_Data$Country[i] <- Internet_Data$Country[i]                         # Do not chang name
  } else {                                                                         # Else (the country name is different)
    Internet_Data$Country[i] <- WB_name[Internet_Data$Country[i] == GE_name]     # Use the name in WB
  }
}

Variables

Variables that we use are: Population, GDP(2018, Current US$), GDP Growth (2018, %), Inflation Rate(2018, %), Number of Data Centers (2018, #), Number of Secure Internet Servers(2018, #), Percentage of Internet Users of Population (2017, %), Rule of law index(2018, -2.5 ~ 2.5), Government effectiveness index(2018, -2.5 ~ 2.5), Control of corruption(2018, -2.5 ~ 2.5), Regulatory quality index(2018, -2.5 ~ 2.5), Ease of doing business(2019, Ranking), Global Competitiveness Index Score (2019, 0-100), the 12 pillars in GCI: Institutions, Infrastructure, ICT adoption, Macroeconomic stability, Health, Skills, Product market, Labour market, Financial system, Market size, Business dynamism, Innovation capability (2019, 0-100)

Variables Year Scale Source
GDP 2018 in US$ World Bank
GDP Growth 2018 % World Bank
FX Rate 2018 in US$ World Bank
Inflation 2018 % World Bank
# of Data Centers 2018 # datacentermap.com
# of Internet Servers 2018 # World Bank
% Internet Users 2017 % population theglobaleconomy.com
Population 2018 # World Bank
Rule of law 2018 -2.5 ~ 2.5 theglobaleconomy.com
Government effectiveness 2018 -2.5 ~ 2.5 theglobaleconomy.com
Regulatory quality 2018 -2.5 ~ 2.5 theglobaleconomy.com
Ease of doing business 2018 Ranking World Bank
Competitiveness Index 2019 0 ~ 100 World Economy Forum
Institutions 2019 0 ~ 100 World Economy Forum
Infrastructure 2019 0 ~ 100 World Economy Forum
ICT adoption 2019 0 ~ 100 World Economy Forum
Macroeconomic stability 2019 0 ~ 100 World Economy Forum
Health 2019 0 ~ 100 World Economy Forum
Skills 2019 0 ~ 100 World Economy Forum
Product market 2019 0 ~ 100 World Economy Forum
Labour market 2019 0 ~ 100 World Economy Forum
Financial system 2019 0 ~ 100 World Economy Forum
Market size 2019 0 ~ 100 World Economy Forum
Business dynamism 2019 0 ~ 100 World Economy Forum
Innovation capability 2019 0 ~ 100 World Economy Forum

Population

From wbstats R package. Already cleaned. Use 2018 population.

Population_2018 <- Population %>% 
  select(Country_Name, Population_2018)

GDP

From World Bank. Use 2018 data because we still don’t have 2019 data yet.

GDP_2018 <- WB %>%
  filter(Series_Name == "GDP (current US$)") %>%
  mutate(GDP_2018 = `2018`) %>%
  select(Country_Name, GDP_2018)

GDP Growth

From World Bank. Use 2018 data because we still don’t have 2019 data yet.

GDP_Growth_2018 <- WB %>%
  filter(Series_Name == "GDP growth (annual %)") %>%
  rename(GDP_Growth_2018 = `2018`) %>%
  select(Country_Name, GDP_Growth_2018)

Inflation Rate (No Iran and Syria)

Inflation_Rate_2018<- WB %>%
  filter(Series_Name == "Inflation, GDP deflator (annual %)") %>%
  mutate(Inflation_Rate_2018 = `2018`) %>%
  select(Country_Name, Inflation_Rate_2018)

Number of Data Centers (cleaned)

Number of Secure Internet Servers (cleaned)

Internet_servers_2018 <- WB %>%
  filter(Series_Name == "Secure Internet servers") %>%
  mutate(Internet_servers_2018 = `2018`) %>%
  select(Country_Name, Internet_servers_2018)

Percentage of Internet Users 2017

There’s more NAs in 2018 Data

Internet_Users_2017 <- Internet_Data %>%
  filter(Year == 2017) %>%
  select(Country, Internet.users.percent.of.population)

Corruption

Corruption_2019 <- CPI_2019 %>%
  select(Country, `CPI score 2019`)

Stability_2018: Rule_of_law_index, Government_effectiveness_index, Control_of_corruption, Regulatory_quality_index

Stability_2018 <- Global_Economy %>%
  filter(Year == 2018) %>%
  select(-c(Code, Year))
Stability_2018

Ease of doing business

Ease_of_doing_business_2019 <- WB %>%
  filter(Series_Name == "Ease of doing business index (1=most business-friendly regulations)") %>%
  mutate(Ease_of_doing_business_2019 = `2019`) %>%
  select(Country_Name, Ease_of_doing_business_2019)

!!! Is Ranked Data, meaning that lower the number, the better the country !!!

Ease_of_doing_business_2019 <- Ease_of_doing_business_2019 %>%
  arrange(Ease_of_doing_business_2019) %>%
  mutate(Reverese_Ranking = 191 - Ease_of_doing_business_2019) %>%
  select(Country_Name, Reverese_Ranking) %>%
  mutate(Ease_of_doing_business_2019 = Reverese_Ranking) %>%
  select(Country_Name, Ease_of_doing_business_2019)

Reverse the ranking so higher in rank has a higher score

GCI

GCI_score_2019 <- GCI %>%
  filter(Edition == 2019, `Series name` == "Global Competitiveness Index 4.0", Attribute == "SCORE") %>%
  select(-c(Index, Edition, `Series Global ID`,`Series name`, `Freeze date`, `Series units`, `Series order`, `Series code (if applicable)`, `Series type`, Attribute)) %>%
  t() %>%
  as.matrix(rownames.force = TRUE) %>%
  as.data.frame()
GCI_score_2019 <- cbind(rownames(GCI_score_2019), GCI_score_2019)
names(GCI_score_2019) <- c("Country_Name", "GCI_2019")
rownames(GCI_score_2019) <- c()
GCI_score_2019$Country_Name <- as.character(GCI_score_2019$Country_Name)
GCI_score_2019$GCI_2019 <- as.numeric(as.character(GCI_score_2019$GCI_2019))
GCI_score_2019 <- head(GCI_score_2019, -12) # get rid of the last 12 rows (regional data)
GCI_score_2019$Country_Name[!GCI_score_2019$Country_Name %in% unique(WB$Country_Name)] <- c("Cote d'Ivoire", "Congo, Dem. Rep.", "Cabo Verde", "Egypt, Arab Rep.", "Hong Kong SAR, China", "North Macedonia", "Taiwan", "Venezuela, RB", "Vietnam", "Yemen, Rep.")

12 Pillars in GCI

Names of the 12 pillars

Pillars_12 <- GCI %>% 
  filter(Edition == 2019, `Series type` == "Pillar", Attribute == "SCORE") %>%
  select(`Series name`) %>%
  .$`Series name`
GCI_pillars <- GCI %>% 
  filter(Edition == 2019, `Series type` == "Pillar", Attribute == "SCORE")  %>%
  select(-c(Index, Edition, `Series Global ID`,`Series name`, `Freeze date`, `Series units`, `Series order`, `Series code (if applicable)`, `Series type`, Attribute)) %>%
  t() %>%
  as.data.frame()
GCI_pillars <- cbind(rownames(GCI_pillars), GCI_pillars)
names(GCI_pillars) <- c("Country_Name", Pillars_12)
rownames(GCI_pillars) <- c()
GCI_pillars$Country_Name <- as.character(GCI_pillars$Country_Name)
GCI_pillars[, 2:13]<- sapply(GCI_pillars[, 2:13], as.character)
GCI_pillars[, 2:13]<- sapply(GCI_pillars[, 2:13], as.numeric)
GCI_pillars <- head(GCI_pillars, -12) # get rid of the last 12 rows (regional data)
GCI_pillars$Country_Name[!GCI_pillars$Country_Name %in% unique(WB$Country_Name)] <- c("Cote d'Ivoire", "Congo, Dem. Rep.", "Cabo Verde", "Egypt, Arab Rep.", "Hong Kong SAR, China", "North Macedonia", "Taiwan", "Venezuela, RB", "Vietnam", "Yemen, Rep.")

Joining Dataframes:

GDP_2018, Population_2018, GDP_Growth_2018, Inflation_Rate_2018, FX_Rate_2018, Data_Center, Internet_servers_2018, Internet_Users_2017, Stability_2018, Ease_of_doing_business_2019, GCI_score_2019, GCI_pillars

Clustering_df <- GDP_2018 %>% 
  left_join(Population_2018, by = c("Country_Name")) %>%
  left_join(GDP_Growth_2018, by = c("Country_Name")) %>%
  left_join(Inflation_Rate_2018, by = c("Country_Name")) %>%
  left_join(Data_Center, by = c("Country_Name" = "Country")) %>%
  left_join(Internet_servers_2018, by = c("Country_Name")) %>%
  left_join(Internet_Users_2017, by = c("Country_Name" = "Country")) %>%
  left_join(Stability_2018, by = c("Country_Name" = "Country")) %>%
  left_join(Ease_of_doing_business_2019, by = c("Country_Name")) %>%
  left_join(GCI_score_2019, by = c("Country_Name")) %>%
  left_join(GCI_pillars, by = c("Country_Name"))

Inpute missing data centers as 0s.

Clustering_df[is.na(Clustering_df$`Number of Data Center`), ]$`Number of Data Center` <- 0

Assume that countries that have missing values are countries Microsoft should not expand into.

Check the countries with missing values

# List of countries with missing values
Clustering_df[!complete.cases(Clustering_df), ]$Country_Name
 [1] "Afghanistan"                    "American Samoa"                 "Andorra"                       
 [4] "Antigua and Barbuda"            "Aruba"                          "Bahamas, The"                  
 [7] "Belarus"                        "Belize"                         "Bermuda"                       
[10] "Bhutan"                         "British Virgin Islands"         "Cayman Islands"                
[13] "Central African Republic"       "Channel Islands"                "Comoros"                       
[16] "Congo, Rep."                    "Cuba"                           "Curacao"                       
[19] "Djibouti"                       "Dominica"                       "Equatorial Guinea"             
[22] "Eritrea"                        "Faroe Islands"                  "Fiji"                          
[25] "French Polynesia"               "Gibraltar"                      "Greenland"                     
[28] "Grenada"                        "Guam"                           "Guinea-Bissau"                 
[31] "Guyana"                         "Iran, Islamic Rep."             "Iraq"                          
[34] "Isle of Man"                    "Kiribati"                       "Korea, Dem. People’s Rep."     
[37] "Kosovo"                         "Liberia"                        "Libya"                         
[40] "Liechtenstein"                  "Macao SAR, China"               "Maldives"                      
[43] "Marshall Islands"               "Micronesia, Fed. Sts."          "Monaco"                        
[46] "Myanmar"                        "Nauru"                          "New Caledonia"                 
[49] "Niger"                          "Northern Mariana Islands"       "Palau"                         
[52] "Papua New Guinea"               "Puerto Rico"                    "Samoa"                         
[55] "San Marino"                     "Sao Tome and Principe"          "Sierra Leone"                  
[58] "Sint Maarten (Dutch part)"      "Solomon Islands"                "Somalia"                       
[61] "South Sudan"                    "St. Kitts and Nevis"            "St. Lucia"                     
[64] "St. Martin (French part)"       "St. Vincent and the Grenadines" "Sudan"                         
[67] "Suriname"                       "Syrian Arab Republic"           "Timor-Leste"                   
[70] "Togo"                           "Tonga"                          "Turkmenistan"                  
[73] "Turks and Caicos Islands"       "Tuvalu"                         "Uzbekistan"                    
[76] "Vanuatu"                        "Venezuela, RB"                  "Virgin Islands (U.S.)"         
[79] "West Bank and Gaza"            
# Keep countries(rows) that have no missing values
Clustering_df <- Clustering_df[complete.cases(Clustering_df), ]
Clustering_df

Add Expand Column

Clustering_df$Expand <- ifelse(Clustering_df$Country_Name %in% Expand_Countries_List, TRUE, FALSE)
Clustering_df$Expand_re <- ifelse(Clustering_df$Country_Name %in% Expand_Countries_List_re, TRUE, FALSE)

Normalize Clustering_df

Clustering_df_preproc <- preProcess(Clustering_df)
Clustering_df_norm <-  predict(Clustering_df_preproc, Clustering_df)
head(Clustering_df_norm )     

Exploratory data analysis (EDA)

We first take a look at some of the variables in our final data.

p1 <- ggplot(Clustering_df, aes(x = GDP_2018)) +
  geom_histogram() +
  ggtitle("2018 GDP") +
  xlab("") +
  scale_x_continuous(breaks = c(0,5000000000000,10000000000000, 15000000000000, 20000000000000),
                     labels = c("0", "5 Trillion", "10 Trillion", "15 Trillion", "20 Trillion")) +
  theme(plot.title = element_text(size = 10, face = "bold"))+
  theme(axis.text.x = element_text(angle = 45, hjust=1, size = 8))
p2 <- ggplot(Clustering_df, aes(x = Population_2018)) +
  geom_histogram() +
  ggtitle("2018 Population")+
  scale_x_continuous(breaks = c(0,500000000,1000000000),
                     labels = c("0", "5 Billion", "10 Billion")) +
  xlab("")+
  theme(plot.title = element_text(size = 10, face = "bold"))
p3 <- ggplot(Clustering_df, aes(x = GDP_Growth_2018)) +
  geom_histogram() +
  ggtitle("2018 GDP Growth") +
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
p4 <- ggplot(Clustering_df, aes(x = GCI_2019)) +
  geom_histogram() +
  ggtitle("Global Competitiveness Index")+
  xlab("") +
  theme(plot.title = element_text(size = 8, face = "bold"))
p5 <- ggplot(Clustering_df, aes(x = Inflation_Rate_2018)) +
  geom_histogram()+
  ggtitle("2018 Inflation Rate")+
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
p6 <- ggplot(Clustering_df, aes(x = `Number of Data Center`)) +
  geom_histogram() +
  ggtitle("Number of Data Center")+
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2)

We can see that some of the variables are extremely skewed, such as population and GDP, while others seem to be normally distributed.

Method 1: Ranking Countries:

To find out which countries are more favorable to expand into, we rank all the countries that Microsoft Financing has not expanded into by different variables, and calculate how many times each country appears in the top 20% of a variable.

# "not in" function
`%notin%` <- Negate(`%in%`)
# Rank country takes a dataframe with two columns (first column is country and second column is the attribute)'
# Returns the top 20% of the country's names in that attribute
Rank_Country <- function(x, Reverse = FALSE) {
  colnames(x) <- c("Country_Name", "Attribute")
  if(Reverse == FALSE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List) %>%
      arrange(desc(Attribute)) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  } else if (Reverse == TRUE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List) %>%
      arrange(Attribute) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  }
  return(output_vector)
}
# Function to re-run ranking to include the 4 countries (SG, India, South Africa and UAE)
Rank_Country_re <- function(x, Reverse = FALSE) {
  colnames(x) <- c("Country_Name", "Attribute")
  if(Reverse == FALSE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List_re) %>%
      arrange(desc(Attribute)) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  } else if (Reverse == TRUE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List_re) %>%
      arrange(Attribute) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  }
  return(output_vector)
}

A generic function that takes in a dataframe of two columns, Country Name and a variable. Then it returns a vector of countries that ranks in the top 20% of that variable.
Use a for loop and table() to see how each countries perform.

Country_vector <- c()
for (i in c(2:26)) {
  Country_vector <- append(Country_vector, Rank_Country(Clustering_df[,c(1,i)]))
}
# Replace inflation because inflation should be ranked from the lowest to highest
Country_vector[(3*26+1):(3*26+27)] <- Rank_Country(Clustering_df[,c(1,5)], Reverse = TRUE)


Country_vector_re <- c()
for (i in c(2:26)) {
  Country_vector_re <- append(Country_vector_re, Rank_Country_re(Clustering_df[,c(1,i)]))
}
# Replace inflation because inflation should be ranked from the lowest to highest
Country_vector_re[(3*26+1):(3*26+27)] <- Rank_Country_re(Clustering_df[,c(1,5)], Reverse = TRUE)
# Without 2020 expansions
head(sort(table(Country_vector),decreasing = TRUE), 18) 

The first table shows the results of lists of countries without the 2020 expansions. These are all the countries that scored 15+ in our ranking system. The ranking system provides an intuitive and straightforward result that everyone could easily gauge. For example, with a score of 22, Hong Kong are in the top 20% in 22 out of the 25 different variables, putting it as one of the most favorable regions to expand into.

# 2020 expansions included
head(sort(table(Country_vector_re),decreasing = TRUE), 40)

The second table evaluates the performance of the current expansion (Singapore. United Arab Emirates, India and South Africa). It turns out that Singapore and United Arab Emirate scored really well, but India and South Africa scored pourly in our model.

Method 2: Clustering Model

We could also use clustering models to help find out which countries Microsoft should take a closer look at. Clustering models put different countries into different clusters, and the countries within the same cluster are said to be similar to each other. If we have a cluster that predominantly consists of countries that Microsoft Financing has expanded into, the other countries that Microsoft have not expanded into are the ones that they should consider.

Clustering Model : Hierarchical clustering

Hierachical clustering calculates the “distance” between different countries, and countries that are closer together will be put in the same cluster, depending on the number of clusters. In this model, we use the most common euclidean distance, which is the ordinary straight-line distance. For example, the euclidean distance between point A(1,2,3,4) and B(9, 8, 7, 6) would be: \[\sqrt{(9-1)^2 + (8-2)^2 + (7-3)^2 + (6-4)^2} = \sqrt{120} \]

Build Tree

Clustering_df_h <- column_to_rownames(Clustering_df, var = "Country_Name") # turn Country_Name into row names to display in the dendrogram.
distance_1 <- dist(Clustering_df_h, method = "euclidean")
ClusteringTree_1 <- hclust(distance_1, method = "ward.D")

Plot Full Tree

hcd = as.dendrogram(ClusteringTree_1)
plot(hcd)
 </br>

It is really hard to read from the dendrogram because there are too many countries, so we would display just part of the tree.

The bottom part of the tree

plot(cut(hcd, h = 600000000000)$lower[[13]], main = "Lower tree part 1")
plot(cut(hcd, h = 600000000000)$lower[[12]], main = "Lower tree part 2")

From the dendrograms shown, we can see that the Hierachical clustering model groups countries together buy their distance to other countries. In the second plot, the model shows that the countries that is the most similar to Hong Kong is Singapore, and the second closest group are Israel and South Africa. In the third tree, if we want to have two clusters, then we would put Argentina nad Thailand in a cluster and Poland, Belgium and Sweden in the other cluster. If we want three clusters, then Poland would form a cluster of its own.

4 Clusters

k = 4
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)                     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)               # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)    # Show the overall clustering result in regards to expansion
for (i in 3:4) {
  print(Clustering_df_cluster %>% filter(Cluster == i, Expand == FALSE) %>% .$Country_Name)
}

The table shows that we should look at the 3rd and 4th clusters because they have the highest percentage of current expansions. The TRUE represents the number of countries that Microsoft has already expanded into while FALSE represents the number of countries that Microsoft have not expanded into yet.
The two FALSE countries in Cluster 3 are South Korea and Russia and the FALSE country in Cluster 4 is China.

5 Clusters

k = 5
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)

6 Clusters

k = 6
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)

7 Clusters

k = 7
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)

8 Clusters

k = 8
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)

The clustering result does not change much when the number of clusters increase. This is because when we increase the number of clusters by 1, it would be easy to “kick out” a country from a cluster to form a new cluster, therefore the majority of the clusters did not change.

=====================================================================================================================

Clustering Model : K-means clustering (Normalized)

K means clustering is different clustering algorithm. The algorithm would randomly genearate a number of centroids (based on our choice of k, the number of clusters) to start. Each data point is then assigned to the closest centroid. After that, the centroids shift to their respective average point within their cluster, then each point is assigned to the clostest centroid. The process is repeated until an equilibrium is reached or the maximum number of iterations are met.

set.seed(1)
k = 4
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 3, Expand == FALSE) %>%.$Country_Name)

From the results, we can see that we should focus on the third cluster: Chile, Czech Republic, Estonia, Hong Kong SAR, Iceland, Israel, South Korea, Lithuania, Luxembourg, Malaysia, Malta, Qatar, Slovenia.

If we change the number of clusters, we may get different results. (k =7, seed = 1)

set.seed(1)
k = 7
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 4, Expand == FALSE) %>%.$Country_Name)

Different seed might generate different results (k =7, seed = 3)

set.seed(3)
k = 7
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 2, Expand == FALSE) %>%.$Country_Name)

If we change the number of clusters or set a different seed, we will get different results from the K-means clustering model. The clustering results might be different while using the same number of clusters because the different starting points might generate differnt results, as we cans see from the outputs above.

To address this, we could run a simulation to see which countries are more preferable. We re run the clustering model 10,000 times and calculate the number of times each countries appear in a cluster which has more expanded countries than non-expanded countries. #### Simulation: k=4, 10,000 times

k = 4
c_list <-  c()
for (j in 1:10000){
  set.seed(j)
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  Clustering_df_norm_kmeans <- Clustering_df_norm                       
  Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster 
  a <- table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
  for (i in 1:k){
  if(a[i] < a[i+k]){
     c_list <- append(c_list, Clustering_df_norm_kmeans %>% filter(Cluster == i, Expand == FALSE) %>%.$Country_Name)
  }
}
}
sort(table(c_list), decreasing = TRUE)

Simulation: k=7, 10,000 times

k = 7 
c_list <-  c()
for (j in 1:10000){
  set.seed(j)
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  Clustering_df_norm_kmeans <- Clustering_df_norm                       
  Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster 
  a <- table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
  for (i in 1:k){
  if(a[i] < a[i+k]){
     c_list <- append(c_list, Clustering_df_norm_kmeans %>% filter(Cluster == i, Expand == FALSE) %>%.$Country_Name)
  }
}
}
sort(table(c_list), decreasing = TRUE)

From the above simulation, we can see that the recommended countries wouls be Chile, Czech, Estonia, Hong Kong, Iceland, Israel, Korea, Luxembourg, Malaysia, Qatar, Slovenia, Lituania and Malta. The results are pretty similar with the results from our ranking model: Hong Kong, Korea, Malaysia, Israel, Czech, Chile, Estonia, Iceland, Luxembourg, Slovenia, Cyprus, Latvia, Lithuania, Malta, Poland, Qatar, China, Saudi Arabia.

Should_Expand_List <- c("Chile", "Czech Republic", "Estonia", "Hong Kong SAR, China", "Iceland", "Israel", "Korea, Rep.", "Luxembourg", "Malaysia", "Qatar", "Slovenia", "Lituania", "Malta", "Poland", "Cyprus", "Latvia", "Lithuania", "China", "Saudi Arabia")

Scree Plot

A scree plot is commonly used to determine the number of clusters to use in k means clustering.The recommended number of cluster is at the “elbow” where there’s a big change in slope.

wss <- c()
for (i in 1:15) {
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- Clustering_kmeans$tot.withinss
}
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")
  </br> 

Optimal cluster is 3 or 4.

The effect of Clustering

Status <- c()
for (i in 1:length(Clustering_df$Country_Name)){
  if (Clustering_df$Country_Name[i] %in% Expand_Countries_List) {
    Status = append(Status, "Already Expanded")
  } else if (Clustering_df$Country_Name[i] %in%Should_Expand_List) {
    Status = append(Status, "Should Expanded")
  } else {
    Status = append(Status, "Should Not Expanded")
  }
}
for (i in 3:7) {
  set.seed(1)
  k = i
  Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  df_plot <- Clustering_df
  df_plot$Expand <- Status
  df_plot$Cluster <- Clustering_kmeans$cluster
  plot <- ggplot(data = df_plot, aes(x = Internet.users.percent.of.population, y = GCI_2019, 
                                     color = factor(Cluster), 
                                     shape = factor(Expand))) +
    geom_point() +
    # theme(legend.title = element_text("Clusters")) +
    xlab("% of Internet Users") +
    ylab("Global Competitiveness Index") + 
    ggtitle(paste0("Percentage of Internet Users against GCI with ", i, " clusters")) +
    scale_colour_discrete("Clusters") +
    scale_shape_discrete("Expand")
  print(plot)
}
 </br>

It is hard to visualize the effect of clustering using all vairables, but we can still see the effect in this two dimensional graph. Countries closer together are grouped in the same clusters, and it is not surprising that the countries on the upper right hand side (the countries with higher percentage of internet users and higher GCI score) would be more favorable to expand into. The clustering result with k=3 and k=4 are similar, expect that the United States has formed its own cluster.

Conclusion

Using three different models: Ranking, Hierarchical Clustering and K-means Clustering, we come up with a shrot list of countries that we could delve into without having to look into every country in the world.
The countries from Ranking:
Hong Kong, Korea, Malaysia, Israel, Czech, Chile, Estonia, Iceland, Luxembourg, Slovenia, Cyprus, Latvia, Lithuania, Malta, Poland, Qatar, China, Saudi Arabia
From Hierarchical Clustering:
Russia, Korea
From K-means Clustering:
Chile, Estonia, Hong Kong, Iceland ,Israel ,Korea, Luxembourg, Malaysia, Czech Republic, Qatar, Slovenia, Lithuania, Malta
With a short list of countries to focus on, we continue our qualitative analysis in these countries.

---
title: "International Expansion for Microsoft Financing"
author: "Ian Chuang"
date: "4/15/2020"
output:
  html_notebook:default
---

<center> <h1>Disclaimer</h1> </center>
This is a project that I worked on during my practicum in Masters of Business Analytics program at University of Washington, Foster School of Business in collarboration with Microsoft Financing team. We've only used public data for this project and the suggestions are not binding. 

<center> <h1>Problem Statement</h1> </center>
Microsoft Financing is a service that offers a wide range of financing options for its customers to accomodate to their needs. 
Microsoft Financing has expanded to 22 different countries by 2019 and four more countries in 2020 (Singapore, United Arabic Emirates, India and South Africa). **Our goal is to find out which countries they should expand into next.** In this part of the quantitative analysis, we narrowed down a list of countries that we should look deeper into in our qualitative analysis.    </br> 
We first collect publicly available country-level data from various websites and clean the data. Then, we build three different models to evaluate which countries are more favorable to expand into. Last we summarise our finding and move on to the qualitative analysis. 

 </br>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#### Load packages
```{r message = FALSE, warning = FALSE}
library(tidyverse)
library(readxl)
library(data.table)
library(caret)              # For Normalizing Data
library(gridExtra)
```

#### Read Data
```{r message = FALSE, warning = FALSE}
World_Bank <- read_csv("World Bank Data_Data.csv")                                      
CPI_2019 <- read_xlsx("CPI_2012_2019.xlsx", skip = 2, sheet = "CPI2019")                
Population <- read.csv("Population.csv", stringsAsFactors = FALSE)                   
Global_Economy <- read.csv("Theglobaleconomy_Data.csv", stringsAsFactors = FALSE)       
Internet_Data <- read.csv("Internet_Data.csv" , stringsAsFactors = FALSE)              
Data_Center <- read_xlsx("#of Data Center by Country.xlsx")
GCI <- read_xlsx("WEF_GCI_4.0_2019_Dataset.xlsx", sheet = "Data", skip = 3)
```
The data that we collected can be generallized into 5 categories:

- Economic
- Demographic
- Public policy and Institutions
- Competitiveness
- Industry and Market

The data that we gathered are all public data from various websites, including the World Bank, World Economic Forum and theglobaleconomy.com.     </br>  

Data Sources :     </br> 
World_Bank and Population is collected from the [World Bank](https://www.worldbank.org/).  </br>
CPI_2019 is from [Transparency International](https://www.transparency.org/en/cpi).  </br>
Global_Economy and Internet_Data are from [The Global Economy](https://www.theglobaleconomy.com/).  </br>
Data_Center is from [Data Center Map](https://www.datacentermap.com/).  </br>
GCI is from the [World Economic Forum](https://www.weforum.org/).  </br>
 

```{r}
Expand_Countries_List <- c("United States", "Australia", "Austria", "Belgium", "Brazil", "Canada", "Denmark", "Finland", "France", "Germany", "Ireland", "Italy", "Japan", "Mexico", "Netherlands", "New Zealand", "Norway", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom", "South Africa", "India", "United Arab Emirates", "Singapore")
Expand_Countries_List_re <- c("United States", "Australia", "Austria", "Belgium", "Brazil", "Canada", "Denmark", "Finland", "France", "Germany", "Ireland", "Italy", "Japan", "Mexico", "Netherlands", "New Zealand", "Norway", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom")
```
Expand_Countries_List is a vector of countries that contains all the countries Microsoft Financing has expanded into, including the 4 expansions in 2020 (Singapore, UAE, India, South Africa)   </br>
We also wanted to evaluate the current expansions in 2020, so we created two separted vectors to store these countries.    </br>
Expand_Countries_List_re is a vector of countries that contains all the countries Microsoft Financing has expanded into, without the 4 expansions in 2020.    </br>

## Cleaning Data
#### Clean WB
```{r}
WB <- World_Bank
head(WB)
tail(WB)
# Drop last five rows
WB <- head(WB, -5)
```
We can see that each row is a country with a given variable (series).    </br>
The last 5 rows are not useful so we will drop them. 

```{r}
# Change column names
names(WB) <- c("Country_Name", "Country_Code", "Series_Name", "Series_Code", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019")
```

NAs are recorded as ".." and therefore other numeric values are recoreded as characters.     </br>
We loop through the 5th to the 14th columns to make the numeric values numeric and the characters that are not numeric into NAs.    </br>
```{r warning=FALSE}
WB[, 5:14] <- sapply(WB[, 5:14], as.numeric)
```

```{r}
# Keep only the first 217 countries
Unique_Country_Name <- unique(WB$Country_Name)
Unique_Country_Name <- Unique_Country_Name[1:217]    # The rest of the Country_Name are regional such as North America
WB <- WB[WB$Country_Name %in% Unique_Country_Name, ] # Keep only the country level data in WB
```

Use the country names in World Bank to join data from difference source together.     </br>
We will match all the country names from data using the country names used in the World Bank.


#### Clean CPI_2019
```{r}
head(CPI_2019)
# The list of countries that are not using the exact same name as the names used in WB
CPI_2019$Country[!CPI_2019$Country %in% unique(WB$Country_Name)]
```
Need to change these names to the names used in WB    </br>
Manually change the country names.
```{r}
CPI_2019$Country[!CPI_2019$Country %in% unique(WB$Country_Name)] <- c("Hong Kong SAR, China", "United States", "Taiwan", "Bahamas, The", "Korea, Rep.", "St. Vincent and the Grenadines", "St. Lucia", "Slovak Republic", "Gambia, The", "Egypt, Arab Rep.", "Kyrgyz Republic", "Lao PDR", "Russian Federation", "Iran, Islamic Rep.","Congo, Rep.", "Congo, Dem. Rep.", "Guinea-Bissau", "Korea, Dem. People’s Rep.", "Venezuela, RB", "Yemen, Rep.", "Syrian Arab Republic")
```


#### Clean Global_Economy 
```{r}
head(Global_Economy)
# The country names that are not in WB. 
Global_Economy$Country[!Global_Economy$Country %in% unique(WB$Country_Name)] %>% unique()
```
Manually check the names used in WB. Some of them are not in WB such as Taiwan and Palestine, just keep them as they are. 
```{r}
GE_name <- Global_Economy$Country[!Global_Economy$Country %in% unique(WB$Country_Name)] %>% unique()
WB_name <- c("Bahamas, The", "Brunei Darussalam", "Myanmar", "Cabo Verde", "Congo, Dem. Rep.", "Egypt, Arab Rep.", "Gambia, The", "Hong Kong SAR, China", "Iran, Islamic Rep.", "Cote d'Ivoire", "Kyrgyz Republic", "Lao PDR",  "Macao SAR, China", "North Macedonia", "Micronesia, Fed. Sts.",  "Korea, Dem. People’s Rep.", "Palestine", "Congo, Rep.", "Russian Federation", "St. Lucia", "St. Vincent and the Grenadines", "Slovak Republic", "Korea, Rep.", "Eswatini", "Syrian Arab Republic", "Taiwan", "United States", "Venezuela, RB", "Yemen, Rep.")
length(GE_name) == length(WB_name)  # Check if the two vectors have the same length
```
Because there are multiple rows for a country, so change the names using a loop 
```{r}
for (i in 1:nrow(Global_Economy)) {
  if(!Global_Economy$Country[i] %in% GE_name) {                                    # If the country name is the same as WB
    next                                                                           # Do not change name
  } else {                                                                         # Else (the country name is different)
    Global_Economy$Country[i] <- WB_name[Global_Economy$Country[i] == GE_name]     # Use the name in WB
  }
}
```

Change column names
```{r}
names(Global_Economy) <- c("Country", "Code", "Year", "Rule_of_law_index", "Government_effectiveness_index", "Control_of_corruption", "Regulatory_quality_index")
names(Global_Economy)
```

#### Clean Data_Center
Need to change these names to the names used in WB
```{r}
Data_Center$Country[!Data_Center$Country %in% WB$Country_Name]
```

```{r}
Data_Center$Country[!Data_Center$Country %in% WB$Country_Name] <- c("United States", "Netherlands", "Hong Kong SAR, China", "Russian Federation", "Iran, Islamic Rep.", "Korea, Rep.", "Slovak Republic", "Egypt, Arab Rep.", "Taiwan", "Jersey", "North Macedonia", "Venezuela, RB", "Guernsey", "Isle of Man", "Bahamas, The", "Bosnia and Herzegovina", "Congo, Dem. Rep.", "Macao SAR, China", "Netherlands Antilles", "Palestine", "Reunion", "Trinidad and Tobago", "Virgin Islands (U.S.)")
# "Taiwan", "Jersey", "Guernsey", "Netherlands Antilles", "Palestine", "Reunion" are not in WB
```

#### Clean Internet_Data
```{r}
for (i in 1:nrow(Internet_Data)) {
  if(!Internet_Data$Country[i] %in% GE_name) {                                    # If the country name is the same as WB
    Internet_Data$Country[i] <- Internet_Data$Country[i]                         # Do not chang name
  } else {                                                                         # Else (the country name is different)
    Internet_Data$Country[i] <- WB_name[Internet_Data$Country[i] == GE_name]     # Use the name in WB
  }
}
```

## Variables
Variables that we use are: Population, GDP(2018, Current US\$), GDP Growth (2018, %), Inflation Rate(2018, %), Number of Data Centers (2018, #), Number of Secure Internet Servers(2018, #), Percentage of Internet Users of Population (2017, %), Rule of law index(2018, -2.5 ~ 2.5), Government effectiveness index(2018, -2.5 ~ 2.5), Control of corruption(2018, -2.5 ~ 2.5), Regulatory quality index(2018, -2.5 ~ 2.5), Ease of doing business(2019, Ranking), Global Competitiveness Index Score (2019, 0-100), the 12 pillars in GCI: Institutions, Infrastructure, ICT adoption, Macroeconomic stability, Health, Skills, Product market, Labour market, Financial system, Market size, Business dynamism, Innovation capability (2019, 0-100)

+---------------------------+---------------+---------------+---------------------------+
|    Variables              |     Year      |    Scale      |        Source             |    
+===========================+===============+===============+===========================+
|    GDP                    |   2018        |   in   US$    |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
|   GDP Growth              |    2018       |   %           |   World Bank              | 
+---------------------------+---------------+---------------+---------------------------+
|   FX Rate                 |    2018       |    in US$     |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
|   Inflation               |    2018       |   %           |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
|   # of Data Centers       |    2018       |   #           |   datacentermap.com       |
+---------------------------+---------------+---------------+---------------------------+
|   # of Internet Servers   |    2018       |   #           |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
|   % Internet Users        |    2017       |  % population |   theglobaleconomy.com    |
+---------------------------+---------------+---------------+---------------------------+
|    Population             |    2018       |   #           |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
|   Rule of law             |    2018       |  -2.5 ~ 2.5   |   theglobaleconomy.com    |
+---------------------------+---------------+---------------+---------------------------+
| Government effectiveness  |    2018       |  -2.5 ~ 2.5   |   theglobaleconomy.com    |
+---------------------------+---------------+---------------+---------------------------+
|  Regulatory quality       |    2018       |  -2.5 ~ 2.5   |   theglobaleconomy.com    |
+---------------------------+---------------+---------------+---------------------------+
| Ease of doing business    |    2018       |  Ranking      |   World Bank              |
+---------------------------+---------------+---------------+---------------------------+
| Competitiveness Index     |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Institutions              |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Infrastructure            |    2019       |   0 ~ 100     |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| ICT adoption              |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Macroeconomic stability   |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Health                    |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Skills                    |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Product market            |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Labour market             |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Financial system          |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Market size               |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Business dynamism         |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+
| Innovation capability     |    2019       |  0 ~ 100      |   World Economy Forum     |
+---------------------------+---------------+---------------+---------------------------+


#### Population
From wbstats R package. Already cleaned. 
Use 2018 population. 
```{r}
Population_2018 <- Population %>% 
  select(Country_Name, Population_2018)
```


#### GDP
From World Bank. Use 2018 data because we still don't have 2019 data yet. 
```{r}
GDP_2018 <- WB %>%
  filter(Series_Name == "GDP (current US$)") %>%
  mutate(GDP_2018 = `2018`) %>%
  select(Country_Name, GDP_2018)
```

#### GDP Growth
From World Bank. Use 2018 data because we still don't have 2019 data yet. 
```{r}
GDP_Growth_2018 <- WB %>%
  filter(Series_Name == "GDP growth (annual %)") %>%
  rename(GDP_Growth_2018 = `2018`) %>%
  select(Country_Name, GDP_Growth_2018)
```

#### Inflation Rate (No Iran and Syria) 
```{r}
Inflation_Rate_2018<- WB %>%
  filter(Series_Name == "Inflation, GDP deflator (annual %)") %>%
  mutate(Inflation_Rate_2018 = `2018`) %>%
  select(Country_Name, Inflation_Rate_2018)
```


#### Number of Data Centers (cleaned)


#### Number of Secure Internet Servers (cleaned)
```{r}
Internet_servers_2018 <- WB %>%
  filter(Series_Name == "Secure Internet servers") %>%
  mutate(Internet_servers_2018 = `2018`) %>%
  select(Country_Name, Internet_servers_2018)
```

#### Percentage of Internet Users 2017
There's more NAs in 2018 Data
```{r}
Internet_Users_2017 <- Internet_Data %>%
  filter(Year == 2017) %>%
  select(Country, Internet.users.percent.of.population)
```

#### Corruption
```{r}
Corruption_2019 <- CPI_2019 %>%
  select(Country, `CPI score 2019`)
```


#### Stability_2018: Rule_of_law_index, Government_effectiveness_index, Control_of_corruption, Regulatory_quality_index
```{r}
Stability_2018 <- Global_Economy %>%
  filter(Year == 2018) %>%
  select(-c(Code, Year))
Stability_2018
```

#### Ease of doing business
```{r}
Ease_of_doing_business_2019 <- WB %>%
  filter(Series_Name == "Ease of doing business index (1=most business-friendly regulations)") %>%
  mutate(Ease_of_doing_business_2019 = `2019`) %>%
  select(Country_Name, Ease_of_doing_business_2019)
```
!!! Is Ranked Data, meaning that lower the number, the better the country !!! 

```{r}
Ease_of_doing_business_2019 <- Ease_of_doing_business_2019 %>%
  arrange(Ease_of_doing_business_2019) %>%
  mutate(Reverese_Ranking = 191 - Ease_of_doing_business_2019) %>%
  select(Country_Name, Reverese_Ranking) %>%
  mutate(Ease_of_doing_business_2019 = Reverese_Ranking) %>%
  select(Country_Name, Ease_of_doing_business_2019)
```
Reverse the ranking so higher in rank has a higher score


#### GCI
```{r}
GCI_score_2019 <- GCI %>%
  filter(Edition == 2019, `Series name` == "Global Competitiveness Index 4.0", Attribute == "SCORE") %>%
  select(-c(Index, Edition, `Series Global ID`,`Series name`, `Freeze date`, `Series units`, `Series order`, `Series code (if applicable)`, `Series type`, Attribute)) %>%
  t() %>%
  as.matrix(rownames.force = TRUE) %>%
  as.data.frame()
GCI_score_2019 <- cbind(rownames(GCI_score_2019), GCI_score_2019)
names(GCI_score_2019) <- c("Country_Name", "GCI_2019")
rownames(GCI_score_2019) <- c()
GCI_score_2019$Country_Name <- as.character(GCI_score_2019$Country_Name)
GCI_score_2019$GCI_2019 <- as.numeric(as.character(GCI_score_2019$GCI_2019))
GCI_score_2019 <- head(GCI_score_2019, -12) # get rid of the last 12 rows (regional data)
GCI_score_2019$Country_Name[!GCI_score_2019$Country_Name %in% unique(WB$Country_Name)] <- c("Cote d'Ivoire", "Congo, Dem. Rep.", "Cabo Verde", "Egypt, Arab Rep.", "Hong Kong SAR, China", "North Macedonia", "Taiwan", "Venezuela, RB", "Vietnam", "Yemen, Rep.")
```

12 Pillars in GCI

Names of the 12 pillars
```{r}
Pillars_12 <- GCI %>% 
  filter(Edition == 2019, `Series type` == "Pillar", Attribute == "SCORE") %>%
  select(`Series name`) %>%
  .$`Series name`
```

```{r}
GCI_pillars <- GCI %>% 
  filter(Edition == 2019, `Series type` == "Pillar", Attribute == "SCORE")  %>%
  select(-c(Index, Edition, `Series Global ID`,`Series name`, `Freeze date`, `Series units`, `Series order`, `Series code (if applicable)`, `Series type`, Attribute)) %>%
  t() %>%
  as.data.frame()
GCI_pillars <- cbind(rownames(GCI_pillars), GCI_pillars)
names(GCI_pillars) <- c("Country_Name", Pillars_12)
rownames(GCI_pillars) <- c()
GCI_pillars$Country_Name <- as.character(GCI_pillars$Country_Name)
GCI_pillars[, 2:13]<- sapply(GCI_pillars[, 2:13], as.character)
GCI_pillars[, 2:13]<- sapply(GCI_pillars[, 2:13], as.numeric)
GCI_pillars <- head(GCI_pillars, -12) # get rid of the last 12 rows (regional data)
GCI_pillars$Country_Name[!GCI_pillars$Country_Name %in% unique(WB$Country_Name)] <- c("Cote d'Ivoire", "Congo, Dem. Rep.", "Cabo Verde", "Egypt, Arab Rep.", "Hong Kong SAR, China", "North Macedonia", "Taiwan", "Venezuela, RB", "Vietnam", "Yemen, Rep.")
```

## Joining Dataframes: 
GDP_2018, Population_2018, GDP_Growth_2018, Inflation_Rate_2018, FX_Rate_2018, Data_Center, Internet_servers_2018, Internet_Users_2017, Stability_2018, Ease_of_doing_business_2019, GCI_score_2019, GCI_pillars
```{r}
Clustering_df <- GDP_2018 %>% 
  left_join(Population_2018, by = c("Country_Name")) %>%
  left_join(GDP_Growth_2018, by = c("Country_Name")) %>%
  left_join(Inflation_Rate_2018, by = c("Country_Name")) %>%
  left_join(Data_Center, by = c("Country_Name" = "Country")) %>%
  left_join(Internet_servers_2018, by = c("Country_Name")) %>%
  left_join(Internet_Users_2017, by = c("Country_Name" = "Country")) %>%
  left_join(Stability_2018, by = c("Country_Name" = "Country")) %>%
  left_join(Ease_of_doing_business_2019, by = c("Country_Name")) %>%
  left_join(GCI_score_2019, by = c("Country_Name")) %>%
  left_join(GCI_pillars, by = c("Country_Name"))
```

Inpute missing data centers as 0s. 
```{r}
Clustering_df[is.na(Clustering_df$`Number of Data Center`), ]$`Number of Data Center` <- 0
```

#### Assume that countries that have missing values are countries Microsoft should not expand into. 
Check the countries with missing values
```{r}
# List of countries with missing values
Clustering_df[!complete.cases(Clustering_df), ]$Country_Name
```

```{r}
# Keep countries(rows) that have no missing values
Clustering_df <- Clustering_df[complete.cases(Clustering_df), ]
Clustering_df
```
Add Expand Column
```{r}
Clustering_df$Expand <- ifelse(Clustering_df$Country_Name %in% Expand_Countries_List, TRUE, FALSE)
Clustering_df$Expand_re <- ifelse(Clustering_df$Country_Name %in% Expand_Countries_List_re, TRUE, FALSE)
```

Normalize Clustering_df
```{r}
Clustering_df_preproc <- preProcess(Clustering_df)
Clustering_df_norm <-  predict(Clustering_df_preproc, Clustering_df)
head(Clustering_df_norm )     
```
## Exploratory data analysis (EDA)
We first take a look at some of the variables in our final data. 
```{r}
p1 <- ggplot(Clustering_df, aes(x = GDP_2018)) +
  geom_histogram() +
  ggtitle("2018 GDP") +
  xlab("") +
  scale_x_continuous(breaks = c(0,5000000000000,10000000000000, 15000000000000, 20000000000000),
                     labels = c("0", "5 Trillion", "10 Trillion", "15 Trillion", "20 Trillion")) +
  theme(plot.title = element_text(size = 10, face = "bold"))+
  theme(axis.text.x = element_text(angle = 45, hjust=1, size = 8))
```

```{r}
p2 <- ggplot(Clustering_df, aes(x = Population_2018)) +
  geom_histogram() +
  ggtitle("2018 Population")+
  scale_x_continuous(breaks = c(0,500000000,1000000000),
                     labels = c("0", "5 Billion", "10 Billion")) +
  xlab("")+
  theme(plot.title = element_text(size = 10, face = "bold"))
```

```{r}
p3 <- ggplot(Clustering_df, aes(x = GDP_Growth_2018)) +
  geom_histogram() +
  ggtitle("2018 GDP Growth") +
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
```

```{r}
p4 <- ggplot(Clustering_df, aes(x = GCI_2019)) +
  geom_histogram() +
  ggtitle("Global Competitiveness Index")+
  xlab("") +
  theme(plot.title = element_text(size = 8, face = "bold"))
```


```{r}
p5 <- ggplot(Clustering_df, aes(x = Inflation_Rate_2018)) +
  geom_histogram()+
  ggtitle("2018 Inflation Rate")+
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
```


```{r}
p6 <- ggplot(Clustering_df, aes(x = `Number of Data Center`)) +
  geom_histogram() +
  ggtitle("Number of Data Center")+
  xlab("") +
  theme(plot.title = element_text(size = 10, face = "bold"))
```

```{r}
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2)
```
We can see that some of the variables are extremely skewed, such as population and GDP, while others seem to be normally distributed. 


## Method 1: Ranking Countries:
To find out which countries are more favorable to expand into, we rank all the countries that Microsoft Financing has not expanded into by different variables, and calculate how many times each country appears in the top 20% of a variable. 
```{r Rank Country}
# "not in" function
`%notin%` <- Negate(`%in%`)
# Rank country takes a dataframe with two columns (first column is country and second column is the attribute)'
# Returns the top 20% of the country's names in that attribute
Rank_Country <- function(x, Reverse = FALSE) {
  colnames(x) <- c("Country_Name", "Attribute")
  if(Reverse == FALSE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List) %>%
      arrange(desc(Attribute)) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  } else if (Reverse == TRUE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List) %>%
      arrange(Attribute) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  }
  return(output_vector)
}
# Function to re-run ranking to include the 4 countries (SG, India, South Africa and UAE)
Rank_Country_re <- function(x, Reverse = FALSE) {
  colnames(x) <- c("Country_Name", "Attribute")
  if(Reverse == FALSE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List_re) %>%
      arrange(desc(Attribute)) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  } else if (Reverse == TRUE) {
      output_vector <- x %>%
      filter(Country_Name %notin% Expand_Countries_List_re) %>%
      arrange(Attribute) %>%
      head(nrow(Clustering_df) * 0.2) %>%
      .$Country_Name
  }
  return(output_vector)
}
```
A generic function that takes in a dataframe of two columns, Country Name and a variable. Then it returns a vector of countries that ranks in the top 20% of that variable.    </br>
Use a for loop and table() to see how each countries perform. 
```{r}
Country_vector <- c()
for (i in c(2:26)) {
  Country_vector <- append(Country_vector, Rank_Country(Clustering_df[,c(1,i)]))
}
# Replace inflation because inflation should be ranked from the lowest to highest
Country_vector[(3*26+1):(3*26+27)] <- Rank_Country(Clustering_df[,c(1,5)], Reverse = TRUE)


Country_vector_re <- c()
for (i in c(2:26)) {
  Country_vector_re <- append(Country_vector_re, Rank_Country_re(Clustering_df[,c(1,i)]))
}
# Replace inflation because inflation should be ranked from the lowest to highest
Country_vector_re[(3*26+1):(3*26+27)] <- Rank_Country_re(Clustering_df[,c(1,5)], Reverse = TRUE)
```


```{r Rank Country Output} 
# Without 2020 expansions
head(sort(table(Country_vector),decreasing = TRUE), 18) 
```

The first table shows the results of lists of countries without the 2020 expansions. These are all the countries that scored 15+ in our ranking system. The ranking system provides an intuitive and straightforward result that everyone could easily gauge. For example, with a score of 22, Hong Kong are in the top 20% in 22 out of the 25 different variables, putting it as one of the most favorable regions to expand into. 

```{r}
# 2020 expansions included
head(sort(table(Country_vector_re),decreasing = TRUE), 40)
```
The second table evaluates the performance of the current expansion (Singapore. United Arab Emirates, India and South Africa). It turns out that Singapore and United Arab Emirate scored really well, but India and South Africa scored pourly in our model.  

## Method 2: Clustering Model
We could also use clustering models to help find out which countries Microsoft should take a closer look at. 
Clustering models put different countries into different clusters, and the countries within the same cluster are said to be similar to each other. If we have a cluster that predominantly consists of countries that Microsoft Financing has expanded into, the other countries that Microsoft have not expanded into are the ones that they should consider.

### Clustering Model : Hierarchical clustering
Hierachical clustering calculates the "distance" between different countries, and countries that are closer together will be put in the same cluster, depending on the number of clusters. In this model, we use the most common euclidean distance, which is the ordinary straight-line distance. For example, the euclidean distance between point A(1,2,3,4) and B(9, 8, 7, 6) would be: $$\sqrt{(9-1)^2 + (8-2)^2 + (7-3)^2 + (6-4)^2} = \sqrt{120} $$

#### Build Tree
```{r}
Clustering_df_h <- column_to_rownames(Clustering_df, var = "Country_Name") # turn Country_Name into row names to display in the dendrogram.
distance_1 <- dist(Clustering_df_h, method = "euclidean")
ClusteringTree_1 <- hclust(distance_1, method = "ward.D")
```

#### Plot Full Tree
```{r}
hcd = as.dendrogram(ClusteringTree_1)
plot(hcd)
```
     </br>

It is really hard to read from the dendrogram because there are too many countries, so we would display just part of the tree. 

#### The bottom part of the tree
```{r}
plot(cut(hcd, h = 600000000000)$lower[[13]], main = "Lower tree part 1")
```

```{r}
plot(cut(hcd, h = 600000000000)$lower[[12]], main = "Lower tree part 2")
```

From the dendrograms shown, we can see that the Hierachical clustering model groups countries together buy their distance to other countries. In the second plot, the model shows that the countries that is the most similar to Hong Kong is Singapore, and the second closest group are Israel and South Africa. 
In the third tree, if we want to have two clusters, then we would put Argentina nad Thailand in a cluster and Poland, Belgium and Sweden in the other cluster. If we want three clusters, then Poland would form a cluster of its own. 

#### 4 Clusters 
```{r}
k = 4
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)                     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)               # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)    # Show the overall clustering result in regards to expansion
for (i in 3:4) {
  print(Clustering_df_cluster %>% filter(Cluster == i, Expand == FALSE) %>% .$Country_Name)
}
```
The table shows that we should look at the 3rd and 4th clusters because they have the highest percentage of current expansions. The TRUE represents the number of countries that Microsoft has already expanded into while FALSE represents the number of countries that Microsoft have not expanded into yet.   </br>
The two FALSE countries in Cluster 3 are South Korea and Russia and the FALSE country in Cluster 4 is China. 

#### 5 Clusters 
```{r}
k = 5
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)
```

#### 6 Clusters 
```{r}
k = 6
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)
```


#### 7 Clusters 
```{r}
k = 7
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)
```

#### 8 Clusters 
```{r}
k = 8
ClusterGroup_1 <- cutree(ClusteringTree_1, k = k)     # Get the clustering result
Clustering_df_cluster <- Clustering_df
Clustering_df_cluster$Cluster <- factor(ClusterGroup_1)       # Put it into anther dataframe
table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand)
print(Clustering_df_cluster %>% filter(Cluster == 3, Expand == FALSE) %>% .$Country_Name)
```

The clustering result does not change much when the number of clusters increase. This is because when we increase the number of clusters by 1, it would be easy to "kick out" a country from a cluster to form a new cluster, therefore the majority of the clusters did not change. 



```{r, include=FALSE}
#### Generic function: H_clustering()
# Takes in 2 input: k and Country
# k is the number of clusters used in Hierarchical clustering
# Country is the country name that you want to focus on: could be a country or a list of countries

GDP_t <- function(input_vector){
  output_vector = vector()
  for (i in 1:length(input_vector)) {
    if(round(input_vector[i]/1000000000, 3) >= 1000) {
      output_vector = append(output_vector, paste0("$", round(input_vector[i]/1000000000000, 3), " Trillon"))
    } else {
      output_vector = append(output_vector, paste0("$", round(input_vector[i]/1000000000, 3), " Billon"))
    }
  }
  return(output_vector)
}
```


```{r H_clustering, include=FALSE}
H_clustering <- function(k, Country = NULL) {
  distance <- dist(Clustering_df[, 2:26], method = "euclidean")
  ClusteringTree <- hclust(distance, method = "ward.D")
  ClusterGroup <- cutree(ClusteringTree, k = k)                 # Get the clustering result
  Clustering_df_cluster <- Clustering_df
  Clustering_df_cluster$Cluster <- factor(ClusterGroup)         # Put it into anther dataframe
  Clustering_result_full <- as.data.frame.matrix(table(Clustering_df_cluster$Cluster, Clustering_df_cluster$Expand))
  Clustering_df_cluster$GDP_2018 <- GDP_t(Clustering_df_cluster$GDP_2018)
  Clustering_df_cluster[, c(4,5,14:26)] <- sapply(Clustering_df_cluster[, c(4,5,14:26)], round, 2)
  Clustering_df_cluster$GDP_Growth_2018 <- paste0( Clustering_df_cluster$GDP_Growth_2018, "%")
  Clustering_df_cluster$Inflation_Rate_2018 <- paste0( Clustering_df_cluster$Inflation_Rate_2018, "%")
  Clustering_df_cluster$Population_2018 <- format(Clustering_df_cluster$Population_2018, big.mark=",", scientific=FALSE)
  if (is.null(Country)) {                                       # When there is no country input
      Output <- list()
      for(i in 1:k){
      Output[[i]] <- Clustering_df_cluster %>% filter(Cluster == i) %>% arrange(Expand_re)
      }
      Output[[k+1]] <- Clustering_result_full
  }

  if (!is.null(Country)) {                                       # When there is country input
    if (length(Country) > 1) {                                   # When country input is a vector (more than 1)
      in_k <- c()
      Output <- list()
      for (i in 1:length(Country)) {
        in_k[i] <- Clustering_df_cluster[Clustering_df_cluster$Country_Name == Country[i], ]$Cluster
      }
      for (j in 1:length(unique(in_k))) {
        Output[[j]] <- Clustering_df_cluster %>% filter(Cluster == unique(in_k)[j]) %>% arrange(Expand_re)
      }  
      Output[[j+1]] <- Clustering_result_full
    } else {                                                    # When country input is only one country
      in_k <- Clustering_df_cluster[Clustering_df_cluster$Country_Name == Country, ]$Cluster
      Output <- Clustering_df_cluster %>% filter(Cluster == in_k) %>% arrange(Expand_re)
    }
    
  }
  return(Output)
  }
```


```{r H_clustering example_1, include=FALSE}
# Shows the result of k=4 in Hierarchical clsutering
H_clustering(4)
```




```{r H_clustering example_2, include=FALSE}
# Shows the cluster that Chile is in under k=12 in hierarchical clustering
H_clustering(12, "Chile")
```

```{r H_clustering example_3, include=FALSE}
# Shows the clusters that Singapore, India, South Africa and UAE are in under k=6 in hierarchical clustering 
H_clustering(6, c("Singapore", "India", "South Africa", "United Arab Emirates"))
```





=====================================================================================================================






### Clustering Model : K-means clustering (Normalized)
K means clustering is different clustering algorithm. The algorithm would randomly genearate a number of centroids (based on our choice of k, the number of clusters) to start. Each data point is then assigned to the closest centroid. After that, the centroids shift to their respective average point within their cluster, then each point is assigned to the clostest centroid. The process is repeated until an equilibrium is reached or the maximum number of iterations are met. 

#### 
```{r}
set.seed(1)
k = 4
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 3, Expand == FALSE) %>%.$Country_Name)
```
From the results, we can see that we should focus on the third cluster: Chile, Czech Republic, Estonia, Hong Kong SAR, Iceland, Israel, South Korea, Lithuania, Luxembourg, Malaysia, Malta, Qatar, Slovenia.

#### If we change the number of clusters, we may get different results. (k =7, seed = 1)
```{r}
set.seed(1)
k = 7
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 4, Expand == FALSE) %>%.$Country_Name)
```
#### Different seed might generate different results (k =7, seed = 3)
```{r}
set.seed(3)
k = 7
Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
Clustering_df_norm_kmeans <- Clustering_df_norm                       
Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster    
table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
print(Clustering_df_norm_kmeans %>% filter(Cluster == 2, Expand == FALSE) %>%.$Country_Name)
```
If we change the number of clusters or set a different seed, we will get different results from the K-means clustering model. The clustering results might be different while using the same number of clusters because the different starting points might generate differnt results, as we cans see from the outputs above. 

To address this, we could run a simulation to see which countries are more preferable. We re run the clustering model 10,000 times and calculate the number of times each countries appear in a cluster which has more expanded countries than non-expanded countries. 
#### Simulation: k=4, 10,000 times
```{r}
k = 4
c_list <-  c()
for (j in 1:10000){
  set.seed(j)
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  Clustering_df_norm_kmeans <- Clustering_df_norm                       
  Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster 
  a <- table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
  for (i in 1:k){
  if(a[i] < a[i+k]){
     c_list <- append(c_list, Clustering_df_norm_kmeans %>% filter(Cluster == i, Expand == FALSE) %>%.$Country_Name)
  }
}
}
sort(table(c_list), decreasing = TRUE)
```
#### Simulation: k=7, 10,000 times
```{r}
k = 7 
c_list <-  c()
for (j in 1:10000){
  set.seed(j)
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  Clustering_df_norm_kmeans <- Clustering_df_norm                       
  Clustering_df_norm_kmeans$Cluster <- Clustering_kmeans$cluster 
  a <- table(Clustering_df_norm_kmeans$Cluster, Clustering_df_norm_kmeans$Expand)
  for (i in 1:k){
  if(a[i] < a[i+k]){
     c_list <- append(c_list, Clustering_df_norm_kmeans %>% filter(Cluster == i, Expand == FALSE) %>%.$Country_Name)
  }
}
}
sort(table(c_list), decreasing = TRUE)
```

From the above simulation, we can see that the recommended countries wouls be Chile, Czech, Estonia, Hong Kong, Iceland, Israel, Korea, Luxembourg, Malaysia, Qatar, Slovenia, Lituania and Malta. The results are pretty similar with the results from our ranking model: Hong Kong, Korea, Malaysia, Israel, Czech, Chile, Estonia, Iceland, Luxembourg, Slovenia, Cyprus, Latvia, Lithuania, Malta, Poland, Qatar, China, Saudi Arabia. 

```{r}
Should_Expand_List <- c("Chile", "Czech Republic", "Estonia", "Hong Kong SAR, China", "Iceland", "Israel", "Korea, Rep.", "Luxembourg", "Malaysia", "Qatar", "Slovenia", "Lituania", "Malta", "Poland", "Cyprus", "Latvia", "Lithuania", "China", "Saudi Arabia")
```

#### Scree Plot
A scree plot is commonly used to determine the number of clusters to use in k means clustering.The recommended number of cluster is at the "elbow" where there's a big change in slope. 
```{r}
wss <- c()
for (i in 1:15) {
  Clustering_kmeans <- kmeans(Clustering_df_norm[,2:26], centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- Clustering_kmeans$tot.withinss
}
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")
```
      </br> 
Optimal cluster is 3 or 4. 


#### The effect of Clustering
```{r}
Status <- c()
for (i in 1:length(Clustering_df$Country_Name)){
  if (Clustering_df$Country_Name[i] %in% Expand_Countries_List) {
    Status = append(Status, "Already Expanded")
  } else if (Clustering_df$Country_Name[i] %in%Should_Expand_List) {
    Status = append(Status, "Should Expanded")
  } else {
    Status = append(Status, "Should Not Expanded")
  }
}
```

```{r}
for (i in 3:7) {
  set.seed(1)
  k = i
  Clustering_kmeans <-  kmeans(Clustering_df_norm[,2:26], centers = k, nstart = 20)  # Use n_start to "stabalize" the clustering results. 
  df_plot <- Clustering_df
  df_plot$Expand <- Status
  df_plot$Cluster <- Clustering_kmeans$cluster
  plot <- ggplot(data = df_plot, aes(x = Internet.users.percent.of.population, y = GCI_2019, 
                                     color = factor(Cluster), 
                                     shape = factor(Expand))) +
    geom_point() +
    # theme(legend.title = element_text("Clusters")) +
    xlab("% of Internet Users") +
    ylab("Global Competitiveness Index") + 
    ggtitle(paste0("Percentage of Internet Users against GCI with ", i, " clusters")) +
    scale_colour_discrete("Clusters") +
    scale_shape_discrete("Expand")
  print(plot)
}

```
     </br>
It is hard to visualize the effect of clustering using all vairables, but we can still see the effect in this two dimensional graph. Countries closer together are grouped in the same clusters, and it is not surprising that the countries on the upper right hand side (the countries with higher percentage of internet users and higher GCI score) would be more favorable to expand into. The clustering result with k=3 and k=4 are similar, expect that the United States has formed its own cluster. 


## Conclusion
Using three different models: Ranking, Hierarchical Clustering and K-means Clustering, we come up with a shrot list of countries that we could delve into without having to look into every country in the world.   </br> 
The countries from Ranking:    </br> 
Hong Kong, Korea, Malaysia, Israel, Czech, Chile, Estonia, Iceland, Luxembourg, Slovenia, Cyprus, Latvia, Lithuania, Malta, Poland, Qatar, China, Saudi Arabia    </br> 
From Hierarchical Clustering:    </br> 
Russia, Korea   </br> 
From K-means Clustering:    </br> 
Chile, Estonia, Hong Kong, Iceland ,Israel ,Korea, Luxembourg, Malaysia, Czech Republic, Qatar, Slovenia, Lithuania, Malta    </br> 
With a short list of countries to focus on, we continue our qualitative analysis in these countries. 