📈 Emerging EdTech Innovators


| Introduction and Objective


The primary objective of this project was to explore emerging companies and trends in the EdTech industry, as featured on HolonIQ’s list. This initiative aimed to gain a comprehensive understanding of the current market landscape, categorize the startups based on their focus areas, and pinpoint potential clients, providers, partners, competitors, or emerging markets. The analysis was structured to examine these entities by geographic regions and industry verticals, ultimately providing strategic insights to understand edunext’s position within the industry and enhance its market positioning and growth.

🚨 Why the HolonIQ EdTech Ranking was Selected

The HolonIQ EdTech ranking was selected due to its reputation as a comprehensive and authoritative source of information on the most innovative and high-potential EdTech startups globally. The ranking provides valuable insights into the leading companies in the EdTech sector, making it an ideal resource for identifying key players and emerging trends in the industry. By leveraging this ranking, eduNext can ensure its analysis is based on credible and up-to-date information, enabling informed decision-making and strategic planning.

💫 Company Overview

Edunext is an infrastructure-as-a-service provider and software company dedicated to the Open edX platform. It empowers organizations worldwide by delivering robust, scalable, and customizable online learning solutions. Edunext’s mission is to enhance the quality of education through technology, supporting successful online learning initiatives across various sectors.


| Project Overview


The project involved generating a list of the 1000 EdTech startups featured in the HolonIQ ranking. This list included links to their LinkedIn profiles. A scraper was developed to extract information from LinkedIn, and additional data was gathered from the startups’ websites, which were saved in text files. This collected information was then used to feed the ChatGPT API with specific prompts. The purpose of these prompts was to build detailed profiles of each company and obtain better answers to questions oriented toward classifying and identifying potential business opportunities for eduNext. This methodology provided a comprehensive and detailed understanding of each startup, facilitating strategic insights for edunext’s market positioning and growth.


| HolonIQ EdTech Ranking Overview


📘 Introduction to HolonIQ

HolonIQ is a global market intelligence firm specializing in the education, climate, and health sectors. Each year, it publishes the Global EdTech 1000 list, which highlights the most promising startups in the educational technology (EdTech) field at both global and regional levels. Sublists include the “Top 200 EdTech in North America,” the “Top 50 EdTech in Australia and New Zealand,” and many more for different regions worldwide.

🔍 How Does the HolonIQ Ranking Work?

Evaluation and Selection:

Evaluation Criteria:

Selection Process:

📊 Importance of the Ranking

These rankings not only highlight the most promising startups but also help connect different regions of the world and share innovations that can improve educational outcomes globally. Promising startups are those that show exceptional potential in terms of innovation, market impact, and growth trajectory, making them key players in driving the future of education. Additionally, they provide investors and other stakeholders with a clear view of emerging trends and the companies leading the change in education.

For more details on HolonIQ rankings and methodologies, you can visit their official website: HolonIQ.


| Methodology


📋 Generate the list with the 1000 companies

The project leveraged a curated list of 1,000 EdTech startups compiled by Holoniq. This list, provided in image format, segmented the startups by geographic region (Africa, Nordic-Baltic, South Asia, etc.). To facilitate further analysis, the initial step involved meticulously extracting key information from each entry. This information included the company name, LinkedIn profile URL, and company website address. The data extraction process was a collaborative effort, requiring manual work from multiple team members. Additionally, support tools like Gemini and GPT-3 were employed to enhance efficiency.

This is an example of the source format of the lists mentioned above.


The Global EdTech 1000 list for this case includes:

We used the 2023 HolonIQ EdTech lists for our analysis.

Once this process was done, the list looked like this:


📋 LinkedIn Scraper

Initially, there were 1,046 companies in the list. After reviewing, it was found that only 937 were unique, as the list was manually compiled by several team members and some entries were duplicated. Of these 937 unique companies, 117 had no LinkedIn page. The LinkedIn scraping process was therefore conducted with 820 companies, yielding 703 successful responses.

The results are summarized in the following table:

Category Number of companies
Total Companies in List 1046
Total Unique Companies 937
Companies without LinkedIn in list 117
Total companies with LinkedIn 820
LinkedIn Retrieved 707
LinkedIn Not Retrieved (Needs Review) 113
LinkedIn Retrieved Webpage 703

The links that could not be retrieved are often in school format rather than company format. Here is an example of the links from which information could not be collected:

Failed Links:

While the links that have company format worked correctly. Here are some examples:

Successful Links:

Company Name URL
byteXL https://in.linkedin.com/company/bytexl
Toodle https://in.linkedin.com/company/toodlerungta
Eupheus Learning https://in.linkedin.com/company/eupheus-learning
10 Minute School https://bd.linkedin.com/company/10ms
Adda247 https://in.linkedin.com/company/adda247
Apars Classroom https://bd.linkedin.com/company/aparsclassroom
EduGorilla https://in.linkedin.com/company/edugorilla-pvt-ltd
Infinity Learn https://in.linkedin.com/company/infinity-learn-by-sri-chaitanya

Note: If you want to see the technical information of the process, please click the button below.

Document Technical Information

This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:

  1. Guest Mode: The code acts like a guest browsing LinkedIn, meaning it doesn’t log in but still retrieves the necessary information.
  2. Fetching Data: It fetches web pages from LinkedIn and other company websites to get the HTML content. This content includes all the visible information about a company.
  3. Parsing Data: After getting the HTML content, the code extracts the important details like company name, address, number of employees, and description. This is done using special tools that can read and understand HTML code.
  4. Storing Data: The extracted information is then neatly organized into tables (dataframes) and saved as Excel files. This makes it easy to review and analyze the data later.
  5. Error Handling: If there are any issues while fetching the data, the code tries a few more times before giving up. This ensures that as much data as possible is collected reliably.

This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.


import jmespath
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response, RemoteProtocolError
from parsel import Selector
from loguru import logger as log
import pandas as pd
import re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
import os
import time
import requests

# Initialize an async httpx client
client = AsyncClient(
    http2=True,
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
    },
    follow_redirects=True
)

def strip_text(text):
    """Remove extra spaces while handling None values."""
    return text.strip() if text is not None else text

def get_actual_url(link):
    parsed_url = urlparse(link)
    query_params = parse_qs(parsed_url.query)
    return query_params['url'][0] if 'url' in query_params else link

def parse_company(response_text: str) -> Dict:
    """Parse company main overview page."""
    selector = Selector(response_text)
    script_data = selector.xpath("//script[@type='application/ld+json']/text()").get()
    if script_data:
        script_data = json.loads(script_data)
    else:
        script_data = {}
    script_data = jmespath.search(
        """{
        name: name,
        url: url,
        mainAddress: address,
        description: description,
        numberOfEmployees: numberOfEmployees.value,
        logo: logo
        }""",
        script_data
    ) or {}
    data = {}
    for element in selector.xpath("//div[contains(@data-test-id, 'about-us')]"):
        name = element.xpath(".//dt/text()").get().strip()
        value = element.xpath(".//dd/text()").get().strip()
        data[name] = value
    addresses = []
    for element in selector.xpath("//div[contains(@id, 'address') and @id != 'address-0']"):
        address_lines = element.xpath(".//p/text()").getall()
        address = ", ".join(line.replace("\n", "").strip() for line in address_lines)
        addresses.append(address)
    affiliated_pages = []
    for element in selector.xpath("//section[@data-test-id='affiliated-pages']/div/div/ul/li"):
        affiliated_pages.append({
            "name": element.xpath(".//a/div/h3/text()").get().strip(),
            "industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
            "address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
            "linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
        })
    similar_pages = []
    for element in selector.xpath("//section[@data-test-id='similar-pages']/div/div/ul/li"):
        similar_pages.append({
            "name": element.xpath(".//a/div/h3/text()").get().strip(),
            "industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
            "address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
            "linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
        })

    # Additional fields from the second script
    soup = BeautifulSoup(response_text, 'html.parser')
    title_tag = soup.find('title')
    designation_tag = soup.find('h2')
    followers_tag = soup.find('meta', {"property": "og:description"})
    description_tag = soup.find('p', class_='break-words')
    website_tag = soup.find('a', attrs={'data-tracking-control-name': 'about_website'})
    website = get_actual_url(website_tag['href']) if website_tag else "Website not found"
    description_span = soup.find('h4', class_='top-card-layout__second-subline')
    description = description_span.get_text(strip=True) if description_span else "Description not found"
    
    # Crunchbase funding information
    funding_section = soup.find('section', attrs={'data-test-id': 'funding'})
    if funding_section:
        all_rounds_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_all-rounds'})
        if all_rounds_tag:
            all_rounds_match = re.search(r'(\d+ total rounds)', all_rounds_tag.get_text(strip=True))
            all_rounds_info = all_rounds_match.group(1) if all_rounds_match else "All rounds info not found"
        else:
            all_rounds_info = "All rounds info not found"

        last_round_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_last-round'})
        if last_round_tag:
            last_round_info = last_round_tag.find('time').get_text(strip=True)
            last_round_amount_tag = funding_section.find('p', class_='text-display-lg')
            last_round_amount = last_round_amount_tag.get_text(strip=True) if last_round_amount_tag else "Last round amount not found"
            last_round_link = last_round_tag['href']
            last_round_formatted_date = last_round_tag.find('time')['datetime']
        else:
            last_round_info = "Last round info not found"
            last_round_amount = "Last round amount not found"
            last_round_link = "Last round link not found"
            last_round_formatted_date = "Last round date not found"

        investors_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_investors'})
        investors_info = investors_tag.get_text(strip=True) if investors_tag else "Investors info not found"
    else:
        all_rounds_info = "Crunchbase funding info not found"
        last_round_info = "Last round info not found"
        last_round_amount = "Last round amount not found"
        investors_info = "Investors info not found"
        last_round_link = "Last round link not found"
        last_round_formatted_date = "Last round date not found"

    # Check if the tags are found before calling get_text()
    name = title_tag.get_text(strip=True).split("|")[0].strip() if title_tag else "Profile Name not found"
    designation = designation_tag.get_text(strip=True) if designation_tag else "Designation not found"
    followers_match = re.search(r'\b(\d[\d,.]*)\s+followers\b', followers_tag["content"]) if followers_tag else None
    followers_count = followers_match.group(1) if followers_match else "Followers count not found"
    description_profile = description_tag.get_text(strip=True) if description_tag else "Profile Description not found"

    additional_data = {
        "profileName": name,
        "designation": designation,
        "followersCount": followers_count,
        "profileDescription": description_profile,
        "website": website,
        "crunchbaseAllRoundsInfo": all_rounds_info,
        "crunchbaseLastRoundInfo": last_round_info,
        "crunchbaseLastRoundAmount": last_round_amount,
        "crunchbaseInvestorsInfo": investors_info,
        "lastRoundFormattedDate": last_round_formatted_date,
        "crunchbaseLink": last_round_link,
    }

    data = {**script_data, **data, **additional_data}
    data["addresses"] = addresses    
    data["affiliatedPages"] = affiliated_pages
    data["similarPages"] = similar_pages
    return data

def read_links_from_file(file_path: str) -> List[str]:
    """Read URLs from a text or Excel file."""
    if file_path.endswith('.txt'):
        with open(file_path, 'r') as file:
            urls = file.read().splitlines()
    elif file_path.endswith('.xlsx'):
        df = pd.read_excel(file_path)
        urls = df['Links'].tolist()
    else:
        raise ValueError("Unsupported file format. Please use a .txt or .xlsx file.")
    return urls

# Initialize dataframes globally
df_company_info = pd.DataFrame()
df_company_addresses = pd.DataFrame()
df_affiliated_pages = pd.DataFrame()
df_similar_pages = pd.DataFrame()

async def fetch_with_retry(url, retries=3, backoff_factor=0.5):
    for attempt in range(retries):
        try:
            response = await client.get(url)
            return response
        except RemoteProtocolError as e:
            log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
            time.sleep(backoff_factor * (2 ** attempt))
    raise Exception(f"All {retries} attempts failed for {url}")

def fetch_with_requests(url, retries=3, backoff_factor=0.5):
    headers = {
        "User-Agent": "Guest",
    }
    for attempt in range(retries):
        try:
            response = requests.get(url, headers=headers)
            if response.status_code == 200:
                return response
        except requests.RequestException as e:
            log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
            time.sleep(backoff_factor * (2 ** attempt))
    raise Exception(f"All {retries} attempts failed for {url}")

async def scrape_company(urls: List[str]) -> List[Dict]:
    """Scrape public LinkedIn company pages."""
    data = []
    failed_links = []
    for url in urls:
        try:
            response = await fetch_with_retry(url)
            if response.status_code == 200:
                data.append(parse_company(response.text))
                log.success(f"Successfully scraped {url}")
            elif response.status_code == 999:  # Use requests as fallback
                log.warning(f"Status code 999 for {url}, switching to requests")
                response = fetch_with_requests(url)
                if response.status_code == 200:
                    data.append(parse_company(response.text))
                    log.success(f"Successfully scraped {url} with requests fallback")
                else:
                    failed_links.append(url)
                    log.error(f"Failed to scrape {url} with status code {response.status_code}")
            else:
                failed_links.append(url)
                log.error(f"Failed to scrape {url} with status code {response.status_code}")
            # Delay between requests to avoid rate limiting
            time.sleep(1)
        except Exception as e:
            failed_links.append(url)
            log.error(f"Error scraping {url}: {e}")
    return data, failed_links

async def run():
    urls = read_links_from_file('profiles.txt')  # Using the 'profiles.txt' file
    profile_data, failed_links = await scrape_company(urls)
    
    global df_company_info, df_company_addresses, df_affiliated_pages, df_similar_pages

    for company in profile_data:
        main_address = company.get('mainAddress', {})
        company_info = {
            "name": company.get("name"),
            "url": company.get("url"),
            "streetAddress": main_address.get("streetAddress") if main_address else None,
            "addressLocality": main_address.get("addressLocality") if main_address else None,
            "addressRegion": main_address.get("addressRegion") if main_address else None,
            "postalCode": main_address.get("postalCode") if main_address else None,
            "addressCountry": main_address.get("addressCountry") if main_address else None,
            "description": company.get("description"),
            "numberOfEmployees": company.get("numberOfEmployees"),
            "Industry": company.get("Industry"),
            "Company size": company.get("Company size"),
            "Headquarters": company.get("Headquarters"),
            "Type": company.get("Type"),
            "Specialties": company.get("Specialties"),
            "profileName": company.get("profileName"),
            "designation": company.get("designation"),
            "followersCount": company.get("followersCount"),
            "profileDescription": company.get("profileDescription"),
            "website": company.get("website"),
            "crunchbaseAllRoundsInfo": company.get("crunchbaseAllRoundsInfo"),
            "crunchbaseLastRoundInfo": company.get("crunchbaseLastRoundInfo"),
            "crunchbaseLastRoundAmount": company.get("crunchbaseLastRoundAmount"),
            "crunchbaseInvestorsInfo": company.get("crunchbaseInvestorsInfo"),
            "lastRoundFormattedDate": company.get("lastRoundFormattedDate"),
            "crunchbaseLink": company.get("crunchbaseLink"),
        }
        df_company_info = pd.concat([df_company_info, pd.DataFrame([company_info])])

        for address in company.get("addresses", []):
            parts = address.split(", ")
            country = parts[-1] if parts else ""
            company_address = {
                "name": company.get("name"),
                "url": company.get("url"),
                "addresses": address,
                "country offices": country
            }
            df_company_addresses = pd.concat([df_company_addresses, pd.DataFrame([company_address])])

        for affiliated in company.get("affiliatedPages", []):
            affiliated_page = {
                "name": company.get("name"),
                "url": company.get("url"),
                "affiliated_name": affiliated["name"],
                "industry": affiliated["industry"],
                "address": affiliated["address"],
                "linkeinUrl": affiliated["linkeinUrl"]
            }
            df_affiliated_pages = pd.concat([df_affiliated_pages, pd.DataFrame([affiliated_page])])

        for similar in company.get("similarPages", []):
            similar_page = {
                "name": company.get("name"),
                "url": company.get("url"),
                "similar_name": similar["name"],
                "industry": similar["industry"],
                "address": similar["address"],
                "linkeinUrl": similar["linkeinUrl"]
            }
            df_similar_pages = pd.concat([df_similar_pages, pd.DataFrame([similar_page])])

        # Save to Excel after each company
        df_company_info.to_excel("company_information.xlsx", index=False)
        df_company_addresses.to_excel("company_addresses.xlsx", index=False)
        df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
        df_similar_pages.to_excel("similar_pages.xlsx", index=False)

    if failed_links:
        df_failed_links = pd.DataFrame({"failed_links": failed_links})
        df_failed_links.to_excel("failed_links.xlsx", index=False)

if __name__ == "__main__":
    try:
        asyncio.run(run())
    except Exception as e:
        log.error(f"Script terminated due to an error: {e}")

        # Save what has been scraped so far
        if not df_company_info.empty:
            df_company_info.to_excel("company_information.xlsx", index=False)
        if not df_company_addresses.empty:
            df_company_addresses.to_excel("company_addresses.xlsx", index=False)
        if not df_affiliated_pages.empty:
            df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
        if not df_similar_pages.empty:
            df_similar_pages.to_excel("similar_pages.xlsx", index=False)


The data collection and analysis process generates four main outputs.

1. LinkedIn Info

The first is called “LinkedIn Info” and contains the following columns:

Company Name Eupheus Learning
URL https://in.linkedin.com/company/eupheus-learning
Street Address A-12, Mohan Co-operative Industrial Estate
Locality New Delhi
Region New Delhi
Postal Code 110044
Country IN
Description “Eupheus in Greek means -”“Active seeking of knowledge”“​ Our Vision is to offer pedagogically differentiated technology driven solutions that lead to critical thinking and achievement of higher learning outcomes by seamlessly integrating in-class and at home learning in the private school segment of the Pre-K to 12 market. Our aim is to bridge the gap between what is taught in-class using institutional textbook driven solutions and retail at-home learning providers by seamlessly integrating both.”
Number of Employees 275
Industry E-Learning Providers
Company Size 51-200 employees
Headquarters New Delhi, New Delhi
Company Type Privately Held
Specialties Education, K-12, Curricular, E Learning, Pre Primary, Middle School Solutions, Senior School Solutions, Digital Reference Resources, Language Learning, Primary School Solutions, Teacher Support, Age Appropriate Resource, Digital, Learning, Print, Live Books, Fiction e Books, Coding, Kinesthetic Learning, CBSE Aligned Text Book, ICSE Aligned Text Book, Digital Library, Reading Program, Atal Tinkering Lab, and TOEFL
Profile Name Eupheus Learning
Designation E-Learning Providers
Followers Count 9,981
Profile Description “Eupheus in Greek means -”“Active seeking of knowledge”“​ Our Vision is to offer pedagogically differentiated technology driven solutions that lead to critical thinking and achievement of higher learning outcomes by seamlessly integrating in-class and at home learning in the private school segment of the Pre-K to 12 market. Our aim is to bridge the gap between what is taught in-class using institutional textbook driven solutions and retail at-home learning providers by seamlessly integrating both.”
Website https://www.eupheus.in
Crunchbase All Rounds Info 4 total rounds
Crunchbase Last Round Info Oct 14, 2021
Crunchbase Last Round Amount US$ 10.0M
Crunchbase Investors Info Lightrock
Last Round Formatted Date 14/10/2021
Crunchbase Link Crunchbase Link

For more detailed information, you can refer to the complete dataset here.

2. Afilliated Pages LinkedIn

The second output is called “Affiliated Pages LinkedIn” and contains information about affiliated pages (i.e., other pages related to the same company). Below is an example of the data collected:

Name upGrad
URL https://in.linkedin.com/company/ueducation
Affiliated Name upGrad Placements
Industry Human Resources Services
Address
LinkedIn URL https://in.linkedin.com/company/upgrad-placements-

For more detailed information, you can refer to the complete dataset here.

3. Similar Pages LinkedIn

The third output is called “Similar Pages LinkedIn” and contains the similar pages recommended by the LinkedIn algorithm. These are pages that LinkedIn suggests as being similar based on various factors such as industry, company size, and other attributes. Below is an example of the data collected:

Name byteXL
URL https://in.linkedin.com/company/bytexl
Similar Company CODINGCLUB
Industry Education
Address Vadodara, GUJARAT
LinkedIn URL https://in.linkedin.com/company/codingclub36

For more detailed information, you can refer to the complete dataset here.

4. Country Offices Address LinkedIn

The fourth output is called “Country Offices Addresses LinkedIn” and contains information about the different office locations and countries where the company has a presence. Below is an example of the data collected:

Name byteXL
URL https://in.linkedin.com/company/bytexl
Addresses Plano, TX, US
Country Offices US

For more detailed information, you can refer to the complete dataset here.

🪟 Website Scraper

Following the generation of a list containing 1000 ed-tech startups from the HolonIQ list, along with their LinkedIn links, the team proceeded to scrape the information available on LinkedIn. Utilizing the websites retrieved by the LinkedIn scraper (rather than those manually added during the initial list generation to ensure greater accuracy), they developed a scraper for the companies’ web pages. This information was then cleanly written into txt files to capture more comprehensive details about each company.

Note: If you want to see the technical information of the process, please click the button below.

Document Technical Information

This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:

  1. Cleaning Text: The clean_text function removes excess whitespace and special characters from the text to make it more readable.
  2. Language Detection: The detect_language_from_html and detect_language functions identify the language of the text, either from the HTML tag or the text content itself.
  3. Fetching HTML Content: The get_html_with_selenium function uses Selenium to load web pages and ensure all content is captured, especially for pages that load content dynamically.
  4. Saving Web Page Content: The save_formatted_webpage_content function retrieves web page content, cleans it, detects the language, and saves it to a text file. If the initial token count is insufficient, it switches to using Selenium for a more thorough scrape.
  5. Handling Requests and Errors: The function handles various exceptions to ensure robustness, including request errors and general exceptions.
  6. Main Processing Loop: The main function reads input data from an Excel file, processes each company’s website, and saves the results and any errors to separate Excel files.

This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.


import os
import re
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from langdetect import detect, LangDetectException
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

from webdriver_manager.chrome import ChromeDriverManager

def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = text.replace('\xa0', ' ')
    return text.strip()

def detect_language_from_html(soup):
    html_tag = soup.find('html')
    if html_tag and html_tag.get('lang'):
        return html_tag['lang']
    else:
        return None

def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return 'unknown'

def get_html_with_selenium(url):
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-features=SameSiteByDefaultCookies")
    options.add_argument("--disable-features=CookiesWithoutSameSiteMustBeSecure")
    options.add_argument("log-level=3")  # Reduce logging output
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(url)
    
    try:
        # Wait for the body content to be loaded
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.TAG_NAME, 'body')))
        # Scroll to the bottom to ensure all content is loaded
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)  # Wait for content to load
        # Additionally, wait for the specific element that indicates the content is fully loaded
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//p")))
    except Exception as e:
        print(f"Error waiting for the page to load: {e}")
    
    html_content = driver.page_source
    driver.quit()
    
    return html_content

def save_formatted_webpage_content(url, company_name):
    try:
        print(f"Retrieving content from {url} for {company_name}...")

        response = requests.get(url, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        html_content = response.text

        soup = BeautifulSoup(html_content, 'html.parser')
        paragraphs = soup.find_all('p')

        paragraph_texts = '\n\n'.join([clean_text(p.get_text()) for p in paragraphs])
        token_count = len(paragraph_texts.split())

        # If the token count is less than 300, use Selenium
        if token_count < 300:
            print(f"Insufficient token count retrieved from {url} using requests. Switching to Selenium...")
            html_content = get_html_with_selenium(url)
            soup = BeautifulSoup(html_content, 'html.parser')
            paragraphs = soup.find_all('p')
            paragraph_texts = '\n\n'.join([clean_text(p.get_text()) for p in paragraphs])
            token_count = len(paragraph_texts.split())

        # Check if HTML content is retrieved
        if not soup.body:
            print(f"No body content found for {company_name} at {url}.")
            return None, None, f"No body content found at {url}"
        
        title = clean_text(soup.title.string) if soup.title else 'No Title Found'
        formatted_content = f"Title: {title}\n\n{paragraph_texts}"

        # Detect language
        language = detect_language_from_html(soup)
        if not language:
            language = detect_language(paragraph_texts)
        
        filepath = os.path.join('webpages', f"{company_name}.txt")
        with open(filepath, 'w', encoding='utf-8') as file:
            file.write(formatted_content)
        
        print(f"Content retrieved and saved for {company_name}. Token count: {token_count}, Language: {language}.")
        return token_count, language, None
    except requests.exceptions.RequestException as req_err:
        print(f"Request error for {company_name}: {req_err}")
        return None, None, f"Request error: {req_err}"
    except Exception as err:
        print(f"Error for {company_name}: {err}")
        return None, None, f"Error: {err}"

def main():
    input_filename = 'cleaned_company_information.xlsx'
    output_data_filename = 'company_token_counts.xlsx'
    output_errors_filename = 'website_errors.xlsx'
    
    # Read the input file
    df = pd.read_excel(input_filename)
    
    # Ensure the 'webpages' directory exists
    if not os.path.exists('webpages'):
        os.makedirs('webpages')
    
    # Lists to store results and errors
    results = []
    errors = []

    # Iterate over the rows in the DataFrame
    for index, row in df.iterrows():
        print(f"Processing {index + 1}/{len(df)}: {row['Company Name']} ({row['Website']})")
        company_name = row['Company Name']
        website = row['Website']
        token_count, language, error = save_formatted_webpage_content(website, company_name)
        
        if token_count is not None:
            results.append({'Company Name': company_name, 'Website': website, 'Token Count': token_count, 'Language': language})
        if error is not None:
            errors.append({'Company Name': company_name, 'Website': website, 'Error': error})
    
    # Save the results to an Excel file
    results_df = pd.DataFrame(results)
    results_df.to_excel(output_data_filename, index=False)
    
    # Save the errors to an Excel file
    errors_df = pd.DataFrame(errors)
    errors_df.to_excel(output_errors_filename, index=False)

    print(f"Process completed. Data saved to {output_data_filename} and errors saved to {output_errors_filename}.")

if __name__ == '__main__':
    main()

  

Following the web scraping process, the output of this part is a folder 📁 containing various txt files, with each file corresponding to a specific company. These files contain detailed information extracted from the companies’ web pages. This approach ensures that all relevant data is captured and organized systematically for further analysis.

As shown in the following example:

🔧 Integration with Chat-GPT

The next step involved integrating with OpenAI using the GPT-3.5 model to leverage the LinkedIn information and the Txt files with website information. These data sources were used to feed the model and answer specific questions designed to build a comprehensive profile of each company. The following questions were asked of the model:

The code leverages additional data from several Excel tables to enrich the analysis:

📄 Role Definitions
Provides detailed definitions for roles like client, partner, competitor, and provider, as well as information about Edunext.

Role Detailed Definitions
Edunext Edunext is the company driving this analysis. It is a software and services company dedicated to the Open edX platform.
CLIENT An entity that is likely to require hosting, maintenance, or professional services for the Open edX platform.
PARTNER An entity that provides instructional design services or a technology aggregator that may subcontract hosting or professional services for the Open edX platform to Edunext, as it is not part of its core business. It can also be an entity that provides a tool for online learning that can be integrated into the Open edX LMS or uses standard interoperability protocols such as LTI.
COMPETITOR An entity that has Open edX hosting, maintenance, or custom development as part of their core business offering and expertise.
PROVIDER An entity that supplies services that may be relevant and useful to Edunext to enhance its value proposition or raise its productivity.

📄 Verticals

These additional sources ensure the accuracy and comprehensiveness of the analysis, allowing the GPT-3.5 model to generate more precise and contextually relevant responses.

Vertical Name Vertical Definition
K-12 Education Technologies and platforms aimed at primary and secondary education.
Higher Education Solutions tailored for colleges, universities, and other tertiary education institutions.
Professional Development & Corporate Training Tools and programs designed for employee training, upskilling, and professional certifications.
Language Learning Apps, platforms, and services focused on teaching new languages.
STEM Education Resources and tools specific to Science, Technology, Engineering, and Mathematics education.
Learning Management Systems (LMS) Platforms that provide a comprehensive management system for learning processes, often used by institutions and corporations.
Tutoring and Mentoring Services and platforms that connect students with tutors and mentors.
Online Courses and MOOCs (Massive Open Online Courses) Platforms offering a variety of courses across different subjects, typically available to a large audience.
Content Creation and Publishing Tools and platforms for creating, sharing, and publishing educational content.
Edutainment Educational tools and resources that incorporate entertainment, such as educational games and interactive learning tools.
Special Education Technologies designed to support learners with special needs.
Assessment and Testing Solutions that focus on student evaluation, testing, and examination.
Early Childhood Education Platforms and tools aimed at pre-K education.
Virtual and Augmented Reality (VR/AR) in Education Immersive technologies used to enhance the learning experience.
Coding and Programming Platforms and tools that teach coding and programming skills.
Collaboration and Communication Tools Solutions that facilitate communication and collaboration among students, teachers, and parents.
School Administration and Management Systems designed to manage school operations, such as attendance, grades, and resource planning.
Educational Hardware Devices and physical technologies used in educational settings, like tablets, interactive whiteboards, and robotics kits.
Adaptive Learning Technologies that use data and analytics to personalize the learning experience.
EdTech Infrastructure Backend technologies and services that support educational platforms and tools, like cloud services, data management, and cybersecurity solutions.

📄 Capabilities

The following list of capabilities was used to analyze the potential and competencies of each company. These capabilities are categorized under main categories and sub-categories to ensure a comprehensive evaluation. The list is based on the HolonIQ framework, which can be found at Digital Capability Framework. The image provided below, sourced from HolonIQ, illustrates these capabilities. However, the list below has been reformatted for ease of understanding and to ensure better performance when processed by the ChatGPT algorithm.

The list shown below is a reinterpretation of the previous image to categorize the capabilities and be able to pass them through the gpt chat api, so that it could be interpreted in a more understandable way and in text format.

Main Category Sub-Category Capability
DEMAND AND DISCOVERY PRODUCT STRATEGY MARKET INSIGHTS & TRENDS
DEMAND AND DISCOVERY PRODUCT STRATEGY UNDERSTAND CUSTOMER NEEDS
DEMAND AND DISCOVERY PRODUCT STRATEGY COMPETITORS & ALTERNATES
DEMAND AND DISCOVERY PRODUCT STRATEGY NEW BUSINESS MODELS
DEMAND AND DISCOVERY PRODUCT STRATEGY B2B RECRUITMENT & PARTNERSHIPS
DEMAND AND DISCOVERY MARKETING PROCESSES STUDENT RELATIONSHIP MANAGEMENT (CRM)
DEMAND AND DISCOVERY MARKETING PROCESSES COMMS & CAMPAIGN MANAGEMENT
DEMAND AND DISCOVERY MARKETING PROCESSES MARKETING AUTOMATION
DEMAND AND DISCOVERY MARKETING PROCESSES SOCIAL MEDIA & COMMUNITY MANAGEMENT
DEMAND AND DISCOVERY STUDENT RECRUITMENT RECRUITMENT EVENTS
DEMAND AND DISCOVERY STUDENT RECRUITMENT CHANNEL PARTNERSHIPS
DEMAND AND DISCOVERY STUDENT RECRUITMENT SCHOOLS & COMMUNITY OUTREACH
DEMAND AND DISCOVERY STUDENT RECRUITMENT SCHOLARSHIP PROGRAM
DEMAND AND DISCOVERY ENROLLMENT MANAGEMENT COURSE SELECTION & GUIDANCE
DEMAND AND DISCOVERY ENROLLMENT MANAGEMENT APPLICATION & ADMISSIONS
DEMAND AND DISCOVERY ENROLLMENT MANAGEMENT RECOGNIZING PRIOR LEARNING
DEMAND AND DISCOVERY ENROLLMENT MANAGEMENT TUITION FINANCING
LEARNING DESIGN CURRICULUM DESIGN DIGITAL DESIGN PRINCIPLES
LEARNING DESIGN CURRICULUM DESIGN PROGRAM ARCHITECTURE
LEARNING DESIGN CURRICULUM DESIGN LEARNING ENVIRONMENTS & PLATFORMS
LEARNING DESIGN CURRICULUM DESIGN LEARNING DELIVERY MODELS
LEARNING DESIGN CURRICULUM DESIGN ACCREDITATION
LEARNING DESIGN CURRICULUM DESIGN CURRICULUM QUALITY MANAGEMENT
LEARNING DESIGN DIGITAL CONTENT & COURSEWARE DIGITAL CONTENT CREATION
LEARNING DESIGN DIGITAL CONTENT & COURSEWARE IMMERSION, SIMULATION & LAB
LEARNING DESIGN DIGITAL CONTENT & COURSEWARE OER & CONTENT LICENSING
LEARNING DESIGN DIGITAL CONTENT & COURSEWARE MANAGING INTEGRATED CONTENT
LEARNING DESIGN SUBJECT MATTER EXPERTISE DESIGNING FOR DIGITAL LEARNING
LEARNING DESIGN SUBJECT MATTER EXPERTISE FACULTY EXPERTISE & SPECIALISMS
LEARNING DESIGN SUBJECT MATTER EXPERTISE SOURCING & MANAGING EXPERTISE
LEARNING DESIGN SUBJECT MATTER EXPERTISE SPECIALIST INDUSTRY PARTNERS
LEARNING DESIGN TEACHING STRATEGIES LEARNER NEEDS & ANALYTICS
LEARNING DESIGN TEACHING STRATEGIES DESIGNING ASSESSMENT
LEARNING DESIGN TEACHING STRATEGIES EXPERIENTIAL LEARNING APPROACHES
LEARNING DESIGN TEACHING STRATEGIES DESIGNING GROUP WORK
LEARNING DESIGN TEACHING STRATEGIES PERSONALIZED & ADAPTIVE LEARNING
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION FACULTY PROFESSIONAL DEVELOPMENT
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION FACULTY MANAGEMENT & SUPPORT
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION TIMETABLING & SCHEDULE MANAGEMENT
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION RETENTION & LEARNING SUPPORT
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION REPORTING & REGULATORY COMPLIANCE
LEARNER EXPERIENCE ACADEMIC ADMINISTRATION LIBRARY SERVICES
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE STUDENT PORTAL & LMS
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE SYNCHRONOUS LEARNING EXPERIENCES
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE ASYNCHRONOUS LEARNING EXPERIENCES
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE VOICE, CHAT & INTERACTIVE LEARNING
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE INDEPENDENT LEARNING RESOURCES
LEARNER EXPERIENCE LEARNING & ACADEMIC EXPERIENCE EXCHANGE PROGRAMS
LEARNER EXPERIENCE STUDENT LIFE ONBOARDING & ORIENTATION
LEARNER EXPERIENCE STUDENT LIFE WELLBEING & MENTAL HEALTH
LEARNER EXPERIENCE STUDENT LIFE STUDENT COMMUNITIES, CLUBS & SOCIETIES
LEARNER EXPERIENCE STUDENT LIFE VOLUNTEERING & STUDENT LEADERSHIP
LEARNER EXPERIENCE STUDENT LIFE STUDENT VOICE & SURVEYS
LEARNER EXPERIENCE STUDENT LIFE GRADUATION & SUCCESS
LEARNER EXPERIENCE ASSESSMENT & VERIFICATION TESTS & EXAMS
LEARNER EXPERIENCE ASSESSMENT & VERIFICATION PORTFOLIOS & PRACTICAL
LEARNER EXPERIENCE ASSESSMENT & VERIFICATION ASSESSMENT FEEDBACK
LEARNER EXPERIENCE ASSESSMENT & VERIFICATION PEER & GROUP ASSESSMENT
LEARNER EXPERIENCE ASSESSMENT & VERIFICATION BADGING & CREDENTIALING
WORK AND LIFELONG LEARNING WORK INTEGRATED LEARNING EMPLOYABILITY SKILLS BUILDING
WORK AND LIFELONG LEARNING WORK INTEGRATED LEARNING WORKPLACE SIMULATION & PROJECTS
WORK AND LIFELONG LEARNING WORK INTEGRATED LEARNING INTERNSHIPS & PLACEMENTS
WORK AND LIFELONG LEARNING WORK INTEGRATED LEARNING STUDENT WORK
WORK AND LIFELONG LEARNING WORK INTEGRATED LEARNING ENTREPRENEURSHIP & STARTUPS
WORK AND LIFELONG LEARNING CAREER PLANNING & PLACEMENT COMPETENCIES & SKILLS EVALUATION
WORK AND LIFELONG LEARNING CAREER PLANNING & PLACEMENT CAREER PLANNING SERVICES
WORK AND LIFELONG LEARNING CAREER PLANNING & PLACEMENT CAREER & RECRUITMENT EVENTS
WORK AND LIFELONG LEARNING CAREER PLANNING & PLACEMENT JOB APPLICATION SUPPORT
WORK AND LIFELONG LEARNING CAREER PLANNING & PLACEMENT JOB FINDING & GRADUATE PLACEMENT
WORK AND LIFELONG LEARNING INDUSTRY & BUSINESS ENGAGEMENT INDUSTRY COLLABS & PARTNERSHIPS
WORK AND LIFELONG LEARNING INDUSTRY & BUSINESS ENGAGEMENT PROFESSIONAL & INDUSTRY ASSOCIATIONS
WORK AND LIFELONG LEARNING INDUSTRY & BUSINESS ENGAGEMENT CUSTOMIZED PROGRAMS (B2B)
WORK AND LIFELONG LEARNING INDUSTRY & BUSINESS ENGAGEMENT EDUCATION AS EMPLOYEE BENEFIT
WORK AND LIFELONG LEARNING ALUMNI & CONTINUING EDUCATION CONTINUING EDUCATION
WORK AND LIFELONG LEARNING ALUMNI & CONTINUING EDUCATION INDUSTRY MENTORING & NETWORKS
WORK AND LIFELONG LEARNING ALUMNI & CONTINUING EDUCATION ALUMNI ENGAGEMENT

Note: If you want to see the technical information of the process, please click the button below.

Document Technical Information

This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:

  1. OpenAI API Interaction: The query_openai_api function handles the interaction with the OpenAI API, including retry logic to manage errors.
  2. Role Categorization: The categorize_potential_role function determines the potential role of the company (client, partner, competitor, or provider) based on the API response.
  3. Vertical Categorization: The categorize_potential_vertical function identifies the appropriate vertical for the company from a predefined list.
  4. Capability Extraction and Validation: The extract_and_validate_capabilities function extracts and validates the capabilities of the company, ensuring exactly five unique capabilities.
  5. Company Processing: The process_company_column function processes each company’s data, constructs prompts, queries the API, and compiles the results.
  6. Progress Saving: The save_progress function periodically saves the progress of the analysis to an Excel file.
  7. Main Processing Loop: The script processes each company in the dataset, reads webpage content, handles errors, and saves the final results to an Excel file.

This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.


import openai
import pandas as pd
import logging
import time
from random import random
import os

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Set your OpenAI API key
openai.api_key = 'ADD THE KEY'

# Load the LinkedIn data
linkedIn_data = pd.read_excel('cleaned_company_information.xlsx', skiprows=range(1, 660))

# Load the capabilities data
capabilities_data = pd.read_excel('Input/capabilities.xlsx')
valid_capabilities = set(capabilities_data['Capability'].str.upper().tolist())

# Load the role definitions
role_definitions = pd.read_excel('Input/company_profile.xlsx')

# Load the vertical data
vertical_data = pd.read_excel('Input/vertical.xlsx')

# Define valid roles and verticals
valid_roles = ['CLIENT', 'PARTNER', 'COMPETITOR', 'PROVIDER']
valid_verticals = vertical_data['Vertical'].tolist()

# Prepare the output DataFrame
output_columns = [
    'Company Name', 'URL (LinkedIn profile link)', 'Website', 'OpenedX Connection',
    'Main Product', 'Competitors', 'Customers', 'B2B or B2C',
    'Number of Customers', 'Country', 'Potential Role', 'Potential Vertical',
    'Capability 1', 'Capability 2', 'Capability 3', 'Capability 4', 'Capability 5'
]

output_data = pd.DataFrame(columns=output_columns)
backup_interval = 10  # Save progress every 10 companies

# Function to query OpenAI API with retry logic
def query_openai_api(prompt, model="gpt-3.5-turbo", max_tokens=512, retries=5):
    for i in range(retries):
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are an assistant that helps analyze company data."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=max_tokens,
                temperature=0.7
            )
            return response.choices[0].message['content'].strip()
        except openai.error.OpenAIError as e:
            logging.error(f"APIError encountered: {e}. Retrying {i + 1}/{retries}...")
            time.sleep((2 ** i) + random())
    return "API request failed."

# Function to categorize potential role
def categorize_potential_role(response):
    response = response.lower()
    if "client" in response:
        return "CLIENT"
    if "partner" in response:
        return "PARTNER"
    if "competitor" in response:
        return "COMPETITOR"
    if "provider" in response:
        return "PROVIDER"
    return ""

# Function to categorize potential vertical
def categorize_potential_vertical(response):
    verticals = vertical_data['Vertical'].tolist()
    for vertical in verticals:
        if vertical.lower() in response.lower():
            return vertical
    return ""

# Function to ensure the capabilities list has exactly 5 unique entries
def ensure_unique_capabilities(capabilities):
    unique_capabilities = list(dict.fromkeys(capabilities))  # Remove duplicates while preserving order
    while len(unique_capabilities) < 5:
        remaining_capabilities = list(valid_capabilities - set(unique_capabilities))
        if remaining_capabilities:
            unique_capabilities.append(remaining_capabilities[0])  # Add a remaining valid capability
        else:
            unique_capabilities.append('')  # If no valid capabilities are left, append an empty string
    return unique_capabilities[:5]

# Function to extract and validate capabilities
def extract_and_validate_capabilities(response, company_name):
    capabilities = []
    logging.debug(f"Raw API capabilities response for {company_name}: {response}")
    for line in response.split('\n'):
        for capability in valid_capabilities:
            if capability.lower() in line.lower() and capability not in capabilities:
                capabilities.append(capability)
                break
    validated_capabilities = ensure_unique_capabilities(capabilities)
    logging.debug(f"Validated capabilities for {company_name}: {validated_capabilities}")
    return validated_capabilities

# Function to process each company and column
def process_company_column(company_info, webpage_content):
    base_prompt = f"Company LinkedIn Info:\n{company_info}\n\nCompany Webpage Content:\n{webpage_content}\n\n"
    capabilities_list = ', '.join(valid_capabilities)

    # Constructing more detailed prompts with additional context
    prompts = {
        'OpenedX Connection': base_prompt + "Taking into account the information provided and your knowledge about the company, does the company have any connection or use the Open edX platform?",
        'Main Product': base_prompt + "Taking into account the information provided and your knowledge about the company, what is the main product or service the company offers?",
        'Competitors': base_prompt + "Taking into account the information provided and your knowledge about the company, list three main competitors of the company including their names and websites.",
        'Customers': base_prompt + "Taking into account the information provided and your knowledge about the company, list three main customers of the company.",
        'B2B or B2C': base_prompt + "Taking into account the information provided and your knowledge about the company, is the company primarily focused on B2B (business-to-business) or B2C (business-to-consumer) operations?",
        'Number of Customers': base_prompt + "Taking into account the information provided and your knowledge about the company, estimate the number of customers the company has.",
        'Country': base_prompt + "Taking into account the information provided and your knowledge about the company, in which country does the company primarily operate?",
        'Potential Role': base_prompt + "Taking into account the information provided and your knowledge about the company, determine if the company is a potential client, partner, competitor, or provider for Edunext.",
        'Potential Vertical': base_prompt + "Taking into account the information provided and your knowledge about the company, determine the potential vertical for the company from the following list:\n" +
            '\n'.join([f"{row['Vertical']} - {row['Definition']}" for _, row in vertical_data.iterrows()]) + "\nWhat is the potential vertical of this company?",
        'Capability 1': base_prompt + f"Based on the provided information and any additional research, what is the primary capability of the company? Choose from the following: {capabilities_list}",
        'Capability 2': base_prompt + f"Considering the capabilities already mentioned, what is the second main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 3': base_prompt + f"Considering the capabilities already mentioned, what is the third main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 4': base_prompt + f"Considering the capabilities already mentioned, what is the fourth main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 5': base_prompt + f"Considering the capabilities already mentioned, what is the fifth main capability of the company? Choose from the following: {capabilities_list}"
    }
    
    responses = {key: query_openai_api(prompt) for key, prompt in prompts.items()}

    # Log raw responses for debugging
    logging.debug(f"Raw API response for company {company_info['Company Name']}: {responses}")

    # Extract and validate capabilities
    capabilities = extract_and_validate_capabilities("\n".join([responses.get(f'Capability {i}', '') for i in range(1, 6)]), company_info['Company Name'])

    output_row = {
        'Company Name': company_info['Company Name'],
        'URL (LinkedIn profile link)': company_info['URL'],
        'Website': company_info['Website'],
        'OpenedX Connection': responses['OpenedX Connection'],
        'Main Product': responses['Main Product'],
        'Competitors': responses['Competitors'],
        'Customers': responses['Customers'],
        'B2B or B2C': "B2B" if "B2B" in responses['B2B or B2C'] else "B2C",
        'Number of Customers': responses['Number of Customers'],
        'Country': responses['Country'].strip(),
        'Potential Role': categorize_potential_role(responses['Potential Role']),
        'Potential Vertical': categorize_potential_vertical(responses['Potential Vertical']),
        'Capability 1': capabilities[0],
        'Capability 2': capabilities[1],
        'Capability 3': capabilities[2],
        'Capability 4': capabilities[3],
        'Capability 5': capabilities[4]
    }

    return output_row

# Function to save progress periodically
def save_progress(data, filename='company_analysis_backup.xlsx'):
    df = pd.DataFrame(data, columns=output_columns)
    df.to_excel(filename, index=False)
    logging.info(f"Progress saved to {filename}")

# Process companies sequentially with logging
results = []
for idx, row in linkedIn_data.iterrows():
    company_info = row.to_dict()
    company_name = company_info['Company Name']
    logging.info(f"Processing company: {company_name}")

    try:
        # Try to read the webpage content for the company, handle missing or invalid files
        webpage_content = ""
        filepath = os.path.join('webpages', f"{company_name}.txt")
        try:
            with open(filepath, 'r', encoding='utf-8') as file:
                webpage_content = file.read()
        except (FileNotFoundError, OSError) as e:
            logging.warning(f"Could not read file for {company_name}: {e}")
        
        output_row = process_company_column(company_info, webpage_content)
        results.append(output_row)
        logging.info(f"Completed processing for {company_name} ({idx + 1}/{len(linkedIn_data)})")
        
        # Save progress periodically
        if (idx + 1) % backup_interval == 0:
            save_progress(results)

    except Exception as e:
        logging.error(f"Error processing company {company_name}: {e}")

# Save final results
output_data = pd.DataFrame(results, columns=output_columns)
output_data.to_excel('company_analysis.xlsx', index=False)
logging.info("All companies processed and saved to company_analysis.xlsx")
  

Company Information Output

The fourth output is called “Company Information Output” and contains detailed information about various analyzed companies. Below is an example of the data collected:

Company Name byteXL
URL (LinkedIn profile link) https://in.linkedin.com/company/bytexl
Website https://bytexl.com
OpenedX Connection Based on the information provided, there is no direct mention of the company byteXL using or having a connection with the Open edX platform. The company’s focus seems to be on transforming engineering colleges in India through their own integrated college transformation model and proprietary platform, rather than utilizing external platforms like Open edX.
Main Product Based on the information provided from the company’s LinkedIn profile and webpage content, the main product or service offered by byteXL is an experiential learning online platform for IT programming aspirants. This platform integrates curriculum, content, and practical learning to enhance students’ skills and awareness on employability. The company partners with colleges to transform their teaching methodology and learning pedagogy to increase the employability quotient of their students. The platform includes features such as academic & skilling content, online editor, student reports, dashboards on individual college performance, and coding challenges.
Competitors Based on the information provided about byteXL, three main competitors of the company in the E-learning industry could be:

1. Company Name: UpGrad
Website: https://www.upgrad.com/

2. Company Name: Simplilearn
Website: https://www.simplilearn.com/

3. Company Name: Coursera
Website: https://www.coursera.org/
Customers Based on the information provided, three main customers of byteXL are:

1. Tejashri Student from Malineni Lakshmaiah Women’s Engineering College

2. A. Sanyasirao, Head of Electronics & Communication Engineering Department at Christu Jyothi Institute of Technology & Science

3. Kalyani, a student of Malineni Lakshmaiah Women’s Engineering College

These customers have shared their positive experiences with byteXL’s teaching methodology and its impact on their academic journey and career prospects.
B2B or B2C B2B
Number of Customers Based on the content provided, it seems like byteXL has several customers who are students and colleges across India. The webpage content mentions testimonials and success stories from students and faculty at various engineering colleges who have benefited from byteXL’s training and mentorship programs.

While the exact number of customers is not explicitly mentioned in the data provided, we can infer that the company has a significant customer base across multiple colleges and individual students. The testimonials and success stories highlight the impact byteXL has had on students’ academic journeys and career paths, indicating a wide reach and positive reputation among customers.

Therefore, based on the information available, we can estimate that byteXL likely has hundreds or even thousands of customers consisting of both colleges and individual students who have engaged with their educational programs and services.
Country The company, byteXL, primarily operates in India. This is indicated by the company’s headquarters being in Hyderabad, India, the focus on transforming engineering colleges in India, partnerships with Indian colleges such as Malineni Lakshmaiah Women’s Engineering College and Christu Jyothi Institute of Technology & Science, and the testimonials and success stories from Indian students and educational institutions. The company’s efforts and impact are centered around the Indian education system and industry, supporting the employability and skills development of Indian engineering students.
Country Only India
Potential Role CLIENT
Potential Vertical Higher Education
Capability 1 PERSONALIZED & ADAPTIVE LEARNING
Capability 2 WORKPLACE SIMULATION & PROJECTS
Capability 3 IMMERSION, SIMULATION & LAB
Capability 4 VOLUNTEERING & STUDENT LEADERSHIP
Capability 5 DIGITAL DESIGN PRINCIPLES

For more detailed information, you can refer to the complete dataset here.


| Analysis and Results

With the data resulting from this exercise, the analysis begins with geographical insights. The first part will focus on the number of startups by country, while the second part will analyze the amount of investment. The geographical location in these charts is provided by LinkedIn, indicating where the company is based, and the investment data also comes from LinkedIn information in conjunction with Crunchbase. It is important to note that not all companies have this information available, and not all investments were made in this or the past year. To verify the date of investment by company, this link can be consulted: Investment Date Verification

Number of startups

The first map provides a visual representation of the number of startups by country based on data sourced from LinkedIn. The map uses a gradient color scale to indicate the density of startups in each country, with lighter shades of blue representing fewer startups and darker shades of blue indicating a higher number of startups.


The second map illustrates the number of startups by country, categorized by their estimated size based on employee count from LinkedIn data. Startups are classified into Small, Medium, Large, and Very Large using employee quartiles. The map features interactive tooltips with country names, estimated sizes, and startup counts. Bubble sizes indicate the number of startups, while colors (Yellow for Small, Orange for Medium, Red for Large, Dark Red for Very Large, and Grey for Unknown) represent size categories. This visualization helps identify the distribution of different-sized startups globally, emphasizing regions with diverse entrepreneurial activities.

Estimated Size Definition
Small Employees ≤ 1st quartile
Medium 1st quartile < Employees ≤ 2nd quartile
Large 2nd quartile < Employees ≤ 3rd quartile
Very Large Employees > 3rd quartile
Unknown Missing employee data
Size based on number of employees
Country Small Medium Large Very Large Unknown Total
United States 60 35 0 50 36 181
India 10 4 0 43 3 60
United Kingdom 13 18 0 10 15 56
Unknown 5 8 1 4 9 27
Brazil 5 7 0 4 9 25
Germany 7 7 0 4 3 21
France 5 6 0 3 7 21
Singapore 6 4 0 4 7 21
Canada 4 4 0 3 7 18
Spain 1 1 0 4 9 15
Mexico 4 4 0 5 1 14
Egypt 5 3 0 2 3 13
Nigeria 3 2 0 1 7 13
South Africa 3 3 0 1 6 13
Sweden 1 3 0 2 6 12
Finland 0 4 1 1 5 11
Italy 4 4 0 1 0 9
Indonesia 0 1 0 6 1 8
South Korea 2 3 1 1 1 8
Norway 0 3 1 0 4 8
Vietnam 3 0 0 4 1 8
United Arab Emirates 3 2 1 1 0 7
Colombia 2 3 0 1 1 7
Switzerland 2 3 0 0 1 6
Kenya 1 1 0 0 4 6
Netherlands 3 2 0 0 1 6
Saudi Arabia 1 1 0 3 1 6
Bangladesh 0 2 0 3 0 5
Ireland 0 1 0 2 2 5
Israel 0 2 0 0 3 5
Argentina 0 2 0 2 0 4
Chile 0 1 0 0 3 4
China 0 2 0 1 1 4
Hungary 0 1 0 1 2 4
Japan 1 1 0 0 2 4
Pakistan 4 0 0 0 0 4
Poland 1 2 0 1 0 4
Austria 0 1 0 1 1 3
Denmark 0 0 0 2 1 3
Iceland 0 0 0 0 3 3
Jordan 1 0 0 2 0 3
Kazakhstan 0 2 0 0 1 3
Portugal 1 1 0 0 1 3
Belgium 1 0 0 0 1 2
Cameroon 0 1 0 0 1 2
Estonia 0 1 0 0 1 2
Kuwait 1 0 0 0 1 2
Lithuania 0 0 0 0 2 2
Malaysia 2 0 0 0 0 2
Peru 0 1 0 0 1 2
Romania 0 2 0 0 0 2
Tunisia 0 0 0 0 2 2
Taiwan 0 2 0 0 0 2
Tanzania 0 1 0 0 1 2
Ukraine 0 0 0 1 1 2
Venezuela 0 1 0 0 1 2
Australia 1 0 0 0 0 1
Bahrain 0 1 0 0 0 1
Congo - Kinshasa 1 0 0 0 0 1
Costa Rica 0 0 0 0 1 1
Czechia 0 0 0 0 1 1
Dominican Republic 0 0 0 0 1 1
Ecuador 0 1 0 0 0 1
Ghana 0 0 0 0 1 1
Greece 0 1 0 0 0 1
Lebanon 0 1 0 0 0 1
Morocco 0 1 0 0 0 1
Madagascar 1 0 0 0 0 1
Panama 0 0 0 0 1 1
Philippines 0 1 0 0 0 1
Thailand 1 0 0 0 0 1
Uzbekistan 1 0 0 0 0 1
Amount of investments

The chart is built using data from LinkedIn, listing countries where offices are based and the last round of investment amounts. This chart shows the sum of all money invested in startups by country, providing a visual representation of global investment distribution. Each country is colored based on the total investment amount, with interactive tooltips offering detailed information about the investments.


This interactive map visualizes the median investment in startups by country based on data from LinkedIn. The map uses the median to avoid the impact of outliers. China is highlighted in violet because it is an outlier, with only one company providing information on the last investment round, which was exceptionally large. The gradient color scale from red to yellow to green indicates low to high median investments in millions of dollars, allowing for a clear comparison of investment levels across countries.


This map illustrates the number of startups by country, categorized by their estimated investment size based on data from LinkedIn. Investment sizes are classified into Small, Medium, Large, and Very Large using quartiles of the last round investment amounts. The map features interactive tooltips with country names and estimated investment sizes. Bubble sizes indicate the number of startups, while colors (Light Green for Small, Green for Medium, Dark Green for Large, Forest Green for Very Large, and Grey for Unknown) represent investment size categories. This visualization helps identify the distribution of startups with varying investment sizes globally, highlighting regions with diverse levels of startup funding.

Estimated Investment Size Definition
Small Investment ≤ 1st quartile
Medium 1st quartile < Investment ≤ 2nd quartile
Large 2nd quartile < Investment ≤ 3rd quartile
Very Large Investment > 3rd quartile
Unknown Missing investment data

Size based on last round amount
Country Small Medium Large Very Large Unknown Total
United States 23 14 32 0 50 119
India 11 9 8 0 10 38
United Kingdom 9 8 13 0 6 36
Germany 5 2 3 0 4 14
France 4 3 5 0 1 13
Canada 4 1 1 0 6 12
Unknown 4 1 3 0 2 10
Spain 2 4 0 0 3 9
Brazil 4 1 1 1 0 7
Singapore 2 1 1 0 3 7
South Korea 2 0 4 0 0 6
Mexico 1 1 2 0 2 6
Egypt 0 4 1 0 0 5
Israel 1 2 2 0 0 5
Italy 2 1 2 0 0 5
Sweden 1 1 2 0 1 5
Colombia 1 2 1 0 0 4
Nigeria 1 3 0 0 0 4
Netherlands 2 2 0 0 0 4
Vietnam 0 0 4 0 0 4
Denmark 0 1 0 0 2 3
Finland 1 1 0 0 1 3
Hungary 0 3 0 0 0 3
Indonesia 1 0 1 0 1 3
Japan 0 1 1 0 1 3
Norway 1 1 1 0 0 3
Portugal 2 1 0 0 0 3
South Africa 0 3 0 0 0 3
United Arab Emirates 1 1 0 0 0 2
Bangladesh 0 1 1 0 0 2
Belgium 0 0 2 0 0 2
Chile 0 2 0 0 0 2
Estonia 0 1 1 0 0 2
Ireland 0 2 0 0 0 2
Kenya 0 2 0 0 0 2
Malaysia 1 1 0 0 0 2
Pakistan 2 0 0 0 0 2
Poland 2 0 0 0 0 2
Romania 0 2 0 0 0 2
Saudi Arabia 1 1 0 0 0 2
Tunisia 0 2 0 0 0 2
Congo - Kinshasa 0 1 0 0 0 1
Switzerland 0 0 1 0 0 1
China 0 0 0 0 1 1
Costa Rica 0 1 0 0 0 1
Czechia 0 1 0 0 0 1
Ghana 0 1 0 0 0 1
Iceland 0 1 0 0 0 1
Jordan 0 0 0 0 1 1
Kuwait 1 0 0 0 0 1
Kazakhstan 1 0 0 0 0 1
Madagascar 0 1 0 0 0 1
Peru 0 1 0 0 0 1
Thailand 1 0 0 0 0 1
Taiwan 0 0 1 0 0 1
Uzbekistan 0 1 0 0 0 1
Venezuela 0 1 0 0 0 1


Time analysis

📄 By country

Number of startups

This stacked bar chart illustrates the number of companies receiving investments each year, broken down by country. Each bar represents a year, and the different colored segments within each bar denote the number of companies from various countries that received investments in that particular year. The legend on the right-hand side identifies the countries corresponding to each color. This visualization helps track investment trends over time and highlights which countries have seen increasing or decreasing investment activity in the EdTech sector.

Amount of investments

This stacked bar chart illustrates the total amount of investment received by companies each year, broken down by country. Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies from various countries in that particular year. The legend on the right-hand side identifies the countries corresponding to each color. This visualization helps track investment trends over time and highlights which countries have received the most significant financial investments in the EdTech sector.


📄 By Vertical

Number of startups

This stacked bar chart displays the number of companies by year, categorized by verticals. Each bar represents a year, and the different colored segments within each bar denote the number of companies within specific verticals for that particular year. The legend on the right-hand side identifies the verticals corresponding to each color. This visualization helps track trends over time and highlights the growth or decline of companies within each vertical in the EdTech sector.

Note: The verticals were defined and selected by the EduNext team, and the detailed definitions of these verticals are provided earlier in the report.

Amount of investment

This stacked bar chart displays the total amount of investment by year, categorized by verticals. Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies within specific verticals for that particular year. The legend on the right-hand side identifies the verticals corresponding to each color. This visualization helps track investment trends over time and highlights the growth or decline of investment within each vertical in the EdTech sector.

Note: The verticals were defined and selected by the EduNext team, and the detailed definitions of these verticals are provided earlier in the report.


📄 By B2B or B2C

Number of startups

This stacked bar chart illustrates the number of companies making investments each year, categorized by their business model: B2B (Business to Business) and B2C (Business to Consumer). Each bar represents a year, and the different colored segments within each bar denote the number of companies in each category for that particular year. The legend on the right-hand side identifies the categories corresponding to each color. This visualization helps track investment activity trends over time, highlighting the dynamics between B2B and B2C companies in the EdTech sector.

Amount of investment

This stacked bar chart illustrates the total amount of investment by year, categorized by business model: B2B (Business to Business) and B2C (Business to Consumer). Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies in each category for that particular year. The legend on the right-hand side identifies the categories corresponding to each color. This visualization helps track investment trends over time, highlighting the dynamics between B2B and B2C companies in the EdTech sector.

📄 By Industry

Number of startups

This treemap chart displays the distribution of companies by industry, based on the information available on their LinkedIn profiles. Each rectangle represents an industry, with the size of the rectangle proportional to the number of companies in that industry. The color and size of the rectangles provide a quick visual reference for the prevalence of different industries among the analyzed companies. This visualization helps to identify which industries are most common in the EdTech sector and how companies are distributed across various fields.

Note: According to LinkedIn, EduNext operates in the E-learning industry. This classification is based on the information available on their LinkedIn profile.

Amount of investment

This treemap chart displays the distribution of investment by industry, based on the information available on the companies’ LinkedIn profiles. Each rectangle represents an industry, with the size of the rectangle proportional to the total investment received by companies in that industry. The color and size of the rectangles provide a quick visual reference for the allocation of investments across different industries among the analyzed companies. This visualization helps to identify which industries attract the most investment in the EdTech sector.

Note: According to LinkedIn, EduNext operates in the E-learning industry. This classification is based on the information available on their LinkedIn profile.

📄 By Role

Role

This donut chart visualizes the distribution of potential roles identified for companies, based on their relevance to EduNext. The chart segments represent the percentage of companies classified as partners, clients, competitors, and providers. The legend indicates the role corresponding to each color. This visualization helps to understand the predominant roles among the analyzed companies.

Note: The EduNext team chose the roles and definitions for each role.

Potential Clients

This table lists the companies identified as potential clients for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential clients, aiding strategic decision-making.

Note: The EduNext team chose the roles and definitions for each role.

Potential partners
This table lists the companies identified as potential partners for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential partners, aiding strategic collaborations.

Note: The EduNext team chose the roles and definitions for each role.

Potential competitors

This table lists the companies identified as potential competitors for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential competitors, aiding competitive analysis.

Note: The EduNext team chose the roles and definitions for each role.

Potential providers

This table lists the companies identified as potential providers for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential providers, aiding strategic sourcing decisions.

Ranking of startups

This horizontal bar chart ranks companies by the amount of their most recent investment, as reported by Crunchbase. The length of each bar represents the size of the investment, and the company names are listed along the vertical axis. This visualization helps to quickly identify the most heavily funded companies within the EdTech sector.

Note: The EduNext team chose the roles and definitions for each role.


📄 By Capabilities

Number of companies

This table lists various capabilities identified in the analyzed companies, showing how many companies possess each capability. The columns represent different capabilities, while the rows display the count of companies that possess each capability. The “Total” column sums up the number of companies for each capability. This table provides a detailed overview of the distribution of specific capabilities across companies, aiding in understanding the strengths and focus areas within the EdTech sector.

Note: The capabilities were taken from the HolonIQ source and are explained in more detail earlier in the report.

Number of Appearances
Capability Capability 1 Capability 2 Capability 3 Capability 4 Capability 5 Total
PERSONALIZED & ADAPTIVE LEARNING 375 178 88 46 14 701
IMMERSION, SIMULATION & LAB 5 198 225 149 67 644
VOLUNTEERING & STUDENT LEADERSHIP 1 2 205 223 147 578
DIGITAL DESIGN PRINCIPLES 8 37 17 209 213 484
ACCREDITATION 4 13 7 3 214 241
DIGITAL CONTENT CREATION 49 27 10 0 2 88
CUSTOMIZED PROGRAMS (B2B) 11 22 17 5 0 55
ASSESSMENT FEEDBACK 6 37 7 3 0 53
JOB APPLICATION SUPPORT 10 24 11 4 0 49
BADGING & CREDENTIALING 3 7 14 12 8 44
DESIGNING ASSESSMENT 17 15 7 2 0 41
PROFESSIONAL & INDUSTRY ASSOCIATIONS 3 13 8 4 8 36
FACULTY MANAGEMENT & SUPPORT 0 0 8 14 12 34
REPORTING & REGULATORY COMPLIANCE 3 17 8 4 0 32
JOB FINDING & GRADUATE PLACEMENT 11 15 3 1 1 31
STUDENT COMMUNITIES, CLUBS & SOCIETIES 8 8 13 1 0 30
TESTS & EXAMS 2 2 2 7 14 27
TUITION FINANCING 26 0 1 0 0 27
CAREER PLANNING SERVICES 9 9 4 2 2 26
WELLBEING & MENTAL HEALTH 21 5 0 0 0 26
COURSE SELECTION & GUIDANCE 7 8 3 4 0 22
APPLICATION & ADMISSIONS 17 1 0 1 0 19
STUDENT PORTAL & LMS 4 6 3 3 0 16
UNDERSTAND CUSTOMER NEEDS 3 6 6 0 0 15
WORKPLACE SIMULATION & PROJECTS 10 4 1 0 0 15
EXPERIENTIAL LEARNING APPROACHES 5 4 2 1 0 12
STUDENT RELATIONSHIP MANAGEMENT (CRM) 5 4 1 1 0 11
RETENTION & LEARNING SUPPORT 3 4 3 0 0 10
COMPETITORS & ALTERNATES 1 3 3 2 0 9
PEER & GROUP ASSESSMENT 3 5 1 0 0 9
PROGRAM ARCHITECTURE 2 1 6 0 0 9
ASYNCH. LEARNING EXPERIENCES 1 3 2 0 1 7
EDUCATION AS EMPLOYEE BENEFIT 6 1 0 0 0 7
ENTREPRENEURSHIP & STARTUPS 6 1 0 0 0 7
INDUSTRY MENTORING & NETWORKS 6 1 0 0 0 7
LEARNING DELIVERY MODELS 3 4 0 0 0 7
ONBOARDING & ORIENTATION 7 0 0 0 0 7
VOICE, CHAT & INTERACTIVE LEARNING 6 1 0 0 0 7
COMPETENCIES & SKILLS EVALUATION 6 0 0 0 0 6
EMPLOYABILITY SKILLS BUILDING 3 0 3 0 0 6
INDUSTRY COLLABS & PARTNERSHIPS 2 3 0 0 0 5
LEARNER NEEDS & ANALYTICS 0 4 1 0 0 5
TIMETABLING & SCHEDULE MANAGEMENT 3 1 1 0 0 5
B2B RECRUITMENT & PARTNERSHIPS 4 0 0 0 0 4
SCHOOLS & COMMUNITY OUTREACH 2 1 1 0 0 4
SOURCING & MANAGING EXPERTISE 2 1 1 0 0 4
STUDENT VOICE & SURVEYS 0 1 3 0 0 4
DESIGNING GROUP WORK 0 1 1 1 0 3
GRADUATION & SUCCESS 1 0 1 1 0 3
INTERNSHIPS & PLACEMENTS 1 1 1 0 0 3
OER & CONTENT LICENSING 3 0 0 0 0 3
SYNCHRONOUS LEARNING EXPERIENCES 2 1 0 0 0 3
EXCHANGE PROGRAMS 0 0 2 0 0 2
LEARNING ENVIRONMENTS & PLATFORMS 2 0 0 0 0 2
LIBRARY SERVICES 1 1 0 0 0 2
RECRUITMENT EVENTS 1 1 0 0 0 2
SCHOLARSHIP PROGRAM 1 0 1 0 0 2
DESIGNING FOR DIGITAL LEARNING 0 1 0 0 0 1
FACULTY EXPERTISE & SPECIALISMS 0 0 1 0 0 1
MANAGING INTEGRATED CONTENT 1 0 0 0 0 1
MARKETING AUTOMATION 1 0 0 0 0 1
Score

Score by Capabilities

This table assigns a weighted importance to each capability based on its position to measure and rank the most important capabilities effectively. The weights are assigned as follows:

  • Capability 1 is multiplied by 5
  • Capability 2 is multiplied by 4
  • Capability 3 is multiplied by 3
  • Capability 4 is multiplied by 2
  • Capability 5 is multiplied by 1

The total score for each capability is then calculated by summing these weighted values, ensuring that higher importance is given to capabilities listed earlier. This method reflects the significance of each capability in the overall ranking. The table includes columns for each capability’s score and the total score for each row.

Note: The capabilities were taken from the HolonIQ source and are explained in more detail earlier in the report.

Score
Capability Capability 1 Capability 2 Capability 3 Capability 4 Capability 5 Score
PERSONALIZED & ADAPTIVE LEARNING 375 178 88 46 14 2957
IMMERSION, SIMULATION & LAB 5 198 225 149 67 1857
VOLUNTEERING & STUDENT LEADERSHIP 1 2 205 223 147 1221
DIGITAL DESIGN PRINCIPLES 8 37 17 209 213 870
DIGITAL CONTENT CREATION 49 27 10 0 2 385
ACCREDITATION 4 13 7 3 214 313
ASSESSMENT FEEDBACK 6 37 7 3 0 205
CUSTOMIZED PROGRAMS (B2B) 11 22 17 5 0 204
JOB APPLICATION SUPPORT 10 24 11 4 0 187
DESIGNING ASSESSMENT 17 15 7 2 0 170
TUITION FINANCING 26 0 1 0 0 133
JOB FINDING & GRADUATE PLACEMENT 11 15 3 1 1 127
WELLBEING & MENTAL HEALTH 21 5 0 0 0 125
BADGING & CREDENTIALING 3 7 14 12 8 117
REPORTING & REGULATORY COMPLIANCE 3 17 8 4 0 115
STUDENT COMMUNITIES, CLUBS & SOCIETIES 8 8 13 1 0 113
PROFESSIONAL & INDUSTRY ASSOCIATIONS 3 13 8 4 8 107
CAREER PLANNING SERVICES 9 9 4 2 2 99
APPLICATION & ADMISSIONS 17 1 0 1 0 91
COURSE SELECTION & GUIDANCE 7 8 3 4 0 84
WORKPLACE SIMULATION & PROJECTS 10 4 1 0 0 69
FACULTY MANAGEMENT & SUPPORT 0 0 8 14 12 64
STUDENT PORTAL & LMS 4 6 3 3 0 59
UNDERSTAND CUSTOMER NEEDS 3 6 6 0 0 57
TESTS & EXAMS 2 2 2 7 14 52
EXPERIENTIAL LEARNING APPROACHES 5 4 2 1 0 49
STUDENT RELATIONSHIP MANAGEMENT (CRM) 5 4 1 1 0 46
RETENTION & LEARNING SUPPORT 3 4 3 0 0 40
PEER & GROUP ASSESSMENT 3 5 1 0 0 38
ONBOARDING & ORIENTATION 7 0 0 0 0 35
EDUCATION AS EMPLOYEE BENEFIT 6 1 0 0 0 34
ENTREPRENEURSHIP & STARTUPS 6 1 0 0 0 34
INDUSTRY MENTORING & NETWORKS 6 1 0 0 0 34
VOICE, CHAT & INTERACTIVE LEARNING 6 1 0 0 0 34
PROGRAM ARCHITECTURE 2 1 6 0 0 32
LEARNING DELIVERY MODELS 3 4 0 0 0 31
COMPETENCIES & SKILLS EVALUATION 6 0 0 0 0 30
COMPETITORS & ALTERNATES 1 3 3 2 0 30
ASYNCH. LEARNING EXPERIENCES 1 3 2 0 1 24
EMPLOYABILITY SKILLS BUILDING 3 0 3 0 0 24
INDUSTRY COLLABS & PARTNERSHIPS 2 3 0 0 0 22
TIMETABLING & SCHEDULE MANAGEMENT 3 1 1 0 0 22
B2B RECRUITMENT & PARTNERSHIPS 4 0 0 0 0 20
LEARNER NEEDS & ANALYTICS 0 4 1 0 0 19
SCHOOLS & COMMUNITY OUTREACH 2 1 1 0 0 17
SOURCING & MANAGING EXPERTISE 2 1 1 0 0 17
OER & CONTENT LICENSING 3 0 0 0 0 15
SYNCHRONOUS LEARNING EXPERIENCES 2 1 0 0 0 14
STUDENT VOICE & SURVEYS 0 1 3 0 0 13
INTERNSHIPS & PLACEMENTS 1 1 1 0 0 12
GRADUATION & SUCCESS 1 0 1 1 0 10
LEARNING ENVIRONMENTS & PLATFORMS 2 0 0 0 0 10
DESIGNING GROUP WORK 0 1 1 1 0 9
LIBRARY SERVICES 1 1 0 0 0 9
RECRUITMENT EVENTS 1 1 0 0 0 9
SCHOLARSHIP PROGRAM 1 0 1 0 0 8
EXCHANGE PROGRAMS 0 0 2 0 0 6
MANAGING INTEGRATED CONTENT 1 0 0 0 0 5
MARKETING AUTOMATION 1 0 0 0 0 5
DESIGNING FOR DIGITAL LEARNING 0 1 0 0 0 4
FACULTY EXPERTISE & SPECIALISMS 0 0 1 0 0 3

| Opportunities for Improvement

The analysis of EdTech companies has revealed several areas for improvement:

  • Data Refinement: Enhancing the data regarding the relationship of these companies with the Open edX platform is crucial. Clear and accurate information in this aspect will lead to more precise insights.
  • Definition of eduNEXT: The definition and concept of “eduNEXT” needs to be well-defined and incorporated into the model training. This will ensure a better understanding and identification of the company’s role in the EdTech ecosystem.
  • Client vs. Partner Roles: Addressing the ambiguity between ‘client’ and ‘partner’ roles is essential. Leveraging advanced models such as GPT-4-O can provide deeper contextual analysis and differentiation.
  • Role Classification: When asking for roles using the ChatGPT algorithm, including a “none of the above” category would have been beneficial. This approach would prevent the forceful assignment of a nonexistent relationship.
  • Investment Data Source: The investment information used in the analysis was based solely on data retrieved from Crunchbase via LinkedIn. To enhance the comprehensiveness of future analyses, considering additional sources of investment data, especially for companies in Asia where LinkedIn usage is less prevalent, would be beneficial.

The project shows significant potential for further exploration and development. As it is still in progress, any feedback and recommendations are highly appreciated to refine and improve the results.


Go to Top

---
author: "Daniela Rios"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output:
  html_document:
    code_download: true
    mathjax: true
    keep_md: true
    highlight: zenburn
    theme:  spacelab
  pdf_document:
always_allow_html: true
---


```{r setup, echo=FALSE}

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, comment = NA)

```

![](Banner.jpg)

***

<span><h1 style = "font-family: verdana; font-size: 26px; font-style: normal; letter-spcaing: 3px; background-color: #e7f3fe; color :#000000; border-radius: 100px 100px; text-align:center">📈 Emerging EdTech Innovators </h1></span>


<div id="table-of-contents" style="
    border: 1px solid #ddd;
    padding: 20px;
    margin-top: 10px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px rgba(0,0,0,0.05);
    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;">

<h2 style="
    text-align: center;
    font-size: 18px;
    margin-top: 0;
    margin-bottom: 10px;
    padding-bottom: 5px;
    border-bottom: 2px solid #2a7ae2;
    font-weight: bold;
    color: #2a7ae2;">
    Table of Contents
</h2>

<ol style="font-size: 14px; color: #555; padding-left: 20px; text-align: left;">
    <li style="margin-bottom: 8px;"><a href="#introduction-objective" style="text-decoration: none; color: #2a7ae2;">Introduction and Objective</a></li>
    <li style="margin-bottom: 8px;"><a href="#overview" style="text-decoration: none; color: #2a7ae2;">Project overview</a></li>
    <li style="margin-bottom: 8px;"><a href="#holoniq-overview" style="text-decoration: none; color: #2a7ae2;">HolonIQ EdTech Ranking Overview</a></li>
    <li style="margin-bottom: 8px;"><a href="#methodology" style="text-decoration: none; color: #2a7ae2;">Methodology</a></li>
    <li style="margin-bottom: 8px;"><a href="#analysis-results" style="text-decoration: none; color: #2a7ae2;">Analysis and Results</a></li>
    <li style="margin-bottom: 8px;"><a href="#discussion" style="text-decoration: none; color: #2a7ae2;">Opportunities for Improvement</a></li>
</ol>

</div>



<span id="top"></span>


<a id="introduction-objective"></a>

***
<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> Introduction and Objective</b>
</p>

***

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The primary objective of this project was to explore emerging companies and trends in the EdTech industry, as featured on HolonIQ’s list. This initiative aimed to gain a comprehensive understanding of the current market landscape, categorize the startups based on their focus areas, and pinpoint potential clients, providers, partners, competitors, or emerging markets. The analysis was structured to examine these entities by geographic regions and industry verticals, ultimately providing strategic insights to understand edunext’s position within the industry and enhance its market positioning and growth.
</p>

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">🚨 <b>Why the HolonIQ EdTech Ranking was Selected</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The HolonIQ EdTech ranking was selected due to its reputation as a comprehensive and authoritative source of information on the most innovative and high-potential EdTech startups globally. The ranking provides valuable insights into the leading companies in the EdTech sector, making it an ideal resource for identifying key players and emerging trends in the industry. By leveraging this ranking, eduNext can ensure its analysis is based on credible and up-to-date information, enabling informed decision-making and strategic planning.
</p>

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">💫 <b>Company Overview</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
Edunext is an infrastructure-as-a-service provider and software company dedicated to the Open edX platform. It empowers organizations worldwide by delivering robust, scalable, and customizable online learning solutions. Edunext's mission is to enhance the quality of education through technology, supporting successful online learning initiatives across various sectors.
</p>

<a id="overview"></a>

***
<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> Project Overview</b>
</p>

***

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The project involved generating a list of the 1000 EdTech startups featured in the HolonIQ ranking. This list included links to their LinkedIn profiles. A scraper was developed to extract information from LinkedIn, and additional data was gathered from the startups' websites, which were saved in text files. This collected information was then used to feed the ChatGPT API with specific prompts. The purpose of these prompts was to build detailed profiles of each company and obtain better answers to questions oriented toward classifying and identifying potential business opportunities for eduNext. This methodology provided a comprehensive and detailed understanding of each startup, facilitating strategic insights for edunext's market positioning and growth.
</p>


***
<a id="holoniq-overview"></a>

<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> HolonIQ EdTech Ranking Overview</b>
</p>

***

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📘 <b>Introduction to HolonIQ</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
HolonIQ is a global market intelligence firm specializing in the education, climate, and health sectors. Each year, it publishes the Global EdTech 1000 list, which highlights the most promising startups in the educational technology (EdTech) field at both global and regional levels. Sublists include the "Top 200 EdTech in North America," the "Top 50 EdTech in Australia and New Zealand," and many more for different regions worldwide.
</p>

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">🔍 <b>How Does the HolonIQ Ranking Work?</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">**Evaluation and Selection:**</span>
</p>
<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <li><b>Database:</b> HolonIQ starts with a global database containing thousands of EdTech organizations. It uses its global intelligence platform to gather detailed data on these companies.</li>
  <li><b>Expert Panel:</b> A panel of regional experts evaluates each organization based on a set of predetermined criteria.</li>
</ul>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">**Evaluation Criteria:**</span>
</p>
<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <li><b>Market:</b> The quality and relative attractiveness of the market in which the company competes, including both the geographical market and the specific sector (e.g., language learning, tutoring).</li>
  <li><b>Product:</b> The quality and uniqueness of the company's product, as well as its impact on the educational sector.</li>
  <li><b>Team:</b> The experience and diversity of the founding and leadership team.</li>
  <li><b>Capital:</b> The financial health of the company, including its ability to generate or secure sufficient funding through revenue or external investments.</li>
  <li><b>Momentum:</b> Positive changes in the company's size and growth rate over time, including the number of employees and revenue.</li>
</ul>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">**Selection Process:**</span>
</p>
<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <li><b>Data Analysis:</b> Utilizes both internal data and information provided through an application process.</li>
  <li><b>Expert Feedback:</b> Local experts in each market review and provide feedback to ensure accurate evaluation.</li>
  <li><b>Diversity of Innovations:</b> Aims to include companies of various sizes and sectors within EdTech to reflect global trends and innovation.</li>
</ul>
</p>

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📊 <b>Importance of the Ranking</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
These rankings not only highlight the most promising startups but also help connect different regions of the world and share innovations that can improve educational outcomes globally. Promising startups are those that show exceptional potential in terms of innovation, market impact, and growth trajectory, making them key players in driving the future of education. Additionally, they provide investors and other stakeholders with a clear view of emerging trends and the companies leading the change in education.
</p>



<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more details on HolonIQ rankings and methodologies, you can visit their official website: <a href="https://www.holoniq.com" style="color: #2a7ae2; text-decoration: none;">HolonIQ</a>.
</p>

***
<a id="methodology"></a>

<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> Methodology</b>
</p>

***

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📋 <b>Generate the list with the 1000 companies</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The project leveraged a curated list of 1,000 EdTech startups compiled by Holoniq. This list, provided in image format, segmented the startups by geographic region (Africa, Nordic-Baltic, South Asia, etc.). To facilitate further analysis, the initial step involved meticulously extracting key information from each entry. This information included the company name, LinkedIn profile URL, and company website address. The data extraction process was a collaborative effort, requiring manual work from multiple team members. Additionally, support tools like Gemini and GPT-3 were employed to enhance efficiency.
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This is an example of the source format of the lists mentioned above.
</p>

![](edtech_200_NA.png)

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#fdf8e1; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b>The Global EdTech 1000 list for this case includes:</b></span>
</p>

<p>We used the 2023 HolonIQ EdTech lists for our analysis.</p>

<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <li>Africa EdTech 50</li>
  <li>Nordic-Baltic EdTech 50</li>
  <li>South Asia EdTech 100</li>
  <li>Middle East & North Africa EdTech 50</li>
  <li>East Asia EdTech 150</li>
  <li>Europe EdTech 200</li>
  <li>Southeast Asia EdTech 50</li>
  <li>Australia & New Zealand EdTech 50</li>
  <li>North America EdTech 200</li>
  <li>Latin America EdTech 100</li>
</ul>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
Once this process was done, the list looked like this:
</p>


![](lis_1.png)

****

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📋 <b>LinkedIn Scraper</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
Initially, there were 1,046 companies in the list. After reviewing, it was found that only 937 were unique, as the list was manually compiled by several team members and some entries were duplicated. Of these 937 unique companies, 117 had no LinkedIn page. The LinkedIn scraping process was therefore conducted with 820 companies, yielding 703 successful responses.
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The results are summarized in the following table:
</p>

<table style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; border-collapse: collapse; width: 100%; max-width: 600px;">
  <thead>
    <tr style="background-color:#f2ebfb;">
      <th style="border: 1px solid #dddddd; padding: 8px; text-align: left;">Category</th>
      <th style="border: 1px solid #dddddd; padding: 8px; text-align: left;">Number of companies</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Total Companies in List</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">1046</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Total Unique Companies</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">937</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Companies without LinkedIn in list</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">117</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Total companies with LinkedIn</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">820</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">LinkedIn Retrieved</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">707</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">LinkedIn Not Retrieved (Needs Review)</td>
      <td style="border: 1px solid #dddddd; padding: 8px;">113</td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">LinkedIn Retrieved Webpage</td>
      <td style="border: 1px solid #dddddd; padding: 8px; background-color: #d4edda; color: #155724; font-weight: bold;">703</td>
    </tr>
  </tbody>
</table>

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The links that could not be retrieved are often in <span style="font-weight: bold; color: #ff0000;">school</span> format rather than company format. Here is an example of the links from which information could not be collected:
</p>

<h2 style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: left; font-weight: bold;">Failed Links:</h2>

<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #ff0000; line-height: 1.5; text-align: left;">
  <li><a href="https://in.linkedin.com/school/kalvium/" target="_blank">https://in.linkedin.com/school/kalvium/</a></li>
  <li><a href="https://in.linkedin.com/school/mastersunion/" target="_blank">https://in.linkedin.com/school/mastersunion/</a></li>
  <li><a href="https://in.linkedin.com/school/virohan/" target="_blank">https://in.linkedin.com/school/virohan/</a></li>
  <li><a href="https://in.linkedin.com/school/geeksterin/" target="_blank">https://in.linkedin.com/school/geeksterin/</a></li>
  <li><a href="https://www.linkedin.com/school/15150251/" target="_blank">https://www.linkedin.com/school/15150251/</a></li>
  <li><a href="https://in.linkedin.com/school/nxtwavetech/" target="_blank">https://in.linkedin.com/school/nxtwavetech/</a></li>
  <li><a href="https://in.linkedin.com/school/prepinsta/" target="_blank">https://in.linkedin.com/school/prepinsta/</a></li>
  <li><a href="https://www.linkedin.com/products/skill-lync/" target="_blank">https://www.linkedin.com/products/skill-lync/</a></li>
</ul>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
While the links that have <span style="font-weight: bold; color: #28a745;">company</span> format worked correctly. Here are some examples:
</p>

<h2 style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: left; font-weight: bold;">Successful Links:</h2>

<table style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; border-collapse: collapse; width: 100%; max-width: 600px;">
  <thead>
    <tr style="background-color:#d4edda;">
      <th style="border: 1px solid #dddddd; padding: 8px; text-align: left;">Company Name</th>
      <th style="border: 1px solid #dddddd; padding: 8px; text-align: left;">URL</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">byteXL</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/bytexl" target="_blank">https://in.linkedin.com/company/bytexl</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Toodle</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/toodlerungta" target="_blank">https://in.linkedin.com/company/toodlerungta</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Eupheus Learning</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/eupheus-learning" target="_blank">https://in.linkedin.com/company/eupheus-learning</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">10 Minute School</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://bd.linkedin.com/company/10ms" target="_blank">https://bd.linkedin.com/company/10ms</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Adda247</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/adda247" target="_blank">https://in.linkedin.com/company/adda247</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Apars Classroom</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://bd.linkedin.com/company/aparsclassroom" target="_blank">https://bd.linkedin.com/company/aparsclassroom</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">EduGorilla</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/edugorilla-pvt-ltd" target="_blank">https://in.linkedin.com/company/edugorilla-pvt-ltd</a></td>
    </tr>
    <tr>
      <td style="border: 1px solid #dddddd; padding: 8px;">Infinity Learn</td>
      <td style="border: 1px solid #dddddd; padding: 8px;"><a href="https://in.linkedin.com/company/infinity-learn-by-sri-chaitanya" target="_blank">https://in.linkedin.com/company/infinity-learn-by-sri-chaitanya</a></td>
    </tr>
  </tbody>
</table>


<div style="background-color:#f9f9f9; border-left:5px solid #ccc; padding:10px; margin:10px 0;">
  <p><strong>Note:</strong> If you want to see the technical information of the process, please click the button below.</p>
</div>


<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <!-- Incluye Prism.js CSS para el tema "Tomorrow" -->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/themes/prism-tomorrow.min.css" rel="stylesheet" />
</head>
<body>


<summary style="background-color:#f7ebfd; color:black; border-radius:18px; padding: 0.2em 0.5em; display: inline-block; cursor: pointer; text-align: left;">**Technical Information**</summary>
<div style="margin-top: 10px; text-align: left;">
  <p>This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:</p>
  <ol style="padding-left: 30px;">
    <li><strong>Guest Mode:</strong> The code acts like a guest browsing LinkedIn, meaning it doesn't log in but still retrieves the necessary information.</li>
    <li><strong>Fetching Data:</strong> It fetches web pages from LinkedIn and other company websites to get the HTML content. This content includes all the visible information about a company.</li>
    <li><strong>Parsing Data:</strong> After getting the HTML content, the code extracts the important details like company name, address, number of employees, and description. This is done using special tools that can read and understand HTML code.</li>
    <li><strong>Storing Data:</strong> The extracted information is then neatly organized into tables (dataframes) and saved as Excel files. This makes it easy to review and analyze the data later.</li>
    <li><strong>Error Handling:</strong> If there are any issues while fetching the data, the code tries a few more times before giving up. This ensures that as much data as possible is collected reliably.</li>
  </ol>
  <p>This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.</p>
  <pre><code class="language-python">
import jmespath
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response, RemoteProtocolError
from parsel import Selector
from loguru import logger as log
import pandas as pd
import re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
import os
import time
import requests

# Initialize an async httpx client
client = AsyncClient(
    http2=True,
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
    },
    follow_redirects=True
)

def strip_text(text):
    """Remove extra spaces while handling None values."""
    return text.strip() if text is not None else text

def get_actual_url(link):
    parsed_url = urlparse(link)
    query_params = parse_qs(parsed_url.query)
    return query_params['url'][0] if 'url' in query_params else link

def parse_company(response_text: str) -> Dict:
    """Parse company main overview page."""
    selector = Selector(response_text)
    script_data = selector.xpath("//script[@type='application/ld+json']/text()").get()
    if script_data:
        script_data = json.loads(script_data)
    else:
        script_data = {}
    script_data = jmespath.search(
        """{
        name: name,
        url: url,
        mainAddress: address,
        description: description,
        numberOfEmployees: numberOfEmployees.value,
        logo: logo
        }""",
        script_data
    ) or {}
    data = {}
    for element in selector.xpath("//div[contains(@data-test-id, 'about-us')]"):
        name = element.xpath(".//dt/text()").get().strip()
        value = element.xpath(".//dd/text()").get().strip()
        data[name] = value
    addresses = []
    for element in selector.xpath("//div[contains(@id, 'address') and @id != 'address-0']"):
        address_lines = element.xpath(".//p/text()").getall()
        address = ", ".join(line.replace("\n", "").strip() for line in address_lines)
        addresses.append(address)
    affiliated_pages = []
    for element in selector.xpath("//section[@data-test-id='affiliated-pages']/div/div/ul/li"):
        affiliated_pages.append({
            "name": element.xpath(".//a/div/h3/text()").get().strip(),
            "industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
            "address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
            "linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
        })
    similar_pages = []
    for element in selector.xpath("//section[@data-test-id='similar-pages']/div/div/ul/li"):
        similar_pages.append({
            "name": element.xpath(".//a/div/h3/text()").get().strip(),
            "industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
            "address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
            "linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
        })

    # Additional fields from the second script
    soup = BeautifulSoup(response_text, 'html.parser')
    title_tag = soup.find('title')
    designation_tag = soup.find('h2')
    followers_tag = soup.find('meta', {"property": "og:description"})
    description_tag = soup.find('p', class_='break-words')
    website_tag = soup.find('a', attrs={'data-tracking-control-name': 'about_website'})
    website = get_actual_url(website_tag['href']) if website_tag else "Website not found"
    description_span = soup.find('h4', class_='top-card-layout__second-subline')
    description = description_span.get_text(strip=True) if description_span else "Description not found"
    
    # Crunchbase funding information
    funding_section = soup.find('section', attrs={'data-test-id': 'funding'})
    if funding_section:
        all_rounds_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_all-rounds'})
        if all_rounds_tag:
            all_rounds_match = re.search(r'(\d+ total rounds)', all_rounds_tag.get_text(strip=True))
            all_rounds_info = all_rounds_match.group(1) if all_rounds_match else "All rounds info not found"
        else:
            all_rounds_info = "All rounds info not found"

        last_round_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_last-round'})
        if last_round_tag:
            last_round_info = last_round_tag.find('time').get_text(strip=True)
            last_round_amount_tag = funding_section.find('p', class_='text-display-lg')
            last_round_amount = last_round_amount_tag.get_text(strip=True) if last_round_amount_tag else "Last round amount not found"
            last_round_link = last_round_tag['href']
            last_round_formatted_date = last_round_tag.find('time')['datetime']
        else:
            last_round_info = "Last round info not found"
            last_round_amount = "Last round amount not found"
            last_round_link = "Last round link not found"
            last_round_formatted_date = "Last round date not found"

        investors_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_investors'})
        investors_info = investors_tag.get_text(strip=True) if investors_tag else "Investors info not found"
    else:
        all_rounds_info = "Crunchbase funding info not found"
        last_round_info = "Last round info not found"
        last_round_amount = "Last round amount not found"
        investors_info = "Investors info not found"
        last_round_link = "Last round link not found"
        last_round_formatted_date = "Last round date not found"

    # Check if the tags are found before calling get_text()
    name = title_tag.get_text(strip=True).split("|")[0].strip() if title_tag else "Profile Name not found"
    designation = designation_tag.get_text(strip=True) if designation_tag else "Designation not found"
    followers_match = re.search(r'\b(\d[\d,.]*)\s+followers\b', followers_tag["content"]) if followers_tag else None
    followers_count = followers_match.group(1) if followers_match else "Followers count not found"
    description_profile = description_tag.get_text(strip=True) if description_tag else "Profile Description not found"

    additional_data = {
        "profileName": name,
        "designation": designation,
        "followersCount": followers_count,
        "profileDescription": description_profile,
        "website": website,
        "crunchbaseAllRoundsInfo": all_rounds_info,
        "crunchbaseLastRoundInfo": last_round_info,
        "crunchbaseLastRoundAmount": last_round_amount,
        "crunchbaseInvestorsInfo": investors_info,
        "lastRoundFormattedDate": last_round_formatted_date,
        "crunchbaseLink": last_round_link,
    }

    data = {**script_data, **data, **additional_data}
    data["addresses"] = addresses    
    data["affiliatedPages"] = affiliated_pages
    data["similarPages"] = similar_pages
    return data

def read_links_from_file(file_path: str) -> List[str]:
    """Read URLs from a text or Excel file."""
    if file_path.endswith('.txt'):
        with open(file_path, 'r') as file:
            urls = file.read().splitlines()
    elif file_path.endswith('.xlsx'):
        df = pd.read_excel(file_path)
        urls = df['Links'].tolist()
    else:
        raise ValueError("Unsupported file format. Please use a .txt or .xlsx file.")
    return urls

# Initialize dataframes globally
df_company_info = pd.DataFrame()
df_company_addresses = pd.DataFrame()
df_affiliated_pages = pd.DataFrame()
df_similar_pages = pd.DataFrame()

async def fetch_with_retry(url, retries=3, backoff_factor=0.5):
    for attempt in range(retries):
        try:
            response = await client.get(url)
            return response
        except RemoteProtocolError as e:
            log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
            time.sleep(backoff_factor * (2 ** attempt))
    raise Exception(f"All {retries} attempts failed for {url}")

def fetch_with_requests(url, retries=3, backoff_factor=0.5):
    headers = {
        "User-Agent": "Guest",
    }
    for attempt in range(retries):
        try:
            response = requests.get(url, headers=headers)
            if response.status_code == 200:
                return response
        except requests.RequestException as e:
            log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
            time.sleep(backoff_factor * (2 ** attempt))
    raise Exception(f"All {retries} attempts failed for {url}")

async def scrape_company(urls: List[str]) -> List[Dict]:
    """Scrape public LinkedIn company pages."""
    data = []
    failed_links = []
    for url in urls:
        try:
            response = await fetch_with_retry(url)
            if response.status_code == 200:
                data.append(parse_company(response.text))
                log.success(f"Successfully scraped {url}")
            elif response.status_code == 999:  # Use requests as fallback
                log.warning(f"Status code 999 for {url}, switching to requests")
                response = fetch_with_requests(url)
                if response.status_code == 200:
                    data.append(parse_company(response.text))
                    log.success(f"Successfully scraped {url} with requests fallback")
                else:
                    failed_links.append(url)
                    log.error(f"Failed to scrape {url} with status code {response.status_code}")
            else:
                failed_links.append(url)
                log.error(f"Failed to scrape {url} with status code {response.status_code}")
            # Delay between requests to avoid rate limiting
            time.sleep(1)
        except Exception as e:
            failed_links.append(url)
            log.error(f"Error scraping {url}: {e}")
    return data, failed_links

async def run():
    urls = read_links_from_file('profiles.txt')  # Using the 'profiles.txt' file
    profile_data, failed_links = await scrape_company(urls)
    
    global df_company_info, df_company_addresses, df_affiliated_pages, df_similar_pages

    for company in profile_data:
        main_address = company.get('mainAddress', {})
        company_info = {
            "name": company.get("name"),
            "url": company.get("url"),
            "streetAddress": main_address.get("streetAddress") if main_address else None,
            "addressLocality": main_address.get("addressLocality") if main_address else None,
            "addressRegion": main_address.get("addressRegion") if main_address else None,
            "postalCode": main_address.get("postalCode") if main_address else None,
            "addressCountry": main_address.get("addressCountry") if main_address else None,
            "description": company.get("description"),
            "numberOfEmployees": company.get("numberOfEmployees"),
            "Industry": company.get("Industry"),
            "Company size": company.get("Company size"),
            "Headquarters": company.get("Headquarters"),
            "Type": company.get("Type"),
            "Specialties": company.get("Specialties"),
            "profileName": company.get("profileName"),
            "designation": company.get("designation"),
            "followersCount": company.get("followersCount"),
            "profileDescription": company.get("profileDescription"),
            "website": company.get("website"),
            "crunchbaseAllRoundsInfo": company.get("crunchbaseAllRoundsInfo"),
            "crunchbaseLastRoundInfo": company.get("crunchbaseLastRoundInfo"),
            "crunchbaseLastRoundAmount": company.get("crunchbaseLastRoundAmount"),
            "crunchbaseInvestorsInfo": company.get("crunchbaseInvestorsInfo"),
            "lastRoundFormattedDate": company.get("lastRoundFormattedDate"),
            "crunchbaseLink": company.get("crunchbaseLink"),
        }
        df_company_info = pd.concat([df_company_info, pd.DataFrame([company_info])])

        for address in company.get("addresses", []):
            parts = address.split(", ")
            country = parts[-1] if parts else ""
            company_address = {
                "name": company.get("name"),
                "url": company.get("url"),
                "addresses": address,
                "country offices": country
            }
            df_company_addresses = pd.concat([df_company_addresses, pd.DataFrame([company_address])])

        for affiliated in company.get("affiliatedPages", []):
            affiliated_page = {
                "name": company.get("name"),
                "url": company.get("url"),
                "affiliated_name": affiliated["name"],
                "industry": affiliated["industry"],
                "address": affiliated["address"],
                "linkeinUrl": affiliated["linkeinUrl"]
            }
            df_affiliated_pages = pd.concat([df_affiliated_pages, pd.DataFrame([affiliated_page])])

        for similar in company.get("similarPages", []):
            similar_page = {
                "name": company.get("name"),
                "url": company.get("url"),
                "similar_name": similar["name"],
                "industry": similar["industry"],
                "address": similar["address"],
                "linkeinUrl": similar["linkeinUrl"]
            }
            df_similar_pages = pd.concat([df_similar_pages, pd.DataFrame([similar_page])])

        # Save to Excel after each company
        df_company_info.to_excel("company_information.xlsx", index=False)
        df_company_addresses.to_excel("company_addresses.xlsx", index=False)
        df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
        df_similar_pages.to_excel("similar_pages.xlsx", index=False)

    if failed_links:
        df_failed_links = pd.DataFrame({"failed_links": failed_links})
        df_failed_links.to_excel("failed_links.xlsx", index=False)

if __name__ == "__main__":
    try:
        asyncio.run(run())
    except Exception as e:
        log.error(f"Script terminated due to an error: {e}")

        # Save what has been scraped so far
        if not df_company_info.empty:
            df_company_info.to_excel("company_information.xlsx", index=False)
        if not df_company_addresses.empty:
            df_company_addresses.to_excel("company_addresses.xlsx", index=False)
        if not df_affiliated_pages.empty:
            df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
        if not df_similar_pages.empty:
            df_similar_pages.to_excel("similar_pages.xlsx", index=False)

</code></pre>
</div>


<!-- Incluye Prism.js -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/components/prism-python.min.js"></script>

</body>
</html>

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The data collection and analysis process generates four main outputs.
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b>1. LinkedIn Info</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The first is called "LinkedIn Info" and contains the following columns:
</p>

<table style="width:100%; border: 1px solid #ddd; border-collapse: collapse; text-align: left;">
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Company Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Eupheus Learning</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/eupheus-learning" target="_blank">https://in.linkedin.com/company/eupheus-learning</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Street Address</td>
    <td style="border: 1px solid #ddd; padding: 8px;">A-12, Mohan Co-operative Industrial Estate</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Locality</td>
    <td style="border: 1px solid #ddd; padding: 8px;">New Delhi</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Region</td>
    <td style="border: 1px solid #ddd; padding: 8px;">New Delhi</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Postal Code</td>
    <td style="border: 1px solid #ddd; padding: 8px;">110044</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Country</td>
    <td style="border: 1px solid #ddd; padding: 8px;">IN</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Description</td>
    <td style="border: 1px solid #ddd; padding: 8px;">"Eupheus in Greek means - ""Active seeking of knowledge""​ Our Vision is to offer pedagogically differentiated technology driven solutions that lead to critical thinking and achievement of higher learning outcomes by seamlessly integrating in-class and at home learning in the private school segment of the Pre-K to 12 market. Our aim is to bridge the gap between what is taught in-class using institutional textbook driven solutions and retail at-home learning providers by seamlessly integrating both."</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Number of Employees</td>
    <td style="border: 1px solid #ddd; padding: 8px;">275</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Industry</td>
    <td style="border: 1px solid #ddd; padding: 8px;">E-Learning Providers</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Company Size</td>
    <td style="border: 1px solid #ddd; padding: 8px;">51-200 employees</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Headquarters</td>
    <td style="border: 1px solid #ddd; padding: 8px;">New Delhi, New Delhi</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Company Type</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Privately Held</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Specialties</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Education, K-12, Curricular, E Learning, Pre Primary, Middle School Solutions, Senior School Solutions, Digital Reference Resources, Language Learning, Primary School Solutions, Teacher Support, Age Appropriate Resource, Digital, Learning, Print, Live Books, Fiction e Books, Coding, Kinesthetic Learning, CBSE Aligned Text Book, ICSE Aligned Text Book, Digital Library, Reading Program, Atal Tinkering Lab, and TOEFL</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Profile Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Eupheus Learning</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Designation</td>
    <td style="border: 1px solid #ddd; padding: 8px;">E-Learning Providers</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Followers Count</td>
    <td style="border: 1px solid #ddd; padding: 8px;">9,981</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Profile Description</td>
    <td style="border: 1px solid #ddd; padding: 8px;">"Eupheus in Greek means - ""Active seeking of knowledge""​ Our Vision is to offer pedagogically differentiated technology driven solutions that lead to critical thinking and achievement of higher learning outcomes by seamlessly integrating in-class and at home learning in the private school segment of the Pre-K to 12 market. Our aim is to bridge the gap between what is taught in-class using institutional textbook driven solutions and retail at-home learning providers by seamlessly integrating both."</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Website</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://www.eupheus.in" target="_blank">https://www.eupheus.in</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Crunchbase All Rounds Info</td>
    <td style="border: 1px solid #ddd; padding: 8px;">4 total rounds</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Crunchbase Last Round Info</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Oct 14, 2021</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Crunchbase Last Round Amount</td>
    <td style="border: 1px solid #ddd; padding: 8px;">US$ 10.0M</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Crunchbase Investors Info</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Lightrock</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Last Round Formatted Date</td>
    <td style="border: 1px solid #ddd; padding: 8px;">14/10/2021</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Crunchbase Link</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://www.crunchbase.com/funding_round/eupheus-learning-series-c--65802481?utm_source=linkedin&utm_medium=referral&utm_campaign=linkedin_companies&utm_content=last_funding_anon&trk=funding_last-round" target="_blank">Crunchbase Link</a></td>
  </tr>
</table>

*****
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more detailed information, you can refer to the complete dataset [here](https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=1045195882#gid=1045195882).
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b>2. Afilliated Pages LinkedIn</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The second output is called "Affiliated Pages LinkedIn" and contains information about affiliated pages (i.e., other pages related to the same company). Below is an example of the data collected:
</p>

<table style="width:100%; border: 1px solid #ddd; border-collapse: collapse; text-align: left;">
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">upGrad</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/ueducation" target="_blank">https://in.linkedin.com/company/ueducation</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Affiliated Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">upGrad Placements</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Industry</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Human Resources Services</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Address</td>
    <td style="border: 1px solid #ddd; padding: 8px;"></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">LinkedIn URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/upgrad-placements-" target="_blank">https://in.linkedin.com/company/upgrad-placements-</a></td>
  </tr>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more detailed information, you can refer to the complete dataset [here](https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=1081800839#gid=1081800839).
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b>3. Similar Pages LinkedIn</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The third output is called "Similar Pages LinkedIn" and contains the similar pages recommended by the LinkedIn algorithm. These are pages that LinkedIn suggests as being similar based on various factors such as industry, company size, and other attributes. Below is an example of the data collected:
</p>

<table style="width:100%; border: 1px solid #ddd; border-collapse: collapse; text-align: left;">
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">byteXL</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/bytexl" target="_blank">https://in.linkedin.com/company/bytexl</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Similar Company</td>
    <td style="border: 1px solid #ddd; padding: 8px;">CODINGCLUB</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Industry</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Education</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Address</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Vadodara, GUJARAT</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">LinkedIn URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/codingclub36" target="_blank">https://in.linkedin.com/company/codingclub36</a></td>
  </tr>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more detailed information, you can refer to the complete dataset [here](https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=925315281#gid=925315281).
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b>4. Country Offices Address LinkedIn</b></span>
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The fourth output is called "Country Offices Addresses LinkedIn" and contains information about the different office locations and countries where the company has a presence. Below is an example of the data collected:
</p>


<table style="width:100%; border: 1px solid #ddd; border-collapse: collapse; text-align: left;">
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">byteXL</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">URL</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/bytexl" target="_blank">https://in.linkedin.com/company/bytexl</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Addresses</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Plano, TX, US</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Country Offices</td>
    <td style="border: 1px solid #ddd; padding: 8px;">US</td>
  </tr>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more detailed information, you can refer to the complete dataset [here](https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=1888662846#gid=1888662846).
</p>


<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">🪟 <b>Website Scraper</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
Following the generation of a list containing 1000 ed-tech startups from the HolonIQ list, along with their LinkedIn links, the team proceeded to scrape the information available on LinkedIn. Utilizing the websites retrieved by the LinkedIn scraper (rather than those manually added during the initial list generation to ensure greater accuracy), they developed a scraper for the companies' web pages. This information was then cleanly written into txt files to capture more comprehensive details about each company.
</p>


<div style="background-color:#f9f9f9; border-left:5px solid #ccc; padding:10px; margin:10px 0;">
  <p><strong>Note:</strong> If you want to see the technical information of the process, please click the button below.</p>
</div>

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <!-- Include Prism.js CSS for the "Tomorrow" theme -->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/themes/prism-tomorrow.min.css" rel="stylesheet" />
</head>
<body>


<summary style="background-color:#f7ebfd; color:black; border-radius:18px; padding: 0.2em 0.5em; display: inline-block; cursor: pointer; text-align: left;">**Technical Information**</summary>
<div style="margin-top: 10px; text-align: left;">
  <p>This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:</p>
  <ol style="padding-left: 30px;">
    <li><strong>Cleaning Text:</strong> The `clean_text` function removes excess whitespace and special characters from the text to make it more readable.</li>
    <li><strong>Language Detection:</strong> The `detect_language_from_html` and `detect_language` functions identify the language of the text, either from the HTML tag or the text content itself.</li>
    <li><strong>Fetching HTML Content:</strong> The `get_html_with_selenium` function uses Selenium to load web pages and ensure all content is captured, especially for pages that load content dynamically.</li>
    <li><strong>Saving Web Page Content:</strong> The `save_formatted_webpage_content` function retrieves web page content, cleans it, detects the language, and saves it to a text file. If the initial token count is insufficient, it switches to using Selenium for a more thorough scrape.</li>
    <li><strong>Handling Requests and Errors:</strong> The function handles various exceptions to ensure robustness, including request errors and general exceptions.</li>
    <li><strong>Main Processing Loop:</strong> The `main` function reads input data from an Excel file, processes each company's website, and saves the results and any errors to separate Excel files.</li>
  </ol>
  <p>This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.</p>
  <pre><code class="language-python">
import os
import re
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from langdetect import detect, LangDetectException
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

from webdriver_manager.chrome import ChromeDriverManager

def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = text.replace('\xa0', ' ')
    return text.strip()

def detect_language_from_html(soup):
    html_tag = soup.find('html')
    if html_tag and html_tag.get('lang'):
        return html_tag['lang']
    else:
        return None

def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return 'unknown'

def get_html_with_selenium(url):
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-features=SameSiteByDefaultCookies")
    options.add_argument("--disable-features=CookiesWithoutSameSiteMustBeSecure")
    options.add_argument("log-level=3")  # Reduce logging output
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(url)
    
    try:
        # Wait for the body content to be loaded
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.TAG_NAME, 'body')))
        # Scroll to the bottom to ensure all content is loaded
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)  # Wait for content to load
        # Additionally, wait for the specific element that indicates the content is fully loaded
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//p")))
    except Exception as e:
        print(f"Error waiting for the page to load: {e}")
    
    html_content = driver.page_source
    driver.quit()
    
    return html_content

def save_formatted_webpage_content(url, company_name):
    try:
        print(f"Retrieving content from {url} for {company_name}...")

        response = requests.get(url, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        html_content = response.text

        soup = BeautifulSoup(html_content, 'html.parser')
        paragraphs = soup.find_all('p')

        paragraph_texts = '\n\n'.join([clean_text(p.get_text()) for p in paragraphs])
        token_count = len(paragraph_texts.split())

        # If the token count is less than 300, use Selenium
        if token_count < 300:
            print(f"Insufficient token count retrieved from {url} using requests. Switching to Selenium...")
            html_content = get_html_with_selenium(url)
            soup = BeautifulSoup(html_content, 'html.parser')
            paragraphs = soup.find_all('p')
            paragraph_texts = '\n\n'.join([clean_text(p.get_text()) for p in paragraphs])
            token_count = len(paragraph_texts.split())

        # Check if HTML content is retrieved
        if not soup.body:
            print(f"No body content found for {company_name} at {url}.")
            return None, None, f"No body content found at {url}"
        
        title = clean_text(soup.title.string) if soup.title else 'No Title Found'
        formatted_content = f"Title: {title}\n\n{paragraph_texts}"

        # Detect language
        language = detect_language_from_html(soup)
        if not language:
            language = detect_language(paragraph_texts)
        
        filepath = os.path.join('webpages', f"{company_name}.txt")
        with open(filepath, 'w', encoding='utf-8') as file:
            file.write(formatted_content)
        
        print(f"Content retrieved and saved for {company_name}. Token count: {token_count}, Language: {language}.")
        return token_count, language, None
    except requests.exceptions.RequestException as req_err:
        print(f"Request error for {company_name}: {req_err}")
        return None, None, f"Request error: {req_err}"
    except Exception as err:
        print(f"Error for {company_name}: {err}")
        return None, None, f"Error: {err}"

def main():
    input_filename = 'cleaned_company_information.xlsx'
    output_data_filename = 'company_token_counts.xlsx'
    output_errors_filename = 'website_errors.xlsx'
    
    # Read the input file
    df = pd.read_excel(input_filename)
    
    # Ensure the 'webpages' directory exists
    if not os.path.exists('webpages'):
        os.makedirs('webpages')
    
    # Lists to store results and errors
    results = []
    errors = []

    # Iterate over the rows in the DataFrame
    for index, row in df.iterrows():
        print(f"Processing {index + 1}/{len(df)}: {row['Company Name']} ({row['Website']})")
        company_name = row['Company Name']
        website = row['Website']
        token_count, language, error = save_formatted_webpage_content(website, company_name)
        
        if token_count is not None:
            results.append({'Company Name': company_name, 'Website': website, 'Token Count': token_count, 'Language': language})
        if error is not None:
            errors.append({'Company Name': company_name, 'Website': website, 'Error': error})
    
    # Save the results to an Excel file
    results_df = pd.DataFrame(results)
    results_df.to_excel(output_data_filename, index=False)
    
    # Save the errors to an Excel file
    errors_df = pd.DataFrame(errors)
    errors_df.to_excel(output_errors_filename, index=False)

    print(f"Process completed. Data saved to {output_data_filename} and errors saved to {output_errors_filename}.")

if __name__ == '__main__':
    main()

  </code></pre>
</div>

<!-- Include Prism.js -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/components/prism-python.min.js"></script>

</body>
</html>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
Following the web scraping process, the output of this part is a folder 📁 containing various txt files, with each file corresponding to a specific company. These files contain detailed information extracted from the companies' web pages. This approach ensures that all relevant data is captured and organized systematically for further analysis.
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
As shown in the following example:
</p>


![](webpage_scraper_example.png)

<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#f2ebfb; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">🔧 <b>Integration with Chat-GPT</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The next step involved integrating with OpenAI using the GPT-3.5 model to leverage the LinkedIn information and the Txt files with website information. These data sources were used to feed the model and answer specific questions designed to build a comprehensive profile of each company. The following questions were asked of the model:
</p>


<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.8; text-align: left;">
    <li>Does the company have any connection or use the Open edX platform?</li>
    <li>What is the main product or service the company offers?</li>
    <li>List three main competitors of the company, including their names and websites.</li>
    <li>List three main customers of the company.</li>
    <li>Is the company primarily focused on B2B (business-to-business) or B2C (business-to-consumer) operations?</li>
    <li>Estimate the number of customers the company has.</li>
    <li>In which country does the company primarily operate?</li>
    <li>Determine if the company is a potential client, partner, competitor, or provider for Edunext.</li>
    <li>Determine the potential vertical for the company from the given list of verticals.</li>
    <li>Based on the provided information and any additional research, what are the primary capabilities of the company? (This was repeated to identify up to five capabilities)</li>
</ul>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The code leverages additional data from several Excel tables to enrich the analysis:
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>Role Definitions</b></span>
  <br>Provides detailed definitions for roles like client, partner, competitor, and provider, as well as information about Edunext.
</p>


<style>
  table {
    width: 100%;
    border-collapse: collapse;
    font-family: Arial, sans-serif;
    font-size: 14px;
    color: #333333;
  }
  th, td {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
  }
  th {
    background-color: #f2f2f2;
  }
</style>

<table>
  <thead>
    <tr>
      <th>Role</th>
      <th>Detailed Definitions</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Edunext</strong></td>
      <td>Edunext is the company driving this analysis. It is a software and services company dedicated to the Open edX platform.</td>
    </tr>
    <tr>
      <td><strong>CLIENT</strong></td>
      <td>An entity that is likely to require hosting, maintenance, or professional services for the Open edX platform.</td>
    </tr>
    <tr>
      <td><strong>PARTNER</strong></td>
      <td>An entity that provides instructional design services or a technology aggregator that may subcontract hosting or professional services for the Open edX platform to Edunext, as it is not part of its core business. It can also be an entity that provides a tool for online learning that can be integrated into the Open edX LMS or uses standard interoperability protocols such as LTI.</td>
    </tr>
    <tr>
      <td><strong>COMPETITOR</strong></td>
      <td>An entity that has Open edX hosting, maintenance, or custom development as part of their core business offering and expertise.</td>
    </tr>
    <tr>
      <td><strong>PROVIDER</strong></td>
      <td>An entity that supplies services that may be relevant and useful to Edunext to enhance its value proposition or raise its productivity.</td>
    </tr>
  </tbody>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>Verticals</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
These additional sources ensure the accuracy and comprehensiveness of the analysis, allowing the GPT-3.5 model to generate more precise and contextually relevant responses.
</p>


<table>
  <thead>
    <tr>
      <th>Vertical Name</th>
      <th>Vertical Definition</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>K-12 Education</strong></td>
      <td>Technologies and platforms aimed at primary and secondary education.</td>
    </tr>
    <tr>
      <td><strong>Higher Education</strong></td>
      <td>Solutions tailored for colleges, universities, and other tertiary education institutions.</td>
    </tr>
    <tr>
      <td><strong>Professional Development & Corporate Training</strong></td>
      <td>Tools and programs designed for employee training, upskilling, and professional certifications.</td>
    </tr>
    <tr>
      <td><strong>Language Learning</strong></td>
      <td>Apps, platforms, and services focused on teaching new languages.</td>
    </tr>
    <tr>
      <td><strong>STEM Education</strong></td>
      <td>Resources and tools specific to Science, Technology, Engineering, and Mathematics education.</td>
    </tr>
    <tr>
      <td><strong>Learning Management Systems (LMS)</strong></td>
      <td>Platforms that provide a comprehensive management system for learning processes, often used by institutions and corporations.</td>
    </tr>
    <tr>
      <td><strong>Tutoring and Mentoring</strong></td>
      <td>Services and platforms that connect students with tutors and mentors.</td>
    </tr>
    <tr>
      <td><strong>Online Courses and MOOCs (Massive Open Online Courses)</strong></td>
      <td>Platforms offering a variety of courses across different subjects, typically available to a large audience.</td>
    </tr>
    <tr>
      <td><strong>Content Creation and Publishing</strong></td>
      <td>Tools and platforms for creating, sharing, and publishing educational content.</td>
    </tr>
    <tr>
      <td><strong>Edutainment</strong></td>
      <td>Educational tools and resources that incorporate entertainment, such as educational games and interactive learning tools.</td>
    </tr>
    <tr>
      <td><strong>Special Education</strong></td>
      <td>Technologies designed to support learners with special needs.</td>
    </tr>
    <tr>
      <td><strong>Assessment and Testing</strong></td>
      <td>Solutions that focus on student evaluation, testing, and examination.</td>
    </tr>
    <tr>
      <td><strong>Early Childhood Education</strong></td>
      <td>Platforms and tools aimed at pre-K education.</td>
    </tr>
    <tr>
      <td><strong>Virtual and Augmented Reality (VR/AR) in Education</strong></td>
      <td>Immersive technologies used to enhance the learning experience.</td>
    </tr>
    <tr>
      <td><strong>Coding and Programming</strong></td>
      <td>Platforms and tools that teach coding and programming skills.</td>
    </tr>
    <tr>
      <td><strong>Collaboration and Communication Tools</strong></td>
      <td>Solutions that facilitate communication and collaboration among students, teachers, and parents.</td>
    </tr>
    <tr>
      <td><strong>School Administration and Management</strong></td>
      <td>Systems designed to manage school operations, such as attendance, grades, and resource planning.</td>
    </tr>
    <tr>
      <td><strong>Educational Hardware</strong></td>
      <td>Devices and physical technologies used in educational settings, like tablets, interactive whiteboards, and robotics kits.</td>
    </tr>
    <tr>
      <td><strong>Adaptive Learning</strong></td>
      <td>Technologies that use data and analytics to personalize the learning experience.</td>
    </tr>
    <tr>
      <td><strong>EdTech Infrastructure</strong></td>
      <td>Backend technologies and services that support educational platforms and tools, like cloud services, data management, and cybersecurity solutions.</td>
    </tr>
  </tbody>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>Capabilities</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The following list of capabilities was used to analyze the potential and competencies of each company. These capabilities are categorized under main categories and sub-categories to ensure a comprehensive evaluation. The list is based on the HolonIQ framework, which can be found at <a href="https://www.digitalcapability.org/#wp-section" style="color: #1a73e8;">Digital Capability Framework</a>. The image provided below, sourced from HolonIQ, illustrates these capabilities. However, the list below has been reformatted for ease of understanding and to ensure better performance when processed by the ChatGPT algorithm.
</p>


![](holon_capabilities.png)
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The list shown below is a reinterpretation of the previous image to categorize the capabilities and be able to pass them through the gpt chat api, so that it could be interpreted in a more understandable way and in text format.
</p>



<table>
  <thead>
    <tr>
      <th>Main Category</th>
      <th>Sub-Category</th>
      <th>Capability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>PRODUCT STRATEGY</td>
      <td>MARKET INSIGHTS & TRENDS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>PRODUCT STRATEGY</td>
      <td>UNDERSTAND CUSTOMER NEEDS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>PRODUCT STRATEGY</td>
      <td>COMPETITORS & ALTERNATES</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>PRODUCT STRATEGY</td>
      <td>NEW BUSINESS MODELS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>PRODUCT STRATEGY</td>
      <td>B2B RECRUITMENT & PARTNERSHIPS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>MARKETING PROCESSES</td>
      <td>STUDENT RELATIONSHIP MANAGEMENT (CRM)</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>MARKETING PROCESSES</td>
      <td>COMMS & CAMPAIGN MANAGEMENT</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>MARKETING PROCESSES</td>
      <td>MARKETING AUTOMATION</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>MARKETING PROCESSES</td>
      <td>SOCIAL MEDIA & COMMUNITY MANAGEMENT</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>STUDENT RECRUITMENT</td>
      <td>RECRUITMENT EVENTS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>STUDENT RECRUITMENT</td>
      <td>CHANNEL PARTNERSHIPS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>STUDENT RECRUITMENT</td>
      <td>SCHOOLS & COMMUNITY OUTREACH</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>STUDENT RECRUITMENT</td>
      <td>SCHOLARSHIP PROGRAM</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>ENROLLMENT MANAGEMENT</td>
      <td>COURSE SELECTION & GUIDANCE</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>ENROLLMENT MANAGEMENT</td>
      <td>APPLICATION & ADMISSIONS</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>ENROLLMENT MANAGEMENT</td>
      <td>RECOGNIZING PRIOR LEARNING</td>
    </tr>
    <tr>
      <td><strong>DEMAND AND DISCOVERY</strong></td>
      <td>ENROLLMENT MANAGEMENT</td>
      <td>TUITION FINANCING</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>DIGITAL DESIGN PRINCIPLES</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>PROGRAM ARCHITECTURE</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>LEARNING ENVIRONMENTS & PLATFORMS</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>LEARNING DELIVERY MODELS</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>ACCREDITATION</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>CURRICULUM DESIGN</td>
      <td>CURRICULUM QUALITY MANAGEMENT</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>DIGITAL CONTENT & COURSEWARE</td>
      <td>DIGITAL CONTENT CREATION</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>DIGITAL CONTENT & COURSEWARE</td>
      <td>IMMERSION, SIMULATION & LAB</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>DIGITAL CONTENT & COURSEWARE</td>
      <td>OER & CONTENT LICENSING</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>DIGITAL CONTENT & COURSEWARE</td>
      <td>MANAGING INTEGRATED CONTENT</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>SUBJECT MATTER EXPERTISE</td>
      <td>DESIGNING FOR DIGITAL LEARNING</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>SUBJECT MATTER EXPERTISE</td>
      <td>FACULTY EXPERTISE & SPECIALISMS</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>SUBJECT MATTER EXPERTISE</td>
      <td>SOURCING & MANAGING EXPERTISE</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>SUBJECT MATTER EXPERTISE</td>
      <td>SPECIALIST INDUSTRY PARTNERS</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>TEACHING STRATEGIES</td>
      <td>LEARNER NEEDS & ANALYTICS</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>TEACHING STRATEGIES</td>
      <td>DESIGNING ASSESSMENT</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>TEACHING STRATEGIES</td>
      <td>EXPERIENTIAL LEARNING APPROACHES</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>TEACHING STRATEGIES</td>
      <td>DESIGNING GROUP WORK</td>
    </tr>
    <tr>
      <td><strong>LEARNING DESIGN</strong></td>
      <td>TEACHING STRATEGIES</td>
      <td>PERSONALIZED & ADAPTIVE LEARNING</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>FACULTY PROFESSIONAL DEVELOPMENT</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>FACULTY MANAGEMENT & SUPPORT</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>TIMETABLING & SCHEDULE MANAGEMENT</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>RETENTION & LEARNING SUPPORT</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>REPORTING & REGULATORY COMPLIANCE</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ACADEMIC ADMINISTRATION</td>
      <td>LIBRARY SERVICES</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>STUDENT PORTAL & LMS</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>SYNCHRONOUS LEARNING EXPERIENCES</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>ASYNCHRONOUS LEARNING EXPERIENCES</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>VOICE, CHAT & INTERACTIVE LEARNING</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>INDEPENDENT LEARNING RESOURCES</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>LEARNING & ACADEMIC EXPERIENCE</td>
      <td>EXCHANGE PROGRAMS</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>ONBOARDING & ORIENTATION</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>WELLBEING & MENTAL HEALTH</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>STUDENT COMMUNITIES, CLUBS & SOCIETIES</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>VOLUNTEERING & STUDENT LEADERSHIP</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>STUDENT VOICE & SURVEYS</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>STUDENT LIFE</td>
      <td>GRADUATION & SUCCESS</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ASSESSMENT & VERIFICATION</td>
      <td>TESTS & EXAMS</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ASSESSMENT & VERIFICATION</td>
      <td>PORTFOLIOS & PRACTICAL</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ASSESSMENT & VERIFICATION</td>
      <td>ASSESSMENT FEEDBACK</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ASSESSMENT & VERIFICATION</td>
      <td>PEER & GROUP ASSESSMENT</td>
    </tr>
    <tr>
      <td><strong>LEARNER EXPERIENCE</strong></td>
      <td>ASSESSMENT & VERIFICATION</td>
      <td>BADGING & CREDENTIALING</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>WORK INTEGRATED LEARNING</td>
      <td>EMPLOYABILITY SKILLS BUILDING</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>WORK INTEGRATED LEARNING</td>
      <td>WORKPLACE SIMULATION & PROJECTS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>WORK INTEGRATED LEARNING</td>
      <td>INTERNSHIPS & PLACEMENTS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>WORK INTEGRATED LEARNING</td>
      <td>STUDENT WORK</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>WORK INTEGRATED LEARNING</td>
      <td>ENTREPRENEURSHIP & STARTUPS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>CAREER PLANNING & PLACEMENT</td>
      <td>COMPETENCIES & SKILLS EVALUATION</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>CAREER PLANNING & PLACEMENT</td>
      <td>CAREER PLANNING SERVICES</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>CAREER PLANNING & PLACEMENT</td>
      <td>CAREER & RECRUITMENT EVENTS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>CAREER PLANNING & PLACEMENT</td>
      <td>JOB APPLICATION SUPPORT</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>CAREER PLANNING & PLACEMENT</td>
      <td>JOB FINDING & GRADUATE PLACEMENT</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>INDUSTRY & BUSINESS ENGAGEMENT</td>
      <td>INDUSTRY COLLABS & PARTNERSHIPS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>INDUSTRY & BUSINESS ENGAGEMENT</td>
      <td>PROFESSIONAL & INDUSTRY ASSOCIATIONS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>INDUSTRY & BUSINESS ENGAGEMENT</td>
      <td>CUSTOMIZED PROGRAMS (B2B)</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>INDUSTRY & BUSINESS ENGAGEMENT</td>
      <td>EDUCATION AS EMPLOYEE BENEFIT</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>ALUMNI & CONTINUING EDUCATION</td>
      <td>CONTINUING EDUCATION</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>ALUMNI & CONTINUING EDUCATION</td>
      <td>INDUSTRY MENTORING & NETWORKS</td>
    </tr>
    <tr>
      <td><strong>WORK AND LIFELONG LEARNING</strong></td>
      <td>ALUMNI & CONTINUING EDUCATION</td>
      <td>ALUMNI ENGAGEMENT</td>
    </tr>
  </tbody>
</table>



<div style="background-color:#f9f9f9; border-left:5px solid #ccc; padding:10px; margin:10px 0;">
  <p><strong>Note:</strong> If you want to see the technical information of the process, please click the button below.</p>
</div>


<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <!-- Include Prism.js CSS for the "Tomorrow" theme -->
  <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/themes/prism-tomorrow.min.css" rel="stylesheet" />
</head>
<body>

<summary style="background-color:#f7ebfd; color:black; border-radius:18px; padding: 0.2em 0.5em; display: inline-block; cursor: pointer; text-align: left;">**Technical Information**</summary>
<div style="margin-top: 10px; text-align: left;">
  <p>This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:</p>
  <ol style="padding-left: 30px;">
    <li><strong>OpenAI API Interaction:</strong> The `query_openai_api` function handles the interaction with the OpenAI API, including retry logic to manage errors.</li>
    <li><strong>Role Categorization:</strong> The `categorize_potential_role` function determines the potential role of the company (client, partner, competitor, or provider) based on the API response.</li>
    <li><strong>Vertical Categorization:</strong> The `categorize_potential_vertical` function identifies the appropriate vertical for the company from a predefined list.</li>
    <li><strong>Capability Extraction and Validation:</strong> The `extract_and_validate_capabilities` function extracts and validates the capabilities of the company, ensuring exactly five unique capabilities.</li>
    <li><strong>Company Processing:</strong> The `process_company_column` function processes each company's data, constructs prompts, queries the API, and compiles the results.</li>
    <li><strong>Progress Saving:</strong> The `save_progress` function periodically saves the progress of the analysis to an Excel file.</li>
    <li><strong>Main Processing Loop:</strong> The script processes each company in the dataset, reads webpage content, handles errors, and saves the final results to an Excel file.</li>
  </ol>
  <p>This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.</p>
  <pre><code class="language-python">
import openai
import pandas as pd
import logging
import time
from random import random
import os

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Set your OpenAI API key
openai.api_key = 'ADD THE KEY'

# Load the LinkedIn data
linkedIn_data = pd.read_excel('cleaned_company_information.xlsx', skiprows=range(1, 660))

# Load the capabilities data
capabilities_data = pd.read_excel('Input/capabilities.xlsx')
valid_capabilities = set(capabilities_data['Capability'].str.upper().tolist())

# Load the role definitions
role_definitions = pd.read_excel('Input/company_profile.xlsx')

# Load the vertical data
vertical_data = pd.read_excel('Input/vertical.xlsx')

# Define valid roles and verticals
valid_roles = ['CLIENT', 'PARTNER', 'COMPETITOR', 'PROVIDER']
valid_verticals = vertical_data['Vertical'].tolist()

# Prepare the output DataFrame
output_columns = [
    'Company Name', 'URL (LinkedIn profile link)', 'Website', 'OpenedX Connection',
    'Main Product', 'Competitors', 'Customers', 'B2B or B2C',
    'Number of Customers', 'Country', 'Potential Role', 'Potential Vertical',
    'Capability 1', 'Capability 2', 'Capability 3', 'Capability 4', 'Capability 5'
]

output_data = pd.DataFrame(columns=output_columns)
backup_interval = 10  # Save progress every 10 companies

# Function to query OpenAI API with retry logic
def query_openai_api(prompt, model="gpt-3.5-turbo", max_tokens=512, retries=5):
    for i in range(retries):
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are an assistant that helps analyze company data."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=max_tokens,
                temperature=0.7
            )
            return response.choices[0].message['content'].strip()
        except openai.error.OpenAIError as e:
            logging.error(f"APIError encountered: {e}. Retrying {i + 1}/{retries}...")
            time.sleep((2 ** i) + random())
    return "API request failed."

# Function to categorize potential role
def categorize_potential_role(response):
    response = response.lower()
    if "client" in response:
        return "CLIENT"
    if "partner" in response:
        return "PARTNER"
    if "competitor" in response:
        return "COMPETITOR"
    if "provider" in response:
        return "PROVIDER"
    return ""

# Function to categorize potential vertical
def categorize_potential_vertical(response):
    verticals = vertical_data['Vertical'].tolist()
    for vertical in verticals:
        if vertical.lower() in response.lower():
            return vertical
    return ""

# Function to ensure the capabilities list has exactly 5 unique entries
def ensure_unique_capabilities(capabilities):
    unique_capabilities = list(dict.fromkeys(capabilities))  # Remove duplicates while preserving order
    while len(unique_capabilities) < 5:
        remaining_capabilities = list(valid_capabilities - set(unique_capabilities))
        if remaining_capabilities:
            unique_capabilities.append(remaining_capabilities[0])  # Add a remaining valid capability
        else:
            unique_capabilities.append('')  # If no valid capabilities are left, append an empty string
    return unique_capabilities[:5]

# Function to extract and validate capabilities
def extract_and_validate_capabilities(response, company_name):
    capabilities = []
    logging.debug(f"Raw API capabilities response for {company_name}: {response}")
    for line in response.split('\n'):
        for capability in valid_capabilities:
            if capability.lower() in line.lower() and capability not in capabilities:
                capabilities.append(capability)
                break
    validated_capabilities = ensure_unique_capabilities(capabilities)
    logging.debug(f"Validated capabilities for {company_name}: {validated_capabilities}")
    return validated_capabilities

# Function to process each company and column
def process_company_column(company_info, webpage_content):
    base_prompt = f"Company LinkedIn Info:\n{company_info}\n\nCompany Webpage Content:\n{webpage_content}\n\n"
    capabilities_list = ', '.join(valid_capabilities)

    # Constructing more detailed prompts with additional context
    prompts = {
        'OpenedX Connection': base_prompt + "Taking into account the information provided and your knowledge about the company, does the company have any connection or use the Open edX platform?",
        'Main Product': base_prompt + "Taking into account the information provided and your knowledge about the company, what is the main product or service the company offers?",
        'Competitors': base_prompt + "Taking into account the information provided and your knowledge about the company, list three main competitors of the company including their names and websites.",
        'Customers': base_prompt + "Taking into account the information provided and your knowledge about the company, list three main customers of the company.",
        'B2B or B2C': base_prompt + "Taking into account the information provided and your knowledge about the company, is the company primarily focused on B2B (business-to-business) or B2C (business-to-consumer) operations?",
        'Number of Customers': base_prompt + "Taking into account the information provided and your knowledge about the company, estimate the number of customers the company has.",
        'Country': base_prompt + "Taking into account the information provided and your knowledge about the company, in which country does the company primarily operate?",
        'Potential Role': base_prompt + "Taking into account the information provided and your knowledge about the company, determine if the company is a potential client, partner, competitor, or provider for Edunext.",
        'Potential Vertical': base_prompt + "Taking into account the information provided and your knowledge about the company, determine the potential vertical for the company from the following list:\n" +
            '\n'.join([f"{row['Vertical']} - {row['Definition']}" for _, row in vertical_data.iterrows()]) + "\nWhat is the potential vertical of this company?",
        'Capability 1': base_prompt + f"Based on the provided information and any additional research, what is the primary capability of the company? Choose from the following: {capabilities_list}",
        'Capability 2': base_prompt + f"Considering the capabilities already mentioned, what is the second main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 3': base_prompt + f"Considering the capabilities already mentioned, what is the third main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 4': base_prompt + f"Considering the capabilities already mentioned, what is the fourth main capability of the company? Choose from the following: {capabilities_list}",
        'Capability 5': base_prompt + f"Considering the capabilities already mentioned, what is the fifth main capability of the company? Choose from the following: {capabilities_list}"
    }
    
    responses = {key: query_openai_api(prompt) for key, prompt in prompts.items()}

    # Log raw responses for debugging
    logging.debug(f"Raw API response for company {company_info['Company Name']}: {responses}")

    # Extract and validate capabilities
    capabilities = extract_and_validate_capabilities("\n".join([responses.get(f'Capability {i}', '') for i in range(1, 6)]), company_info['Company Name'])

    output_row = {
        'Company Name': company_info['Company Name'],
        'URL (LinkedIn profile link)': company_info['URL'],
        'Website': company_info['Website'],
        'OpenedX Connection': responses['OpenedX Connection'],
        'Main Product': responses['Main Product'],
        'Competitors': responses['Competitors'],
        'Customers': responses['Customers'],
        'B2B or B2C': "B2B" if "B2B" in responses['B2B or B2C'] else "B2C",
        'Number of Customers': responses['Number of Customers'],
        'Country': responses['Country'].strip(),
        'Potential Role': categorize_potential_role(responses['Potential Role']),
        'Potential Vertical': categorize_potential_vertical(responses['Potential Vertical']),
        'Capability 1': capabilities[0],
        'Capability 2': capabilities[1],
        'Capability 3': capabilities[2],
        'Capability 4': capabilities[3],
        'Capability 5': capabilities[4]
    }

    return output_row

# Function to save progress periodically
def save_progress(data, filename='company_analysis_backup.xlsx'):
    df = pd.DataFrame(data, columns=output_columns)
    df.to_excel(filename, index=False)
    logging.info(f"Progress saved to {filename}")

# Process companies sequentially with logging
results = []
for idx, row in linkedIn_data.iterrows():
    company_info = row.to_dict()
    company_name = company_info['Company Name']
    logging.info(f"Processing company: {company_name}")

    try:
        # Try to read the webpage content for the company, handle missing or invalid files
        webpage_content = ""
        filepath = os.path.join('webpages', f"{company_name}.txt")
        try:
            with open(filepath, 'r', encoding='utf-8') as file:
                webpage_content = file.read()
        except (FileNotFoundError, OSError) as e:
            logging.warning(f"Could not read file for {company_name}: {e}")
        
        output_row = process_company_column(company_info, webpage_content)
        results.append(output_row)
        logging.info(f"Completed processing for {company_name} ({idx + 1}/{len(linkedIn_data)})")
        
        # Save progress periodically
        if (idx + 1) % backup_interval == 0:
            save_progress(results)

    except Exception as e:
        logging.error(f"Error processing company {company_name}: {e}")

# Save final results
output_data = pd.DataFrame(results, columns=output_columns)
output_data.to_excel('company_analysis.xlsx', index=False)
logging.info("All companies processed and saved to company_analysis.xlsx")
  </code></pre>

</div>

<!-- Include Prism.js -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.25.0/components/prism-python.min.js"></script>
</body>
</html>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; margin-bottom: 20px;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;"><b> Company Information Output</b></span>
</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The fourth output is called "Company Information Output" and contains detailed information about various analyzed companies. Below is an example of the data collected:
</p>

<table style="width:100%; border: 1px solid #ddd; border-collapse: collapse; text-align: left;">
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Company Name</td>
    <td style="border: 1px solid #ddd; padding: 8px;">byteXL</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">URL (LinkedIn profile link)</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://in.linkedin.com/company/bytexl" target="_blank">https://in.linkedin.com/company/bytexl</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Website</td>
    <td style="border: 1px solid #ddd; padding: 8px;"><a href="https://bytexl.com" target="_blank">https://bytexl.com</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">OpenedX Connection</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Based on the information provided, there is no direct mention of the company byteXL using or having a connection with the Open edX platform. The company's focus seems to be on transforming engineering colleges in India through their own integrated college transformation model and proprietary platform, rather than utilizing external platforms like Open edX.</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Main Product</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Based on the information provided from the company's LinkedIn profile and webpage content, the main product or service offered by byteXL is an experiential learning online platform for IT programming aspirants. This platform integrates curriculum, content, and practical learning to enhance students' skills and awareness on employability. The company partners with colleges to transform their teaching methodology and learning pedagogy to increase the employability quotient of their students. The platform includes features such as academic & skilling content, online editor, student reports, dashboards on individual college performance, and coding challenges.</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Competitors</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Based on the information provided about byteXL, three main competitors of the company in the E-learning industry could be:<br><br>1. Company Name: UpGrad<br>Website: <a href="https://www.upgrad.com/" target="_blank">https://www.upgrad.com/</a><br><br>2. Company Name: Simplilearn<br>Website: <a href="https://www.simplilearn.com/" target="_blank">https://www.simplilearn.com/</a><br><br>3. Company Name: Coursera<br>Website: <a href="https://www.coursera.org/" target="_blank">https://www.coursera.org/</a></td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Customers</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Based on the information provided, three main customers of byteXL are:<br><br>1. Tejashri Student from Malineni Lakshmaiah Women’s Engineering College<br><br>2. A. Sanyasirao, Head of Electronics & Communication Engineering Department at Christu Jyothi Institute of Technology & Science<br><br>3. Kalyani, a student of Malineni Lakshmaiah Women's Engineering College<br><br>These customers have shared their positive experiences with byteXL's teaching methodology and its impact on their academic journey and career prospects.</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">B2B or B2C</td>
    <td style="border: 1px solid #ddd; padding: 8px;">B2B</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Number of Customers</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Based on the content provided, it seems like byteXL has several customers who are students and colleges across India. The webpage content mentions testimonials and success stories from students and faculty at various engineering colleges who have benefited from byteXL's training and mentorship programs.<br><br>While the exact number of customers is not explicitly mentioned in the data provided, we can infer that the company has a significant customer base across multiple colleges and individual students. The testimonials and success stories highlight the impact byteXL has had on students' academic journeys and career paths, indicating a wide reach and positive reputation among customers.<br><br>Therefore, based on the information available, we can estimate that byteXL likely has hundreds or even thousands of customers consisting of both colleges and individual students who have engaged with their educational programs and services.</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Country</td>
    <td style="border: 1px solid #ddd; padding: 8px;">The company, byteXL, primarily operates in India. This is indicated by the company's headquarters being in Hyderabad, India, the focus on transforming engineering colleges in India, partnerships with Indian colleges such as Malineni Lakshmaiah Women’s Engineering College and Christu Jyothi Institute of Technology & Science, and the testimonials and success stories from Indian students and educational institutions. The company's efforts and impact are centered around the Indian education system and industry, supporting the employability and skills development of Indian engineering students.</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Country Only</td>
    <td style="border: 1px solid #ddd; padding: 8px;">India</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Potential Role</td>
    <td style="border: 1px solid #ddd; padding: 8px;">CLIENT</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Potential Vertical</td>
    <td style="border: 1px solid #ddd; padding: 8px;">Higher Education</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Capability 1</td>
    <td style="border: 1px solid #ddd; padding: 8px;">PERSONALIZED & ADAPTIVE LEARNING</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Capability 2</td>
    <td style="border: 1px solid #ddd; padding: 8px;">WORKPLACE SIMULATION & PROJECTS</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Capability 3</td>
    <td style="border: 1px solid #ddd; padding: 8px;">IMMERSION, SIMULATION & LAB</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Capability 4</td>
    <td style="border: 1px solid #ddd; padding: 8px;">VOLUNTEERING & STUDENT LEADERSHIP</td>
  </tr>
  <tr>
    <td style="border: 1px solid #ddd; padding: 8px; font-weight: bold; background-color: #f0f0f0;">Capability 5</td>
    <td style="border: 1px solid #ddd; padding: 8px;">DIGITAL DESIGN PRINCIPLES</td>
  </tr>
</table>

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
For more detailed information, you can refer to the complete dataset <a href="https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=1591638440#gid=1591638440" target="_blank">here</a>.
</p>


<a id="analysis-results"></a>

***
<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> Analysis and Results</b>
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
With the data resulting from this exercise, the analysis begins with geographical insights. The first part will focus on the number of startups by country, while the second part will analyze the amount of investment. The geographical location in these charts is provided by LinkedIn, indicating where the company is based, and the investment data also comes from LinkedIn information in conjunction with Crunchbase. It is important to note that not all companies have this information available, and not all investments were made in this or the past year. To verify the date of investment by company, this link can be consulted:
[Investment Date Verification](https://docs.google.com/spreadsheets/d/1UtK3tMqmukREBoPvLxICo9zt1rizbtb41Q_rehTh4Uo/edit?gid=1045195882#gid=1045195882)
</p>


####  {.tabset}


##### Number of startups


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The first map provides a visual representation of the number of startups by country based on data sourced from LinkedIn. The map uses a gradient color scale to indicate the density of startups in each country, with lighter shades of blue representing fewer startups and darker shades of blue indicating a higher number of startups.
</p>


```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(tidyr)
library(DT)
library(countrycode)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Count the number of startups by country
startup_counts <- data %>%
  group_by(Country) %>%
  summarise(Count = n())

# Convert ISO country codes to country names
startup_counts$Country <- countrycode(startup_counts$Country, "iso2c", "country.name")
startup_counts$Country[is.na(startup_counts$Country)] <- "Unknown"

# Load geographical data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Join startup data with geographical data
map_data <- left_join(world, startup_counts, by = c("name_long" = "Country"))

# Create labels for tooltips
map_data$tooltip <- with(map_data, paste(name_long, "<br>", "Number of Startups: ", Count))

# Create the map with ggplot2
p <- ggplot(data = map_data) +
  geom_sf(aes(fill = Count, text = tooltip), color = "white") +
  scale_fill_gradientn(
    colors = c("#deebf7", "#9ecae1", "#3182bd", "#08519c"),
    na.value = "grey50",
    name = "Number of Startups"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(
    title = "Number of Startups by Country",
    subtitle = "Source: LinkedIn",
    fill = "Number of Startups"
  )

# Make the map interactive with plotly
p_interactive <- ggplotly(p, tooltip = "text")

# Adjust the size for horizontal display
p_interactive <- p_interactive %>%
  layout(
    width = 1000,  # Set width to make the map wider
    height = 500  # Set height appropriately
  )

# Display the interactive map
p_interactive

```

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The second map illustrates the number of startups by country, categorized by their estimated size based on employee count from LinkedIn data. Startups are classified into Small, Medium, Large, and Very Large using employee quartiles. The map features interactive tooltips with country names, estimated sizes, and startup counts. Bubble sizes indicate the number of startups, while colors (Yellow for Small, Orange for Medium, Red for Large, Dark Red for Very Large, and Grey for Unknown) represent size categories. This visualization helps identify the distribution of different-sized startups globally, emphasizing regions with diverse entrepreneurial activities.
</p>

<table style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; border-collapse: collapse; width: 100%;">
  <tr style="border-bottom: 1px solid #ddd;">
    <th style="padding: 8px; text-align: left;">Estimated Size</th>
    <th style="padding: 8px; text-align: left;">Definition</th>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Small</td>
    <td style="padding: 8px;">Employees ≤ 1st quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Medium</td>
    <td style="padding: 8px;">1st quartile < Employees ≤ 2nd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Large</td>
    <td style="padding: 8px;">2nd quartile < Employees ≤ 3rd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Very Large</td>
    <td style="padding: 8px;">Employees > 3rd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Unknown</td>
    <td style="padding: 8px;">Missing employee data</td>
  </tr>
</table>


```{r , echo=FALSE}

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Calculate quartiles for the number of employees
quartiles <- quantile(data$`Number of Employees`, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

# Function to estimate size based on quartiles
estimate_size <- function(num_employees) {
  if (is.na(num_employees)) {
    return("Unknown")
  } else if (num_employees <= quartiles[1]) {
    return("Small")
  } else if (num_employees <= quartiles[2]) {
    return("Medium")
  } else if (num_employees <= quartiles[3]) {
    return("Large")
  } else {
    return("Very Large")
  }
}

# Apply the function to estimate size
data$Estimated_Size <- sapply(data$`Number of Employees`, estimate_size)

# Aggregate data by country and estimated size
startup_counts <- data %>%
  group_by(Country, Estimated_Size) %>%
  summarise(Count = n(), .groups = 'drop')

# Convert ISO country codes to country names
startup_counts$Country <- countrycode(startup_counts$Country, "iso2c", "country.name")
startup_counts$Country[is.na(startup_counts$Country)] <- "Unknown"

# Load geographical data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Extract centroid coordinates for bubble positions
world_centroids <- st_centroid(world)
bubble_data <- left_join(world_centroids, startup_counts, by = c("name_long" = "Country"))

# Filter out rows with NA values
bubble_data <- bubble_data[!is.na(bubble_data$Count), ]

# Convert centroids to data frame for ggplot2
bubble_data_df <- as.data.frame(bubble_data)
bubble_data_df <- cbind(bubble_data_df, st_coordinates(bubble_data))

# Create the map with ggplot2
p <- ggplot(data = world) +
  geom_sf(fill = "grey90", color = "white") +
  geom_point(data = bubble_data_df, aes(x = X, y = Y, size = Count, color = Estimated_Size, text = paste(name_long, "<br>", "Estimated Size: ", Estimated_Size, "<br>", "Count: ", Count)), inherit.aes = FALSE) +
  scale_size_continuous(name = "Number of Startups") +
  scale_color_manual(
    values = c("Small" = "yellow", "Medium" = "orange", "Large" = "red", "Very Large" = "darkred", "Unknown" = "grey"),
    name = "Estimated Size"
  ) +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(
    title = "Number of Startups by Country and Estimated Size",
    subtitle = "Source: LinkedIn",
    fill = "Number of Startups"
  )

# Make the map interactive with plotly
p_interactive <- ggplotly(p, tooltip = "text")

# Adjust the size for horizontal display
p_interactive <- p_interactive %>%
  layout(
    width = 1000,  # Set width to make the map wider
    height = 500,  # Set height appropriately
    title = list(text = 'Number of Startups by Country and Estimated Size', x = 0.5, y = 0.95, font = list(size = 20)),
    margin = list(t = 50, b = 50)
  )

# Display the interactive map
p_interactive


```


```{r , echo=FALSE}

library(readxl)
library(dplyr)
library(tidyr)
library(countrycode)
library(knitr)
library(kableExtra)
library(scales)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Calculate quartiles for the number of employees
quartiles <- quantile(data$`Number of Employees`, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

# Function to estimate size based on quartiles
estimate_size <- function(num_employees) {
  if (is.na(num_employees)) {
    return("Unknown")
  } else if (num_employees <= quartiles[1]) {
    return("Small")
  } else if (num_employees <= quartiles[2]) {
    return("Medium")
  } else if (num_employees <= quartiles[3]) {
    return("Large")
  } else {
    return("Very Large")
  }
}

# Apply the function to estimate size
data$Estimated_Size <- sapply(data$`Number of Employees`, estimate_size)

# Aggregate data by country and estimated size
startup_counts <- data %>%
  group_by(Country, Estimated_Size) %>%
  summarise(Count = n(), .groups = 'drop')

# Convert ISO country codes to country names
startup_counts$Country <- countrycode(startup_counts$Country, "iso2c", "country.name")
startup_counts$Country[is.na(startup_counts$Country)] <- "Unknown"

# Spread the data to have estimated sizes as columns
table_data <- startup_counts %>%
  pivot_wider(names_from = Estimated_Size, values_from = Count, values_fill = list(Count = 0)) %>%
  mutate(Total = rowSums(select(., -Country))) %>%
  arrange(desc(Total)) # Sort by Total in descending order

# Custom color scale function for the Total column
color_scale_total <- function(x) {
  col <- scales::col_numeric(palette = c("#deebf7", "#3182bd"), domain = range(x, na.rm = TRUE))(x)
  return(col)
}

# Apply the color scale to the Total column
table_data$Total_color <- color_scale_total(table_data$Total)

# Create the table with kable
kable(table_data %>% select(-Total_color), col.names = c("Country", "Small", "Medium", "Large", "Very Large", "Unknown", "Total"), align = 'c') %>%
  add_header_above(c("Size based on number of employees" = 7)) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(7, background = table_data$Total_color, bold = TRUE) %>%
  scroll_box(height = "500px", width = "100%")

```



##### Amount of investments

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The chart is built using data from LinkedIn, listing countries where offices are based and the last round of investment amounts. This chart shows the sum of all money invested in startups by country, providing a visual representation of global investment distribution. Each country is colored based on the total investment amount, with interactive tooltips offering detailed information about the investments.
</p>

```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(tidyr)
library(countrycode)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Preprocess the Crunchbase Last Round Amount column
data <- data %>%
  filter(!grepl("Last round amount not found", `Crunchbase Last Round Amount`)) %>%
  mutate(
    `Crunchbase Last Round Amount` = gsub("US\\$ ", "", `Crunchbase Last Round Amount`),
    `Crunchbase Last Round Amount` = ifelse(grepl("M", `Crunchbase Last Round Amount`), 
                                            as.numeric(sub("M", "", `Crunchbase Last Round Amount`)) * 1e6, 
                                            ifelse(grepl("K", `Crunchbase Last Round Amount`), 
                                                   as.numeric(sub("K", "", `Crunchbase Last Round Amount`)) * 1e3, 
                                                   as.numeric(`Crunchbase Last Round Amount`)))
  )

# Summarize investment amounts by country
investment_by_country <- data %>%
  group_by(Country) %>%
  summarise(Total_Investment = sum(`Crunchbase Last Round Amount`, na.rm = TRUE))

# Convert ISO country codes to country names
investment_by_country$Country <- countrycode(investment_by_country$Country, "iso2c", "country.name")
investment_by_country$Country[is.na(investment_by_country$Country)] <- "Unknown"

# Load geographical data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Join investment data with geographical data
map_data <- left_join(world, investment_by_country, by = c("name_long" = "Country"))

# Create labels for tooltips
map_data$tooltip <- with(map_data, paste(name_long, "<br>", "Total Investment: $", format(Total_Investment, big.mark = ",")))

# Create the map with ggplot2
p <- ggplot(data = map_data) +
  geom_sf(aes(fill = Total_Investment, text = tooltip), color = "white") +
  scale_fill_gradientn(
    colors = c("#d73027", "#fee08b", "#1a9850"),
    na.value = "grey50",
    name = "Total Investment",
    labels = scales::dollar_format(scale = 1e-9, suffix = "B")
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(
    title = "Total Investment in Startups by Country",
    subtitle = "Source: LinkedIn",
    fill = "Total Investment"
  )

# Make the map interactive with plotly
p_interactive <- ggplotly(p, tooltip = "text")

# Adjust the size for horizontal display
p_interactive <- p_interactive %>%
  layout(
    width = 1000,  # Set width to make the map wider
    height = 500  # Set height appropriately
  )

# Display the interactive map
p_interactive

```

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This interactive map visualizes the median investment in startups by country based on data from LinkedIn. The map uses the median to avoid the impact of outliers. China is highlighted in violet because it is an outlier, with only one company providing information on the last investment round, which was exceptionally large. The gradient color scale from red to yellow to green indicates low to high median investments in millions of dollars, allowing for a clear comparison of investment levels across countries.
</p>


```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(tidyr)
library(countrycode)
library(scales)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Preprocess the Crunchbase Last Round Amount column
data <- data %>%
  filter(!grepl("Last round amount not found", `Crunchbase Last Round Amount`)) %>%
  mutate(
    `Crunchbase Last Round Amount` = gsub("US\\$ ", "", `Crunchbase Last Round Amount`),
    `Crunchbase Last Round Amount` = ifelse(grepl("M", `Crunchbase Last Round Amount`), 
                                            as.numeric(sub("M", "", `Crunchbase Last Round Amount`)) * 1e6, 
                                            ifelse(grepl("K", `Crunchbase Last Round Amount`), 
                                                   as.numeric(sub("K", "", `Crunchbase Last Round Amount`)) * 1e3, 
                                                   as.numeric(`Crunchbase Last Round Amount`)))
  )

# Exclude China from the median calculation
data_no_china <- data %>% filter(Country != "CN")

# Summarize investment amounts by country using the median
investment_by_country <- data_no_china %>%
  group_by(Country) %>%
  summarise(Median_Investment = median(`Crunchbase Last Round Amount`, na.rm = TRUE))

# Add China back with its original value
china_investment <- data %>%
  filter(Country == "CN") %>%
  summarise(Country = "CN", Median_Investment = median(`Crunchbase Last Round Amount`, na.rm = TRUE))

investment_by_country <- bind_rows(investment_by_country, china_investment)

# Convert ISO country codes to country names
investment_by_country$Country <- countrycode(investment_by_country$Country, "iso2c", "country.name")
investment_by_country$Country[is.na(investment_by_country$Country)] <- "Unknown"

# Load geographical data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Join investment data with geographical data
map_data <- left_join(world, investment_by_country, by = c("name_long" = "Country"))

# Separate China from the rest of the countries
map_data$color <- ifelse(map_data$name_long == "China", "purple", "default")

# Create labels for tooltips
map_data$tooltip <- with(map_data, paste(name_long, "<br>", "Median Investment: $", number(Median_Investment, accuracy = 1, big.mark = ",")))

# Create the map with ggplot2
p <- ggplot(data = map_data) +
  geom_sf(aes(fill = ifelse(color == "default", Median_Investment, NA), text = tooltip), color = "white") +
  geom_sf(data = subset(map_data, name_long == "China"), fill = "purple", color = "white", aes(text = tooltip)) +
  scale_fill_gradientn(
    colors = c("#d73027", "#fee08b", "#1a9850"),
    na.value = "grey50",
    name = "Median Investment",
    labels = scales::dollar_format(scale = 1e-6, suffix = "M")
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(
    title = "Median Investment in Startups by Country",
    subtitle = "Source: LinkedIn",
    fill = "Median Investment"
  )

# Make the map interactive with plotly
p_interactive <- ggplotly(p, tooltip = "text")

# Adjust the size for horizontal display
p_interactive <- p_interactive %>%
  layout(
    width = 1000,  # Set width to make the map wider
    height = 500  # Set height appropriately
  )

# Display the interactive map
p_interactive


```

*****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This map illustrates the number of startups by country, categorized by their estimated investment size based on data from LinkedIn. Investment sizes are classified into Small, Medium, Large, and Very Large using quartiles of the last round investment amounts. The map features interactive tooltips with country names and estimated investment sizes. Bubble sizes indicate the number of startups, while colors (Light Green for Small, Green for Medium, Dark Green for Large, Forest Green for Very Large, and Grey for Unknown) represent investment size categories. This visualization helps identify the distribution of startups with varying investment sizes globally, highlighting regions with diverse levels of startup funding.
</p>

<table style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; border-collapse: collapse; width: 100%;">
  <tr style="border-bottom: 1px solid #ddd;">
    <th style="padding: 8px; text-align: left;">Estimated Investment Size</th>
    <th style="padding: 8px; text-align: left;">Definition</th>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Small</td>
    <td style="padding: 8px;">Investment ≤ 1st quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Medium</td>
    <td style="padding: 8px;">1st quartile < Investment ≤ 2nd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Large</td>
    <td style="padding: 8px;">2nd quartile < Investment ≤ 3rd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Very Large</td>
    <td style="padding: 8px;">Investment > 3rd quartile</td>
  </tr>
  <tr style="border-bottom: 1px solid #ddd;">
    <td style="padding: 8px;">Unknown</td>
    <td style="padding: 8px;">Missing investment data</td>
  </tr>
</table>


```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(tidyr)
library(countrycode)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Preprocess the Crunchbase Last Round Amount column
data <- data %>%
  filter(!grepl("Last round amount not found", `Crunchbase Last Round Amount`)) %>%
  mutate(
    `Crunchbase Last Round Amount` = gsub("US\\$ ", "", `Crunchbase Last Round Amount`),
    `Crunchbase Last Round Amount` = ifelse(grepl("M", `Crunchbase Last Round Amount`), 
                                            as.numeric(sub("M", "", `Crunchbase Last Round Amount`)) * 1e6, 
                                            ifelse(grepl("K", `Crunchbase Last Round Amount`), 
                                                   as.numeric(sub("K", "", `Crunchbase Last Round Amount`)) * 1e3, 
                                                   as.numeric(`Crunchbase Last Round Amount`)))
  )

# Calculate quartiles for the Crunchbase Last Round Amount
quartiles <- quantile(data$`Crunchbase Last Round Amount`, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

# Function to estimate size based on investment quartiles
estimate_investment_size <- function(investment) {
  if (is.na(investment)) {
    return("Unknown")
  } else if (investment <= quartiles[1]) {
    return("Small")
  } else if (investment <= quartiles[2]) {
    return("Medium")
  } else if (investment <= quartiles[3]) {
    return("Large")
  } else {
    return("Very Large")
  }
}

# Apply the function to estimate size based on investment
data$Estimated_Investment_Size <- sapply(data$`Crunchbase Last Round Amount`, estimate_investment_size)

# Aggregate data by country and estimated investment size
investment_counts <- data %>%
  group_by(Country, Estimated_Investment_Size) %>%
  summarise(Count = n(), .groups = 'drop')

# Convert ISO country codes to country names
investment_counts$Country <- countrycode(investment_counts$Country, "iso2c", "country.name")
investment_counts$Country[is.na(investment_counts$Country)] <- "Unknown"

# Load geographical data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Extract centroid coordinates for bubble positions
world_centroids <- st_centroid(world)
bubble_data <- left_join(world_centroids, investment_counts, by = c("name_long" = "Country"))

# Filter out rows with NA values
bubble_data <- bubble_data[!is.na(bubble_data$Count), ]

# Convert centroids to data frame for ggplot2
bubble_data_df <- as.data.frame(bubble_data)
bubble_data_df <- cbind(bubble_data_df, st_coordinates(bubble_data))

# Create the map with ggplot2
p <- ggplot(data = world) +
  geom_sf(fill = "grey90", color = "white") +
  geom_point(data = bubble_data_df, aes(x = X, y = Y, size = Count, color = Estimated_Investment_Size, text = paste(name_long, "<br>Investment Size:", Estimated_Investment_Size, "<br>Number of Startups:", Count)), inherit.aes = FALSE) +
  scale_size_continuous(name = "Number of Startups") +
  scale_color_manual(
    values = c("Small" = "lightgreen", "Medium" = "green", "Large" = "darkgreen", "Very Large" = "forestgreen", "Unknown" = "grey"),
    name = "Investment Size"
  ) +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(
    title = "Number of Startups by Country and Investment Size",
    subtitle = "Source: LinkedIn",
    fill = "Number of Startups"
  )

# Make the map interactive with plotly
p_interactive <- ggplotly(p, tooltip = "text")

# Adjust the size for horizontal display
p_interactive <- p_interactive %>%
  layout(
    width = 1000,  # Set width to make the map wider
    height = 500,  # Set height appropriately
    title = list(text = 'Number of Startups by Country and Investment Size', x = 0.5, y = 0.95, font = list(size = 20)),
    margin = list(t = 50, b = 50)
  )

# Display the interactive map
p_interactive


```

*****


```{r, echo=FALSE}

library(readxl)
library(dplyr)
library(tidyr)
library(countrycode)
library(knitr)
library(kableExtra)
library(scales)

# Load the data
data <- read_excel("linkedin_info.xlsx")

# Preprocess the Crunchbase Last Round Amount column
data <- data %>%
  filter(!grepl("Last round amount not found", `Crunchbase Last Round Amount`)) %>%
  mutate(
    `Crunchbase Last Round Amount` = gsub("US\\$ ", "", `Crunchbase Last Round Amount`),
    `Crunchbase Last Round Amount` = ifelse(grepl("M", `Crunchbase Last Round Amount`), 
                                            as.numeric(sub("M", "", `Crunchbase Last Round Amount`)) * 1e6, 
                                            ifelse(grepl("K", `Crunchbase Last Round Amount`), 
                                                   as.numeric(sub("K", "", `Crunchbase Last Round Amount`)) * 1e3, 
                                                   as.numeric(`Crunchbase Last Round Amount`)))
  )

# Calculate quartiles for the Crunchbase Last Round Amount
quartiles <- quantile(data$`Crunchbase Last Round Amount`, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

# Function to estimate size based on investment quartiles
estimate_investment_size <- function(investment) {
  if (is.na(investment)) {
    return("Unknown")
  } else if (investment <= quartiles[1]) {
    return("Small")
  } else if (investment <= quartiles[2]) {
    return("Medium")
  } else if (investment <= quartiles[3]) {
    return("Large")
  } else {
    return("Very Large")
  }
}

# Apply the function to estimate size based on investment
data$Estimated_Investment_Size <- sapply(data$`Crunchbase Last Round Amount`, estimate_investment_size)

# Aggregate data by country and estimated investment size
investment_counts <- data %>%
  group_by(Country, Estimated_Investment_Size) %>%
  summarise(Count = n(), .groups = 'drop')

# Convert ISO country codes to country names
investment_counts$Country <- countrycode(investment_counts$Country, "iso2c", "country.name")
investment_counts$Country[is.na(investment_counts$Country)] <- "Unknown"

# Spread the data to have investment sizes as columns
table_data <- investment_counts %>%
  pivot_wider(names_from = Estimated_Investment_Size, values_from = Count, values_fill = list(Count = 0)) %>%
  mutate(Total = rowSums(select(., -Country))) %>%
  arrange(desc(Total)) # Sort by Total in descending order

# Custom color scale function for the Total column (green palette)
color_scale_total <- function(x) {
  col <- scales::col_numeric(palette = c("#e5f5e0", "#a1d99b", "#31a354"), domain = range(x, na.rm = TRUE))(x)
  return(col)
}

# Apply the color scale to the Total column
table_data$Total_color <- color_scale_total(table_data$Total)

# Create the table with kable
kable(table_data %>% select(-Total_color), col.names = c("Country", "Small", "Medium", "Large", "Very Large", "Unknown", "Total"), align = 'c') %>%
  add_header_above(c("Size based on last round amount" = 7)) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(7, background = table_data$Total_color, bold = TRUE) %>%
  scroll_box(height = "500px", width = "100%")

```

###

****
<h6 class="section-subtitle" id="desarrollo" style="text-align: center; font-weight: bold;">
  <span style='color:#2a9134; font-size: 16px;'>Time analysis</span>
</h6>
****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By country</b></span>
</p>

####  {.tabset}


##### Number of startups

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart illustrates the number of companies receiving investments each year, broken down by country. Each bar represents a year, and the different colored segments within each bar denote the number of companies from various countries that received investments in that particular year. The legend on the right-hand side identifies the countries corresponding to each color. This visualization helps track investment trends over time and highlights which countries have seen increasing or decreasing investment activity in the EdTech sector.</p>


```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(countrycode)

# Load the data (Assuming your data is in a spreadsheet named "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocess the data
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = year(`Last Round Formatted Date`),
    Country_Name = countrycode(Country, "iso2c", "country.name")
  )

# Summarize data by Year and Country
yearly_data <- data %>%
  group_by(Country_Name, Year) %>%
  summarise(Count = n(), .groups = 'drop')

# Get the list of unique country names in the chart
unique_countries <- unique(yearly_data$Country_Name)

# Create an interactive plot using plotly
plot_year <- plot_ly()

# Add each country's data as a separate trace
for (country in unique_countries) {
  country_data <- yearly_data %>% filter(Country_Name == country)
  plot_year <- add_trace(
    plot_year,
    data = country_data,
    x = ~Year,
    y = ~Count,
    name = country,
    type = 'bar',
    text = ~paste("Country:", Country_Name, "<br>Year:", Year, "<br>Count:", Count),  # Define hover text
    hoverinfo = 'text'  # Show hover information
  )
}

# Add a country selector (optional)
# You can comment out this section if you don't want the dropdown menu for selecting countries

country_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_countries))),
    label = "All Countries"
  )
)

country_buttons <- c(country_buttons, lapply(seq_along(unique_countries), function(i) {
  visibility <- rep(FALSE, length(unique_countries))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_countries[i]
  )
}))

plot_year <- plot_year %>%
  layout(
    title = "Companies Invested by Year",
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Number of Companies"),
    barmode = 'stack',
    hovermode = 'closest',
    width = 1000,  # Set the width of the plot
    updatemenus = list(  # Comment out this section to remove the dropdown menu
      list(
        type = "dropdown",
        active = 0,
        buttons = country_buttons,
        x = 0.85,  # Adjust position
        xanchor = 'left',
        y = 1.15,
        yanchor = 'top'
      )
    )
  )

# Display the interactive plot
plot_year

```

##### Amount of investments

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart illustrates the total amount of investment received by companies each year, broken down by country. Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies from various countries in that particular year. The legend on the right-hand side identifies the countries corresponding to each color. This visualization helps track investment trends over time and highlights which countries have received the most significant financial investments in the EdTech sector.</p>


```{r, echo=FALSE}
# Load the necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(countrycode)

# Load the data (Assuming your data is in a spreadsheet named "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocess the data
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = year(`Last Round Formatted Date`),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Keep the amount in original units
  )

# Check if the conversion is correct
data <- data %>% filter(!is.na(`Crunchbase Last Round Amount`))

# Summarize data by Year and Country with total investment amount
yearly_data <- data %>%
  group_by(Country_Name, Year) %>%
  summarise(Total_Investment = sum(`Crunchbase Last Round Amount`, na.rm = TRUE), .groups = 'drop')

# Get the list of unique country names in the chart
unique_countries <- unique(yearly_data$Country_Name)

# Create an interactive plot using plotly
plot_year <- plot_ly()

# Add each country's data as a separate trace
for (country in unique_countries) {
  country_data <- yearly_data %>% filter(Country_Name == country)
  plot_year <- add_trace(
    plot_year,
    data = country_data,
    x = ~Year,
    y = ~Total_Investment,
    name = country,
    type = 'bar',
    text = ~paste("Country:", Country_Name, "<br>Year:", Year, "<br>Total Investment: $", format(Total_Investment, big.mark = ",")),  # Define hover text
    hoverinfo = 'text'  # Show hover information
  )
}

# Add a country selector
country_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_countries))),
    label = "All Countries"
  )
)

country_buttons <- c(country_buttons, lapply(seq_along(unique_countries), function(i) {
  visibility <- rep(FALSE, length(unique_countries))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_countries[i]
  )
}))

plot_year <- plot_year %>%
  layout(
    title = "Total Investment Received by Year",
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Total Investment (USD)"),
    barmode = 'stack',
    hovermode = 'closest',
    width = 1000,  # Set the width of the plot
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = country_buttons,
        x = 0.85,  # Adjust position
        xanchor = 'left',
        y = 1.15,
        yanchor = 'top'
      )
    )
  )

# Display the interactive plot
plot_year


```

###

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By Vertical</b></span>
</p>

####  {.tabset}


##### Number of startups

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart displays the number of companies by year, categorized by verticals. Each bar represents a year, and the different colored segments within each bar denote the number of companies within specific verticals for that particular year. The legend on the right-hand side identifies the verticals corresponding to each color. This visualization helps track trends over time and highlights the growth or decline of companies within each vertical in the EdTech sector.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The verticals were defined and selected by the EduNext team, and the detailed definitions of these verticals are provided earlier in the report.</p>

```{r, echo=FALSE}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(countrycode)
library(DT)

# Cargar los datos (Asumiendo que los datos están en una hoja llamada "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Cargar los datos adicionales con la información de los verticales potenciales
additional_data <- read_excel("final_output.xlsx")

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Resumir datos por año y vertical potencial contando el número de empresas
yearly_vertical_data <- merged_data %>%
  group_by(`Potential Vertical`, Year) %>%
  summarise(Company_Count = n(), .groups = 'drop')

# Obtener la lista de verticales potenciales únicos en el gráfico
unique_verticals <- unique(yearly_vertical_data$`Potential Vertical`)

# Crear los botones para seleccionar los verticales
vertical_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_verticals))),
    label = "All Verticals"
  )
)

vertical_buttons <- c(vertical_buttons, lapply(seq_along(unique_verticals), function(i) {
  visibility <- rep(FALSE, length(unique_verticals))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_verticals[i]
  )
}))

# Crear un gráfico interactivo usando plotly
plot_year <- plot_ly()

# Añadir los datos de cada vertical como una traza separada
for (vertical in unique_verticals) {
  vertical_data <- yearly_vertical_data %>% filter(`Potential Vertical` == vertical)
  plot_year <- add_trace(
    plot_year,
    data = vertical_data,
    x = ~Year,
    y = ~Company_Count,
    name = vertical,
    type = 'bar',
    text = ~paste("Vertical:", `Potential Vertical`, "<br>Year:", Year, "<br>Company Count:", Company_Count),  # Definir texto emergente
    hoverinfo = 'text'  # Mostrar información emergente
  )
}

# Definir el diseño del gráfico
plot_year <- plot_year %>%
  layout(
    title = list(
      text = "Number of Companies by Year and Vertical",
      x = 0.5,
      xanchor = 'center',
      y = 0.85,
      yanchor = 'top'
    ),
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Number of Companies"),
    barmode = 'stack',
    hovermode = 'closest',
    showlegend = FALSE,  # Ocultar la leyenda
    width = 1000,  # Establecer el ancho del gráfico
    margin = list(t = 200),  # Aumentar el margen superior para hacer espacio para el menú desplegable
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = vertical_buttons,
        x = 0.5,  # Posición centrada
        xanchor = 'center',
        y = 1.3,
        yanchor = 'top'
      )
    )
  )

# Mostrar el gráfico interactivo
plot_year


```



##### Amount of investment

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart displays the total amount of investment by year, categorized by verticals. Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies within specific verticals for that particular year. The legend on the right-hand side identifies the verticals corresponding to each color. This visualization helps track investment trends over time and highlights the growth or decline of investment within each vertical in the EdTech sector.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The verticals were defined and selected by the EduNext team, and the detailed definitions of these verticals are provided earlier in the report.</p>

```{r, echo=FALSE}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(countrycode)
library(DT)

# Cargar los datos (Asumiendo que los datos están en una hoja llamada "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Cargar los datos adicionales con la información de los verticales potenciales
additional_data <- read_excel("final_output.xlsx")

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Resumir datos por año y vertical potencial sumando la cantidad de inversión
yearly_vertical_data <- merged_data %>%
  group_by(`Potential Vertical`, Year) %>%
  summarise(Total_Investment = sum(`Crunchbase Last Round Amount`, na.rm = TRUE), .groups = 'drop')

# Obtener la lista de verticales potenciales únicos en el gráfico
unique_verticals <- unique(yearly_vertical_data$`Potential Vertical`)

# Crear los botones para seleccionar los verticales
vertical_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_verticals))),
    label = "All Verticals"
  )
)

vertical_buttons <- c(vertical_buttons, lapply(seq_along(unique_verticals), function(i) {
  visibility <- rep(FALSE, length(unique_verticals))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_verticals[i]
  )
}))

# Crear un gráfico interactivo usando plotly
plot_year <- plot_ly()

# Añadir los datos de cada vertical como una traza separada
for (vertical in unique_verticals) {
  vertical_data <- yearly_vertical_data %>% filter(`Potential Vertical` == vertical)
  plot_year <- add_trace(
    plot_year,
    data = vertical_data,
    x = ~Year,
    y = ~Total_Investment,
    name = vertical,
    type = 'bar',
    text = ~paste("Vertical:", `Potential Vertical`, "<br>Year:", Year, "<br>Total Investment: $", format(Total_Investment, big.mark = ",")),  # Definir texto emergente
    hoverinfo = 'text'  # Mostrar información emergente
  )
}

# Definir el diseño del gráfico
plot_year <- plot_year %>%
  layout(
    title = list(
      text = "Total Investment by Year and Vertical",
      x = 0.5,
      xanchor = 'center',
      y = 0.85,
      yanchor = 'top'
    ),
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Total Investment (USD)"),
    barmode = 'stack',
    hovermode = 'closest',
    showlegend = FALSE,  # Ocultar la leyenda
    width = 1000,  # Establecer el ancho del gráfico
    margin = list(t = 200),  # Aumentar el margen superior para hacer espacio para el menú desplegable
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = vertical_buttons,
        x = 0.5,  # Posición centrada
        xanchor = 'center',
        y = 1.3,
        yanchor = 'top'
      )
    )
  )

# Mostrar el gráfico interactivo
plot_year


```

###

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By B2B or B2C </b></span>
</p>


####  {.tabset}


##### Number of startups

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart illustrates the number of companies making investments each year, categorized by their business model: B2B (Business to Business) and B2C (Business to Consumer). Each bar represents a year, and the different colored segments within each bar denote the number of companies in each category for that particular year. The legend on the right-hand side identifies the categories corresponding to each color. This visualization helps track investment activity trends over time, highlighting the dynamics between B2B and B2C companies in the EdTech sector.</p>

```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)

# Load the data (Assuming your data is in a spreadsheet named "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocess the data
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`))
  )

# Load the additional data with B2B or B2C information
additional_data <- read_excel("final_output.xlsx")

# Merge the datasets based on Company Name
merged_data <- merge(data, additional_data, by = "Company Name")

# Summarize data by Year and B2B or B2C
yearly_b2b_b2c_data <- merged_data %>%
  group_by(`B2B or B2C`, Year) %>%
  summarise(Count = n(), .groups = 'drop')

# Get the list of unique B2B or B2C categories in the chart
unique_categories <- unique(yearly_b2b_b2c_data$`B2B or B2C`)

# Create an interactive plot using plotly
plot_year <- plot_ly()

# Add each category's data as a separate trace
for (category in unique_categories) {
  category_data <- yearly_b2b_b2c_data %>% filter(`B2B or B2C` == category)
  plot_year <- add_trace(
    plot_year,
    data = category_data,
    x = ~Year,
    y = ~Count,
    name = category,
    type = 'bar',
    text = ~paste("Category:", `B2B or B2C`, "<br>Year:", Year, "<br>Count:", Count),  # Define hover text
    hoverinfo = 'text'  # Show hover information
  )
}

# Add a category selector
category_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_categories))),
    label = "All Categories"
  )
)

category_buttons <- c(category_buttons, lapply(seq_along(unique_categories), function(i) {
  visibility <- rep(FALSE, length(unique_categories))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_categories[i]
  )
}))

plot_year <- plot_year %>%
  layout(
    title = "Number of Companies Investing by Year",
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Number of Companies"),
    barmode = 'stack',
    hovermode = 'closest',
    width = 1000,  # Set the width of the plot
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = category_buttons,
        x = 0.85,  # Adjust position
        xanchor = 'left',
        y = 1.15,
        yanchor = 'top'
      )
    )
  )

# Display the interactive plot
plot_year


```

##### Amount of investment

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This stacked bar chart illustrates the total amount of investment by year, categorized by business model: B2B (Business to Business) and B2C (Business to Consumer). Each bar represents a year, and the different colored segments within each bar denote the total investment received by companies in each category for that particular year. The legend on the right-hand side identifies the categories corresponding to each color. This visualization helps track investment trends over time, highlighting the dynamics between B2B and B2C companies in the EdTech sector.</p>

```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(countrycode)

# Load the data (Assuming your data is in a spreadsheet named "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocess the data
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Keep the amount in original units
  )

# Load the additional data with B2B or B2C information
additional_data <- read_excel("final_output.xlsx")

# Merge the datasets based on Company Name
merged_data <- merge(data, additional_data, by = "Company Name")

# Summarize data by Year and B2B or B2C
yearly_b2b_b2c_data <- merged_data %>%
  group_by(`B2B or B2C`, Year) %>%
  summarise(Total_Investment = sum(`Crunchbase Last Round Amount`, na.rm = TRUE), .groups = 'drop')

# Get the list of unique B2B or B2C categories in the chart
unique_categories <- unique(yearly_b2b_b2c_data$`B2B or B2C`)

# Create an interactive plot using plotly
plot_year <- plot_ly()

# Add each category's data as a separate trace
for (category in unique_categories) {
  category_data <- yearly_b2b_b2c_data %>% filter(`B2B or B2C` == category)
  plot_year <- add_trace(
    plot_year,
    data = category_data,
    x = ~Year,
    y = ~Total_Investment,
    name = category,
    type = 'bar',
    text = ~paste("Category:", `B2B or B2C`, "<br>Year:", Year, "<br>Total Investment: $", format(Total_Investment, big.mark = ",")),  # Define hover text
    hoverinfo = 'text'  # Show hover information
  )
}

# Add a category selector
category_buttons <- list(
  list(
    method = "restyle",
    args = list("visible", rep(TRUE, length(unique_categories))),
    label = "All Categories"
  )
)

category_buttons <- c(category_buttons, lapply(seq_along(unique_categories), function(i) {
  visibility <- rep(FALSE, length(unique_categories))
  visibility[i] <- TRUE
  list(
    method = "restyle",
    args = list("visible", visibility),
    label = unique_categories[i]
  )
}))

plot_year <- plot_year %>%
  layout(
    title = "Total Investment by Year and Category",
    xaxis = list(title = "Year", dtick = 1),
    yaxis = list(title = "Total Investment (USD)"),
    barmode = 'stack',
    hovermode = 'closest',
    width = 1000,  # Set the width of the plot
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = category_buttons,
        x = 0.85,  # Adjust position
        xanchor = 'left',
        y = 1.15,
        yanchor = 'top'
      )
    )
  )

# Display the interactive plot
plot_year

```

###

####  {.tabset}

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By Industry </b></span>
</p>


##### Number of startups

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This treemap chart displays the distribution of companies by industry, based on the information available on their LinkedIn profiles. Each rectangle represents an industry, with the size of the rectangle proportional to the number of companies in that industry. The color and size of the rectangles provide a quick visual reference for the prevalence of different industries among the analyzed companies. This visualization helps to identify which industries are most common in the EdTech sector and how companies are distributed across various fields.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">

<strong>Note:</strong> According to LinkedIn, EduNext operates in the E-learning industry. This classification is based on the information available on their LinkedIn profile.</p>


```{r, echo=FALSE}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(plotly)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Resumir los datos para obtener el conteo de cada industria
industry_data <- merged_data %>%
  group_by(Industry) %>%
  summarise(Count = n()) %>%
  ungroup()

# Calcular el porcentaje
industry_data <- industry_data %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Definir colores personalizados para los segmentos
colors <- c("#FF5733", "#33FF57", "#3357FF", "#FF33A1", "#FFDD33", "#33FFD7", "#D733FF", "#FF5733")

# Crear el gráfico de área de rectángulos (treemap) usando plotly
treemap_chart <- plot_ly(
  industry_data,
  labels = ~paste(Industry, "<br>Count: ", Count, "<br>Percentage: ", round(Percentage, 2), "%"),
  parents = "",
  values = ~Count,
  type = 'treemap',
  textinfo = 'label',
  hoverinfo = 'label+value+percent',
  marker = list(colors = colors)
) %>%
  layout(
    title = list(
      text = "<b>Distribution of Companies by Industry</b>",
      x = 0.5,
      y = 0.95,
      xanchor = 'center',
      yanchor = 'top',
      font = list(size = 24, family = "Arial", color = "#333333")
    ),
    margin = list(t = 100, b = 100, l = 50, r = 50),  # Ajustar márgenes para mejor espaciado
    height = 500,  # Ajustar la altura para un mejor espaciado
    width = 700   # Ajustar el ancho para centrar la gráfica
  )

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(treemap_chart), "industry_distribution_treemap_chart.html")

# Mostrar el gráfico interactivo en el visor de RStudio o en el navegador
treemap_chart



```

##### Amount of investment

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This treemap chart displays the distribution of investment by industry, based on the information available on the companies' LinkedIn profiles. Each rectangle represents an industry, with the size of the rectangle proportional to the total investment received by companies in that industry. The color and size of the rectangles provide a quick visual reference for the allocation of investments across different industries among the analyzed companies. This visualization helps to identify which industries attract the most investment in the EdTech sector.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">

<strong>Note:</strong> According to LinkedIn, EduNext operates in the E-learning industry. This classification is based on the information available on their LinkedIn profile.</p>

```{r, echo=FALSE}
# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(plotly)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Resumir los datos para obtener la inversión total de cada industria
industry_data <- merged_data %>%
  group_by(Industry) %>%
  summarise(Total_Investment = sum(`Crunchbase Last Round Amount`, na.rm = TRUE)) %>%
  ungroup()

# Calcular el porcentaje de inversión
industry_data <- industry_data %>%
  mutate(Percentage = Total_Investment / sum(Total_Investment) * 100)

# Definir colores personalizados para los segmentos
colors <- c("#FF5733", "#33FF57", "#3357FF", "#FF33A1", "#FFDD33", "#33FFD7", "#D733FF", "#FF5733")

# Crear el gráfico de área de rectángulos (treemap) usando plotly
treemap_chart <- plot_ly(
  industry_data,
  labels = ~paste(Industry, "<br>Investment: $", format(round(Total_Investment, 2), big.mark = ",", scientific = FALSE), "<br>Percentage: ", round(Percentage, 2), "%"),
  parents = "",
  values = ~Total_Investment,
  type = 'treemap',
  textinfo = 'label',
  hoverinfo = 'label+value+percent',
  marker = list(colors = colors)
) %>%
  layout(
    title = list(
      text = "<b>Distribution of Investment by Industry</b>",
      x = 0.5,
      y = 0.95,
      xanchor = 'center',
      yanchor = 'top',
      font = list(size = 24, family = "Arial", color = "#333333")
    ),
    margin = list(t = 100, b = 100, l = 50, r = 50),  # Ajustar márgenes para mejor espaciado
    height = 500,  # Ajustar la altura para un mejor espaciado
    width = 700   # Ajustar el ancho para centrar la gráfica
  )

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(treemap_chart), "investment_distribution_treemap_chart.html")

# Mostrar el gráfico interactivo en el visor de RStudio o en el navegador
treemap_chart



```


###

####  {.tabset}

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By Role </b></span>
</p>


##### Role

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This donut chart visualizes the distribution of potential roles identified for companies, based on their relevance to EduNext. The chart segments represent the percentage of companies classified as partners, clients, competitors, and providers. The legend indicates the role corresponding to each color. This visualization helps to understand the predominant roles among the analyzed companies.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The EduNext team chose the roles and definitions for each role.</p>

```{r, echo=FALSE}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(plotly)

# Cargar los datos (Asumiendo que los datos están en una hoja llamada "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Cargar los datos adicionales con la información de los roles potenciales
additional_data <- read_excel("final_output.xlsx")

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Resumir los datos para obtener el conteo de cada rol potencial
role_data <- merged_data %>%
  group_by(`Potential Role`) %>%
  summarise(Count = n()) %>%
  ungroup()

# Definir colores personalizados para los segmentos
colors <- c("#FF5733", "#33FF57", "#3357FF", "#FF33A1", "#FFDD33", "#33FFD7", "#D733FF", "#FF5733")

# Crear el gráfico de dona usando plotly
donut_chart <- plot_ly(
  role_data,
  labels = ~`Potential Role`,
  values = ~Count,
  type = 'pie',
  hole = 0.5,  # Crear el agujero de la dona
  textinfo = 'label+percent',
  insidetextorientation = 'radial',
  marker = list(colors = colors)
) %>%
  layout(
    title = list(
      text = "<b>Proportion of Potential Roles</b>",
      x = 0.5,
      y = 0.95,
      xanchor = 'center',
      yanchor = 'top',
      font = list(size = 24, family = "Arial", color = "#333333")
    ),
    showlegend = TRUE,
    legend = list(orientation = 'h', x = 0.5, y = -0.3, xanchor = 'center'),
    margin = list(t = 100, b = 100, l = 50, r = 50),  # Ajustar márgenes para mejor espaciado
    height = 500,  # Ajustar la altura para un mejor espaciado
    width = 700,   # Ajustar el ancho para centrar la dona
    annotations = list(
      list(
        x = 0.5,
        y = 0.5,
        text = "",
        showarrow = FALSE
      )
    )
  )

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(donut_chart), "role_proportion_donut_chart.html")

# Mostrar el gráfico interactivo en el visor de RStudio o en el navegador
donut_chart


```


##### Potential Clients

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This table lists the companies identified as potential clients for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential clients, aiding strategic decision-making.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The EduNext team chose the roles and definitions for each role.</p>


```{r, echo=FALSE}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(countrycode)
library(scales)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Convertir el código de país ISO a nombres de países en inglés
merged_data$Country <- countrycode(merged_data$`Country.x`, origin = 'iso2c', destination = 'country.name')

# Filtrar para obtener solo las empresas que aplican como "CLIENT" y seleccionar las columnas disponibles
clients <- merged_data %>%
  filter(`Potential Role` == "CLIENT") %>%
  select(`Company Name`, `Company Size`, Country, `Company Type`, 
         `Number of Employees`, `Industry`, `B2B or B2C`, 
         `Potential Vertical`, `Crunchbase Last Round Amount`, `Last Round Formatted Date`) %>%
  distinct() %>%
  arrange(desc(`Crunchbase Last Round Amount`)) %>%
  mutate(`Crunchbase Last Round Amount` = dollar(`Crunchbase Last Round Amount`, prefix = "$", big.mark = ".", decimal.mark = ","))  # Formatear la cantidad de inversión

# Crear una tabla interactiva para visualizar las empresas que son "CLIENTS"
clients_table <- datatable(clients, 
          options = list(pageLength = 10, 
                         autoWidth = TRUE, 
                         dom = 't<"bottom"lp>',  # Mover los controles de paginación abajo
                         scrollX = TRUE),
          rownames = FALSE, 
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: center; font-size: 150%; font-weight: bold;',
            'List of Companies that Apply as Clients'
          ))

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(clients_table), "clients_list.html")

# Mostrar la tabla interactiva en el visor de RStudio o en el navegador
clients_table



```

##### Potential partners


This table lists the companies identified as potential partners for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential partners, aiding strategic collaborations.</p>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The EduNext team chose the roles and definitions for each role.</p>

```{r, echo=FALSE}
# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(countrycode)
library(scales)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Convertir el código de país ISO a nombres de países en inglés
merged_data$Country <- countrycode(merged_data$`Country.x`, origin = 'iso2c', destination = 'country.name')

# Filtrar para obtener solo las empresas que aplican como "PARTNER" y seleccionar las columnas disponibles
partners <- merged_data %>%
  filter(`Potential Role` == "PARTNER") %>%
  select(`Company Name`, `Company Size`, Country, `Company Type`, 
         `Number of Employees`, `Industry`, `B2B or B2C`, 
         `Potential Vertical`, `Crunchbase Last Round Amount`, `Last Round Formatted Date`) %>%
  distinct() %>%
  arrange(desc(`Crunchbase Last Round Amount`)) %>%
  mutate(`Crunchbase Last Round Amount` = dollar(`Crunchbase Last Round Amount`, prefix = "$", big.mark = ".", decimal.mark = ","))  # Formatear la cantidad de inversión

# Crear una tabla interactiva para visualizar las empresas que son "PARTNERS"
partners_table <- datatable(partners, 
          options = list(pageLength = 10, 
                         autoWidth = TRUE, 
                         dom = 't<"bottom"lp>',  # Mover los controles de paginación abajo
                         scrollX = TRUE),
          rownames = FALSE, 
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: center; font-size: 150%; font-weight: bold;',
            'List of Companies that Apply as Partners'
          ))

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(partners_table), "partners_list.html")

# Mostrar la tabla interactiva en el visor de RStudio o en el navegador
partners_table


```

##### Potential competitors

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This table lists the companies identified as potential competitors for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential competitors, aiding competitive analysis.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The EduNext team chose the roles and definitions for each role.</p>

```{r, echo=FALSE}
# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(countrycode)
library(scales)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Convertir el código de país ISO a nombres de países en inglés
merged_data$Country <- countrycode(merged_data$`Country.x`, origin = 'iso2c', destination = 'country.name')

# Filtrar para obtener solo las empresas que aplican como "COMPETITOR" y seleccionar las columnas disponibles
competitors <- merged_data %>%
  filter(`Potential Role` == "COMPETITOR") %>%
  select(`Company Name`, `Company Size`, Country, `Company Type`, 
         `Number of Employees`, `Industry`, `B2B or B2C`, 
         `Potential Vertical`, `Crunchbase Last Round Amount`, `Last Round Formatted Date`) %>%
  distinct() %>%
  arrange(desc(`Crunchbase Last Round Amount`)) %>%
  mutate(`Crunchbase Last Round Amount` = dollar(`Crunchbase Last Round Amount`, prefix = "$", big.mark = ".", decimal.mark = ","))  # Formatear la cantidad de inversión

# Crear una tabla interactiva para visualizar las empresas que son "COMPETITORS"
competitors_table <- datatable(competitors, 
          options = list(pageLength = 10, 
                         autoWidth = TRUE, 
                         dom = 't<"bottom"lp>',  # Mover los controles de paginación abajo
                         scrollX = TRUE),
          rownames = FALSE, 
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: center; font-size: 150%; font-weight: bold;',
            'List of Companies that Apply as Competitors'
          ))

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(competitors_table), "competitors_list.html")

# Mostrar la tabla interactiva en el visor de RStudio o en el navegador
competitors_table

```

##### Potential providers

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This table lists the companies identified as potential providers for EduNext. The columns include company name, size, country, type, number of employees, industry, business model (B2B or B2C), potential vertical, and the amount of the last funding round as reported by Crunchbase. This table provides detailed information on potential providers, aiding strategic sourcing decisions.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">

```{r, echo=FALSE}
# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(countrycode)
library(scales)

# Cargar los datos (Asegurándose de que los archivos estén en el directorio de trabajo)
data <- read_excel("linkedin_info.xlsx")
additional_data <- read_excel("final_output.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    Year = as.integer(year(`Last Round Formatted Date`)),
    Country_Name = countrycode(Country, "iso2c", "country.name"),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Convertir el código de país ISO a nombres de países en inglés
merged_data$Country <- countrycode(merged_data$`Country.x`, origin = 'iso2c', destination = 'country.name')

# Filtrar para obtener solo las empresas que aplican como "PROVIDER" y seleccionar las columnas disponibles
providers <- merged_data %>%
  filter(`Potential Role` == "PROVIDER") %>%
  select(`Company Name`, `Company Size`, Country, `Company Type`, 
         `Number of Employees`, `Industry`, `B2B or B2C`, 
         `Potential Vertical`, `Crunchbase Last Round Amount`, `Last Round Formatted Date`) %>%
  distinct() %>%
  arrange(desc(`Crunchbase Last Round Amount`)) %>%
  mutate(`Crunchbase Last Round Amount` = dollar(`Crunchbase Last Round Amount`, prefix = "$", big.mark = ".", decimal.mark = ","))  # Formatear la cantidad de inversión

# Crear una tabla interactiva para visualizar las empresas que son "PROVIDERS"
providers_table <- datatable(providers, 
          options = list(pageLength = 10, 
                         autoWidth = TRUE, 
                         dom = 't<"bottom"lp>',  # Mover los controles de paginación abajo
                         scrollX = TRUE),
          rownames = FALSE, 
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: center; font-size: 150%; font-weight: bold;',
            'List of Companies that Apply as Providers'
          ))

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(providers_table), "providers_list.html")

# Mostrar la tabla interactiva en el visor de RStudio o en el navegador
providers_table


```



##### Ranking of startups

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This horizontal bar chart ranks companies by the amount of their most recent investment, as reported by Crunchbase. The length of each bar represents the size of the investment, and the company names are listed along the vertical axis. This visualization helps to quickly identify the most heavily funded companies within the EdTech sector.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The EduNext team chose the roles and definitions for each role.</p>

```{r, echo=FALSE, fig.height=10, fig.width=12}

# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(lubridate)
library(plotly)
library(htmltools)

# Cargar los datos (Asumiendo que los datos están en una hoja llamada "linkedin_info.xlsx")
data <- read_excel("linkedin_info.xlsx")

# Preprocesar los datos
data <- data %>%
  filter(!grepl("Last round date not found", `Last Round Formatted Date`)) %>%
  filter(!is.na(`Last Round Formatted Date`)) %>%
  mutate(
    `Last Round Formatted Date` = ymd(`Last Round Formatted Date`),
    `Crunchbase Last Round Amount` = case_when(
      grepl("M", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e6,
      grepl("K", `Crunchbase Last Round Amount`) ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`)) * 1e3,
      TRUE ~ as.numeric(gsub("[^0-9.]", "", `Crunchbase Last Round Amount`))
    )  # Mantener el monto en unidades originales
  )

# Cargar los datos adicionales con la información de los roles potenciales
additional_data <- read_excel("final_output.xlsx")

# Fusionar los conjuntos de datos basados en el nombre de la empresa
merged_data <- merge(data, additional_data, by = "Company Name")

# Concatenar roles para cada empresa y calcular la inversión más reciente
merged_data <- merged_data %>%
  group_by(`Company Name`) %>%
  summarise(
    Roles = paste(unique(`Potential Role`), collapse = ", "),
    Most_Recent_Investment = max(`Crunchbase Last Round Amount`, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(Most_Recent_Investment)) %>%
  head(35)  # Limitar a las 35 empresas principales

# Crear el gráfico de barras horizontal usando plotly
plot <- plot_ly(
  merged_data,
  x = ~Most_Recent_Investment,
  y = ~reorder(`Company Name`, Most_Recent_Investment),
  type = 'bar',
  orientation = 'h',
  text = ~paste('Roles:', Roles, '<br>Investment: $', format(Most_Recent_Investment, big.mark = ",")),
  hoverinfo = 'text',
  marker = list(color = 'skyblue')
) %>%
  layout(
    title = list(
      text = "Ranking of Companies by Most Recent Investment",
      x = 0.5,
      y = 0.95,
      xanchor = 'center',
      yanchor = 'top',
      font = list(size = 18)
    ),
    xaxis = list(title = "Most Recent Investment (USD)"),
    yaxis = list(title = "Company Name", automargin = TRUE),
    margin = list(l = 250, r = 50, b = 50, t = 100, pad = 10),  # Ajustar márgenes para ajustar el contenido
    height = 800,  # Establecer la altura para 35 empresas
    width = 1000,  # Ajustar el ancho para evitar el desplazamiento horizontal
    bargap = 0.1  # Ajustar el espacio entre barras para hacerlas más grandes
  )

# Guardar el archivo HTML
htmltools::save_html(htmltools::tagList(plot), "plot_no_scroll.html")

# Incluir este código en tu archivo R Markdown
# Mostrar el gráfico interactivo en el documento R Markdown
plot

```


###

####  {.tabset}

****

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <span style="background-color:#e9f5db; color:black; border-radius:5px; padding: 0.2em 0.5em; display: inline-block;">📄 <b>By Capabilities </b></span>
</p>


##### Number of companies

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This table lists various capabilities identified in the analyzed companies, showing how many companies possess each capability. The columns represent different capabilities, while the rows display the count of companies that possess each capability. The "Total" column sums up the number of companies for each capability. This table provides a detailed overview of the distribution of specific capabilities across companies, aiding in understanding the strengths and focus areas within the EdTech sector.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The capabilities were taken from the HolonIQ source and are explained in more detail earlier in the report.</p>

```{r, echo=FALSE}

# Load the necessary libraries
library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)

# Load the data
data <- read_excel("final_output.xlsx")

# Combine all capabilities into a single long-format dataframe
capability_data <- data %>%
  select(`Capability 1`, `Capability 2`, `Capability 3`, `Capability 4`, `Capability 5`) %>%
  gather(key = "Capability_Type", value = "Capability") %>%
  filter(!is.na(Capability))

# Count the occurrences of each capability in each column
capability_counts <- capability_data %>%
  count(Capability_Type, Capability) %>%
  spread(key = Capability_Type, value = n, fill = 0) %>%
  mutate(Total = `Capability 1` + `Capability 2` + `Capability 3` + `Capability 4` + `Capability 5`) %>%
  arrange(desc(Total))

# Create the table using kable
kable_table <- capability_counts %>%
  kable("html", col.names = c("Capability", "Capability 1", "Capability 2", "Capability 3", "Capability 4", "Capability 5", "Total"), align = 'c') %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  column_spec(1, bold = TRUE, color = "purple") %>%
  column_spec(7, bold = TRUE, color = "purple") %>%
  add_header_above(c(" " = 1, "Number of Appearances" = 6)) %>%
  scroll_box(height = "500px", width = "100%")

# Save the table as an HTML file
save_kable(kable_table, "capability_ranking_table.html", self_contained = TRUE)

# Display the table
kable_table


```

##### Score

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Score by Capabilities</strong></p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
This table assigns a weighted importance to each capability based on its position to measure and rank the most important capabilities effectively. The weights are assigned as follows:</p>
<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
  <li>Capability 1 is multiplied by 5</li>
  <li>Capability 2 is multiplied by 4</li>
  <li>Capability 3 is multiplied by 3</li>
  <li>Capability 4 is multiplied by 2</li>
  <li>Capability 5 is multiplied by 1</li>
</ul>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The total score for each capability is then calculated by summing these weighted values, ensuring that higher importance is given to capabilities listed earlier. This method reflects the significance of each capability in the overall ranking. The table includes columns for each capability's score and the total score for each row.</p>
<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
<strong>Note:</strong> The capabilities were taken from the HolonIQ source and are explained in more detail earlier in the report.</p>


```{r, echo=FALSE}
# Cargar las librerías necesarias
library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)

# Cargar los datos
data <- read_excel("final_output.xlsx")

# Combinar todas las capacidades en un solo dataframe de formato largo
capability_data <- data %>%
  select(`Capability 1`, `Capability 2`, `Capability 3`, `Capability 4`, `Capability 5`) %>%
  gather(key = "Capability_Type", value = "Capability") %>%
  filter(!is.na(Capability))

# Contar las ocurrencias de cada capacidad en cada columna
capability_counts <- capability_data %>%
  count(Capability_Type, Capability) %>%
  spread(key = Capability_Type, value = n, fill = 0) %>%
  mutate(
    Score = (`Capability 1` * 5) + (`Capability 2` * 4) + (`Capability 3` * 3) + (`Capability 4` * 2) + (`Capability 5` * 1)
  ) %>%
  arrange(desc(Score))

# Crear la tabla usando kable
kable_table <- capability_counts %>%
  kable("html", col.names = c("Capability", "Capability 1", "Capability 2", "Capability 3", "Capability 4", "Capability 5", "Score"), align = 'c') %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  column_spec(1, bold = TRUE, color = "purple") %>%
  column_spec(7, bold = TRUE, color = "purple") %>%
  add_header_above(c(" " = 1, "Score" = 5, " " = 1)) %>%
  scroll_box(height = "500px", width = "100%")

# Guardar la tabla como un archivo HTML
save_kable(kable_table, "capability_ranking_table_with_scores.html", self_contained = TRUE)

# Mostrar la tabla
kable_table

```



###

<a id="discussion"></a>

<p style="font-family: Arial, sans-serif; font-size: 22px; color: #333333; line-height: 1.5; text-align: center; margin-bottom: 20px;">
  <b><span style="color: #9381ff; font-size: 26px;"> |</span> Opportunities for Improvement</b>
</p>


<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The analysis of EdTech companies has revealed several areas for improvement:
</p>

<ul style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left; padding-left: 20px;">
    <li><strong>Data Refinement:</strong> Enhancing the data regarding the relationship of these companies with the Open edX platform is crucial. Clear and accurate information in this aspect will lead to more precise insights.</li>
    <li><strong>Definition of eduNEXT:</strong> The definition and concept of "eduNEXT" needs to be well-defined and incorporated into the model training. This will ensure a better understanding and identification of the company's role in the EdTech ecosystem.</li>
    <li><strong>Client vs. Partner Roles:</strong> Addressing the ambiguity between 'client' and 'partner' roles is essential. Leveraging advanced models such as GPT-4-O can provide deeper contextual analysis and differentiation.</li>
    <li><strong>Role Classification:</strong> When asking for roles using the ChatGPT algorithm, including a "none of the above" category would have been beneficial. This approach would prevent the forceful assignment of a nonexistent relationship.</li>
    <li><strong>Investment Data Source:</strong> The investment information used in the analysis was based solely on data retrieved from Crunchbase via LinkedIn. To enhance the comprehensiveness of future analyses, considering additional sources of investment data, especially for companies in Asia where LinkedIn usage is less prevalent, would be beneficial.</li>
</ul>

<p style="font-family: Arial, sans-serif; font-size: 14px; color: #333333; line-height: 1.5; text-align: left;">
The project shows significant potential for further exploration and development. As it is still in progress, any feedback and recommendations are highly appreciated to refine and improve the results.
</p>


****



<a href="#top" class="go-top">Go to Top</a>

<style>
.go-top {
  position: fixed;
  bottom: 20px;
  right: 20px;
  text-decoration: none;
  background-color: #f2f2f2;
  color: black;
  padding: 10px 15px;
  border-radius: 5px;
  font-size: 12px;
  z-index: 1000;
}
</style>

<script>
document.querySelector('.go-top').addEventListener('click', function(e) {
  e.preventDefault();
  window.scrollTo({top: 0, behavior: 'smooth'});
});
</script>