| Introduction and Objective
The primary objective of this project was to explore emerging companies and trends in the EdTech industry, as featured on HolonIQ’s list. This initiative aimed to gain a comprehensive understanding of the current market landscape, categorize the startups based on their focus areas, and pinpoint potential clients, providers, partners, competitors, or emerging markets. The analysis was structured to examine these entities by geographic regions and industry verticals, ultimately providing strategic insights to understand edunext’s position within the industry and enhance its market positioning and growth.
🚨 Why the HolonIQ EdTech Ranking was Selected
The HolonIQ EdTech ranking was selected due to its reputation as a comprehensive and authoritative source of information on the most innovative and high-potential EdTech startups globally. The ranking provides valuable insights into the leading companies in the EdTech sector, making it an ideal resource for identifying key players and emerging trends in the industry. By leveraging this ranking, eduNext can ensure its analysis is based on credible and up-to-date information, enabling informed decision-making and strategic planning.
💫 Company Overview
Edunext is an infrastructure-as-a-service provider and software company dedicated to the Open edX platform. It empowers organizations worldwide by delivering robust, scalable, and customizable online learning solutions. Edunext’s mission is to enhance the quality of education through technology, supporting successful online learning initiatives across various sectors.
| Project Overview
The project involved generating a list of the 1000 EdTech startups featured in the HolonIQ ranking. This list included links to their LinkedIn profiles. A scraper was developed to extract information from LinkedIn, and additional data was gathered from the startups’ websites, which were saved in text files. This collected information was then used to feed the ChatGPT API with specific prompts. The purpose of these prompts was to build detailed profiles of each company and obtain better answers to questions oriented toward classifying and identifying potential business opportunities for eduNext. This methodology provided a comprehensive and detailed understanding of each startup, facilitating strategic insights for edunext’s market positioning and growth.
| HolonIQ EdTech Ranking Overview
📘 Introduction to HolonIQ
HolonIQ is a global market intelligence firm specializing in the education, climate, and health sectors. Each year, it publishes the Global EdTech 1000 list, which highlights the most promising startups in the educational technology (EdTech) field at both global and regional levels. Sublists include the “Top 200 EdTech in North America,” the “Top 50 EdTech in Australia and New Zealand,” and many more for different regions worldwide.
🔍 How Does the HolonIQ Ranking Work?
Evaluation and Selection:
Evaluation Criteria:
Selection Process:
📊 Importance of the Ranking
These rankings not only highlight the most promising startups but also help connect different regions of the world and share innovations that can improve educational outcomes globally. Promising startups are those that show exceptional potential in terms of innovation, market impact, and growth trajectory, making them key players in driving the future of education. Additionally, they provide investors and other stakeholders with a clear view of emerging trends and the companies leading the change in education.
For more details on HolonIQ rankings and methodologies, you can visit their official website: HolonIQ.
| Methodology
📋 Generate the list with the 1000 companies
The project leveraged a curated list of 1,000 EdTech startups compiled by Holoniq. This list, provided in image format, segmented the startups by geographic region (Africa, Nordic-Baltic, South Asia, etc.). To facilitate further analysis, the initial step involved meticulously extracting key information from each entry. This information included the company name, LinkedIn profile URL, and company website address. The data extraction process was a collaborative effort, requiring manual work from multiple team members. Additionally, support tools like Gemini and GPT-3 were employed to enhance efficiency.
This is an example of the source format of the lists mentioned above.
The Global EdTech 1000 list for this case includes:
We used the 2023 HolonIQ EdTech lists for our analysis.
Once this process was done, the list looked like this:
📋 LinkedIn Scraper
Initially, there were 1,046 companies in the list. After reviewing, it was found that only 937 were unique, as the list was manually compiled by several team members and some entries were duplicated. Of these 937 unique companies, 117 had no LinkedIn page. The LinkedIn scraping process was therefore conducted with 820 companies, yielding 703 successful responses.
The results are summarized in the following table:
| Category | Number of companies |
|---|---|
| Total Companies in List | 1046 |
| Total Unique Companies | 937 |
| Companies without LinkedIn in list | 117 |
| Total companies with LinkedIn | 820 |
| LinkedIn Retrieved | 707 |
| LinkedIn Not Retrieved (Needs Review) | 113 |
| LinkedIn Retrieved Webpage | 703 |
The links that could not be retrieved are often in school format rather than company format. Here is an example of the links from which information could not be collected:
While the links that have company format worked correctly. Here are some examples:
| Company Name | URL |
|---|---|
| byteXL | https://in.linkedin.com/company/bytexl |
| Toodle | https://in.linkedin.com/company/toodlerungta |
| Eupheus Learning | https://in.linkedin.com/company/eupheus-learning |
| 10 Minute School | https://bd.linkedin.com/company/10ms |
| Adda247 | https://in.linkedin.com/company/adda247 |
| Apars Classroom | https://bd.linkedin.com/company/aparsclassroom |
| EduGorilla | https://in.linkedin.com/company/edugorilla-pvt-ltd |
| Infinity Learn | https://in.linkedin.com/company/infinity-learn-by-sri-chaitanya |
Note: If you want to see the technical information of the process, please click the button below.
This section explains the technical details of how we collected and processed the data. Here is an easy-to-understand breakdown:
This approach allowed us to gather comprehensive information about each company, which was then analyzed to identify potential business opportunities for EduNext.
import jmespath
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response, RemoteProtocolError
from parsel import Selector
from loguru import logger as log
import pandas as pd
import re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
import os
import time
import requests
# Initialize an async httpx client
client = AsyncClient(
http2=True,
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
},
follow_redirects=True
)
def strip_text(text):
"""Remove extra spaces while handling None values."""
return text.strip() if text is not None else text
def get_actual_url(link):
parsed_url = urlparse(link)
query_params = parse_qs(parsed_url.query)
return query_params['url'][0] if 'url' in query_params else link
def parse_company(response_text: str) -> Dict:
"""Parse company main overview page."""
selector = Selector(response_text)
script_data = selector.xpath("//script[@type='application/ld+json']/text()").get()
if script_data:
script_data = json.loads(script_data)
else:
script_data = {}
script_data = jmespath.search(
"""{
name: name,
url: url,
mainAddress: address,
description: description,
numberOfEmployees: numberOfEmployees.value,
logo: logo
}""",
script_data
) or {}
data = {}
for element in selector.xpath("//div[contains(@data-test-id, 'about-us')]"):
name = element.xpath(".//dt/text()").get().strip()
value = element.xpath(".//dd/text()").get().strip()
data[name] = value
addresses = []
for element in selector.xpath("//div[contains(@id, 'address') and @id != 'address-0']"):
address_lines = element.xpath(".//p/text()").getall()
address = ", ".join(line.replace("\n", "").strip() for line in address_lines)
addresses.append(address)
affiliated_pages = []
for element in selector.xpath("//section[@data-test-id='affiliated-pages']/div/div/ul/li"):
affiliated_pages.append({
"name": element.xpath(".//a/div/h3/text()").get().strip(),
"industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
"address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
"linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
})
similar_pages = []
for element in selector.xpath("//section[@data-test-id='similar-pages']/div/div/ul/li"):
similar_pages.append({
"name": element.xpath(".//a/div/h3/text()").get().strip(),
"industry": strip_text(element.xpath(".//a/div/p[1]/text()").get()),
"address": strip_text(element.xpath(".//a/div/p[2]/text()").get()),
"linkeinUrl": element.xpath(".//a/@href").get().split("?")[0]
})
# Additional fields from the second script
soup = BeautifulSoup(response_text, 'html.parser')
title_tag = soup.find('title')
designation_tag = soup.find('h2')
followers_tag = soup.find('meta', {"property": "og:description"})
description_tag = soup.find('p', class_='break-words')
website_tag = soup.find('a', attrs={'data-tracking-control-name': 'about_website'})
website = get_actual_url(website_tag['href']) if website_tag else "Website not found"
description_span = soup.find('h4', class_='top-card-layout__second-subline')
description = description_span.get_text(strip=True) if description_span else "Description not found"
# Crunchbase funding information
funding_section = soup.find('section', attrs={'data-test-id': 'funding'})
if funding_section:
all_rounds_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_all-rounds'})
if all_rounds_tag:
all_rounds_match = re.search(r'(\d+ total rounds)', all_rounds_tag.get_text(strip=True))
all_rounds_info = all_rounds_match.group(1) if all_rounds_match else "All rounds info not found"
else:
all_rounds_info = "All rounds info not found"
last_round_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_last-round'})
if last_round_tag:
last_round_info = last_round_tag.find('time').get_text(strip=True)
last_round_amount_tag = funding_section.find('p', class_='text-display-lg')
last_round_amount = last_round_amount_tag.get_text(strip=True) if last_round_amount_tag else "Last round amount not found"
last_round_link = last_round_tag['href']
last_round_formatted_date = last_round_tag.find('time')['datetime']
else:
last_round_info = "Last round info not found"
last_round_amount = "Last round amount not found"
last_round_link = "Last round link not found"
last_round_formatted_date = "Last round date not found"
investors_tag = funding_section.find('a', attrs={'data-tracking-control-name': 'funding_investors'})
investors_info = investors_tag.get_text(strip=True) if investors_tag else "Investors info not found"
else:
all_rounds_info = "Crunchbase funding info not found"
last_round_info = "Last round info not found"
last_round_amount = "Last round amount not found"
investors_info = "Investors info not found"
last_round_link = "Last round link not found"
last_round_formatted_date = "Last round date not found"
# Check if the tags are found before calling get_text()
name = title_tag.get_text(strip=True).split("|")[0].strip() if title_tag else "Profile Name not found"
designation = designation_tag.get_text(strip=True) if designation_tag else "Designation not found"
followers_match = re.search(r'\b(\d[\d,.]*)\s+followers\b', followers_tag["content"]) if followers_tag else None
followers_count = followers_match.group(1) if followers_match else "Followers count not found"
description_profile = description_tag.get_text(strip=True) if description_tag else "Profile Description not found"
additional_data = {
"profileName": name,
"designation": designation,
"followersCount": followers_count,
"profileDescription": description_profile,
"website": website,
"crunchbaseAllRoundsInfo": all_rounds_info,
"crunchbaseLastRoundInfo": last_round_info,
"crunchbaseLastRoundAmount": last_round_amount,
"crunchbaseInvestorsInfo": investors_info,
"lastRoundFormattedDate": last_round_formatted_date,
"crunchbaseLink": last_round_link,
}
data = {**script_data, **data, **additional_data}
data["addresses"] = addresses
data["affiliatedPages"] = affiliated_pages
data["similarPages"] = similar_pages
return data
def read_links_from_file(file_path: str) -> List[str]:
"""Read URLs from a text or Excel file."""
if file_path.endswith('.txt'):
with open(file_path, 'r') as file:
urls = file.read().splitlines()
elif file_path.endswith('.xlsx'):
df = pd.read_excel(file_path)
urls = df['Links'].tolist()
else:
raise ValueError("Unsupported file format. Please use a .txt or .xlsx file.")
return urls
# Initialize dataframes globally
df_company_info = pd.DataFrame()
df_company_addresses = pd.DataFrame()
df_affiliated_pages = pd.DataFrame()
df_similar_pages = pd.DataFrame()
async def fetch_with_retry(url, retries=3, backoff_factor=0.5):
for attempt in range(retries):
try:
response = await client.get(url)
return response
except RemoteProtocolError as e:
log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
time.sleep(backoff_factor * (2 ** attempt))
raise Exception(f"All {retries} attempts failed for {url}")
def fetch_with_requests(url, retries=3, backoff_factor=0.5):
headers = {
"User-Agent": "Guest",
}
for attempt in range(retries):
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response
except requests.RequestException as e:
log.error(f"Attempt {attempt + 1} for {url} failed: {e}")
time.sleep(backoff_factor * (2 ** attempt))
raise Exception(f"All {retries} attempts failed for {url}")
async def scrape_company(urls: List[str]) -> List[Dict]:
"""Scrape public LinkedIn company pages."""
data = []
failed_links = []
for url in urls:
try:
response = await fetch_with_retry(url)
if response.status_code == 200:
data.append(parse_company(response.text))
log.success(f"Successfully scraped {url}")
elif response.status_code == 999: # Use requests as fallback
log.warning(f"Status code 999 for {url}, switching to requests")
response = fetch_with_requests(url)
if response.status_code == 200:
data.append(parse_company(response.text))
log.success(f"Successfully scraped {url} with requests fallback")
else:
failed_links.append(url)
log.error(f"Failed to scrape {url} with status code {response.status_code}")
else:
failed_links.append(url)
log.error(f"Failed to scrape {url} with status code {response.status_code}")
# Delay between requests to avoid rate limiting
time.sleep(1)
except Exception as e:
failed_links.append(url)
log.error(f"Error scraping {url}: {e}")
return data, failed_links
async def run():
urls = read_links_from_file('profiles.txt') # Using the 'profiles.txt' file
profile_data, failed_links = await scrape_company(urls)
global df_company_info, df_company_addresses, df_affiliated_pages, df_similar_pages
for company in profile_data:
main_address = company.get('mainAddress', {})
company_info = {
"name": company.get("name"),
"url": company.get("url"),
"streetAddress": main_address.get("streetAddress") if main_address else None,
"addressLocality": main_address.get("addressLocality") if main_address else None,
"addressRegion": main_address.get("addressRegion") if main_address else None,
"postalCode": main_address.get("postalCode") if main_address else None,
"addressCountry": main_address.get("addressCountry") if main_address else None,
"description": company.get("description"),
"numberOfEmployees": company.get("numberOfEmployees"),
"Industry": company.get("Industry"),
"Company size": company.get("Company size"),
"Headquarters": company.get("Headquarters"),
"Type": company.get("Type"),
"Specialties": company.get("Specialties"),
"profileName": company.get("profileName"),
"designation": company.get("designation"),
"followersCount": company.get("followersCount"),
"profileDescription": company.get("profileDescription"),
"website": company.get("website"),
"crunchbaseAllRoundsInfo": company.get("crunchbaseAllRoundsInfo"),
"crunchbaseLastRoundInfo": company.get("crunchbaseLastRoundInfo"),
"crunchbaseLastRoundAmount": company.get("crunchbaseLastRoundAmount"),
"crunchbaseInvestorsInfo": company.get("crunchbaseInvestorsInfo"),
"lastRoundFormattedDate": company.get("lastRoundFormattedDate"),
"crunchbaseLink": company.get("crunchbaseLink"),
}
df_company_info = pd.concat([df_company_info, pd.DataFrame([company_info])])
for address in company.get("addresses", []):
parts = address.split(", ")
country = parts[-1] if parts else ""
company_address = {
"name": company.get("name"),
"url": company.get("url"),
"addresses": address,
"country offices": country
}
df_company_addresses = pd.concat([df_company_addresses, pd.DataFrame([company_address])])
for affiliated in company.get("affiliatedPages", []):
affiliated_page = {
"name": company.get("name"),
"url": company.get("url"),
"affiliated_name": affiliated["name"],
"industry": affiliated["industry"],
"address": affiliated["address"],
"linkeinUrl": affiliated["linkeinUrl"]
}
df_affiliated_pages = pd.concat([df_affiliated_pages, pd.DataFrame([affiliated_page])])
for similar in company.get("similarPages", []):
similar_page = {
"name": company.get("name"),
"url": company.get("url"),
"similar_name": similar["name"],
"industry": similar["industry"],
"address": similar["address"],
"linkeinUrl": similar["linkeinUrl"]
}
df_similar_pages = pd.concat([df_similar_pages, pd.DataFrame([similar_page])])
# Save to Excel after each company
df_company_info.to_excel("company_information.xlsx", index=False)
df_company_addresses.to_excel("company_addresses.xlsx", index=False)
df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
df_similar_pages.to_excel("similar_pages.xlsx", index=False)
if failed_links:
df_failed_links = pd.DataFrame({"failed_links": failed_links})
df_failed_links.to_excel("failed_links.xlsx", index=False)
if __name__ == "__main__":
try:
asyncio.run(run())
except Exception as e:
log.error(f"Script terminated due to an error: {e}")
# Save what has been scraped so far
if not df_company_info.empty:
df_company_info.to_excel("company_information.xlsx", index=False)
if not df_company_addresses.empty:
df_company_addresses.to_excel("company_addresses.xlsx", index=False)
if not df_affiliated_pages.empty:
df_affiliated_pages.to_excel("affiliated_pages.xlsx", index=False)
if not df_similar_pages.empty:
df_similar_pages.to_excel("similar_pages.xlsx", index=False)