The Power of GPT-4 & LLMs in Venture Capital: A 2023 Guide to AI-Based Startup Sourcing
April 8, 2023
Are you still manually searching for new startups? Have you ever struggled to find specific startups or scale-ups that work on edge case technologies?
At Startup Scout, we've been there too. We understand how hard it can be to find, evaluate, and contact interesting startups before they're approached by others or become irrelevant.
Since 2016, we've completed 750 scouting projects in 42 countries, helping accelerators, corporates and VCs identify and approach the most promising startups. In the past, we used to have 35 analysts (students) who helped us clean data, as well as research the startup founders.
Since then we've evolved into a technology company. Recently, we have uncovered all our secrets in guest post for Andre Retterath’s Data-driven VC newsletter.
This blog post is an extended version of this article, in which we also touch on other parts of sourcing, outreach and screening part (for accelerators).
In the upcoming chapters, we will be revealing for the first time our well-refined process and the technologies we have built.
By reading this blog post, you will learn:
- What data sources we use to scout startups and how to scrape them
- How you can get data from LinkedIn (and don’t get your account blocked)
- How to utilize ML to distinguish between a startup and an ordinary company
- The role GPT plays in our stack, along with an example of a prompt
- What's new with OpenAI GPT-4?
- How we ensure that our database is up-to-date and enriched with necessary information
- How we use blacklists and deconflicting rules to avoid reaching out to the wrong contacts and overlaps between programs
- What you need to do to run a large-scale outreach campaign
- Our infrastructure and the recommended sending limits for best possible deliverability rate
- Why many startups drop out during the application process and how we address it
- How we streamline the evaluation procedure and speed up the process from days to hours
Challenges of AI-based Startup Scouting:
Startup scouting can be a challenging and time-intensive process, and there are several pain points that our customers commonly experience:
- Identifying the right data sources for scouting can be quite a challenge with so many options to choose from. While traditional Crunchbase-alike platforms such as Dealroom, Pitchbook, and Tracxn can be great starting points, it's important to remember that most founders, especially those outside the US, don’t create profiles on Crunchbase as the first thing when starting their company. Instead, public sources like LinkedIn, Facebook groups, and other online communities can be a much better resource for discovering new, promising firms.
- Finding good quality and relevancy is especially important in VC, as the screening capacity is always limited and hard to scale. Also, it's important to keep in mind that what defines an interesting company can vary from fund to fund. For example, an impact or a food-oriented VC fund might have a different scope than a B2B SaaS or FinTech fund. So, it can feel like searching for a needle in a haystack when you're trying to find ideas that fit specific niches or technologies.
- Right people with contact details: When researching relevant startup companies, it's essential to identify also the key personnel in charge, typically the founders, and their contact details.
- Deduplication and data merging: To avoid duplications and ensure that you are not rediscovering companies and people that you have already found, it's critical to keep your database clean and organized through deduplication and data merging.
- Timing is critical when scouting startups since there is a limited window when a startup is most relevant and can quickly become irrelevant.
- When conducting large-scale outreach the key is to ensure high deliverability, open rate, and reply rate by using the right tools and techniques.
- Blacklists and Deconfliction Rules: Operating across different branches in multiple countries requires implementing blacklists and deconfliction rules to ensure that you don't reach out to people you shouldn't. This includes not contacting people who have already been contacted by your colleagues or another branch.
- Ensuring scalability and consistency is critical for staying competitive. The key is to implement the right tools and systems to manage and organize data, automate processes as much as possible, and maintain the same level of quality and attention to detail across all programs or branches.
- Time-intensive evaluation requires prioritizing relevant startups and automating certain tasks.
- The rising costs: Improving the effectiveness and automating or streamlining processes is cheaper and more effective than hiring more people.
Utilizing Machine Learning to Enhance Human Intelligence
At Startup Scout, we've developed a unique process for startup scouting that combines the latest technology with human expertise to deliver efficiency and accuracy that was not achievable a year ago.
Our process has been refined over the years to enable us to complete 750 scouting projects in 42 countries. In the next chapters, we'll describe our unique process step-by-step, which is illustrated in the picture below, and share the technology we use at each stage, along with its benefits.
Our Process in Three Parts:
- Startup Sourcing and Data Enrichment
- Campaign Management
Each Plays a Critical Role in Identifying and Connecting with the Most Promising Startups Worldwide
Exploring Data Sources: Where to Find Input Datasets for Next-Generation AI Models
In this section, we'll guide you through our process step-by-step and explain where our data comes from and how we find relevant startups and up-to-date contacts on founders.
Where are our data from?
At Startup Scout, we use over 1000 data sources to find companies that might be startups from around the world. To ensure that we have a high quantity of data, we have our own team of part-time junior developers who focus solely on data scraping. One of the most important data sources is LinkedIn. We leverage LinkedIn to find anyone who has recently set up job title as “founder” OR “co-founder” OR “CEO” or “CTO”, and other relevant positions. Here is the list of all the type of sources that we crawl:
Regularly scraped data sources:
- LinkedIn profiles of anyone who has recently set up a job title as founder, co-founder, CEO, COO, CTO, or who is working in stealth mode in a fresh company
- Major global startup databases: We use Crunchbase alike databases as well as directories such as Product Hunt and crowdfunding platforms, to discover new and emerging startups. If you don’t know where to start this could be a great starting point.
- Minor (local) startup databases: We also use smaller, more localized databases that focus on specific regions or industries to find startups that might be overlooked by larger databases.
- Startup tech media: We also monitor startup-focused media outlets and blogs to stay up-to-date on the latest trends, technologies, and companies in the startup world.
- Relevant Facebook groups and other communities: We leverage Facebook groups and other online communities to identify founders and other key personnel. Example: https://www.facebook.com/groups/AustrianStartupPinwall/
Ad Hoc data sources:
- Startup Conference networking apps: We use apps that are designed to facilitate networking at startup and tech conferences, which can help us to identify new companies and connect with founders and investors. After the conference, we create a list of attendees (usually consisting of name, company name, and company type) from the app. Then we use our technology to enrich this data with the right URLs, LinkedIn profiles, contact details, and other relevant information.
- Portfolio of startups on the websites of incubators, competitions, accelerators, and VCs: We explore the websites of these organizations to identify startups that they are working with or have invested in, which can provide valuable insights into new and promising companies.
Techniques and Tools for Scraping Public and Private Data: Read on for Tips
While it doesn't require rocket science-level knowledge, web scraping and crawling does necessitate some technical expertise. We recommend using Python packages such as Scrapy or Selenium, which offer basic code snippets. You can find a useful post on this topic here. Moreover, forums like Stackoverflow contain threads that supply complete code for scraping specific data sources.
There are also numerous ready-to-use scraping tools and services like Browse.ai, Apify, or Phantombuster that may suffice for smaller use-cases.
What Software to Use to Scrape LinkedIn?
Extracting information from LinkedIn manually can be a daunting task, but automated tools can help. By scheduling data collection and notifying you of any changes, you can save time and effort. For smaller projects or infrequent updates, tools like DuxSoup and Phantombuster can efficiently scrape LinkedIn data.
IMPORTANT: Automating your LinkedIn actions can lead to banning or losing your accounts, but when this happens is unpredictable. LinkedIn is not very consistent as to when they block your account.
Anyway here are some recommended max limits for LinkedIn Profile page extractions:
How to extract data from LinkedIn on a larger scale?
Any of the above-mentioned LinkedIn automation tools aren’t scalable and cost-efficient for larger-scale use-case. But do you know who the biggest scrapers out there are?
Search engines! We can leverage this fact to scrape public LinkedIn profiles even without a need for a LinkedIn profile or being even signed in.
Especially when we were building our new saas tool Pipebooster.io, we found out that we can utilize various “hacking” methods to scrape public LinkedIn profiles that have been indexed by search engines (Google, Bing, DuckDuckGo, Yandex, Seznam). By using a cascade of search queries, we can narrow down outcomes on the results page to show only what we need.
Example of a simple search query:
site:linkedin.com/in/ intitle:"CEO" OR "Founder" AND Company XYZ
Our data scraping process enables us to swiftly add new data sources within hours of receiving a ticket. This feature is valuable if you wish to expand your scope and discover emerging companies in regions that were previously underrepresented, such as Africa or Southeast Asia.
With our process in place, it typically takes just a few hours for our data scraping team to add a new data source to our system. This is especially useful when we need to search for startups in a region or industry that was previously not well covered, allowing us to quickly expand our reach and identify new and emerging companies.
Benefits: Comprehensive and always-growing database of startups, giving us a vast quantity of companies to analyze and evaluate, as well as the flexibility to quickly and easily expand our reach by adding new sources as needed
Challenges: Too many companies. With so many companies to consider, it can feel like looking for a needle in a haystack.
With millions of new companies every year, it would be impossible to manually determine which ones are startups. That's where our advanced technology and processes come in.
Classification and Industry Categorization Powered by AI
Ok. So now we have a lot of data from various sources, which is too much for human analysts to handle alone. So the next step is to identify which of the companies that you found are startups and which are most probably not (marketing agencies, advisory firms, kebab kiosks, ice coffee shops..) and which of the most frequent industries it belongs to.
In the next chapter, I will describe how we trained our own (economic) machine-learning algorithm that has been then finetuned with (costly) OpenAI’s GPT.
As mentioned before, different types of funds may have varying criteria for identifying relevant deal-flow.
For example, a food tech or impact accelerator may have different criteria for what qualifies as a startup than a B2B SaaS VC that only invests in high-growth companies. Making things even more challenging, many of our clients are less concerned with whether a company is a startup or not and are more interested in the specific technology.
Optimizing Efficiency and Reducing Costs: Train Your Own ML Classifier to Filter Out Irrelevant Companies
Using only GPT model to analyse thousands of websites can become very expensive, especially if you have a lot of data sources to process especially if the majority of the researched companies are irrelevant.
That’s why we have trained our own ML classifier:
To classify whether a company is a startup or not, we rely on an advanced machine-learning model called XLNet. XLNet is a state-of-the-art machine learning model that is widely used in natural language processing tasks. It has been shown to outperform other popular language models like BERT in various tasks, including question answering, sentiment analysis, and document ranking.
We trained the XLNet model on a dataset of 8 million companies that were tagged by our team of 35 student analysts over the past 7 years. To tag such a large number of companies, we developed our own data tagging console along with a Chrome extension that automatically loads and opens websites and LinkedIn profiles for each company. This frees up our analysts to focus solely on evaluating and tagging the data.
At Startup Scout, we not only classify whether a company is a startup or not but also categorize it into one of 32 general industries, with the possibility for a company to belong to multiple industries.
Interesting facts: On average, each analyst using our data tagging console and chrome extension was able to evaluate around 200-300 websites per hour. To tag 8 million companies, we invested a total of 32,000 man-hours into data tagging. To put it in perspective, this is equivalent to 35 people working full-time for 5 months.
Challenge: It is very difficult for people to maintain focus for hours, and as a result, the generated dataset is very generic and may not be 100% accurate. However, this process allows us to cost-efficiently filter out most of the irrelevant companies before they get into our database.
OpenAI's GPT Releases: Transformative Tools for Categorizing Companies
A few months back, training new XMLNet-based ML classifiers to filter out companies not aligned with each fund's specific focus (industry, business model, or technology) was a costly and challenging task. However, we discovered that unlike GPT-2, the latest OpenAI GPT models (3 and 3.5) are effective tools for accurately categorizing and classifying companies based on summarized content from their website and LinkedIn descriptions.
What's new with OpenAI GPT-4?
- increased word limit (from 3000 to 25 000 words)
- knows more languages than its predecessor
- improved "intelligence" as it scores better than 90% of candidates in the Bar Exam (according to OpenAI).
- it understands image input
- accuracy - GPT 3.5 would sometimes invent answers. The new models should be a step forward as it is 82% less likely to respond with inaccurate answers
- speed (however I was not able to confirm this, as it seems that it heavily depends on the complexity of the answers)
We recently tested GPT-4 alongside our already extensive experience with GPT-3.5, and we'd like to give you some feedback. While GPT-3.5 was pretty good for analyzing most industries, we found it to be unpredictable sometimes, especially when processing large volumes of websites.
Our brief tests with GPT-4 API, which was easy to switch to for developers already using OpenAI's models, yielded better results for more complex industries, such as social impact businesses. We also observed that GPT-4 is more consistent and accurate than its predecessor.
In conclusion, the new model further enhances our process as it can address gaps and edge cases that GPT-3 struggled with. Overall, we're impressed with GPT's performance since GPT-3 and look forward to incorporating it further into our work.
Fine-tuning with Open AI’s GPT API (3.5 & 4)
Thanks to the API of pre-trained GPT models, there is no longer a need to train new machine-learning classifiers. Instead, you can create prompts specific to each fund and feed the GPT models via API with the website and LinkedIn profile of each company. This allows us to determine whether they fit even the most specific use cases
To obtain this data, we have developed our own web scraping robot that visits each company's website and looks for sections such as "about". It then scrapes and summarizes the relevant information for further analysis.
Classifying millions of websites directly with OpenAI GPT-3.5 would be a very costly and inefficient approach since the price of each request is calculated in tokens based on the length of the prompt + the length of the answer. As a result, analyzing companies that are most likely irrelevant would still use up a considerable amount of tokens.
However, with the help of appropriate prompts, we have been able to fine-tune the results of our XLNet-based models through the GPT-3 API in a cost-efficient manner to customize each search according to the specific requirements.
Example of a typical process - looking for a specific BioTechnology:
Find biotech startups that are working on eliminating the need for animals in the food production process.
How we proceed:
- We first filter out non-startup companies and select only those within the biotech industry using our pre-trained ML algorithms.
- This narrows down the list from millions to just a few thousand companies.
- Then, we use the GPT-3 API to directly feed in summarized website content and Linkedin descriptions for each shortlisted biotech company directly from our internal dataportal.
- With a specific prompt tailored to each customer's unique use case, we ask GPT-3 to determine whether each company is working on eliminating the need for animals in the food production process.
- In the last step we check who are the founders and get their contact information
“Review the company website and short description provided and answer the following question: Is the company a biotech company that is working on eliminating the need for animals in the food production process?
If YES, respond with "YES RELEVANT".
If NOT, respond with "NOT RELEVANT".
Interesting fact: This process not only accelerates our startup scouting efforts, but it also provides us with comparable or often more accurate results than what we were able to obtain from our analysts. The difference lies in focus. Humans excel at understanding client’s context, figuring out and fine-tuning the right prompts for the AI. Once the right prompt is generated, AI can keep repeating the process indefinitely, while humans tend to lose focus and become disengaged after only a few hours of such work.
Enhancing Company and Team Data:
Now that we have a database of highly specific companies, the next step is to identify their founders and key personnel who run the company. We also need to obtain their contact details so that we can reach out to them for potential collaborations or partnerships.
Below is a description of our data enrichment process:
- Enriching company data with contact persons, finding verified emails of founders, their personal LinkedIn profiles, and other data points
- All data are stored in one central location where they are merged, deduplicated, and missing data is enriched
- Non-operational startups are excluded from the database
It is crucial to ensure that you are only reaching out to the active founders of a startup team. Databases are not always reliable as they often contain very outdated information, including individuals who have already left. As a solution, our system conducts real-time searches for founders and other key personnel using specific search query strings like:
"site:linkedin.com/in/ intitle: 'CEO' OR 'Founder' OR 'Owner' OR 'Co-founder' AND <Company Name>".
To efficiently find and verify the most accurate email addresses of researched founders, we utilize a cascade of our own proprietary email hunting tool, coupled with several 3rd party providers and Zerobounce verification tool. Our email hunting tool uses 32 common patterns to guess email addresses, and each step is verified. If our tool is unable to find or verify the correct contact, we automatically search in 3rd party databases, starting with cheaper providers and escalating to more expensive ones like ZoomInfo, with each step again verified by Zerobounce. All automatically.
Interesting facts: We had to include also a process for database cleaning (checking whether a startup is non-operational). It’s an automatic script that regularly checks if websites in our database are still functioning. If it is not, or if it is redirected to another company's website, we flag the company as inactive or merge it with another company in our database (in cases of mergers, exits, or name changes).
2- Campaign Management
In this section, we'll talk about managing sourcing campaigns. Each year, we handle hundreds of these campaigns, which present significant challenges that we must overcome. These challenges include conducting outreach on a large scale, coordinating with other marketing activities, and implementing blacklists and deconfliction rules to ensure scalability and consistency. Let's take a closer look at how we address these challenges.
Our dataportal is the core of our startup scouting operations, where all data is centralized, deduplicated, and enriched to ensure the accuracy and relevance of information. With multiple modules available, our team can set up startup sourcing projects by selecting the industry and utilizing pre-defined or custom GPT prompts to feed into our AI.
Within a few hours, we can obtain the final results which are then double-checked by our analysts before being added to a new outreach or other marketing campaign. Contact details are verified, and we use LinkedIn to confirm the founders' current status within the company.
Outreach and High Deliverability Infrastructure
Success of your outrach campaign depends on 3 main components
1. quality of the list - how good the people on the list that you found fit your persona
2. content you sent - timing, framing of the value,
3. actual infrastructure responsible for sending the email
At Startup Scout, we always emphasize the importance of focusing on high quality and relevance rather than quantity when it comes to outreach campaigns. However, we understand that sometimes our clients are working with tight deadlines and need to approach a large number of startups quickly. In these situations, having a powerful, high-deliverability infrastructure for sending emails is key.
That's why we've developed our own proprietary outreach tool that allows us to manage and control all of our activities and campaigns efficiently. Our tool offers personalization options, automatic sequence steps, and the ability to blacklist replies and forward messages to program managers. We also prioritize deliverability by ensuring that our tool adheres to daily sending limits and sends emails through multiple warmed-up domains, with random pauses between each message sent. With this approach, we achieve the best possible results for our outreach campaigns.
Always remember that every message you send in a sequence should be highly relevant and personalized for each startup you're reaching out to.
SMTP settings, SPF, DKIM setting
To ensure the best possible deliverability for our email outreach campaigns, we implement a range of measures including creating additional domains, optimal domain settings, and utilizing technologies like SPF and DKIM signing. Our blacklists are automatically applied to outreach lists and updated from an inbound mailbox to prevent communication with people who have already been contacted by other programs. We also reduce spam scores by using SPF validation and DKIM signatures, and our software makes the process easier with features like an automatic rejection of applicants and reminders. By utilizing the Gmail API internally, our software uses Google IPs for sending emails, making it easier to achieve high deliverability rates compared to using your own domain, SMTP provider, and configuration.
Our infrastructure consists of a network of self-hosted mail servers, IP rotation, and no limits on scaling up, so our clients can focus on the quality of their value proposition and message content while we handle the technical details.
Gradually scaling up your outreach and sending engaging emails is key, as is splitting domains to isolate the effects of different types of emails. We recommend setting up a new domain with a two-week warm-up period to build up your reputation and constantly increasing your volume without sending big spikes that could disrupt your progress. Our own list screening and monitoring tool catches any problems early on to avoid email deliverability problems.
Recommendations for limits:
- 1 domain, 2 mailboxes, max 500 emails/day
- 2 domains, 4 mailboxes, max 1000 emails/day
- 4 domains, 10 mailboxes, max 5000 emails/day
Note: The limits increase gradually, with an initial limit of 50 emails/day/mailbox that can be increased by 10-15% per day.
Blacklists and Deconfliction Rules:
Establishing control over who has been contacted and who shouldn't be contacted is particularly crucial when running multiple programs in different countries or involving numerous team members and external partners in the scouting process.
At Startup Scout, we've developed blacklisting and deconfliction rules to avoid canibalizing programs of our clients that overlap and focus on similar countries, industries or stages.
The blacklisting system centralizes a list of friends, partners, alumni, and people you are already in touch with, and automatically adds all individuals who apply to an acceleration program or who are in the application process to the blacklist. Furthermore, if someone replies to any of our email campaign, our outreach tool immediately stop contacting them to avoid duplication of communication.
To avoid any overlap or duplication of outreach efforts, Startup Scout employs a system of deconflicting rules that rely on temporary blacklists to split our outreach lists between programs scouting for similar startups in the same industry or geography.
As an example, suppose two accelerators are scouting for similar startups - one in France and another in New York. In this case, we reserve all European startups for a few weeks before the French accelerator's deadline exclusively for the French outreach campaign. If the European startups do not apply or reply during that period, we transfer them to the outreach campaign for New York accelerator, and vice versa. This deconfliction rule ensures that all programs have a fair chance of discovering and engaging with promising startups primarily in their respective regions.
This approach can also be used based on startup maturity, stage, or accelerator deadlines to ensure personalized and relevant outreach to each startup.
3 - Application Screening & Evaluation
In the last section, we will discuss a common challenge faced by our clients when evaluating startups and how we have overcome it. By analyzing countless application forms, we have developed our own solution that effectively addresses these challenges.
One of the main challenges our clients face is the loss of potential startup applicants during the application process. Often, startups give up on the application due to complicated forms, having to create an account before starting the application, or a lack of automatic reminders to follow up. Meanwhile, program managers may ask questions in a non-quantifiable way or include too many questions that aren't essential, making it difficult to filter startups based on specific criteria during the evaluation process.
We've analyzed countless application forms to create one that solves this problem permanently.
- No need to register - unlike other forms, ours doesn't require applicants to create an account, making it easy to use
- Leverage pre-made templates: We offer a wide selection of pre-made templates that can be used and adapted to meet your specific needs, saving you time and effort.
- Embed the Branded Form Right on Your Website: Our application form can be embedded on your website or shared via a link, allowing startups to easily fill out the form without being redirected to a separate page.
- Email Reminders & Auto-Saving: Our form has an auto-saving feature and sends automatic reminders to ensure that the application is completed on time. It is also fully GDPR compliant.
- Quickly Evaluate Applicants: Our Evaluation Tool allows you to invite team members to evaluate applications quickly and efficiently, streamlining the evaluation process.
- Sync With Your CRM: Our form can be integrated with your CRM or our database of startups, making it easy to find the most relevant startups and lead them directly to your application form.
In the past, many of our clients used simple application forms on their websites and exported data into spreadsheets to evaluate startups. This was a labor-intensive process that lacked a professional look. Evaluators often had to juggle multiple presentations stored across various folders, making the process even more cumbersome.
With our evaluation platform, evaluating startups is now easier and more professional than ever before. Here are the benefits of using our platform:
- All necessary information, including applicant data, pitch decks, videos, and more, is accessible from one centralized interface, making it easy for evaluators
- Multiple evaluators can be invited to collaborate on the platform, which streamlines the team evaluation process.
- Specific evaluation flows can be created based on industry or other tags, ensuring that each evaluator has access to only the relevant startups.
- The platform includes automated rejection and acceptance emails, saving evaluators time and effort.
- The platform is user-friendly and professional-looking, making the evaluation process more efficient and less labor-intensive.
Overall, our evaluation platform allows startups to be evaluated in a streamlined and professional manner, which saves time and increases the likelihood of finding the best candidates.
In conclusion, by reading this article, you have learned how we replaced the need for 35 analysts in start-up scouting by training our own LLM and by GPT fine-tuning. Our process has achieved efficiency and accuracy that wasn't possible before, and we hope that you have gained valuable insights into the benefits and challenges of using AI in your own start-up scouting efforts.
We believe that this is just the beginning. With the current rapid development of generative AI, we can expect more and more innovation also in the other stages of the investment process such as screening, due diligence and more. We see the venture capital industry today as being where hedge funds were years ago, which means there's ample opportunity for automation and enhancements.
Demo and free data sample:
If you're interested in seeing our technology in action, please reach out to us for a data sample of startups specific to your needs. Additionally, we're open to licensing our technology to companies who would like to use it themselves.
Book a meeting with the founder of Startup Scout: https://calendly.com/vlastimil/30
And don't forget to check out our new SaaS tool, Pipebooster.io, which can:
- enrich your startup conference data with contact details and Linkedin profiles of attendees
- automatically track all new founders in your target area
- fill your database with new startup leads
- notify you when someone in your network starts a new startup.
CEO of Startup Scout by Leadspicker
+420 775 68 64 70
Pipebooster.io - How to Get Notified When Somebody Starts a New Company
Pipebooster.io - How to Find Emails and LinkedIn Profiles on Anyone
Pipebooster.io - How to get emails and Linkedin profiles of Slush attendees for free
About the author
Vlastimil Vodicka is a startup founder with a Venture Capital background. In recent years, with his co-founder, he has built a technology startup that Deloitte has recognized as the 16th fastest-growing technology company in the Central European Deloitte Fast 50 2019 program. After having bootstrapped and earned the first million dollars themselves, Leadspicker landed $2 million in seed funding from Reflex Capital and J&T Ventures.
Whether you’re an accelerator, innovation lab, or VC/PE fund, we cherry-pick all the startup leads you'll ever need, so you can focus on what matters. Let us find you the next unicorn!