Blog - ScraperAPI

Build an Image Search Engine for Amazon with the ScraperAPI-LangChain Agent

Egop Gogo-Job — Fri, 21 Nov 2025 05:55:01 +0000

Image search has become an intuitive way to browse the internet. Tools like Google Lens can find identical items across different websites based on an uploaded photo, which is useful but generic.

If you live in the UK or Canada and just want search results of product listings from your local Amazon marketplace or some other local online retailer, the breadth of results Google Lens returns can be overwhelming, time-wasting, and mostly useless. Oftentimes, it will return similar items, just not readily accessible items.

Given Amazon’s scale and inventory depth, a focused search that goes straight to the right marketplace is the most efficient approach.

Our project addresses this by enabling image search, particularly for Amazon Marketplaces in any region of the world, using two separate large language models (LLMs) to analyze uploaded images and generate shopping queries.

These queries are passed to a reasoning model that uses the ScraperAPI LangChain agent to search Amazon and return structured results. To build a user interface and host our app for free, we use Streamlit.

Let’s get started!

Understanding the Search Engine’s Workflow

There are three core components of our Image Search engine that work in sequence. Claude 3.5 Sonnet reads the uploaded photo and writes a short shopping caption that captures distinct attributes of the item.

GPT 4o Mini takes that caption, chooses the right Amazon marketplace, and forms a neat query. The ScraperAPI LangChain agent then runs the query against Amazon and returns structured results containing title, ASIN, price, URL link, and image, which the app shows instantly.

Let’s take a closer look at how each of these components functions:

LangChain and ScraperAPI

LangChain agents connect a reasoning model to external tools, so the model can act, not just chat. Integrating ScraperAPI as an external tool enables the agent to crawl and fetch real-time data from the web without getting blocked.

The package exposes whatever reasoning model (an LLM) you pair with the agent through three distinct ScraperAPI endpoints: ScraperAPIAmazonSearchTool, ScraperAPIGoogleSearchTool, and ScraperAPITool.

With just a prompt and your ScraperAPI key, the agent issues a tool call and ScraperAPI handles bypassing, protection, and extraction, returning clean formatted data. For Amazon, the data usually comes back as a structured JSON field containing title, ASIN, price, image, and URL link.

Claude 3.5 Sonnet and GPT 4o Mini

In this project, Claude 3.5 Sonnet, a multimodal LLM, converts each uploaded photo into a short descriptive caption that captures the key attributes of that item.

The caption becomes the query, and GPT 4o Mini, the reasoning model paired to our agent, then interprets the caption, selects the correct Amazon marketplace, and calls the ScraperAPI LangChain tool to run the search.

The tool returns structured results that the app can display directly. Splitting the work this way keeps each model focused on what it does best.

Claude Vision extracts the right details from the image. GPT 4o Mini handles reasoning and tool use. ScraperAPI provides stable access and structured data.

Obtaining Claude 3.5 Sonnet and GPT4o Mini from OpenRouter

Our setup uses two separate large language models arranged in a multi-flow design. You can access LLMs from platforms like Hugging Face, Google AI Studio, AWS Bedrock, or locally via Ollama.

However, I used OpenRouter because it’s simpler to set up and supports many models through a single API, which is ideal for multi-flow LLM setups.

Here’s a guide on how to access Claude 3.5 Sonnet from OpenRouter:

After verifying your email, log in and search for Claude models (or any other LLM of our choice) in the search bar:

Select Claude 3.5 Sonnet and click on the “Copy” icon just below the model’s name:

Click on “API” to create a personal API access key for your model.

Select “Create API Key” and then copy and save your newly created API key.

You do not have to repeat the entire process to access GPT 4o Mini. Simply copy and paste the model link highlighted below into the code, and your single API key will be able to access both LLMs.

Do not share your API key publicly!

Getting Started with ScraperAPI

If you don’t have a ScraperAPI account, go to scraperapi.com, and click “Start Trial” to create one or “Login” to access an existing account.:

After creating your account, you’ll have access to a dashboard providing you with an API key, access to 5000 API credits (7-day limited trial period), and information on how to get started scraping.

To access more credits and advanced features, scroll down and click “Upgrade to Larger Plan.”

ScraperAPI provides documentation for various programming languages and frameworks, such as PHP, Java, and Node.js, that interact with its endpoints. You can find these resources by scrolling down on the dashboard page and clicking “View All Docs”:

Now we’re all set, let’s start building our tool.

Building the Image Search Engine for Amazon

Step 1: Setting Up the Project

Create a new project folder, a virtual environment, and install the necessary dependencies.

mkdir amzn_image_search  # Creates the project folder
cd amzn_image_search # Moves you inside the project folder

Set up a virtual environment

python -m venv venv

Activate the environment:

Windows:

venvScriptsactivate

macOS/Linux:

source venv/bin/activate

Now, install the dependencies we’ll need:

pip install streamlit Pillow requests aiohttp openai langchain-openai langchain langchain-scraperapi python-dotenv

The key dependencies and their functions are:

streamlit: The core library for building and running the app’s UI.
openai: To interact with OpenRouter’s API, which is compatible with the OpenAI library’s structure.
langchain-openai: Provides the LangChain integration for using OpenAI-compatible models (like those on OpenRouter) as the “brain” for our agent.
langchain-scraperapi: Provides the pre-built ScraperAPIAmazonSearchTool that our LangChain agent will use to perform searches on Amazon.
langchain: The framework that allows us to chain together our language model (the brain) and tools (the search functionality) into an autonomous agent.
Pillow: A library for opening, manipulating, and saving many different image file formats. We use it to handle uploaded images.
requests & aiohttp: Underlying HTTP libraries used by the other packages to make API calls.

Step 2: Keys, Environment, and Model Selection

Let’s set up the necessary API keys and define which AI models will be used for different tasks.

In a file .env, add:

SCRAPERAPI_API_KEY="Your_SCRAPERAPI_API_Key"

In a file main.py, add the following code:

import os, io, base64, json
import streamlit as st
from PIL import Image
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import StructuredTool
from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool
from dotenv import load_dotenv
load_dotenv()
# secure api keys from .env using os
SCRAPERAPI_KEY = os.environ.get("SCRAPERAPI_API_KEY")
OPENROUTER_API_KEY_DEFAULT = os.environ.get("OPENROUTER_API_KEY")
if SCRAPERAPI_KEY:
    os.environ["SCRAPERAPI_API_KEY"] = SCRAPERAPI_KEY
else:
    print("Warning: SCRAPERAPI_API_KEY environment variable not set.")
# allocating models as per their tasks 
CAPTION_MODEL = "anthropic/claude-3.5-sonnet"  # vision model for captioning
AGENT_MODEL = "openai/gpt-4o-mini" # reasoning model (cheaper alternative to claude

Here’s a breakdown of what the code above does:

Imports: All the necessary libraries for the application are imported at the top, including StructuredTool which we’ll use to create a custom, reliable search tool.
API Keys: The script handles API key management by using load_dotenv() to retrieve keys from a .env file and assigns them to variables: SCRAPERAPI_KEY and OPENROUTER_API_KEY_DEFAULT.
Environment Setup: os.environ["SCRAPERAPI_API_KEY"] = SCRAPERAPI_KEY is a crucial line. LangChain tools often look for API keys in environment variables, so this makes our SCRAPERAPI_KEY available to the ScraperAPIAmazonSearchTool.
Model Selection: Since we’re using two different models for two distinct tasks, the CAPTION_MODEL will be Claude 3.5 Sonnet due to its multimodal capabilities. The AGENT_MODEL is GPT-4o mini because it’s cheaper and very efficient at understanding instructions and using tools, which is exactly what the agent needs to do.

Step 3: App Configuration and UI Basics

Here we’ll configure the Streamlit page and set up some basic data structures and titles. Add this to your file:

st.set_page_config(page_title=" Amazon Visual Match", layout="wide")
st.title("Amazon Visual Product Search Engine")
AMZ_BASES = {
   "US (.com)": {"tld": "com", "country": "us"},
   "UK (.co.uk)": {"tld": "co.uk", "country": "gb"},
   "DE (.de)": {"tld": "de", "country": "de"},
   "FR (.fr)": {"tld": "fr", "country": "fr"},
   "IT (.it)": {"tld": "it", "country": "it"},
   "ES (.es)": {"tld": "es", "country": "es"},
   "CA (.ca)": {"tld": "ca", "country": "ca"},
}

Here’s what this code achieves:

st.set_page_config(…): Sets the browser tab title and uses a “wide” layout for the app.
st.title(…): Displays the main title on the web page.
AMZ_BASES: This dictionary is essential. It maps a marketplace name ( “ES (.es)”) to the two codes ScraperAPI needs: the tld (top-level domain, like es) and the country code for that domain. Providing both is critical to ensuring we search the correct local marketplace.

Step 4: Creating the Image Captioning Function

This is the first major functional part of the app. It defines the logic for sending an image to the vision LLM (Claude 3.5 Sonnet) to get a descriptive caption. Continue in your file by adding this:

# captioning stage
def caption_with_openrouter_claude(
       pil_img: Image.Image,
       api_key: str,
       model: str = CAPTION_MODEL,
       max_tokens: int = 96,
) -> str:
   if not api_key:
       raise RuntimeError("Missing OpenRouter API key.")
   client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)
   b64 = _image_to_b64(pil_img)
   prompt = (
       "Describe this product in ONE concise shopping-style sentence suitable for an Amazon search. "
       "Include brand/model if readable, color, material, and 3-6 search keywords. "
       "No commentary, just the search-style description."
   )
   resp = client.chat.completions.create(
       model=model,
       temperature=0.2,
       max_tokens=max_tokens,
       messages=[{
           "role": "user",
           "content": [
               {"type": "text", "text": prompt},
               {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
           ],
       }],
   )
   return resp.choices[0].message.content.strip()

Let’s break this down:

_image_to_b64: A helper function that takes an image opened by the Pillow library and converts it into a Base64 string. This is the standard format for embedding image data directly into an API request.
caption_with_openrouter_claude: Initializes the OpenAI client, pointing it to OpenRouter’s API endpoint and instructs the vision model on exactly how to describe the image: as a single, concise sentence suitable for a product search.
Finally, it sends the request and returns the clean text response from the AI model.

Step 5: Initializing the LangChain Agent

This function builds the agent that will perform the Amazon search. To make our agent robust, we won’t give it the ScraperAPIAmazonSearchTool directly. Instead, we’ll wrap it in a custom StructuredTool to “lock” the marketplace settings. This prevents the agent from getting confused and defaulting to the US marketplace: amazon.com

First, we define a function to create this “locale-locked” tool.

def make_amazon_search_tool(tld: str, country_code: str) -> StructuredTool:
   base_tool = ScraperAPIAmazonSearchTool()
   def _search_amazon(query: str) -> str:
       return base_tool.invoke({
           "query": query,
           "tld": tld,
           "country_code": country_code,
           "output_format": "json",
       })
   return StructuredTool.from_function(
       name="scraperapi_amazon_search",
       func=_search_amazon,
       description=(
           f"Search products on https://www.amazon.{tld} "
           f"(locale country_code={country_code}). "
           "Input: a plain natural-language product search query."
       ),
   )

Now, we create the agent initializer, which uses the helper function above.

# langchain agent setup
def initialize_amazon_agent(openrouter_key: str, tld: str, country_code: str) -> AgentExecutor:
   llm = ChatOpenAI(
       openai_api_key=openrouter_key,
       base_url="https://openrouter.ai/api/v1",
       model=AGENT_MODEL,
       temperature=0,
   )
   amazon_tool = make_amazon_search_tool(tld=tld, country_code=country_code)
   tools = [amazon_tool]
   prompt = ChatPromptTemplate.from_messages([
       (
           "system",
           "You are an Amazon product search assistant. "
           "You MUST use the `scraperapi_amazon_search` tool for every search. "
           "Return ONLY the JSON from the tool. Do not invent or change tld/country."
       ),
       ("human", "{input}"),
       MessagesPlaceholder(variable_name="agent_scratchpad"),
   ])
   agent = create_tool_calling_agent(llm, tools, prompt)
   return AgentExecutor(agent=agent, tools=tools, verbose=True)

The code achieves the following:

make_amazon_search_tool: This wrapper function takes the tld and country_code from the dropdown selection box and creates a new, simple tool for the agent. When the agent uses this tool, it only provides the search query. The tld and country_code are hard-coded into the tool’s _search_amazon function, guaranteeing it searches the correct marketplace.
LLM Initialization: It sets up the ChatOpenAI object, configuring it to use the AGENT_MODEL (GPT-4o mini) via OpenRouter. The temperature=0 makes the model’s responses highly predictable.
Agent Creation: It assembles the final agent using our special amazon_tool and a system prompt that explicitly tells the agent to only return the JSON from the tool. This, combined with the wrapper tool, makes parsing the results reliable.
The AgentExecutor is the runtime that executes the agent’s tasks. verbose=True is helpful for debugging, as it prints the agent’s thought process to the console.

Step 6: Building the User Input Interface

Now let’s build the interactive sidebar and main input column within our Streamlit app.

with st.sidebar:
   st.subheader("LLM Configuration")
   openrouter_key = st.text_input(
       "OPENROUTER_API_KEY (Unified Key)",
       type="password",
       value=OPENROUTER_API_KEY_DEFAULT,
       help="Used for both caption + agent models.",
   )
   st.markdown(f"**Vision Caption Model:** `{CAPTION_MODEL}`")
   st.markdown(f"**Agent Reasoning Model:** `{AGENT_MODEL}`")
col_l, col_r = st.columns([1, 1.25])
with col_l:
   region_label = st.selectbox("Marketplace", list(AMZ_BASES.keys()), index=0)
   selected_market = AMZ_BASES[region_label]
   marketplace_tld = selected_market["tld"]
   country_code = selected_market["country"]
   uploaded = st.file_uploader("Upload a product photo", type=["png", "jpg", "jpeg"])
   manual_boost = st.text_input(
       "Optional extra keywords",
       help="e.g. brand/model/color to append to the caption",
   )
   run_btn = st.button("Search Amazon")
with col_r:
   st.info(
       f"Flow: (1) Caption image with **{CAPTION_MODEL}** "
       f"(2) Agent with **{AGENT_MODEL}** calls ScraperAPI Amazon Search locked to "
       f"**amazon.{marketplace_tld}** (3) Display JSON results."
   )

Here’s what the code does:

Sidebar: A sidebar is created to hold the configuration. It includes a password input for the OpenRouter API key and displays the names of the two models being used.
Main Columns: The main area is split into a left column (col_l) and a right column (col_r).
col_l contains all the user inputs: the marketplace dropdown, file uploader, optional keyword box, and the search button.
Most importantly, when a marketplace is selected, we now pull both marketplace_tld and country_code from the AMZ_BASES dictionary.
col_r contains an st.info box that clearly explains the app’s workflow to the user, dynamically showing which marketplace (amazon.{marketplace_tld}) is being searched.

Step 7: The Main Application Logic and Search Execution

Now to the heart of the application, where everything is tied together. This block of code runs when a user clicks the “Search Amazon” button.

if run_btn:
   if not uploaded:
       st.warning("Please upload a photo first.")
       st.stop()
   if not openrouter_key:
       st.error("Please paste your OPENROUTER_API_KEY.")
       st.stop()
   img = Image.open(io.BytesIO(uploaded.read())).convert("RGB")
   st.image(img, caption="Uploaded photo", use_container_width=True)
   with st.spinner(f"Describing your image via {CAPTION_MODEL}..."):
       try:
           caption = caption_with_openrouter_claude(img, openrouter_key)
       except Exception as e:
           st.error(f"Captioning failed: {e}")
           st.stop()
   query = f"{caption} {manual_boost}".strip()
   st.success(f"Caption: _{caption}_")
   st.write("**Agent Query:**", query)
   agent_executor = initialize_amazon_agent(
       openrouter_key,
       tld=marketplace_tld,
       country_code=country_code,
   )
   with st.spinner(
           f"Searching amazon.{marketplace_tld}"
   ):
       try:
           result = agent_executor.invoke({"input": f"Search for: {query}"})
       except Exception as e:
           st.error(f"LangChain Agent execution failed: {e}")
           st.stop()
   agent_output_str = result.get("output", "").strip()
   if not agent_output_str:
       st.error("Agent returned empty output.")
       st.stop()
   json_start_brace = agent_output_str.find('{')
   json_start_bracket = agent_output_str.find('[')
   if json_start_brace == -1 and json_start_bracket == -1:
       st.error("Agent output did not contain any valid JSON.")
       with st.expander("Debug: Raw agent output"):
           st.code(agent_output_str)
       st.stop()
   if json_start_brace == -1:
       json_start_index = json_start_bracket
   elif json_start_bracket == -1:
       json_start_index = json_start_brace
   else:
       json_start_index = min(json_start_brace, json_start_bracket)
   cleaned_json_str = agent_output_str[json_start_index:]
   try:
       decoder = json.JSONDecoder()
       raw_data, _ = decoder.raw_decode(cleaned_json_str)
   except json.JSONDecodeError as e:
       st.error(f"Failed to parse JSON from agent output: {e}")
       with st.expander("Debug: Raw agent output (before clean)"):
           st.code(agent_output_str)
       with st.expander("Debug: Sliced/Cleaned string that failed"):
           st.code(cleaned_json_str)
       st.stop()
   items = []
   if isinstance(raw_data, dict) and isinstance(raw_data.get("results"), list):
       items = raw_data["results"]
   elif isinstance(raw_data, list):
       items = raw_data
   else:
       st.warning("Unexpected JSON shape from tool. See raw output below.")
       with st.expander("Debug: Raw JSON"):
           st.json(raw_data)
       st.stop()

Let’s break it down below:

Input Validation: It first checks if an image has been uploaded and if an API key is present.
Image Processing: It opens the uploaded image file, displays it, and prepares it for captioning.
Caption Generation: It calls the caption_with_openrouter_claude function inside an st.spinner.
Query Construction: It creates the final search query by combining the AI-generated caption with any optional keywords.
Agent Execution: This is the key update. It now initializes the agent by passing both the marketplace_tld and country_code to our initialize_amazon_agent function.
Robust JSON Parsing: This is the second critical part. The agent’s raw output can sometimes be messy (invisible characters or extra text after the JSON ends).
1. We first find the start of the JSON ({ or [) to trim any leading junk.
2. We then use json.JSONDecoder().raw_decode(). to ignore any “extra data” that might come after it. Thereby solving parsing errors.
3. It then safely extracts the list of products from the “results” key.

Step 8: Displaying the Search Results

The final step is to take the list of product items extracted in the previous step and render it in a user-friendly format. Add:

    if not items:
       st.warning(f"No items found on amazon.{marketplace_tld} for that query.")
       with st.expander("Debug: Raw JSON"):
           st.json(raw_data)
       st.stop()
   st.subheader(f"Results ({len(items)}) from amazon.{marketplace_tld}")
   for it in items[:24]:
       with st.container(border=True):
           c1, c2 = st.columns([1, 2])
           with c1:
               if it.get("image"):
                   st.image(it["image"], use_container_width=True)
           with c2:
               st.markdown(f"**{it.get('name', 'No Title')}**")
               asin = it.get("asin")
               if asin:
                   st.write(f"ASIN: `{asin}`")
               price = it.get("price_string")
               if price:
                   st.write(f"Price: {price}")
               url = it.get("url")
               if url:
                   st.link_button("View on Amazon", url)

The code does the following:

No Results Check: It first checks if the items list is empty and informs the user.
Results Header: It displays a subheader announcing how many results were found and from which marketplace (amazon.{marketplace_tld}).
Loop and Display: It loops through the first 24 items (items[:24]) and displays each product in a structured, two-column layout with its image, title, ASIN, price, and a direct link to the product page.

Step 9: Running Your Application

With the entire script in place, you can now run the application from your terminal. Make sure your virtual environment is still active.

streamlit run main.py

Your web browser should automatically open and load up the Application. “main.py” simply references your script’s file name, the one housing the code within your IDE. So, substitute accordingly.

Here’s a snippet of what the tool’s UI looks like:

Deploying the Image Search Engine App Using Streamlit

Follow the steps below to deploy your Image Search Engine on Streamlit for free:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository on GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHubBefore doing anything else, create a .gitignore file to avoid accidentally uploading sensitive files like. Add the following to it:

.env
__pycache__/
*.pyc
*.pyo
*.pyd
.env.*
.secrets.toml

If you haven’t already set up Git and linked your repository, use the following commands in your terminal from within your project folder:

git init
git add .
git commit -m "Initial commit"
git branch -M main
# With HTTPS
git remote add origin https://github.com/YOUR_USERNAME/your_repo.git
# With SSH
git remote add origin git@github.com:YOUR_USERNAME/your-repo.git
git push -u origin main

If it’s your first time using GitHub from this machine, you might need to set up an SSH connection. Here is how.

Step 2: Define Dependencies and Protect Your Secrets!

Streamlit needs to know what dependencies your app requires.

1. In your project folder, automatically create a requirements file by running:

pip freeze > requirements.txt

2. Commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies”
git push origin main

Step 3: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App.”

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

Repository: YOUR_USERNAME/Amazon-Image-Search-Engine
Branch: main
Main file path: main.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

7. Go to your deployed app dashboard, find your app, and find “Secrets” in “Settings”. Add your environment variables (your API keys) just as you have them locally in your .env file.

Step 4: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL (e.g., https://your-app-name.streamlit.app). You can now share this link to allow others to access your app!

Here’s a short YouTube video demonstrating the Image Search Engine in action.

Conclusion

Congratulations. You just built an Image Search engine for Amazon. Your tool converts uploaded photos into search queries that yield targeted results based on visual similarities.

We achieved this using the ScraperAPI-Langchain agent for real-time web scraping, Claude 3.5 Sonnet for image captioning, GPT-4o Mini as a reasoning model for our agent, and Streamlit for building the UI and free cloud hosting.

The result is a fast, intuitive, and relevant tool that helps consumers find Amazon products instantly, even when they are unable to provide written search queries, thereby reducing the time to purchase and improving customer satisfaction.

The post Build an Image Search Engine for Amazon with the ScraperAPI-LangChain Agent appeared first on ScraperAPI.

The Ultimate Guide to Bypassing Anti-Bot Detection

Ize Majebi — Wed, 15 Oct 2025 14:12:33 +0000

You set up your scraper, press run, and the first few requests succeed. The data comes back exactly as you hoped, and for a moment, it feels like everything is working. Then the next request fails: a 403 Forbidden appears. Soon after, you are staring at a wall of CAPTCHAs. In some cases, there is not even an error message, and your IP is silently throttled until every request times out.

If you’ve ever tried scraping at scale, you’ve probably run into this. It’s frustrating, but it isn’t random. The web has become a tug of war between site owners and developers. On one side are businesses trying to protect their content and infrastructure. On the other hand are researchers, engineers, and companies that need access to that content. Anti-bot systems are designed for this fight, and they have grown into complex defenses that use IP reputation, browser fingerprinting, behavioral analysis, and challenge tests to block automation.

In this guide, you will learn what those defenses look like, why scrapers get blocked, and the strategies that actually make a difference. The goal is not to hand out short-term fixes, but to give you a clear understanding of the systems you are up against and how to build scrapers that last longer in production.

Ready? Let’s get started!

Chapter 1: Know Your Enemy: The Anatomy of a Modern Bot Blocker

If you want to bypass anti-bot systems, you first need to understand them. Bot blockers are built to detect patterns that real users rarely produce. They don’t rely on a single check but layer multiple defenses together. The more signals they collect, the more confident they become that the traffic is automated.

The easiest way to make sense of these systems is to break them down into four core pillars: IP reputation, browser fingerprinting, behavioral analysis, and active challenges. Each pillar covers a different angle of detection, and together they form the backbone of modern anti-bot defenses.

The Four Pillars of Detection

IP Reputation and Analysis

The first thing any website learns about you is your IP address. A server always sees a source IP; you can’t make requests without exposing a source IP, and though you can proxy/relay it, it is often the very first filter that anti-bot systems apply. If your IP does not look trustworthy, you will be blocked before the site even checks your browser fingerprint, your behavior, or whether you can solve a CAPTCHA.

Why IP Type Matters

Websites classify IP addresses by their origin, and this classification has a direct impact on your chances of being blocked.

Datacenter IPs are those owned by cloud providers such as Amazon Web Services, Google Cloud, or DigitalOcean. They are attractive because they are cheap, fast, and easy to acquire, but they are also the most heavily scrutinized. Their ranges are publicly known, and many sites blacklist them pre-emptively. Even a brand-new IP from a datacenter can be flagged without ever being used for abuse.
Residential IPs come from consumer internet providers and are assigned to everyday households. Because they blend into the regular traffic of millions of users, they are much harder to detect and block. This is why residential proxy services are valuable, although they are also costly. However, once a proxy provider is identified, its pool of residential IPs can still be marked as suspicious.
Mobile IPs belong to carrier networks. They are the hardest to blacklist consistently, because thousands of users often share the same public address through carrier-grade NAT (Network Address Translation). These IPs also change frequently as devices move across cell towers. That churn makes them appear fresh and unpredictable, but it also means that abusive traffic from one user can create problems for everyone else on the same IP. Still, even when shared, extreme abuse on one IP can still trigger blocks for others on the same address.

The type of IP you use shapes your reputation before anything else is considered. A datacenter IP may be treated as suspicious even before it makes its first request. At the same time, a residential or mobile IP may earn more trust simply by belonging to a consumer or carrier network.

How Reputation Scores Are Built

Identifying your IP type is only the starting point. Websites and security providers maintain live databases of IP reputation that go far deeper. These systems assign a score to each address based on both historical evidence and real-time traffic.

Some of the most essential signals include:

Network ownership: An Autonomous System Number (ASN) identifies which organization owns a block of IPs. If the ASN belongs to a hosting provider, that alone can raise suspicion.
Anonymity markers: IPs known to be used by VPNs, Tor, or open proxy services are treated as risky.
Abuse history: If an IP has been linked to spam, scraping, or fraud in the past, that history follows it.
Request velocity: A human cannot make hundreds of requests in a second. High-volume activity is one of the clearest signs of automation.
Geographic consistency: A user’s IP location should align with their browser settings and session history. If someone appears in Canada one minute and Singapore the next, something is wrong.

The resulting score dictates how a website responds. Low-risk IPs may be allowed through without friction. Medium-risk IPs may see throttling or occasional CAPTCHA. High-risk IPs are blocked outright with errors like 403 Forbidden or 429 Too Many Requests.

When a website detects suspicious traffic, it rarely stops at blocking just your IP. Most anti-bot systems are designed to think in groups, not individuals, which means the actions of one scraper can end up tainting an entire neighborhood of addresses.

At the smaller scale, this happens with subnets. A subnet is simply a slice of a larger network, carved out so that routers can manage traffic more efficiently. You’ll often see subnets written in a format like 192.0.2.0/24. This notation tells you that all the addresses from 192.0.2.0 through 192.0.2.255 are part of the same group. If a handful of those addresses start showing abusive behavior, it is much easier for a website to restrict the entire /24 block than to chase individual offenders.

At a larger scale, blocking does not just target individual IP addresses. It can happen at the level of an entire autonomous system (AS). The internet is made up of thousands of these systems, which are large networks run by internet service providers, mobile carriers, cloud companies, universities, or government agencies. Each one manages its own pool of IP addresses, known as its “address space.” To keep things organized, every AS is assigned a unique identifier called an autonomous system number (ASN). For example, Cloudflare operates under ASN 13335, while Amazon Web Services uses several different ASNs for its various regions.

Why does this matter? Because if one AS is consistently associated with scraping or fraud, websites can enforce rules across every IP inside it. That could mean millions of addresses flagged with a single policy update. This is especially common with cloud providers, since entire data center networks are publicly known and widely targeted by scrapers.

Browser Fingerprinting

Once websites confirm your IP looks safe, the next step is to examine your browser. This process, known as browser fingerprinting, involves collecting numerous small details about your browser to create a unique profile. Unlike cookies, which you can delete or block, fingerprinting does not rely on stored data. Instead, it takes advantage of the information your browser naturally exposes every time it loads a page.

What a Fingerprint Contains

A browser fingerprint is a collection of attributes that describe how your system looks and behaves. No single attribute is unique on its own, but when combined, they can create a profile that is very unlikely to match anyone else’s. Common components include:

User-Agent and headers: The User-Agent is a string that tells websites which browser and operating system you are using (for example, Chrome on Windows or Safari on iOS). Other headers can reveal your preferred language, supported file formats, or device type.
Screen and system settings: Your screen resolution, color depth, time zone, and whether your device supports touch input are all easy to read and can help distinguish you from others.
Graphics rendering: Websites use APIs such as Canvas and WebGL to draw hidden images in your browser. Because the result depends on your graphics card, drivers, and fonts, the output is slightly different for each machine.
Audio processing: Through the AudioContext API, sites can generate sounds that your hardware processes in unique ways. These differences become another signal in your fingerprint.
Fonts and layout: The fonts you have installed, and how your system renders text, vary across devices.
Plugins and media devices: Browsers can reveal what extensions are installed, and whether a camera, microphone, or other media device is available.

When all of these signals are combined, the result is usually distinctive enough to identify one device out of millions.

How Fingerprints Are Collected

Some of these values, like the User-Agent, are shared automatically every time your browser makes a request. Others are gathered using JavaScript that runs quietly in the background. For instance, a script may tell your browser to draw a hidden image on a canvas, then read back the pixel data to see how your system rendered it. Because hardware and software vary, the results form part of a unique signature.

These details are then combined into a hash, a short code that represents the overall configuration. If the same hash appears across visits, the system knows it is dealing with the same client, even if the IP has changed or cookies have been cleared.

Why Automation Tools Struggle

This is also the stage where automation platforms are exposed. Headless browsers such as Puppeteer, Playwright, and Selenium are designed to load and interact with web pages without a visible window. Although they are helpful for scraping, they often fail fingerprinting checks because they leak signs of automation.

A property called navigator.webdriver is usually set to true, which immediately signals automation.
Rendering in headless environments is often handled by software libraries like SwiftShader instead of a GPU, which produces outputs that differ from typical human-operated devices and can be fingerprinted.
Many browser APIs return incomplete or default values instead of realistic ones.
HTTP headers may be sent in an unusual order that does not match the patterns of real browsers.

Together, these inconsistencies make the fingerprint look unnatural. Even if your IP is clean, the browser itself gives you away.

Stability and the Growing Scope of Fingerprinting

Fingerprinting is not only about how unique a setup looks but also about how consistent it appears over time. Real users typically keep the same configuration for weeks or months, only changing after a software update or hardware replacement. Scrapers, on the other hand, often shift profiles from one session to the next. A client that looks like Chrome on Windows in one request and Safari on macOS in the next is unlikely to be genuine. Even minor mismatches, such as a User-Agent string reporting one browser version while WebGL capabilities match another, can be enough to raise suspicion.

To make detection harder to evade, websites continue expanding the range of signals they collect. In the past, some sites used the Battery Status API to collect signals like charge level and charging state, but browser vendors have since restricted or disabled this feature due to privacy concerns. Others use the MediaDevices API to identify how many microphones, speakers, or cameras are connected. WebAssembly can be used to run timing tests that expose subtle CPU characteristics, although modern browsers now limit timer precision to prevent microsecond-level leaks.

Even tools designed to protect privacy can make things worse. Anti-fingerprinting extensions often create patterns that stand out precisely because they look unusual. Instead of blending in, they can make a browser seem more suspicious.

This is why fingerprinting remains such a powerful defense. It does not depend on stored data and cannot be reset as easily as an IP address. It relies on the information your browser naturally reveals, which is very difficult to disguise. Even with a clean IP, an unstable or unrealistic fingerprint can expose a scraper before it ever reaches the target data. Managing fingerprints so that they appear natural and consistent is as essential as proxy rotation. Without it, no other bypass technique will succeed.

Behavioral Analysis (The “Turing Test”)

Even if your IP looks safe and your browser fingerprint appears realistic, websites can still catch you by looking at how you behave. This approach is known as behavioral analysis, and it is designed to spot the difference between natural human activity and automated scripts. Think of it as a digital version of the Turing Test: the site is silently asking, “Does this visitor actually move, click, and type like a person?”

People rarely interact with websites in predictable, machine-like ways. A human visitor might move the mouse in uneven arcs, scroll back and forth while reading, pause unexpectedly, or type in bursts with pauses between words. These slight irregularities form a behavioral signature.

Bots often fail at this. Many scripts execute actions with mechanical precision: clicks happen instantly, scrolling is smooth and perfectly uniform, and typing may occur at an inhumanly consistent speed. Some bots even skip interaction entirely, jumping directly to the data source they want.

Behavioral analysis systems compare these patterns to baselines collected from regular users. If your activity deviates significantly from typical patterns, the site may flag you as a bot, even if your IP and fingerprint appear legitimate.

Key Behavioral Signals

Websites collect a wide range of behavioral signals. The most common include:

Mouse movements and clicks: Human mouse paths contain tiny hesitations, jitters, and corrections. Bots either skip this step or simulate perfectly straight, robotic lines.
Scrolling behavior: Real users scroll unevenly, sometimes stopping midway, changing direction, or adjusting speed. Scripts often scroll in a linear, predictable way or avoid scrolling entirely.
Typing rhythm: Known as keystroke dynamics, this measures the timing of each keystroke. Humans type in bursts with natural pauses, while bots often fill fields instantly or type at an impossibly steady rhythm.
Navigation flow: A genuine visitor usually enters through the homepage or a category page, spends time browsing, and then reaches the data-heavy endpoint. Bots often go straight to the target URL within seconds.
Session activity: Humans vary in how long they stay on pages. Bots typically request content instantly and leave without hesitation. This makes session length a valuable signal.

TLS and JA3 Fingerprinting

Behavioral analysis is not limited to on-page actions. It also examines how your connection behaves.

Every HTTPS connection begins with a TLS handshake (Transport Layer Security handshake). This is the negotiation where your browser and the server agree on encryption methods before any content is exchanged. Each browser, operating system, and networking library has a slightly different way of performing this handshake.

JA3 fingerprinting is a technique that takes the details of this handshake, including supported ciphers, extensions, and protocol versions, and generates a hash that uniquely identifies the client. If your scraper presents itself as Chrome but uses a handshake typical of Python’s requests library, the mismatch is easy to detect.

This means that even before a single page loads, your connection can betray whether you are really using the browser you claim.

Why Behavioral Analysis Is Effective

Behavioral analysis is more complex to evade than other defenses because it measures live activity rather than static attributes. You can rent residential proxies or spoof browser fingerprints, but replicating the subtle quirks of human movement, scrolling, and typing takes much more effort.

Even advanced bots that try to simulate user actions can be exposed when their patterns are compared across multiple signals. For example, mouse movement may look natural, but the navigation flow might still be too direct. Or the keystroke dynamics might be convincing, but the TLS handshake does not match the claimed browser.

This multi-layered approach is what makes behavioral analysis one of the most resilient forms of bot detection.

Behavioral analysis acts as the final checkpoint. It catches bots that slip through IP and fingerprint filters, but still fail to behave like real users. For scrapers, bypassing anti-bot systems requires more than just technical camouflage. To succeed, your traffic must not only appear legitimate on the surface but also behave in a manner that closely mirrors human browsing patterns. Without that, even the most advanced proxy rotation or fingerprint spoofing will not be enough.

Challenges & Interrogation

Even if your IP looks clean and your browser fingerprint appears consistent, websites often add one final test: an active challenge. These are designed to confirm that there is a real user on the other end before granting access.

From CAPTCHA to Risk Scoring

The earliest challenges were simple CAPTCHA. Sites showed distorted text or numbers that humans could solve, but automated scripts could not. Over time, this expanded to image grids, such as “select all squares with traffic lights.”

Today, many sites use more subtle methods, like Google’s reCAPTCHA v2, which introduced the “I’m not a robot” checkbox and occasional image puzzles. reCAPTCHA v3 shifted further, assigning an invisible risk score in the background so most users never see a prompt. hCaptcha followed a similar model, with a stronger emphasis on privacy and flexibility for site owners.

Invisible and Scripted Tests

Modern challenges increasingly happen behind the scenes. Cloudflare’s Turnstile runs lightweight checks in the browser, only interrupting the user if something looks suspicious. It’s Managed Challenges adapt in real time, deciding whether to show a visible test or resolve quietly based on signals like IP reputation and session history.

Websites also use JavaScript challenges, which run small scripts inside the browser. These might:

Draw hidden graphics with Canvas or WebGL to confirm rendering quirks
Measure how code executes to verify real hardware is present
Check for storage, cookies, and header consistency

Passing such tests generates a short-lived token that the server validates before letting requests continue.

The Push Toward Privacy

The newest trend moves away from puzzles entirely. Private Access Tokens, based on the Privacy Pass standard, allow trusted devices to prove they are legitimate without exposing identity. Instead of clicking boxes or solving images, the browser presents a cryptographic token issued by a trusted provider. Apple and Cloudflare are leading this move, aiming to remove CAPTCHA altogether for supported platforms.

Challenges and interrogation catch automated clients that may have passed IP and fingerprint checks, but still cannot prove they are genuine. The direction is clear: fewer frustrating puzzles, more invisible checks, and an emphasis on privacy-preserving tokens. For scrapers, this is often the most rigid barrier to overcome, because failing a challenge does not just block access, it also signals to the site that automation is in play.

Chapter 2: The Rogues’ Gallery: A Deep Dive into Major Bot Blockers

Anti-bot vendors use the same four pillars of detection, but each adds its own methods and scale. Knowing how the big players operate helps explain why some scrapers fail instantly while others last longer.

Cloudflare

Cloudflare is the most widely deployed bot management solution, acting as a reverse proxy for millions of websites. A reverse proxy sits between a user and the website’s server, meaning Cloudflare can filter, inspect, or block traffic before the target site ever receives it.

Cloudflare uses multiple layers of defense:

I’m Under Attack Mode (IUAM): This feature activates when a site is experiencing unusual traffic. Visitors are shown a temporary interstitial page for about five seconds. During that pause, Cloudflare runs JavaScript code that collects information about the browser and verifies whether it looks legitimate. A standard browser passes automatically, while bots that cannot execute JavaScript are stopped immediately.
Turnstile: Unlike traditional puzzles, Turnstile performs background checks (for example, analyzing browser behavior and TLS handshakes) to verify real users invisibly. Only high-risk traffic sees explicit challenges, which reduces friction for humans while raising the bar for bots.
Shared IP Reputation: Cloudflare leverages its enormous footprint across the internet. If an IP is flagged for suspicious activity on one site, that information can be used to block it on others. This network effect makes Cloudflare particularly powerful at tracking abusers across domains.
Browser and TLS Fingerprinting: Beyond JavaScript challenges, Cloudflare inspects the TLS handshake (the initial negotiation that establishes an encrypted HTTPS connection). If your client claims to be Chrome but its TLS handshake matches known automation fingerprints (like those from Python libraries), it is easily exposed.

For scrapers, Cloudflare’s greatest difficulty lies in its scale and speed. Even if you rotate IPs or patch fingerprints, once a signal is flagged on one site, it can follow you everywhere Cloudflare operates.

Akamai

Akamai is one of the oldest and largest Content Delivery Networks (CDNs), and its bot management is among the most advanced. Unlike simple IP filtering, Akamai emphasizes behavioral data collection, sometimes referred to as sensor data.

What makes Akamai stand out:

Browser Sensors: JavaScript embedded in protected sites records subtle human signals: mouse movements, keystroke timing, scroll depth, and tab focus. These are compared against large datasets of genuine user activity. Bots typically generate movements that are too perfect, too fast, or missing altogether.
Session Flow Tracking: Instead of looking at single requests, Akamai evaluates the entire browsing journey. Humans usually navigate step by step: homepage, category page, product page, while bots often jump directly to data endpoints. This difference in flow is a strong detection signal.
Edge-Level Integration: Because Akamai runs at the CDN edge, it can correlate behavioral insights with network-level data:
- ASN ownership: Is the traffic coming from a consumer ISP or a known hosting provider?
- Velocity: Are requests being made faster than a human could reasonably click?
- Geolocation: Does the user’s IP location align with their browser settings and session history?

Akamai is difficult to evade because it does not rely on just one layer of detection. To succeed, a scraper must mimic both the technical footprint and the organic, sometimes messy, flow of human browsing.

PerimeterX (HUMAN Security)

PerimeterX, now rebranded under HUMAN Security, is known for its client-side detection model. Instead of relying entirely on server-side logs, PerimeterX embeds sensors that run directly in the user’s browser session.

These sensors collect thousands of attributes in real time:

Deep Fingerprinting: WebGL rendering results, Canvas image outputs, installed fonts, available plugins, and even motion data from mobile devices all contribute to a unique profile. Unlike a simple User-Agent string, these combined values are difficult to spoof convincingly.
Automation Framework Detection: Popular scraping tools often leave behind subtle flags. For example, Selenium sets navigator.webdriver = true in most configurations, which is a dead giveaway. Puppeteer in headless mode often uses SwiftShader for rendering, which can differ from physical GPU outputs. Even the order in which HTTP headers are sent can expose a headless browser.
Ongoing Validation: Many systems check once per session, but PerimeterX continues to validate throughout. If your scraper passes the first test but shows suspicious behavior five minutes later, it can still be flagged.

Because PerimeterX looks so deeply into browser environments, it is particularly good at catching advanced bots that use headless browsers. Evading it requires not just patched fingerprints but also realistic rendering outputs and consistent session behavior over time.

DataDome

DataDome emphasizes AI-driven detection across websites, mobile apps, and APIs. Unlike older providers that focus mainly on web traffic, DataDome has built systems to secure modern app ecosystems where bots target APIs and mobile endpoints.

Its system relies on:

AI and Machine Learning Models: Every request is scored against patterns learned from billions of data points. This scoring happens in under two milliseconds, fast enough to avoid slowing down user experience.
Cross-Platform Protection: Bots are not limited to browsers. Many now use mobile emulators or modified SDKs to attack APIs directly. DataDome covers all these channels, analyzing whether the client environment matches expected behavior.
Adaptive Learning: Models are updated continuously to reflect new bot behaviors, ensuring the system evolves rather than relying on static rules.
Multi-Layered Analysis: Attributes like IP reputation, HTTP headers, TLS fingerprints, and on-page behavior are combined into a holistic risk score.

For scrapers, the key challenge is the breadth of coverage. Even if you disguise your browser, an API request from the same session may expose automation. And because detection happens in real time, there is little room for trial and error before blocks are enforced.

AWS WAF

Amazon Web Services provides a Web Application Firewall (WAF) that customers can configure to block unwanted traffic. Unlike Cloudflare or Akamai, AWS WAF is not a dedicated anti-bot product but a toolkit that site owners adapt to their own needs. Its strength lies in flexibility, which means scrapers can face very different levels of difficulty depending on how it is deployed.

Typical anti-bot rules in AWS WAF include:

Managed Rule Groups: AWS and partners provide prebuilt rules that block common malicious traffic, including known scrapers and impersonators of Googlebot.
Datacenter IP Blocking: Site owners often deny requests from IP ranges associated with cloud providers. Since many scrapers rely on these datacenter IPs, this is a simple but effective filter.
Rate Limiting: Rules can cap the number of requests a single client can send in a given timeframe. Humans rarely send more than a handful of requests per second, so exceeding those limits is suspicious.
Custom Filters: Organizations can create their own detection logic, such as flagging mismatched geolocations, odd header values, or repeated patterns of failed requests.

Because AWS WAF is configurable, its effectiveness varies. Some sites may implement only the most basic rules, which are easy to bypass with proxies, while others, especially large enterprises, may deploy complex rule sets that combine multiple signals, creating protection comparable to dedicated bot management platforms.

Each provider applies the same pillars of detection in different ways:

Cloudflare leverages scale and global IP reputation.
Akamai focuses on behavioral signals and session flow.
PerimeterX (HUMAN Security) digs deeply into client-side fingerprints and automation leaks.
DataDome uses real-time AI analysis across browsers, apps, and APIs.
AWS WAF relies on site-specific configurations that range from simple to highly sophisticated.

For scrapers, this means there is no single bypass strategy; you need to understand each system on its own terms, and your scraper’s resilience requires a layered approach that addresses IP, fingerprints, behavior, and challenges simultaneously.

Chapter 3: The Scraper’s Toolkit: Core Techniques for Bypassing Detection

Anti-bot systems combine multiple signals to tell humans and automation apart. That means no single trick is enough to bypass them. You need a toolkit, a set of layered techniques that work together. Each one addresses a different pillar of detection: proxies manage your IP reputation, fingerprints protect your browser identity, CAPTCHA solutions handle active challenges, and human-like behavior makes your traffic believable. The goal is not to imitate these techniques halfway but to apply them consistently, because detection systems compare multiple signals at once. A clean IP with a broken fingerprint will still be blocked. A perfect fingerprint with robotic timing will also fail. The techniques below are the foundation of any resilient scraping operation.

Technique 1: Proxy Management Mastery

Proxies are the foundation of every serious scraping project. Each request you send is tied to an IP address, and websites judge those addresses long before they examine your browser fingerprint or behavior. Without proxies, you are limited to a single identity that will almost always get flagged. With them, you can multiply your presence across thousands of identities, but only if you use them correctly.

Choosing the Right Proxy

Datacenter proxies

Datacenter IPs come from cloud providers and hosting companies. They are designed for scale, which makes them cheap and extremely fast. When you need to collect data from sites that have weak or no anti-bot defenses, datacenter proxies can get the job done at a fraction of the cost of other options.

The problem is reputation. Because datacenter ranges are publicly known, websites can block entire chunks of them in advance. A site that wants to protect itself from automated scraping can blacklist entire subnets or even autonomous systems belonging to providers like AWS or DigitalOcean. That means even a “fresh” datacenter IP may already be treated with suspicion before it makes its first request. If your target is sensitive, such as e-commerce, ticketing, or finance, datacenter traffic will often be blocked at the door.

Residential proxies

Consumer internet service providers issue Residential IPs, the same ones that power ordinary households. From a website’s perspective, traffic from these IPs looks just like regular user activity. That natural cover gives residential proxies a much higher trust level. They are particularly effective when scraping guarded pages, logged-in content, or platforms that rely heavily on IP reputation.

The trade-off is speed and cost. Residential IPs tend to respond more slowly than datacenter IPs, and most providers charge by bandwidth rather than per IP, so costs add up quickly on large projects. They can also be targeted if abuse is concentrated. If too many suspicious requests originate from the same provider or subnet, websites can extend blocks across that range, reducing the reliability of the pool.

Mobile proxies

Mobile IPs are routed through carrier networks. Here, thousands of users share the same public IP address, and devices constantly switch towers as they move. That constant churn makes mobile IPs nearly impossible to blacklist consistently. If a site blocked one, it could accidentally cut off thousands of legitimate mobile users at once.

This makes mobile proxies one of the most potent tools for scraping heavily protected content. However, they are also the most expensive and the least predictable. Because you are sharing the address with many strangers, your session can suddenly inherit the consequences of someone else’s abusive activity. Frequent IP changes mid-session can also disrupt multi-step flows like checkouts or form submissions.

In practice, few scrapers rely on a single category. Datacenter proxies deliver speed and scale where defenses are weak, residential proxies strike a balance of cost and reliability for most guarded content, and mobile proxies are reserved for the hardest restrictions where stealth is non-negotiable.

Rotation that Feels Human

Choosing the right proxy type is only the first step. The next challenge is using those proxies in ways that resemble real browsing. Websites do not just look at which IP you use; they observe how long you use it, how often it appears, and whether its behavior aligns with a human pattern.

Rotation strategies help you manage this.

Sticky sessions: Instead of switching IPs on every request, keep the same one for a cluster of related actions. A real user browsing a shop will log in, click around, and add something to their cart without changing IP midway. Holding onto the same proxy for these flows makes your traffic believable.
Rotating sessions: For bulk crawls, such as collecting thousands of product listings, swap IPs every few requests or pages. This spreads out the workload and prevents any single IP from carrying too much risk.
Geographic alignment: If your proxy is in Germany, for example, your headers, cookies, and time zone should tell the same story. Sudden jumps from one country to another in the middle of a session are easy for defenses to spot.
Request budgets: Every IP has a lifespan. If you push it too hard with hundreds of rapid requests, it will get flagged. Assign a realistic budget of requests per IP, retire it once that limit is reached, and reintroduce it later.

The trick is balance. People do not change IPs every second, but they also do not hammer a website with thousands of requests from the same address. Rotation that feels human is about pacing and continuity, not random churns.

Keeping the Pool Healthy

Even the best proxy rotation plan will fail if the pool itself is weak. Some IPs will perform flawlessly, while others will either slow down or burn out quickly. Managing a proxy pool means constantly monitoring, pruning, and replenishing.

Metrics worth tracking include:

Block signals such as 403 Forbidden, 429 Too Many Requests, and CAPTCHA challenges
Connection health, like timeouts, TLS handshake failures, and dropped sessions
Latency and response times, which can reveal throttling or overloaded providers

When you spot problems, isolate them. Quarantine flagged IPs or entire subnets to avoid poisoning the rest of your traffic. Replace weak providers with stronger ones, and always spread your pool across multiple vendors so that one outage does not bring everything down.

A healthy pool is a constantly moving target that requires maintenance. Skipping this step is the fastest way to turn a strong setup into a fragile one.

Putting it All Together

Mastering proxy management is about combining all three layers: choosing the right proxy type, rotating them in ways that mimic human behavior, and keeping the pool clean. Datacenter, residential, and mobile proxies each have their place, and their strengths complement one another when used strategically. Rotation rules make those IPs look natural, and pool maintenance ensures you always have healthy addresses ready.

Without this foundation, none of the other bypass techniques, like fingerprint spoofing, behavior simulation, or CAPTCHA solving, will matter. If your proxies fail, everything else falls apart.

Technique 2: Perfecting Your Digital Identity (Fingerprint & Headers)

Proxies may give you a new address on the internet, but they do not tell the whole story. Once a request reaches a website, the browser itself comes under scrutiny. This is where many scrapers fail. They might be using a clean IP, but the headers, rendering outputs, or session data they present do not resemble a real person. Fingerprinting closes that gap. To pass this test, you need to create an identity that not only looks consistent but also behaves as if it belongs to a real browser in a real location.

Choosing A Realistic Baseline

The first decision is what identity to copy. Defenders have massive datasets of how common browsers look and behave, so straying too far from the norm is risky.

A good approach is to anchor your setup in a widely used combination: for example, Chrome 115 on Windows 10, or Safari on iOS. These represent large segments of real users. If you instead show up as a rare Linux build with an unusual screen resolution, you instantly stand out. This choice becomes your baseline. Everything else, such as headers, rendering results, fonts, and media devices, must align with it.

Making Fingerprints And Networks Agree

An IP address already reveals a lot about where traffic is coming from. If your fingerprint tells a different story, detection is almost guaranteed.

Time zone, locale, and Accept-Language should reflect the region of your proxy.
A German IP, for instance, should not be paired with a US English-only browser and a Pacific time zone.
Currency, local domains, and even keyboard layouts can reinforce or break this alignment.

Think of this as storytelling. The IP and the fingerprint are two characters. If they contradict each other, the plot falls apart.

Building Headers That Match Real Traffic

Headers are often overlooked, yet they are one of the most powerful indicators of authenticity. Websites check not only the values but also whether the set of headers and their order match what real browsers send.

A User-Agent string must match the exact browser and version you claim.
Accept, Accept-Language, Accept-Encoding, and the newer Sec-CH-UA headers should all be present and correct.
The order matters. Real browsers send them in consistent sequences that defenders log and compare against.

Rotating only the User-Agent is a common beginner mistake. Without updating the entire header set to match, the disguise falls apart instantly.

Closing The Gaps In Headless Browsers

Automation tools like Puppeteer, Playwright, and Selenium are designed for control, not invisibility. Out of the box, they leak signs of automation.

navigator.webdriver is automatically set to true, which flags the browser as automated.
Properties like navigator.plugins or navigator.languages often return empty or default values, unlike real browsers.
Graphics rendered with SwiftShader in headless mode can be different from outputs produced by a physical GPU.
Headers may be sent in unnatural orders or with missing fields.

To avoid instant detection, you need to patch or disguise these gaps. Stealth plugins and libraries exist for this, but they still require careful testing and validation.

Making Rendering Outputs Believable

Fingerprinting relies heavily on how your system draws graphics and processes audio.

Canvas and WebGL outputs should align with the GPU and operating system you claim. A Windows laptop should not render like a mobile device.
Fonts must match the declared platform. A Windows profile with macOS-only fonts raises alarms.
AudioContext results must remain stable across a session, since real hardware does not change its sound processing randomly.

These details are subtle, but together they form a signature that is hard to fake and easy to check. Defenders know what standard systems look like; if yours has capabilities that are too empty or too crowded, suspicion rises.

A laptop typically reports a single microphone and webcam, so having none or a dozen looks strange. Browser features should match the version you present. For example, an older version of Chrome should not claim to support APIs that were only introduced later. Even installed extensions can betray you. A completely empty profile is just as suspicious as one with twenty security tools.

Maintaining Stability Over Time

One of the strongest signals websites check is stability. Real users do not constantly switch between different devices or browser versions. They use the same setup until they update or replace their hardware.

Maintain the same fingerprint within a sticky session, particularly for high-volume flows such as logins or carts.
Change versions only when it makes sense, such as after a scheduled browser update.
Avoid rapid platform switches, such as transitioning from Windows to macOS between requests.

Stability tells defenders that you are a steady, consistent user, not a bot cycling through different disguises.

Cookies, localStorage, and sessionStorage are not just technical details but they are part of what makes a session feel real. A genuine browser carries state forward across visits.

Let cookies accumulate naturally, including authentication tokens and consent banners.
Reuse them for related requests rather than wiping them clean each time.
Preserve session history so that the browsing pattern looks continuous.

Without a state, every request looks like a first-time visitor, which is rarely how real users behave.

Measuring And Adjusting

Finally, you cannot perfect a fingerprint once and forget it. Websites change what they check, and even minor mismatches can appear over time.

Track how often you face CAPTCHA, blocks, or unusual error codes.
Log the outputs of your own Canvas, WebGL, and AudioContext to catch instability.
Compare your profile to real browser captures using tools like CreepJS or FingerprintJS.

This feedback loop helps you correct mistakes before they burn your entire setup.

Fingerprint management is about coherence. Your IP, headers, rendering, devices, and behavior all need to tell the same story. A clean IP without a matching fingerprint will still be blocked. A patched fingerprint without stability will still look wrong. Only when all parts are aligned do you create an identity that can survive in production.

Technique 3: Solving the CAPTCHA Conundrum

Even if you have clean IPs and fingerprints that look human, websites often add one more obstacle before granting access: a challenge-response test known as CAPTCHA. The acronym stands for Completely Automated Public Turing test to tell Computers and Humans Apart. Put simply, it is a puzzle designed to be easy for people but difficult for bots.

CAPTCHA is not new, but they have evolved into one of the toughest barriers scrapers face. To deal with them effectively, you need to understand what you are up against and choose a strategy that balances cost, speed, and reliability.

Understanding the Different Forms of CAPTCHA

Not all CAPTCHAs look the same. Over the years, defenders have introduced new formats to stay ahead of automation tools.

Text-based CAPTCHAs: These were the earliest form, where users had to type distorted letters or numbers. They are now largely phased out because machine learning models can solve them with high accuracy.
Image selection challenges: These ask the user to click on all images containing an object, such as traffic lights or crosswalks. They rely on human visual recognition, which is still harder to automate consistently.
reCAPTCHA v2: Google’s version that often shows up as the “I’m not a robot” checkbox. If the system is suspicious, it escalates to an image challenge.
reCAPTCHA v3: A behind-the-scenes version that scores visitors silently based on their behavior, only serving challenges if the score is too low.
hCaptcha and Cloudflare Turnstile: Alternatives that serve similar roles, often preferred by sites that want to avoid sending user data to Google. Turnstile is especially tricky because it can run invisible checks without showing the user anything.

Each type has its own level of difficulty. The simpler ones can be solved automatically, but the more advanced forms often require external help.

The CAPTCHA Solving Ecosystem

Because scrapers cannot always solve CAPTCHA on their own, an entire ecosystem of third-party services exists to handle them. These services usually fall into two categories:

Human-powered solvers: Companies employ workers who receive CAPTCHA images and solve them in real time. You send the challenge through an API, they solve it within seconds, and you get back a token to submit with your request.
Machine-learning solvers: Some services attempt to solve CAPTCHA with automated models. They can be faster and cheaper but are less reliable against newer and more complex challenges.

Popular providers include 2Captcha, Anti-Captcha, and DeathByCaptcha. They integrate easily into scraping scripts by exposing simple APIs where you post a challenge, wait for the solution, and then continue your request.

CAPTCHA solving introduces trade-offs that you have to plan for:

Cost: Each solve costs money, often fractions of a cent, but this adds up at scale. For scrapers making millions of requests, solving CAPTCHA manually can become the most significant expense.
Latency: Human solvers take time. Even the fastest services usually add a delay of 5–20 seconds. This may be acceptable for occasional requests, but it slows down large crawls.
Reliability: Solvers are not perfect. Sometimes they return incorrect answers or time out. Building in error handling and retries is essential.

This is why many teams mix strategies: using solvers only when necessary, while trying to minimize how often challenges are triggered in the first place.

Reducing CAPTCHA Frequency

The best way to handle CAPTCHAs is not to see them often. Careful planning can keep challenges rare:

Maintain good IP hygiene: Residential or mobile proxies with low abuse history face fewer CAPTCHAs.
Keep fingerprints consistent: Browsers that look real and stable raise fewer red flags.
Pace your requests: Sudden bursts of traffic are more likely to trigger challenges.
Reuse cookies and sessions: A returning user with a history of normal browsing behavior is less likely to be tested.

By reducing how suspicious your traffic looks, you can push CAPTCHAs from being constant roadblocks to occasional speed bumps.

When a CAPTCHA does appear, you have three main options:

Bypass entirely by preventing triggers with a good proxy, fingerprint, and behavior management.
Outsource solving to a third-party service, accepting the cost and delay.
Combine approaches, using solvers only when absolutely necessary while optimizing your setup to minimize their frequency.

Managing CAPTCHAs is less about brute force and more about strategy. If you rely on solving them at scale, your scraper will be slow and expensive. If you invest in preventing them, solvers become a rare fallback instead of a dependency.

Technique 4: Mimicking Human Behavior

At this point, you have clean IPs, fingerprints that look real, and a strategy for dealing with CAPTCHAs. But if your scraper still moves through a website like a robot, detection systems will notice. This is where behavioral mimicry comes in. The goal is not only to send requests that succeed, but to make your traffic look like it belongs to a person sitting at a screen.

Websites have spent years fine-tuning their ability to distinguish humans from bots. They know that people pause, scroll unevenly, misclick, and browse in messy and unpredictable ways. A scraper that always requests the next page instantly, scrolls in perfect increments, or never makes mistakes stands out. Mimicking human behavior makes your automation blend in with the natural noise of real users.

Building Human-Like Timing

One of the easiest giveaways of a bot is timing. Real users never click or type with machine precision.

Delays between actions: Instead of firing requests back-to-back, add short pauses that vary randomly. For example, wait 2.4 seconds after one click, then 3.1 seconds after the next.
Typing simulation: When filling forms, stagger keypresses to mimic natural rhythm. People often type in bursts, with slight pauses between words.
Warm-up navigation: Before going straight to the target data page, let your scraper visit the homepage or a category page. Real users rarely jump to deep links without a path.

These adjustments slow down your scraper slightly but dramatically reduce how robotic it looks.

Making Navigation Believable

Beyond timing, websites watch where you go and how you get there.

Session flow: Humans often wander. They may open a menu, check an unrelated page, or click back before moving on. Adding a few detours creates a more realistic flow.
Scrolling behavior: People scroll unevenly, sometimes stopping mid-page, then continuing. Scripts can replicate this by scrolling in variable increments and pausing at random points.
Mouse movement: While many scrapers skip this entirely, some detection systems check for mouse events. Simulating small, imperfect arcs and jitter makes interaction data look genuine.

Managing Cookies and Sessions

Humans carry baggage from one visit to the next in the form of cookies and session history. A scraper that always starts fresh looks suspicious.

Persist cookies: Store and reuse cookies so your scraper appears as the same user returning.
Maintain sessions: Use sticky proxies to hold an IP across several requests, keeping the identity consistent.
Align browser state: Headers like “Accept-Language” and time zone settings should match the location of the IP you are using.

This continuity creates the impression of a long-term visitor rather than disposable traffic.

Balancing Scale and Stealth

The challenge is that human-like behavior is slower by design. If you are scraping millions of pages, adding pauses and navigation steps can cut throughput. The solution is to parallelize: run more scrapers in parallel, each moving at a believable pace, instead of trying to push one scraper at unnatural speed.

Mimicking human behavior is about creating noise and imperfection. A successful scraper does not just move from point A to point B as fast as possible. It hesitates, scrolls, and carries history just like a person would. Combined with strong IP management and consistent fingerprints, this makes your automation much harder to distinguish from a real visitor.

Chapter 4: The Strategic Decision: When to Build vs. When to Buy

Every technique we have covered so far—proxy management, fingerprint alignment, behavioral simulation, and solving challenges—can be built and maintained by a dedicated team. Many developers start this way because it offers maximum control and transparency. Over time, however, the reality of maintaining an unblocking system at scale forces a bigger decision: should you continue to invest in building internally, or should you adopt a managed solution that handles these defenses for you?

The True Cost of an In-House Solution

On paper, building in-house combines the right tools: a proxy provider, a CAPTCHA solver, and some logic to manage requests. In practice, it evolves into a complex system that must adapt to every change in how websites block automation.

Maintaining such a system requires constant investment in four areas:

Engineering capacity: Developers spend a significant amount of time patching scripts when sites update their defenses, rewriting fingerprint logic, and building monitoring tools to catch failures.
Proxy infrastructure: Residential and mobile proxies are indispensable for challenging targets, but they come with high recurring costs. Pools degrade as IPs are flagged, requiring continuous replacement and vendor management.
Challenge solving: CAPTCHA and some client-side JavaScript puzzles add direct costs per request. Even with solvers, failure rates introduce retries that inflate both costs and delays.

Monitoring and updates: Sites rarely stay static. What works one month may fail the next, and every update to defenses requires a response. The system becomes a moving target.

Introducing the Managed Solution: Scraping APIs

A managed scraping API abstracts these same components into a single request. Instead of provisioning proxies, patching fingerprints, or integrating solver services yourself, the API handles those tasks automatically and delivers the page content.

The core benefit is focus. Firefighting bot detection updates no longer consume development time. Teams can focus on extracting insights from the data instead of maintaining the pipeline. Costs are generally easier to predict because many managed APIs bundle infrastructure, rotation logic, and solver fees, although high volumes or specialized targets can still increase expenses.

This does not make managed services universally superior. For small-scale projects with limited targets, a custom in-house setup can be cheaper and more flexible. However, for projects that require consistent, large-scale access, the stability of a managed API often outweighs the control of building everything yourself.

The Trade-Off

The choice is not between right and wrong, but between two different ways of investing resources:

Build if you have strong technical expertise, modest scale, and the need for complete control over how every request is managed.
Buy if your goal is long-term stability, predictable costs, and freeing engineers from the ongoing work of keeping up with anti-bot systems.

At its core, this is not a technical question but a strategic one. The defenses used by websites will continue to evolve. The real decision is whether your team wants to be in the business of keeping pace with those defenses, or whether you would rather rely on a service that does it for you.

Conclusion: The End of the Arms Race?

Bypassing modern anti-bot systems is not about finding a single trick or loophole. It requires a layered strategy that addresses every stage of detection. At the network level, your IP reputation must be managed with care. At the browser level, your fingerprint must look both realistic and consistent. At the interaction level, your behavior has to resemble the irregular patterns of human browsing. And when those checks are not enough, you must be prepared to solve active challenges like CAPTCHA or JavaScript puzzles.

Taken together, these defenses form a system designed to catch automation from multiple angles. To succeed, your scrapers need to look convincing in all of them at once. That is why the most resilient strategies focus on combining proxies, fingerprints, behavioral design, and rotation into one coherent approach rather than relying on isolated fixes.

There are two ways to get there. One approach is to build and maintain an in-house stack, thereby absorbing the costs and complexities associated with staying ahead of detection updates. The other option is to adopt a managed service that handles the unblocking for you, enabling your team to focus on extracting and utilizing the data. The right choice depends on scale, resources, and priorities.

What will not change is the direction of this contest. Websites will continue to develop more advanced defenses, and scrapers will continue to adapt. The arms race may never truly end, but access to web data will remain essential for research, business intelligence, and innovation. The organizations that thrive will be those that treat anti-bot systems not as an impenetrable wall, but as a challenge that can be met with the right mix of strategy, tools, and discipline.

The post The Ultimate Guide to Bypassing Anti-Bot Detection appeared first on ScraperAPI.

Playwright vs Puppeteer in 2025: Which Browser Automation Tool Is Right for You?

John Fáwọlé — Fri, 11 Jul 2025 01:16:58 +0000

If you are working with headless browsers, you’ll likely face a key decision: Playwright or Puppeteer?

Both are great tools for scraping dynamic websites or automating browser tasks, and each comes with a solid reputation and a strong following.

They have, of course, their differences, too, both from a technical standpoint and in terms of ecosystem, support, and overall flexibility. In this short blog, we’ll compare these two popular libraries.

By the end, you’ll have had better understanding of Playwright and Puppeteer, their tradeoffs, and all the information your need to pick the best fit for your project.

What Are Playwright and Puppeteer? Key Features and Differences

Before we delve into the key differences between Playwright and Puppeteer, it is important to understand each one well.

What is Playwright?

Back in 2020, the Microsoft team began to see the need for a single robust API to cross-test browsers. This led to the creation of Playwright.

Unlike many existing libraries, Playwright acts as a unified tool bridging multiple platforms, browsers, and languages. For instance, Playwright supports FireFox, WebKit, and Chromium — the open-source engine behind Google Chrome.

It works on virtually any machine, and supports both headless and headful modes. Mobile-first developers have a soft spot for Playwright because it can emulate Android Chrome and Mobile Safari directly on your desktop. App developers can simulate and test how their applications perform across different mobile environments without needing physical devices.

When it comes to web scraping, Playwright is fitted with a number of ad-hoc features—such as AutoWait, very popular due to its ability to let you scape web pages without setting off bot-detection systems. Playwright also shines in managing multiple tasks at the same time. For example, it can handle testing a number of tabs and user scenarios at the same time without effort.

What is Puppeteer?

Google created Puppeteer in 2017 as a JavaScript library for web testing and automation within its browser ecosystem. It was designed to meet the demand of developers building with Google products.

Puppeteer does not have a native frontend, which means it runs completely headless. However, users can configure it to launch a visible browser.

Since its beginnings, Puppeteer has been popular among developers to test Chrome extensions. Today, with most websites built using JavaScript—often with Next.js on the frontend and Node.js on the backend—many developers still prefer to test their applications using a JavaScript-based library like Puppeteer.-

For end-to-end testers, Puppeteer gives you the flexibility to check everything from the user interface to keyboard inputs. This means you can:

Make sure your web app performs well
Test the overall user experience
Catch anything that might be broken
Spot security vulnerabilities

When it comes to scraping, this library is popular for the ability to crawl pages, extract data, and capture the results as screenshots or PDFs.

Playwright vs Puppeteer Comparison

Feature	Playwright	Puppeteer
Browser Support	Chromium, Firefox, and WebKit	Chrome and Firefox
Cross-browser Support	Available	Unavailable
Language Support	JavaScript, Python, Java, TypeScript, .NET	JavaScript
Mobile Simulation Support	Available	Unavailable
Browser UI	Available	Unavailable
Creator	Microsoft	Google
Timeline trace debugging	Available	Available
Machine Support	Mac, Windows, Linux	Mac, Windows, Linux
Performance	Fast	Fast
Community Vibrance	Better	Good
Documentation	Good	Better

When to Choose Playwright vs. Puppeteer?

Now that we have taken a closer look at both Playwright and Puppeteer, let’s see when it’s best to use each, depending on your project and specific needs.

Playwright

Here are some reasons you might want to stick with Playwright.

Multi-language Support

Playwright supports many languages, including JavaScript, TypeScript, Python, Java, and .NET.

Unlike Puppeteer, which supports only JavaScript, you have many options with Playwright. You have the freedom of picking and building with the language you are most comfortable with.

Cross-browser Support

Playwright is the right choice if you want to test your application across multiple browsers. It supports many browsers, such as Firefox, Chrome, and WebKit.

Mobile Simulation

You may be trying to scrape, test, or build a mobile app. Playwright helps you simulate a realistic mobile environment directly from your desktop. Its precise rendering capabilities give you an accurate view of how your application will appear and behave on mobile devices, letting you do more informed development and testing without the need for physical hardware.

Puppeteer

Here are some use cases when Puppeteer might be your best pick:

Testing Chrome Extensions

Puppeteer was built by the Chrome DevTools team at Google, so the tech stack similarities make it a great tool for Chrome extension testing. You are going to have an even better time if you are extensively using JavaScript.

JavaScript is Enough

On the other hand, Puppeteer only supports JavaScript, so projects relying on other languages might be slowed down.

Browser-specific Support is Not Important

If you are testing or scraping with only Chrome in mind, Puppeteer is a good option. Supporting Chrome is no issue at all for Puppeteer, but it might struggle with other browsers.

Playwright vs Puppeteer for Web Scraping: Which One Wins?

Primarily, these libraries are used for web automation and testing. However, many engineers might be more interested in Playwright and Puppeteer’s web scraping capabilities of.

Here is what to keep in mind when choosing between the two for web scraping.

Bot Detection

Most detectors are trained to recognize bots by identifying agents that speedily access a web page and carry on actions even while it is still loading.

Thanks to the AutoWaita feature, Playwright ensures that elements fully load before any action is executed, making it easier to proceed undetected.

While Puppeteer doesn’t offer an equivalent of AutoWait, it also sports similar features that support graceful loading. For example, you can get creative with page.setDefaultTimeout() and page.waitUntil, which allow you to control how long to wait for elements or actions before timing out.

Dynamic Content Handling

Puppeteer was built to handle crawling and data extraction from Single Page Applications.

However, it has a couple of downsides:

An acute focus on Chrome
No built-in support for handling dynamic content

If you want to scrape with Puppeteer, you’d have to digest the docs well so you can manually configure it to successfully scrape dynamic content.

For example, you’ll need to write waitFor() methods explicitly, among other things. Playwright, on the other hand, comes with automatic waiting and built-in retries, which help reduce bot detection and minimize errors.

Apart from that, Playwright is better suited for scraping modern websites, especially when it comes to handling iframes. It can reliably access and extract content loaded within them.

Scraping Pre-rendered Content in HTML

There are times you might need to pre-render a web page you want to scrape, probably to avoid API detection or to improve your scraping efficiency.

If you do this often, you’ll need to check which library better supports your workflow.

Puppeteer has native support for fetching pre-rendered content, usually without requiring heavy configuration.

Playwright, on the other hand, doesn’t natively support pre-rendered content, so you’d need to write your own script to handle that.

Conclusion

Playwright and Puppeteer are two good libraries you can use for your web testing or scraping. In this guide, we’ve examined the technical merits and downsides of each one.

It’s important to emphasize that if your goal is web scraping, these libraries alone might not be enough. Modern websites use advanced bot detection and blocking techniques that go beyond what headless browsers can easily bypass.
That’s where tools like ScraperAPI come in. It can help you successfully scrape the web without the usual headaches. Sign up for the basic plan here and see how it works for yourself!

FAQs

Is Puppeteer better than Playwright?

Yes, due to the native stealth plugin, Puppeteer is often better than Playwright for web scraping. That said, Playwright stands out with its cross-browser compatibility and support for multiple programming languages.

Is Puppeteer the same as Playwright?

No, they are two different headless browser automation libraries, each with its own features and strengths.

Is Playwright a fork of Puppeteer?

No, Playwright is not a fork of Puppeteer. It was created by the same team that originally worked on Puppeteer at Google, but they moved to Microsoft and built Playwright from scratch.

Is Playwright good for scraping?

Absolutely. It’s a powerful headless browser library that works well for web automation and scraping. For improved performance and fewer blocks, it’s even more effective when used alongside a tool like ScraperAPI.

The post Playwright vs Puppeteer in 2025: Which Browser Automation Tool Is Right for You? appeared first on ScraperAPI.

Build a TikTok Brand-Influencer Scouting Tool Using ScraperAPI-LangChain Agent, Qwen3, and Streamlit

Egop Gogo-Job — Fri, 11 Jul 2025 01:10:24 +0000

Build a custom scraper TikTok influencer scouting tool that lets you filter creators by country, follower count, and more while supporting follow up queries, using the ScraperAPI LangChain agent for data extraction, Qwen3’s LLM for contextually relevant insights, and Streamlit for free app hosting.

Influencer marketing is rapidly eclipsing traditional ads, especially among younger audiences who value authenticity and relatability. Micro-influencers, with their tight-knit and highly engaged communities, inspire far more trust and loyalty than a generic banner ever could.

Yet, follower counts and engagement metrics alone won’t guarantee success. What truly moves the needle is partnering with creators whose unique voice, values, and vision mirror your brand, who spark real action rather than empty clicks.

We are building a solution that can increase lead generation and targeted marketing thanks to its highly customizable approach, and help you find the right influencer to really boost your brand.

You will learn how to build a TikTok influencer scouting tool that utilizes the LangChain-ScraperAPI agent for scraping raw data and finding niche creators through natural-language queries.

We will use Qwen3 as our large language model to deepen the tool’s contextual understanding and Streamlit to deploy and host the finished app for free in the cloud.

Let’s get started.

Understanding AI Agents in LangChain

Fundamentally, an AI agent is a program that combines a large language model (LLM) with tools and memory to perform tasks autonomously. Rather than responding to one-off prompts, an agent can:

Interpret and execute user intent by breaking down high-level queries into actionable steps.
Call external tools like APIs, web scrapers, databases, etc., to gather or process information.
Continue reasoning and iterating on results until it meets the user’s requirements.

What sets agents apart from standard LLM applications is the capacity to make informed decisions about what actions to take next based on intermediate results. Agents are not only reactive—they are active participants in solving a task.

For example, instead of answering “What’s the weather like in Paris?”, a LangChain agent can respond to a complex, multi-part query:

“Plan a weekend getaway in Paris. I need weather forecasts, hotel prices under $200 per night, and suggestions for indoor activities if it rains.”

The agent breaks this down, uses tools like the ScraperAPI Google Search Tool and a general-purpose web scraper, to gather each piece of information, like weather data, hotel listings, and local attractions, and then combines everything into a complete response.

LangChain provides a flexible framework to assemble these components. You define a set of functions, APIs, or scrapers, wrap them with simple adapters, and then wire them into an agent that uses the LLM to decide when and how to call each resource.

How Does Autonomous Scraping with the LangChain–ScraperAPI Integration Work?

The LangChain-ScraperAPI integration is a Python package that allows AI agents to scrape the web using ScraperAPI. The package contains three different components, each corresponding to an official ScraperAPI endpoint:

ScraperAPITool: Allows the AI agent to scrape any website and retrieve data
ScraperAPIGoogleSearchTool: Specifically enables the agent to scrape Google Search results and rankings.
ScraperAPIAmazonSearchTool: Scrape Amazon search results and rankings exclusively.

All you need to do to use this package in Python is to install it with pip, then import the components:

pip install -U langchain-scraperapi

from langchain_scraperapi.tools import (
   ScraperAPITool,
   ScraperAPIGoogleSearchTool,
   ScraperAPIAmazonSearchTool
)

If you don’t have it already, create a ScraperAPI account and get your API key, then set it as an environment variable. In your terminal, run:

export SCRAPERAPI_API_KEY="your API key"

Once the tools are installed, you can create an instance of any of them and provide parameters such as the URL to scrape, the output format you want, and any additional options you need. Here’s an example:

from langchain_scraperapi.tools import ScraperAPITool
tool = ScraperAPITool()
print(tool.invoke(input={"url": "walmart.com", "output_format": "markdown"}))

The code above initializes one of the package’s components, ScraperAPITool, ascribes it as a variable, and then uses the invoke method to scrape “walmart.com”, requesting the output in markdown format. The scraped content is then printed.

The great thing about agents is that you can instruct them in natural language to do complex tasks. For instance, we can give the ScraperAPI-LangChain agent a query to search and return results and even images of teddy bears for sale on Amazon, and it will do just that. Below is a sample of the code:

from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool
tool = ScraperAPIAmazonSearchTool()
print(tool.invoke(input={"query": "show me pink teddy bears for sale on Amazon"}))

Using the regular ScraperAPI Amazon Endpoint will also return the same results, but you’d have to find and input an actual Amazon URL with pink teddy bears on display and then attempt to scrape the web page. Using the ScraperAPI-LangChain agent makes it easier and to retrieve complex data instantly with minimal coding and resources.

How to Obtain Qwen3 from OpenRouter

As we’re making use of a large language model from OpenRouter, we’ll need to set up an account and get out API key, before we can start making requests..

What sets Qwen models apart is their efficiency and scalability, particularly when it comes to those built on the Mixture of Experts (MoE) architecture. Unlike traditional large language models where all parameters are activated for every query, MoE models contain multiple ‘expert’ sub-networks.

This means that, as they process information, MoE models activate only a small subset of specialized sub-networks (“experts”) based on learned routing decisions, allowing them to interpret, understand, and respond to a query without engaging the full model. This selective activation enables MoE models to maintain high performance while significantly reducing computational overhead and costs.

As a result, Qwen3 consistently delivers responses that are highly contextual, informative, and relevant.

Here’s a guide on how to access a model from OpenRouter:

After verifying your email, log in and search for Qwen3 models (or any other LLM of our choice) in the search bar:

Go to the Qwen3 model of your choice:

Click on “API” to create a personal API access key for your model.

Select “Create API Key” and then copy and save your newly created API key.

Do not share your API key publicly.

Getting Started with ScraperAPI

To begin, go to ScraperAPI’s dashboard. If you don’t have an account yet, click on “Start Trial” to create one:

After creating your account, you’ll have access to a dashboard providing you with an API key, access to 5000 API credits (7-day limited trial period), and information on how to get started scraping.

To access more credits and advanced features, scroll down and click “Upgrade to Larger Plan.”

ScraperAPI provides documentation for various programming languages and frameworks—such as PHP, Java, and Node.js—that interact with its endpoints. You can find these resources by scrolling down on the dashboard page and clicking “View All Docs”:

Now we’re all set, let’s start building our tool.

Building the TikTok Brand-Influencer Scouting Tool

Step 1: Setting Up the Project

Create a new project folder, a virtual environment, and install the necessary dependencies.

mkdir tiktok_influencer_project  # Creates the project folder
cd tiktok_influencer_project # Moves you inside the project folder

python -m venv your-env-name  # Creates a new envirobment

Activate the environment:

Windows:

your-env-name\Scripts\activate

macOS/Linux:

source your-env-name/bin/activate

And now you can install the dependencies we’ll need:

pip install streamlit tiktoken langchain-openai langchain-scraperapi

The key dependencies and their functions are:

streamlit: We need this to build the app’s user interface, so users can directly input their niche and other filters, while seeing results in real-time.
tiktoken: This library is from OpenAI and is used for tokenizing text and estimating token counts. In our project, we use it to estimate the length of the queries sent to the language model, so we don’t exceed API limits.
langchain-openai: This is a separate package that provides the integration with OpenAI-compatible Large Language Models (LLMs). Therefore, we use it to connect Qwen via OpenRouter, for our application to send prompts and receive AI-generated responses.
langchain-scraperapi: This is the package that integrates ScraperAPI and LangChain’s abilities in the form of an agent that can perform web scraping and Google searches autonomously.

Step 2: Integrating the Langchain-Scraperapi Package

Remember at the beginning when we set our ScraperAPI key as an environment variable and installed the dependencies we needed? If you are using the same environment, you’re good to go. However, if you are working in a new one, you won’t have the packages you need yet. Install Langchain-ScraperAPI:

pip install -U langchain-scraperapi

Previously, we exported our ScraperAPI key as an environment variable. However, this time around, we will be needing our OpenRouter API key as well. We could export both, but exporting keys to the environment is a temporary solution (the credentials are only saved locally for a limited time). To make sure we have both our key safely stowed away in our env and ready to go at any moment, we’re going to need to use python-dotenv.

pip install python-dotenv

Create a new .env file and add your API keys:

SCRAPERAPI_API_KEY="your-scraperapi-key"
OPENROUTER_API_KEY="your-openrouter-key"

Step 3: Importing Libraries and Setting Up API Keys

The next step is importing all the necessary libraries and securely loading the API keys required to interact with external services like ScraperAPI and OpenRouter (for the LLM).

import os
import streamlit as st
import tiktoken

from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents.agent_types import AgentType
from langchain_scraperapi.tools import (
    ScraperAPITool,
    ScraperAPIGoogleSearchTool,
    ScraperAPIAmazonSearchTool
)
from dotenv import load_dotenv
load_dotenv()

# Loading API Keys

scraperapi_key = os.environ.get("SCRAPERAPI_API_KEY")
openrouter_api_key = os.environ.get("OPENROUTER_API_KEY")

# Let’s include API Key checks as a safety net and for easier debugging
if not scraperapi_key:
    st.warning("ScraperAPI key might not be correctly set. Using the provided default or placeholder.")
    if scraperapi_key == "YOUR_SCRAPERAPI_API_KEY":
        st.error("Please replace 'YOUR_SCRAPERAPI_API_KEY' with your actual key in the script.")
        st.stop()

if not openrouter_api_key:
    st.error("OPENROUTER_API_KEY not found or is still the placeholder. Please set it in the script.")
    st.stop()

The code above achieves the following:

Imports:

os: Used to interact with the operating system, specifically for setting and getting environment variables.
streamlit as st: The core library for building the web app’s user interface.
tiktoken: For estimating the number of tokens within prompts sent to the LLM.
langchain_openai.ChatOpenAI: Imports the class to interact with OpenAI-compatible chat models (like the Qwen model via OpenRouter in this case).
langchain.agents.initialize_agent, Tool: Key components from LangChain to create and manage the AI agent and the tools it can use.
langchain.agents.agent_types.AgentType: Specifies different types of LangChain agents.
langchain_scraperapi.tools: Imports specific tools designed to work with ScraperAPI for web scraping and search.

API Keys Setup:

load_dotenv(): Loads the keys from .env
scraperapi_key = os.environ.get("SCRAPERAPI_API_KEY"): Retrieves the value of the SCRAPERAPI_API_KEY environment variable.
openrouter_api_key = os.environ.get("OPENROUTER_API_KEY"): Retrieves the value of the OPENROUTER_API_KEY environment variable.

API Key Checks:

The if not scraperapi_key: and if not openrouter_api_key: blocks provide basic validation. They check if the API keys have been set or give a warning if they are missing or still contain the placeholder values. If the keys are not set, the Streamlit app will stop execution (st.stop()) to prevent errors further down the line.

Step 4: Building the Streamlit UI Layout

Here we will set up the basic layout and texts for the Streamlit web UI.

# Streamlit UI Setup 
st.set_page_config(page_title="TikTok Influencer Finder", layout="centered")
st.title("TikTok Influencer Finder 🧑🏼‍🤝‍🧑🏿🌐")
st.markdown("""
Welcome! This bot uses ScraperAPI's Langchain AI Agent for web scraping and a **Qwen LLM (via OpenRouter)**
to help you discover TikTok influencers who might be a great fit to promote your brand.
Please provide your brand's niche (e.g., 'sustainable running shoes', 'female luxury bags', 'men's watches').
""")

Here’s what the code above achieves:

st.set_page_config(...): Configures the Streamlit page, setting the browser tab title to “TikTok Influencer Finder” and the layout to “centered.”
st.title(...): Displays the main title of the application on the web page.
st.markdown(...): Renders a block of Markdown text, serving as a welcome message and a brief explanation of the tool’s purpose and how it works.

Step 5: Initializing LangChain Tools

Now we’ll prepare the tools that the LangChain agent will use to interact with the external web. (specifically, to perform web searches and scrape content) using ScraperAPI.

# Initializing Tools
try:
    scraper_tool = ScraperAPITool(scraperapi_api_key=scraperapi_key)
    google_search_tool = ScraperAPIGoogleSearchTool(scraperapi_api_key=scraperapi_key)
except Exception as e:
    st.error(f"Error initializing ScraperAPI tools: {e}.")
    st.stop()

tools = [
    Tool(
        name="Google Search",
        func=google_search_tool.run,
        description="Useful for finding general information online, including articles, blogs, and lists of TikTok influencers."
    ),
    Tool(
        name="General Web Scraper",
        func=scraper_tool.run,
        description="Useful for scraping content from specific URLs after search."
    )
]

Below is a further breakdown of what the code above does:

Tool Initialization:

scraper_tool = ScraperAPITool(...): Creates an instance of a general web scraping tool provided by langchain-scraperapi, authenticated with your scraperapi_key. This tool can scrape content from any given URL.
google_search_tool = ScraperAPIGoogleSearchTool(...): Creates an instance of a Google search tool, also powered by ScraperAPI. This tool allows the agent to perform Google searches.
The try-except block handles potential errors during the initialization of these tools, displaying an error message in Streamlit and stopping the app if something goes wrong.

Tools List for LangChain Agent:

tools = [...]: Defines a list of Tool objects. Each Tool is a wrapper that makes an external function available to the LangChain agent.
“Google Search” Tool: Named “Google Search,” its function (func) is set to google_search_tool.run, meaning when the agent “uses” this tool, it will execute a Google search. The description tells the LLM what this tool is useful for.
“General Web Scraper” Tool: Named “General Web Scraper,” its function is scraper_tool.run. Its description indicates it’s for scraping specific URLs, typically after a search.

Step 6: Initializing the Large Language Model (LLM)

It is now time to initialize the Large Language Model (LLM) that will serve as the “brain” of the agent, enabling it to understand prompts and decide on actions.

QWEN_MODEL_NAME = "qwen/qwen3-30b-a3b:free"

llm = None
try:
    llm = ChatOpenAI(
        model_name=QWEN_MODEL_NAME,
        temperature=0.1,
        openai_api_key=openrouter_api_key,
        base_url="https://openrouter.ai/api/v1"
    )
    st.success(f"Successfully initialized Qwen model: {QWEN_MODEL_NAME}")
except Exception as e:
    st.error(f"Error initializing Qwen LLM: {e}")
    st.stop()

# Initialize agent here!
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    max_iterations=3
)

Here is what we can understand from the code above:

QWEN_MODEL_NAME: Defines the specific Qwen model we are using from OpenRouter.
llm = ChatOpenAI(...): Initializes the ChatOpenAI object.
model_name=QWEN_MODEL_NAME: Specifies which LLM to use.
temperature=0.1: Controls the creativity of the LLM’s responses. A lower value (like 0.1) makes the output more deterministic and focused.
openai_api_key=openrouter_api_key: Provides the API key for authentication with OpenRouter.
base_url="https://openrouter.ai/api/v1": Specifies the API endpoint for OpenRouter, as OpenRouter provides an OpenAI-compatible API.
The try-except block catches any errors during the LLM initialization, displays them in Streamlit, and stops the application if the LLM cannot be set up.
agent = initialize_agent(...): Allows your button callback to use agent.run(query) properly.

Step 7: Initializing the LangChain Agent

This crucial step brings together the LLM and the tools to create an intelligent agent capable of reasoning and taking action based on user requests.

# Initializing Agent
agent = None
if llm is not None:
    try:
        agent = initialize_agent(
            tools=tools,
            llm=llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True,
            handle_parsing_errors=True,
            max_iterations=15
        )
    except Exception as e:
        st.error(f"Error initializing LangChain agent: {e}")
        st.stop()
else:
    st.error("LLM not initialized. Agent setup failed.")
    st.stop()

The code achieves the following:

ifllm is not None:: Ensures that the LLM initializes successfully before attempting to create the agent.
agent = initialize_agent(...): This is the core LangChain function to set up an agent.
tools=tools: Provides the list of Tool objects (Google Search and General Web Scraper) that the agent can utilize.
llm=llm: Connects the initialized LLM to the agent, giving it its reasoning capabilities.
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION: Specifies the type of agent. This agent type uses the LLM to decide which tool to use and how to use it in a single “thought” step, based on a description of the tools and the current task.
verbose=True: When True, the agent’s internal thought process and tool usage will be printed to the console, which is very helpful for debugging.
handle_parsing_errors=True: Allows the agent to attempt to recover from parsing errors in its internal reasoning.
max_iterations=15: Sets a limit on how many steps (tool uses, thoughts) the agent can take before giving up, preventing infinite loops.
The try-except block handles errors during agent initialization, displaying them and stopping the app if the agent cannot be set up.

Step 8: Building the User Input Interface

Here we will define the interactive elements in the app’s UI where the user can provide details for their search.

# Inputting UI elements
user_niche = st.text_input(
    "Enter your brand's niche:",
    key="brand_niche_input",
    placeholder="Type niche here..."
)

# --- Additional Filters ---
st.subheader("Optional Filters")
country_filter = st.text_input(
    "Filter by Country (optional):",
    key="country_filter",
    placeholder="e.g., United States, UK, China"
)

min_followers = st.number_input(
    "Minimum Follower Count (e.g., 500000 for 500K)",
    min_value=0,
    value=0,
    step=10000,
    key="min_followers"
)

Below is a summary of what the code above achieves:

st.text_input(...): Creates a text input field for the user to enter their brand’s niche.
st.subheader("Optional Filters"): Displays a smaller heading for the optional filters section.
country_filter = st.text_input(...): Creates another text input for an optional country filter.
min_followers = st.number_input(...): Creates a numerical input field for the minimum follower count.

Step 9: Token Estimation Function

The function below helps to manage the length of prompts sent to the LLM, which usually have token limits.

# Token Estimation function
def estimate_tokens(text):
    try:
        encoding = tiktoken.encoding_for_model("gpt-4")
    except:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Here’s what the code achieves:

Function Definition: Defines estimate_tokens(text), which takes a string text as input.
Tokenization: It attempts to get the token encoder for the “gpt-4” model,. If that fails, it falls back to a common base encoding (cl100k_base). encoding.encode(text) converts the input text into a list of token integers, while len(...) returns the count of these tokens.

Step 10: Main Search Logic (Finding Influencers)

This is the core functional part of the application. It triggers when the user clicks the “Find TikTok Influencers” button. The code is used to construct the defining prompt query, run the agent, and display the results.

# Main Search Logic
if st.button("Find TikTok Influencers ✨"):
    if not user_niche:
        st.warning("Please enter your niche.")
    elif agent is None:
        st.error("Agent not initialized.")
    else:
        query = f"""
        Find a list of at least 5 TikTok influencers highly relevant to the niche: '{user_niche}'.
        Apply these filters:
        - Country: {country_filter or 'Any'}
        - Minimum Follower Count: {min_followers}
        For each influencer, provide:
        1. TikTok Username
        2. Full Name (if known)
        3. Approximate Follower Count
        4. Niche
        5. TikTok profile or verified link
        Format as Markdown list.
        """

        token_count = estimate_tokens(query)
        if token_count > 20000:
            st.error(f"Query too long ({token_count} tokens). Try reducing text.")
            st.stop()

        st.info("🚀 Searching influencers...")
        with st.spinner("Running agent..."):
            try:
                response = agent.run(query)
                st.session_state["last_influencer_data"] = response

                st.subheader("💡 Influencers Found:")
                st.markdown(response)
            except Exception as e:
                st.error(f"Agent failed: {e}")

Here’s further information on precisely how the code works:

if st.button("Find TikTok Influencers ✨"):: This block executes when a user clicks the button.

Input Validation:

if not user_niche:: Checks if the niche input is empty and displays a warning.
elif agent is None:: Checks if the LangChain agent was successfully initialized earlier and displays an error if not.

Query Construction:

To receive our results in a tidy and presentable format, we have to manually input the query that will be sent to the LangChain agent. This query instructs the agent on what to find (TikTok influencers), what criteria to use (niche, country, minimum followers), what information to extract for each, and the desired output format (Markdown list).
{country_filter or 'Any'}: Is a neat Python trick that uses country_filter if it has a value, otherwise defaults to the string ‘Any’.

Token Count Check:

token_count = estimate_tokens(query): Calls the previously defined function to get an estimate of the query’s token length.
if token_count > 20000:: Prevents sending overly long queries to the LLM, which could exceed API limits.

Running the Agent:

st.info("🚀 Searching influencers..."): Displays an informational message to the user.
with st.spinner("Running agent..."):: Shows a spinning animation in the UI, indicating that the application is running.
response = agent.run(query): This is where the magic happens. The LangChain agent takes the query, uses its LLM to reason about the task, and decides which of its tools (Google Search, Web Scraper) to use, potentially in multiple steps, to fulfill the request. The final answer from the agent is stored in response.
st.session_state["last_influencer_data"] = response: Stores the agent’s response in Streamlit’s session state, making the data persistently available across reruns of the script within the same user session, which is crucial for the follow-up Q&A.

Displaying Results:

st.subheader("💡 Influencers Found:"): Displays a subheader.
st.markdown(response): Renders the agent’s response (which is formatted as Markdown) directly into the Streamlit UI.

Error Handling: The try-except block catches any exceptions that occur during the agent’s execution and displays an error message.

Step 11: Follow-up Q&A Logic

To enable users ask further questions about the influencers they find, we will add a follow-up logic that links the LLM directly with the previously obtained data as context.

# Follow-up Q&A code
st.markdown("---")
st.subheader("Ask a follow-up question about the influencers ✍️")
follow_up_question = st.text_input("Your question:", key="followup_question")

if follow_up_question and "last_influencer_data" in st.session_state:
    context = st.session_state["last_influencer_data"]
    qna_prompt = f"""
    Based on the following influencer data:
    {context}

    Answer the following question:
    {follow_up_question}
    """
    token_count = estimate_tokens(qna_prompt)
    if token_count > 20000:
        st.error(f"Follow-up too long ({token_count} tokens). Try shortening your question or data.")
    else:
        try:
            st.info("🧠 Thinking...")
            follow_up_response = llm.invoke(qna_prompt)
            st.markdown(follow_up_response)
        except Exception as e:
            st.error(f"LLM follow-up failed: {e}")

The code achieves the following:

st.markdown("---"): Adds a horizontal rule for visual separation.
st.subheader(...) and st.text_input(...): Create a section for the user to input a follow-up question.
context = st.session_state["last_influencer_data"]: Retrieves the previously found influencer data to provide context for the LLM.
qna_prompt = f"""...""": Constructs a new prompt for the LLM. This prompt includes the context (the influencer data) and the follow_up_question, instructing the LLM to answer based on that information.
Token Count Check: Similar to the main search, it checks the token length of the follow-up prompt to prevent errors.
Invoking the LLM Directly: follow_up_response = llm.invoke(qna_prompt), unlike agent.run(), llm.invoke(), sends the prompt directly to the LLM without involving the agent’s tool-use reasoning. The LLM then processes the prompt (context + question) and generates an answer. follow_up_response.content extracts the actual text of the response while st.markdown(follow_up_response.content) displays the LLM’s answer in Markdown format.

Step 12: Footer

Why not add a simple footer to give credit to the technologies we used? This is good practice especially if you’re building this project to include within your personal portfolio; this way, recruiters can easily spot at a glance, the tools you used in developing your app.

# --- Footer ---
st.markdown("---")
st.markdown("Powered by ScraperAPI, Langchain and OpenRouter (Qwen)", unsafe_allow_html=True)

Here’s the explanation for the code above:

st.markdown("---"): Adds another horizontal rule.
st.markdown("... ", unsafe_allow_html=True): Displays a small, centered, grey-colored text at the bottom of the page, crediting the technologies used.
unsafe_allow_html=True is necessary because you’re embedding raw HTML (
) within the Markdown.

Here’s a snippet of what the tool’s UI looks like:

Step 13: Run your script

Now that all the steps are in place, you can run your code with Streamlit by doing:

streamlit run your_script_name.py

Deploying the TikTok Brand-Influencer Scouting Tool Using Streamlit

Here’s how to deploy our TikTok Brand-Influencer Scouting app on Streamlit for free in just a few steps:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository on GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHub

Before doing anything else, create a .gitignore file to avoid accidentally uploading sensitive files like. Add the following to it:

.env
__pycache__/
*.pyc
*.pyo
*.pyd
.env.*
.secrets.toml

If you haven’t already set up Git and linked your repository, use the following commands in your terminal from within your project folder:

git init
git add .
git commit -m "Initial commit"
git branch -M main
# With HTTPS
git remote add origin https://github.com/YOUR_USERNAME/your_repo.git
# With SSH
git remote add origin git@github.com:YOUR_USERNAME/your-repo.git

git push -u origin main

If it’s your first time using GitHub from this machine, you might need to set up an SSH connection. Here is how.

Step 2: Define Dependencies and Protect Your Secrets!

Streamlit needs to know what dependencies your app requires.

1. In your project folder, automatically create a requirements file by running:

pip freeze > requirements.txt

2. Commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies”
git push origin main

3. Do the same for your app file containing all your code:

git add your-script.py 
git commit -m "Added app script" 
git push origin main

Step 3: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App.”

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

Repository: YOUR_USERNAME/TikTok-Influencer-Finder
Branch: main
Main file path: app.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

7. Go to your deployed app dashboard, find your app, and find “Secrets” in “Settings”. Add your environment variables (your API keys) just as you have them locally in your .env file.

Step 4: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL (e.g., https://your-app-name.streamlit.app). You can now share this link to allow others access to your tool!

Conclusion

In this tutorial, you have learned how to build a TikTok Brand-Influencer Scouting Tool that utilizes the ScraperAPI-LangChain agent for smart, autonomous data extraction, Qwen3 for contextual insights and follow-up queries, and Streamlit for building a user-friendly interface to host the app, in the cloud, for free.

This tool aids influencer marketing, enabling brands to identify creators whose niche, voice and vision align perfectly with their own. By including filtering options and follow-up questions, it moves beyond just basic metrics to find influencers who can truly spark authentic engagement and drive targeted marketing efforts.

Ready to build your own?

Start using ScraperAPI today and transform your influencer scouting process into a streamlined, highly effective strategy!

The post Build a TikTok Brand-Influencer Scouting Tool Using ScraperAPI-LangChain Agent, Qwen3, and Streamlit appeared first on ScraperAPI.

How to Scrape Geo-Restricted Data Without Getting Banned

Justas Palekas — Sat, 28 Jun 2025 09:33:57 +0000

While the internet is often considered free and open to all, some geographical restrictions are still placed on some websites.

Sometimes, the changes are subtle, such as automatically switching languages. In other cases, entirely different content is served (e.g., Netflix) for people from different countries or regions.

At the extremes, some websites are entirely inaccessible unless your IP address is from a specified country. While all of these restrictions do serve a proper purpose, they also make web scraping significantly more difficult.

There are several ways to bypass geographical restrictions while scraping, such as using proxies within your own scraper or using pre-built solutions that take care of the hassle.

Using Proxies to Bypass Geo-restrictions

Residential proxies are, in fact, the way most scrapers bypass geo-restrictions. Since you get an IP address from a device that’s physically located in a country of your choice. When a proxy relays requests from your machine to a website, it’ll think that the true source of the request is from within that country.

While there are numerous other proxy types, residential proxies are generally regarded as your best bet for most web scraping tasks, especially those that involve geographical restrictions. Datacenter proxies, while fast and cheap, have a limited range of locations and are more easily detected.

ISP proxies would work perfectly fine as they are as legitimate as residential proxies and as fast as datacenter proxies, but they’re one of the most expensive options available. Additionally, the pool of IPs is usually quite limited.

While purchasing proxies directly from a provider and integrating them into your scraper is definitely efficient, it has one caveat: you still need to build the scraper itself. Extensive programming knowledge is required for any scraping project that has a decent scope.

Constant updates to the scraping solution will also be required as minor changes in layouts, website code, or anything in between will cause it to either break completely or return improper results.

Then there’s the headache of data parsing and storage, both of which are complicated topics on their own.

So, while buying proxies from a provider can be a good solution for some, it’s usually reserved for those who can build a scraper on their own.

Using ScraperAPI

ScraperAPI manages the entire scraping pipeline, from proxies to data delivery, for its users. There’s no need to build something from the ground up: you can start scraping as soon as you get a plan and write some basic code.

We’ll be using Python to send requests to the ScraperAPI endpoint to retrieve data from websites.

Preparation

First, you’ll need an IDE such as PyCharm or Visual Studio Code to run your code. Then you should register for an account with ScraperAPI.

Note: ScraperAPI’s free trial is enough to test out geotargeting as it provides access to all of the premium features. Once that expires, however, you’ll need one of the paid plans unless US/EU geotargeting is enough for your use case.
Once you have everything set up, we’ll be using the requests library to send HTTP requests to the ScraperAPI endpoint. Since requests is a third-party library, we’ll need to install it first:

pip install requests

That’s the only library you’ll need since ScraperAPI does all the heavy lifting for you. All you need to do is begin writing code and getting the URLs you want to scrape.

Sending a simple request

It’s often best to start simple and increase code complexity as you go. We’ll start by sending a GET request to a website that restricts EU users to verify what happens if we do not use residential proxies or ScraperAPI:

import requests
resp = requests.get('https://www.chicagotribune.com/')
print(resp.text)

We’ll be using the requests library throughout, so we’ll have to import it. Sending a GET request is extremely simple – call the module with the GET method and pass in the URL as a string (double or single quotes required) as the argument.

Then we simply use our resp object as the argument in the print command and include text as the method.

You’ll receive an error message, as attempting to get any response from The Chicago Tribune while using an EU IP address sends the same error message every time:

If you were to use a US IP address with an EU-locked website, you’d get a similar response. They all differ slightly; however, the end result is the same.

import requests
resp = requests.get('https://www.rte.ie/player/')
print(resp.text)

RTE restricts users to EU only, so with a US IP address, you get:

So, using either ScraperAPI or residential proxies will be necessary to access some websites. Let’s start by sending a request through ScraperAPI:

import requests
payload = {'api_key': ‘YOUR-API-KEY-HERE', 'url': 'https://httpbin.org/ip'}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

As always, start by importing the necessary library (requests in our case). Then, define a dictionary object that has two key:value pairs – the API key (required for authentication) and the URL, which is the website you want to scrape.

We then create a response object that will store the answer retrieved from the website. You’ll need two arguments, the first of which is always the ScraperAPI endpoint, the second of which is the payload dictionary.

For now, we simply print the response. Running the code should just retrieve the origin IP address and print it in the standard output screen.

Selecting a geographical location

We’ll now switch to scraping websites that show data based on location, such as displaying different prices, currencies, or content in general.

Let’s start by implementing a country code in our ScraperAPI code to visit The Chicago Tribune and see if we get a response.
All you need to do is add an additional key:value pair to your payload dictionary. It’ll be country_code as the key and the country code in the 2-letter ISO 3166-1 format.

import requests
payload = {'api_key': 'YOUR-API-KEY-HERE', 'url': 'https://www.chicagotribune.com/', 'country_code': us}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

You should get a large HTML response showcasing lots of data. Our screenshot is truncated for demonstration purposes:

Parsing data with BeautifulSoup

We’ll start by installing BeautifulSoup:

pip install beautifulsoup4

Now we’ll need to make some modifications to our code:

We’ll put the response text (the full HTML file) into a BeautifulSoup object that will be used for parsing.
Then, a list will be created to store all the article titles.
For the output, we’ll run another loop that prints each title on a new line.

import requests
from bs4 import BeautifulSoup
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
for t in unique_titles:
    print(t)

Note that we also create a unique_titles object that’s both sorted and turned into a set (from a list). Sets in Python do not store duplicate values, so it’s an easy way to remove duplicate titles from our original list.

You should get a response that’s similar to:

Finally, some websites display the same page with different data for many geographical locations. Most ecommerce businesses do that to make prices more transparent for users.

Storing data with Pandas

You’ll likely want to do more than just print out data. Otherwise, you’ll lose everything after closing your IDE or any other program.
Usually, the pandas library is more than enough for basic scraping projects. Start by installing:

pip install pandas

We’ll also import the default datetime library to add time stamps to our CSV file, which is highly useful if you need to return to it later.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

Running the code will now create a CSV file and print a success message. The underlying code is quite simple – a dataframe is created that starts with the row “Headline” and then each other row is one of the titles.

To add a timestamp to the file, we use datetime.now() and turn it to a string using strftime and provide the format in the argument.

Finally, the dataframe is outputted into a CSV file.

Note: We use “utf-16” encoding, as “utf-8” doesn’t translate all the characters correctly.

Your CSV file should look a little like this:

Further considerations

Scraping a single website with ScraperAPI is ultimately a little too simple for any real-world project, although it serves as a great starting point. You can improve your scraping code in two primary ways.

One is that you can use ScraperAPI to scrape the homepage of a website, collect all the URLs, and continue in such a fashion, building your list of URLs.

Alternatively, you can create a list object manually and input all the URLs you want to scrape, then run a loop iterating over each element as the URL.

Here’s the full code block that you can work upon:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

The post How to Scrape Geo-Restricted Data Without Getting Banned appeared first on ScraperAPI.

Speed Up Web Scraping with ScraperAPI’s Concurrent Threads

Srujana Maddula — Sat, 28 Jun 2025 07:11:23 +0000

If you’ve ever built a web scraper, you know the pain. You build a scraper, and it works great on 1,000 pages, but the moment you scale up to 10,000 or more, things become slow. Here’s the good news: there’s a fix!

In this article, you’ll learn everything about:

What concurrent threads are
How to set up ScraperAPI’s concurrent threads
How to use them to scrape web pages faster and more efficiently

So, what are concurrent threads?

If you’ve used ScraperAPI before, you already know the basics—you hit the API to fetch the pages you need. With concurrent threads, you can send multiple requests at the same time. Instead of scraping one page, waiting, and then scraping the next, you can run several requests in parallel and get results way faster.

Let’s say you’re using 5 concurrent threads. That means you’re making 5 requests to ScraperAPI at once, all running in parallel. So, the more threads you use, the more requests you can send at once, and the faster your scraper runs.

Each ScraperAPI plan comes with its own thread limit. For example:

The Business plan gives you up to 100 concurrent threads
The Scaling plan bumps that up to 200 threads

However, if your scraping needs go beyond that, we’ve got you covered with our Enterprise plan. With Enterprise, there’s no fixed cap. We work with you to tailor a custom thread limit based on your exact use case so you get the best speed and performance.

How to increase your scraping speed?

Now that we know what concurrent threads are, it’s time to see them in action.

We’ll run a simple experiment to test how performance scales with different thread limits and show just how much speed you can unlock.

First, we’ll create a list of 1000+ URL samples. To do that, we’ll crawl https://edition.cnn.com/business/tech and extract URLs using open-source tools like Scrapy. This step is just to get the sample URLs we want to scrape. In your case, these URLs would be the actual pages that you need to scrape.

Once we have the list of URLs, we’ll hit the ScraperAPI endpoint twice:

First, using 100 concurrent threads.
Then again, with 500 concurrent threads.

Finally, we’ll measure how long each run takes.

Stage 1: Create a list of sample URLs to scrape

Follow these steps to create a list of URLs from https://edition.cnn.com/business/tech:
Step 1: Open the command prompt or terminal, go to your project folder, and install Scrapy and BeautifulSoup (which we will need later).

pip install scrapy bs4

Step 2: Start a new Scrapy project.

scrapy startproject cnn_scraper
cd cnn_scraper

Step 3: Go inside the /spiders folder and create a Python file.

cd spiders
touch cnn_spider.py

Step 4: In your IDE, go to cnn_scraper/spiders/cnn_spider.py and paste the following code:

import scrapy

from urllib.parse import urljoin, urlparse

class CnnSpider(scrapy.Spider):

   name = "cnn"  
   allowed_domains = ["edition.cnn.com"]
   start_urls = ["https://edition.cnn.com/business/tech"]
   seen_urls = set()

   custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }

   def parse(self, response):
       links = response.css("a::attr(href)").getall()

       for link in links:
           if link.startswith("/"):
               full_url = urljoin("https://edition.cnn.com", link)
           elif link.startswith("http") and "edition.cnn.com" in link:
               full_url = link
           else:
               continue
           
           if full_url not in self.seen_urls:
               self.seen_urls.add(full_url)
               yield {"url": full_url}
               yield response.follow(full_url, callback=self.parse)

       if len(self.seen_urls) >= 1000:
           self.crawler.engine.close_spider(self, "URL limit reached")

In the above code, custom_settings sets the User-Agent header Scrapy sends with each request, making the spider look like a real browser. The parse() function uses the getall() built-in function to collect and process all the links on the current page, and turn them into full links. The if condition (if full_url not in self.seen_urls) is only to process links you haven’t seen before.

Step 5: To run the above code and save the URLs into a JSON file, execute the following command from the cnn_scraper/spiders folder:

scrapy crawl cnn -o urls.json

Stage 2: Let’s scrape the saved URLs using ScraperAPI

Step 1: Create a Python file–I named mine scraper_api.py, but you can pick whatever name works for you–and paste the following code in it:

import requests
import json
import csv
import time
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

API_KEY = 'ScraperAPI API_key'
NUM_RETRIES = 3
NUM_THREADS = 100
with open("path/to/URLs_json_file", "r") as file:
    raw_data = json.load(file)
    list_of_urls = [item["url"] for item in raw_data if "url" in item]

def scrape_url(url):
   params = {
       'api_key': API_KEY,
       'url': url
   }

   for _ in range(NUM_RETRIES):
       try:
           response = requests.get('http://api.scraperapi.com/', params=params)
           if response.status_code in [200, 404]:
               break
       except requests.exceptions.ConnectionError:
           continue
   else:
       return {
           'url': url,
           'h1': 'Failed after retries',
           'title': '',
           'meta_description': '',
           'status_code': 'Error'
       }

   if response.status_code == 200:
       soup = BeautifulSoup(response.text, "html.parser")
       h1 = soup.find("h1")
       title = soup.title.string.strip() if soup.title else "No Title Found"
       meta_tag = soup.find("meta", attrs={"name": "description"})
       meta_description = meta_tag["content"].strip() if meta_tag and meta_tag.has_attr("content") else "No Meta Description"
       return {
           'url': url,
           'h1': h1.get_text(strip=True) if h1 else 'No H1 found',
           'title': title,
           'meta_description': meta_description,
           'status_code': response.status_code
       }
   else:
       return {
           'url': url,
           'h1': 'No H1 - Status {}'.format(response.status_code),
           'title': '',
           'meta_description': '',
           'status_code': response.status_code
       }

start_time = time.time()

#concurrent threads 
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
   scraped_data = list(executor.map(scrape_url, list_of_urls))


elapsed_time = time.time() - start_time
print(f"Using 100 concurrent threads, scraping completed in {elapsed_time:.2f} seconds.")

# Save to CSV
with open("cnn_h1_1000_1_results.csv", "w", newline='', encoding="utf-8") as f:
   writer = csv.DictWriter(f, fieldnames=["url", "h1", "title", "meta_description", "status_code"])
   writer.writeheader()
   writer.writerows(scraped_data)

The function scrape_url(url) sends a request to ScraperAPI using the given URL. If the response status code is not 200 or 404, it retries up to NUM_RETRIES times. If it gets a 200 OK, it uses BeautifulSoup to parse the H1, title, and meta description.

The part ThreadPoolExecutor(max_workers=NUM_THREADS) sends the concurrent requests to ScraperAPI. In the end, the code saves the scraped data to a CSV file.
When NUM_THREADS == 100, it took 100.68 seconds to scrape the titles.

In the same code, we only changed the number of concurrent threads to 500; now it took just 23.56 seconds.

Just like that, I slashed the scraping time from around 100 seconds down to just 23 seconds. That’s nearly 4 times faster with 500 threads compared to 100!

To optimize your performance with custom concurrent threads, upgrade to our custom enterprise plan today.

The post Speed Up Web Scraping with ScraperAPI’s Concurrent Threads appeared first on ScraperAPI.

Integrating ScraperAPI with Data Cleaning Pipelines

Ize Majebi — Mon, 26 May 2025 10:45:17 +0000

Collecting clean, usable data is the foundation of any successful web scraping project. However, web data is often filled with inconsistencies, duplicates, and irrelevant content, making it hard to work with straight out of the source.

That’s where combining ScraperAPI with data cleaning pipelines comes in. ScraperAPI helps you reliably extract data from websites—even those with complex anti-scraping protections—while Python’s data tools make it easy to clean, structure, and prepare that data for use.

In this guide, you’ll learn how to:

Set up ScraperAPI for web scraping
Use ETL (Extract, Transform, Load) techniques to clean and organize your data
Integrate these tools into a workflow that’s fast, flexible, and scalable

Ready? Let’s get started!

What is ETL?

ETL stands for Extract, Transform, Load. It is a data processing framework used to move data from one or more sources, clean it, and store it in a system where it can be analyzed. This process is essential for handling large volumes of data from various sources, preparing it for reporting and informed decision-making.

The Three Stages of ETL

Extract: In this initial phase, raw data is gathered from its source. In our case, that means scraping websites. This can be tricky, as websites often implement anti-scraping measures like IP bans, CAPTCHAs, and dynamic content loading through JavaScript. To manage these challenges and streamline the extraction process, we’ll use ScraperAPI, a tool designed to simplify and automate data collection at scale.
Transform: Once data is extracted, it’s often messy or inconsistent. The real cleanup happens in the transformation stage: the data is validated, standardized, and restructured into a usable format. This is a crucial step for ensuring data quality and consistency.
Load: Finally, the cleaned and transformed data is loaded into a storage system. Depending on the project, this could be a CSV file, a relational database (like PostgreSQL or MySQL), a NoSQL database (like MongoDB), a data warehouse (like BigQuery, Redshift, or Snowflake), or even a data lake. We’ll keep this tutorial simple and load the data into a CSV file.

ScraperAPI and Python for ETL

Web-scraped data is often messy, inconsistent, and unstructured—not yet ready for analysis or decision-making. That’s where ETL becomes essential. It brings structure, cleanliness, and reliability to chaotic web data, making it more valuable.

Let’s break down how this works in the context of scraping real estate listings:

Extract: Use ScraperAPI to pull raw HTML from multiple real estate website pages. ScraperAPI handles the toughest parts of web scraping—IP rotation, user-agent spoofing, CAPTCHA solving, and even JavaScript rendering—so you can focus on getting the data instead of fighting anti-bot defenses.
Transform: With libraries like BeautifulSoup and Pandas, you can clean and standardize your data for analysis using Python:
- Parse price fields, stripping currency symbols and converting values to a numeric format.
- Standardize inconsistent text (e.g., “3 bdr”, “three beds”) into a single format like (e.g., 3).
- Normalize square footage to a consistent unit and data type.
- Handle missing values for features such as balconies or garages.
- Identify and remove duplicate listings that may appear due to frequent site updates.
Load:
Once the data is cleaned and transformed, use Pandas to export it into a structured format like a CSV for reporting or analysis, or load it directly into a database for long-term storage and querying.

With Python and ScraperAPI together, you have a powerful ETL toolkit:

ScraperAPI simplifies and hardens the Extract phase.
With its rich data handling capabilities, Python covers Transform and Load with flexibility and precision.

This ETL pipeline guarantees that the data you have scraped is precise, consistent, and prepared for use, regardless of whether you are analyzing market trends or creating a real estate dashboard.

Project Requirements

Before diving into the integration, make sure you have the following:

1. A ScraperAPI Account: Sign up on the ScraperAPI website to get your API key. ScraperAPI will handle proxy rotation, CAPTCHA solving, and JavaScript rendering, making the extraction phase a breeze. You’ll receive 5,000 free API credits when you sign up for a seven-day trial, starting whenever you’re ready.

2. A Python Environment: Ensure Python (version 3.7+ recommended) is installed on your system. You’ll also need to install key libraries:

requests: For making HTTP requests to ScraperAPI.
beautifulsoup4: For parsing HTML and XML content.
pandas: For data manipulation and cleaning.
python-dotenv: to load your credentials from your .env file and manage your API key securely.
lxml (optional but recommended): A fast and efficient XML and HTML parser that BeautifulSoup can use.

You can install them using pip with this command:

pip install requests beautifulsoup4 pandas lxml python-dotenv

3. Basic Web Scraping Knowledge: A foundational understanding of HTML structure, CSS selectors, and how web scraping works will be beneficial.

4. An IDE or Code Editor: Such as VS Code, PyCharm, or Jupyter Notebook for writing and running your Python scripts.

TL;DR;

For those in a hurry, here’s the full scraper we are going to be building:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

Want to see how we built it? Keep reading!

Building a Real Estate ETL Pipeline with ScraperAPI and Python

In this section, we’ll build a working ETL pipeline that scrapes real estate listings from Idealista using ScraperAPI, cleans the data with Python, and saves it in a structured CSV file. We’ll walk through each part of the process—extracting the data, transforming it into a usable format, and loading it for analysis—so you’ll have a complete and reusable workflow by the end.

Step 1: Extracting: Using ScraperAPI

Most real estate websites are known for blocking scrapers, making collecting data at any meaningful scale challenging. For that reason, we sent our get() requests through ScraperAPI, effectively bypassing Idealista’s anti-scraping mechanisms without complicated workarounds.
For this guide, we’ll update an existing ScraperAPI real estate project to demonstrate the integration. You can find the complete guide on scraping Idealista with Python here.

import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

scraper_api_key = 'YOUR_SCRAPERAPI_KEY' # Replace with your ScraperAPI key
idealista_query = "https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={idealista_query}'
 
response = requests.get(scraper_api_url)

extracted_data = []

# Check if the request was successful (status code 200)
if response.status_code == 200:
   # Parse the HTML content using BeautifulSoup
   soup = BeautifulSoup(response.text, 'html.parser')
   # Extract each house listing post
   house_listings = soup.find_all('article', class_='item')
  
   # Create a list to store extracted information
  
   # Loop through each house listing and extract information
   for index, listing in enumerate(house_listings):
       # Extracting relevant information
      title = listing.find('a', class_='item-link').get('title')
      price = listing.find('span', class_='item-price').text.strip()

       # Find all div elements with class 'item-detail'
      item_details = listing.find_all('span', class_='item-detail')

       # Extracting bedrooms and area from the item_details
      bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
      area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"
      description = listing.find('div', class_='item-description').text.strip() if listing.find('div', class_='item-description') else "N/A"
      tags = listing.find('span', class_='listing-tags').text.strip() if listing.find('span', class_='listing-tags') else "N/A"
       # Extracting images
      image_elements = listing.find_all('img')
      images = [img.get("src") for img in image_elements] if image_elements else []
 
       # Store extracted information in a dictionary
      listing_data = {
           "Title": title,
           "Price": price,
           "Bedrooms": bedrooms,
           "Area": area,
           "Description": description,
           "Tags": tags,
           "Images": images
       }
       # Append the dictionary to the list
      extracted_data.append(listing_data)

The code above scrapes and parses real estate listings from Idealista using ScraperAPI and BeautifulSoup. It begins by configuring ScraperAPI with your ScraperAPI key and the target URL, then sends an HTTP GET request to the URL. If the request is successful, the HTML is parsed with BeautifulSoup, and the script locates all

elements with the class "item" (which represent property listings). It then loops through each listing to extract key details—title, price, number of bedrooms, area, description, tags, and image URLs.

Step 2: Transforming the Data (Data Cleaning)

After extracting raw data from Idealista, the next step is to clean and prepare it. To make this data more useful, we’ll use pandas, a powerful Python library for data analysis. If you’ve never used pandas before, think of it like Excel—only it’s in Python and is more flexible.
In Step 1, we stored each listing in a dictionary and added those dictionaries to a list called extracted_data. Here’s what that list might look like:

[
    {
        "Title": "Spacious apartment in central Barcelona",
        "Price": "€350,000",
        "Bedrooms": "3 bdr",
        "Area": "120 m²",
        "Description": "...",
        "Tags": "Luxury",
        "Images": [...]
    },
    ...
]

Now we’ll use pandas to convert that list into a structured DataFrame (a table-like object), then clean each column step by step.

import pandas as pd

# Convert raw listing data to a DataFrame
df = pd.DataFrame(three_bedroom_listings)

# View the raw data
print(df.head())

pd.DataFrame(...) creates a DataFrame from a list of dictionaries. Each dictionary becomes a row; each key becomes a column.
.head() shows the first five rows — useful for checking structure and data types.

The price values are strings like "€350,000". We’ll remove symbols and formatting to convert them to numeric values.

df['Price'] = (
    df['Price']
    .str.replace('€', '', regex=False)   # Remove the euro symbol
    .str.replace(',', '', regex=False)   # Remove comma separators
    .str.strip()                         # Remove leading/trailing whitespace
    .astype(float)                       # Convert strings to float
)


print(df['Price'].head()) # Display the first few prices to verify conversion

.str.replace(old, new) modifies string values in a column.
.str.strip() removes unnecessary spaces from both ends.
.astype(float) changes the column type from string to float so we can perform numerical operations later.

Listings may include text like "3 bdr" or "two beds". We’ll extract just the number of bedrooms as an integer using a regex function with .apply().

import re

def extract_bedrooms(text):
    match = re.search(r'\d+', text)  # Find the first sequence of digits
    return int(match.group()) if match else None

df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)


print(df['Bedrooms'].head()) # Display the first few bedroom counts to verify conversion

.apply() runs a function on each element in the column.
re.search(r'\d+', text) looks for the first group of digits.

This cleans and standardizes the bedroom count into integers.Area values include units like "120 m²". We’ll remove those and convert to float.

df['Area'] = (
    df['Area']
    .str.replace('m²', '', regex=False)  # Remove unit
    .str.replace(',', '.', regex=False)  # Convert comma to dot for decimal values
    .str.strip()                         # Clean up whitespace
    .astype(float)                       # Convert to float
)


print(df['Area'].head())  # Display the first few areas to verify conversion

This ensures all values in the “Area” column are consistent numerical types so that we can sort, filter, or calculate metrics like price per square meter.

Some listings may be missing essential values. We’ll drop rows with missing data in key columns. You can choose which columns are crucial and should not have any missing values.

df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)

.dropna() removes rows with NaN (missing) values.
The subset argument limits this check to specific columns; you can add other columns here if needed.
inplace=True modifies the DataFrame directly without needing to reassign it.

To work with only listings that have exactly 3 bedrooms (optional):

df = df[df['Bedrooms'] == 3]

df[condition] filters rows based on a condition.
Here, we’re checking where the “Bedrooms” column equals 3, and updating df to only include those rows.

At this point, your data is structured similarly to this:

Title	Price	Bedrooms	Area	…
“Modern flat in Eixample”	310000.00	3	95.0	…
“Loft with terrace in Gracia”	275000.00	2	82.0	…

This cleaned DataFrame is now ready for analysis or export. In the next step, we’ll load it into a CSV file.

Step 3: Loading Cleaned Data into CSV (Storing)

With your data now cleaned and structured in a pandas DataFrame, the final step is to persist it, meaning you save it somewhere so it can be reused, shared, or analyzed later.

The CSV file is the most common and beginner-friendly format for storing tabular data. It’s a simple text file where each row is a line and commas separate each column. Most tools—Excel, Google Sheets, data visualization tools, and programming languages—can open and process CSV files efficiently.

You can save your DataFrame to a CSV with just one line of code:

# Save the cleaned DataFrame to a CSV file
df.to_csv('three_bedroom_houses.csv', index=False)

df.to_csv(...) is a pandas method that writes your DataFrame to a CSV file.
'three_bedroom_houses.csv' is the file name that will be created (or overwritten).
index=False tells pandas not to write the DataFrame index (row numbers) to the file, which keeps it clean unless you explicitly need it.

Once this is done, you’ll see a new file in your working directory (where your script is running). Here’s what a few lines of that file might look like:

Title,Price,Bedrooms,Area,Description,Tags,Images
"Flat / apartment in calle de Bailèn, La Dreta de l'Eixample, Barcelona",675000.0,3,106.0,"Magnificent and quiet brand new refurbished flat in Eixample.
This ready-to-live-in flat enjoys a fantastic location very close to the popular Paseo Sant Joan and the pedestrian street Consell de Cent. It is a very pleasant urban environment in which to live in the neighbourhood, with numerous services, shops, restau",N/A,"['https://img4.idealista.com/blur/480_360_mq/0/id.pro.es.image.master/dd/d0/85/1326281103.jpg', 'https://st3.idealista.com/b1/b8/d4/bcn-advisors.gif']"

You can open it in:

Excel: Just double-click the file.
Google Sheets: Upload the file and import it as a spreadsheet.
Another Python script: Using pd.read_csv()
Visualization tools: Like Power BI, Tableau, or even Jupyter notebooks.

If you’re working with a larger dataset later or need better performance, consider saving to a database. But for now, CSV is ideal.

Step 4: Finalizing the ETL Pipeline

Now that your scraper works and your data is clean, it’s time to turn your code into a proper ETL pipeline. This makes it easier to maintain, reuse, schedule, or extend. We’ll do two things here:

1. Modularize the script into extract, transform, and load functions

2. Move sensitive info like your ScraperAPI key and target URL to environment variables using the python-dotenv package

This final version is production-friendly, secure, and easy to build on.First, install python-dotenv if you don’t already have it:

pip install python-dotenv

Next, create a .env file in your project directory and add any sensitive information:

SCRAPER_API_KEY=your_scraperapi_key_here
IDEALISTA_URL=https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/

Here’s your final pipeline script, with the code restructured and organized in separate methods:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

With this final step, your scraper is now:

Modular and easy to update
Secure, with API keys safely stored in environment variables
Ready to scale, automate, or plug into larger data workflows

You now have a reusable, scalable workflow for scraping and analyzing real estate listings!

Use Cases for ScraperAPI and Python’s Data Cleaning Integration

Now that you’ve seen how ScraperAPI and Python work together to extract and clean real estate data, let’s explore how this powerful combination can be used across industries. The ETL workflow—Extract, Transform, Load—is flexible and scalable, making it useful for many data-driven projects.

Here are several practical applications where this integration excels:

1. Sentiment analysis: You can look at how language affects buyer interest by scraping property descriptions or user reviews. After cleaning the text with Python, sentiment analysis tools like TextBlob or VADER can score the tone as positive, neutral, or negative. This makes it possible to see whether listings that use appealing terms like “spacious” or “modern” tend to sell faster or command higher prices.

2. Trend Monitoring: Running your scraper regularly helps build a dataset that captures how property prices and features change over time. It’s easier to visualize trends and track how specific market segments are evolving by structuring the data around key attributes like location, number of bedrooms, or property type.

3. Competitor Research: Scraping listings from multiple real estate platforms gives you a direct view of competitors’ prices and positions of similar properties. With standardized data, you can compare pricing strategies, listing frequency, and included features to identify market gaps or specific areas where your offering could stand out.

4. Community Insights: Collecting data from forums, review sites, or social media conversations can reveal what buyers and renters care about. After cleaning and processing the text, analysis can uncover common priorities: proximity to schools, demand for green space, or concerns about noise, etc., which can inform development and marketing decisions.

Wrapping Up

Integrating ScraperAPI with data-cleaning pipelines creates a powerful setup for working with web data. ScraperAPI takes care of the tricky parts of scraping—like CAPTCHAs, IP blocks, and JavaScript rendering—so you can reliably extract data at scale. On the other side, Python helps you clean and organize that data, making sure it’s accurate, consistent, and ready for analysis. This combination saves time and makes it easier to get real insights from messy, real-world data.

In this tutorial, we walked through the process of:

Extracting real estate listings from Idealista using ScraperAPI
Transforming the data by standardizing data types, removing unwanted characters and empty values, and filtering for three-bedroom listings.
Loading the cleaned data into a structured CSV file for easy sharing and analysis

If you’d like to try it for yourself, you can sign up for a free ScraperAPI account and get 5,000 API credits to start scraping right away. It’s a great way to test the waters and see how it fits into your data workflows.

Until next time, happy scraping!

FAQs

Why should I integrate ScraperAPI into my ETL pipeline?

Integrating ScraperAPI into your ETL pipeline simplifies data extraction by handling anti-scraping mechanisms like IP bans, CAPTCHAs, and JavaScript rendering. This ensures uninterrupted data collection, even from complex or heavily protected websites. ScraperAPI also reduces the need for manual workarounds, allowing you to focus on transforming and analyzing the data.

How do I ensure that the scraped data is accurate and of high quality?

To ensure data accuracy in your ETL pipeline, start by validating the extracted data using Python tools like pandas to check for missing values, duplicates, or inconsistent formats. Clean the data by standardizing date formats, currency symbols, and numeric values. Regularly test your scraping logic to ensure it adapts to website structure changes. Always review a sample of the scraped output to manually confirm that the data matches expectations before scaling your pipeline.

What kind of data can I scrape with ScraperAPI for my ETL pipeline?

ScraperAPI can extract a wide variety of data types for ETL pipelines, including plain text such as product descriptions, blog content, or property listings; numerical data like prices, ratings, and financial figures; media files including images and videos; structured data such as HTML tables, lists, and JSON or XML feeds; and dynamic content loaded via JavaScript or AJAX. This flexibility suits everything from basic web scraping to complex data aggregation projects.

The post Integrating ScraperAPI with Data Cleaning Pipelines appeared first on ScraperAPI.

Build a Walmart Reviews Analysis Tool Using ScraperAPI, VADER, Gemini, and Streamlit

Egop Gogo-Job — Thu, 15 May 2025 11:41:27 +0000

Customer reviews are more than just feedback. They are a rich, often untapped source of business intelligence. Paying close attention and analyzing what your customers say about their experience with your products can uncover real pain points, spot trends in complaints, and even discover areas for opportunities that might be invisible otherwise.

Scraping dynamic, high-traffic websites like Walmart can be a challenging task. Even locating the correct JavaScript tags with the data you want can be confusing and seem like an impossible task. Luckily for us, ScraperAPI provides a dedicated endpoint specifically for scraping Walmart reviews.

This article will guide you through building a unique tool that analyzes Walmart customer feedback. By using ScraperAPI’s structured Walmart reviews async endpoint, we will scrape reviews for multiple products and utilize VADER to pinpoint the emotional tone of each review.

Furthermore, we will utilize Gemini to transform this raw data into a clear, actionable report that includes recommendations, all displayed in a free, cloud-hosted web interface built with Streamlit.

Understanding VADER for Sentiment Analysis

Sentiment analysis is a method for identifying the emotions expressed in a piece of text. Since VADER (Valence Aware Dictionary and Sentiment Reasoner) is the sentiment analysis tool we’re using in this project, it’s best to understand how it works and its benefits before diving deeper.

VADER uses a predefined dictionary (lexicon) where each word is allotted a sentiment score. These scores reflect how positive, negative, or neutral a term is. In this project, VADER assigns two key metrics to each review we analyze: polarity and subjectivity.

Polarity represents the overall sentiment of a review, ranging from negative to positive. A score closer to +1 indicates a more positive review, while a score closer to -1 means a more negative review. A score near 0 signifies a neutral review. VADER calculates each score by assessing the sentiment intensity of individual words in the review, referencing its built-in dictionary.

Here’s more information on VADER that includes key advantages and features:

1. Handles Informal Language Well

VADER is excellent at analyzing the kind of casual language people use on social media. It can easily understand and interpret slang, irregularly capitalized words, and even emotional cues through punctuation, such as multiple exclamation points and emojis. With most sentiment analysis tools, it’s challenging to achieve this, making VADER particularly well-suited for our task.

2. Provides Context-Aware Sentiment Adjustment

Instead of treating words in isolation, VADER utilizes smart rules to interpret context. When a sentence includes words like “not,” it flips the meaning, such that, while “good” is positive, “not good” becomes negative.

It also notices if certain words are in all caps or if there are many exclamation points, which usually means the emotion is stronger. And, it gives priority to words like “very” or “slightly,” especially when they appear before an adjective, to figure out exactly how strong the emotion is.

3. Gives an Overall Mood Score

VADER wraps up all its analysis in a single number called the compound score, which ranges from -1 to +1. This score tells you at a glance whether the overall review feels positive (closer to +1), negative (closer to -1), or neutral (around 0). It’s like a summary mood indicator that combines all the word scores and context tweaks into one easy-to-understand value.

ScraperAPI’s Walmart Reviews API (Async Endpoint)

Web scraping is difficult for several reasons. Modern websites are built with dynamic JavaScript frameworks, which means that most of the content isn’t available in the static HTML. In practice, you’d need to understand JavaScript and know your way around web development tools to locate and extract the data you need.

When scraping a website, the tool you use first must bypass several anti-scraping defenses that many sites employ these days. Once it’s through, it immediately comes in contact with a mountain of code. The image below shows a real-life example of the code behind Walmart’s website (right-click and select “Inspect” to see the same image below on a Walmart website) :

The code in the Elements section of a webpage is often buried under multiple layers of HTML, making it tricky to find exactly where the data is coming from. To navigate this, you typically need a good understanding of HTML, CSS, and JavaScript.

But what if you’re not a front-end developer? If you’re a data analyst, scientist, or engineer, your primary language probably isn’t JavaScript.

In most cases, you’ll need to use your browser’s developer tools to inspect the page and locate the specific elements, like reviews, ratings, or dates, that contain the data you want to scrape.

Tools like Selenium and Puppeteer can help simulate user behavior, but they add layers of complexity. If we wanted to scrape this Walmart site, usually, here’s an ideal process we’d have to go through just to locate and extract that data:

First, you have to locate the parent container within the website’s HTML code that contains the div class where you can find the reviews data:

Within the div class, search for “