Blog - ScraperAPI https://www.scraperapi.com/blog/ Scale Data Collection with a Simple API Fri, 21 Nov 2025 05:55:04 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://www.scraperapi.com/wp-content/uploads/favicon-512x512-1-150x150.png Blog - ScraperAPI https://www.scraperapi.com/blog/ 32 32 Build an Image Search Engine for Amazon with the ScraperAPI-LangChain Agent https://www.scraperapi.com/blog/build-image-search-engine-for-amazon/ Fri, 21 Nov 2025 05:55:01 +0000 https://www.scraperapi.com/?p=8530 Image search has become an intuitive way to browse the internet. Tools like Google Lens can find identical items across different websites based on an uploaded photo, which is useful but generic.  If you live in the UK or Canada and just want search results of product listings from your local Amazon marketplace or some […]

The post Build an Image Search Engine for Amazon with the ScraperAPI-LangChain Agent appeared first on ScraperAPI.

]]>

Image search has become an intuitive way to browse the internet. Tools like Google Lens can find identical items across different websites based on an uploaded photo, which is useful but generic. 

If you live in the UK or Canada and just want search results of product listings from your local Amazon marketplace or some other local online retailer, the breadth of results Google Lens returns can be overwhelming, time-wasting, and mostly useless. Oftentimes, it will return similar items, just not readily accessible items. 

Given Amazon’s scale and inventory depth, a focused search that goes straight to the right marketplace is the most efficient approach.

Our project addresses this by enabling image search, particularly for Amazon Marketplaces in any region of the world, using two separate large language models (LLMs) to analyze uploaded images and generate shopping queries. 

These queries are passed to a reasoning model that uses the ScraperAPI LangChain agent to search Amazon and return structured results. To build a user interface and host our app for free, we use Streamlit.

Let’s get started!

Understanding the Search Engine’s Workflow

There are three core components of our Image Search engine that work in sequence. Claude 3.5 Sonnet reads the uploaded photo and writes a short shopping caption that captures distinct attributes of the item. 

GPT 4o Mini takes that caption, chooses the right Amazon marketplace, and forms a neat query. The ScraperAPI LangChain agent then runs the query against Amazon and returns structured results containing title, ASIN, price, URL link, and image, which the app shows instantly. 

Let’s take a closer look at how each of these components functions:

LangChain and ScraperAPI

LangChain agents connect a reasoning model to external tools, so the model can act, not just chat. Integrating ScraperAPI as an external tool enables the agent to crawl and fetch real-time data from the web without getting blocked. 

The package exposes whatever reasoning model (an LLM) you pair with the agent through three distinct ScraperAPI endpoints: ScraperAPIAmazonSearchTool, ScraperAPIGoogleSearchTool, and ScraperAPITool

With just a prompt and your ScraperAPI key, the agent issues a tool call and ScraperAPI handles bypassing, protection, and extraction, returning clean formatted data. For Amazon, the data usually comes back as a structured JSON field containing title, ASIN, price, image, and URL link. 

Claude 3.5 Sonnet and GPT 4o Mini

In this project, Claude 3.5 Sonnet, a multimodal LLM, converts each uploaded photo into a short descriptive caption that captures the key attributes of that item. 

The caption becomes the query, and GPT 4o Mini, the reasoning model paired to our agent, then interprets the caption, selects the correct Amazon marketplace, and calls the ScraperAPI LangChain tool to run the search. 

The tool returns structured results that the app can display directly. Splitting the work this way keeps each model focused on what it does best. 

Claude Vision extracts the right details from the image. GPT 4o Mini handles reasoning and tool use. ScraperAPI provides stable access and structured data.

Image Search Engine Workflow

Obtaining Claude 3.5 Sonnet and GPT4o Mini from OpenRouter

Our setup uses two separate large language models arranged in a multi-flow design. You can access LLMs from platforms like Hugging Face, Google AI Studio, AWS Bedrock, or locally via Ollama.

However, I used OpenRouter because it’s simpler to set up and supports many models through a single API, which is ideal for multi-flow LLM setups. 

Here’s a guide on how to access Claude 3.5 Sonnet from OpenRouter:

  1. Log in to OpenRouter, sign up, and create a free account:
OpenRounter sign up
  1. After verifying your email, log in and search for Claude models (or any other LLM of our choice) in the search bar: 
OpenRouter Claude3.5
  1. Select Claude 3.5 Sonnet and click on the “Copy” icon just below the model’s name:
OpenRouter Select Claude 3.5 Sonnet
  1. Click on “API” to create a personal API access key for your model. 
OpenRouter Click on “API” to create a personal API access key
  1. Select “Create API Key” and then copy and save your newly created API key. 
OpenRouter Select “Create API Key”
  1. You do not have to repeat the entire process to access GPT 4o Mini. Simply copy and paste the model link highlighted below into the code, and your single API key will be able to access both LLMs. 
OpenRouter repeat the entire process to access GPT 4o Mini

Do not share your API key publicly!

Getting Started with ScraperAPI

  1. If you don’t have a ScraperAPI account, go to scraperapi.com, and click “Start Trial” to create one or “Login” to access an existing account.:
ScraperAPI Start Trial
  1. After creating your account, you’ll have access to a dashboard providing you with an API key, access to 5000 API credits (7-day limited trial period), and information on how to get started scraping. 
ScraperAPI Dashboard
  1. To access more credits and advanced features, scroll down and click “Upgrade to Larger Plan.”
ScraperAPI Upgrade to larger plan
  1. ScraperAPI provides documentation for various programming languages and frameworks, such as PHP, Java, and Node.js, that interact with its endpoints. You can find these resources by scrolling down on the dashboard page and clicking “View All Docs”:
ScraperAPI view all docs

Now we’re all set, let’s start building our tool.

Building the Image Search Engine for Amazon

Step 1: Setting Up the Project

Create a new project folder, a virtual environment, and install the necessary dependencies.

mkdir amzn_image_search  # Creates the project folder
cd amzn_image_search # Moves you inside the project folder

Set up a virtual environment

python -m venv venv

Activate the environment:

  • Windows:
venvScriptsactivate
  • macOS/Linux:
source venv/bin/activate

Now, install the dependencies we’ll need:

pip install streamlit Pillow requests aiohttp openai langchain-openai langchain langchain-scraperapi python-dotenv

The key dependencies and their functions are:

  • streamlit: The core library for building and running the app’s UI.
  • openai: To interact with OpenRouter’s API, which is compatible with the OpenAI library’s structure.
  • langchain-openai: Provides the LangChain integration for using OpenAI-compatible models (like those on OpenRouter) as the “brain” for our agent.
  • langchain-scraperapi: Provides the pre-built ScraperAPIAmazonSearchTool that our LangChain agent will use to perform searches on Amazon.
  • langchain: The framework that allows us to chain together our language model (the brain) and tools (the search functionality) into an autonomous agent.
  • Pillow: A library for opening, manipulating, and saving many different image file formats. We use it to handle uploaded images.
  • requests & aiohttp: Underlying HTTP libraries used by the other packages to make API calls.

Step 2: Keys, Environment, and Model Selection

Let’s set up the necessary API keys and define which AI models will be used for different tasks.

In a file .env, add:

SCRAPERAPI_API_KEY="Your_SCRAPERAPI_API_Key"

In a file main.py, add the following code:

import os, io, base64, json
import streamlit as st
from PIL import Image
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import StructuredTool
from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool
from dotenv import load_dotenv
load_dotenv()
# secure api keys from .env using os
SCRAPERAPI_KEY = os.environ.get("SCRAPERAPI_API_KEY")
OPENROUTER_API_KEY_DEFAULT = os.environ.get("OPENROUTER_API_KEY")
if SCRAPERAPI_KEY:
    os.environ["SCRAPERAPI_API_KEY"] = SCRAPERAPI_KEY
else:
    print("Warning: SCRAPERAPI_API_KEY environment variable not set.")
# allocating models as per their tasks 
CAPTION_MODEL = "anthropic/claude-3.5-sonnet"  # vision model for captioning
AGENT_MODEL = "openai/gpt-4o-mini" # reasoning model (cheaper alternative to claude

Here’s a breakdown of what the code above does:

  • Imports: All the necessary libraries for the application are imported at the top, including StructuredTool which we’ll use to create a custom, reliable search tool.
  • API Keys: The script handles API key management by using load_dotenv() to retrieve keys from a .env file and assigns them to variables: SCRAPERAPI_KEY and OPENROUTER_API_KEY_DEFAULT.
  • Environment Setup: os.environ["SCRAPERAPI_API_KEY"] = SCRAPERAPI_KEY is a crucial line. LangChain tools often look for API keys in environment variables, so this makes our SCRAPERAPI_KEY available to the ScraperAPIAmazonSearchTool.
  • Model Selection: Since we’re using two different models for two distinct tasks, the CAPTION_MODEL will be Claude 3.5 Sonnet due to its multimodal capabilities. The AGENT_MODEL is GPT-4o mini because it’s cheaper and very efficient at understanding instructions and using tools, which is exactly what the agent needs to do.

Step 3: App Configuration and UI Basics

Here we’ll configure the Streamlit page and set up some basic data structures and titles. Add this to your file:

st.set_page_config(page_title=" Amazon Visual Match", layout="wide")
st.title("Amazon Visual Product Search Engine")
AMZ_BASES = {
   "US (.com)": {"tld": "com", "country": "us"},
   "UK (.co.uk)": {"tld": "co.uk", "country": "gb"},
   "DE (.de)": {"tld": "de", "country": "de"},
   "FR (.fr)": {"tld": "fr", "country": "fr"},
   "IT (.it)": {"tld": "it", "country": "it"},
   "ES (.es)": {"tld": "es", "country": "es"},
   "CA (.ca)": {"tld": "ca", "country": "ca"},
}

Here’s what this code achieves:

  • st.set_page_config(…): Sets the browser tab title and uses a “wide” layout for the app.
  • st.title(…): Displays the main title on the web page.
  • AMZ_BASES: This dictionary is essential. It maps a marketplace name ( “ES (.es)”) to the two codes ScraperAPI needs: the tld (top-level domain, like es) and the country code for that domain. Providing both is critical to ensuring we search the correct local marketplace.

Step 4: Creating the Image Captioning Function

This is the first major functional part of the app. It defines the logic for sending an image to the vision LLM (Claude 3.5 Sonnet) to get a descriptive caption. Continue in your file by adding this:

# captioning stage
def caption_with_openrouter_claude(
       pil_img: Image.Image,
       api_key: str,
       model: str = CAPTION_MODEL,
       max_tokens: int = 96,
) -> str:
   if not api_key:
       raise RuntimeError("Missing OpenRouter API key.")
   client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=api_key)
   b64 = _image_to_b64(pil_img)
   prompt = (
       "Describe this product in ONE concise shopping-style sentence suitable for an Amazon search. "
       "Include brand/model if readable, color, material, and 3-6 search keywords. "
       "No commentary, just the search-style description."
   )
   resp = client.chat.completions.create(
       model=model,
       temperature=0.2,
       max_tokens=max_tokens,
       messages=[{
           "role": "user",
           "content": [
               {"type": "text", "text": prompt},
               {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
           ],
       }],
   )
   return resp.choices[0].message.content.strip()

Let’s break this down:

  • _image_to_b64: A helper function that takes an image opened by the Pillow library and converts it into a Base64 string. This is the standard format for embedding image data directly into an API request.
  • caption_with_openrouter_claude: Initializes the OpenAI client, pointing it to OpenRouter’s API endpoint and instructs the vision model on exactly how to describe the image: as a single, concise sentence suitable for a product search.
  • Finally, it sends the request and returns the clean text response from the AI model.

Step 5: Initializing the LangChain Agent

This function builds the agent that will perform the Amazon search. To make our agent robust, we won’t give it the ScraperAPIAmazonSearchTool directly. Instead, we’ll wrap it in a custom StructuredTool to “lock” the marketplace settings. This prevents the agent from getting confused and defaulting to the US marketplace: amazon.com 

First, we define a function to create this “locale-locked” tool.

def make_amazon_search_tool(tld: str, country_code: str) -> StructuredTool:
   base_tool = ScraperAPIAmazonSearchTool()
   def _search_amazon(query: str) -> str:
       return base_tool.invoke({
           "query": query,
           "tld": tld,
           "country_code": country_code,
           "output_format": "json",
       })
   return StructuredTool.from_function(
       name="scraperapi_amazon_search",
       func=_search_amazon,
       description=(
           f"Search products on https://www.amazon.{tld} "
           f"(locale country_code={country_code}). "
           "Input: a plain natural-language product search query."
       ),
   )

Now, we create the agent initializer, which uses the helper function above.

# langchain agent setup
def initialize_amazon_agent(openrouter_key: str, tld: str, country_code: str) -> AgentExecutor:
   llm = ChatOpenAI(
       openai_api_key=openrouter_key,
       base_url="https://openrouter.ai/api/v1",
       model=AGENT_MODEL,
       temperature=0,
   )
   amazon_tool = make_amazon_search_tool(tld=tld, country_code=country_code)
   tools = [amazon_tool]
   prompt = ChatPromptTemplate.from_messages([
       (
           "system",
           "You are an Amazon product search assistant. "
           "You MUST use the `scraperapi_amazon_search` tool for every search. "
           "Return ONLY the JSON from the tool. Do not invent or change tld/country."
       ),
       ("human", "{input}"),
       MessagesPlaceholder(variable_name="agent_scratchpad"),
   ])
   agent = create_tool_calling_agent(llm, tools, prompt)
   return AgentExecutor(agent=agent, tools=tools, verbose=True)

The code achieves the following:

  • make_amazon_search_tool: This wrapper function takes the tld and country_code from the dropdown selection box and creates a new, simple tool for the agent. When the agent uses this tool, it only provides the search query. The tld and country_code are hard-coded into the tool’s _search_amazon function, guaranteeing it searches the correct marketplace.
  • LLM Initialization: It sets up the ChatOpenAI object, configuring it to use the AGENT_MODEL (GPT-4o mini) via OpenRouter. The temperature=0 makes the model’s responses highly predictable.
  • Agent Creation: It assembles the final agent using our special amazon_tool and a system prompt that explicitly tells the agent to only return the JSON from the tool. This, combined with the wrapper tool, makes parsing the results reliable.
  • The AgentExecutor is the runtime that executes the agent’s tasks. verbose=True is helpful for debugging, as it prints the agent’s thought process to the console.

Step 6: Building the User Input Interface

Now let’s build the interactive sidebar and main input column within our Streamlit app.

with st.sidebar:
   st.subheader("LLM Configuration")
   openrouter_key = st.text_input(
       "OPENROUTER_API_KEY (Unified Key)",
       type="password",
       value=OPENROUTER_API_KEY_DEFAULT,
       help="Used for both caption + agent models.",
   )
   st.markdown(f"**Vision Caption Model:** `{CAPTION_MODEL}`")
   st.markdown(f"**Agent Reasoning Model:** `{AGENT_MODEL}`")
col_l, col_r = st.columns([1, 1.25])
with col_l:
   region_label = st.selectbox("Marketplace", list(AMZ_BASES.keys()), index=0)
   selected_market = AMZ_BASES[region_label]
   marketplace_tld = selected_market["tld"]
   country_code = selected_market["country"]
   uploaded = st.file_uploader("Upload a product photo", type=["png", "jpg", "jpeg"])
   manual_boost = st.text_input(
       "Optional extra keywords",
       help="e.g. brand/model/color to append to the caption",
   )
   run_btn = st.button("Search Amazon")
with col_r:
   st.info(
       f"Flow: (1) Caption image with **{CAPTION_MODEL}** "
       f"(2) Agent with **{AGENT_MODEL}** calls ScraperAPI Amazon Search locked to "
       f"**amazon.{marketplace_tld}** (3) Display JSON results."
   )

Here’s what the code does:

  • Sidebar: A sidebar is created to hold the configuration. It includes a password input for the OpenRouter API key and displays the names of the two models being used.
  • Main Columns: The main area is split into a left column (col_l) and a right column (col_r).
  • col_l contains all the user inputs: the marketplace dropdown, file uploader, optional keyword box, and the search button.
  • Most importantly, when a marketplace is selected, we now pull both marketplace_tld and country_code from the AMZ_BASES dictionary.
  • col_r contains an st.info box that clearly explains the app’s workflow to the user, dynamically showing which marketplace (amazon.{marketplace_tld}) is being searched.

Step 7: The Main Application Logic and Search Execution

Now to the heart of the application, where everything is tied together. This block of code runs when a user clicks the “Search Amazon” button.

if run_btn:
   if not uploaded:
       st.warning("Please upload a photo first.")
       st.stop()
   if not openrouter_key:
       st.error("Please paste your OPENROUTER_API_KEY.")
       st.stop()
   img = Image.open(io.BytesIO(uploaded.read())).convert("RGB")
   st.image(img, caption="Uploaded photo", use_container_width=True)
   with st.spinner(f"Describing your image via {CAPTION_MODEL}..."):
       try:
           caption = caption_with_openrouter_claude(img, openrouter_key)
       except Exception as e:
           st.error(f"Captioning failed: {e}")
           st.stop()
   query = f"{caption} {manual_boost}".strip()
   st.success(f"Caption: _{caption}_")
   st.write("**Agent Query:**", query)
   agent_executor = initialize_amazon_agent(
       openrouter_key,
       tld=marketplace_tld,
       country_code=country_code,
   )
   with st.spinner(
           f"Searching amazon.{marketplace_tld}"
   ):
       try:
           result = agent_executor.invoke({"input": f"Search for: {query}"})
       except Exception as e:
           st.error(f"LangChain Agent execution failed: {e}")
           st.stop()
   agent_output_str = result.get("output", "").strip()
   if not agent_output_str:
       st.error("Agent returned empty output.")
       st.stop()
   json_start_brace = agent_output_str.find('{')
   json_start_bracket = agent_output_str.find('[')
   if json_start_brace == -1 and json_start_bracket == -1:
       st.error("Agent output did not contain any valid JSON.")
       with st.expander("Debug: Raw agent output"):
           st.code(agent_output_str)
       st.stop()
   if json_start_brace == -1:
       json_start_index = json_start_bracket
   elif json_start_bracket == -1:
       json_start_index = json_start_brace
   else:
       json_start_index = min(json_start_brace, json_start_bracket)
   cleaned_json_str = agent_output_str[json_start_index:]
   try:
       decoder = json.JSONDecoder()
       raw_data, _ = decoder.raw_decode(cleaned_json_str)
   except json.JSONDecodeError as e:
       st.error(f"Failed to parse JSON from agent output: {e}")
       with st.expander("Debug: Raw agent output (before clean)"):
           st.code(agent_output_str)
       with st.expander("Debug: Sliced/Cleaned string that failed"):
           st.code(cleaned_json_str)
       st.stop()
   items = []
   if isinstance(raw_data, dict) and isinstance(raw_data.get("results"), list):
       items = raw_data["results"]
   elif isinstance(raw_data, list):
       items = raw_data
   else:
       st.warning("Unexpected JSON shape from tool. See raw output below.")
       with st.expander("Debug: Raw JSON"):
           st.json(raw_data)
       st.stop()

Let’s break it down below:

  • Input Validation: It first checks if an image has been uploaded and if an API key is present.
  • Image Processing: It opens the uploaded image file, displays it, and prepares it for captioning.
  • Caption Generation: It calls the caption_with_openrouter_claude function inside an st.spinner.
  • Query Construction: It creates the final search query by combining the AI-generated caption with any optional keywords.
  • Agent Execution: This is the key update. It now initializes the agent by passing both the marketplace_tld and country_code to our initialize_amazon_agent function.
  • Robust JSON Parsing: This is the second critical part. The agent’s raw output can sometimes be messy (invisible characters or extra text after the JSON ends).
    1. We first find the start of the JSON ({ or [) to trim any leading junk.
    2. We then use json.JSONDecoder().raw_decode(). to ignore any “extra data” that might come after it. Thereby solving parsing errors.
    3. It then safely extracts the list of products from the “results” key.

Step 8: Displaying the Search Results

The final step is to take the list of product items extracted in the previous step and render it in a user-friendly format. Add:

    if not items:
       st.warning(f"No items found on amazon.{marketplace_tld} for that query.")
       with st.expander("Debug: Raw JSON"):
           st.json(raw_data)
       st.stop()
   st.subheader(f"Results ({len(items)}) from amazon.{marketplace_tld}")
   for it in items[:24]:
       with st.container(border=True):
           c1, c2 = st.columns([1, 2])
           with c1:
               if it.get("image"):
                   st.image(it["image"], use_container_width=True)
           with c2:
               st.markdown(f"**{it.get('name', 'No Title')}**")
               asin = it.get("asin")
               if asin:
                   st.write(f"ASIN: `{asin}`")
               price = it.get("price_string")
               if price:
                   st.write(f"Price: {price}")
               url = it.get("url")
               if url:
                   st.link_button("View on Amazon", url)

The code does the following:

  • No Results Check: It first checks if the items list is empty and informs the user.
  • Results Header: It displays a subheader announcing how many results were found and from which marketplace (amazon.{marketplace_tld}).
  • Loop and Display: It loops through the first 24 items (items[:24]) and displays each product in a structured, two-column layout with its image, title, ASIN, price, and a direct link to the product page.

Step 9: Running Your Application

With the entire script in place, you can now run the application from your terminal. Make sure your virtual environment is still active.

streamlit run main.py

Your web browser should automatically open and load up the Application. “main.py” simply references your script’s file name, the one housing the code within your IDE. So, substitute accordingly.  

Here’s a snippet of what the tool’s UI looks like:

Amazon Visual Product Search Engine

Deploying the Image Search Engine App Using Streamlit 

Follow the steps below to deploy your Image Search Engine on Streamlit for free:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository on GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHubBefore doing anything else, create a .gitignore file to avoid accidentally uploading sensitive files like. Add the following to it:

.env
__pycache__/
*.pyc
*.pyo
*.pyd
.env.*
.secrets.toml

If you haven’t already set up Git and linked your repository, use the following commands in your terminal from within your project folder:

git init
git add .
git commit -m "Initial commit"
git branch -M main
# With HTTPS
git remote add origin https://github.com/YOUR_USERNAME/your_repo.git
# With SSH
git remote add origin git@github.com:YOUR_USERNAME/your-repo.git
git push -u origin main

If it’s your first time using GitHub from this machine, you might need to set up an SSH connection. Here is how.

Step 2: Define Dependencies and Protect Your Secrets!

Streamlit needs to know what dependencies your app requires. 

1. In your project folder, automatically create a requirements file by running:

pip freeze > requirements.txt

2. Commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies”
git push origin main

Step 3: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App.” 

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

  • Repository: YOUR_USERNAME/Amazon-Image-Search-Engine
  • Branch: main
  • Main file path: main.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

7. ​​Go to your deployed app dashboard, find your app, and find “Secrets” in “Settings”. Add your environment variables (your API keys) just as you have them locally in your .env file.

Step 4: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL (e.g., https://your-app-name.streamlit.app). You can now share this link to allow others to access your app!

Here’s a short YouTube video demonstrating the Image Search Engine in action. 

Conclusion

Congratulations. You just built an Image Search engine for Amazon. Your tool converts uploaded photos into search queries that yield targeted results based on visual similarities. 

We achieved this using the ScraperAPI-Langchain agent for real-time web scraping, Claude 3.5 Sonnet for image captioning, GPT-4o Mini as a reasoning model for our agent, and Streamlit for building the UI and free cloud hosting.

The result is a fast, intuitive, and relevant tool that helps consumers find Amazon products instantly, even when they are unable to provide written search queries, thereby reducing the time to purchase and improving customer satisfaction.

The post Build an Image Search Engine for Amazon with the ScraperAPI-LangChain Agent appeared first on ScraperAPI.

]]>
The Ultimate Guide to Bypassing Anti-Bot Detection https://www.scraperapi.com/blog/bypassing-anti-bot-detection/ Wed, 15 Oct 2025 14:12:33 +0000 https://www.scraperapi.com/?p=8479 You set up your scraper, press run, and the first few requests succeed. The data comes back exactly as you hoped, and for a moment, it feels like everything is working. Then the next request fails: a 403 Forbidden appears. Soon after, you are staring at a wall of CAPTCHAs. In some cases, there is […]

The post The Ultimate Guide to Bypassing Anti-Bot Detection appeared first on ScraperAPI.

]]>

You set up your scraper, press run, and the first few requests succeed. The data comes back exactly as you hoped, and for a moment, it feels like everything is working. Then the next request fails: a 403 Forbidden appears. Soon after, you are staring at a wall of CAPTCHAs. In some cases, there is not even an error message, and your IP is silently throttled until every request times out.

If you’ve ever tried scraping at scale, you’ve probably run into this. It’s frustrating, but it isn’t random. The web has become a tug of war between site owners and developers. On one side are businesses trying to protect their content and infrastructure. On the other hand are researchers, engineers, and companies that need access to that content. Anti-bot systems are designed for this fight, and they have grown into complex defenses that use IP reputation, browser fingerprinting, behavioral analysis, and challenge tests to block automation.

In this guide, you will learn what those defenses look like, why scrapers get blocked, and the strategies that actually make a difference. The goal is not to hand out short-term fixes, but to give you a clear understanding of the systems you are up against and how to build scrapers that last longer in production.

Ready? Let’s get started!

The Four Pillars of Detection

Chapter 1: Know Your Enemy: The Anatomy of a Modern Bot Blocker

If you want to bypass anti-bot systems, you first need to understand them. Bot blockers are built to detect patterns that real users rarely produce. They don’t rely on a single check but layer multiple defenses together. The more signals they collect, the more confident they become that the traffic is automated.

The easiest way to make sense of these systems is to break them down into four core pillars: IP reputation, browser fingerprinting, behavioral analysis, and active challenges. Each pillar covers a different angle of detection, and together they form the backbone of modern anti-bot defenses.

The Four Pillars of Detection

IP Reputation and Analysis

The first thing any website learns about you is your IP address. A server always sees a source IP; you can’t make requests without exposing a source IP, and though you can proxy/relay it, it is often the very first filter that anti-bot systems apply. If your IP does not look trustworthy, you will be blocked before the site even checks your browser fingerprint, your behavior, or whether you can solve a CAPTCHA.

Why IP Type Matters

Websites classify IP addresses by their origin, and this classification has a direct impact on your chances of being blocked.

  • Datacenter IPs are those owned by cloud providers such as Amazon Web Services, Google Cloud, or DigitalOcean. They are attractive because they are cheap, fast, and easy to acquire, but they are also the most heavily scrutinized. Their ranges are publicly known, and many sites blacklist them pre-emptively. Even a brand-new IP from a datacenter can be flagged without ever being used for abuse.
  • Residential IPs come from consumer internet providers and are assigned to everyday households. Because they blend into the regular traffic of millions of users, they are much harder to detect and block. This is why residential proxy services are valuable, although they are also costly. However, once a proxy provider is identified, its pool of residential IPs can still be marked as suspicious.
  • Mobile IPs belong to carrier networks. They are the hardest to blacklist consistently, because thousands of users often share the same public address through carrier-grade NAT (Network Address Translation). These IPs also change frequently as devices move across cell towers. That churn makes them appear fresh and unpredictable, but it also means that abusive traffic from one user can create problems for everyone else on the same IP. Still, even when shared, extreme abuse on one IP can still trigger blocks for others on the same address.

The type of IP you use shapes your reputation before anything else is considered. A datacenter IP may be treated as suspicious even before it makes its first request. At the same time, a residential or mobile IP may earn more trust simply by belonging to a consumer or carrier network.

How Reputation Scores Are Built

Identifying your IP type is only the starting point. Websites and security providers maintain live databases of IP reputation that go far deeper. These systems assign a score to each address based on both historical evidence and real-time traffic.

Some of the most essential signals include:

  • Network ownership: An Autonomous System Number (ASN) identifies which organization owns a block of IPs. If the ASN belongs to a hosting provider, that alone can raise suspicion.
  • Anonymity markers: IPs known to be used by VPNs, Tor, or open proxy services are treated as risky.
  • Abuse history: If an IP has been linked to spam, scraping, or fraud in the past, that history follows it.
  • Request velocity: A human cannot make hundreds of requests in a second. High-volume activity is one of the clearest signs of automation.
  • Geographic consistency: A user’s IP location should align with their browser settings and session history. If someone appears in Canada one minute and Singapore the next, something is wrong.

The resulting score dictates how a website responds. Low-risk IPs may be allowed through without friction. Medium-risk IPs may see throttling or occasional CAPTCHA. High-risk IPs are blocked outright with errors like 403 Forbidden or 429 Too Many Requests.

When a website detects suspicious traffic, it rarely stops at blocking just your IP. Most anti-bot systems are designed to think in groups, not individuals, which means the actions of one scraper can end up tainting an entire neighborhood of addresses.

At the smaller scale, this happens with subnets. A subnet is simply a slice of a larger network, carved out so that routers can manage traffic more efficiently. You’ll often see subnets written in a format like 192.0.2.0/24. This notation tells you that all the addresses from 192.0.2.0 through 192.0.2.255 are part of the same group. If a handful of those addresses start showing abusive behavior, it is much easier for a website to restrict the entire /24 block than to chase individual offenders.

At a larger scale, blocking does not just target individual IP addresses. It can happen at the level of an entire autonomous system (AS). The internet is made up of thousands of these systems, which are large networks run by internet service providers, mobile carriers, cloud companies, universities, or government agencies. Each one manages its own pool of IP addresses, known as its “address space.” To keep things organized, every AS is assigned a unique identifier called an autonomous system number (ASN). For example, Cloudflare operates under ASN 13335, while Amazon Web Services uses several different ASNs for its various regions.

Why does this matter? Because if one AS is consistently associated with scraping or fraud, websites can enforce rules across every IP inside it. That could mean millions of addresses flagged with a single policy update. This is especially common with cloud providers, since entire data center networks are publicly known and widely targeted by scrapers.

Browser Fingerprinting

Once websites confirm your IP looks safe, the next step is to examine your browser. This process, known as browser fingerprinting, involves collecting numerous small details about your browser to create a unique profile. Unlike cookies, which you can delete or block, fingerprinting does not rely on stored data. Instead, it takes advantage of the information your browser naturally exposes every time it loads a page.

What a Fingerprint Contains

A browser fingerprint is a collection of attributes that describe how your system looks and behaves. No single attribute is unique on its own, but when combined, they can create a profile that is very unlikely to match anyone else’s. Common components include:

  • User-Agent and headers: The User-Agent is a string that tells websites which browser and operating system you are using (for example, Chrome on Windows or Safari on iOS). Other headers can reveal your preferred language, supported file formats, or device type.
  • Screen and system settings: Your screen resolution, color depth, time zone, and whether your device supports touch input are all easy to read and can help distinguish you from others.
  • Graphics rendering: Websites use APIs such as Canvas and WebGL to draw hidden images in your browser. Because the result depends on your graphics card, drivers, and fonts, the output is slightly different for each machine.
  • Audio processing: Through the AudioContext API, sites can generate sounds that your hardware processes in unique ways. These differences become another signal in your fingerprint.
  • Fonts and layout: The fonts you have installed, and how your system renders text, vary across devices.
  • Plugins and media devices: Browsers can reveal what extensions are installed, and whether a camera, microphone, or other media device is available.

When all of these signals are combined, the result is usually distinctive enough to identify one device out of millions.

How Fingerprints Are Collected

Some of these values, like the User-Agent, are shared automatically every time your browser makes a request. Others are gathered using JavaScript that runs quietly in the background. For instance, a script may tell your browser to draw a hidden image on a canvas, then read back the pixel data to see how your system rendered it. Because hardware and software vary, the results form part of a unique signature.

These details are then combined into a hash, a short code that represents the overall configuration. If the same hash appears across visits, the system knows it is dealing with the same client, even if the IP has changed or cookies have been cleared.

Why Automation Tools Struggle

This is also the stage where automation platforms are exposed. Headless browsers such as Puppeteer, Playwright, and Selenium are designed to load and interact with web pages without a visible window. Although they are helpful for scraping, they often fail fingerprinting checks because they leak signs of automation.

  • A property called navigator.webdriver is usually set to true, which immediately signals automation.
  • Rendering in headless environments is often handled by software libraries like SwiftShader instead of a GPU, which produces outputs that differ from typical human-operated devices and can be fingerprinted.
  • Many browser APIs return incomplete or default values instead of realistic ones.
  • HTTP headers may be sent in an unusual order that does not match the patterns of real browsers.

Together, these inconsistencies make the fingerprint look unnatural. Even if your IP is clean, the browser itself gives you away.

Stability and the Growing Scope of Fingerprinting

Fingerprinting is not only about how unique a setup looks but also about how consistent it appears over time. Real users typically keep the same configuration for weeks or months, only changing after a software update or hardware replacement. Scrapers, on the other hand, often shift profiles from one session to the next. A client that looks like Chrome on Windows in one request and Safari on macOS in the next is unlikely to be genuine. Even minor mismatches, such as a User-Agent string reporting one browser version while WebGL capabilities match another, can be enough to raise suspicion.

To make detection harder to evade, websites continue expanding the range of signals they collect. In the past, some sites used the Battery Status API to collect signals like charge level and charging state, but browser vendors have since restricted or disabled this feature due to privacy concerns. Others use the MediaDevices API to identify how many microphones, speakers, or cameras are connected. WebAssembly can be used to run timing tests that expose subtle CPU characteristics, although modern browsers now limit timer precision to prevent microsecond-level leaks.

Even tools designed to protect privacy can make things worse. Anti-fingerprinting extensions often create patterns that stand out precisely because they look unusual. Instead of blending in, they can make a browser seem more suspicious.

This is why fingerprinting remains such a powerful defense. It does not depend on stored data and cannot be reset as easily as an IP address. It relies on the information your browser naturally reveals, which is very difficult to disguise. Even with a clean IP, an unstable or unrealistic fingerprint can expose a scraper before it ever reaches the target data. Managing fingerprints so that they appear natural and consistent is as essential as proxy rotation. Without it, no other bypass technique will succeed.

Behavioral Analysis (The “Turing Test”)

Even if your IP looks safe and your browser fingerprint appears realistic, websites can still catch you by looking at how you behave. This approach is known as behavioral analysis, and it is designed to spot the difference between natural human activity and automated scripts. Think of it as a digital version of the Turing Test: the site is silently asking, “Does this visitor actually move, click, and type like a person?”

People rarely interact with websites in predictable, machine-like ways. A human visitor might move the mouse in uneven arcs, scroll back and forth while reading, pause unexpectedly, or type in bursts with pauses between words. These slight irregularities form a behavioral signature.

Bots often fail at this. Many scripts execute actions with mechanical precision: clicks happen instantly, scrolling is smooth and perfectly uniform, and typing may occur at an inhumanly consistent speed. Some bots even skip interaction entirely, jumping directly to the data source they want.

Behavioral analysis systems compare these patterns to baselines collected from regular users. If your activity deviates significantly from typical patterns, the site may flag you as a bot, even if your IP and fingerprint appear legitimate.

Key Behavioral Signals

Websites collect a wide range of behavioral signals. The most common include:

  • Mouse movements and clicks: Human mouse paths contain tiny hesitations, jitters, and corrections. Bots either skip this step or simulate perfectly straight, robotic lines.
  • Scrolling behavior: Real users scroll unevenly, sometimes stopping midway, changing direction, or adjusting speed. Scripts often scroll in a linear, predictable way or avoid scrolling entirely.
  • Typing rhythm: Known as keystroke dynamics, this measures the timing of each keystroke. Humans type in bursts with natural pauses, while bots often fill fields instantly or type at an impossibly steady rhythm.
  • Navigation flow: A genuine visitor usually enters through the homepage or a category page, spends time browsing, and then reaches the data-heavy endpoint. Bots often go straight to the target URL within seconds.
  • Session activity: Humans vary in how long they stay on pages. Bots typically request content instantly and leave without hesitation. This makes session length a valuable signal.
TLS and JA3 Fingerprinting

Behavioral analysis is not limited to on-page actions. It also examines how your connection behaves.

Every HTTPS connection begins with a TLS handshake (Transport Layer Security handshake). This is the negotiation where your browser and the server agree on encryption methods before any content is exchanged. Each browser, operating system, and networking library has a slightly different way of performing this handshake.

JA3 fingerprinting is a technique that takes the details of this handshake, including supported ciphers, extensions, and protocol versions, and generates a hash that uniquely identifies the client. If your scraper presents itself as Chrome but uses a handshake typical of Python’s requests library, the mismatch is easy to detect.

This means that even before a single page loads, your connection can betray whether you are really using the browser you claim.

Why Behavioral Analysis Is Effective

Behavioral analysis is more complex to evade than other defenses because it measures live activity rather than static attributes. You can rent residential proxies or spoof browser fingerprints, but replicating the subtle quirks of human movement, scrolling, and typing takes much more effort.

Even advanced bots that try to simulate user actions can be exposed when their patterns are compared across multiple signals. For example, mouse movement may look natural, but the navigation flow might still be too direct. Or the keystroke dynamics might be convincing, but the TLS handshake does not match the claimed browser.

This multi-layered approach is what makes behavioral analysis one of the most resilient forms of bot detection.

Behavioral analysis acts as the final checkpoint. It catches bots that slip through IP and fingerprint filters, but still fail to behave like real users. For scrapers, bypassing anti-bot systems requires more than just technical camouflage. To succeed, your traffic must not only appear legitimate on the surface but also behave in a manner that closely mirrors human browsing patterns. Without that, even the most advanced proxy rotation or fingerprint spoofing will not be enough.

Challenges & Interrogation

Even if your IP looks clean and your browser fingerprint appears consistent, websites often add one final test: an active challenge. These are designed to confirm that there is a real user on the other end before granting access.

From CAPTCHA to Risk Scoring

The earliest challenges were simple CAPTCHA. Sites showed distorted text or numbers that humans could solve, but automated scripts could not. Over time, this expanded to image grids, such as “select all squares with traffic lights.”

Today, many sites use more subtle methods, like Google’s reCAPTCHA v2, which introduced the “I’m not a robot” checkbox and occasional image puzzles. reCAPTCHA v3 shifted further, assigning an invisible risk score in the background so most users never see a prompt. hCaptcha followed a similar model, with a stronger emphasis on privacy and flexibility for site owners.

Invisible and Scripted Tests

Modern challenges increasingly happen behind the scenes. Cloudflare’s Turnstile runs lightweight checks in the browser, only interrupting the user if something looks suspicious. It’s Managed Challenges adapt in real time, deciding whether to show a visible test or resolve quietly based on signals like IP reputation and session history.

Websites also use JavaScript challenges, which run small scripts inside the browser. These might:

  • Draw hidden graphics with Canvas or WebGL to confirm rendering quirks
  • Measure how code executes to verify real hardware is present
  • Check for storage, cookies, and header consistency

Passing such tests generates a short-lived token that the server validates before letting requests continue.

The Push Toward Privacy

The newest trend moves away from puzzles entirely. Private Access Tokens, based on the Privacy Pass standard, allow trusted devices to prove they are legitimate without exposing identity. Instead of clicking boxes or solving images, the browser presents a cryptographic token issued by a trusted provider. Apple and Cloudflare are leading this move, aiming to remove CAPTCHA altogether for supported platforms.

Challenges and interrogation catch automated clients that may have passed IP and fingerprint checks, but still cannot prove they are genuine. The direction is clear: fewer frustrating puzzles, more invisible checks, and an emphasis on privacy-preserving tokens. For scrapers, this is often the most rigid barrier to overcome, because failing a challenge does not just block access, it also signals to the site that automation is in play.

Major Bot Blockers

Chapter 2: The Rogues’ Gallery: A Deep Dive into Major Bot Blockers

Anti-bot vendors use the same four pillars of detection, but each adds its own methods and scale. Knowing how the big players operate helps explain why some scrapers fail instantly while others last longer.

Cloudflare

Cloudflare is the most widely deployed bot management solution, acting as a reverse proxy for millions of websites. A reverse proxy sits between a user and the website’s server, meaning Cloudflare can filter, inspect, or block traffic before the target site ever receives it.

Cloudflare uses multiple layers of defense:

  • I’m Under Attack Mode (IUAM): This feature activates when a site is experiencing unusual traffic. Visitors are shown a temporary interstitial page for about five seconds. During that pause, Cloudflare runs JavaScript code that collects information about the browser and verifies whether it looks legitimate. A standard browser passes automatically, while bots that cannot execute JavaScript are stopped immediately.
  • Turnstile: Unlike traditional puzzles, Turnstile performs background checks (for example, analyzing browser behavior and TLS handshakes) to verify real users invisibly. Only high-risk traffic sees explicit challenges, which reduces friction for humans while raising the bar for bots.
  • Shared IP Reputation: Cloudflare leverages its enormous footprint across the internet. If an IP is flagged for suspicious activity on one site, that information can be used to block it on others. This network effect makes Cloudflare particularly powerful at tracking abusers across domains.
  • Browser and TLS Fingerprinting: Beyond JavaScript challenges, Cloudflare inspects the TLS handshake (the initial negotiation that establishes an encrypted HTTPS connection). If your client claims to be Chrome but its TLS handshake matches known automation fingerprints (like those from Python libraries), it is easily exposed.

For scrapers, Cloudflare’s greatest difficulty lies in its scale and speed. Even if you rotate IPs or patch fingerprints, once a signal is flagged on one site, it can follow you everywhere Cloudflare operates.

Akamai

Akamai is one of the oldest and largest Content Delivery Networks (CDNs), and its bot management is among the most advanced. Unlike simple IP filtering, Akamai emphasizes behavioral data collection, sometimes referred to as sensor data.

What makes Akamai stand out:

  • Browser Sensors: JavaScript embedded in protected sites records subtle human signals: mouse movements, keystroke timing, scroll depth, and tab focus. These are compared against large datasets of genuine user activity. Bots typically generate movements that are too perfect, too fast, or missing altogether.
  • Session Flow Tracking: Instead of looking at single requests, Akamai evaluates the entire browsing journey. Humans usually navigate step by step: homepage, category page, product page, while bots often jump directly to data endpoints. This difference in flow is a strong detection signal.
  • Edge-Level Integration: Because Akamai runs at the CDN edge, it can correlate behavioral insights with network-level data:
    • ASN ownership: Is the traffic coming from a consumer ISP or a known hosting provider?
    • Velocity: Are requests being made faster than a human could reasonably click?
    • Geolocation: Does the user’s IP location align with their browser settings and session history?

Akamai is difficult to evade because it does not rely on just one layer of detection. To succeed, a scraper must mimic both the technical footprint and the organic, sometimes messy, flow of human browsing.

PerimeterX (HUMAN Security)

PerimeterX, now rebranded under HUMAN Security, is known for its client-side detection model. Instead of relying entirely on server-side logs, PerimeterX embeds sensors that run directly in the user’s browser session.

These sensors collect thousands of attributes in real time:

  • Deep Fingerprinting: WebGL rendering results, Canvas image outputs, installed fonts, available plugins, and even motion data from mobile devices all contribute to a unique profile. Unlike a simple User-Agent string, these combined values are difficult to spoof convincingly.
  • Automation Framework Detection: Popular scraping tools often leave behind subtle flags. For example, Selenium sets navigator.webdriver = true in most configurations, which is a dead giveaway. Puppeteer in headless mode often uses SwiftShader for rendering, which can differ from physical GPU outputs. Even the order in which HTTP headers are sent can expose a headless browser.
  • Ongoing Validation: Many systems check once per session, but PerimeterX continues to validate throughout. If your scraper passes the first test but shows suspicious behavior five minutes later, it can still be flagged.

Because PerimeterX looks so deeply into browser environments, it is particularly good at catching advanced bots that use headless browsers. Evading it requires not just patched fingerprints but also realistic rendering outputs and consistent session behavior over time.

DataDome

DataDome emphasizes AI-driven detection across websites, mobile apps, and APIs. Unlike older providers that focus mainly on web traffic, DataDome has built systems to secure modern app ecosystems where bots target APIs and mobile endpoints.

Its system relies on:

  • AI and Machine Learning Models: Every request is scored against patterns learned from billions of data points. This scoring happens in under two milliseconds, fast enough to avoid slowing down user experience.
  • Cross-Platform Protection: Bots are not limited to browsers. Many now use mobile emulators or modified SDKs to attack APIs directly. DataDome covers all these channels, analyzing whether the client environment matches expected behavior.
  • Adaptive Learning: Models are updated continuously to reflect new bot behaviors, ensuring the system evolves rather than relying on static rules.
  • Multi-Layered Analysis: Attributes like IP reputation, HTTP headers, TLS fingerprints, and on-page behavior are combined into a holistic risk score.

For scrapers, the key challenge is the breadth of coverage. Even if you disguise your browser, an API request from the same session may expose automation. And because detection happens in real time, there is little room for trial and error before blocks are enforced.

AWS WAF

Amazon Web Services provides a Web Application Firewall (WAF) that customers can configure to block unwanted traffic. Unlike Cloudflare or Akamai, AWS WAF is not a dedicated anti-bot product but a toolkit that site owners adapt to their own needs. Its strength lies in flexibility, which means scrapers can face very different levels of difficulty depending on how it is deployed.

Typical anti-bot rules in AWS WAF include:

  • Managed Rule Groups: AWS and partners provide prebuilt rules that block common malicious traffic, including known scrapers and impersonators of Googlebot.
  • Datacenter IP Blocking: Site owners often deny requests from IP ranges associated with cloud providers. Since many scrapers rely on these datacenter IPs, this is a simple but effective filter.
  • Rate Limiting: Rules can cap the number of requests a single client can send in a given timeframe. Humans rarely send more than a handful of requests per second, so exceeding those limits is suspicious.
  • Custom Filters: Organizations can create their own detection logic, such as flagging mismatched geolocations, odd header values, or repeated patterns of failed requests.

Because AWS WAF is configurable, its effectiveness varies. Some sites may implement only the most basic rules, which are easy to bypass with proxies, while others, especially large enterprises, may deploy complex rule sets that combine multiple signals, creating protection comparable to dedicated bot management platforms.

Each provider applies the same pillars of detection in different ways:

  • Cloudflare leverages scale and global IP reputation.
  • Akamai focuses on behavioral signals and session flow.
  • PerimeterX (HUMAN Security) digs deeply into client-side fingerprints and automation leaks.
  • DataDome uses real-time AI analysis across browsers, apps, and APIs.
  • AWS WAF relies on site-specific configurations that range from simple to highly sophisticated.

For scrapers, this means there is no single bypass strategy; you need to understand each system on its own terms, and your scraper’s resilience requires a layered approach that addresses IP, fingerprints, behavior, and challenges simultaneously.

Techniques for Bypassing Detection

Chapter 3: The Scraper’s Toolkit: Core Techniques for Bypassing Detection

Anti-bot systems combine multiple signals to tell humans and automation apart. That means no single trick is enough to bypass them. You need a toolkit, a set of layered techniques that work together. Each one addresses a different pillar of detection: proxies manage your IP reputation, fingerprints protect your browser identity, CAPTCHA solutions handle active challenges, and human-like behavior makes your traffic believable. The goal is not to imitate these techniques halfway but to apply them consistently, because detection systems compare multiple signals at once. A clean IP with a broken fingerprint will still be blocked. A perfect fingerprint with robotic timing will also fail. The techniques below are the foundation of any resilient scraping operation.

Technique 1: Proxy Management Mastery

Proxies are the foundation of every serious scraping project. Each request you send is tied to an IP address, and websites judge those addresses long before they examine your browser fingerprint or behavior. Without proxies, you are limited to a single identity that will almost always get flagged. With them, you can multiply your presence across thousands of identities, but only if you use them correctly.

Choosing the Right Proxy

Datacenter proxies

Datacenter IPs come from cloud providers and hosting companies. They are designed for scale, which makes them cheap and extremely fast. When you need to collect data from sites that have weak or no anti-bot defenses, datacenter proxies can get the job done at a fraction of the cost of other options.

The problem is reputation. Because datacenter ranges are publicly known, websites can block entire chunks of them in advance. A site that wants to protect itself from automated scraping can blacklist entire subnets or even autonomous systems belonging to providers like AWS or DigitalOcean. That means even a “fresh” datacenter IP may already be treated with suspicion before it makes its first request. If your target is sensitive, such as e-commerce, ticketing, or finance, datacenter traffic will often be blocked at the door.

Residential proxies

Consumer internet service providers issue Residential IPs, the same ones that power ordinary households. From a website’s perspective, traffic from these IPs looks just like regular user activity. That natural cover gives residential proxies a much higher trust level. They are particularly effective when scraping guarded pages, logged-in content, or platforms that rely heavily on IP reputation.

The trade-off is speed and cost. Residential IPs tend to respond more slowly than datacenter IPs, and most providers charge by bandwidth rather than per IP, so costs add up quickly on large projects. They can also be targeted if abuse is concentrated. If too many suspicious requests originate from the same provider or subnet, websites can extend blocks across that range, reducing the reliability of the pool.

Mobile proxies

Mobile IPs are routed through carrier networks. Here, thousands of users share the same public IP address, and devices constantly switch towers as they move. That constant churn makes mobile IPs nearly impossible to blacklist consistently. If a site blocked one, it could accidentally cut off thousands of legitimate mobile users at once.

This makes mobile proxies one of the most potent tools for scraping heavily protected content. However, they are also the most expensive and the least predictable. Because you are sharing the address with many strangers, your session can suddenly inherit the consequences of someone else’s abusive activity. Frequent IP changes mid-session can also disrupt multi-step flows like checkouts or form submissions.

In practice, few scrapers rely on a single category. Datacenter proxies deliver speed and scale where defenses are weak, residential proxies strike a balance of cost and reliability for most guarded content, and mobile proxies are reserved for the hardest restrictions where stealth is non-negotiable.

Rotation that Feels Human

Choosing the right proxy type is only the first step. The next challenge is using those proxies in ways that resemble real browsing. Websites do not just look at which IP you use; they observe how long you use it, how often it appears, and whether its behavior aligns with a human pattern.

Rotation strategies help you manage this.

  • Sticky sessions: Instead of switching IPs on every request, keep the same one for a cluster of related actions. A real user browsing a shop will log in, click around, and add something to their cart without changing IP midway. Holding onto the same proxy for these flows makes your traffic believable.
  • Rotating sessions: For bulk crawls, such as collecting thousands of product listings, swap IPs every few requests or pages. This spreads out the workload and prevents any single IP from carrying too much risk.
  • Geographic alignment: If your proxy is in Germany, for example, your headers, cookies, and time zone should tell the same story. Sudden jumps from one country to another in the middle of a session are easy for defenses to spot.
  • Request budgets: Every IP has a lifespan. If you push it too hard with hundreds of rapid requests, it will get flagged. Assign a realistic budget of requests per IP, retire it once that limit is reached, and reintroduce it later.

The trick is balance. People do not change IPs every second, but they also do not hammer a website with thousands of requests from the same address. Rotation that feels human is about pacing and continuity, not random churns.

Keeping the Pool Healthy

Even the best proxy rotation plan will fail if the pool itself is weak. Some IPs will perform flawlessly, while others will either slow down or burn out quickly. Managing a proxy pool means constantly monitoring, pruning, and replenishing.

Metrics worth tracking include:

  • Block signals such as 403 Forbidden, 429 Too Many Requests, and CAPTCHA challenges
  • Connection health, like timeouts, TLS handshake failures, and dropped sessions
  • Latency and response times, which can reveal throttling or overloaded providers

When you spot problems, isolate them. Quarantine flagged IPs or entire subnets to avoid poisoning the rest of your traffic. Replace weak providers with stronger ones, and always spread your pool across multiple vendors so that one outage does not bring everything down.

A healthy pool is a constantly moving target that requires maintenance. Skipping this step is the fastest way to turn a strong setup into a fragile one.

Putting it All Together

Mastering proxy management is about combining all three layers: choosing the right proxy type, rotating them in ways that mimic human behavior, and keeping the pool clean. Datacenter, residential, and mobile proxies each have their place, and their strengths complement one another when used strategically. Rotation rules make those IPs look natural, and pool maintenance ensures you always have healthy addresses ready.

Without this foundation, none of the other bypass techniques, like fingerprint spoofing, behavior simulation, or CAPTCHA solving, will matter. If your proxies fail, everything else falls apart.

Technique 2: Perfecting Your Digital Identity (Fingerprint & Headers)

Proxies may give you a new address on the internet, but they do not tell the whole story. Once a request reaches a website, the browser itself comes under scrutiny. This is where many scrapers fail. They might be using a clean IP, but the headers, rendering outputs, or session data they present do not resemble a real person. Fingerprinting closes that gap. To pass this test, you need to create an identity that not only looks consistent but also behaves as if it belongs to a real browser in a real location.

Choosing A Realistic Baseline

The first decision is what identity to copy. Defenders have massive datasets of how common browsers look and behave, so straying too far from the norm is risky.

A good approach is to anchor your setup in a widely used combination: for example, Chrome 115 on Windows 10, or Safari on iOS. These represent large segments of real users. If you instead show up as a rare Linux build with an unusual screen resolution, you instantly stand out. This choice becomes your baseline. Everything else, such as headers, rendering results, fonts, and media devices, must align with it.

Making Fingerprints And Networks Agree

An IP address already reveals a lot about where traffic is coming from. If your fingerprint tells a different story, detection is almost guaranteed.

  • Time zone, locale, and Accept-Language should reflect the region of your proxy.
  • A German IP, for instance, should not be paired with a US English-only browser and a Pacific time zone.
  • Currency, local domains, and even keyboard layouts can reinforce or break this alignment.

Think of this as storytelling. The IP and the fingerprint are two characters. If they contradict each other, the plot falls apart.

Building Headers That Match Real Traffic

Headers are often overlooked, yet they are one of the most powerful indicators of authenticity. Websites check not only the values but also whether the set of headers and their order match what real browsers send.

  • A User-Agent string must match the exact browser and version you claim.
  • Accept, Accept-Language, Accept-Encoding, and the newer Sec-CH-UA headers should all be present and correct.
  • The order matters. Real browsers send them in consistent sequences that defenders log and compare against.

Rotating only the User-Agent is a common beginner mistake. Without updating the entire header set to match, the disguise falls apart instantly.

Closing The Gaps In Headless Browsers

Automation tools like Puppeteer, Playwright, and Selenium are designed for control, not invisibility. Out of the box, they leak signs of automation.

  • navigator.webdriver is automatically set to true, which flags the browser as automated.
  • Properties like navigator.plugins or navigator.languages often return empty or default values, unlike real browsers.
  • Graphics rendered with SwiftShader in headless mode can be different from outputs produced by a physical GPU.
  • Headers may be sent in unnatural orders or with missing fields.

To avoid instant detection, you need to patch or disguise these gaps. Stealth plugins and libraries exist for this, but they still require careful testing and validation.

Making Rendering Outputs Believable

Fingerprinting relies heavily on how your system draws graphics and processes audio.

  • Canvas and WebGL outputs should align with the GPU and operating system you claim. A Windows laptop should not render like a mobile device.
  • Fonts must match the declared platform. A Windows profile with macOS-only fonts raises alarms.
  • AudioContext results must remain stable across a session, since real hardware does not change its sound processing randomly.

These details are subtle, but together they form a signature that is hard to fake and easy to check. Defenders know what standard systems look like; if yours has capabilities that are too empty or too crowded, suspicion rises.

A laptop typically reports a single microphone and webcam, so having none or a dozen looks strange. Browser features should match the version you present. For example, an older version of Chrome should not claim to support APIs that were only introduced later. Even installed extensions can betray you. A completely empty profile is just as suspicious as one with twenty security tools.

Maintaining Stability Over Time

One of the strongest signals websites check is stability. Real users do not constantly switch between different devices or browser versions. They use the same setup until they update or replace their hardware.

  • Maintain the same fingerprint within a sticky session, particularly for high-volume flows such as logins or carts.
  • Change versions only when it makes sense, such as after a scheduled browser update.
  • Avoid rapid platform switches, such as transitioning from Windows to macOS between requests.

Stability tells defenders that you are a steady, consistent user, not a bot cycling through different disguises.

Cookies, localStorage, and sessionStorage are not just technical details but they are part of what makes a session feel real. A genuine browser carries state forward across visits.

  • Let cookies accumulate naturally, including authentication tokens and consent banners.
  • Reuse them for related requests rather than wiping them clean each time.
  • Preserve session history so that the browsing pattern looks continuous.

Without a state, every request looks like a first-time visitor, which is rarely how real users behave.

Measuring And Adjusting

Finally, you cannot perfect a fingerprint once and forget it. Websites change what they check, and even minor mismatches can appear over time.

  • Track how often you face CAPTCHA, blocks, or unusual error codes.
  • Log the outputs of your own Canvas, WebGL, and AudioContext to catch instability.
  • Compare your profile to real browser captures using tools like CreepJS or FingerprintJS.

This feedback loop helps you correct mistakes before they burn your entire setup.

Fingerprint management is about coherence. Your IP, headers, rendering, devices, and behavior all need to tell the same story. A clean IP without a matching fingerprint will still be blocked. A patched fingerprint without stability will still look wrong. Only when all parts are aligned do you create an identity that can survive in production.

Technique 3: Solving the CAPTCHA Conundrum

Even if you have clean IPs and fingerprints that look human, websites often add one more obstacle before granting access: a challenge-response test known as CAPTCHA. The acronym stands for Completely Automated Public Turing test to tell Computers and Humans Apart. Put simply, it is a puzzle designed to be easy for people but difficult for bots.

CAPTCHA is not new, but they have evolved into one of the toughest barriers scrapers face. To deal with them effectively, you need to understand what you are up against and choose a strategy that balances cost, speed, and reliability.

Understanding the Different Forms of CAPTCHA

Not all CAPTCHAs look the same. Over the years, defenders have introduced new formats to stay ahead of automation tools.

  • Text-based CAPTCHAs: These were the earliest form, where users had to type distorted letters or numbers. They are now largely phased out because machine learning models can solve them with high accuracy.
  • Image selection challenges: These ask the user to click on all images containing an object, such as traffic lights or crosswalks. They rely on human visual recognition, which is still harder to automate consistently.
  • reCAPTCHA v2: Google’s version that often shows up as the “I’m not a robot” checkbox. If the system is suspicious, it escalates to an image challenge.
  • reCAPTCHA v3: A behind-the-scenes version that scores visitors silently based on their behavior, only serving challenges if the score is too low.
  • hCaptcha and Cloudflare Turnstile: Alternatives that serve similar roles, often preferred by sites that want to avoid sending user data to Google. Turnstile is especially tricky because it can run invisible checks without showing the user anything.

Each type has its own level of difficulty. The simpler ones can be solved automatically, but the more advanced forms often require external help.

The CAPTCHA Solving Ecosystem

Because scrapers cannot always solve CAPTCHA on their own, an entire ecosystem of third-party services exists to handle them. These services usually fall into two categories:

  • Human-powered solvers: Companies employ workers who receive CAPTCHA images and solve them in real time. You send the challenge through an API, they solve it within seconds, and you get back a token to submit with your request.
  • Machine-learning solvers: Some services attempt to solve CAPTCHA with automated models. They can be faster and cheaper but are less reliable against newer and more complex challenges.

Popular providers include 2Captcha, Anti-Captcha, and DeathByCaptcha. They integrate easily into scraping scripts by exposing simple APIs where you post a challenge, wait for the solution, and then continue your request.

CAPTCHA solving introduces trade-offs that you have to plan for:

  • Cost: Each solve costs money, often fractions of a cent, but this adds up at scale. For scrapers making millions of requests, solving CAPTCHA manually can become the most significant expense.
  • Latency: Human solvers take time. Even the fastest services usually add a delay of 5–20 seconds. This may be acceptable for occasional requests, but it slows down large crawls.
  • Reliability: Solvers are not perfect. Sometimes they return incorrect answers or time out. Building in error handling and retries is essential.

This is why many teams mix strategies: using solvers only when necessary, while trying to minimize how often challenges are triggered in the first place.

Reducing CAPTCHA Frequency

The best way to handle CAPTCHAs is not to see them often. Careful planning can keep challenges rare:

  • Maintain good IP hygiene: Residential or mobile proxies with low abuse history face fewer CAPTCHAs.
  • Keep fingerprints consistent: Browsers that look real and stable raise fewer red flags.
  • Pace your requests: Sudden bursts of traffic are more likely to trigger challenges.
  • Reuse cookies and sessions: A returning user with a history of normal browsing behavior is less likely to be tested.

By reducing how suspicious your traffic looks, you can push CAPTCHAs from being constant roadblocks to occasional speed bumps.

When a CAPTCHA does appear, you have three main options:

  1. Bypass entirely by preventing triggers with a good proxy, fingerprint, and behavior management.
  2. Outsource solving to a third-party service, accepting the cost and delay.
  3. Combine approaches, using solvers only when absolutely necessary while optimizing your setup to minimize their frequency.

Managing CAPTCHAs is less about brute force and more about strategy. If you rely on solving them at scale, your scraper will be slow and expensive. If you invest in preventing them, solvers become a rare fallback instead of a dependency.

Technique 4: Mimicking Human Behavior

At this point, you have clean IPs, fingerprints that look real, and a strategy for dealing with CAPTCHAs. But if your scraper still moves through a website like a robot, detection systems will notice. This is where behavioral mimicry comes in. The goal is not only to send requests that succeed, but to make your traffic look like it belongs to a person sitting at a screen.

Websites have spent years fine-tuning their ability to distinguish humans from bots. They know that people pause, scroll unevenly, misclick, and browse in messy and unpredictable ways. A scraper that always requests the next page instantly, scrolls in perfect increments, or never makes mistakes stands out. Mimicking human behavior makes your automation blend in with the natural noise of real users.

Building Human-Like Timing

One of the easiest giveaways of a bot is timing. Real users never click or type with machine precision.

  • Delays between actions: Instead of firing requests back-to-back, add short pauses that vary randomly. For example, wait 2.4 seconds after one click, then 3.1 seconds after the next.
  • Typing simulation: When filling forms, stagger keypresses to mimic natural rhythm. People often type in bursts, with slight pauses between words.
  • Warm-up navigation: Before going straight to the target data page, let your scraper visit the homepage or a category page. Real users rarely jump to deep links without a path.

These adjustments slow down your scraper slightly but dramatically reduce how robotic it looks.

Making Navigation Believable

Beyond timing, websites watch where you go and how you get there.

  • Session flow: Humans often wander. They may open a menu, check an unrelated page, or click back before moving on. Adding a few detours creates a more realistic flow.
  • Scrolling behavior: People scroll unevenly, sometimes stopping mid-page, then continuing. Scripts can replicate this by scrolling in variable increments and pausing at random points.
  • Mouse movement: While many scrapers skip this entirely, some detection systems check for mouse events. Simulating small, imperfect arcs and jitter makes interaction data look genuine.

Managing Cookies and Sessions

Humans carry baggage from one visit to the next in the form of cookies and session history. A scraper that always starts fresh looks suspicious.

  • Persist cookies: Store and reuse cookies so your scraper appears as the same user returning.
  • Maintain sessions: Use sticky proxies to hold an IP across several requests, keeping the identity consistent.
  • Align browser state: Headers like “Accept-Language” and time zone settings should match the location of the IP you are using.

This continuity creates the impression of a long-term visitor rather than disposable traffic.

Balancing Scale and Stealth

The challenge is that human-like behavior is slower by design. If you are scraping millions of pages, adding pauses and navigation steps can cut throughput. The solution is to parallelize: run more scrapers in parallel, each moving at a believable pace, instead of trying to push one scraper at unnatural speed.

Mimicking human behavior is about creating noise and imperfection. A successful scraper does not just move from point A to point B as fast as possible. It hesitates, scrolls, and carries history just like a person would. Combined with strong IP management and consistent fingerprints, this makes your automation much harder to distinguish from a real visitor.

When to Build vs. When to Buy

Chapter 4: The Strategic Decision: When to Build vs. When to Buy

Every technique we have covered so far—proxy management, fingerprint alignment, behavioral simulation, and solving challenges—can be built and maintained by a dedicated team. Many developers start this way because it offers maximum control and transparency. Over time, however, the reality of maintaining an unblocking system at scale forces a bigger decision: should you continue to invest in building internally, or should you adopt a managed solution that handles these defenses for you?

The True Cost of an In-House Solution

On paper, building in-house combines the right tools: a proxy provider, a CAPTCHA solver, and some logic to manage requests. In practice, it evolves into a complex system that must adapt to every change in how websites block automation.

Maintaining such a system requires constant investment in four areas:

  • Engineering capacity: Developers spend a significant amount of time patching scripts when sites update their defenses, rewriting fingerprint logic, and building monitoring tools to catch failures.
  • Proxy infrastructure: Residential and mobile proxies are indispensable for challenging targets, but they come with high recurring costs. Pools degrade as IPs are flagged, requiring continuous replacement and vendor management.
  • Challenge solving: CAPTCHA and some client-side JavaScript puzzles add direct costs per request. Even with solvers, failure rates introduce retries that inflate both costs and delays.

Monitoring and updates: Sites rarely stay static. What works one month may fail the next, and every update to defenses requires a response. The system becomes a moving target.

Introducing the Managed Solution: Scraping APIs 

A managed scraping API abstracts these same components into a single request. Instead of provisioning proxies, patching fingerprints, or integrating solver services yourself, the API handles those tasks automatically and delivers the page content.

The core benefit is focus. Firefighting bot detection updates no longer consume development time. Teams can focus on extracting insights from the data instead of maintaining the pipeline. Costs are generally easier to predict because many managed APIs bundle infrastructure, rotation logic, and solver fees, although high volumes or specialized targets can still increase expenses.

This does not make managed services universally superior. For small-scale projects with limited targets, a custom in-house setup can be cheaper and more flexible. However, for projects that require consistent, large-scale access, the stability of a managed API often outweighs the control of building everything yourself.

The Trade-Off

The choice is not between right and wrong, but between two different ways of investing resources:

  • Build if you have strong technical expertise, modest scale, and the need for complete control over how every request is managed.
  • Buy if your goal is long-term stability, predictable costs, and freeing engineers from the ongoing work of keeping up with anti-bot systems.

At its core, this is not a technical question but a strategic one. The defenses used by websites will continue to evolve. The real decision is whether your team wants to be in the business of keeping pace with those defenses, or whether you would rather rely on a service that does it for you.

Conclusion: The End of the Arms Race?

Bypassing modern anti-bot systems is not about finding a single trick or loophole. It requires a layered strategy that addresses every stage of detection. At the network level, your IP reputation must be managed with care. At the browser level, your fingerprint must look both realistic and consistent. At the interaction level, your behavior has to resemble the irregular patterns of human browsing. And when those checks are not enough, you must be prepared to solve active challenges like CAPTCHA or JavaScript puzzles.

Taken together, these defenses form a system designed to catch automation from multiple angles. To succeed, your scrapers need to look convincing in all of them at once. That is why the most resilient strategies focus on combining proxies, fingerprints, behavioral design, and rotation into one coherent approach rather than relying on isolated fixes.

There are two ways to get there. One approach is to build and maintain an in-house stack, thereby absorbing the costs and complexities associated with staying ahead of detection updates. The other option is to adopt a managed service that handles the unblocking for you, enabling your team to focus on extracting and utilizing the data. The right choice depends on scale, resources, and priorities.

What will not change is the direction of this contest. Websites will continue to develop more advanced defenses, and scrapers will continue to adapt. The arms race may never truly end, but access to web data will remain essential for research, business intelligence, and innovation. The organizations that thrive will be those that treat anti-bot systems not as an impenetrable wall, but as a challenge that can be met with the right mix of strategy, tools, and discipline.

The post The Ultimate Guide to Bypassing Anti-Bot Detection appeared first on ScraperAPI.

]]>
Playwright vs Puppeteer in 2025: Which Browser Automation Tool Is Right for You? https://www.scraperapi.com/blog/playwright-vs-puppeteer/ Fri, 11 Jul 2025 01:16:58 +0000 https://www.scraperapi.com/?p=8108 If you are working with headless browsers, you’ll likely face a key decision: Playwright or Puppeteer? Both are great tools for scraping dynamic websites or automating browser tasks, and each comes with a solid reputation and a strong following. They have, of course, their differences, too, both from a technical standpoint and in terms of […]

The post Playwright vs Puppeteer in 2025: Which Browser Automation Tool Is Right for You? appeared first on ScraperAPI.

]]>

If you are working with headless browsers, you’ll likely face a key decision: Playwright or Puppeteer?

Both are great tools for scraping dynamic websites or automating browser tasks, and each comes with a solid reputation and a strong following.

They have, of course, their differences, too, both from a technical standpoint and in terms of ecosystem, support, and overall flexibility. In this short blog, we’ll compare these two popular libraries.

By the end, you’ll have had better understanding of Playwright and Puppeteer, their tradeoffs, and all the information your need to pick the best fit for your project.

What Are Playwright and Puppeteer? Key Features and Differences

Before we delve into the key differences between Playwright and Puppeteer, it is important to understand each one well.  

What is Playwright?

Back in 2020, the Microsoft team began to see the need for a single robust API to cross-test browsers. This led to the creation of Playwright.  

Unlike many existing libraries, Playwright acts as a unified tool bridging multiple platforms, browsers, and languages. For instance, Playwright supports FireFox, WebKit, and Chromium — the open-source engine behind Google Chrome. 

It works on virtually any machine, and supports both headless and headful modes. Mobile-first developers have a soft spot for Playwright because it can emulate Android Chrome and Mobile Safari directly on your desktop. App developers can simulate and test how their applications perform across different mobile environments without needing physical devices.

When it comes to web scraping, Playwright is fitted with a number of ad-hoc features—such as AutoWait, very popular due to its ability to let you scape web pages without  setting off bot-detection systems.  Playwright also shines in managing multiple tasks at the same time. For example, it can handle testing a number of tabs and user scenarios at the same time without effort.

What is Puppeteer?

Google created Puppeteer in 2017 as a JavaScript library for web testing and automation within its browser ecosystem. It was designed to meet the demand of developers building with Google products. 

Puppeteer does not have a native frontend, which means it runs completely headless. However, users can configure it to launch a visible browser.

Since its beginnings, Puppeteer has been popular among developers to test Chrome extensions. Today, with most websites built using JavaScript—often with Next.js on the frontend and Node.js on the backend—many developers still prefer to test their applications using a JavaScript-based library like Puppeteer.-

For end-to-end testers, Puppeteer gives you the flexibility to check everything from the user interface to keyboard inputs. This means you can: 

  • Make sure your web app performs well 
  • Test the overall user experience
  • Catch anything that might be broken
  • Spot security vulnerabilities

When it comes to scraping, this library is popular for the ability to crawl pages, extract data, and capture the results as screenshots or PDFs. 

Playwright vs Puppeteer Comparison

FeaturePlaywrightPuppeteer
Browser SupportChromium, Firefox, and WebKitChrome and Firefox
Cross-browser SupportAvailableUnavailable 
Language SupportJavaScript, Python, Java, TypeScript, .NETJavaScript
Mobile Simulation SupportAvailableUnavailable
Browser UIAvailable Unavailable 
CreatorMicrosoftGoogle
Timeline trace debuggingAvailableAvailable 
Machine SupportMac, Windows, LinuxMac, Windows, Linux
Performance FastFast
Community VibranceBetter Good
Documentation GoodBetter

When to Choose Playwright vs. Puppeteer?

Now that we have taken a closer look at both Playwright and Puppeteer, let’s see when it’s best to use each, depending on your project and specific needs.

Playwright 

Here are some reasons you might want to stick with Playwright. 

Multi-language Support 

Playwright supports many languages, including JavaScript, TypeScript, Python, Java, and .NET. 

Unlike Puppeteer, which supports only JavaScript, you have many options with Playwright. You have the freedom of picking and building with the language you are most comfortable with.  

Cross-browser Support

Playwright is the right choice if you want to test your application across multiple browsers. It supports many browsers, such as Firefox, Chrome, and WebKit. 

Mobile Simulation 

You may be trying to scrape, test, or build a mobile app. Playwright helps you simulate a realistic mobile environment directly from your desktop. Its precise rendering capabilities give you an accurate view of how your application will appear and behave on mobile devices, letting you do more informed development and testing without the need for physical hardware. 

Puppeteer 

Here are some use cases when Puppeteer might be your best pick:

Testing Chrome Extensions

Puppeteer was built by the Chrome DevTools team at Google, so the tech stack similarities make it a great tool for Chrome extension testing. You are going to have an even better time if you are extensively using JavaScript.

JavaScript is Enough

On the other hand, Puppeteer only supports JavaScript, so projects relying on other languages might be slowed down. 

Browser-specific Support is Not Important 

If you are testing or scraping with only Chrome in mind, Puppeteer is a good option. Supporting Chrome is no issue at all for Puppeteer, but it might struggle with other browsers.

Playwright vs Puppeteer for Web Scraping: Which One Wins?

Primarily, these libraries are used for web automation and testing. However, many engineers might be more interested in Playwright and Puppeteer’s web scraping capabilities of. 

Here is what to keep in mind when choosing between the two for web scraping.

Bot Detection

Most detectors are trained to recognize bots by identifying agents that speedily access a web page and carry on actions even while it is still loading. 

Thanks to the AutoWaita feature, Playwright ensures that elements fully load before any action is executed, making it easier to proceed undetected. 

While Puppeteer doesn’t offer an equivalent of AutoWait, it also sports similar features that support graceful loading. For example, you can get creative with page.setDefaultTimeout() and page.waitUntil, which allow you to control how long to wait for elements or actions before timing out.

Dynamic Content Handling

Puppeteer was built to handle crawling and data extraction from Single Page Applications. 

However, it has a couple of downsides:

  • An acute focus on Chrome
  • No built-in support for handling dynamic content

If you want to scrape with Puppeteer, you’d have to digest the docs well so you can manually configure it to successfully scrape dynamic content. 

For example, you’ll need to write waitFor() methods explicitly, among other things. Playwright, on the other hand, comes with automatic waiting and built-in retries, which help reduce bot detection and minimize errors.

Apart from that, Playwright is better suited for scraping modern websites, especially when it comes to handling iframes. It can reliably access and extract content loaded within them.

Scraping Pre-rendered Content in HTML

There are times you might need to pre-render a web page you want to scrape, probably to avoid API detection or to improve your scraping efficiency. 

If you do this often, you’ll need to check which library better supports your workflow.

Puppeteer has native support for fetching pre-rendered content, usually without requiring heavy configuration.

Playwright, on the other hand, doesn’t natively support pre-rendered content, so you’d need to write your own script to handle that.

Conclusion 

Playwright and Puppeteer are two good libraries you can use for your web testing or scraping. In this guide, we’ve examined the technical merits and downsides of each one. 

It’s important to emphasize that if your goal is web scraping, these libraries alone might not be enough. Modern websites use advanced bot detection and blocking techniques that go beyond what headless browsers can easily bypass.
That’s where tools like ScraperAPI come in.  It can help you successfully scrape the web without the usual headaches. Sign up for the basic plan here and see how it works for yourself!

FAQs

Yes, due to the native stealth plugin, Puppeteer is often better than Playwright for web scraping. That said, Playwright stands out with its cross-browser compatibility and support for multiple programming languages.

No, they are two different headless browser automation libraries, each with its own features and strengths.

No, Playwright is not a fork of Puppeteer. It was created by the same team that originally worked on Puppeteer at Google, but they moved to Microsoft and built Playwright from scratch.

Absolutely. It’s a powerful headless browser library that works well for web automation and scraping. For improved performance and fewer blocks, it’s even more effective when used alongside a tool like ScraperAPI.

The post Playwright vs Puppeteer in 2025: Which Browser Automation Tool Is Right for You? appeared first on ScraperAPI.

]]>
Build a TikTok Brand-Influencer Scouting Tool Using ScraperAPI-LangChain Agent, Qwen3, and Streamlit https://www.scraperapi.com/blog/build-a-tiktok-brand-influencer-scouting-tool/ Fri, 11 Jul 2025 01:10:24 +0000 https://www.scraperapi.com/?p=8106 Build a custom scraper TikTok influencer scouting tool that lets you filter creators by country, follower count, and more while supporting follow up queries, using the ScraperAPI LangChain agent for data extraction, Qwen3’s LLM for contextually relevant insights, and Streamlit for free app hosting. Influencer marketing is rapidly eclipsing traditional ads, especially among younger audiences who […]

The post Build a TikTok Brand-Influencer Scouting Tool Using ScraperAPI-LangChain Agent, Qwen3, and Streamlit appeared first on ScraperAPI.

]]>

Build a custom scraper TikTok influencer scouting tool that lets you filter creators by country, follower count, and more while supporting follow up queries, using the ScraperAPI LangChain agent for data extraction, Qwen3’s LLM for contextually relevant insights, and Streamlit for free app hosting.

person using a laptop seen from above

Influencer marketing is rapidly eclipsing traditional ads, especially among younger audiences who value authenticity and relatability. Micro-influencers, with their tight-knit and highly engaged communities, inspire far more trust and loyalty than a generic banner ever could. 

Yet, follower counts and engagement metrics alone won’t guarantee success. What truly moves the needle is partnering with creators whose unique voice, values, and vision mirror your brand, who spark real action rather than empty clicks.

We are building a solution that can increase lead generation and targeted marketing thanks to its highly customizable approach, and help you find the right influencer to really boost your brand.

You will learn how to build a TikTok influencer scouting tool that utilizes the LangChain-ScraperAPI agent for scraping raw data and finding niche creators through natural-language queries. 

We will use Qwen3 as our large language model to deepen the tool’s contextual understanding and Streamlit to deploy and host the finished app for free in the cloud. 

Let’s get started. 

Understanding AI Agents in LangChain

Fundamentally, an AI agent is a program that combines a large language model (LLM) with tools and memory to perform tasks autonomously. Rather than responding to one-off prompts, an agent can:

  • Interpret and execute user intent by breaking down high-level queries into actionable steps.
  • Call external tools like APIs, web scrapers, databases, etc., to gather or process information.
  • Continue reasoning and iterating on results until it meets the user’s requirements.

What sets agents apart from standard LLM applications is the capacity to make informed decisions about what actions to take next based on intermediate results. Agents are not only reactive—they are active participants in solving a task.

For example, instead of answering “What’s the weather like in Paris?”, a LangChain agent can respond to a complex, multi-part query: 

“Plan a weekend getaway in Paris. I need weather forecasts, hotel prices under $200 per night, and suggestions for indoor activities if it rains.”

The agent breaks this down, uses tools like the ScraperAPI Google Search Tool and a general-purpose web scraper, to gather each piece of information, like weather data, hotel listings, and local attractions, and then combines everything into a complete response.

LangChain provides a flexible framework to assemble these components. You define a set of functions, APIs, or scrapers, wrap them with simple adapters, and then wire them into an agent that uses the LLM to decide when and how to call each resource. 

image of a robot explaining the uses of agents

How Does Autonomous Scraping with the LangChain–ScraperAPI Integration Work?

The LangChain-ScraperAPI integration is a Python package that allows AI agents to scrape the web using ScraperAPI. The package contains three different components, each corresponding to an official ScraperAPI endpoint:

  1. ScraperAPITool: Allows the AI agent to scrape any website and retrieve data
  2. ScraperAPIGoogleSearchTool: Specifically enables the agent to scrape Google Search results and rankings.
  3. ScraperAPIAmazonSearchTool: Scrape Amazon search results and rankings exclusively. 

All you need to do to use this package in Python is to install it with pip, then import the components:

pip install -U langchain-scraperapi
from langchain_scraperapi.tools import (
   ScraperAPITool,
   ScraperAPIGoogleSearchTool,
   ScraperAPIAmazonSearchTool
)

If you don’t have it already, create a ScraperAPI account and get your API key, then set it as an environment variable. In your terminal, run:

export SCRAPERAPI_API_KEY="your API key"

Once the tools are installed, you can create an instance of any of them and provide parameters such as the URL to scrape, the output format you want, and any additional options you need. Here’s an example:

from langchain_scraperapi.tools import ScraperAPITool
tool = ScraperAPITool()
print(tool.invoke(input={"url": "walmart.com", "output_format": "markdown"}))

The code above initializes one of the package’s components, ScraperAPITool, ascribes it as a variable, and then uses the invoke method to scrape “walmart.com”, requesting the output in markdown format. The scraped content is then printed.

The great thing about agents is that you can instruct them in natural language to do complex tasks. For instance, we can give the ScraperAPI-LangChain agent a query to search and return results and even images of teddy bears for sale on Amazon, and it will do just that. Below is a sample of the code:

from langchain_scraperapi.tools import ScraperAPIAmazonSearchTool
tool = ScraperAPIAmazonSearchTool()
print(tool.invoke(input={"query": "show me pink teddy bears for sale on Amazon"}))

Using the regular ScraperAPI Amazon Endpoint will also return the same results, but you’d have to find and input an actual Amazon URL with pink teddy bears on display and then attempt to scrape the web page. Using the ScraperAPI-LangChain agent makes it easier and to retrieve complex data instantly with minimal coding and resources. 

How to Obtain Qwen3 from OpenRouter

As we’re making use of a large language model from OpenRouter, we’ll need to set up an account and get out API key, before we can start making requests..

What sets Qwen models apart is their efficiency and scalability, particularly when it comes to those built on the Mixture of Experts (MoE) architecture. Unlike traditional large language models where all parameters are activated for every query, MoE models contain multiple ‘expert’ sub-networks. 

This means that, as they process information, MoE models activate only a small subset of specialized sub-networks (“experts”) based on learned routing decisions, allowing them to interpret, understand, and respond to a query without engaging the full model. This selective activation enables MoE models to maintain high performance while significantly reducing computational overhead and costs.

As a result, Qwen3 consistently delivers responses that are highly contextual, informative, and relevant.

Here’s a guide on how to access a model from OpenRouter:

  1. Login to OpenRouter, sign up, and create a free account:
Screenshot of openrouter
  1. After verifying your email, log in and search for Qwen3 models (or any other LLM of our choice) in the search bar: 
Screenshot of openrouter
  1. Go to the Qwen3 model of your choice:
Screenshot of openrouter
  1. Click on “API” to create a personal API access key for your model. 
Screenshot of openrouter
  1. Select “Create API Key” and then copy and save your newly created API key. 
Screenshot of openrouter
  1. Do not share your API key publicly.

Getting Started with ScraperAPI

  1. To begin, go to ScraperAPI’s dashboard. If you don’t have an account yet,  click on “Start Trial” to create one:
Screenshot of scraperapi
  1. After creating your account, you’ll have access to a dashboard providing you with an API key, access to 5000 API credits (7-day limited trial period), and information on how to get started scraping. 
Screenshot of scraperapi
  1. To access more credits and advanced features, scroll down and click “Upgrade to Larger Plan.”
Screenshot of scraperapi
  1. ScraperAPI provides documentation for various programming languages and frameworks—such as PHP, Java, and Node.js—that interact with its endpoints. You can find these resources by scrolling down on the dashboard page and clicking “View All Docs”:
Screenshot of scraperapi

Now we’re all set, let’s start building our tool.

Building the TikTok Brand-Influencer Scouting Tool

Step 1: Setting Up the Project

Create a new project folder, a virtual environment, and install the necessary dependencies.

mkdir tiktok_influencer_project  # Creates the project folder
cd tiktok_influencer_project # Moves you inside the project folder
python -m venv your-env-name  # Creates a new envirobment

Activate the environment:

  • Windows:
your-env-name\Scripts\activate
  • macOS/Linux:
source your-env-name/bin/activate

And now you can install the dependencies we’ll need:

pip install streamlit tiktoken langchain-openai langchain-scraperapi

The key dependencies and their functions are:

  • streamlit: We need this to build the app’s user interface, so users can directly input their niche and other filters, while seeing results in real-time.
  • tiktoken: This library is from OpenAI and is used for tokenizing text and estimating token counts. In our project, we use it to estimate the length of the queries sent to the language model, so we don’t exceed API limits.
  • langchain-openai: This is a separate package that provides the integration with OpenAI-compatible Large Language Models (LLMs). Therefore, we use it to connect Qwen via OpenRouter, for our application to send prompts and receive AI-generated responses.
  • langchain-scraperapi: This is the package that integrates ScraperAPI and LangChain’s abilities in the form of an agent that can perform web scraping and Google searches autonomously.

Step 2: Integrating the Langchain-Scraperapi Package

Remember at the beginning when we set our ScraperAPI key as an environment variable and installed the dependencies we needed? If you are using the same environment, you’re good to go. However, if you are working in a new one, you won’t have the packages you need yet. Install Langchain-ScraperAPI:

pip install -U langchain-scraperapi

Previously, we exported our ScraperAPI key as an environment variable. However, this time around, we will be needing our OpenRouter API key as well. We could export both, but exporting keys to the environment is a temporary solution (the credentials are only saved locally for a limited time). To make sure we have both our key safely stowed away in our env and ready to go at any moment, we’re going to need to use python-dotenv.

pip install python-dotenv

Create a new .env file and add your API keys:

SCRAPERAPI_API_KEY="your-scraperapi-key"
OPENROUTER_API_KEY="your-openrouter-key"

Step 3: Importing Libraries and Setting Up API Keys

The next step is importing all the necessary libraries and securely loading the API keys required to interact with external services like ScraperAPI and OpenRouter (for the LLM).

import os
import streamlit as st
import tiktoken

from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents.agent_types import AgentType
from langchain_scraperapi.tools import (
    ScraperAPITool,
    ScraperAPIGoogleSearchTool,
    ScraperAPIAmazonSearchTool
)
from dotenv import load_dotenv
load_dotenv()

# Loading API Keys

scraperapi_key = os.environ.get("SCRAPERAPI_API_KEY")
openrouter_api_key = os.environ.get("OPENROUTER_API_KEY")

# Let’s include API Key checks as a safety net and for easier debugging
if not scraperapi_key:
    st.warning("ScraperAPI key might not be correctly set. Using the provided default or placeholder.")
    if scraperapi_key == "YOUR_SCRAPERAPI_API_KEY":
        st.error("Please replace 'YOUR_SCRAPERAPI_API_KEY' with your actual key in the script.")
        st.stop()

if not openrouter_api_key:
    st.error("OPENROUTER_API_KEY not found or is still the placeholder. Please set it in the script.")
    st.stop()

The code above achieves the following:

Imports:

  • os: Used to interact with the operating system, specifically for setting and getting environment variables.
  • streamlit as st: The core library for building the web app’s user interface.
  • tiktoken: For estimating the number of tokens within prompts sent to the LLM.
  • langchain_openai.ChatOpenAI: Imports the class to interact with OpenAI-compatible chat models (like the Qwen model via OpenRouter in this case).
  • langchain.agents.initialize_agent, Tool: Key components from LangChain to create and manage the AI agent and the tools it can use.
  • langchain.agents.agent_types.AgentType: Specifies different types of LangChain agents.
  • langchain_scraperapi.tools: Imports specific tools designed to work with ScraperAPI for web scraping and search.

API Keys Setup:

  • load_dotenv(): Loads the keys from .env
  • scraperapi_key = os.environ.get("SCRAPERAPI_API_KEY"): Retrieves the value of the SCRAPERAPI_API_KEY environment variable.
  • openrouter_api_key = os.environ.get("OPENROUTER_API_KEY"): Retrieves the value of the OPENROUTER_API_KEY environment variable.

API Key Checks:

  • The if not scraperapi_key: and if not openrouter_api_key: blocks provide basic validation. They check if the API keys have been set or give a warning if they are missing or still contain the placeholder values. If the keys are not set, the Streamlit app will stop execution (st.stop()) to prevent errors further down the line.

Step 4: Building the Streamlit UI Layout

Here we will set up the basic layout and texts for the Streamlit web UI.

# Streamlit UI Setup 
st.set_page_config(page_title="TikTok Influencer Finder", layout="centered")
st.title("TikTok Influencer Finder 🧑🏼‍🤝‍🧑🏿🌐")
st.markdown("""
Welcome! This bot uses ScraperAPI's Langchain AI Agent for web scraping and a **Qwen LLM (via OpenRouter)**
to help you discover TikTok influencers who might be a great fit to promote your brand.
Please provide your brand's niche (e.g., 'sustainable running shoes', 'female luxury bags', 'men's watches').
""")

Here’s what the code above achieves:

  • st.set_page_config(...): Configures the Streamlit page, setting the browser tab title to “TikTok Influencer Finder” and the layout to “centered.”
  • st.title(...): Displays the main title of the application on the web page.
  • st.markdown(...): Renders a block of Markdown text, serving as a welcome message and a brief explanation of the tool’s purpose and how it works.

Step 5: Initializing LangChain Tools 

Now we’ll prepare the tools that the LangChain agent will use to interact with the external web. (specifically, to perform web searches and scrape content) using ScraperAPI.

# Initializing Tools
try:
    scraper_tool = ScraperAPITool(scraperapi_api_key=scraperapi_key)
    google_search_tool = ScraperAPIGoogleSearchTool(scraperapi_api_key=scraperapi_key)
except Exception as e:
    st.error(f"Error initializing ScraperAPI tools: {e}.")
    st.stop()

tools = [
    Tool(
        name="Google Search",
        func=google_search_tool.run,
        description="Useful for finding general information online, including articles, blogs, and lists of TikTok influencers."
    ),
    Tool(
        name="General Web Scraper",
        func=scraper_tool.run,
        description="Useful for scraping content from specific URLs after search."
    )
]

Below is a further breakdown of what the code above does:

Tool Initialization:

  • scraper_tool = ScraperAPITool(...): Creates an instance of a general web scraping tool provided by langchain-scraperapi, authenticated with your scraperapi_key. This tool can scrape content from any given URL.
  • google_search_tool = ScraperAPIGoogleSearchTool(...): Creates an instance of a Google search tool, also powered by ScraperAPI. This tool allows the agent to perform Google searches.
  • The try-except block handles potential errors during the initialization of these tools, displaying an error message in Streamlit and stopping the app if something goes wrong.

Tools List for LangChain Agent:

  • tools = [...]: Defines a list of Tool objects. Each Tool is a wrapper that makes an external function available to the LangChain agent.
  • “Google Search” Tool: Named “Google Search,” its function (func) is set to google_search_tool.run, meaning when the agent “uses” this tool, it will execute a Google search. The description tells the LLM what this tool is useful for.
  • “General Web Scraper” Tool: Named “General Web Scraper,” its function is scraper_tool.run. Its description indicates it’s for scraping specific URLs, typically after a search.

Step 6: Initializing the Large Language Model (LLM)

It is now time to initialize the Large Language Model (LLM) that will serve as the “brain” of the agent, enabling it to understand prompts and decide on actions.

QWEN_MODEL_NAME = "qwen/qwen3-30b-a3b:free"

llm = None
try:
    llm = ChatOpenAI(
        model_name=QWEN_MODEL_NAME,
        temperature=0.1,
        openai_api_key=openrouter_api_key,
        base_url="https://openrouter.ai/api/v1"
    )
    st.success(f"Successfully initialized Qwen model: {QWEN_MODEL_NAME}")
except Exception as e:
    st.error(f"Error initializing Qwen LLM: {e}")
    st.stop()

# Initialize agent here!
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    max_iterations=3
)

Here is what we can understand from the code above:

  • QWEN_MODEL_NAME: Defines the specific Qwen model we are using from OpenRouter.
  • llm = ChatOpenAI(...): Initializes the ChatOpenAI object.
  • model_name=QWEN_MODEL_NAME: Specifies which LLM to use.
  • temperature=0.1: Controls the creativity of the LLM’s responses. A lower value (like 0.1) makes the output more deterministic and focused.
  • openai_api_key=openrouter_api_key: Provides the API key for authentication with OpenRouter.
  • base_url="https://openrouter.ai/api/v1": Specifies the API endpoint for OpenRouter, as OpenRouter provides an OpenAI-compatible API.
  • The try-except block catches any errors during the LLM initialization, displays them in Streamlit, and stops the application if the LLM cannot be set up.
  • agent = initialize_agent(...): Allows your button callback to use agent.run(query) properly.

Step 7: Initializing the LangChain Agent

This crucial step brings together the LLM and the tools to create an intelligent agent capable of reasoning and taking action based on user requests.

# Initializing Agent
agent = None
if llm is not None:
    try:
        agent = initialize_agent(
            tools=tools,
            llm=llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True,
            handle_parsing_errors=True,
            max_iterations=15
        )
    except Exception as e:
        st.error(f"Error initializing LangChain agent: {e}")
        st.stop()
else:
    st.error("LLM not initialized. Agent setup failed.")
    st.stop()

The code achieves the following:

  • ifllm is not None:: Ensures that the LLM initializes successfully before attempting to create the agent.
  • agent = initialize_agent(...): This is the core LangChain function to set up an agent.
  • tools=tools: Provides the list of Tool objects (Google Search and General Web Scraper) that the agent can utilize.
  • llm=llm: Connects the initialized LLM to the agent, giving it its reasoning capabilities.
  • agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION: Specifies the type of agent. This agent type uses the LLM to decide which tool to use and how to use it in a single “thought” step, based on a description of the tools and the current task.
  • verbose=True: When True, the agent’s internal thought process and tool usage will be printed to the console, which is very helpful for debugging.
  • handle_parsing_errors=True: Allows the agent to attempt to recover from parsing errors in its internal reasoning.
  • max_iterations=15: Sets a limit on how many steps (tool uses, thoughts) the agent can take before giving up, preventing infinite loops.
  • The try-except block handles errors during agent initialization, displaying them and stopping the app if the agent cannot be set up.

Step 8: Building the User Input Interface

Here we will define the interactive elements in the app’s UI where the user can provide details for their search.

# Inputting UI elements
user_niche = st.text_input(
    "Enter your brand's niche:",
    key="brand_niche_input",
    placeholder="Type niche here..."
)

# --- Additional Filters ---
st.subheader("Optional Filters")
country_filter = st.text_input(
    "Filter by Country (optional):",
    key="country_filter",
    placeholder="e.g., United States, UK, China"
)

min_followers = st.number_input(
    "Minimum Follower Count (e.g., 500000 for 500K)",
    min_value=0,
    value=0,
    step=10000,
    key="min_followers"
)

Below is a summary of what the code above achieves:

  • st.text_input(...): Creates a text input field for the user to enter their brand’s niche.
  • st.subheader("Optional Filters"): Displays a smaller heading for the optional filters section.
  • country_filter = st.text_input(...): Creates another text input for an optional country filter.
  • min_followers = st.number_input(...): Creates a numerical input field for the minimum follower count.

Step 9: Token Estimation Function

The function below helps to manage the length of prompts sent to the LLM, which usually have token limits.

# Token Estimation function
def estimate_tokens(text):
    try:
        encoding = tiktoken.encoding_for_model("gpt-4")
    except:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Here’s what the code achieves:

  • Function Definition: Defines estimate_tokens(text), which takes a string text as input.
  • Tokenization: It attempts to get the token encoder for the “gpt-4” model,. If that fails, it falls back to a common base encoding (cl100k_base). encoding.encode(text) converts the input text into a list of token integers, while len(...) returns the count of these tokens. 

Step 10: Main Search Logic (Finding Influencers)

This is the core functional part of the application. It triggers when the user clicks the “Find TikTok Influencers” button. The code is used to construct the defining prompt query, run the agent, and display the results.

# Main Search Logic
if st.button("Find TikTok Influencers ✨"):
    if not user_niche:
        st.warning("Please enter your niche.")
    elif agent is None:
        st.error("Agent not initialized.")
    else:
        query = f"""
        Find a list of at least 5 TikTok influencers highly relevant to the niche: '{user_niche}'.
        Apply these filters:
        - Country: {country_filter or 'Any'}
        - Minimum Follower Count: {min_followers}
        For each influencer, provide:
        1. TikTok Username
        2. Full Name (if known)
        3. Approximate Follower Count
        4. Niche
        5. TikTok profile or verified link
        Format as Markdown list.
        """

        token_count = estimate_tokens(query)
        if token_count > 20000:
            st.error(f"Query too long ({token_count} tokens). Try reducing text.")
            st.stop()

        st.info("🚀 Searching influencers...")
        with st.spinner("Running agent..."):
            try:
                response = agent.run(query)
                st.session_state["last_influencer_data"] = response

                st.subheader("💡 Influencers Found:")
                st.markdown(response)
            except Exception as e:
                st.error(f"Agent failed: {e}")

Here’s further information on precisely how the code works:

if st.button("Find TikTok Influencers ✨"):: This block executes when a user clicks the button.

Input Validation:

  • if not user_niche:: Checks if the niche input is empty and displays a warning.
  • elif agent is None:: Checks if the LangChain agent was successfully initialized earlier and displays an error if not.

Query Construction:

  • To receive our results in a tidy and presentable format, we have to manually input the query that will be sent to the LangChain agent. This query instructs the agent on what to find (TikTok influencers), what criteria to use (niche, country, minimum followers), what information to extract for each, and the desired output format (Markdown list).
  • {country_filter or 'Any'}: Is a neat Python trick that uses country_filter if it has a value, otherwise defaults to the string ‘Any’.

Token Count Check:

  • token_count = estimate_tokens(query): Calls the previously defined function to get an estimate of the query’s token length.
  • if token_count > 20000:: Prevents sending overly long queries to the LLM, which could exceed API limits.

Running the Agent:

  • st.info("🚀 Searching influencers..."): Displays an informational message to the user.
  • with st.spinner("Running agent..."):: Shows a spinning animation in the UI, indicating that the application is running.
  • response = agent.run(query): This is where the magic happens. The LangChain agent takes the query, uses its LLM to reason about the task, and decides which of its tools (Google Search, Web Scraper) to use, potentially in multiple steps, to fulfill the request. The final answer from the agent is stored in response.
  • st.session_state["last_influencer_data"] = response: Stores the agent’s response in Streamlit’s session state, making the data persistently available across reruns of the script within the same user session, which is crucial for the follow-up Q&A.

Displaying Results:

  • st.subheader("💡 Influencers Found:"): Displays a subheader.
  • st.markdown(response): Renders the agent’s response (which is formatted as Markdown) directly into the Streamlit UI.

Error Handling: The try-except block catches any exceptions that occur during the agent’s execution and displays an error message.

Step 11: Follow-up Q&A Logic

To enable users ask further questions about the influencers they find, we will add a follow-up logic that links the LLM directly with the previously obtained data as context.

# Follow-up Q&A code
st.markdown("---")
st.subheader("Ask a follow-up question about the influencers ✍️")
follow_up_question = st.text_input("Your question:", key="followup_question")

if follow_up_question and "last_influencer_data" in st.session_state:
    context = st.session_state["last_influencer_data"]
    qna_prompt = f"""
    Based on the following influencer data:
    {context}

    Answer the following question:
    {follow_up_question}
    """
    token_count = estimate_tokens(qna_prompt)
    if token_count > 20000:
        st.error(f"Follow-up too long ({token_count} tokens). Try shortening your question or data.")
    else:
        try:
            st.info("🧠 Thinking...")
            follow_up_response = llm.invoke(qna_prompt)
            st.markdown(follow_up_response)
        except Exception as e:
            st.error(f"LLM follow-up failed: {e}")

The code achieves the following:

  • st.markdown("---"): Adds a horizontal rule for visual separation.
  • st.subheader(...) and st.text_input(...): Create a section for the user to input a follow-up question.
  • context = st.session_state["last_influencer_data"]: Retrieves the previously found influencer data to provide context for the LLM.
  • qna_prompt = f"""...""": Constructs a new prompt for the LLM. This prompt includes the context (the influencer data) and the follow_up_question, instructing the LLM to answer based on that information.
  • Token Count Check: Similar to the main search, it checks the token length of the follow-up prompt to prevent errors.
  • Invoking the LLM Directly: follow_up_response = llm.invoke(qna_prompt), unlike agent.run(), llm.invoke(), sends the prompt directly to the LLM without involving the agent’s tool-use reasoning. The LLM then processes the prompt (context + question) and generates an answer. follow_up_response.content extracts the actual text of the response while st.markdown(follow_up_response.content) displays the LLM’s answer in Markdown format.

Step 12: Footer

Why not add a simple footer to give credit to the technologies we used? This is good practice especially if you’re building this project to include within your personal portfolio; this way, recruiters can easily spot at a glance, the tools you used in developing your app.

# --- Footer ---
st.markdown("---")
st.markdown("<p style='text-align: center; color: grey;'>Powered by ScraperAPI, Langchain and OpenRouter (Qwen)</p>", unsafe_allow_html=True)

Here’s the explanation for the code above:

  • st.markdown("---"): Adds another horizontal rule.
  • st.markdown("<p style='text-align: center; color: grey;'>...</p>", unsafe_allow_html=True): Displays a small, centered, grey-colored text at the bottom of the page, crediting the technologies used. 
  • unsafe_allow_html=True is necessary because you’re embedding raw HTML (<p>) within the Markdown.

Here’s a snippet of what the tool’s UI looks like:

snippet of what the tool’s UI looks like

Step 13: Run your script

Now that all the steps are in place, you can run your code with Streamlit by doing:

streamlit run your_script_name.py

Deploying the TikTok Brand-Influencer Scouting Tool Using Streamlit 

Here’s how to deploy our TikTok Brand-Influencer Scouting app  on Streamlit for free in just a few steps:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository on GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHub

Before doing anything else, create a .gitignore file to avoid accidentally uploading sensitive files like. Add the following to it:

.env
__pycache__/
*.pyc
*.pyo
*.pyd
.env.*
.secrets.toml

If you haven’t already set up Git and linked your repository, use the following commands in your terminal from within your project folder:

git init
git add .
git commit -m "Initial commit"
git branch -M main
# With HTTPS
git remote add origin https://github.com/YOUR_USERNAME/your_repo.git
# With SSH
git remote add origin git@github.com:YOUR_USERNAME/your-repo.git

git push -u origin main

If it’s your first time using GitHub from this machine, you might need to set up an SSH connection. Here is how.

Step 2: Define Dependencies and Protect Your Secrets!

Streamlit needs to know what dependencies your app requires. 

1. In your project folder, automatically create a requirements file by running:

pip freeze > requirements.txt

2. Commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies”
git push origin main

3. Do the same for your app file containing all your code:

git add your-script.py 
git commit -m "Added app script" 
git push origin main

Step 3: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App.” 

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

  • Repository: YOUR_USERNAME/TikTok-Influencer-Finder
  • Branch: main
  • Main file path: app.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

7. ​​Go to your deployed app dashboard, find your app, and find “Secrets” in “Settings”. Add your environment variables (your API keys) just as you have them locally in your .env file.

Step 4: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL (e.g., https://your-app-name.streamlit.app). You can now share this link to allow others access to your tool!

Conclusion

In this tutorial, you have learned how to build a TikTok Brand-Influencer Scouting Tool that utilizes the ScraperAPI-LangChain agent for smart, autonomous data extraction, Qwen3 for contextual insights and follow-up queries, and Streamlit for building a user-friendly interface to host the app, in the cloud, for free.

This tool aids influencer marketing, enabling brands to identify creators whose niche, voice and vision align perfectly with their own. By including filtering options and follow-up questions, it moves beyond just basic metrics to find influencers who can truly spark authentic engagement and drive targeted marketing efforts.

Ready to build your own? 

Start using ScraperAPI today and transform your influencer scouting process into a streamlined, highly effective strategy!

The post Build a TikTok Brand-Influencer Scouting Tool Using ScraperAPI-LangChain Agent, Qwen3, and Streamlit appeared first on ScraperAPI.

]]>
How to Scrape Geo-Restricted Data Without Getting Banned https://www.scraperapi.com/blog/how-to-scrape-geo-restricted-data/ Sat, 28 Jun 2025 09:33:57 +0000 https://www.scraperapi.com/?p=7966 While the internet is often considered free and open to all, some geographical restrictions are still placed on some websites. Sometimes, the changes are subtle, such as automatically switching languages. In other cases, entirely different content is served (e.g., Netflix) for people from different countries or regions. At the extremes, some websites are entirely inaccessible […]

The post How to Scrape Geo-Restricted Data Without Getting Banned appeared first on ScraperAPI.

]]>

While the internet is often considered free and open to all, some geographical restrictions are still placed on some websites.

Sometimes, the changes are subtle, such as automatically switching languages. In other cases, entirely different content is served (e.g., Netflix) for people from different countries or regions.

At the extremes, some websites are entirely inaccessible unless your IP address is from a specified country. While all of these restrictions do serve a proper purpose, they also make web scraping significantly more difficult.

There are several ways to bypass geographical restrictions while scraping, such as using proxies within your own scraper or using pre-built solutions that take care of the hassle.

Using Proxies to Bypass Geo-restrictions

Residential proxies are, in fact, the way most scrapers bypass geo-restrictions. Since you get an IP address from a device that’s physically located in a country of your choice. When a proxy relays requests from your machine to a website, it’ll think that the true source of the request is from within that country.

While there are numerous other proxy types, residential proxies are generally regarded as your best bet for most web scraping tasks, especially those that involve geographical restrictions. Datacenter proxies, while fast and cheap, have a limited range of locations and are more easily detected.

ISP proxies would work perfectly fine as they are as legitimate as residential proxies and as fast as datacenter proxies, but they’re one of the most expensive options available. Additionally, the pool of IPs is usually quite limited.

While purchasing proxies directly from a provider and integrating them into your scraper is definitely efficient, it has one caveat: you still need to build the scraper itself. Extensive programming knowledge is required for any scraping project that has a decent scope.

Constant updates to the scraping solution will also be required as minor changes in layouts, website code, or anything in between will cause it to either break completely or return improper results.

Then there’s the headache of data parsing and storage, both of which are complicated topics on their own.

So, while buying proxies from a provider can be a good solution for some, it’s usually reserved for those who can build a scraper on their own.

Using ScraperAPI

ScraperAPI manages the entire scraping pipeline, from proxies to data delivery, for its users. There’s no need to build something from the ground up: you can start scraping as soon as you get a plan and write some basic code.

We’ll be using Python to send requests to the ScraperAPI endpoint to retrieve data from websites.

Preparation

First, you’ll need an IDE such as PyCharm or Visual Studio Code to run your code. Then you should register for an account with ScraperAPI.

Note: ScraperAPI’s free trial is enough to test out geotargeting as it provides access to all of the premium features. Once that expires, however, you’ll need one of the paid plans unless US/EU geotargeting is enough for your use case.
Once you have everything set up, we’ll be using the requests library to send HTTP requests to the ScraperAPI endpoint. Since requests is a third-party library, we’ll need to install it first:

pip install requests

That’s the only library you’ll need since ScraperAPI does all the heavy lifting for you. All you need to do is begin writing code and getting the URLs you want to scrape.

Sending a simple request

It’s often best to start simple and increase code complexity as you go. We’ll start by sending a GET request to a website that restricts EU users to verify what happens if we do not use residential proxies or ScraperAPI:

import requests
resp = requests.get('https://www.chicagotribune.com/')
print(resp.text)

We’ll be using the requests library throughout, so we’ll have to import it. Sending a GET request is extremely simple – call the module with the GET method and pass in the URL as a string (double or single quotes required) as the argument.

Then we simply use our resp object as the argument in the print command and include text as the method. 

You’ll receive an error message, as attempting to get any response from The Chicago Tribune while using an EU IP address sends the same error message every time:

Screenshot of an error message

If you were to use a US IP address with an EU-locked website, you’d get a similar response. They all differ slightly; however, the end result is the same.

import requests
resp = requests.get('https://www.rte.ie/player/')
print(resp.text)

RTE restricts users to EU only, so with a US IP address, you get:

Screenshot of RTE restricts users to EU only

So, using either ScraperAPI or residential proxies will be necessary to access some websites. Let’s start by sending a request through ScraperAPI:

import requests
payload = {'api_key': ‘YOUR-API-KEY-HERE', 'url': 'https://httpbin.org/ip'}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

As always, start by importing the necessary library (requests in our case). Then, define a dictionary object that has two key:value pairs – the API key (required for authentication) and the URL, which is the website you want to scrape.

We then create a response object that will store the answer retrieved from the website. You’ll need two arguments, the first of which is always the ScraperAPI endpoint, the second of which is the payload dictionary.

For now, we simply print the response. Running the code should just retrieve the origin IP address and print it in the standard output screen.

Selecting a geographical location

We’ll now switch to scraping websites that show data based on location, such as displaying different prices, currencies, or content in general.

Let’s start by implementing a country code in our ScraperAPI code to visit The Chicago Tribune and see if we get a response.
All you need to do is add an additional key:value pair to your payload dictionary. It’ll be country_code as the key and the country code in the 2-letter ISO 3166-1 format.

import requests
payload = {'api_key': 'YOUR-API-KEY-HERE', 'url': 'https://www.chicagotribune.com/', 'country_code': us}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

You should get a large HTML response showcasing lots of data. Our screenshot is truncated for demonstration purposes:

Screenshot of large HTML response

Parsing data with BeautifulSoup

We’ll start by installing BeautifulSoup:

pip install beautifulsoup4

Now we’ll need to make some modifications to our code:

  • We’ll put the response text (the full HTML file) into a BeautifulSoup object that will be used for parsing.
  • Then, a list will be created to store all the article titles.
  • For the output, we’ll run another loop that prints each title on a new line.
import requests
from bs4 import BeautifulSoup
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
for t in unique_titles:
    print(t)

Note that we also create a unique_titles object that’s both sorted and turned into a set (from a list). Sets in Python do not store duplicate values, so it’s an easy way to remove duplicate titles from our original list.

You should get a response that’s similar to:

Screenshot of response

Finally, some websites display the same page with different data for many geographical locations. Most ecommerce businesses do that to make prices more transparent for users.

Storing data with Pandas

You’ll likely want to do more than just print out data. Otherwise, you’ll lose everything after closing your IDE or any other program.
Usually, the pandas library is more than enough for basic scraping projects. Start by installing:

pip install pandas

We’ll also import the default datetime library to add time stamps to our CSV file, which is highly useful if you need to return to it later.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

Running the code will now create a CSV file and print a success message. The underlying code is quite simple – a dataframe is created that starts with the row “Headline” and then each other row is one of the titles.

To add a timestamp to the file, we use datetime.now() and turn it to a string using strftime and provide the format in the argument.

Finally, the dataframe is outputted into a CSV file.

Note: We use “utf-16” encoding, as “utf-8” doesn’t translate all the characters correctly.

Your CSV file should look a little like this:

Screenshot of CSV file

Further considerations

Scraping a single website with ScraperAPI is ultimately a little too simple for any real-world project, although it serves as a great starting point. You can improve your scraping code in two primary ways.

One is that you can use ScraperAPI to scrape the homepage of a website, collect all the URLs, and continue in such a fashion, building your list of URLs.

Alternatively, you can create a list object manually and input all the URLs you want to scrape, then run a loop iterating over each element as the URL.

Here’s the full code block that you can work upon:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

 

The post How to Scrape Geo-Restricted Data Without Getting Banned appeared first on ScraperAPI.

]]>
Speed Up Web Scraping with ScraperAPI’s Concurrent Threads https://www.scraperapi.com/blog/speed-up-web-scraping/ Sat, 28 Jun 2025 07:11:23 +0000 https://www.scraperapi.com/?p=7994 If you’ve ever built a web scraper, you know the pain. You build a scraper, and it works great on 1,000 pages, but the moment you scale up to 10,000 or more, things become slow. Here’s the good news: there’s a fix!  In this article, you’ll learn everything about: So, what are concurrent threads? If […]

The post Speed Up Web Scraping with ScraperAPI’s Concurrent Threads appeared first on ScraperAPI.

]]>

If you’ve ever built a web scraper, you know the pain. You build a scraper, and it works great on 1,000 pages, but the moment you scale up to 10,000 or more, things become slow. Here’s the good news: there’s a fix! 

In this article, you’ll learn everything about:

  • What concurrent threads are
  • How to set up ScraperAPI’s concurrent threads
  • How to use them to scrape web pages faster and more efficiently

So, what are concurrent threads?

If you’ve used ScraperAPI before, you already know the basics—you hit the API to fetch the pages you need. With concurrent threads, you can send multiple requests at the same time. Instead of scraping one page, waiting, and then scraping the next, you can run several requests in parallel and get results way faster.

Let’s say you’re using 5 concurrent threads. That means you’re making 5 requests to ScraperAPI at once, all running in parallel. So, the more threads you use, the more requests you can send at once, and the faster your scraper runs.

Each ScraperAPI plan comes with its own thread limit. For example:

  • The Business plan gives you up to 100 concurrent threads
  • The Scaling plan bumps that up to 200 threads

However, if your scraping needs go beyond that, we’ve got you covered with our Enterprise plan. With Enterprise, there’s no fixed cap. We work with you to tailor a custom thread limit based on your exact use case so you get the best speed and performance. 

How to increase your scraping speed?

Now that we know what concurrent threads are, it’s time to see them in action.

We’ll run a simple experiment to test how performance scales with different thread limits and show just how much speed you can unlock.

First, we’ll create a list of 1000+ URL samples. To do that, we’ll crawl https://edition.cnn.com/business/tech and extract URLs using open-source tools like Scrapy. This step is just to get the sample URLs we want to scrape. In your case, these URLs would be the actual pages that you need to scrape.

Once we have the list of URLs, we’ll hit the ScraperAPI endpoint twice: 

  • First, using 100 concurrent threads. 
  • Then again, with 500 concurrent threads.

Finally, we’ll measure how long each run takes. 

Stage 1: Create a list of sample URLs to scrape

Follow these steps to create a list of URLs from https://edition.cnn.com/business/tech:
Step 1: Open the command prompt or terminal, go to your project folder, and install Scrapy and BeautifulSoup (which we will need later).

pip install scrapy bs4

Step 2: Start a new Scrapy project.

scrapy startproject cnn_scraper
cd cnn_scraper

Step 3: Go inside the /spiders folder and create a Python file.

cd spiders
touch cnn_spider.py

Step 4: In your IDE, go to cnn_scraper/spiders/cnn_spider.py and paste the following code:

import scrapy

from urllib.parse import urljoin, urlparse

class CnnSpider(scrapy.Spider):

   name = "cnn"  
   allowed_domains = ["edition.cnn.com"]
   start_urls = ["https://edition.cnn.com/business/tech"]
   seen_urls = set()

   custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }

   def parse(self, response):
       links = response.css("a::attr(href)").getall()

       for link in links:
           if link.startswith("/"):
               full_url = urljoin("https://edition.cnn.com", link)
           elif link.startswith("http") and "edition.cnn.com" in link:
               full_url = link
           else:
               continue
           
           if full_url not in self.seen_urls:
               self.seen_urls.add(full_url)
               yield {"url": full_url}
               yield response.follow(full_url, callback=self.parse)

       if len(self.seen_urls) >= 1000:
           self.crawler.engine.close_spider(self, "URL limit reached")

In the above code, custom_settings sets the User-Agent header Scrapy sends with each request, making the spider look like a real browser. The parse() function uses the getall() built-in function to collect  and process all the links on the current page, and turn them into full links. The if condition (if full_url not in self.seen_urls) is only to process links you haven’t seen before. 

Step 5: To run the above code and save the URLs into a JSON file, execute the following command from the cnn_scraper/spiders folder:

scrapy crawl cnn -o urls.json

Stage 2: Let’s scrape the saved URLs using ScraperAPI 

Step 1: Create a Python file–I named mine scraper_api.py, but you can pick whatever name works for you–and paste the following code in it:

import requests
import json
import csv
import time
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

API_KEY = 'ScraperAPI API_key'
NUM_RETRIES = 3
NUM_THREADS = 100
with open("path/to/URLs_json_file", "r") as file:
    raw_data = json.load(file)
    list_of_urls = [item["url"] for item in raw_data if "url" in item]

def scrape_url(url):
   params = {
       'api_key': API_KEY,
       'url': url
   }

   for _ in range(NUM_RETRIES):
       try:
           response = requests.get('http://api.scraperapi.com/', params=params)
           if response.status_code in [200, 404]:
               break
       except requests.exceptions.ConnectionError:
           continue
   else:
       return {
           'url': url,
           'h1': 'Failed after retries',
           'title': '',
           'meta_description': '',
           'status_code': 'Error'
       }

   if response.status_code == 200:
       soup = BeautifulSoup(response.text, "html.parser")
       h1 = soup.find("h1")
       title = soup.title.string.strip() if soup.title else "No Title Found"
       meta_tag = soup.find("meta", attrs={"name": "description"})
       meta_description = meta_tag["content"].strip() if meta_tag and meta_tag.has_attr("content") else "No Meta Description"
       return {
           'url': url,
           'h1': h1.get_text(strip=True) if h1 else 'No H1 found',
           'title': title,
           'meta_description': meta_description,
           'status_code': response.status_code
       }
   else:
       return {
           'url': url,
           'h1': 'No H1 - Status {}'.format(response.status_code),
           'title': '',
           'meta_description': '',
           'status_code': response.status_code
       }

start_time = time.time()

#concurrent threads 
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
   scraped_data = list(executor.map(scrape_url, list_of_urls))


elapsed_time = time.time() - start_time
print(f"Using 100 concurrent threads, scraping completed in {elapsed_time:.2f} seconds.")

# Save to CSV
with open("cnn_h1_1000_1_results.csv", "w", newline='', encoding="utf-8") as f:
   writer = csv.DictWriter(f, fieldnames=["url", "h1", "title", "meta_description", "status_code"])
   writer.writeheader()
   writer.writerows(scraped_data)

The function scrape_url(url) sends a request to ScraperAPI using the given URL. If the response status code is not 200 or 404, it retries up to NUM_RETRIES times. If it gets a 200 OK, it uses BeautifulSoup to parse the H1, title, and meta description. 

The part ThreadPoolExecutor(max_workers=NUM_THREADS) sends the concurrent requests to ScraperAPI. In the end, the code saves the scraped data to a CSV file. 
When NUM_THREADS == 100, it took 100.68 seconds to scrape the titles.

example scraper code

In the same code, we only changed the number of concurrent threads to 500; now it took just 23.56 seconds. 

example scraper code

Just like that, I slashed the scraping time from around 100 seconds down to just 23 seconds. That’s nearly 4 times faster with 500 threads compared to 100! 

To optimize your performance with custom concurrent threads, upgrade to our custom enterprise plan today.

The post Speed Up Web Scraping with ScraperAPI’s Concurrent Threads appeared first on ScraperAPI.

]]>
Integrating ScraperAPI with Data Cleaning Pipelines https://www.scraperapi.com/blog/integrating-scraperapi-with-data-cleaning-pipelines/ Mon, 26 May 2025 10:45:17 +0000 https://www.scraperapi.com/?p=7818 Collecting clean, usable data is the foundation of any successful web scraping project. However, web data is often filled with inconsistencies, duplicates, and irrelevant content, making it hard to work with straight out of the source. That’s where combining ScraperAPI with data cleaning pipelines comes in. ScraperAPI helps you reliably extract data from websites—even those […]

The post Integrating ScraperAPI with Data Cleaning Pipelines appeared first on ScraperAPI.

]]>

Collecting clean, usable data is the foundation of any successful web scraping project. However, web data is often filled with inconsistencies, duplicates, and irrelevant content, making it hard to work with straight out of the source.

That’s where combining ScraperAPI with data cleaning pipelines comes in. ScraperAPI helps you reliably extract data from websites—even those with complex anti-scraping protections—while Python’s data tools make it easy to clean, structure, and prepare that data for use.

In this guide, you’ll learn how to:

  • Set up ScraperAPI for web scraping
  • Use ETL (Extract, Transform, Load) techniques to clean and organize your data
  • Integrate these tools into a workflow that’s fast, flexible, and scalable

Ready? Let’s get started!

What is ETL?

ETL stands for Extract, Transform, Load. It is a data processing framework used to move data from one or more sources, clean it, and store it in a system where it can be analyzed. This process is essential for handling large volumes of data from various sources, preparing it for reporting and informed decision-making.

The Three Stages of ETL

  • Extract: In this initial phase, raw data is gathered from its source. In our case, that means scraping websites. This can be tricky, as websites often implement anti-scraping measures like IP bans, CAPTCHAs, and dynamic content loading through JavaScript. To manage these challenges and streamline the extraction process, we’ll use ScraperAPI, a tool designed to simplify and automate data collection at scale.
  • Transform: Once data is extracted, it’s often messy or inconsistent. The real cleanup happens in the transformation stage: the data is validated, standardized, and restructured into a usable format. This is a crucial step for ensuring data quality and consistency.
  • Load: Finally, the cleaned and transformed data is loaded into a storage system. Depending on the project, this could be a CSV file, a relational database (like PostgreSQL or MySQL), a NoSQL database (like MongoDB), a data warehouse (like BigQuery, Redshift, or Snowflake), or even a data lake. We’ll keep this tutorial simple and load the data into a CSV file.

ScraperAPI and Python for ETL

Web-scraped data is often messy, inconsistent, and unstructured—not yet ready for analysis or decision-making. That’s where ETL becomes essential. It brings structure, cleanliness, and reliability to chaotic web data, making it more valuable.

Let’s break down how this works in the context of scraping real estate listings:

  • Extract: Use ScraperAPI to pull raw HTML from multiple real estate website pages. ScraperAPI handles the toughest parts of web scraping—IP rotation, user-agent spoofing, CAPTCHA solving, and even JavaScript rendering—so you can focus on getting the data instead of fighting anti-bot defenses.
  • Transform: With libraries like BeautifulSoup and Pandas, you can clean and standardize your data for analysis using Python:
    • Parse price fields, stripping currency symbols and converting values to a numeric format.
    • Standardize inconsistent text (e.g., “3 bdr”, “three beds”) into a single format like (e.g., 3).
    • Normalize square footage to a consistent unit and data type.
    • Handle missing values for features such as balconies or garages.
    • Identify and remove duplicate listings that may appear due to frequent site updates.
  • Load:
    Once the data is cleaned and transformed, use Pandas to export it into a structured format like a CSV for reporting or analysis, or load it directly into a database for long-term storage and querying.

With Python and ScraperAPI together, you have a powerful ETL toolkit:

  • ScraperAPI simplifies and hardens the Extract phase.
  • With its rich data handling capabilities, Python covers Transform and Load with flexibility and precision.

This ETL pipeline guarantees that the data you have scraped is precise, consistent, and prepared for use, regardless of whether you are analyzing market trends or creating a real estate dashboard.

Project Requirements

Before diving into the integration, make sure you have the following:

1. A ScraperAPI Account: Sign up on the ScraperAPI website to get your API key. ScraperAPI will handle proxy rotation, CAPTCHA solving, and JavaScript rendering, making the extraction phase a breeze. You’ll receive 5,000 free API credits when you sign up for a seven-day trial, starting whenever you’re ready.  

2. A Python Environment: Ensure Python (version 3.7+ recommended) is installed on your system. You’ll also need to install key libraries:

  • requests: For making HTTP requests to ScraperAPI.
  • beautifulsoup4: For parsing HTML and XML content.
  • pandas: For data manipulation and cleaning.
  • python-dotenv: to load your credentials from your .env file and manage your API key securely.
  • lxml (optional but recommended): A fast and efficient XML and HTML parser that BeautifulSoup can use.

You can install them using pip with this command:

pip install requests beautifulsoup4 pandas lxml python-dotenv

3. Basic Web Scraping Knowledge: A foundational understanding of HTML structure, CSS selectors, and how web scraping works will be beneficial.

4. An IDE or Code Editor: Such as VS Code, PyCharm, or Jupyter Notebook for writing and running your Python scripts.

 

TL;DR;

For those in a hurry, here’s the full scraper we are going to be building:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

Want to see how we built it? Keep reading!

Building a Real Estate ETL Pipeline with ScraperAPI and Python

In this section, we’ll build a working ETL pipeline that scrapes real estate listings from Idealista using ScraperAPI, cleans the data with Python, and saves it in a structured CSV file. We’ll walk through each part of the process—extracting the data, transforming it into a usable format, and loading it for analysis—so you’ll have a complete and reusable workflow by the end.

Step 1: Extracting: Using ScraperAPI 

Most real estate websites are known for blocking scrapers, making collecting data at any meaningful scale challenging. For that reason, we sent our get() requests through ScraperAPI, effectively bypassing Idealista’s anti-scraping mechanisms without complicated workarounds.
For this guide, we’ll update an existing ScraperAPI real estate project to demonstrate the integration. You can find the complete guide on scraping Idealista with Python here.

import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

scraper_api_key = 'YOUR_SCRAPERAPI_KEY' # Replace with your ScraperAPI key
idealista_query = "https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={idealista_query}'
 
response = requests.get(scraper_api_url)

extracted_data = []

# Check if the request was successful (status code 200)
if response.status_code == 200:
   # Parse the HTML content using BeautifulSoup
   soup = BeautifulSoup(response.text, 'html.parser')
   # Extract each house listing post
   house_listings = soup.find_all('article', class_='item')
  
   # Create a list to store extracted information
  
   # Loop through each house listing and extract information
   for index, listing in enumerate(house_listings):
       # Extracting relevant information
      title = listing.find('a', class_='item-link').get('title')
      price = listing.find('span', class_='item-price').text.strip()

       # Find all div elements with class 'item-detail'
      item_details = listing.find_all('span', class_='item-detail')

       # Extracting bedrooms and area from the item_details
      bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
      area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"
      description = listing.find('div', class_='item-description').text.strip() if listing.find('div', class_='item-description') else "N/A"
      tags = listing.find('span', class_='listing-tags').text.strip() if listing.find('span', class_='listing-tags') else "N/A"
       # Extracting images
      image_elements = listing.find_all('img')
      images = [img.get("src") for img in image_elements] if image_elements else []
 
       # Store extracted information in a dictionary
      listing_data = {
           "Title": title,
           "Price": price,
           "Bedrooms": bedrooms,
           "Area": area,
           "Description": description,
           "Tags": tags,
           "Images": images
       }
       # Append the dictionary to the list
      extracted_data.append(listing_data)

The code above scrapes and parses real estate listings from Idealista using ScraperAPI and BeautifulSoup. It begins by configuring ScraperAPI with your ScraperAPI key and the target URL, then sends an HTTP GET request to the URL. If the request is successful, the HTML is parsed with BeautifulSoup, and the script locates all <article> elements with the class "item" (which represent property listings). It then loops through each listing to extract key details—title, price, number of bedrooms, area, description, tags, and image URLs.

Step 2: Transforming the Data (Data Cleaning)

After extracting raw data from Idealista, the next step is to clean and prepare it. To make this data more useful, we’ll use pandas, a powerful Python library for data analysis. If you’ve never used pandas before, think of it like Excel—only it’s in Python and is more flexible.
In Step 1, we stored each listing in a dictionary and added those dictionaries to a list called extracted_data. Here’s what that list might look like:

[
    {
        "Title": "Spacious apartment in central Barcelona",
        "Price": "€350,000",
        "Bedrooms": "3 bdr",
        "Area": "120 m²",
        "Description": "...",
        "Tags": "Luxury",
        "Images": [...]
    },
    ...
]

Now we’ll use pandas to convert that list into a structured DataFrame (a table-like object), then clean each column step by step.

import pandas as pd

# Convert raw listing data to a DataFrame
df = pd.DataFrame(three_bedroom_listings)

# View the raw data
print(df.head())
  • pd.DataFrame(...) creates a DataFrame from a list of dictionaries. Each dictionary becomes a row; each key becomes a column.
  • .head() shows the first five rows — useful for checking structure and data types.

The price values are strings like "€350,000". We’ll remove symbols and formatting to convert them to numeric values.

df['Price'] = (
    df['Price']
    .str.replace('€', '', regex=False)   # Remove the euro symbol
    .str.replace(',', '', regex=False)   # Remove comma separators
    .str.strip()                         # Remove leading/trailing whitespace
    .astype(float)                       # Convert strings to float
)


print(df['Price'].head()) # Display the first few prices to verify conversion
  • .str.replace(old, new) modifies string values in a column.
  • .str.strip() removes unnecessary spaces from both ends.
  • .astype(float) changes the column type from string to float so we can perform numerical operations later.

Listings may include text like "3 bdr" or "two beds". We’ll extract just the number of bedrooms as an integer using a regex function with .apply().

import re

def extract_bedrooms(text):
    match = re.search(r'\d+', text)  # Find the first sequence of digits
    return int(match.group()) if match else None

df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)


print(df['Bedrooms'].head()) # Display the first few bedroom counts to verify conversion
  • .apply() runs a function on each element in the column.
  • re.search(r'\d+', text) looks for the first group of digits.

This cleans and standardizes the bedroom count into integers.Area values include units like "120 m²". We’ll remove those and convert to float.

df['Area'] = (
    df['Area']
    .str.replace('m²', '', regex=False)  # Remove unit
    .str.replace(',', '.', regex=False)  # Convert comma to dot for decimal values
    .str.strip()                         # Clean up whitespace
    .astype(float)                       # Convert to float
)


print(df['Area'].head())  # Display the first few areas to verify conversion

This ensures all values in the “Area” column are consistent numerical types so that we can sort, filter, or calculate metrics like price per square meter.

Some listings may be missing essential values. We’ll drop rows with missing data in key columns. You can choose which columns are crucial and should not have any missing values.

df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
  • .dropna() removes rows with NaN (missing) values.
  • The subset argument limits this check to specific columns; you can add other columns here if needed.
  • inplace=True modifies the DataFrame directly without needing to reassign it.

To work with only listings that have exactly 3 bedrooms (optional):

df = df[df['Bedrooms'] == 3]
  • df[condition] filters rows based on a condition.
  • Here, we’re checking where the “Bedrooms” column equals 3, and updating df to only include those rows.

At this point, your data is structured similarly to this:

Title Price Bedrooms Area
“Modern flat in Eixample” 310000.00 3 95.0
“Loft with terrace in Gracia” 275000.00 2 82.0

This cleaned DataFrame is now ready for analysis or export. In the next step, we’ll load it into a CSV file.

Step 3: Loading Cleaned Data into CSV (Storing)

With your data now cleaned and structured in a pandas DataFrame, the final step is to persist it, meaning you save it somewhere so it can be reused, shared, or analyzed later.

The CSV file is the most common and beginner-friendly format for storing tabular data. It’s a simple text file where each row is a line and commas separate each column. Most tools—Excel, Google Sheets, data visualization tools, and programming languages—can open and process CSV files efficiently.

You can save your DataFrame to a CSV with just one line of code:

# Save the cleaned DataFrame to a CSV file
df.to_csv('three_bedroom_houses.csv', index=False)
  • df.to_csv(...) is a pandas method that writes your DataFrame to a CSV file.
  • 'three_bedroom_houses.csv' is the file name that will be created (or overwritten).
  • index=False tells pandas not to write the DataFrame index (row numbers) to the file, which keeps it clean unless you explicitly need it.

Once this is done, you’ll see a new file in your working directory (where your script is running). Here’s what a few lines of that file might look like:

Title,Price,Bedrooms,Area,Description,Tags,Images
"Flat / apartment in calle de Bailèn, La Dreta de l'Eixample, Barcelona",675000.0,3,106.0,"Magnificent and quiet brand new refurbished flat in Eixample.
This ready-to-live-in flat enjoys a fantastic location very close to the popular Paseo Sant Joan and the pedestrian street Consell de Cent. It is a very pleasant urban environment in which to live in the neighbourhood, with numerous services, shops, restau",N/A,"['https://img4.idealista.com/blur/480_360_mq/0/id.pro.es.image.master/dd/d0/85/1326281103.jpg', 'https://st3.idealista.com/b1/b8/d4/bcn-advisors.gif']"

You can open it in:

  • Excel: Just double-click the file.
  • Google Sheets: Upload the file and import it as a spreadsheet.
  • Another Python script: Using pd.read_csv()
  • Visualization tools: Like Power BI, Tableau, or even Jupyter notebooks.

If you’re working with a larger dataset later or need better performance, consider saving to a database. But for now, CSV is ideal.

Step 4: Finalizing the ETL Pipeline

Now that your scraper works and your data is clean, it’s time to turn your code into a proper ETL pipeline. This makes it easier to maintain, reuse, schedule, or extend. We’ll do two things here:

1. Modularize the script into extract, transform, and load functions

2. Move sensitive info like your ScraperAPI key and target URL to environment variables using the python-dotenv package

This final version is production-friendly, secure, and easy to build on.First, install python-dotenv if you don’t already have it:

pip install python-dotenv

Next, create a .env file in your project directory and add any sensitive information:

SCRAPER_API_KEY=your_scraperapi_key_here
IDEALISTA_URL=https://www.idealista.com/en/venta-viviendas/barcelona-barcelona/

Here’s your final pipeline script, with the code restructured and organized in separate methods:

import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

# === Load environment variables from .env file ===
load_dotenv()
SCRAPER_API_KEY = os.getenv('SCRAPER_API_KEY')
IDEALISTA_URL = os.getenv('IDEALISTA_URL')
SCRAPER_API_URL = f"http://api.scraperapi.com/?api_key={SCRAPER_API_KEY}&url={IDEALISTA_URL}"


# === Extract ===
def extract_data(url):
    response = requests.get(url)
    extracted_data = []

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        listings = soup.find_all('article', class_='item')

        for listing in listings:
            title = listing.find('a', class_='item-link').get('title')
            price = listing.find('span', class_='item-price').text.strip()

            item_details = listing.find_all('span', class_='item-detail')
            bedrooms = item_details[0].text.strip() if item_details and item_details[0] else "N/A"
            area = item_details[1].text.strip() if len(item_details) > 1 and item_details[1] else "N/A"

            description = listing.find('div', class_='item-description')
            description = description.text.strip() if description else "N/A"

            tags = listing.find('span', class_='listing-tags')
            tags = tags.text.strip() if tags else "N/A"

            images = [img.get("src") for img in listing.find_all('img')] if listing.find_all('img') else []

            extracted_data.append({
                "Title": title,
                "Price": price,
                "Bedrooms": bedrooms,
                "Area": area,
                "Description": description,
                "Tags": tags,
                "Images": images
            })
    else:
        print(f"Failed to extract data. Status code: {response.status_code}")

    return extracted_data


# === Transform ===
def transform_data(data):
    df = pd.DataFrame(data)

    df['Price'] = (
        df['Price']
        .str.replace('€', '', regex=False)
        .str.replace(',', '', regex=False)
        .str.strip()
        .astype(float)
    )

    def extract_bedrooms(text):
        match = re.search(r'\d+', text)
        return int(match.group()) if match else None

    df['Bedrooms'] = df['Bedrooms'].apply(extract_bedrooms)

    df['Area'] = (
        df['Area']
        .str.replace('m²', '', regex=False)
        .str.replace(',', '.', regex=False)
        .str.strip()
        .astype(float)
    )

    df.dropna(subset=['Price', 'Bedrooms', 'Area'], inplace=True)
    df = df[df['Bedrooms'] == 3]

    return df


# === Load ===
def load_data(df, filename='three_bedroom_houses.csv'):
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} listings to {filename}")


# === Main pipeline ===
def main():
    print("Starting ETL pipeline for Idealista listings...")

    raw_data = extract_data(SCRAPER_API_URL)
    if not raw_data:
        print("No data extracted. Check your API key or target URL.")
        return

    print(f"Extracted {len(raw_data)} listings.")

    cleaned_data = transform_data(raw_data)
    print(f"{len(cleaned_data)} listings after cleaning and filtering.")

    load_data(cleaned_data)


if __name__ == "__main__":
    main()

With this final step, your scraper is now:

  • Modular and easy to update
  • Secure, with API keys safely stored in environment variables
  • Ready to scale, automate, or plug into larger data workflows

You now have a reusable, scalable workflow for scraping and analyzing real estate listings!

Use Cases for ScraperAPI and Python’s Data Cleaning Integration

Now that you’ve seen how ScraperAPI and Python work together to extract and clean real estate data, let’s explore how this powerful combination can be used across industries. The ETL workflow—Extract, Transform, Load—is flexible and scalable, making it useful for many data-driven projects.

Here are several practical applications where this integration excels:

1. Sentiment analysis: You can look at how language affects buyer interest by scraping property descriptions or user reviews. After cleaning the text with Python, sentiment analysis tools like TextBlob or VADER can score the tone as positive, neutral, or negative. This makes it possible to see whether listings that use appealing terms like “spacious” or “modern” tend to sell faster or command higher prices.

2. Trend Monitoring: Running your scraper regularly helps build a dataset that captures how property prices and features change over time. It’s easier to visualize trends and track how specific market segments are evolving by structuring the data around key attributes like location, number of bedrooms, or property type.

3. Competitor Research: Scraping listings from multiple real estate platforms gives you a direct view of competitors’ prices and positions of similar properties. With standardized data, you can compare pricing strategies, listing frequency, and included features to identify market gaps or specific areas where your offering could stand out.

4. Community Insights: Collecting data from forums, review sites, or social media conversations can reveal what buyers and renters care about. After cleaning and processing the text, analysis can uncover common priorities: proximity to schools, demand for green space, or concerns about noise, etc., which can inform development and marketing decisions.

Wrapping Up

Integrating ScraperAPI with data-cleaning pipelines creates a powerful setup for working with web data. ScraperAPI takes care of the tricky parts of scraping—like CAPTCHAs, IP blocks, and JavaScript rendering—so you can reliably extract data at scale. On the other side, Python helps you clean and organize that data, making sure it’s accurate, consistent, and ready for analysis. This combination saves time and makes it easier to get real insights from messy, real-world data.

In this tutorial, we walked through the process of:

  • Extracting real estate listings from Idealista using ScraperAPI
  • Transforming the data by standardizing data types, removing unwanted characters and empty values, and filtering for three-bedroom listings.
  • Loading the cleaned data into a structured CSV file for easy sharing and analysis

If you’d like to try it for yourself, you can sign up for a free ScraperAPI account and get 5,000 API credits to start scraping right away. It’s a great way to test the waters and see how it fits into your data workflows.

Until next time, happy scraping!

FAQs

Integrating ScraperAPI into your ETL pipeline simplifies data extraction by handling anti-scraping mechanisms like IP bans, CAPTCHAs, and JavaScript rendering. This ensures uninterrupted data collection, even from complex or heavily protected websites. ScraperAPI also reduces the need for manual workarounds, allowing you to focus on transforming and analyzing the data.

To ensure data accuracy in your ETL pipeline, start by validating the extracted data using Python tools like pandas to check for missing values, duplicates, or inconsistent formats. Clean the data by standardizing date formats, currency symbols, and numeric values. Regularly test your scraping logic to ensure it adapts to website structure changes. Always review a sample of the scraped output to manually confirm that the data matches expectations before scaling your pipeline.

ScraperAPI can extract a wide variety of data types for ETL pipelines, including plain text such as product descriptions, blog content, or property listings; numerical data like prices, ratings, and financial figures; media files including images and videos; structured data such as HTML tables, lists, and JSON or XML feeds; and dynamic content loaded via JavaScript or AJAX. This flexibility suits everything from basic web scraping to complex data aggregation projects.

The post Integrating ScraperAPI with Data Cleaning Pipelines appeared first on ScraperAPI.

]]>
Build a Walmart Reviews Analysis Tool Using ScraperAPI, VADER, Gemini, and Streamlit https://www.scraperapi.com/blog/walmart-reviews-analysis-tool/ Thu, 15 May 2025 11:41:27 +0000 https://www.scraperapi.com/?p=7733 Customer reviews are more than just feedback. They are a rich, often untapped source of business intelligence. Paying close attention and analyzing what your customers say about their experience with your products can uncover real pain points, spot trends in complaints, and even discover areas for opportunities that might be invisible otherwise. Scraping dynamic, high-traffic […]

The post Build a Walmart Reviews Analysis Tool Using ScraperAPI, VADER, Gemini, and Streamlit appeared first on ScraperAPI.

]]>

Customer reviews are more than just feedback. They are a rich, often untapped source of business intelligence. Paying close attention and analyzing what your customers say about their experience with your products can uncover real pain points, spot trends in complaints, and even discover areas for opportunities that might be invisible otherwise.

Scraping dynamic, high-traffic websites like Walmart can be a challenging task. Even locating the correct JavaScript tags with the data you want can be confusing and seem like an impossible task. Luckily for us, ScraperAPI provides a dedicated endpoint specifically for scraping Walmart reviews.

This article will guide you through building a unique tool that analyzes Walmart customer feedback. By using ScraperAPI’s structured Walmart reviews async endpoint, we will scrape reviews for multiple products and utilize VADER to pinpoint the emotional tone of each review. 

Furthermore, we will utilize Gemini to transform this raw data into a clear, actionable report that includes recommendations, all displayed in a free, cloud-hosted web interface built with Streamlit. 

Understanding VADER for Sentiment Analysis

Sentiment analysis is a method for identifying the emotions expressed in a piece of text. Since VADER (Valence Aware Dictionary and Sentiment Reasoner) is the sentiment analysis tool we’re using in this project, it’s best to understand how it works and its benefits before diving deeper.

VADER uses a predefined dictionary (lexicon) where each word is allotted a sentiment score. These scores reflect how positive, negative, or neutral a term is. In this project, VADER assigns two key metrics to each review we analyze: polarity and subjectivity.

Polarity represents the overall sentiment of a review, ranging from negative to positive. A score closer to +1 indicates a more positive review, while a score closer to -1 means a more negative review. A score near 0 signifies a neutral review. VADER calculates each score by assessing the sentiment intensity of individual words in the review, referencing its built-in dictionary.

Diagram about using VADER for Sentiment Scoring Process

Here’s more information on VADER that includes key advantages and features: 

1. Handles Informal Language Well

VADER is excellent at analyzing the kind of casual language people use on social media. It can easily understand and interpret slang, irregularly capitalized words, and even emotional cues through punctuation, such as multiple exclamation points and emojis. With most sentiment analysis tools, it’s challenging to achieve this, making VADER particularly well-suited for our task. 

2. Provides Context-Aware Sentiment Adjustment

Instead of treating words in isolation, VADER utilizes smart rules to interpret context. When a sentence includes words like “not,” it flips the meaning, such that, while “good” is positive, “not good” becomes negative. 

It also notices if certain words are in all caps or if there are many exclamation points, which usually means the emotion is stronger. And, it gives priority to words like “very” or “slightly,” especially when they appear before an adjective, to figure out exactly how strong the emotion is.

3. Gives an Overall Mood Score

VADER wraps up all its analysis in a single number called the compound score, which ranges from -1 to +1. This score tells you at a glance whether the overall review feels positive (closer to +1), negative (closer to -1), or neutral (around 0). It’s like a summary mood indicator that combines all the word scores and context tweaks into one easy-to-understand value.

ScraperAPI’s Walmart Reviews API (Async Endpoint)

Web scraping is difficult for several reasons. Modern websites are built with dynamic JavaScript frameworks, which means that most of the content isn’t available in the static HTML. In practice, you’d need to understand JavaScript and know your way around web development tools to locate and extract the data you need. 

When scraping a website, the tool you use first must bypass several anti-scraping defenses that many sites employ these days. Once it’s through, it immediately comes in contact with a mountain of code. The image below shows a real-life example of the code behind Walmart’s website (right-click and select “Inspect” to see the same image below on a Walmart website) : 

Using web development tools to locate and extract the Walmart's data

The code in the Elements section of a webpage is often buried under multiple layers of HTML, making it tricky to find exactly where the data is coming from. To navigate this, you typically need a good understanding of HTML, CSS, and JavaScript.

But what if you’re not a front-end developer? If you’re a data analyst, scientist, or engineer, your primary language probably isn’t JavaScript.

In most cases, you’ll need to use your browser’s developer tools to inspect the page and locate the specific elements, like reviews, ratings, or dates, that contain the data you want to scrape. 

Tools like Selenium and Puppeteer can help simulate user behavior, but they add layers of complexity. If we wanted to scrape this Walmart site, usually, here’s an ideal process we’d have to go through just to locate and extract that data: 

  • First, you have to locate the parent container within the website’s HTML code that contains the div class where you can find the reviews data:
Using web development tools to locate div class where find the reviews data
  • Within the div class, search for “<script id = __NEXTDATA__ and type = application/json”
Using web development tools to search for “<script id = __NEXTDATA__ and type = application/json”
  • Within that class, you will find the review data:
Using web development tools to find the review data
Using web development tools to find reviews from costumers

Modern websites often use dynamic rendering, which means the data loads asynchronously through JavaScript. As a result, you can’t access it with a simple HTML request—the important information is hidden behind scripts that run after the page loads.

That’s where ScraperAPI’s Walmart Reviews API (Async Endpoint) helps. It’s built to handle these challenges by bypassing Walmart’s anti-scraping defenses and delivering fully rendered page data directly to you.

Even better, the async endpoint lets you target customer reviews specifically—no need to dig through Walmart’s website yourself. While the API runs in the background, your app can continue to run as usual. Once the data is ready, you’ll get a status update.

More About ScraperAPI’s Asynchronous Feature

ScraperAPI’s async feature overcomes the challenges of large-scale web scraping, especially on websites with stringent anti-scraping measures. Instead of waiting for immediate responses that can result in timeouts and low success rates, you submit one or more scraping jobs and retrieve the results later while utilizing other functionalities within your app. 

How it works is that you send a POST request with your API key and URL, using the /jobs endpoint for single URLs or the /batchjobs endpoint for multiple URLs. The service immediately assigns a unique job ID and status URL. You then poll that status URL to monitor progress until you receive the scraped content in the JSON “body” field.

This asynchronous process gives ScraperAPI more time to navigate complex websites, manage timeouts, adjust HTTP headers, and render content that is heavy on JavaScript. You also set up webhook callbacks, so the service automatically delivers data once your job is complete. 

ScraperAPI’s async approach handles the heavy lifting, so you can focus on other aspects of your application instead of waiting for data to scrape fully. As a result, you receive clean, structured data that’s ready for further processing.

Getting Started with ScraperAPI

  1. To begin, go to ScraperAPI’s website.
  2. You can either log in if you have an account already or click on “Start Trial” to create one:
Getting Started with ScraperAPI
  1. After creating your account, you’ll see a dashboard providing you with an API key, access to 5000 API credits (7-day limited trial period), and information on how to get started scraping. 
ScraperAPI dashboard with API key, access to 5000 API credits
  1. To access more credits and advanced features, scroll down and click “Upgrade to Larger Plan.”
Upgrade to Larger Plan button in ScraperAPI dashboard
  1. ScraperAPI provides documentation for various programming languages and frameworks that connect to its endpoints, including PHP, Java, Node.js, and more. You’ll find these resources when you scroll down on the dashboard page and select “View All Docs”:
View All Docs button to find resources in ScraperAPI dashboard
  1. Locate the search bar in the top right corner:
Search bar in ScraperAPI All Docs
  1. Search for “Walmart reviews endpoint,” and click on the “Async Structured Data Collection Method” pop-up:
ScraperAPI resources about Walmart reviews endpoint
  1. You’ll be directed to ScraperAPI’s detailed and clear documentation on using the Async Structured Data Collection Method.
ScraperAPI resource: Async Structured Data Collection Method
  1. On the documentation page, scroll down until you find the ‘Walmart Endpoint’ section. Then click on “Walmart Reviews API (Async)”
ScraperAPI resource: Walmart Reviews API (Async)
  1. It will take you to the Walmart Reviews API (Async) documentation, where you’ll find clear instructions and practical examples for using this feature in your application. 
ScraperAPI Resource: instructions for using Walmart Reviews API (Async)

Building the Walmart Reviews Analysis Tool

Step 1: Setting Up the Project

Create a new project folder in a virtual environment and install the necessary dependencies.

1. Create the project folder:

mkdir walmart_rev_project
cd walmart_rev_project

2. Set up a virtual environment:

python -m venv venv

Activate the environment:

  • Windows:
venvScriptsactivate
  • macOS/Linux:
source venv/bin/activate

3. Install Dependencies:

pip install streamlit requests google-generativeai

To access VADER:

pip install nltk
python -c "import nltk; nltk.download('vader_lexicon')"

The key dependencies and their functions are:

  • streamlit: To build an interactive web UI for the tool.
  • requests: Makes HTTP requests to external services (like ScraperAPI) to send and receive data.
  • google-generativeai: Interfaces with Google’s Gemini Large Language Model (LLM) to generate reports from the data we scrape.
  • nltk: This is a library that provides natural language processing tools, including sentiment analysis via VADER and text tokenization for processing customer reviews.
  • json (Standard Library): Handles JSON encoding and decoding for API responses.
  • concurrent.futures (Standard Library): Allows the application to run tasks concurrently using thread-based parallelism.
  • datetime (Standard Library): Manages date and time functions, such as timestamping reports and job submissions.

4. Define the Project Structure:

walmart_rev_project/
│── walmart_scraperapi.py

Step 2: Enabling Google’s Gemini LLM

We’ll be using Gemini 1.5 Flash as the large language model (LLM) for this tutorial. To get the same results, follow along and use the same model. Here’s how you can set it up:

  1. Go to the Google Developer API website.
  2. Create a Google account if you don’t already have one.
  3. Click on “Get a Gemini API Key”:
Get a Gemini API Key in Google Developer API website
  1. You’ll be redirected to Google AI Studio, select “Create an API Key,” copy your API key, and store it as an environment variable:
Create an API Key in Google AI Studio

Step 3: Initializing Libraries, VADER and API Keys

Now, let’s build the codebase and create a suitable prompt to guide the LLM in its task.

1. Importing Libraries and Setting Up VADER

First, the tool imports necessary libraries from installed dependencies and configures both Gemini and NLTK’s sentiment analyzer.

import streamlit as st
import requests
import google.generativeai as genai
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import json
import concurrent.futures
import datetime
# Download VADER lexicon if not already present
nltk.download('vader_lexicon')
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

The code above achieves the following:

Imports:

  • streamlit: Builds the web app’s user interface.
  • requests: Enables HTTP requests to interact with external services, such as ScraperAPI.
  • google.generativeai as genai: Integrates Google’s Gemini LLM for language generation capabilities (used later in the script).
  • nltk: Provides tools for working with human language data, specifically for sentiment analysis.
  • nltk.sentiment.vader.SentimentIntensityAnalyzer: A specific NLTK class used to analyze the sentiment of text.
  • json: Enables handling data in JSON format, common for web service responses.
  • concurrent.futures: Allows running tasks concurrently, potentially improving performance.
  • datetime: Provides functionality for working with dates and times, likely for report generation.

VADER Setup:

  • nltk.download('vader_lexicon'): Downloads the VADER lexicon, a list of words and their sentiment scores, if it’s not already present.
  • analyzer = SentimentIntensityAnalyzer(): Creates an instance of the VADER sentiment analyzer, ready for use.

2. Setting Up the API Keys and Configuring Gemini:

Further down, it sets up the API keys needed for Google Gemini and ScraperAPI and configures the Gemini API.

# Replace with your actual API keys
GOOGLE_API_KEY = "Axxxxxx"  # Replace with your Gemini API key
SCRAPERAPI_KEY = "9xxxxxx"  # Replace with your ScraperAPI key
# Configure Gemini API
genai.configure(api_key=GOOGLE_API_KEY)

Here’s what the code above achieves:

  • Set up API Keys: It defines variables to hold the API keys for Gemini and ScraperAPI, which are GOOGLE_API_KEY and SCRAPERAPI_KEY. Remember to replace the placeholder values with your actual API keys so the application can use these services.
  • Configuring Gemini: genai.configure(api_key=GOOGLE_API_KEY) configures the Google Generative AI library (genai) to use the provided API key, allowing the application to authenticate and interact with the Gemini language model.

Step 4: Building the Sentiment Analysis Function

The function below, analyze_sentiment, takes text as input and initializes VADER to determine its polarity and subjectivity.

def analyze_sentiment(text):
    scores = analyzer.polarity_scores(text)
    polarity = scores['compound']
    subjectivity = 1 - scores['neu']
    return polarity, subjectivity

Below is a further breakdown of what the code above does:

  • Function Definition: It defines a function named analyze_sentiment that accepts a single argument, text, which will contain the input string we need to perform sentiment analysis on.
  • Sentiment Scoring: scores = analyzer.polarity_scores(text) calls the polarity_scores() method of the initialized VADER analyzer object on the input text. This action returns a dictionary containing various sentiment scores, including positive, negative, neutral, and a compound score.
  • Extracting Polarity: polarity = scores['compound'] extracts the compound score from the scores dictionary and assigns it to the polarity variable. Remember, we discussed that the compound score is a normalized, weighted composite score that summarizes the overall sentiment of the text.
  • Calculating Subjectivity: subjectivity = 1 - scores['neu'] calculates the measure of subjectivity, while VADER provides a neu score representing the proportion of neutral words in the text. By subtracting this from 1, the function estimates the degree to which the text expresses opinions or subjective content; a higher value indicates greater subjectivity.
  • Returning Results: return polarity, subjectivity, returns two values: the calculated polarity (compound sentiment score) and the derived subjectivity score.

Step 5: Building the Report Generation Function

Here, the function, generate_gemini_report, takes the “model” instance and a text prompt as input, then generates content based on that prompt.

def generate_gemini_report(model, prompt):
    try:
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        return f"Error generating report: {e}"

Here is what we can understand from the code above:

  • Defining Function: It defines the function generate_gemini_report that accepts two arguments, which are model (an instance of Gemini configured earlier) and prompt (a string containing the instructions or questions for Gemini to respond to).
  • Content Generation: response = model.generate_content(prompt) calls the generate_content() method of the provided model object, passing the prompt as an argument. Thereby sending the prompt to the Gemini model to generate a textual response.
  • Returning the Response: return response.text ensures that if the content generation is successful, it extracts the generated text from the response object and returns it.
  • Error Handling: except Exception as e: handles any potential errors that might occur during the content generation process (e.g., network issues, API errors). While return f"Error generating report: {e}" returns an error message that includes a description of the exception. This section is mainly for debugging purposes.

Step 6: Building the Async Scraper Function

Let’s utilize fetch_async_reviews as a function to interact with ScraperAPI’s Walmart review endpoint (async) and submit a request for product reviews.

def fetch_async_reviews(api_key, product_id, tld, sort, page):
    url = "https://async.scraperapi.com/structured/walmart/review"
    headers = {"Content-Type": "application/json"}
    data = {
        "apiKey": api_key,
        "productId": product_id,
        "tld": tld,
        "page": str(page),
        "sort": sort,
    }
    st.info(f"Submitting job for Product ID '{product_id}' on page {page} with payload: {data}")
    response = requests.post(url, json=data, headers=headers)
    try:
        response.raise_for_status()
        return response.json()
    except requests.exceptions.HTTPError as err:
        st.error(f"HTTP error during async request: {err}")
        st.error(f"Response content: {response.text}")
        raise
    except json.JSONDecodeError as e:
        st.error(f"Error decoding async response: {e}")
        raise
    except Exception as e:
        st.error(f"Unexpected error during async request: {e}")
        raise

The code achieves the following:

1. Defining the Function:

First, it defines fetch_async_reviews, which accepts the following arguments:

  • api_key: Your ScraperAPI API key for authentication.
  • product_id: The unique identifier of the Walmart product whose reviews are to be fetched.
  • tld: The top-level domain for Walmart (e.g., “com”, “ca”).
  • sort: The criteria by which the reviews should be sorted (e.g., “relevancy”, “helpful”).
  • page: The specific page number of reviews to retrieve.

2. Request Details:

  • url = "https://async.scraperapi.com/structured/walmart/review": Defines the specific ScraperAPI endpoint for fetching structured Walmart reviews asynchronously.
  • headers = {"Content-Type": "application/json"}: Sets the HTTP headers to indicate that the request body will be in JSON format.
  • data = {...}: Creates a Python dictionary containing the parameters to be sent in the request body as JSON:

3. Submitting the Asynchronous Job:

  • st.info(...) uses Streamlit’s info function to display a message in the web app indicating that a scraping job is being submitted with the provided details. While response = requests.post(url, json=data, headers=headers) sends an HTTP POST request to the ScraperAPI endpoint with the specified URL, JSON data, and headers. 

4. Handling the Response:

  • try...except block: This block handles potential errors during the API request and response processing.
  • response.raise_for_status(): Checks if the HTTP request was successful (status code 2xx). If not, it raises an HTTPError exception.
  • return response.json(): If the request is successful, it parses the JSON response from ScraperAPI and returns it. This response typically contains information about the submitted job, such as its ID and status URL.
  • except requests.exceptions.HTTPError as err:: Catches HTTP-related errors (e.g., 4xx or 5xx status codes from ScraperAPI), displays an error message in the Streamlit app including the HTTP error and the response content, and re-raises the exception.
  • except json.JSONDecodeError as e:: Catches errors that occur if the response from ScraperAPI is not valid JSON, displays an error message, and re-raises the exception.
  • except Exception as e:: Catches any other unexpected errors during the process, displays an error message, and re-raises the exception.

Step 7:  Checking Job Status

The function below takes the status URL (provided by ScraperAPI after submitting an asynchronous job) and retrieves the current status of that job.

def check_job_status(status_url):
    try:
        response = requests.get(status_url)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        st.error(f"Error checking job status: {e}")
        raise
    except json.JSONDecodeError as e:
        st.error(f"Error decoding job status response: {e}")
        raise

Here’s further information on precisely how the code works:

  • Function Definition: It defines the check_job_status function, which accepts a single argument, status_url, that is the URL provided by ScraperAPI, allowing querying the status of a previously submitted asynchronous scraping job.
  • Requesting Job Status: response = requests.get(status_url) sends an HTTP GET request to the provided status_url to retrieve the current status of the scraping job.

Step 8: Retrieving the Scraped Data from the Job Result

Another function, get_job_result, takes the data representing the status of the async job and extracts the actual scraped content if the job status is successful.

def get_job_result(job_data):
    if job_data and job_data.get("status") == "finished":
        response_data = job_data.get("response")
        if response_data and isinstance(response_data, dict) and "body" in response_data:
            try:
                return json.loads(response_data["body"])
            except json.JSONDecodeError as e:
                st.error(f"Error decoding 'body' JSON: {e}")
                return None
        else:
            st.error("Could not find 'body' in the job result response.")
            return None
    return None

Below is a summary of what the code above achieves:

  • Defines the Function: First, it defines get_job_result, which accepts a single argument, job_data, a dictionary containing information about the status and result of the asynchronous job.
  • Checking Job Completion: if job_data and job_data.get("status") == "finished": first checks if job_data is not “None” and then looks explicitly for the “status” key within it. If the value associated with “status” is “finished”, it proceeds to extract the results.
  • Accessing Response Data: response_data = job_data.get("response"): retrieves the value of the “response” key from the job_data. This “response” contains details about the HTTP response from Walmart’s website.
  • Verifying Response Body: if response_data and isinstance(response_data, dict) and "body" in response_data: checks if response_data exists, is a dictionary, and contains a key named “body”. The “body” key holds the actual HTML content of the scraped webpage as a string.
  • Extracting and Decoding Body: The try...except json.JSONDecodeError as e:block parses the content of the “body” as JSON.
  • Handling Missing Body: else: st.error("Could not find 'body' in the job result response.") return None: this code block verifies if the “body” key is not found within the response_data, then displays an error message in the Streamlit app and returns None.

Step 9: Processing Reviews 

Here, the process_reviews_for_display function takes the product ID and raw review data (from ScraperAPI) as input, then processes each review to extract relevant information, such as text, sentiment, and potential pain points.

def process_reviews_for_display(product_id, results):
    review_data_list = []
    for review in results.get("reviews", []):
        if "text" in review:
            review_text = review["text"]
            polarity, subjectivity = analyze_sentiment(review_text)
            sentences = nltk.sent_tokenize(review_text)
            pain_points = [s.strip() for s in sentences if analyzer.polarity_scores(s)['compound'] < -0.05 and len(s.strip()) > 5]
            review_data_list.append({
                "text": review_text,
                "pain_points": pain_points,
                "sentiment": {
                    "polarity": polarity,
                    "subjectivity": subjectivity
                }
            })
    return review_data_list

The code above achieves the following:

1. Function Definition: 

It defines the process_reviews_for_display function, which accepts two arguments:

  • product_id: The ID of the product the reviews belong to.
  • results: A dictionary containing the raw review data fetched from ScraperAPI 

2. Initializing Review Data List: 

review_data_list = [] creates an empty list to store the processed information for each review.

3. Iterating Through Reviews: 

for review in results.get("reviews", []): iterates through the list of reviews, accessed using the .get("reviews", []) method on the results dictionary. Therefore, safely handling cases where the “reviews” key might be missing.

4. Processing Each Review:

  • if "text" in review:: It checks if the current review dictionary contains a “text” key, which is expected to hold the actual review text.
  • review_text = review["text"]: If the “text” key exists, its value (the review text) is assigned to the review_text variable.
  • polarity, subjectivity = analyze_sentiment(review_text): The analyze_sentiment function (defined earlier) is called on the review_text to get its sentiment polarity and subjectivity scores.
  • sentences = nltk.sent_tokenize(review_text): The review text is split into individual sentences using NLTK’s sent_tokenize function.
  • pain_points = [s.strip() for s in sentences if analyzer.polarity_scores(s)['compound'] < -0.05 and len(s.strip()) > 5]: This line identifies potential “pain points” within the review. It iterates through each sentence, calculates its compound sentiment score using VADER. If the score is below -0.05 (indicating negative sentiment) and the sentence is longer than 5 characters, it’s considered a potential pain point (after removing leading/trailing whitespace).
  • review_data_list.append({...}): A dictionary containing the extracted and analyzed information for the current review is created and appended to the review_data_list

5. Returning Processed Review Data: 

The function returns the review_data_list, which now contains a structured representation of each processed review, including its text, identified pain points, and sentiment analysis results.

Step 10: Building a Prompt Function to Generate the Reports

Now, we need to write and call a function that takes a list of product IDs and processed review data as input, then constructs a text prompt to send to Gemini for generating a report.

def generate_report_prompt(product_ids, processed_reviews):
    # The function up here generates a combined prompt for all products
    prompt = f"Here are the customer reviews and their associated pain points for product IDs: {', '.join(product_ids)}:nn"
    for review_info in processed_reviews:
        prompt += f"Review: {review_info['text']}n"
        if review_info['pain_points']:
            prompt += f"Pain Points: {', '.join(review_info['pain_points'])}n"
        else:
            prompt += "Pain Points: Nonen"
        prompt += f"Sentiment: Polarity={review_info['sentiment']['polarity']:.2f}, Subjectivity={review_info['sentiment']['subjectivity']:.2f}nn"
    prompt += "Based on these reviews, identify the key pain points for each product. Explain what sentiment polarity and subjectivity mean in the context of these reviews. Provide an overall sentiment summary for each product."
    return prompt

Here’s how the code works in detail:

1. Initializing the Prompt: 

After defining the function, prompt = f"Here are the customer reviews and their associated pain points for product IDs: {', '.join(product_ids)}:nn" starts building the prompt string by including a header that lists the product IDs which the report will be generated for.

2. Iterating Through Processed Reviews:

  • for review_info in processed_reviews:: Iterates through each processed review in the processed_reviews list.
  • prompt += f"Review: {review_info['text']}n": Adds the original review text to the prompt.
  • if review_info['pain_points']:: Checks if any pain points were identified for the current review.
  • prompt += f"Sentiment: Polarity={review_info['sentiment']['polarity']:.2f}, Subjectivity={review_info['sentiment']['subjectivity']:.2f}nn": Adds the sentiment polarity and subjectivity scores for the current review to the prompt, formatted to two decimal places.

3. Adding Instructions for Gemini: 

prompt += "Based on these reviews, identify the key pain points for each product. Explain what sentiment polarity and subjectivity mean in the context of these reviews. Provide an overall sentiment summary for each product."For a user to better understand VADER’s results, we will include instructions within the prompt, telling the Gemini what kind of information and analysis is expected in the generated report. 

4. Returning the Prompt: 

return prompt returns the complete prompt string, which is now ready for sending to Gemini.

Step 11. Building the Main Application Function

This main function orchestrates the entire Streamlit application, handling user input, fetching reviews asynchronously, checking job status, processing reviews, generating reports with Gemini, and displaying the results.

def main():
    st.title("ScraperAPI Walmart Customer Reviews Analysis Tool")
    st.markdown("Enter Walmart product review details below:")
    # accept multiple product IDs and page numbers as comma-separated lists
    product_ids_input = st.text_input("Walmart Product IDs (comma separated)", "")
    tld = st.selectbox("Top Level Domain (TLD)", ["com", "ca"], index=0)
    sort_options = ["relevancy", "helpful", "submission-desc", "submission-asc", "rating-desc", "rating-asc"]
    sort = st.selectbox("Sort By", sort_options, index=0)
    pages_input = st.text_input("Page Numbers (comma separated for each product, e.g., 1,2 for first product)", "1")
    # Initialize session state variables if not already set
    if 'gemini_report' not in st.session_state:
        st.session_state.gemini_report = None
    if 'model' not in st.session_state:
        st.session_state.model = genai.GenerativeModel('gemini-1.5-flash')
    if 'review_data_prompt' not in st.session_state:
        st.session_state.review_data_prompt = None
    if 'jobs' not in st.session_state:
        st.session_state.jobs = []  # this will store dicts with product_id, page, job_id, status_url
    if 'async_results' not in st.session_state:
        st.session_state.async_results = {}  # keyed by (product_id, page)
    if 'processed_reviews' not in st.session_state:
        st.session_state.processed_reviews = {} # Keyed by product_id
    if 'reports' not in st.session_state:
        st.session_state.reports = {} # kkeyed by report name
    st.sidebar.header("Previous Reports")
    if st.session_state.reports:
        selected_report_name = st.sidebar.selectbox("Select a Report", list(st.session_state.reports.keys()))
        st.sidebar.markdown("---")
        st.subheader("View Previous Report")
        st.markdown(st.session_state.reports[selected_report_name])
    else:
        st.sidebar.info("No reports generated yet.")
        st.sidebar.markdown("---")
    if st.button("Fetch Reviews (Async)"):
        if product_ids_input.strip():
            product_ids = [pid.strip() for pid in product_ids_input.split(",") if pid.strip()]
            pages_list = [p.strip() for p in pages_input.split(",")]
            st.session_state.jobs = []  # Reset jobs list for new submission
            st.session_state.async_results = {} # Reset results
            st.session_state.processed_reviews = {} # Reset processed reviews
            if len(pages_list) == 1:
                # Apply the same page number to all products
                pages_per_product = [int(pages_list[0])] * len(product_ids)
            elif len(pages_list) == len(product_ids):
                # Use the provided page numbers for each product
                pages_per_product = [int(p) for p in pages_list if p.isdigit()]
                if len(pages_per_product) != len(product_ids):
                    st.error("Number of page numbers must match the number of product IDs or be a single value.")
                    return
            else:
                st.error("Number of page numbers must match the number of product IDs or be a single value.")
                return
            for i, pid in enumerate(product_ids):
                page = pages_per_product[i]
                try:
                    async_response = fetch_async_reviews(SCRAPERAPI_KEY, pid, tld, sort, page)
                    job_id = async_response.get("id")
                    status_url = async_response.get("statusUrl")
                    st.session_state.jobs.append({
                        "product_id": pid,
                        "page": page,
                        "job_id": job_id,
                        "status_url": status_url
                    })
                    st.info(f"Submitted job for Product {pid} Page {page}: Job ID {job_id}")
                except Exception as e:
                    st.error(f"Error submitting async request for Product {pid} Page {page}: {e}")
            st.session_state.gemini_report = None  # Reset report on new fetch
        else:
            st.warning("Please enter at least one Walmart Product ID.")
    st.markdown("---")
    st.subheader("Async Job Status")
    if st.session_state.jobs:
        if st.button("Check Job Status"):
            jobs = st.session_state.jobs
            results_dict = {}  # Key: (product_id, page), Value: job result (JSON)
            with concurrent.futures.ThreadPoolExecutor() as executor:
                future_to_job = {executor.submit(check_job_status, job["status_url"]): job for job in jobs}
                for future in concurrent.futures.as_completed(future_to_job):
                    job = future_to_job[future]
                    try:
                        status_data = future.result()
                        st.write(f"Job for Product {job['product_id']} Page {job['page']} status: {status_data.get('status')}")
                        if status_data.get('status') == 'finished':
                            job_result = get_job_result(status_data)
                            if job_result and isinstance(job_result, dict) and "reviews" in job_result:
                                results_dict[(job["product_id"], job["page"])] = job_result
                                st.success(f"Job for Product {job['product_id']} Page {job['page']} finished.")
                            else:
                                st.error(f"Unexpected async results structure for Product {job['product_id']} Page {job['page']}.")
                        elif status_data.get('status') == 'failed':
                            st.error(f"Job for Product {job['product_id']} Page {job['page']} failed: {status_data.get('error')}")
                    except Exception as e:
                        st.error(f"Error checking job for Product {job['product_id']} Page {job['page']}: {e}")
            st.session_state.async_results = results_dict
            # Process all finished jobs and aggregate reviews per product
            processed_reviews_per_product = {}
            for (pid, p), job_result in results_dict.items():
                processed_reviews = process_reviews_for_display(pid, job_result)
                if pid not in processed_reviews_per_product:
                    processed_reviews_per_product[pid] = []
                processed_reviews_per_product[pid].extend(processed_reviews)
            st.session_state.processed_reviews = processed_reviews_per_product
            if st.session_state.processed_reviews and st.session_state.gemini_report is None:
                prompt = generate_report_prompt(list(st.session_state.processed_reviews.keys()),
                                                [review for reviews in st.session_state.processed_reviews.values() for review in reviews])
                st.session_state.review_data_prompt = prompt
                with st.spinner("Generating combined report with Gemini..."):
                    report = generate_gemini_report(st.session_state.model, prompt)
                    st.session_state.gemini_report = report
                    report_name = f"Combined Report for Products {', '.join(st.session_state.processed_reviews.keys())} - {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
                    st.session_state.reports[report_name] = report
    else:
        st.info("No async job submitted yet.")
    # Display the processed review data per product
    if st.session_state.processed_reviews:
        st.markdown("---")
        st.subheader("Processed Review Data:")
        for pid, reviews in st.session_state.processed_reviews.items():
            st.markdown(f"**Product ID: {pid}**")
            for i, review in enumerate(reviews):
                st.markdown(f"- **Review {i+1}:**")
                st.markdown(f"  - **Text:** {review['text']}")
                if review['pain_points']:
                    st.markdown(f"  - **Pain Points:** {', '.join(review['pain_points'])}")
                else:
                    st.markdown("  - **Pain Points:** None")
                st.markdown(f"  - **Sentiment:** Polarity={review['sentiment']['polarity']:.2f}, Subjectivity={review['sentiment']['subjectivity']:.2f}")
            st.markdown("---")
    if st.session_state.gemini_report:
        st.markdown("---")
        st.subheader("Generated Walmart Customer Reviews Analysis Report")
        st.markdown(st.session_state.gemini_report)
        st.markdown("### Ask Additional Questions")
        user_query = st.text_input("Enter your question here", key="user_query")
        if st.button("Submit Question"):
            if st.session_state.review_data_prompt is not None:
                question_prompt = (
                        "You are an expert in customer review analysis and product evaluation. "
                        "Based on the following review data, provide a detailed, critically evaluated answer to the question below.nn"
                        "Review Data:n" + st.session_state.review_data_prompt + "nn"
                        "Question: " + user_query
                )
                with st.spinner("Generating answer from Gemini..."):
                    answer = generate_gemini_report(st.session_state.model, question_prompt)
                st.markdown("### Answer to Your Question")
                st.markdown(answer)
            else:
                st.markdown("Please generate a report first.")
if __name__ == "__main__":
    main()

The code above defines the main function that controls the flow and user interactions of the Streamlit application. Here’s a detailed breakdown of what the code does:

1. Sets up the User Interface:

  • st.title(...) and st.markdown(...): Display the title and introductory text of the application.
  • st.text_input(...), st.selectbox(...): Create input fields for users to enter their Walmart product IDs, select the top-level domain, choose the sorting method for reviews, and enter page numbers.

2. Managing Application State:

  • It initializes several st.session_state variables. Session state allows the application to remember information across user interactions, such as generated reports, fetched jobs, and processed reviews. This way, we prevent data from being reset on every rerun.

3. Displaying Previous Reports:

  • Further down, it checks if any reports have been generated previously (stored in st.session_state.reports). If so, it displays a sidebar with a dropdown to select and view these previous reports.

4. Handling Review Fetching: When the user clicks the “Fetch Reviews (Async)” button:

  • It parses the entered product IDs and page numbers.
  • It resets the jobs, async_results, and processed_reviews session state variables for a new request.
  • It iterates through the provided product IDs and pages, calling the fetch_async_reviews function to submit a scraping job to ScraperAPI for each.
  • It stores information about each submitted job (product ID, page, job ID, status URL) in the st.session_state.jobs list.
  • It displays messages to the user indicating the submission of each job.

5. Checking Asynchronous Job Status: 

Here, when the user clicks the “Check Job Status” button:

  • It iterates through the jobs stored in st.session_state.jobs.
  • It uses a concurrent.futures.ThreadPoolExecutor to concurrently call the check_job_status function for each job’s status URL, improving efficiency.
  • For each job, it displays the current status.
  • If a job is finished, it calls get_job_result to retrieve the scraped data.
  • If the data is successfully retrieved, it stores it in the st.session_state.async_results dictionary.
  • It then calls process_reviews_for_display to analyze the sentiment and extract pain points from the retrieved reviews, storing the processed data in st.session_state.processed_reviews, organized by product ID.

6. Generating the Gemini Report: 

After the jobs are checked and reviews are processed, if there are processed reviews and a report hasn’t been generated yet:

  • It calls generate_report_prompt to create a prompt for the Gemini model based on the processed reviews.
  • It uses st.spinner to display a loading message while calling generate_gemini_report to get the report from Gemini.
  • It stores the generated report in st.session_state.gemini_report and saves it as a named report in st.session_state.reports.

7. Displaying Results:

  • It displays the processed review data (text, pain points, sentiment) for each product.
  • If a Gemini report has been generated, it displays the report.
  • It provides an input field for the user to ask additional questions about the reviews, which are then sent to Gemini for an answer.

8. Running the Application:

  • if __name__ == "__main__": main(): This standard Python construct ensures that the main() function is executed when the script is run directly.

In essence, the main function ties together all the other functions in the script to create a functional web application that allows a user to fetch, process, and analyze Walmart customer reviews using ScraperAPI and Gemini. It manages the user interface, handles user interactions, orchestrates the data fetching and processing pipelines, and displays the results in a suitable, user-friendly format.

Here’s a snippet of what the tool’s UI looks like:

Run the application to fetch, process, and analyze Walmart customer reviews using ScraperAPI and Gemini

Deploying the Walmart Reviews Analysis App Using Streamlit 

Here’s how to deploy our Walmart analysis app on Streamlit for free cloud hosting in just a few steps:

Step 1: Set Up a GitHub Repository

Streamlit requires your project to be hosted on GitHub.

1. Create a New Repository on GitHub

Create a new repository on GitHub and set it as public.

2. Push Your Code to GitHub

If you haven’t already set up Git and linked your repository, use the following commands in your terminal:

git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin https://github.com/YOUR_USERNAME/walmart_reviews_tool.git
git push -u origin main

Step 2: Store Your Gemini Token as an Environment Variable

Before deploying your app, you have to securely store your Gemini token within your system as an environment variable to protect it from misuse by others.

1. Set Your Token As an Environment Variable (Locally):

  • macOS/Linux
export GOOGLE_API_TOKEN="your_token"
  • Windows (PowerShell)
set GOOGLE_API_TOKEN="your_token"
  • Use os.environ to retrieve the token within your script:
import os
GOOGLE_API_TOKEN = os.environ.get("GOOGLE_API_TOKEN")
if GOOGLE_API_TOKEN is None:
    print("Error: Google API token not found in environment variables.")
    # Handle errors
else:
    # Use GOOGLE_API_TOKEN in your Google Developer API calls
    print("Google API token loaded successfully")
  • Restart your code editor.

Step 3: Create a requirements.txt file

Streamlit needs to know what dependencies your app requires. 

1. In your project folder, create a file named requirements.txt.

2. Add the following dependencies:

streamlit
requests
google-generativeai
nltk

3. Save the file and commit it to GitHub:

git add requirements.txt
git commit -m "Added dependencies"
git push origin main

4. Do the same for the app.py file containing all your code:

git add app.py 
git commit -m "Added app script" 
git push origin main

Step 4: Deploy on Streamlit Cloud

1. Go to Streamlit Community Cloud.

2. Click “Sign in with GitHub” and authorize Streamlit.

3. Click “Create App.” 

4. Select “Deploy a public app from GitHub repo.”

5. In the repository settings, enter:

  • Repository: YOUR_USERNAME/Walmart-Reviews-application
  • Branch: main
  • Main file path: app.py (or whatever your Streamlit script is named)

6. Click “Deploy” and wait for Streamlit to build the app.

Step 5: Get Your Streamlit App URL

After deployment, Streamlit will generate a public URL (e.g., https://your-app-name.streamlit.app). You can now share this link to allow others access to your tool!

Conclusion

Within this tutorial, we’ve built a powerful tool that combines ScraperAPI’s Walmart Reviews Async Endpoint with VADER for sentiment analysis, and Gemini for generating insightful reports—all presented through a clean and interactive Streamlit interface.

This application is a game-changer for market research, in that it can identify customer sentiment trends, highlight potential pain points early, and support competitor analysis by examining how shoppers react to similar products.

Ready to build your own? Start using ScraperAPI today and turn raw Walmart customer reviews into valuable business insights!

The post Build a Walmart Reviews Analysis Tool Using ScraperAPI, VADER, Gemini, and Streamlit appeared first on ScraperAPI.

]]>
The 12 Best Apify Alternatives for Web Scraping in 2025 https://www.scraperapi.com/blog/apify-alternatives/ Thu, 27 Feb 2025 16:53:50 +0000 https://www.scraperapi.com/?p=7303 Confused by Apify’s complexities and looking for some Apify alternatives? Look no further. This guide helps you navigate the best Apify alternatives for web scraping in 2025, with detailed comparisons to help you make the right choice for your needs. Why Are People Searching for Apify Alternatives? Apify is a powerful tool, but it has […]

The post The 12 Best Apify Alternatives for Web Scraping in 2025 appeared first on ScraperAPI.

]]>

Confused by Apify’s complexities and looking for some Apify alternatives? Look no further. This guide helps you navigate the best Apify alternatives for web scraping in 2025, with detailed comparisons to help you make the right choice for your needs.

Why Are People Searching for Apify Alternatives?

Apify is a powerful tool, but it has its challenges:

  • Complexity: Apify can be hard to navigate for beginners.
  • Pricing: The pricing structure might not suit smaller businesses.
  • Features: You might need more specialized features or simpler workflows.

If you’re facing these challenges, exploring alternatives will help you find a tool that better matches your requirements.

ScraperAPI – The Best Apify Alternative in 2025

ScraperAPI website homepage

ScraperAPI is an all-in-one solution for web scraping, offering unmatched features and ease of use compared to Apify. Designed to simplify the web scraping process while providing scalability and cost efficiency, ScraperAPI excels in areas where Apify falls short.

Why ScraperAPI is Better

Advanced Proxy Handling:

  • ScraperAPI uses AI-driven machine learning and statistical analysis to optimize proxy usage, ensuring requests are routed intelligently. This reduces overhead costs and enhances success rates.
  • Unlike Apify, which requires manual setup for proxy handling, ScraperAPI automates the entire process, saving you time and effort.

Scalability for High-Volume Scraping:

  • ScraperAPI’s Async Scraper feature is built for large-scale data extraction, handling millions of requests concurrently without performance bottlenecks.
  • Apify offers scalability but lacks the dedicated asynchronous capabilities that ScraperAPI provides for seamless high-volume scraping.

Automated Data Pipelines:

  • With DataPipeline endpoints, ScraperAPI enables users to schedule recurring scraping jobs, eliminating the need for constant manual intervention.
  • Apify workflows require more setup time, and their automation options aren’t as intuitive or robust as ScraperAPI’s.

Ease of Integration:

  • ScraperAPI’s straightforward API design allows developers to integrate it with Python, JavaScript, or other languages in minutes.
  • While Apify supports similar integrations, its learning curve is steep for beginners, making ScraperAPI the more user-friendly option.

Pricing Advantage:

  • ScraperAPI’s pricing starts at $49/month, making it accessible to small businesses and individual developers.
  • Apify’s pricing starts at $49/month, but costs escalate quickly depending on the usage tier.

Built-in Anti-Bot Measures:

  • ScraperAPI handles captchas, JavaScript rendering, and IP bans seamlessly, ensuring uninterrupted scraping.
  • Apify users often rely on third-party tools or custom configurations for similar functionalities.

Pros and Cons of ScraperAPI

Pros Cons
AI-optimized proxy handling Limited UI for no-code users
High scalability for large data volumes  
Affordable pricing structure  
Seamless integration with popular tools  
Automated scraping pipelines  

Pricing

ScraperAPI and Apify have similar pricing structures; the major notable differences are the increase in Apify’s price for higher plans and the lack of a detailed description of what each plan comes with, unlike ScraperAPI, which has a transparent pricing system and can be seen at a glance.

Apify is priced per compute unit (CU), calculated by multiplying the amount of memory (GB) allocated by the length of time it runs: ‘1 compute unit = 1 GB memory x 1 hour’. In contrast, ScraperAPI is priced per API credit, and for the lowest price, you get access to 10,000 API credits, equivalent to 20,000 e-commerce pages of data. There are no straightforward ways to calculate the number of pages a CU can get you.

Plan Price
Hobby $49
Startup $149
Business $299

Other 9 Notable Apify Alternatives for 2025

1. Bright Data

Bright Data - ticketi proxies

Bright Data offers proxy solutions and a web scraping platform with extensive customization options. It’s popular among enterprises for its vast proxy network and advanced tools. With its cutting-edge technology, customizable features, and focus on transparency, Bright Data allows businesses to access and utilize public web data efficiently while adhering to ethical standards.

Features:

  • Large proxy pool for various use cases
  • Supports market research, ad verification, and more
  • Compatible with headless browsers
  • Advanced IP rotation and targeting options

Pros and Cons:

Pros Cons
Extensive proxy network Expensive for small-scale projects
Reliable customer support Steep learning curve
Flexible plans for enterprise users  

Pricing:

Bright Data is priced per GB and Starts at $499/month, which is higher than Apify, making it less viable for smaller businesses or individuals. Apify is priced per compute unit, and the cost varies depending on your subscription. While Bright Data focuses more on proxies and large-scale scraping, Apify offers a broader range of automation tools.

Plans Price
Pay as you go $8.4 / GB
Small  $499 monthly
large business $999 monthly
Enterprise $1999 monthly

Trustpilot and G2 reviews

  • Trustpilot review – 4.5 out of 5
  • G2 review – 4.7 out of 5

Read more: Explore some of the best Bright Data alternatives.

2. Octoparse

Octoparse is a no-code web scraping tool that simplifies data extraction for non-technical users. Its pre-built templates and visual interface make it an ideal choice for beginners and professionals looking for a user-friendly solution to collect and manage data efficiently.

Features:

  • Drag-and-drop interface
  • Predefined templates for common use cases
  • Exports data in CSV, Excel, and JSON formats
  • Cloud-based operation for seamless access

Pros and Cons:

Pros Cons
User-friendly for beginners Limited scalability for large projects
Affordable pricing for small tasks Slow data extraction on free plans
Excellent documentation  

Pricing:

Octoparse is a no-code web scraping tool; its price starts at $99 per month, costing a little more than Apify. Unlike Apify, Octoparse Is designed for non-developers who want a simple, user-friendly interface, positioning itself as a mid-tier solution. Octoparse has a stronger focus on simplicity and quick setup for basic scraping tasks, making it perfect for beginners.

Plan Price
Standard Plan $99
Professional Plan  $249
Enterprise Plan Custom

Trustpilot and G2 reviews

  • Trustpilot review – 2.9 out of 5
  • G2 review – 4.3 out of 5

3. ParseHub

Parsehub scraping infrastructure dashboard page

ParseHub provides a visual data extraction interface suitable for complex scraping workflows. Its versatility with dynamic websites makes it a popular choice among technical users. With its intuitive visual interface, ParseHub allows you to create scraping workflows without requiring coding skills. It supports automation, scheduling, and multi-platform compatibility, making it ideal for technical and non-technical users.

Features:

  • Works on dynamic websites
  • Supports scheduling and automation
  • Free plan available with limited features
  • Multi-platform support for flexibility

Pros and Cons:

Pros Cons
Handles dynamic websites well Limited support for concurrent tasks
Offers a free plan  
Supports advanced workflows  

Pricing:

ParseHub offers a free plan with limited features, but its paid plans can go up to $600 monthly. It provides powerful scraping capabilities for dynamic websites but is more expensive for smaller users. Compared to Apify, ParseHub is better for non-technical users due to its intuitive interface and stronger focus on handling JavaScript-heavy websites with minimal configuration. If price is your primary concern, Apify might be a better choice.

Plan Price
Standard  $189
Professional $599
ParseHub Plus Custom

Trustpilot and G2 reviews

  • Trustpilot review – 0.0 out of 5
  • G2 review – 4.3 out of 5

Read more: Explore some of the best Parsehub alternatives.

4. PhantomBuster

Phantombuster dashboard

PhantomBuster automates web scraping and social media data collection using APIs and workflows. With pre-built APIs and integrations for LinkedIn, Twitter, and other networks, PhantomBuster enables users to collect data, automate repetitive tasks, and export results in various formats. Its user-friendly interface and Zapier integration make it popular for marketers and growth hackers looking to streamline social media activities and data collection.

Features:

  • Automates LinkedIn, Twitter, and other platforms
  • Cloud-based with no installation needed
  • Exports data in various formats
  •  Integrates with third-party automation tools like Zapier

Pros and Cons:

Pros Cons
Great for social media scraping Limited features for broader scraping
Cloud-based and easy-to-use Expensive for advanced workflows
Supports Zapier integration  

Pricing:

PhantomBuster specializes in social media scraping and automation, starting at $69 monthly. It is one of the more budget-friendly tools, particularly for users focused on social media-related tasks. PhantomBuster is a niche-focused scraping tool that automates workflows for platforms like LinkedIn, Instagram, and Twitter. Unlike Apify, it is specifically designed for lead generation, social media scraping, and automation tasks.

Plan Price
Starter  $69
Pro  $159
Team $439

Trustpilot and G2 reviews

  • Trustpilot review – 2.8 out of 5
  • G2 review – 4.3 out of 5

5. Zyte (formerly Scrapinghub)

Zyte API AI-Driven Proxy Rotation dashboard page

Zyte provides tools like Smart Proxy Manager and web scraping APIs for enterprise-grade data extraction. It simplifies the process of collecting structured data from the web, making it an excellent choice for companies with large-scale data needs, such as market research, price monitoring, and competitive analysis.

Features:

  • Rotating proxies with anti-blocking capabilities
  • Data-as-a-service offerings
  • Supports Python-based Scrapy framework
  • Customizable tools for specific needs

Pros and Cons:

Pros Cons
Excellent proxy management Expensive for small-scale users
Wide range of tools for developers Requires technical expertise
Wide range of tools for developers  

Pricing:

Zyte has a lot of packages, so it is best to contact Zyte’s team for a price for your project. On the website, Zyte’s pricing begins at $450 per month, making it expensive for small and medium businesses, unlike Apify, which is affordable and suitable for small and large businesses. Zyte offers a managed scraping service, automatic extraction, and proxy management at its price point. It is designed for enterprise-level scraping, focusing on large-scale, high-complexity projects with little user effort. So, if the scale of your project is large, Zyte might be a good option.

Plan Price(Starting at)
Custom $450/month
Standard $450/month

Trustpilot and G2 reviews

  • Trustpilot review – 2.0 out of 5
  • G2 review – 4.4 out of 5

Read more: Explore some of the best Zyte alternatives.

6. Scrapy

Scrapy infrastructure dashboard page

Scrapy is an open-source web scraping framework for developers familiar with Python. Its flexibility and extensive libraries make it ideal for large-scale and complex projects. Its modular design and extensive documentation suit everything from small projects to large-scale scraping workflows.

Features:

  • Highly customizable for complex projects
  • Supports integrations with middleware and pipelines
  • Free to use but requires setup
  • Strong community support and regular updates

Pros and Cons:

Pros Cons
Completely free Requires significant technical skills
Great for large-scale projects No customer support
Active developer community  

Pricing:

Free

Trustpilot or G2 reviews

  • Trustpilot review – Nill
  • G2 review – Nill

7. Diffbot

Diffbot specializes in AI-powered data extraction and knowledge graph generation. It’s ideal for unstructured data extraction and machine learning applications. Diffbot uses natural language processing and computer vision to extract insights from websites, offering fully automated, customizable web scraping solutions without coding. Its API suite supports tasks like article extraction, product data scraping, and sentiment analysis, making it a go-to tool for enterprises looking to gather actionable data at scale.

Features:

  • Extracts structured data from unstructured web content
  • API integration for automated workflows
  • Powerful for machine learning applications
  • Knowledge graph generation for enhanced insights

Pros and Cons:

Pros Cons
AI-powered data extraction Very expensive
Great for unstructured data Limited no-code features
Integrates with ML workflows  

Pricing:

Diffbot is an AI-powered data extraction, starting at $299 per month, which makes it significantly more expensive than Apify and impractical for individuals with limited budgets. This premium pricing caters to users seeking advanced features like unstructured data extraction and knowledge graph creation for large-scale projects. 

For context:

  • Diffbot startup plan ($299): 250k API credits, equivalent to 250,000 pages (of any type) scraped. Click here for a more detailed explanation of credit deduction. There is no straightforward way to calculate the number of pages a CU can get you.

Plan Price
STARTUP  $299/Month
PLUS  $899/Month
ENTERPRISE Custom

Trustpilot or G2 reviews

  • Trustpilot review – Nill
  • G2 review – 4.9

Read more: Explore some of the best Diffbot alternatives.

8. Mozenda

Mozenda homepage

Mozenda is a cloud-based web scraping tool tailored for business users. It provides a user-friendly interface for building and managing scraping workflows without requiring programming knowledge. Known for its business-oriented approach, Mozenda is ideal for extracting actionable insights from large-scale datasets, empowering users with automation and scalability for their data collection needs.

Features:

  • Intuitive point-and-click interface
  • Supports data export in multiple formats
  • Cloud-based with scheduling capabilities
  • Strong customer support for troubleshooting

Pros and Cons:

Pros Cons
Simple interface for business users Lacks advanced proxy features
Supports various export formats  

Pricing:

Mozenda does not disclose its pricing publicly. Its plans are customized based on specific user requirements, making it a less transparent option than Apify. It is affordable and has transparent pricing, making it ideal for small and large-scale projects. Mozenda is another excellent choice for non-technical users, emphasizing simplicity and ease of use and offering enterprise-level support and managed services.

Trustpilot or G2 reviews

  • Trustpilot review – Nill
  • G2 review – 4.1

9. ScrapingBee

ScrapingBee scraping infrastructure dashboard

ScrapingBee is a developer-focused web scraping API designed to make data extraction seamless, especially from JavaScript-heavy websites. Its lightweight API enables users to focus on extracting clean, structured data without worrying about the complexities of rendering dynamic web pages. ScrapingBee focuses on providing a hassle-free scraping experience with integrated proxy management.

Features:

  • Headless browser support
  • Built-in proxy rotation
  • Optimized for rendering JavaScript-heavy websites

Pros and Cons:

Pros Cons
Great for JavaScript-heavy websites No free plan
Easy setup and integration No clear concurrent thread limits per plan
Reliable proxy management  

Pricing:

ScrapingBee starts at $49 monthly, offering integrated proxy management and JavaScript rendering. It is a solid mid-range option for small to medium-sized scraping projects. ScrapingBee is better for simple API-based scraping, offering a straightforward solution for quick scraping tasks. It handles web scraping tasks (CAPTCHA) challenges automatically, making it ideal for developers who want a low-learning-curve tool. In contrast, Apify is more versatile, requires more setup, and offers greater functionality.

Plan Price/month
Freelance $49
Startup $99
Business $249
Business + $599

Trustpilot or G2 reviews

  • Trustpilot review – 0.0/5
  • G2 review – 0.0/5

Read more: Explore some of the best ScrapingBee alternatives.

10. Lobstr.io

Lobstr.io offers a no-code web scraping platform and API built for speed, simplicity, and affordability. Its user-friendly tools make it popular among startups and small to medium businesses.

With powerful automation features, integrated email enrichment, and a developer-friendly API, Lobstr.io helps businesses collect and use public web data without the technical overhead.

Features:

  • 20+ no-code web scraping tools
  • Lead scraping tools with email validation and enrichment
  • Schedule feature for continuous/automated data collection
  • Direct export to Google Sheets, S3, SFTP, and other platforms
  • Developer-friendly API

Pros and Cons:

Pros Cons
Scheduling feature Fewer scrapers than Apify
Suitable for SMBs No pay-as-you-go (yet)
Affordable pricing plans  

Pricing:

Lobstr.io generally costs around €0.025 per 1,000 rows. It offers monthly subscription plans starting at €50/month.

Unlike Apify, Lobstr.io uses a credit-based system — where credits are consumed per row, per email, or per extra data point.

You only pay for the data you get.

There are no rental or compute unit costs, which makes it more budget-friendly than Apify.

Plan Price/month
Free forever €0
Premium €50
Business $250
Enterprise $500

Trustpilot or G2 reviews

  • Trustpilot review – 4.1 out of 5
  • Capterra review – 5 out of 5

Apify Alternatives Compared

Products Starting Price Free Trial Key Value Why ScraperAPI is Better
Apify $49/month Yes Feature-rich but complex Simpler, more scalable APIs
ScraperAPI $49 Yes All-in-one scraping solution with AI-optimized proxies More affordable and efficient
Bright Data $499 Yes Extensive proxy network More cost-effective for most use cases
Octoparse $99 Yes No-code interface Scales better for large data needs
ParseHub $189 Yes Visual workflows Better proxy handling  
PhantomBuster $69 Yes Social media automation Broader scraping capabilities  
Zyte $450 No Enterprise tools   Affordable for small-to-medium projects
Scrapy Free Free Open source Plug-and-play, easier to use
Diffbot $299 Yes AI-powered extraction More affordable
Mozenda Custom pricing Yes Business-oriented interface Better proxy management
ScrapingBee $49 No JavaScript rendering Broader features and scraping flexibility
Lobstr.io $50 Yes No-code scrapers with task scheduling and lead data collection ScraperAPI supports higher scale, broader feature set

Factors to Consider When Choosing Apify Alternatives

  • Ease of Use: Look for a tool that matches your technical skills
  • Pricing: Ensure the tool fits your budget
  • Scalability: Consider how well the tool handles large projects
  • Features: Identify features critical for your use case
  • Support: Reliable customer support can save you time and headaches

Wrapping Up: Choosing an Apify Alternative

The best Apify alternative depends on your specific needs, but when evaluating alternatives to Apify, ScraperAPI stands out as the most effective solution for web scraping in 2025. Its affordability, ease of use, scalability, and advanced features set it apart from others.

Key Advantages of ScraperAPI

  1. Unmatched Pricing and Transparency
    While many competitors, such as Bright Data and ParseHub, charge based on bandwidth or use complex pricing models, ScraperAPI’s straightforward credit-based system is transparent and predictable. For $49/month, you can send up to 100,000 API credits to scrape approximately 20,000 eCommerce pages, making it far more cost-effective than Apify or other tools like Zyte and Bright Data.
  2. Scalability at Any Level
    ScraperAPI is purpose-built to handle high-volume scraping tasks. Features like the Async Scraper and DataPipeline allow you to handle millions of requests simultaneously, schedule recurring jobs, and manage large-scale projects with minimal manual intervention. This scalability ensures it can grow with your needs, whether you’re a solo developer or managing enterprise-level projects.
  3. AI-Optimized Proxy Handling
    One of ScraperAPI’s features is its AI-driven proxy rotation system, which uses machine learning to optimize requests. This ensures higher success rates, reduces costs by only rotating proxies when necessary, and handles anti-bot measures like captchas and IP bans. ScraperAPI automates these tasks, unlike tools requiring manual configurations, saving time and effort.
  4. Structured Data Endpoints (SDEs) for Cleaner Data
    ScraperAPI’s SDEs simplify data cleaning and automation by delivering well-structured, ready-to-analyze data. This reduces the need for post-processing, enabling faster decision-making and smoother workflow integration. For businesses focused on efficiency, this is a game-changer.
  5. Ease of Integration and Automation
    With a developer-friendly API that integrates seamlessly with Python, JavaScript, and other programming languages, ScraperAPI eliminates the steep learning curve often associated with tools like Apify.  Its built-in scheduling and automation tools make it easy to streamline scraping workflows without extensive technical expertise.
  6. Reliability and Customer Support
    Unlike some alternatives, ScraperAPI is supported by a dedicated team that proactively handles site changes and optimizes performance. This ensures consistently high success rates, giving users peace of mind even when scraping dynamic or highly-protected websites.

With its powerful features, transparent pricing, and focus on scalability and automation, ScraperAPI is the ideal alternative to Apify for 2025. Whether you’re extracting data for business intelligence, eCommerce, or research, ScraperAPI delivers the efficiency, flexibility, and affordability you need. It’s not just a tool; it’s a complete web scraping solution designed to help you quickly achieve your goals.

Based on this, ScraperAPI is your best choice for web scraping in 2025.

Happy Scraping!

The post The 12 Best Apify Alternatives for Web Scraping in 2025 appeared first on ScraperAPI.

]]>
Top 11 Zenrows Alternatives for Web Scraping in 2025 https://www.scraperapi.com/blog/zenrows-alternatives/ Mon, 24 Feb 2025 16:48:56 +0000 https://www.scraperapi.com/?p=7291 Looking for the best ZenRows alternatives? You’re in the right place. ZenRows is a powerful web scraping API that helps developers extract data from websites efficiently. It offers features like automatic proxy rotation, JavaScript rendering, and anti-detection mechanisms. But it’s not the only option available. Its pricing structure and limitations in handling large-scale projects have […]

The post Top 11 Zenrows Alternatives for Web Scraping in 2025 appeared first on ScraperAPI.

]]>

Looking for the best ZenRows alternatives? You’re in the right place. ZenRows is a powerful web scraping API that helps developers extract data from websites efficiently. It offers features like automatic proxy rotation, JavaScript rendering, and anti-detection mechanisms. But it’s not the only option available. Its pricing structure and limitations in handling large-scale projects have prompted businesses to consider other solutions.

Many scraping APIs and platforms offer comparable features and pricing, catering to different needs and user preferences. In this article, I’ve compiled a list of the top ten alternatives to ZenRows in 2025 to help you find the right fit for your scraping projects.

Enterprise Scraping Without the Price Tag

ScraperAPI lets you scrape 5x more pages than Zenrows without losing efficiency.

1. ScraperAPI – The Best ZenRows Alternative in 2025

ScraperAPI offers the best blend of affordability, feature depth, and ease of use, making it the best alternative to Zenrows in 2025. Like ZenRows, ScraperAPI is designed to collect real-time data from websites at scale but goes further in terms of automation, pricing, and proxy management.

ScraperAPI website homepage

One major advantage of ScraperAPI is that you don’t have to build or maintain any infrastructure. Behind the scenes, ScraperAPI handles proxy management, headless browsers, and CAPTCHAs, allowing you to extract structured data from any website without getting blocked. 

In addition, you can bypass geolocation barriers, render JavaScript-heavy pages, and handle enterprise-level volumes with just a single API call—minus the hefty price tag.

Let’s take a closer look at some of ScraperAPI’s key features:

DataPipeline [for Automated Workflows]

DataPipeline is ScraperAPI’s low-code web scraping solution, capable of collecting large amounts of data in just a few clicks. It also provides ready-to-use templates that allow you to collect structured JSON data from high-demand domains like Amazon, Google, and eBay.

Main page for ScraperAPI DataPipeline

ScraperAPI’s Datapipeline extends beyond basic scheduling as it lets you collect data from 10,000 URLs per project without requiring code. DataPipeline endpoints can deliver your data in various formats, including JSON, CSV, or Webhooks, making integrating it with your existing architecture easier.

Note: You can also schedule your scrapers to run periodically using a visual scheduler or CRON.

To get started, all you need to do is add a list of URLs (could also be ASINs or keywords in the case of certain templates), set your geotargeting preferences – if not, it’ll default to the US – and run your project. You can also choose dynamic input and use the webhook as an input option.

Adding the input for a DataPipeline project

In just a few minutes, your data will be ready to download or sent through a Webhook if you set any.

Choosing the output for our DataPipeline project

Another advantage of DataPipeline is its intuitive dashboard. You can monitor progress, cancel jobs mid-run if necessary, review configurations, and download the final dataset. For even greater flexibility, you can manage the entire process through DataPipeline Endpoints, enabling you to track every step of your scraping operation programmatically.

Project dashboard inside DataPipeline

Compared to ScraperAPI’s DataPipeline endpoints, ZenRows’ Scraper APIs (currently in beta) have room to grow. Right now, they only support scraping one URL at a time, which can feel limiting, and the lack of a comprehensive scheduling system makes large-scale or automated workflows more challenging.

Structured Data Endpoints (SDEs)

ScraperAPI’s Structured Data Endpoints simplify your web scraping projects by automatically parsing the data in JSON format. This means you don’t have to build custom scraping tools and will get readily usable data regardless of the scale of your project.

Structured data endpoint SDEs

Different web scraping providers offer different structured data endpoints. Scraper API, for example, offers Amazon Price Scraper, Walmart Category Scraper, Google Shopping Scraper API, and more. ScraperAPI’s SDEs can be used with both the standard API and the Async API. Hence, with a simple post() request, you can manage millions of requests asynchronously without hurting their success rates, thereby speeding up large-scale projects. 

AI-Driven Proxy Management

ScraperAPI has found a clever way to integrate AI into its request flow. Unlike ZenRows, which relies on expensive residential proxies to unlock sites, ScraperAPI doesn’t always use expensive residential or mobile proxies. 

Instead, it automatically rotates your IP and headers using AI and statistical analysis. Therefore, it avoids overusing premium proxies, slashing costs by up to 40% while maintaining 99.9% success rates. This ensures that premium proxies are only used when necessary, reducing wasted requests and keeping overall scraping costs lower than ZenRows.

Pricing

You don’t need to break the bank to get started with ScraperAPI. Simply sign up here for 5,000 free API credits to test the service. Paid plans start at a competitive $49 per month, offering higher request limits, faster speeds, and dedicated support. 

ScraperAPI also includes a cost calculator in its dashboard so you can track your overall credit usage and remaining quota. Here’s a more comprehensive breakdown:

ScraperAPI Plan Price API Credits Equivalent
Free 5,000 Great for testing
Hobby $49/mo 100,000 100,000 pages (1 credit each)
Startup $149/mo 1,000,000 Additional features for bigger projects
Business $299/mo 3,000,000 Ideal for heavy workloads, cost-effective at $0.08 per 1K requests
Enterprise Custom 3,000,000+ Custom pricing applies for access to over 10,000,000 API credits.

Comparing to ZenRows

ZenRows starts at $69/mo, but enabling JavaScript rendering or premium proxies can drastically reduce your total request cap. For instance, a ZenRows Business plan might cost $0.10 per 1K requests at baseline; once you add JS rendering, it jumps to $0.45, and premium proxies can raise it to $0.90 or more.

On the other hand, ScraperAPI delivers roughly 20% more successful requests at the same price point. The cost per 1,000 requests on the ScraperAPI Business plan (with a yearly subscription) comes to about $0.08, considerably lower than ZenRows’ $0.10 base rate.

Pros

  • Easy to use and beginner-friendly
  • AI-powered proxies
  • Excellent scheduling capacity
  • No code support
  • Excellent documentation and support
  • High Uptime and concurrency
  • Geotargeting

Cons

  • No PAYG pricing
  • Limited geotargeting in lower tiers

Other Notable ZenRows Alternatives

Below are nine more solutions that stand out as strong ZenRows alternatives, each well-suited for specific use cases. We’ve grouped them into categories for quick reference:

The Best All-in-One Alternatives to ZenRows for Web Scraping

2. Bright Data

Bright Data is Zenrows’ biggest competitor in the premium segment. It has excellent performance and advanced geotargeting and offers even more features, such as powerful and extensive proxy management tools. 

You can buy anything from proxy networks to proxy-based web scraping APIs and complete datasets. Its unique pre-built datasets allow users to purchase pre-scraped data, removing the need for manual collection. 

Furthermore, Bright Data focuses on high-volume enterprise scraping use cases. Their pricing also reflects this, often landing above the market average, which may be overkill for smaller or more budget-conscious projects. The platform is also notoriously technically challenging for beginners.

Pricing

Bright Data’s entry plan is $499/month, though their pay-as-you-go plan starts at $1.5 per 1k records. Business and premium plans are significantly higher ($999 and $1999/month, respectively). 

Pros

  • Extensive proxy network
  • Pre-built datasets 

Cons

  • Expensive
  • Complex pricing structure
  • Technically complex

Related: Explore some of the best Bright Data alternatives.

3. Oxylabs

Oxylabs is another close competitor in the premium segment. It offers a comprehensive suite of tools, including multiple proxy types, scraping APIs, and pre-collected datasets, all managed through a user-friendly dashboard. 

Oxylabs works great for enterprise-level data scraping or big e-commerce price-tracking projects. They also offer a massive pool of proxies from over 195 countries, which provides exceptional geo-targeting capabilities and ensures your scraping requests appear more human-like. However, its advanced features come with a steep learning curve and pricing that may deter smaller teams.

Pricing

Oxylabs pricing starts from $49/month, and they also offer a free tier valid for one week (capable of scraping up to 5,000 records).

Pros

  • Bulk scraping support, with up to 1,000 URLs per batch
  • Global proxy network
  • Dedicated support 
  • Multiple delivery options to receive data via API or on your cloud storage bucket

Cons

  • Complex for small needs
  • Setup difficulty for beginners
  • Higher pricing for advanced features

Read more: Explore some of the best Oxylabs alternatives.

4. Scrapingbee

ScrapingBee’s web scraping API is great for general web scraping tasks, including price monitoring, real estate data collection, and extracting reviews. It also has a dedicated API for web scraping using Google search and includes well-crafted documentation to guide you through. 

With the JS scenario feature, you can effortlessly perform actions like clicking, scrolling, waiting, or executing custom JavaScript on target websites.

ScrapingBee offers a free tier of 1,000 credits per month. However, this is barely practical due to the platform’s relatively high credit consumption. It’s also worth noting that they have introduced a new AI extraction feature (currently in beta), which carries an additional cost of 5 credits per use on top of the regular API cost.

Pricing

ScrapingBee offers a free plan with 1,000 monthly API calls. If you need more, the Freelance plan at $49/month provides increased request limits and additional features for more demanding projects.

Pros

  • Easy Integration
  • Good documentation
  • Javascript rendering

Cons

  • Higher-tier plans are not cost-effective at scale
  • Limited free trial for extensive testing.

Read more: Explore some of the best ScrapingBee alternatives.

Top Alternatives to ZenRows for Beating Anti-Bot Systems

5. SmartProxy

Smartproxy ranks among the top ZenRows alternatives for bypassing anti-bot systems thanks to its massive proxy network and specialized scraping APIs. Though rotating proxies remain the backbone of its service, Smartproxy now offers four different web scraping APIs that make collecting data easier without manually handling proxy rotation or dealing with anti-bot measures. Smartproxy also has built target-specific endpoints optimized for major platforms and website elements, like Google ads, Amazon product pages, or TikTok hashtags. 

While Smartproxy’s offerings scale well, it’s worth noting that this service remains primarily focused on proxies. For highly advanced data extraction scenarios that require in-depth parsing or post-processing, you may find Smartproxy’s solutions somewhat limited compared to all-in-one platforms.

Pricing 

The general-purpose and social media API starts at $50 for 25k requests, while SERP and e-commerce start at $30 for 15k requests. Their free trial is 1k requests over 7 days.

Pros

  • Comprehensive location coverage
  • Specialized APIs for major platforms
  • Advanced geo-targeting options
  • Reliable proxy infrastructure
  • Great Documentation

Cons

  • Limited advanced data extraction features
  • Primarily focused on proxies.
  • Parsing Limitations

Read more: Explore some of the best SmartProxy alternatives.

6. Apify

Apify is a full-stack cloud-based platform where developers build, deploy, and monitor web scraping, data extraction & browser automation tools at scale. 

Apify also offers an extensive library of over 3,000 pre-built scrapers (Actors), making it easy to create and run customized tasks for everything from social media monitoring to e-commerce price tracking. 

Pricing

Starts at $49/month. Though Apify follows a freemium model, costs can quickly rise once you scale up usage or require advanced enterprise features.

Pros

  • Integrated Proxy Management
  • Flexible automation and data export options
  • Active developer community
  • Extensive marketplace of ready-to-use scrapers
  • Freemium + Open Source approach

Cons

  • Steep learning curve for beginners
  • Complex pricing structure
  • Resource-intensive for large projects

Read more: Explore some of the best Apify alternatives.

The Best No-Code ZenRows Alternatives for Beginners

7. ParseHub

ParseHub is a free, user-friendly web scraper ideal for beginners and non-technical users. It offers a visual point-and-click interface, a no-code workflow, and support for dynamic web pages.

Parsehub also offers a desktop application and cloud-based services, allowing for flexible data collection and integration into various workflows. You can easily schedule scraping tasks and export data in multiple formats, including JSON and CSV.  

However, some users may find limitations when dealing with super-complex websites or robust anti-scraping measures. Additionally, while ParseHub has a free plan, it is very limited, and the costs can rise quickly when scraping at scale.

Pricing

ParseHub offers a free plan with basic features, but advanced capabilities require a paid subscription, which starts at $189. Pricing can be relatively high once you exceed the free limits and begin to scale up your projects.

Pros

  • No coding required
  • Handles many dynamic websites
  • IP rotation
  • Scheduled scraping

Cons

  • Limited in handling super-complex websites or large-scale scraping projects
  • The free version comes with limited features.
  • Higher-tier plans can be expensive for small businesses or individual users.

Read more: Explore some of the best ParseHub alternatives.

8. Octoparse

Octoparse is a no-code AI web scraping tool that uses a point-and-click platform for web scraping. With Octoparse, users can interact with web elements uniquely, as many featured actions include infinite scrolling, dropdown, hover, etc.

It works by navigating to the page you want data from and clicking on the elements you want to scrape. Once your logic is created, you can run your scraper or create a workflow to schedule recurring scraping jobs. Octoparse also offers 469 free built-in template scrapers, perfect for individuals or small businesses stepping into web scraping.

Pricing
Octoparse starts with a free plan that allows up to 10 tasks on local devices and 50K data exports per month to test out the platform. For more advanced needs, the Standard Plan at $119/month unlocks cloud-based scraping (up to 6 concurrent processes), IP rotation, residential proxies, automatic CAPTCHA solving, and unlimited data exports.

Pros

  • User-friendly platform
  • Online data scraping templates
  • Hybrid model (cloud or local storage)

Cons

  • The free tier has limited resources
  • Processing large data sets can be slow.
  • Not cost-effective.

New and Unique ZenRows Alternatives to Watch

9. Scrapingdog

Scrapingdog is another API suite that can be used as a Zenrows alternative. Performance-wise, this product performs well on benchmarks. 

They have dedicated endpoints for many well-known data sources, including Google Search, Google AI Mode & LinkedIn. If you want to use it for high volume, the API is scalable, and the more credits you consume, the API pricing goes.

The API has a new AI web scraping API that lets you extract data in a structured format from any page with just a prompt. This API can be used to feed in data to train LLM models. 

Pricing

Scrapingdog offers 1000 free credits to test. After that, you can upgrade to different plans. The basic plan starts at $40 in which offers 200000 credits.

Pros

  • Easy To Integrate
  • Documentation is easy to understand
  • Response Time of API is Fast

Cons

  • Limited free trial for extensive testing
  • No internal No-code Tools available, though you can integrate with 3rd party tools easily

Read more: Explore some of the best Scrapingdog alternatives.

10. Import.io

Import.io is a no-code SaaS platform that transforms unstructured web data into structured datasets with a point-and-click interface. It also allows you to scrape behind logins, making it useful for sites that require authentication. 

Import.io is ideal for enterprises as it allows customers to process up to 1000 URLs concurrently or on a schedule and gain access to millions of rows of data that they use for hundreds of different use cases. Customers can use this data for price monitoring, market research, data journalism, machine learning, and more. 

Pricing

Import.io offers a 14-day free trial of their Standard plan. However, upgrading to the Advanced plan will cost you $1099 monthly.

Pros

  • No coding required
  • Handles complex websites
  • Scheduled Crawls
  • Massive Concurrency

Cons

  • High cost for small businesses.
  • Limited customization options
  • Limited support
  • Requires a credit card to sign up

11. Crawlbase 

Crawlbase ensures anonymity while you navigate the web, offering top-notch crawling protection. This tool is perfect for data mining or SEO projects, and you don’t have to deal with the hassle of managing global proxies. 

Crawlbase is an all-in-one data crawling and scraping platform for business developers. It prioritizes anonymity while you navigate the web, making it perfect for data mining or SEO projects. It also automatically manages proxies and bypasses any restrictions, blocks, or captchas to ensure smooth, steady data delivery.

Pricing

Crawlbase introduced a new pricing structure for its crawling API effective December 1, 2024. It groups websites into three categories (Standard, Moderate, and Complex) based on scraping difficulty. Scraping a moderate website costs $0.009 per request, which adds up to $115 for 100k requests per month.

Pros

  • Stealth crawling
  • Automated Proxy Management
  • Available free trial

Cons

  • Limited free tier
  • Complex pricing structure
  • May lack specialized scrapers for very specific targets

ZenRows Alternatives Compared

Below is a quick comparison table to help you evaluate the key differences between ZenRows and the 10 notable alternatives we’ve covered.

Tool Pricing Key Features Ideal Use Case Rating (Trustpilot/G2)
ScraperAPI From $49/month JavaScript rendering & proxy rotationDataPipeline endpointsAsync ScraperSDEs General-purpose scraping and advanced data extraction 4.7
ZenRows From $69/month JavaScript rendering, proxy rotation, CAPTCHAs, scheduling For mid-sized businesses 4.8
Bright Data From $499/month or $1.5 per 1k records Extensive proxy networkPre-built datasetsAdvanced geotargeting Enterprise-scale projects requiring geo-specific data 4.6
Oxylabs Starts at $49/mo  Pre-collected datasetsBulk scraping Cloud integration Large e-commerce price tracking & SERP monitoring 4.2
ScrapingBee From $49/month JavaScript renderingDedicated Google Search APIAI extraction (beta) Mid-sized teams needing quick integration 4.9
Smartproxy Starts at $50/month Large proxy networkSpecialized scraping APIsAdvanced geo-targeting Bypassing aggressive anti-scraping measures 4.5
Apify Starts at $49/month Cloud-based platform3,000+ pre-built Actors Integrated proxy management Developers building custom scrapers 4.8
ParseHub Starts at $189/mo No-code interfaceDynamic site support Beginners & non-technical users 4.5
Octoparse From $119/mo Drag-and-drop interface Cloud-based scraping Small businesses needing no-code workflows 4.6
Import.io From $1099/mo No-code SaaS1k+ concurrent URLs Large enterprises with complex data pipelines 3.6
Crawlbase $0.009/request ($115/100k) Stealth crawlingAutomated proxy management Data mining & SEO audits requiring anonymity 4.8

Note: Trustpilot/G2 ratings may vary over time.

Why ScraperAPI Might Be Your Go-To ZenRows Alternative

When we say free, we mean it

ScraperAPI offers a free forever tier with 1000 credits each month (after an initial 5000 free credits for the first week) that gives you unrestricted access to the platform. In contrast, ZenRows’ free trial lasts 14 days, limits you to 1,000 API requests, and requires a paid subscription after the trial ends.

Render Instruction Set

ScraperAPI’s Render Instruction Set allows you to send instructions to a headless browser via an API call, guiding it on what actions to perform during page rendering. These instructions give you the same control over the page rendering as any headless browser without the tedious setup typically required. 

 With the Render Instruction Set, you can execute complex operations such as clicking buttons, completing a search form, or infinite scrolling. It is perfect for sites with heavy dynamic content, and you can scrape interactive pages just as you would with a local browser.

Get more for less on a paid plan

For $69/month, ZenRows provides access to a multi-purpose API. For only $49/month, ScraperAPI offers a well-rounded scraping platform with access to cutting-edge features like its Data pipelines, Render Instruction Set, and SDEs.

No-code infrastructure and scheduling features

ZenRows lacks native cloud capabilities or scheduling features, requiring additional services for those functions. With ScraperAPI, you can host enterprise-grade scrapers in the cloud, schedule jobs automatically, store data, and export it in multiple formats.

Anti-blocking features

ZenRows offers proxies to help you avoid blocking. ScraperAPI takes anti-blocking to another level with AI-powered proxy rotation and advanced fingerprinting techniques, ensuring your scrapers remain performant and cost-effective.

Wrapping Up: ScraperAPI and ZenRows Compared

Feature ScraperAPI ZenRows
Free Trial 5,000 free API credits upon sign-up
✅
1,000 URLs free for 14 days
✅
Ready-to-Use Tools DataPipeline (low-code scraping), Structured Data Endpoints for Amazon, Google, eBay, etc.
✅
Scraper APIs (still in beta)
⚠
Cloud-Based Hosting Manage scraping jobs in the cloud with DataPipeline
✅
User must handle storage, scheduling, and data retrieval independently.
❌
Data Export Formats JSON, CSV, HTML, Webhooks
✅
JSON, HTML, Markdown
✅
Dynamic Websites Built-in JavaScript rendering
✅
JavaScript rendering available 
✅ (extra cost)
Enterprise Scaling yes
✅
yes
✅
Proxy Rotation & CAPTCHAs yes
✅
yes
✅
Scheduling CRON-based scheduling (DataPipeline) can handle 10,000+ URLs/project
✅
Basic scheduling only;
❌

Ready to see for yourself? Sign up for ScraperAPI and join over 10,000 data-focused companies in maximizing your data collection!

We think you would like these:

The post Top 11 Zenrows Alternatives for Web Scraping in 2025 appeared first on ScraperAPI.

]]>