What is Web Scraping? The Complete Guide for 2025 https://www.scraperapi.com/web-scraping/ Scale Data Collection with a Simple API Sun, 28 Sep 2025 14:11:43 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://www.scraperapi.com/wp-content/uploads/favicon-512x512-1-150x150.png What is Web Scraping? The Complete Guide for 2025 https://www.scraperapi.com/web-scraping/ 32 32 7 Anti-Scraping Techniques and How to Bypass These Mechanisms https://www.scraperapi.com/web-scraping/how-to-bypass-anti-scraping-techniques/ Wed, 23 Jul 2025 08:28:23 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=8185 Companies fiercely protect their data because competitors scraping it can steal their advantage. To stay ahead, they invest in sophisticated defenses set to block scraping attempts and keep their information locked down. If you need that data for research, analysis, or competitive insights, and you’ve hit those access restrictions, this blog is for you. We’ll […]

The post 7 Anti-Scraping Techniques and How to Bypass These Mechanisms appeared first on ScraperAPI.

]]>
Companies fiercely protect their data because competitors scraping it can steal their advantage. To stay ahead, they invest in sophisticated defenses set to block scraping attempts and keep their information locked down.

If you need that data for research, analysis, or competitive insights, and you’ve hit those access restrictions, this blog is for you.

We’ll discuss the most common anti-scraping techniques and how to bypass each. 

What is Anti-Scraping?

Anti-scrapers are website security guards. 

Scraping is when external bots automatically extract content from a website. Anti-scraping, on the other hand, is the defense system specifically made to detect and block these automated extraction attempts. 

To protect themselves, websites deploy a mix of clever techniques: monitoring user behavior like mouse movements and clicks, flagging suspicious IP addresses (especially if one is making an enormous number of requests), and tracking data access spikes over a short time.

In simple terms, anti-scraping is any tool, technique, or approach that blocks bots or scrapers from extracting a website’s content.

How to Bypass Anti-Scraping Mechanisms with a Simple API?

Just like websites have anti-scraping tactics to block bots, there are ways to bypass them and access the content. 

Some common methods include rotating IP addresses, which simply means changing IP addresses using VPNs or proxy servers, so websites think the traffic is coming from multiple devices. Then there’s solving CAPTCHAs, adding random delays between mouse clicks and scrolls, and changing user agents in HTTP requests.

While these are common methods, some advanced techniques use AI to detect and bypass anti-scraping tools. For example, AI-based image or audio recognition can detect and bypass CAPTCHAs.  

Sounds complex, right? That’s where tools like ScraperAPI come in. You send your request to their API, and everything (IP rotation, JavaScript rendering, CAPTCHA bypassing) happens behind the scenes. 

So, with a simple API call, you’ll get the content without manually fighting the anti-scraping techniques.

Top 7 Anti-Scraping Techniques in 2025 and How to Bypass Them

1. Login Wall or Auth Wall 

A login wall, as the name suggests, is a front gate to website access. To get through, users have to sign in with valid login details. 

Bots, which typically don’t have valid credentials, often fail to scrape the data behind login walls. This is most common on LinkedIn, Facebook, and other social media sites.

How to bypass login walls?

After logging in to a website, some scraping scripts save the session cookies and reuse them in future scraping requests. This makes the website think it’s still a logged-in session, so it doesn’t ask you to log in again.

Is bypassing login walls this way legal? The legality depends on the situation, how you extract the data, and how you use it. Check out this guide to learn how to scrape data safely.

2. IP Address blocking

Anti-scraping tools often use IP addresses and their metadata to decide whether to allow website access.

An IP address looks something like this: 192.168.1.100 It follows the IPv4 format of four numbers separated by dots. Which part of an IP address refers to the network or host is determined by the subnet mask, not the position of specific numbers. IP addresses can reveal general geographic location based on how ISPs assign them, and users on the same network or ISP segment. The last part of the IP address may distinguish devices within a network, but many users can still share the same public IP due to NAT (Network Address Translation).

Anti-scraping systems use this information to detect unusual patterns. If a single IP or a series of similar IPs starts sending too many requests, it raises a red flag. 

There’s also rate limiting. Websites set a cap on how many requests a single IP can make in a short timeframe. If you go over that limit, your access will be cut off.

To make things tighter, websites keep a blacklist of suspicious IPs, IP ranges, and high-risk geographic locations. If your IP shows up in that database, expect restrictions.

How to Bypass IP Blocking?

Using proxy servers is one of the most common techniques to bypass IP-based restrictions. 

A proxy server acts as an intermediary between your computer and the website. When you send a request, it goes to the proxy first, which then forwards the request to the website using its own IP address. The website’s response is received by the proxy and then relayed back to your computer.

 There are different types of proxy servers. Residential proxies help bypass many advanced anti-scraping systems by using IP addresses assigned to real residential devices, making them appear as genuine users.

Tools like ScraperAPI make it simple. ScraperAPI provides access to over 4 million residential proxies, plus data center proxies, mobile proxies, and e-commerce-specific ones. It automatically rotates these IP addresses behind the scenes, so you can keep scraping without hitting restrictions.

3. User-Agent and HTTP Headers

User agents are a part of HTTP headers. They tell the website what browser you’re using, its version, and your operating system.

Legit User agents look more like this: `user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47`

But, when you send a Python request, the default User-Agent looks like `user-agent: python-requests/2.22.0`, which is easy to detect and block. 

Some websites inspect the User-Agent string and block requests that don’t match those of major browsers like Chrome, Safari, or Firefox. Because these browsers frequently update their user-agent strings, using an outdated or uncommon one can sometimes raise suspicion.

What about skipping the User-Agent entirely? That’s also a red flag and can lead to being blocked.

How to Bypass User-Agent?

Customizing user agents with recent browser versions helps blend in with regular traffic. Rotating user-agent strings is also important when sending multiple requests.

 But manually managing and rotating headers at enterprise scale can be challenging. That’s where ScraperAPI steps in. It rotates user agents and IPs for each request, ensuring smooth data access. With ScraperAPI, everything is automated through a simple API call, delivering a 99.99% success rate.

4. Honeypots

Websites set traps to catch bots called Honeypots. When scrapers trigger these traps, the site logs their IPs and may blacklist them.

For example, websites place hidden links in areas that are invisible to humans. Bots don’t see the page like humans, but they parse the raw HTML and interact with all elements, including hidden ones. This allows websites to spot the difference, collect bot information, and block their access.

Another common example is hidden fields in web forms. These fields are invisible to users, so when a real person fills out the form, the field stays empty. But scrapers often fill every form field they find in the HTML, including hidden ones, because they don’t distinguish visible from hidden fields. That difference in behavior helps websites spot and stop bots.

How to Bypass Honeypots

Ensure your scraper skips interacting with hidden elements with properties set like `display: none` or `visibility: hidden`. 

When using proxies, if a honeypot detects suspicious activity, it typically results in blacklisting the proxy IP rather than your real IP, assuming all requests go through the proxy. Proper proxy rotation helps reduce the risk of bans. Still, if you want to avoid the trap altogether, use ScraperAPI. It’s designed to avoid honeypot traps by automatically managing proxy rotation, browser emulation, and other anti-bot techniques through a simple API call.

5. JavaScript challenges

Sometimes, when you open a website, there’s a slight delay before the page fully loads. That’s a JavaScript challenge running in the background.

cloudflare javascript challenge

Here’s what’s happening: a JavaScript file is injected into your browser and prompts it to execute a script. Real browsers handle it easily. They run the script, store a token in a cookie, and if that token is validated by the server, access is granted.

Scrapers without a JavaScript engine cannot run that script, so they get blocked.

How to bypass JavaScript challenges?

Use a headless browser with JavaScript enabled. Headless Chromium (via Puppeteer, Playwright, Selenium) executes JavaScript just like a real browser. These tools also support plugins like `puppeteer-extra-plugin-stealth` or options like `contextOptions` to fake screen sizes and device profiles. This makes the site believe a real Chrome browser is running the JavaScript challenge.

For more reliable scraping of Javascript-protected sites, combine a headless browser with a Scraper API. Or just use ScraperAPI on its own. It comes with a built-in headless browser to render JavaScript and includes other bypass techniques like proxy servers, IP rotation, CAPTCHA handling, and more.

6. CAPTCHAs 

CAPTCHAs are challenges—like checking a box, typing some text, or clicking on the right images—that humans can solve easily before accessing a protected website. But for bots and scrapers, they’re a nightmare.

Most secure websites use CAPTCHAs as a first line of defense. And even if you get past the initial page load, background monitoring continues. If the site picks up on anything unusual, it’ll throw a CAPTCHA your way to make sure you’re human before letting you continue.

How to bypass CAPTCHAs?

Instead of solving CAPTCHAs, it’s easier and cheaper to avoid triggering them in the first place.

Websites often present CAPTCHAs when they detect suspicious behavior. So, rotating IPs, rotating user agents, and clearing cookies between requests can help you prevent them. Avoiding honeypots and adding random delays between requests also lowers your chances of running into a CAPTCHA.

When you use ScraperAPI, it handles this automatically. The platform monitors for CAPTCHA-triggered failures, and if one is detected, it retries the request using a different IP to bypass it.

If you do end up facing a CAPTCHA, the last resort is solving it. There are third-party CAPTCHA-solving services you can plug into your scraping setup. But these can get expensive, so it’s best to avoid CAPTCHAs altogether using the prevention steps above.

7. User Behavior Analysis

Humans have a natural way of moving the mouse, clicking, and scrolling. But bots typically perform these actions with precision and fixed timing. For example, a bot might move the mouse from point A to B with linear, inhuman precision, click immediately after landing on a button, or scroll a page with no pauses.

Websites collect this interaction data and run it through behavioral analysis or machine learning models, which can spot patterns and distinguish bots from real users. 

How to bypass user behavior analysis?

The key difference here is behavior—how users interact versus how bots operate. So to bypass user behavior analysis (UBA), your bot needs to act as human as possible.

That means typing at a natural speed with randomized delays, clicking at varied intervals (not perfectly timed), and mimicking real scrolling patterns. 

For this, tools like Selenium or Puppeteer come in handy. They can simulate real human interactions with the website. 

Another reliable method is replaying real user behavior. This involves recording an actual session of a human using the site—every click, scroll, pause, and typo—and replaying that interaction using an automated script with the same timing and flow, though it works best when combined with consistent browser and device fingerprints.

Conclusion 

Now you know all the essential anti-scraping techniques and how to bypass each of them. But handling all of them manually is complex, time-consuming, and often not enough. Some websites use advanced defense systems that can still block access, even when every precaution is in place.

ScraperAPI simplifies the entire process by automatically handling anti-scraping measures and delivering consistent data access through a single API call. Want to see it in action? Book a free custom trial and test ScraperAPI on your exact use case.

The post 7 Anti-Scraping Techniques and How to Bypass These Mechanisms appeared first on ScraperAPI.

]]>
What is a Bot Blocker and How Does It Work? https://www.scraperapi.com/web-scraping/what-is-a-bot-blocker/ Wed, 23 Jul 2025 08:12:03 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=8180 Bot blockers are everywhere online, but what exactly are they, and how do they work? If you’ve ever run into a CAPTCHA, had your scraper blocked, or noticed strange traffic on your site, you’re already familiar with how bot blockers affect your work. This article explains how bot blockers detect and prevent automated traffic, and […]

The post What is a Bot Blocker and How Does It Work? appeared first on ScraperAPI.

]]>
Bot blockers are everywhere online, but what exactly are they, and how do they work? If you’ve ever run into a CAPTCHA, had your scraper blocked, or noticed strange traffic on your site, you’re already familiar with how bot blockers affect your work.

This article explains how bot blockers detect and prevent automated traffic, and why they’ve become such an ever-present part of modern websites. Here’s what you’ll learn:

  • What bot blockers are and how they differ from CAPTCHAs, firewalls, and rate limiting
  • How bots are detected using techniques like behavioral analysis and device fingerprinting
  • The most common methods used to prevent bots, including IP blocking and JavaScript challenges
  • Whether bot blockers can be bypassed, and how tools like ScraperAPI manage to do it

What is a Bot Blocker?

A bot blocker is a system that identifies and stops automated traffic from reaching a website. It’s designed to filter out unwanted bots, such as scrapers, credential stuffers, or fake accounts, only letting real users through.

While the term “bot blocker” is sometimes used loosely, it’s not the same as tools like CAPTCHAs, firewalls, or rate limiting:

  • CAPTCHAs are challenges that test whether a user is human, often by asking them to click images or solve puzzles. Bot blockers may use CAPTCHAs as part of a larger system, but they go beyond that.
  • Firewalls are broad security tools that block traffic based on IP, ports, or other network-level rules. A bot blocker operates at the application layer, examining behavior, patterns, and browser characteristics to identify bots.
  • Rate limiting restricts how often someone can make requests, typically based on IP address. It’s a basic tactic that bot blockers usually include, but on its own, it can be easy for bots to work around.

Bot blockers typically combine multiple signals and techniques to decide if a request is suspicious. Instead of relying on a single method, they analyze the context, including how the request behaves, what it looks like, and whether it matches known bot patterns.For example, if a user lands on a page and immediately makes 20 requests per second without moving the mouse or scrolling, the bot blocker might flag this as suspicious. It may also, for instance, detect a screen resolution of 0x0 or a lack of standard plugins—both signs of a headless browser. These small signals add up, and the system may respond by blocking the session, issuing a CAPTCHA, or silently dropping the requests.

How Does a Bot Blocker Work?

Once a request reaches a website, a bot blocker has to “make a decision” quickly. Is this visitor a human or an automated script? To determine this, it runs a series of checks in the background. These typically fall into two categories: detection and prevention.

Detection Techniques

Detection comes first. It’s about gathering as much information as possible without slowing down the process. The goal is to create a profile for each request based on user behavior, browser data, and traffic patterns. If anything looks off, the system can escalate to prevention: blocking, challenging, or slowing down the request.

  • Behavioral analysis: Real users move their mouse unpredictably, scroll at different speeds, and take time to click around. Bot blockers watch for these kinds of natural signals. Bots, especially basic ones, tend to skip user interactions entirely or simulate them in patterns that are too fast or too uniform to look human.
  • Device fingerprinting: Even two users on the same browser rarely have identical setups. A fingerprint is built from small details like screen resolution, OS, time zone, and installed plugins, as well as more advanced signals like WebGL data, canvas rendering, and audio context. Headless browsers and automated tools often return generic or incomplete values, which helps flag them as bots.
  • Rate limiting and request pattern analysis: Bot blockers track how often requests are made, where they come from, and what they’re doing. A flood of traffic from a single IP address, or a group of IP addresses accessing the same resources in a concentrated pattern, is a common indication of scraping or brute-force attempts. Repeated requests with no variation can also be a red flag.
  • JavaScript challenges and browser integrity checks: Some systems insert lightweight JavaScript that runs as soon as the page loads. These scripts verify whether the browser can execute code correctly and return the expected values. Bots that disable JavaScript or use stripped-down environments often fail these checks, exposing them as non-human.

Prevention Methods

Once a request is identified as suspicious, the bot blocker applies one or more countermeasures. These are designed to stop the bot outright, challenge it, or slow it down enough to make the attack inefficient.

  • Blocking IPs or ASN ranges: Every internet-connected device has an IP address. When too many suspicious requests originate from the same IP address, the system may block that address entirely. For broader attacks, the system might block by Autonomous System Number (ASN), which refers to a group of IP addresses controlled by the same network provider, often used by VPNs, cloud services, or proxy networks. Blocking at this level can cut off thousands of abusive sources at once.
  • Requiring CAPTCHAs or JavaScript execution: If a request seems automated but not definitively malicious, the system might respond with a challenge. This could be a CAPTCHA  or a requirement to run a short JavaScript task in the browser. Bots that can’t solve CAPTCHAs or don’t support JavaScript execution typically fail at this step and are filtered out.
  • Cookie-based validation and token systems: To track whether a session behaves consistently, some bot blockers issue a cookie or token to the browser, allowing them to verify the user’s identity. This is a small piece of data stored on the client and sent back with future requests. If the token is missing, reused incorrectly, or manipulated, it suggests the session isn’t following normal behavior, and the system can block or challenge it accordingly.
  • Redirects, honeypots, and tarpit delays: These are low-level traps designed to confuse or slow down bots. Redirects can send bots to fake or dead-end pages, keeping them away from real content. Honeypots are invisible form fields or links that regular users never see, but bots, which often fill in everything automatically, will interact with them and reveal themselves. Tarpits deliberately delay server responses, forcing bots to waste time and resources without affecting actual users.

Together, these detection and prevention layers help websites filter out malicious automation while allowing real users to browse without friction. But no system is foolproof. As bot blockers become more advanced, so do the tools designed to get around them. In the next section, we’ll look at whether bot blockers can be bypassed and how some services are designed to address this challenge.

Can Bot Blockers Be Bypassed? 

Modern bot blockers are designed to detect not only high request volumes or suspicious IP addresses, but also more sophisticated bot activity. They’re designed to detect automation that mimics user behavior, fakes browser details, or skips steps like JavaScript execution and token validation. For anyone working with web scraping or automation, the question isn’t just if bot blockers can be bypassed, but how to do it consistently without getting flagged.

As we explored earlier in the article, most detection systems look for a pattern of clues rather than relying on one telltale sign. They examine behavior, analyze fingerprint data, and track how users interact with a site. To bypass these protections, you have to think like the system, and then design your toolset to stay just outside of what it considers suspicious.

Key Requirements for Beating Bot Detection

At the core, a bypass strategy needs to replicate what a real user does when visiting a site. That includes everything from how the page is loaded to how often requests are made. For example:

  • JavaScript Rendering: Many sites don’t expose key content or tokens until after JavaScript runs. Bots that skip JS execution often miss out on the actual page content or get caught when token checks fail.
  • Fingerprint Consistency: Sites can detect if something feels “off” about your browser. Are the fonts missing? Is the screen size set to 0x0 (which usually results from poorly set up headless scraping tools)? Are the expected plugins absent? These details form a fingerprint, and if it doesn’t match what a real browser would produce, it raises flags.
  • Session Management: A user might keep a session alive across multiple clicks or pages. Some sites use session-based tokens or cookies that change over time. If your scraper starts fresh on every request, it could look abnormal.
  • Request Timing and Flow: Real users don’t send 50 requests in one second. They pause, scroll, and explore. Mimicking these delays—or randomizing them—can help avoid rate-based triggers.
  • Clean IP Infrastructure: Requests from cloud servers or public proxies often get blocked immediately. Residential or mobile IPs blend in more easily because they reflect real-world usage patterns.

By combining all of these elements, it becomes possible to approach or even surpass human-like browsing behaviour, at least from the server’s perspective.

Best Practices for Bypassing Bot Blockers

The most reliable bypass strategies are the ones that evolve with the environment. Web protection changes fast, and scraping tools that work today might break tomorrow. A few principles tend to hold up over time:

  • Focus on realism: The more your request resembles real browsing, complete with headers, cookies, user behavior, and rendering, the better your chances of success.
  • Rotate carefully: IPs and user-agents should be rotated thoughtfully, not randomly. Too much variation too quickly can be just as suspicious as no variation at all.
  • Track site behavior: If a site starts setting new tokens, introducing CAPTCHA, or changing resource paths, adapt accordingly. Detection is dynamic.
  • Preserve sessions: In many cases, reusing sessions across requests allows bots to behave more like real users, particularly on sites that expect multi-step interactions.
  • Retry gracefully: Bots that crash on the first error tend to reveal themselves. Retry logic with backoff, fallback IPs, or alternate routes can improve reliability and avoid full lockouts.

If you want to explore specific techniques in more detail, take a look at our How to Bypass Bot Detection guide. It walks through real examples, request flows, and ways to adjust your approach based on the type of protection you’re dealing with.

How ScraperAPI Helps You Bypass Bot Protection

If you’ve tried putting these best practices into action, you already know how much work it takes to keep things running smoothly. Getting past a bot blocker isn’t just about solving one problem; it’s managing a stack of constantly moving parts: proxy rotation, fingerprinting, CAPTCHAs, session handling, and rendering.

ScraperAPI is built to handle all of that for you.

Instead of stitching together proxies, headless browsers, and CAPTCHA solvers on your own, you can make a single API request. ScraperAPI takes care of the backend, allowing you to focus on the data. It automatically:

  • Routes traffic through a global pool of residential and mobile IPs
  • Handles JavaScript rendering when needed
  • Manages cookies, headers, and tokens in the background 
  • Uses realistic browser fingerprints to avoid detection
  • Bypasses protections like Cloudflare, Datadome, and PerimeterX

If you’re scraping at scale or working with data from high-friction sites like Amazon or Google, you can also use our Structured Data Endpoints. These return clean, usable JSON for things like product listings, search results, job ads, and more, so you don’t have to parse HTML or maintain custom scrapers.

And if you need even more control, ScraperAPI also supports asynchronous scraping, allowing you to send millions of requests in parallel without exceeding rate limits or exhausting IP addresses.

The bottom line? You can build your setup, and many developers do. But if you’d rather skip the infrastructure work and avoid spending hours debugging IP blocks or CAPTCHA triggers, ScraperAPI gives you a faster, more reliable path forward.

Conclusion

Bot blockers have become a standard part of the modern web, designed to filter out everything from basic scrapers to advanced automation tools. Understanding how they work and how to work around them can make a big difference, whether you’re collecting market data, monitoring pricing, or building a search tool.

By thinking like a detection system and focusing on realism, you can significantly improve your chances of staying under the radar. And if you’re looking for a simpler way to handle the challenging aspects, such as proxy rotation, CAPTCHA solving, and JavaScript rendering, ScraperAPI can help.

If you’re working on a project that needs reliable data at scale, you can sign up for a free trial and get 5,000 API credits to start. Need something bigger? Contact us to request a custom trial tailored to your specific use case.

The post What is a Bot Blocker and How Does It Work? appeared first on ScraperAPI.

]]>
How to Bypass Bot Detection in 2025: 7 Proven Methods https://www.scraperapi.com/web-scraping/how-to-bypass-bot-detection/ Fri, 18 Jul 2025 12:16:06 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=8174 Scraping data from the web has never been more challenging. Websites now use layered protection systems that can detect even subtle signs of automation, blocking requests before they ever reach the page. But as blockers have evolved, so have scrapers. With the proper setup, you can collect data at scale without constantly getting blocked. In […]

The post How to Bypass Bot Detection in 2025: 7 Proven Methods appeared first on ScraperAPI.

]]>
Scraping data from the web has never been more challenging. Websites now use layered protection systems that can detect even subtle signs of automation, blocking requests before they ever reach the page.

But as blockers have evolved, so have scrapers. With the proper setup, you can collect data at scale without constantly getting blocked.

In this guide, you’ll learn:

  • What bot detection is and how it works
  • The most effective tool for bypassing modern defenses
  • 7 tested techniques that help you avoid getting flagged
  • How to make your requests look and act more like a real user

If you’re encountering CAPTCHAs, rate limits, or blocked IPs, this guide will show you how to bypass them without wasting time or compromising your scrapers.

1. Best Solution: Use a Web Scraping Tool to Bypass Bot Detection

If you’re scraping at any kind of scale, getting blocked isn’t a matter of if, but when. Sites are constantly updating their detection systems, and keeping up with those changes requires time, effort, and a significant amount of trial and error.

That’s why many developers turn to tools that handle these issues automatically.

ScraperAPI is one of the most straightforward ways to bypass bot detection without managing a stack of proxies, headless browsers, or CAPTCHA solvers. It acts as a middle layer between your scraper and the target website, handling the things that typically trigger blocks, like browser fingerprinting or missing tokens.

Here’s what ScraperAPI does for you behind the scenes:

  • Built-in proxy rotation with access to millions of residential and mobile IPs, spread across 200+ countries
  • Automatic CAPTCHA solving, so you don’t have to pause or reroute traffic when challenges appear
  • JavaScript rendering using real browser environments to access dynamic content that doesn’t load on static requests
  • Session and header management that mimics real user traffic with consistent cookies, user-agents, and timing
  • 99.99%+ success rate on high-friction sites protected by Cloudflare, Datadome, PerimeterX, and others

Instead of assembling five different services to keep your scrapers running, you can send a single request and receive a clean, usable response.

Here’s an example of using ScraperAPI to scrape and save the contents of a blog article in Markdown format:

import requests

payload = {
   'api_key': 'YOUR_API_KEY',
   'url': 'https://blog.hubspot.com/sales/ultimate-guide-creating-sales-plan',
   'country': 'us',
   'output_format': 'markdown'
}

response = requests.get('https://api.scraperapi.com/', params=payload)
product_data = response.text

with open('hubspot-product.md', 'w', encoding="utf-8") as f:
    f.write(product_data)

This request automatically handles proxy routing, JavaScript rendering, and any hidden token issues, without requiring you to run a browser or manually solve CAPTCHAs.

For more specific use cases, check out other tutorials on how to scrape sites with some of the toughest protections:

If you’re just starting or tired of scripts breaking every time something changes, ScraperAPI gives you a stable foundation to build on, so you can focus on the data, not the defenses.

Simple Methods to Bypass Bot Detection

While using a scraping tool like ScraperAPI can handle most of the heavy lifting, it’s still helpful to understand the core techniques that detection systems look out for—and how to get around them manually if needed.

These methods form the backbone of most bypass strategies. Whether you’re writing your scraper from scratch or fine-tuning an existing setup, these approaches can help your traffic look more like a real user and less like a bot.

Here are the first few techniques to focus on:

2. Proxy Rotation Strategies to Avoid Blocks

One of the most common reasons a scraper gets blocked is that it sends too many requests from the same IP address. To avoid this, proxy rotation is essential.

Proxy rotation is the process of switching IP addresses between requests, making it appear as though the traffic is coming from different users in different locations. This helps you avoid rate limits, IP bans, and geo-based restrictions.

There are three main types of proxies you can use:

  • Datacenter proxies: These are fast and inexpensive, but they are often flagged more easily. Many sites can recognize traffic from cloud providers or data centers and will block or throttle it.
  • Residential proxies: These route your traffic through real devices connected to home networks. Because they look like everyday users, they’re much harder for detection systems to block, but they’re more expensive.
  • Mobile proxies: These use real mobile devices and networks. They’re the most difficult to detect and block, making them especially useful for high-security targets, but they tend to be the most costly.

Here’s a simple example of rotating proxies using requests:

import requests
import random

proxies = [
    "http://user:pass@proxy1.example.com:8000",
    "http://user:pass@proxy2.example.com:8000",
    "http://user:pass@proxy3.example.com:8000"
]

url = "https://example.com"

chosen_proxy = random.choice(proxies)
proxy = {"http": chosen_proxy, "https": chosen_proxy}

response = requests.get(url, proxies=proxy)
print(response.status_code)

A smart rotation strategy doesn’t just swap IPs randomly; it matches IP type to the target site’s sensitivity, manages request frequency per IP, and avoids patterns that look scripted. For better success, combine this with user-agent rotation and header spoofing.

3. User-Agent Strings to Mimic Real Users

Every time your browser connects to a website, it sends a User-Agent string—basically a label that tells the server what kind of browser and device you’re using. Detection systems often use this to verify whether a request is coming from a real browser.

If your scraper sends a default or outdated User-Agent, it can quickly be flagged as a bot. Updating your User-Agent string to mimic real browsers is a simple but effective way to blend in.

Here are a few examples of User-Agent strings:

  • Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.4 Safari/605.1.15
  • Mozilla/5.0 (iPhone; CPU iPhone OS 16_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Mobile/15E148 Safari/604.1

Here’s a quick example of rotating User-Agent Strings:

import requests
import random

headers = {
    "User-Agent": random.choice([
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_4) AppleWebKit/605.1.15 Safari/605.1.15",
        "Mozilla/5.0 (iPhone; CPU iPhone OS 16_3_1 like Mac OS X) AppleWebKit/605.1.15 Mobile/15E148 Safari/604.1"
    ])
}

response = requests.get("https://example.com", headers=headers)
print(response.text)

It’s also important to rotate User-Agent strings periodically and match them to your other request headers, such as platform and screen size, for better consistency.

4. Header Randomization to Appear as a Real Browser

Web servers don’t just rely on your IP or User-Agent to detect bots. They also analyze other HTTP headers, the metadata sent along with each request. If your headers are missing, out of order, or don’t match typical browser behavior, that’s often a red flag.

Common headers that detection systems look at include:

  • Accept
  • Accept-Language
  • Accept-Encoding
  • Referer
  • Connection
  • Upgrade-Insecure-Requests
  • DNT (Do Not Track)

Real browsers send these headers in a specific structure, and that structure often varies by browser and device. By randomizing or rotating headers and making sure they match your User-Agent, you reduce the chances of being flagged.

Some advanced tools (like ScraperAPI) do this automatically, but if you’re building your scraper, it might be worth collecting real browser headers and rotating them based on context.

Here’s how to spoof realistic headers manually:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Referer": "https://google.com/",
    "Upgrade-Insecure-Requests": "1",
    "DNT": "1"
}

response = requests.get("https://example.com", headers=headers)
print(response.text)

To go a step further, you can rotate through multiple header sets based on the type of user agent you’re mimicking. Browser DevTools, Puppeteer in headful mode, and tools like Selenium with logging enabled can help you capture real-world headers from actual sessions.

Even with perfect headers, traditional headless browsers like Selenium and Puppeteer can still get flagged. That’s because they expose subtle clues, like the presence of webdriver=true, missing browser features, or unusual JavaScript behavior.

Tools like Undetected ChromeDriver (UC) help patch those gaps by automatically adjusting or removing detectable automation flags. When paired with proper headers, it becomes significantly harder to distinguish your browser from a real one.

Example using Undetected ChromeDriver:

# pip install undetected-chromedriver selenium setuptools

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import random

# Set up stealth options
options = uc.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-blink-features=AutomationControlled")
options.headless = False 

# Launch browser
driver = uc.Chrome(options=options)
driver.set_window_size(1280, 800)

# Visit a protected site
url = "https://www.scraperapi.com/"
driver.get(url)

# Wait to simulate reading time
time.sleep(random.uniform(3, 6))

# Scroll like a user
for i in range(0, 1000, 100):
    driver.execute_script(f"window.scrollTo(0, {i});")
    time.sleep(random.uniform(0.3, 0.8))

# Optional: Click a visible button or link (simulating intent)
try:
    cta_button = driver.find_element(By.CLASS_NAME, "elementor-button-text")
    cta_button.click()
    time.sleep(random.uniform(2, 4))
except Exception as e:
    print("CTA not found or not clickable:", e)

# Print a success message
print("Page loaded and user-like interaction complete.")

# Clean up
driver.quit()

UC automatically sets headers, adjusts browser fingerprints, and removes known detection flags, all of which would otherwise require manual patching.

If you’re scraping sites that use tools like Cloudflare, Datadome, or PerimeterX, combining header spoofing with a stealthy browser setup is often the difference between success and instant blocks.

Advanced Methods to Bypass Bot Detection: Human-Like Interaction

Once you’ve covered the basics, such as rotating proxies and headers, the next challenge is behavioral detection. Many advanced bot protection systems now analyze how a visitor behaves on the page, how they move their mouse, how fast they scroll, and whether their clicks and typing feel “real.”

That means it’s no longer enough just to send valid requests. You need to mimic how a human interacts with the site, even if you’re doing it programmatically.

5. Randomized Mouse Movements

One of the simplest ways to trigger a bot detection system is to move through a page without ever using the mouse. Real users scroll, hover, and move the cursor in erratic patterns, even if they don’t click on anything. Bots, on the other hand, often move directly to their target elements without any additional motion.

That’s why simulating human-like mouse movements is a valuable part of a bypass strategy. Detection systems often monitor cursor behavior to determine whether a session looks genuine. A complete absence of mouse activity, or movements that are too linear or precise, can raise suspicion.Using tools like Selenium, you can program your scraper to mimic these natural patterns. Here’s a simple example of how to simulate randomized cursor jitter before clicking on an element:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time, random

driver = webdriver.Chrome()
driver.get("https://books.toscrape.com")
time.sleep(2)

# Find the target element
target = driver.find_element(By.LINK_TEXT, "Home")

# Initialize ActionChains
actions = ActionChains(driver)

# Start by moving to the element directly
actions.move_to_element(target).pause(0.3).perform()

for _ in range(5):
    offset_x = random.randint(-5, 5)
    offset_y = random.randint(-5, 5)
    actions.move_by_offset(offset_x, offset_y).pause(random.uniform(0.05, 0.15)).perform()
    # Reset to the element to avoid pointer drifting
    actions.move_to_element(target).pause(0.1).perform()

# Final click
actions.move_to_element(target).click().perform()

time.sleep(2)
driver.quit()

This type of interaction helps reduce the chances of being flagged by systems that expect users to hover over elements or generate natural input noise before clicking.

6. Typing and Scrolling Delays

Real users don’t fill out forms instantly or scroll through a page in a perfectly linear way. They pause, make minor corrections, and take time between actions. Bots, on the other hand, tend to type and scroll with machine-like precision and speed—something most detection systems are trained to spot.

Adding realistic delays to your interactions helps your scraper blend in. For example, when entering text into a search field, simulate keystrokes instead of injecting the full value in one command. Likewise, scroll in steps rather than jumping from top to bottom instantly.

Here’s how you can simulate human-like typing and scroll behavior using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import random

driver = webdriver.Chrome()
driver.get("https://www.wikipedia.org/")
time.sleep(2)

search_box = driver.find_element(By.ID, "searchInput")

# Type each character with a slight delay
search_term = "web scraping"
for char in search_term:
    search_box.send_keys(char)
    time.sleep(random.uniform(0.1, 0.3))  # Simulate typing speed variation

# Simulate scroll delay
for _ in range(5):
    driver.execute_script("window.scrollBy(0, 200);")
    time.sleep(random.uniform(0.5, 1.2))  # Mimic inconsistent scroll speed

time.sleep(2)
driver.quit()

This subtle randomness gives your scraper a more human-like rhythm, especially on pages with behavioral detection, such as login forms, search bars, or infinite scroll. If your bot is too fast or too predictable, it’ll stand out.

If you’re using ScraperAPI, you can simulate some of these actions without running a full browser locally by using the Render Instruction Set. For example, scrolling to load content and waiting for it to appear:

headers = {
    'x-sapi-api_key': '<YOUR_API_KEY>',
    'x-sapi-render': 'true',
    'x-sapi-instruction_set': '[{"type": "scroll", "direction": "y", "value": "bottom"}, {"type": "wait", "value": 4}]'
}

This is ideal for infinite scroll pages or dynamic content that only loads after user interaction.
7. Idle Time Simulation

Even the best scrapers fail if they behave too efficiently. Bots that instantly perform actions one after the other look suspicious. In contrast, users tend to pause: reading, thinking, switching tabs.

Idle simulation adds a “natural” amount of dead time between your actions. This reduces detection from behavioral pattern recognition tools that track session timing and rhythm.

A few tips:

  • Wait between navigation and interaction
  • Randomize the idle time to break consistent patterns
  • Simulate tab-switching delays (waits of 5–15 seconds work well)
  • Combine idle time with minor mouse movements or focus shifts

Here’s an idle simulation function in Python:

import time
import random

def idle():
    pause = random.uniform(5, 12)
    print(f"Simulating user pause for {pause:.2f} seconds...")
    time.sleep(pause)

Use this between actions to let the page settle or to mimic a user hesitating before making a choice.

Conclusion

Bypassing bot detection in 2025 takes more than just changing your IP or tweaking a few headers. Modern websites employ sophisticated, multi-layered defenses, analyzing everything from IP addresses and browser fingerprints to user movement, scrolling, and interaction with the page.

This guide walked through seven proven strategies to help your scrapers blend in, from proxy rotation and realistic headers to simulating mouse movements, typing delays, and idle time. These methods will give you the control to build scrapers that go undetected on even the most protected sites.

If you’d rather skip the trial-and-error and focus on getting reliable data, ScraperAPI can handle the complex parts for you. You can sign up for a free trial with 5,000 credits, or reach out to request a custom trial tailored to your specific needs.

The post How to Bypass Bot Detection in 2025: 7 Proven Methods appeared first on ScraperAPI.

]]>
How to Scrape Redfin Property Data with Python https://www.scraperapi.com/web-scraping/redfin/ Tue, 01 Jul 2025 06:12:27 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=8069 Real estate professionals, investors, and researchers need vast amounts of structured property data to make informed decisions, but manually collecting this information would be painfully slow and inefficient.   In this tutorial, we will walk through how to automate Redfin data extraction using Python and ScraperAPI’s Structured Data Endpoints (SDEs). You’ll learn to programmatically collect: By […]

The post How to Scrape Redfin Property Data with Python appeared first on ScraperAPI.

]]>
Real estate professionals, investors, and researchers need vast amounts of structured property data to make informed decisions, but manually collecting this information would be painfully slow and inefficient.  

In this tutorial, we will walk through how to automate Redfin data extraction using Python and ScraperAPI’s Structured Data Endpoints (SDEs). You’ll learn to programmatically collect:

  • Property listings (both for sale and rent)
  • Detailed home information (price history, features, taxes)
  • Agent profiles (performance metrics, specialties, contact details)

By the end of this article, you’ll be able to build your own dataset for market analysis, investment research, or competitive benchmarking, avoiding manual data entry. 

Let’s dive in.

TL;DR: Scrape Redfin Data without Interruptions

Redfin property data is quite valuable, and so, it uses advanced anti-scraping mechanisms to block scrapers from accessing the website:

  • Complex CAPTCHA and JavaScript challenges
  • Behavior analysis and browser fingerprinting 
  • Rate limiting and IP block

ScraperAPI automatically bypasses all of these challenges without any extra setup from your end, letting you collect Redfin data in clean JSON or CSV format.See it for yourself; create a free ScraperAPI account and add your api_key before running the code below:

import requests
import json

payload = {
  'api_key': 'YOUR_API_KEY',
  'url': 'https://www.redfin.com/MD/Baltimore/Domain-Brewers-Hill/apartment/45425151',
  'country': 'us'
}

response = requests.get('https://api.scraperapi.com/structured/redfin/forrent', params=payload)
data = response.json()

with open('redfin-rent-page.json', 'w') as f:
  json.dump(data, f)

Want to test our API at scale? Contact our sales team to get a custom trial, which includes:

  • Personalized onboarding
  • Custom scraping credits and concurrent thread limits
  • Dedicated support Slack channel

And a dedicated account manager to ensure successful integration with your infrastructure.

Why Scrape Redfin.com?

Redfin provides comprehensive property data that’s valuable for:

  • Real estate market analysis and trend identification
  • Investment opportunity evaluation
  • Comparative market analysis for pricing strategies
  • Neighborhood and school district research
  • Rental yield calculations
  • Historical sales and price change tracking

Redfin Data Fields to Scrape

  • Basic property details (address, price, beds, baths, square footage)
  • Property features and amenities
  • Historical price changes and days on market
  • Neighborhood information and school ratings
  • Agent and brokerage details
  • Property tax history
  • Sale and rental history
  • Comparable properties in the area

Project Requirements

Before diving into the integration, ensure you have the following:

  1. A ScraperAPI Account: Sign up on the ScraperAPI website to get your API key. ScraperAPI simplifies the extraction process by taking care of JavaScript rendering, CAPTCHA solving, and proxy rotation. Joining a seven-day trial will grant you 5,000 free API credits, which you can use whenever you’re ready.   
  2. A Python Environment: Make sure you have Python (version 3.7+ recommended) installed on your system. You’ll also need the following libraries:
  • requests: For making HTTP requests to ScraperAPI.
    • python-dotenv: to load your credentials from your .env file and manage your API key securely.
  • matplotlib: to visualize your data.

You can install them with this pip command:

pip install requests python-dotenv matplotlib
  1. An IDE or Code Editor: For creating and executing your Python scripts, like Visual Studio Code, PyCharm, or Jupyter Notebook.

How to Scrape Redfin Property Pages with ScraperAPI

While search results provide basic information, individual property pages contain much more detailed data. Redfin offers two different types of property pages, one for properties for sale and another for rental properties. These pages have different structures and layouts, so ScraperAPI provides separate endpoints for each type.

Sale vs. Rental Property Pages

Sale property pages have the following information:

  • Payment Calculator
  • Property features
  • Agent  Details

Rental property pages have the following information:

  • Monthly rent
  • Pet policies
  • Amenities

Let’s look at how to scrape both types of pages.

Setting up the environment

First of all, create a .env file in the root directory of your project. Here, you will store your credentials (in this case, your API key), ensuring that they are not made public by mistake. Make sure to add this file to your .gitignore file to avoid pushing sensitive data to version control.

You can create the .env  file by running:

touch .env

In it, define your API key:

SCRAPERAPI_KEY= "YOUR_API_KEY"  # Replace with your actual ScraperAPI key

Now we can import them into our Python file:

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')

We will use this setup across all of our examples.

Scraping Redfin sale property details

To scrape property-for-sale information from Redfin, we are going to use the ScraperAPI Redfin Sales SDE. This lets us pull structured data from Redfin without dealing with JavaScript rendering, anti-bot blocks, or endless HTML parsing.

Instead of scraping raw pages, you’ll get back ready-to-use sale property details, pricing, address, beds, baths, square footage, and more in JSON.

It’s faster, more reliable, and it works really well for building real estate tools, dashboards, or market analysis scripts.

import requests
import json
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')

SCRAPERAPI_SALE_ENDPOINT = "https://api.scraperapi.com/structured/redfin/forsale"
TARGET_URL = "https://www.redfin.com/NY/Howard-Beach/15140-88th-St-11414/unit-1G/home/57020499"

def fetch_sale_properties():
   """Fetch property data using ScraperAPI SDE."""
   try:
       response = requests.get(SCRAPERAPI_SALE_ENDPOINT, params={
           'api_key': SCRAPERAPI_KEY,
           'url': TARGET_URL,
           'country_code': 'US'
       })
       response.raise_for_status()
       return response.json()
   except requests.exceptions.RequestException as e:
       print(f"Failed to fetch properties: {e}")
       return None

def scrape_sale_properties():
   """Scrape specific property fields and save them to a JSON file."""
   print(f"Fetching properties from {TARGET_URL} using ScraperAPI SDE...")
   data = fetch_sale_properties()
   if not data:
       print("No property data found.")
       return

   filtered_data = {
       "type": data.get("type", "N/A"),
       "price": data.get("price", "N/A"),
       "sq_ft": data.get("sq_ft", "N/A"),
       "beds": data.get("beds", "N/A"),
       "baths": data.get("baths", "N/A"),
       "description": data.get("description", "N/A"),
       "address": data.get("address", "N/A"),
       "active": data.get("active", "N/A"),
       "agent": data.get("agents", [{}])[0].get("name", "N/A")
   }

   with open("filtered_redfin_sales_page.json", "w", encoding="utf-8") as f:
       json.dump(filtered_data, f, indent=4)

   print("Scraping completed. Filtered data saved to filtered_redfin_sales_page.json.")

if __name__ == "__main__":
   scrape_sale_properties()

The code uses ScraperAPI’s structured data endpoint to collect key details from a Redfin property listing. It filters the response to include only useful data points such as:

  • Property type
  • Price
  • Square footage
  • Number of bedrooms and bathrooms
  • A short description
  • Full address
  • Listing status (active or not)
  • The agent’s name

The `fetch_properties()` function handles the request to ScraperAPI and retrieves the JSON response. The `scrape_properties()` function then extracts just the needed fields and saves them to a file named `filtered_redfin_sales_page.json`.

Your data should look like this:

Scrape Redfin Sale Pages
ScraperAPI turns any Redfin sale property URL into ready-to-use JSON or CSV.

Scrape Redfin for Rent Properties

Just like we did for Redfin property-for-sale, you can easily extract property-for-rent details using the ScraperAPI-Redfin-Rent SDE. With this simple API call, you have access to all property information, such as price, pet policies, schools in the area, etc.

Using ScraperAPI SDE makes your task simple and faster, and gives you accurate and reliable data:

import requests
import json
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')
SCRAPERAPI_RENT_ENDPOINT = "https://api.scraperapi.com/structured/redfin/forrent"
TARGET_URL = "https://www.redfin.com/NY/Astoria/2718-Ditmars-Blvd-11105/home/20946748"

def fetch_rent_property():
   """Fetch property data using ScraperAPI."""
   try:
       response = requests.get(SCRAPERAPI_RENT_ENDPOINT, params={
           'api_key': SCRAPERAPI_KEY,
           'url': TARGET_URL,
           'country_code': 'US'
       })
       response.raise_for_status()
       return response.json()
   except requests.exceptions.RequestException as e:
       print(f"Error: {e}")
       return None

def scrape_rent_property():
   """Extract and save specific property fields."""
   data = fetch_rent_property()
   if not data:
       print("No data returned.")
       return

   filtered_data = {
       "name": data.get('name', 'N/A'),
       "type": data.get('type', 'N/A'),
       "map_url": data.get('map_url', 'N/A'),
       "bed_max": data.get('bed_max', 'N/A'),
       "bath_max": data.get('bath_max', 'N/A'),
       "price_max": data.get('price_max', 'N/A'),
       "description": data.get('description', 'N/A'),
       "address": f"{data.get('address', {}).get('street_line', '')}, "
                  f"{data.get('address', {}).get('city', '')}, "
                  f"{data.get('address', {}).get('state', '')} {data.get('address', {}).get('zip', '')}"
   }

   with open("Filtered_redfin_rent_page.json", "w", encoding="utf-8") as f:
       json.dump(filtered_data, f, indent=4)
   print("Filtered data saved to Filtered_redfin_rent_page.json.")

if __name__ == "__main__":
   scrape_rent_property()

The code above uses ScraperAPI’s Redfin rental SDE to extract property details from a specific Redfin rental listing. It sends a `GET` request with the API key, Redfin listing URL, and country code. With the help of ScraperAPI’s SDE, it returns structured JSON data. No need for extra parsing libraries like BeautifulSoup.

The `fetch_properties()` function handles the API request and returns the response if successful. The `scrape_properties()` function then filters key data points such as the property’s name, type, number of beds and baths, price, map URL, description, and full address. Finally, it saves the cleaned data to a file named `Filtered_redfin_rent_page.json`.

Your data should look like this:

Scrape Redfin Rent Pages
ScraperAPI turns any Redfin rental property URL into ready-to-use JSON or CSV.

How to Scrape Redfin.com Search Results

If your goal is to retrieve all listings from a Redfin search, ScraperAPI can simplify the process with its Redfin Search SDE:

import requests
import json
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')

payload = {'api_key': SCRAPERAPI_KEY, 
'url': 'https://www.redfin.com/city/30749/NY/New-York/apartments-for-rent', 'country_code': 'US'}
r = requests.get('https://api.scraperapi.com/structured/redfin/search', params=payload)
data = r.json()

with open('redfin-search-rent-page.json', 'w') as f:
   json.dump (data,
f)

The code uses ScraperAPI’s SDE to search and scrape all rental listings from a Redfin search page.

  • It sends a `GET` request to the ScraperAPI Redfin search endpoint, which is designed to return structured property listings.
  • The payload includes an API key, the target Redfin URL for apartments for rent in New York City, and an optional country_code parameter set to 'US'.
  • The response (`r`) is parsed as JSON and saved to a local file named 'redfin-search-rent-page.json' using the `json.dump()` method.

These listings can be seen below. 

Note: As of the time this article was written, the search SDE is currently not able to extract URLs of each individual listing, but this is soon coming.

Collect Agent Profiles from Redfin

While property for sale or rent information is necessary when it comes to real estate, agent information is very important to analyze performance.

To effortlessly get Redfin agent information, ScraperAPI has provided a Redfin Agent SDE.

In this section, I’ll show you how to extract agent information from Redfin using Python and ScraperAPI’s specialized Agent Details API.

import requests
import json
import time
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')

SCRAPERAPI_AGENT_ENDPOINT = "https://api.scraperapi.com/structured/redfin/agent"
TARGET_URL = "https://www.redfin.com/real-estate-agents/ian-rubinstein"

REQUEST_DELAY = 2

def fetch_agents_data():
   """Fetch agent data using ScraperAPI SDE and save specific fields."""
   print(f"Fetching agent data from {TARGET_URL} using ScraperAPI SDE...")

   payload = {
       'api_key': SCRAPERAPI_KEY,
       'url': TARGET_URL
   }

   try:
       response = requests.get(SCRAPERAPI_AGENT_ENDPOINT, params=payload)
       response.raise_for_status()
       data = response.json()

       filtered_data = {
           "name": data.get("name", ""),
           "type": data.get("type", ""),
           "license_number": data.get("license_number", ""),
           "contact": data.get("contact", ""),
           "about": data.get("about", ""),
           "agent_areas": data.get("agent_areas", [])
       }

       with open('filtered_redfin_agent.json', 'w') as f:
           json.dump(filtered_data, f, indent=4)

       print("Filtered agent data saved to 'filtered_redfin_agent.json'")
       return filtered_data

   except requests.exceptions.RequestException as e:
       print(f"Failed to fetch agent data: {e}")
       return None

   finally:
       time.sleep(REQUEST_DELAY)

if __name__ == "__main__":
   agent_data = fetch_agents_data()
   if agent_data:
       print("Agent data fetched and filtered successfully.")

Here’s a breakdown of the script:

  • It sends a `Get` request with the API key and target URL, then processes the JSON response.
  • Next, it extracts key data points such as the agent’s name, type, license number, contact info, biography, and areas they serve, filtering out unnecessary details.
  • It then saves this filtered data to a JSON file named `filtered_redfin_agent.json` for easy access.

The `fetch_agents_data()` function handles the data fetching, filtering, and saving, with error handling for request failures. A delay ensures the script respects rate limits. When run, the script prints status updates to keep you informed.

Your data should look like this:

Wrapping Up: Analyzing Redfin Property Data

Now that you have collected data from search results, property pages, and agent profiles, you can combine them all together for analysis or analyse each data individually.

Here’s a simple example that visualizes property prices by neighborhood:

import requests
import matplotlib.pyplot as plt
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
SCRAPERAPI_KEY = os.getenv('SCRAPERAPI_KEY')

# Step 1: Fetch the data

payload = {
   'api_key': SCRAPERAPI_KEY,
   'url': 'https://www.redfin.com/city/30749/NY/New-York/apartments-for-rent',
   'country_code': 'US'
}
endpoint = 'https://api.scraperapi.com/structured/redfin/search'
response = requests.get(endpoint, params=payload)
data = response.json()

# Step 2: Extract the first 10 listings
listings = data.get("listing", [])[:10]

# Step 3: Process prices and locations
locations = []
prices = []

for listing in listings:
   full_address = listing.get("address", "")
   if "|" in full_address:
       name, address = [part.strip() for part in full_address.split("|", 1)]
   else:
       name = full_address.strip()
  
   price_list = listing.get("price", [])
   for price_info in price_list:
       cost = price_info.get("cost")
       if cost:
           try:
               cost = float(cost)
               locations.append(name)
               prices.append(cost)
           except ValueError:
               continue  # skip invalid price formats

# Step 4: Visualize - Price vs. Location
plt.figure(figsize=(12, 6))
plt.bar(locations, prices, color='teal')
plt.xlabel('Location (Building Name)', fontsize=12)
plt.ylabel('Monthly Rent ($)', fontsize=12)
plt.title('Rental Prices by Location (Top 10 Listings)', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
  • The code above fetches apartment listings for New York City from Redfin using the ScraperAPI structured endpoint.
  • It sends a request with your ScraperAPI key, the Redfin page for NYC rentals, and the country code.
  • The response comes in JSON format, and the code focuses on the first 10 listings.
  • It extracts the building name from each address and gathers the corresponding prices, ignoring any missing or invalid data.
  • Finally, it creates a bar chart using matplotlib, showing rental prices by building for the top 10 listings in New York City.

Here is the final result:

ScraperAPI’s SDEs take away much of the heavy lifting usually involved in web scraping. Instead of wrestling with raw HTML, messy selectors, or fragile parsing rules, SDEs delivers ready-to-use structured JSON data.

If you followed every step, here’s what you should have accomplished with it:

  • Extracted structured property listings from both sales and rental pages
  • Collected agent details for a deeper view of the market landscape
  • Analyzed the data for market trends, price patterns, and investment insights

By automating the entire data pipeline, you transformed what would normally be hours of manual data gathering into a fast, programmatic process. And thanks to the structured format, the scraped data was immediately ready for deeper analysis, visualizations, or integration into other tools.

If you’re serious about real estate research, investment analysis, or building market dashboards, ScraperAPI can save you massive amounts of time and effort. It lets you focus on insights and strategy, not on debugging scrapers or dodging anti-bot systems.

Get started with ScraperAPI today to supercharge your data collection and make your real estate projects smarter, faster, and more scalable.

The post How to Scrape Redfin Property Data with Python appeared first on ScraperAPI.

]]>
Best Headless Browsers for Web Scraping in 2025 https://www.scraperapi.com/web-scraping/best-headless-browsers/ Wed, 25 Jun 2025 05:58:34 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7981 Headless browsers are powerful tools in data automation and scraping, but do you really understand how they work behind the scenes or the technical choices that shape them? This short guide breaks down everything you need to know, including: Plus, we’ll highlight some of the top headless browsers available today. Let’s dive in! TL;DR: Scrape […]

The post Best Headless Browsers for Web Scraping in 2025 appeared first on ScraperAPI.

]]>
Headless browsers are powerful tools in data automation and scraping, but do you really understand how they work behind the scenes or the technical choices that shape them?

This short guide breaks down everything you need to know, including:

  •     What headless browsers are
  •     What each one does best
  •     How to avoid detection while using them
  •     Tips on choosing the right one for your needs

Plus, we’ll highlight some of the top headless browsers available today. Let’s dive in!

TL;DR: Scrape Dynamic Sites More Efficiently

Headless browsers are mostly used in web scraping when the target site requires you to:

  • Render the page before collecting data, like in the case of SPAs
  • Interact with elements, like forms and next page buttons
  • Scroll to load more data, like infinite scrolling on news and ecommerce sites

However, these technologies are also more resource-intensive for your machines and demand a more complex infrastructure.

To overcome the challenges listed above without increasing complexity on your side, ScraperAPI provides two solutions:

1. ScraperAPI JS rendering

Using ScraperAPI’s render parameter allows you to extract data from single-page applications (SPAs) and dynamic sites:

import requests

payload = {
   'api_key': 'YOUR_API_KEY',
   'url': 'https://www.washingtonpost.com/',
   'render': 'true'
}

response = requests.get('https://api.scraperapi.com', params=payload)
print(response.status_code)

2. ScraperAPI’s Rendering Instructions

Some sites require some kind of interaction to load content (clicking a button, waiting for rendering, scrolling to the bottom, etc.). To overcome this challenge, you can send a set of rendering instructions to ScraperAPI:

import requests
url = 'https://api.scraperapi.com/'

# Define headers with your API key and rendering settings
headers = {
   'x-sapi-api_key': '<YOUR_API_KEY>',
   'x-sapi-render': 'true',
   'x-sapi-instruction_set': '[{"type": "input", "selector": {"type": "css", "value": "#searchInput"}, "value": "cowboy boots"}, {"type": "click", "selector": {"type": "css", "value": "#search-form button[type=\\"submit\\\"]"}}, {"type": "wait_for_selector", "selector": {"type": "css", "value": "#content"}}]'
}

payload = {
   'url': 'https://www.wikipedia.org'
}

response = requests.get(url, params=payload, headers=headers)
print(response.text)

Note: To test these scripts, create a free ScraperAPI account and add your api_key before running the code.

What is a Headless Browser, and Why Use One?

A headless browser is a type of browser that renders web pages without a graphical user interface, meaning there’s no visible window for user interaction.

A helpful way to picture a headless browser is to imagine reading this blog without any colors, layout, or fonts—just raw HTML and JavaScript behind the scenes.

Headless browsers are ideal for web scraping because they allow you to load browser pages faster, simulate user interaction, and inspect web content. For the same reasons, they can also be a good choice for browser automation tasks and benchmarking.

Headless vs Traditional Browsers

A traditional browser, the regular browser you are familiar with. For instance, you are probably reading this guide on Chrome, Firefox, Edge, or any other popular choice. 

Simply put, a headless browser doesn’t have a user interface, while a traditional one does. How do the two compare in more detail?

Performance

Headless browsers’ lack of user interface makes for better performance, simply because they have a lot fewer components to load. On the other hand, when you click on a web page to open it, the frontend employs a number of resources to present you with the interactive page you’re after.

As headless browsers are less resource-intensive, they tend to be a more economical choice for users who want to build lean scrapers without much of the overhead of frontend-heavy applications.

Interface

The most obvious difference between a traditional and a headless browser is the graphic user interface. The former has it, while the latter doesn’t. 

Interaction

Both types of browsers are often used by different classes of entities. Starting with traditional browsers, they are primarily built to be used by people. 

On the other hand, headless browsers are a better fit for the command line or API to interact with the web. 

Why Headless Browsers Are Ideal for Scraping

Many data scientists and software engineers love to scrape websites with headless browsers. Why do they do that? Here are some of the reasons:

Bot Detection Bypass with Configuration

Many websites actively block requests from headless browsers. Simply using a headless browser can often cause your scraping attempt to fail.

That’s why proper configuration and simulating user actions are essential. When set up correctly, a headless browser can successfully scrape sites by mimicking behaviors like clicks and scrolling.

Scraping Scalability and Speed

Imagine you want to load 1000 web pages and extract data from them. Doing so manually would require extensive time and effort.

But what if the data could be scraped without even having to load the UI? Here is where headless browsers come in handy. By allowing for nimble scraping practices, they are also good support when you want to scale your operations. 

User-Behavior Automation

Most modern websites are built with Next.js or other JavaScript frameworks, which make content rendering dynamic. With headless browsers, you can automate user behavior so that it mimics real interactions with the dynamic webpage you are looking to scrape. 

Dynamic Content Handling

Modern websites often use frameworks like Next.js or React, which load content dynamically using JavaScript. When you try to scrape these sites with a basic HTTP request (like using requests.get()), you only receive the initial HTML.

This is where headless browsers shine. They work just like real browsers: they load the page, execute JavaScript, and wait for all elements to render. This means you can scrape the actual, fully loaded content, not just the bare HTML shell. 

Top Headless Browser Compared

Are you looking for the best headless browsers you can use for your scraping work? Here are some options to consider: 

Puppeteer 

Puppeteer is one of the best headless browser libraries out there. Built with JavaScript, it’s a lightweight tool for anyone to scrape or automate the web using a headless browser. 

Key Use Cases

Advanced User-input Automation

Puppeteer offers advanced automation, with the capacity to simulate many user entries. For context, the simulation of these user entries will make your scraping bots

  • look more natural 
  • interact well with JavaScript-heavy websites 

Some of these user activities include keyboard input, scrolling, and form submission. With Puppeteer, you can scrape interactive JavaScript websites heedlessly at scale. 

Pre-rendered Content Generation

As we saw earlier, dynamic content doesn’t fully load its content at the initial trigger. This can trick a headless browser into attempting to scrape empty HTML. 

Puppeteer proves to be a good choice for this, as it can execute the full loading of the web page so the scraping program can extract the actual displayed content.

Native Screenshot and PDF Generation

Puppeteer supports native screenshots, which is useful for capturing images of web pages at scale. This feature can save you a lot of time, as a single program can automatically take screenshots of multiple pages.

In addition to screenshots, Puppeteer also allows you to extract information and save it directly as PDFs, making document generation easier.

Testing Chrome Extensions

Puppeteer was created by the same team that develops Google Chrome DevTools.

Because Puppeteer and Chrome extensions are both built around Chrome’s architecture, they “speak the same language” under the hood. This makes Puppeteer especially good for testing Chrome extensions. You can automate running and interacting with extensions in a way that fits naturally with how Chrome works.

Pros and Cons

Pros Cons
Screenshot and PDF generationOnly works on Node.js
Perfect for testing Chrome extensions Limited to the Google ecosystem 
Advanced user-input dexterityOften has release incompatibility issues with Chromium
Crawling Singe-Page-Applications easilyLack of multi-browser support
Generation of pre-rendered content
Optimized for the Google experience

Languages and browsers supported

Puppeteer was built for Node.js, a backend JavaScript framework. It was for Chromium-based browsers, such as Chrome or Microsoft Edge. 

Stealth compatibility or add-ons

Puppeteer has some inherent plugins to prevent minimal bot detection, including:

  • puppeteer-extra-plugin-stealth: A plugin that applies multiple stealth techniques to help Puppeteer scripts avoid detection by anti-bot systems. It mimics more human-like browser behavior.
  • puppeteer-extra-plugin-anonymize-ua: A plugin that anonymizes the user-agent string to reduce the risk of detection. It can spoof or randomize the user-agent to avoid patterns typically associated with headless browsers.

Here is the official link to Puppeteer. 

Playwright

With over 72k stars on GitHub, Playwright is top of the list of best headless browsers. Its diverse and flexible nature makes it stand out. 

Key Use Cases

Multi-browser Support

Some headless browsers support only one or two browsers, but with Playwright, you have more options. It supports Firefox, Chromium, and WebKit for headless scraping.

Friendly to Mobile App Scraping

Do you want to scrape Uber or more mobile-native applications? Then you might want to use Playwright. 

It offers an embedded mobile simulation for Android on Google Chrome. This can save you time as you can have a hypothetical mobile screen to work with on your desktop. 

Native Retry and Auto-wait

Playwright was built with the dynamic web in mind. It automatically retries assertions till it goes through, thereby saving you the time to manually try requests. 

In addition, it automatically waits for elements to load before trying to scrape. This gentle in-built approach frees you from having to set timeouts manually. 

Native tools

Playwright has some internal dev tools that can be helpful to both web testers and scrapers. Chief among them is the Playwright Inspector. You can use it to look more intently at pages, get selectors, and audit the logs. 

Pros and Cons

ProsCons
Offers auto-wait and auto-retryOnly supports Android mobile simulation
Offers native mobile simulationDoesn’t have native stealth plugins
Supports cross-browser activities 
Supports several languages 
Native tools

Languages and browsers supported

Playwright supports TypeScript, JavaScript, Python, Java, and .NET. 

Stealth compatibility or add-ons

Playwright does not offer native stealth plugins. However, since it is a fork of Puppeteer, it includes some stealth plugins originally developed for Puppeteer.

Here is the official GitHub page of Playwright.

Selenium

Selenium is an open-source library that is primarily for web testing. But then, it has also proven to be very useful in successfully scraping JavaScript webpages. 

If Puppeteer and Playwright are not up to your technical spec, you should be eager to check out Selenium. 

Key Use Cases

Automatic Waiting Methods

Sometimes, the scraping program runs before the webpage fully loads, which causes a race condition. That’s why automatic (or implicit) waits are important: they pause execution until the page is ready.

Additionally, Selenium offers an explicit wait method that you can customize to fit your specific requirements.

Multi-browser Support

The library supports a couple of popular browsers, such as Google Chrome, Edge, Firefox, Internet Explorer, and Safari. Each browser also has its own custom features and capabilities. 

Bidirectional Model

In many web testing scenarios, communication is mostly request and response. But what happens if the server that sends responses also needs to make a request?

This is where bidirectional communication comes into play. It’s an edge case in web testing and can be complex to handle, but Selenium makes it easy: you just need to set bidi to True!

More Realistic User Simulation

Modern websites can be complex, with buttons like:

  • Prompt — which expects a text input
  • Confirm — which can be canceled
  • Alerts — which displays a custom message 

Selenium is designed to handle all these interactions headlessly, making it easier for you to scrape websites that require unique inputs.

Multi-machine Automation

You can execute Selenium WebDriver scripts on many remote machines. 

Even though web testers utilize this feature to test across multiple machines, scrapers can also make it run many requests across many remote machines at once. 

Of course, this might not be practically necessary with providers like ScraperAPI, which allow you run up to thousands of requests concurrently. 

Pros and Cons

ProsCons
Supports many browsers No mobile simulation
Supports many languages Relies too much on third-party tools
Supports bidirectional communication 
Has an automatic wait period 
Supports comprehensive testing
Can be integrated with other testing tools


Languages and browsers supported

It supports Python, Kotlin, JavaScript, Java, C#, and Ruby. 

Stealth compatibility or add-ons

It has some stealth methods such as undetected-chromedriver and setting navigator.webdriver to False

You can find the official GitHub page of Selenium here.

When to Use Each Headless Browser

Here is how to choose the perfect headless browser for your project.

On Anti-bot Protection

Many websites block scraping bots automatically. This is why it’s better to choose a headless browser that can go unnoticed by most detectors. 

As Akamai and other bot detection systems often spot Selenium at work, this might not be the best option for you.

Puppeteer and Playwright might be a better fit, as they cover footprints and conceal their headless nature more effectively.

On Cross-browser Support

If you need strong cross-browser support, you should probably pick either Selenium or Playwright. Puppeteer, as you might guess, is not ideal for this because it’s limited to Chrome only. Therefore, it won’t help you achieve that goal.

On Languages 

If you want support for multiple programming languages, Playwright and Selenium are good options. In particular, Selenium offers broader language coverage.

Project Scope

Some headless browsers don’t have the capacity to handle large-scale projects. For instance, Playwright and Puppeteer are not ideal for enterprise-grade scraping tasks, as they can be a bit slow. If enterprise-grade scraping is what you want, you should probably choose a library like Selenium. 

Need for Mobile Simulation

Among these options, only Playwright supports mobile simulation. It’s ideal for you if you want to work on scraping projects that require working with the mobile version of a website.. 

Key Factors to Consider When Choosing a Headless Browser

Whether you are choosing Puppeteer, Selenium, Playwright, or any other headless browser library, there are some key factors you should consider before picking the most suited for your job. 

Ease of Use

Before even starting a project or integrating a library, one of the difficulties that you may face is a set-up that keeps failing. 

This can be worse if there is no documentation to help you out. You’d be left to figure things out on your own, which can be time-consuming and frustrating. 

Also, ideally, your choice should have a gentle learning curve, and it should be simple for anyone to grasp. 

On this note, one of your core determinants on the right headless browser to pick should be ease of use. The easier, the better. 

Language Support

You can build faster in the language you are more comfortable with. If you’re working in a team of engineers, you might want to consider the languages that your teammates are also fluent in. 

This becomes more important if the legacy codebase might be migrated in the future.  Some headless browsers only support 2 or 3 languages. However, Selenium, for instance, supports up to about 6 popular languages. 

Pick the headless browser that supports the language you already know. 

Performance & Memory Usage

An ideal headless browser for your scraping programs should be fast enough to execute requests at scale. 

When picking the right headless browser, optimize for the one with speed, especially if you’re sure your number of concurrent requests will keep increasing with time. 

That said, memory usage is another key issue in adopting an efficient library. For each of your choices, how much memory does it consume to run? The lesser memory consumption, the faster the throughput and the less it will make your machine lag. 

Stealth & Anti-Bot Detection

Modern websites are more sophisticated, and not all are friendly with scraping requests. Understandably so, as scraping requests can crash servers if done at scale in bad faith, among other reasons. 

Therefore, there are anti-bot measures that bounce off scraping requests from even ethical web scrapers. As a result, any headless browser you want to choose should have stealth plugins and anti-bot detection methods. 

Some libraries don’t have native support for stealth plugins, so you’d need some add-ons. Choose the ones with native stealth plugins and anti-bot detection methods. 

Evading Detection with Headless Browsers

There are now advanced bot detectors, which make it harder to successfully scrape protected websites. While there are services to unblock your program, it’s not a bad idea to use a headless browser that is capable enough to jump over some of these detectors. 

Along with your headless browsers, implement these methods to avoid being blocked. 

ScraperAPI

Puppeteer and other headless browser libraries have some stealth plugins for you to successfully avoid bot detection. ScraperAPI is not “one of those plugins.”

It is a unique suite of web scraping products for you to successfully access and extract data from any website, no matter the level of protection. 

Many web scraping developers have paid proxies, plugins, and other tools. However, ScraperAPI is an all-in-one solution designed to provide a seamless data extraction experience. It handles everything from proxy rotation to website unblocking. 

Fun fact: you can integrate your favorite headless browsers and ScraperAPI perfectly in your scraping program. You can check out this sought-after Selenium-ScraperAPI integration guide. 

Playwright Stealth

Many detectors easily spot and block programs that use Playwright. Among other reasons, Playwright wasn’t originally built to handle seasoned detectors. 

More interestingly, Playwright often sets headless to true, which basically signals to the site that the request is likely from a bot and definitely not an organic user. 

Even though Playwright has no native stealth plugin, the puppeteer-extra-stealth was adapted for Playwright. The plugin helps it to successfully access websites unnoticed. 

Puppeteer Stealth

The puppeteer-extra-stealth-plugin is relatively new and was built on various public detection indicators. Its main strength lies in hiding its headless nature.

Detectors easily spot headless browsers because they lack a UI. Now, what this plugin does is make Puppeteer appear as natural as possible. 

Undetected_chromedriver

The undetected_chromedriver is a carefully crafted stealth plugin built to withstand DataDome, Cloudflare, Akamai, and many other sophisticated bot detection systems. 

Many web scraping engineers utilize this plugin while extracting data with their headless browsers. 

But there’s bad news: undetected_chromedriver exposes location, so if you’re not rotating IP, you might eventually get blocked. In addition, some detectors identify it through browser fingerprints. 

NoDriver

NoDriver was created as a remedy to the limitations of undetected_chromedriver. It claims to be more sophisticated in concealing fingerprints and making programs go undetected. 

Use of proxies + browser fingerprint rotation

Many websites spot and block scraping bots based on location. For instance, some detectors have already blocked certain locations from accessing the websites they protect. 

Thus, you need a proxy to scrape such websites. 

Here is another important fact: once some detectors notice unusual bot interaction from an IP, they block it. How can you outwit that? By having multiple proxies and alternating them when you’re using headless browsers. 

This way, your digital footprint cannot be clearly defined, and this will outsmart bot detectors. 

Conclusion

In this blog, you have learned about headless browsers and how to identify the best one for you. 

It’s good to re-emphasize that the most suitable headless browser for you depends on the stage your SaaS is at and what you currently need. 

Even though headless browsers can be helpful in successfully scraping JavaScript-heavy websites, some of these websites now automatically block requests from headless browsers. 

Modern problems, they say, require modern solutions. This is where ScraperAPI comes in as a tool to help you bypass the hurdles of bot detection and scrape successfully without any headaches. 

Here is an interesting fact: ScraperAPI works with any headless browser. Meaning you can integrate it with Selenium, Puppeteer, Playwright and others. 

Well, don’t take our word for it. Try out a free plan of ScraperAPI with your headless browser today!

FAQs

Which browser is best for scraping?

Generally, Selenium appears to be the best headless browser library for scraping.

What is the best headless browser automation tool?

ScraperAPI is the best headless browser automation tool. Integrate it with your favorite headless browser, then automate your scraping with its DataPipeline feature.

Do I always need a headless browser for web scraping?

Not necessarily, especially if you’re trying to scrape an unprotected website.

Can headless browsers be detected?

Yes, if the headlessness is exposed. But a more clever way is to utilize stealth plugins.

The post Best Headless Browsers for Web Scraping in 2025 appeared first on ScraperAPI.

]]>
The 8 Best Data Extraction Tools in 2025: A Complete Guide https://www.scraperapi.com/web-scraping/best-data-extraction-tools/ Wed, 25 Jun 2025 05:45:18 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7978 Every thriving business must be able to leverage customer sentiments and market trends to its advantage. To this end, you need data to understand what customers would love to see from you and the current cycle of your industry.  For product owners, GTM marketers, CTOs, and other business leaders, you need efficient data extraction tools. […]

The post The 8 Best Data Extraction Tools in 2025: A Complete Guide appeared first on ScraperAPI.

]]>
Every thriving business must be able to leverage customer sentiments and market trends to its advantage. To this end, you need data to understand what customers would love to see from you and the current cycle of your industry. 

For product owners, GTM marketers, CTOs, and other business leaders, you need efficient data extraction tools.

In this short blog, we’ll discuss some carefully selected extraction tools. We’ll also take a look at their strengths, weaknesses, and ideal users.

TL;DR: Use an Enterprise Data Extraction Tool for Large-Scale Scraping

Websites employ various techniques to detect and block scrapers, and bypassing them is becoming increasingly more difficult and expensive. Bot-blockers, such as Akamai and Cloudflare, make the process even more complex, requiring specialized solutions and expertise to access the data.

To save time and money, it’s better to use a specialized tool like ScraperAPI to:

  • Rotate your IP through a pool of over 150 million proxies
  • Handle CAPTCHA and JavaScript challenges
  • Bypass advanced bot-blockers, including Akamai and Cloudflare
  • Get dedicated support with fast response times
  • Speed up scraping using a custom concurrent threads limits

Want to give it a try? Create a free ScraperAPI account and test our service using the code below. Add your api_key and the url you’re trying to scrape, but it’s blocking you.

import requests

payload = {
   'api_key': 'YOUR_API_KEY',
   'url': 'YOUR_TARGET_URL',
   'country_code': 'us'
}

response = requests.get('https://api.scraperapi.com', params=payload)
print(response.status_code)

If you want to test ScraperAPI for your particular use case, contact sales to get a custom trial, including personalized onboarding, custom concurrency and scraping credits limit, and a dedicated account manager to ensure a success rates and speed.

The 8 Best Data Extraction Tools in 2025

#1. ScraperAPI [Best for enterprise companies and dev teams that need to scrape millions of pages without getting blocked]

ScraperAPI website

ScraperAPI is the most suitable data extraction tool for both small teams and enterprises.

It offers many structured endpoints that allow users to easily spin up scraping jobs without reinventing the wheel. Current endpoints include Amazon, Google, Redfin, and Walmart.

The API is built to automatically bypass bot protection systems. This means you don’t need to worry about obstacles like Cloudflare—ScraperAPI can consistently access sites even when such blockers are in place.

Additionally, the feature DataPipelinen allows users to carry out large-scale scraping by automating the entire process. 

Best For

SaaS teams of all sizes that want to extract data easily and at scale. 

Key Features

  • DataPipeline for scraping automation
  • AsyncScraper for simultaneous scraping jobs
  • Dedicated Data Endpoints for easier data extraction 
  • API to access and extract data from any website
ProsCons
Can handle scraping work at scaleCannot parse documents
Scraping jobs can be automated
Ready-made endpoints for faster extraction
Efficient API to extract data from any website
Friendly pricing 

Pricing

  • Hobby – $49
  • Startup – $149
  • Business – $299
  • Scaling – $475

Customers who set up yearly billing can have 10% off.

#2. Apify [Best suited for developers, AI engineers, or research engineers]

Apify website

With over 4k global customers and 4 billion pages crawled monthly, Apify has carved a name for itself as one of the most efficient YC-backed data extraction tools in 2025.

Its primary product is the Apify proxy and scraper API, which has been proven to help developers bypass any blocked website and scrape data successfully. 

Over time, Apify has advanced from its anti-blocking product to building its full-fledged platform that consists of Apify Store and Apify Actors. 

At the moment, it has over 4,500 pre-built scrapers, including those for TikTok, Instagram, and Amazon. It also gives developers the capacity to build scrapers, list them on the platform, and earn from their usage. 

Best For

Apify is best suited for developers, AI engineers, or research engineers who are neck-deep into web data extraction. 

Key Features

  • Special storage solution for enterprises
  • The Crawlee Python library
  • Integration with other applications
  • It has Apify Actors, which are simply ready-made scrapers
ProsCons
Users can simply use existing scraping programs for their needsCrowded interface
Comprehensive and well-documented SDK for developersHard to grasp for non-technical users
AI-driven designSome actors are outdated 
Easy setup Relatively high pricing 
It can handle scraping jobs at scale

Pricing

Here is a summary of the pricing plan:

  • Starter Plan – $39
  • Scale Plan – $199
  • Business Plan – $999

It has a 10% discount for customers who would like to get billed annually. In addition, it offers a pay-as-you-go model. 

#3. DocParser [Ideal for professionals in the corporate sector who might need to extract data from documents]

DocParser website

If you’ve been looking for the best data extraction tool for business documents, DocParser might be what you need. 

It’s one of the most popular tools for document parsing. All you have to do is upload your doc, provide instructions on the data you need, and download the output.

Yes, it’s that simple!

DocParser provides rule templates specifically designed for finance and accounting tasks, which you can use right out of the box. Alternatively, you can create your own custom rules. Its AI-powered features also allow for automated, customizable data extraction, so you can streamline your workflows with minimal manual input.

Best For

DocParser is ideal for professionals in the corporate sector who might need to extract data from documents.

Key Features

  • Has an HTTP API
  • Stores past document copies in case they are needed later
  • Data can be downloaded in multiple file formats
  • Parsing rules can be customized
  • Pre-built rules for document parsing
ProsCons
Friendly to non-technical professionalsCan only scrape documents
Easy to get startedCannot be integrated with other applications
Automation with AICannot scrape websites
There are rule templates
Supports OCR for scanned docs 

Pricing

Here is the monthly pricing plan of DocParser:

  • Starter Plan – $32
  • Professional – $61
  • Business – $133

Note that there is no free tier. 

#4. OctoParse [Suitable tool for non-technical professionals]

Octoparse website

OctoParse has been in the data extraction industry since 2016 as one of the most prominent no-code tools. Primarily a no-code scraping tool, you can think of it as DocParser for websites. 

With native AI integration, it is easier to supercharge your scraping job. You can turn on auto-detect and receive real-time AI guidance at every step of the journey. 

Octoparse makes data scraping even faster with its library of pre-built templates, offering hundreds of ready-to-use setups for popular websites.

Best For 

Most suitable tool for non-technical professionals working in news curation, e-commerce, and lead generation. 

Key Features 

  • AI assistant for web scraping
  • Cloud-based scraping automation
  • In-built captcha bypass
  • Hundreds of preset templates
ProsCons
Has scraping templates Cannot parse documents
Offers a free trialLack of solid documentation 
Has an API referenceCan’t bypass Cloudflare
Complicated UI
Finds it difficult to scrape some complex websites

Pricing 

  • Standard Plan – $99
  • Professional Plan – $249

There’s a 16% discount for customers who choose yearly billing. 

#5. Airbyte [Great fit for enterprise teams that manage end-to-end data workflows]

Airbyte website

In many cases, data needs to be passed from one application to another in order to extract its full value.

This is where Airbyte stands out—it’s a cloud-based infrastructure that can “Extract, Load, and Transform” your data across multiple platforms. 

Airbyte has over 600 connectors, which pull and push data from one application to another. This includes DuckDB, BigQuery, Brex, n8, and so on. You can also spin up your custom Connector with the CDK to tailor your data extraction to your taste. 

Best For

Airbyte is a great fit for enterprise teams that manage end-to-end data workflows, from extraction to loading into target systems.

ProsCons
Many connectors to choose fromLack of clear pricing information
Native integration with many other applicationsNot suitable for small businesses or solo developers
LLM integration for data analysis Not ideal for non-technical teams
Low latency
Solid developer documentation 

Key Features

  • Prebuilt connectors
  • CDK SDK for custom collectors
  • LLM integration & data insights
  • Users can use any extraction method

Pricing

  • Open Source 
  • Cloud
  • Team 
  • Enterprise

There is no clear-cut amount of what each plan costs. 

#6. Fivetran [Suited for enterprise teams on a budget]

Fivetran website

Fivetran differs from the previously mentioned tools by focusing primarily on data movement. Like Airbyte, it offers a wide range of connectors for data integration.

Users have access to over 700 connectors to move data across applications and automate workflows. They can also build custom connectors with support from the documentation.

Best For

Best suited for enterprise teams on a budget and placing more emphasis on operational quality.  

Key Features

  • REST API 
  • Database replication 
  • Custom connectors
  • Over 700 connectors
  • File replication
  • Data warehouse 
ProsCons
Documentation and video guidesCannot parse documents
SDK for custom connectorsMay struggle to extract data from complex or dynamically rendered websites
Strong capabilities in moving data across platformsRelatively costly option in the market
Easy setupInefficient support

Pricing 

  • Free 
  • Standard
  • Enterprise
  • Business Critical

There are no fixed prices for each plan; customers are billed based on their usage.

#7. Zapier [Teams seeking to connect data across platforms]

Zapier website

Zapier Formatter is Zapier’s built-in tool for basic data extraction. Users can fetch simple information such as names, emails, and other straightforward details from various platforms.

However, it’s not designed for complex or website-level data extraction. Its real strength lies in enabling plug-and-play automation to move and transform data across different systems.

Best For

Enterprise teams seeking to connect data across platforms while leveraging AI-powered automation.

Key Features

  • AI integration 
  • Formatter for data extraction
  • Zap for automation workflows 
  • Page customization 
ProsCons
Sleek user interfaceUsage limits can be restrictive 
Support for many popular applications Expensive for small teams
Little or no need for programming knowledge No support for custom integrations 
Has AI integration for swift automationNot mobile-friendly

Pricing

  • Free Plan – $0
  • Pro Plan – $13
  • Advanced Plan – $66

#8. ParseHub [Ideal for non-technical teams that want to extract data quickly]

Parsehub website

ParseHub is a cloud-based data extraction software known specifically for its ease of use. Not available on the browser, it is currently available on Windows, Mac, and Linux. 

ParseHub lets users simply open a website, select the data they want to scrape, and download the results directly to their machine. It’s a no-code tool, so no programming experience is required to use it.

Best For

Non-technical teams that want to extract data quickly and efficiently. It’s ideal for sales leads, growth marketers, and business developers. 

Key Features

  • Automatically rotates IP
  • DropBox integration
  • Capable of fetching millions of data points within minutes
  • Cloud-based
  • All interactions are through the UI
ProsCons
Made for non-technical teamsLack of robust documentation
Users can access data via the APITakes up to 15 minutes to set up
Great customer supportNot compatible with MCP and the latest AI tech stack
Free plan to scrape pagesLimited developer freedom
Data can be exported as CSV, Google Sheets, or TableauRelatively high pricing

Pricing

ParseHub has two main pricing plans: Standard and Professional. The latter is $189 per month, while the former is $599. It also offers free licenses for schools and a free tier worth about $99. 

#Bonus Tool: Astera

Astera website

Astera is an end-to-end, AI-driven data management platform simplifying and accelerating data extraction, preparation, integration, and warehousing.

Astera’s data extraction tool, ReportMiner, enables organizations to connect —and extract—structured, semi-structured, and unstructured data from 100+ sources using a visual, point-and-click UI powered by AI.

The built-in LLM Generate object empowers users to extract data from any file type. Once the data is extracted, users can easily cleanse, transform, and prepare it to meet their requirements using Astera’s Data Pipeline Builder.

Best For

Astera ReportMiner is fit for businesses of all sizes. Built-in OCR, Auto Generate Layout, and LLM-Powered extraction keep the platform simple and straightforward to use.

Key Features

  • Offers template-based data extraction using Auto Generate Layout and template-less data extraction via LLM
  • Diverse range of structured, semi-structured, and unstructured data sources supported with extended data processing and cleansing options
  • Extract, transform, and load data to both on-premises and cloud-based systems
  • Built-in workflow orchestration and job scheduler with real-time job monitoring 
ProsCons
Supports both, template-based and template-less data extraction from diverse documents in bulkPricing is available on request 
Powered by an intuitive, easy-to-use UIThe breadth of features may be overkill for businesses with basic data extraction use cases
Built-in transformations keep data preparations tasks simple
The ability to manage the entire document data extraction lifecycle within a single platform
Complemented by Astera’s award-winning customer support

Pricing

Astera ReportMiner is available as:

  • ReportMiner Express
  • ReportMiner Enterprise

Prospective buyers need to get in touch with the company to get a quote.

How to Choose the Best Data Extraction Tool for Your Needs

Choosing the data extraction tool you’ll use for your work is a crucial decision that can make or break your project.

Here is what to look out for when choosing your data extraction tool:

Easy Setup

Simplicity is important in maintaining an efficient workflow across your organization. When reviewing your options, ask yourself

  • Can the most junior person on the team conveniently navigate this tool without extensive training?
  • What knowledge gaps need to be addressed before they can effectively use it?

If setup proves complex, it’s probably not the best fit, especially if it requires extensive training for your team. Ideally, you want a data extraction tool with an intuitive interface and a gentle learning curve.

Stable and Efficient Performance

A data extraction tool that crashes frequently is not ideal for keeping things running smoothly. When making your choice, be sure to keep stability and efficiency in mind. 

Ask yourself: Can this tool concurrently scrape up to 100+ requests without running into errors or crashing? Most companies using data extraction tend to have large volume needs when it comes to scraping, so make sure the tool you pick can handle the heat.

Data Security 

In this world that revolves around data, information is gold, and implementing the right security measures to protect your users’ data is paramount. 

A good way to verify the security of your prospective tool is to check how many international data compliance certifications it holds. This helps you gauge the strength of their data protection measures.

Automation

Ideally, you want your prospective tool to automate repetitive tasks. This helps your team focus on what really matters. 

Make sure you choose a platform that makes using it a breeze and takes care of the bulk of manual work, such as data formatting, error handling, and report generation.

AI and Agentic Alignment

Agentic AI is the future of SaaS. Many teams now integrate AI-powered agents to handle cross-application workflows. Consider how well your prospective tool supports large language models (LLMs) and intelligent agents, which can be a key factor in selecting a leading solution. 

Clear Documentation and Video Guides

Tools are always different, and that is why documentation is important for quickly grasping how things work. 

An ideal data extraction tool should not leave you in the dark when something breaks or doesn’t work as intended. Instead, it should document and explain every segment of the workflow in extensive, well-structured documentation or guides. 

Scalability for Enterprises

Companies grow, project requirements change, and the right tool for the job needs to be able to seamlessly adapt. Make sure the tool you choose is designed to grow with you by offering scalable features, flexible integrations, and regular updates that keep pace. 

Conclusion

No matter how you use data within your organization or for your personal project, a great data extraction tool that suits your needs can support you throughout your journey.

In this short blog, we’ve gone over 8 data extraction tools you might want to consider in 2025 and also pointed out non-negotiable qualities you should look out for. 

In all of this, ScraperAPI stands as an ideal overall choice if you want to take your data extraction to the next level. You can check out the docs for yourself here!

The post The 8 Best Data Extraction Tools in 2025: A Complete Guide appeared first on ScraperAPI.

]]>
Scrapy vs BeautifulSoup – Which Web Scraping Tool Should You Use in 2025? https://www.scraperapi.com/web-scraping/scrapy-vs-beautifulsoup/ Wed, 25 Jun 2025 05:22:41 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7976 Choosing between Scrapy vs Beautifulsoup isn’t just about picking a Python library or framework, but rather about choosing the right foundation for your entire web scraping strategy. As websites advance and more anti-scraping measures are implemented, the wrong tool for the job can lead to hours of frustration. Many developers jump straight into coding and […]

The post Scrapy vs BeautifulSoup – Which Web Scraping Tool Should You Use in 2025? appeared first on ScraperAPI.

]]>
Choosing between Scrapy vs Beautifulsoup isn’t just about picking a Python library or framework, but rather about choosing the right foundation for your entire web scraping strategy. As websites advance and more anti-scraping measures are implemented, the wrong tool for the job can lead to hours of frustration. Many developers jump straight into coding and overlook the critical importance of matching the right tools for their scraping requirements.

In this guide, you’ll learn:

  • What makes Scrapy and BeautifulSoup fundamentally different
  • The key strengths and limitations of each approach
  • When to choose a complete framework versus a parsing library
  • Real-world scenarios where one clearly outperforms the other
  • How to make the right choice for your 2025 scraping projects

Ready to stop second-guessing your tool selection and start scraping with confidence? Let’s dive in!

What Is BeautifulSoup?

BeautifulSoup is a Python library designed for parsing HTML and XML documents, making it one of the most accessible entry points into web scraping. Think of it as a powerful magnifying glass that helps you navigate through the messy structure of web pages and extract the exact data you need.

What makes BeautifulSoup appealing for beginners is its smooth learning curve. Even developers new to web scraping can easily grasp its syntax and start extracting data within minutes. The library transforms complex HTML documents into a navigable tree structure, allowing you to search for elements using familiar methods like finding tags, classes, or IDs.

Core Features:

  • Tag-based navigation: Easily traverse HTML elements using simple, readable syntax
  • Multiple parser support: Works with built-in html.parser, lxml, and html5lib for different performance and accuracy needs
  • Robust error handling: Gracefully manages malformed HTML without crashing your script
  • Flexibility with poor HTML: Handles real-world websites with broken or inconsistent markup
  • CSS selector support: Find elements using CSS selectors familiar to web developers

Typical Use Cases: BeautifulSoup shines in scenarios like extracting product information from e-commerce sites, scraping news articles, gathering data from static websites, parsing HTML emails, or cleaning up messy HTML documents. It’s particularly effective for one-off scraping tasks, data analysis projects, and situations where you need to quickly extract specific information from a handful of pages.

Key Limitations: BeautifulSoup is purely a parsing library, which means it doesn’t handle HTTP requests, session management, or JavaScript rendering. You’ll need to pair it with libraries like requests for a complete scraping solution, and it’s not designed for large-scale or high-performance scraping operations.

What Is Scrapy?

Scrapy is a Python-based web scraping framework that goes far beyond simple HTML parsing to provide a complete solution for large-scale data extraction projects. Unlike parsing libraries, Scrapy is built from the ground up as an end-to-end scraping platform, handling everything from making HTTP requests and following links to processing data and exporting results in various formats.

Think of Scrapy as a full-service scraping factory rather than just a parsing tool. It’s designed around the concept of “spiders”, self-contained scraping programs that can crawl websites, extract data, and process information through customizable pipelines.

Key Strengths:

  • Asynchronous support: Built on the Twisted framework for high-performance concurrent scraping
  • Middleware system: Extensible architecture for handling cookies, user agents, proxies, and custom processing
  • Spider design: Organized structure for creating reusable, maintainable scraping logic
  • Built-in exporters: Native support for JSON, CSV, XML, and custom data formats
  • AutoThrottle: Intelligent request throttling to avoid overwhelming target servers
  • Scalability: Designed for distributed scraping across multiple machines
  • Request/Response handling: Complete HTTP client with cookie support, redirects, and error handling
  • Item pipelines: Data processing workflows for cleaning, validation, and storage

Typical Use Cases: Scrapy excels in enterprise-level projects like crawling entire e-commerce catalogs, monitoring competitor pricing across thousands of products, building comprehensive datasets from multiple related websites, creating web crawlers that follow complex link structures, and any scenario requiring systematic and large-scale data extraction with reliability and performance.

Key Limitations: Scrapy’s power comes with complexity; it has a steeper learning curve and may be overkill for simple, one-off scraping tasks. It also doesn’t handle JavaScript-heavy sites natively (though it can integrate with tools like Splash), and its framework-based approach requires more setup compared to the simplicity of just importing a parsing library.

Scrapy vs BeautifulSoup: Side-by-Side Feature Comparison

The table below provides you with all the information you need to choose the right tool:

FeatureBeautifulSoupScrapy
Setup complexitySimple pip install, ready to useRequires project structure, settings configuration
Learning curveGentle, intuitive for beginnersSteeper, framework concepts to master
Speed & performanceSequential processing, slowerAsynchronous, high-performance concurrent requests
ScalabilityLimited to single-threaded operationsBuilt for large-scale, distributed scraping
HTTP request supportRequires additional library (requests)Built-in comprehensive HTTP client
JavaScript renderingNone (static HTML only)Limited (requires Splash/Selenium integration)
Proxy/captcha handlingManual implementation neededBuilt-in middleware support
Ecosystem/middlewareMinimal, relies on external packagesRich ecosystem with extensive middleware
Development Speed for Simple TasksVery fast for quick scriptsSlower initial setup, faster for complex projects
Community & DocumentationLarge community, extensive tutorialsStrong community, comprehensive official docs

Key Takeaways:

When you compare these two tools, you’ll see they take completely different approaches to web scraping. 

BeautifulSoup focuses on being simple and easy to use: you can start scraping in just three lines of code, making it perfect for quick data extraction and testing ideas. 

Scrapy, on the other hand, is the heavy-duty solution built for big projects, and its powerful features become necessary when you’re scraping thousands of pages and need strong error handling and automatic features. 

At the end of the day, it really comes down to project size versus how quickly you want to start. BeautifulSoup gets you scraping right away but slows down on bigger jobs, while Scrapy takes time to learn but can handle massive projects. Neither tool is better than the other; they’re built for different types of scraping tasks.

Which One Should You Choose?

Choosing between Scrapy and Beautifulsoup depends on your project requirements and circumstances. Focus on which one aligns with your actual needs. Here’s a practical cheat sheet to guide your choice:

Choose BeautifulSoup if:

  • You’re scraping fewer than 100 pages total
  • You need to extract data from static HTML pages
  • You’re new to web scraping or Python
  • You want to get results quickly without setup overhead
  • Your project is a one-time data extraction or analysis
  • You’re prototyping or exploring what data is available
  • You already have URLs and just need to parse HTML content
  • You’re working on data analysis, where scraping is a small component

Choose Scrapy if:

  • You’re planning to scrape hundreds or thousands of pages
  • You need to crawl websites by following links systematically
  • Speed and performance are critical requirements
  • You want built-in handling for proxies, user agents, and rate limiting
  • Your project requires data pipelines for cleaning or storage
  • You’re building a long-term, maintainable scraping solution
  • You need robust error handling and retry mechanisms
  • You’re comfortable with framework-based development

Project Size Guidelines:

Small Projects (1-50 pages): BeautifulSoup is almost always the right choice. The setup time for Scrapy isn’t justified, and BeautifulSoup’s simplicity will get you results faster.

Medium Projects (50-500 pages): This is the gray area where either tool could work. Choose BeautifulSoup if simplicity matters more than speed, or Scrapy if you anticipate growth or need built-in features like throttling.

Large Projects (500+ pages): Scrapy becomes increasingly necessary. BeautifulSoup will struggle with performance, and you’ll end up rebuilding Scrapy’s features manually.

Technical Comfort Level: If you’re new to programming or web scraping, start with BeautifulSoup regardless of project size. The learning experience will be smoother, and you can always migrate to Scrapy later as your needs and skills grow.

Quick Decision Checklist: Ask yourself these three questions:

  1. Do I need to follow links and crawl through website structures? (If yes → Scrapy)
  2. Am I scraping more than a few dozen pages? (If yes probably Scrapy)
  3. Is this a quick, one-time task? (If yes → BeautifulSoup)

Remember: you can always start with BeautifulSoup and switch to Scrapy later as your project grows. Many successful scraping projects begin as simple BeautifulSoup scripts that evolve into full Scrapy frameworks over time.

How Scrapy and BeautifulSoup Handle Common Web Scraping Challenges

Real-world web scraping involves more than just parsing HTML. Modern websites present numerous challenges that can make or break your scraping project. Here’s how both tools handle the most common obstacles you’ll encounter in the wild.

  1. Error Handling: 

BeautifulSoup takes a minimalist approach to error handling. It parses malformed HTML without crashing, but HTTP errors, network timeouts, and connection issues must be handled manually through your requests implementation. You’ll need to wrap your code in try-except blocks and implement your own retry logic.

Scrapy, however, comes with built-in error handling. It automatically handles HTTP errors, network failures, and parsing exceptions with configurable retry mechanisms. Failed requests are automatically retried with exponential backoff, and you can customize error-handling behavior through middleware without cluttering your main scraping logic.

  1. Rate Limiting and Retries

This is where the tools diverge significantly. With BeautifulSoup, you’re responsible for implementing delays between requests, typically using time.sleep() or more sophisticated throttling mechanisms. While this gives you complete control, it also means writing and maintaining additional code for something that should be standard.

Scrapy includes AutoThrottle, an intelligent rate-limiting system that automatically adjusts request delays based on server response times and latency. It learns optimal crawling speeds for each domain and includes built-in retry logic with customizable delays. You can also set global delays, randomize timing, and configure concurrent request limits without writing a single line of throttling code.

  1. Extracting Structured Data

Both tools excel at data extraction, but with different approaches. BeautifulSoup’s tag-based navigation feels natural and intuitive. You can easily chain selectors and navigate parent-child relationships. It’s important to note that organizing extracted data into structured formats requires additional manual work.

Scrapy introduces the concept of Items and ItemLoaders, providing a structured way to define, validate, and process extracted data. This systematic approach reduces bugs and makes data extraction more maintainable, especially when dealing with complex, multi-field extractions across hundreds of pages.

  1. Proxy and CAPTCHA Support

BeautifulSoup relies entirely on external libraries for proxy support. You’ll typically use the requests library with proxy configurations, rotating proxies manually through your code. CAPTCHA handling requires integrating third-party services and managing the complexity yourself.

Scrapy’s middleware system makes proxy rotation straightforward through built-in proxy middleware or third-party packages like scrapy-rotating-proxies. While CAPTCHA handling still requires external services, Scrapy’s architecture makes integration cleaner through custom middleware that can intercept and handle CAPTCHA challenges without disrupting your main scraping logic.

The Reality Check: Modern Bot Protection

Here’s the uncomfortable truth: both tools struggle against today’s sophisticated website defenses. Modern sites use advanced detection systems like Cloudflare, PerimeterX, and custom JavaScript fingerprinting that can easily spot and block scrapers built with either Scrapy or BeautifulSoup. These protection systems have become so smart that they can identify bot behavior patterns regardless of which parsing tool you’re using.

JavaScript-heavy websites create another major obstacle that neither tool handles well on its own. BeautifulSoup can only see the basic HTML that loads first, missing all the content that JavaScript creates later. Scrapy faces the same limitation and needs additional tools like Splash, Selenium, or Playwright to render dynamic content, which makes everything more complicated and slower. When websites deploy strict anti-bot measures, you need specialized services for residential proxies, CAPTCHA solving, and JavaScript rendering, making your choice of parsing tool less important than your overall anti-detection strategy.

Key Takeaway: Both Scrapy and BeautifulSoup excel at what they’re designed for, but modern web scraping success often depends more on your ability to handle advanced bot protection than on which parsing tool you choose. Today’s scraping projects typically require combining your preferred parsing library with proxy services, JavaScript renderers, and anti-detection techniques.

Conclusion

Choosing between Scrapy and BeautifulSoup depends on your project size and experience. BeautifulSoup is perfect for beginners and small projects, while Scrapy excels at large-scale operations and complex data pipelines.

Both tools face modern scraping challenges like bot detection, JavaScript rendering, and CAPTCHA protection. That’s where ScraperAPI comes in. It handles these obstacles automatically while you continue using your preferred parsing tool.

In this tutorial, we explored:

  • Comparing Scrapy vs BeautifulSoup for different use cases
  • Understanding when to use each tool effectively
  • Overcoming common scraping challenges with ScraperAPI integration

Ready to supercharge your scraping projects? Sign up for a free ScraperAPI account and get 5,000 API credits to test it with your existing scripts. Whether you’re team BeautifulSoup or team Scrapy, ScraperAPI provides the infrastructure you need to scrape successfully.

Until next time, happy scraping!

FAQs about Free & AI Web Scraping Tools

Is BeautifulSoup better than Scrapy?

It depends on your needs. BeautifulSoup is better for beginners, small projects, and quick data extraction from static pages. Scrapy is better for large-scale scraping, complex crawling operations, and production environments. Choose based on project size and your technical experience.

Is Scrapy good for web scraping?

Yes, Scrapy is excellent for web scraping, especially for large-scale projects. It’s a full-featured framework with built-in support for handling requests, following links, managing cookies, and processing data pipelines. It’s designed specifically for production-grade scraping operations.

Can I use Scrapy with BeautifulSoup?

Absolutely! You can use BeautifulSoup as a parser within Scrapy for more complex HTML parsing tasks. Simply import BeautifulSoup in your Scrapy spider and use it to parse the response content when you need its specific parsing capabilities.

What are the main differences between Scrapy and BeautifulSoup for web scraping?

Scrapy is a complete framework for large-scale scraping with built-in request handling, while BeautifulSoup is a parsing library that requires additional tools for making requests. Scrapy handles crawling, data pipelines, and concurrent requests automatically, whereas BeautifulSoup focuses solely on parsing HTML/XML content.

The post Scrapy vs BeautifulSoup – Which Web Scraping Tool Should You Use in 2025? appeared first on ScraperAPI.

]]>
Best User Agent List for Web Scraping in 2025 (with Examples & Tips) https://www.scraperapi.com/web-scraping/best-user-agent-list-for-web-scraping/ Mon, 23 Jun 2025 08:57:10 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7925 As websites deploy smarter detection systems and more aggressive blocking, scraping is becoming increasingly difficult. While developers chase complex solutions like residential proxies and CAPTCHA solvers, they often overlook one of the most powerful tools: the `user agent` string. In this guide, you’ll learn  Ready to stop getting blocked and start scraping successfully? Let’s dive […]

The post Best User Agent List for Web Scraping in 2025 (with Examples & Tips) appeared first on ScraperAPI.

]]>
As websites deploy smarter detection systems and more aggressive blocking, scraping is becoming increasingly difficult. While developers chase complex solutions like residential proxies and CAPTCHA solvers, they often overlook one of the most powerful tools: the `user agent` string.

In this guide, you’ll learn 

  • What user agents are 
  • How sites use them to detect bots 
  • How to build and rotate a clean user agent list 
  • How tools like ScraperAPI make the entire process effortless 

Ready to stop getting blocked and start scraping successfully? Let’s dive in!

What Is a User Agent String?

Think of a user agent string as your browser’s business card. Every time you visit a website, your browser introduces itself with a line of text that says “Hi, I’m Chrome running on Windows” or “I’m Safari on an iPhone.” This introduction happens behind the scenes in every single web request.

What it is

A User Agent (UA) string is a line of text included in HTTP headers that identifies the software making the request. It tells websites what browser you’re using, what version it is, what operating system you’re running, and sometimes even what device you’re on. Here’s what a typical Chrome user agent looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

Breaking this down:

  • Mozilla/5.0: Legacy identifier (all modern browsers include this)
  • Windows NT 10.0; Win64; x64: Operating system and architecture
  • AppleWebKit/537.36: Browser engine version
  • Chrome/120.0.0.0: Browser name and version
  • Safari/537.36: Additional engine compatibility info

Where it’s found 

The user agent string lives in the HTTP headers of every request you make, specifically under the `User-Agent` header. Here’s what a basic HTTP request looks like:

GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html, application/xhtml+xml

Would you like to view your user agent string? Check it out at whatismybrowser.com or test HTTP requests at httpbin.org/user-agent.

What it’s used for 

Websites use user agent strings for several purposes:

  • Content optimization: When you visit YouTube on your phone, it automatically serves you the mobile version because it reads your user agent and sees you’re on a mobile device. Desktop users get the full desktop layout.
  • Analytics and tracking: Sites track which browsers and devices their visitors use. This helps them decide whether to support older browser versions or focus on mobile optimization.
  • Bot detection and security: This is where web scraping gets interesting. Websites analyze user agent patterns to spot automated traffic. A request from ‘python-requests/2.28.1’ screams “I’m a bot!“, while a Chrome user agent blends in with normal traffic.
  • Feature compatibility: Sites might serve different JavaScript or CSS based on browser capabilities. Internet Explorer gets simpler code, while modern browsers get advanced features.

Why User Agents Matter for Web Scraping

When you’re scraping, your user agent string is often the first thing that gives you away. Servers use these strings as a primary method to distinguish between real human visitors and automated bots, and getting this wrong can shut down your scraping operation before it even starts.

Here is why user agent matters for web scraping:

The Bot Detection Problem

Most scraping libraries use obvious user agents by default. Python’s requests library, for example, sends ‘python-requests/2.28.1’ with every request. To a server, this is like walking into a store wearing a sign that says: “I’m a robot.” This results in instant blocks, empty responses, or redirects to CAPTCHA pages.

Here’s what commonly happens when servers detect suspicious user agents:

  • Immediate blocking: Your IP gets banned after just a few requests.
  • Empty responses: The server returns blank pages or error messages.
  • Fake content: You receive dummy data instead of real information.
  • CAPTCHA challenges: Every request gets redirected to human verification.
  • Rate limiting: Your requests get severely throttled or queued.

Detection Systems Are Getting Smarter

Modern anti-bot systems don’t just look at your user agent, they analyze it alongside other request patterns. They check for:

  • Consistency: Does your user agent match your Accept headers and other browser fingerprints?
  • Frequency: Are you making requests too fast for a human?
  • Behavior: Do you visit robots.txt, load images, or follow redirects like a real browser?

The Client Hints Evolution

Here’s where things get more complex. Modern browsers are transitioning from a single User-Agent string to a set of headers called Client Hints. These provide more detailed information about the client while improving privacy by reducing fingerprinting opportunities.

Instead of cramming everything into one user agent string, browsers now send separate headers:

Sec-CH-UA: "Chromium";v="120", "Google Chrome";v="120", "Not A Brand";v="24"
Sec-CH-UA-Mobile: ?0
Sec-CH-UA-Platform: "Windows"

Why This Matters for Scrapers

Detection systems now validate whether your complete browser profile makes sense. A major red flag is a mismatch between your main User-Agent string and these Client Hints headers. For example, claiming to be Chrome 120 on Windows in your user agent while sending Client Hints that say you’re on mobile Safari will get you blocked instantly.

This evolution means successful scraping in 2025 requires not just a good user agent string, but a complete, consistent browser fingerprint that includes all the modern headers real browsers send.

The Best User Agent List for Web Scraping (Updated for 2025)

Here’s a curated selection of proven user agent strings that work effectively for web scraping in 2025.

1. Chrome User Agents

Chrome dominates the browser market with over 65% market share, making its user agents your safest bet for blending in with normal traffic. They’re updated frequently and widely accepted across all websites.

Chrome on Windows:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

Chrome on macOS:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

Chrome on Linux:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36

2. Firefox User Agents

Firefox represents about 8-10% of browser traffic and offers a good alternative when you want to diversify your user agent rotation. These strings are particularly useful for sites that might be flagging too many Chrome requests.

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0
Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0

3. Safari User Agents

Safari user agents are essential for scraping sites that cater heavily to Mac and iOS users. They’re particularly effective for e-commerce and design-focused websites with many Apple users.

Safari on macOS:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15
Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15

Safari on iOS:

Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1
Mozilla/5.0 (iPad; CPU OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1

4. Edge (Chromium-based) User Agents

Modern Edge uses the same engine as Chrome but has a smaller user base, making these user agents perfect when you need something that looks legitimate but isn’t as common as Chrome.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0

5. Mobile User Agents

With mobile traffic now exceeding desktop, these user agents are crucial for accessing mobile-optimized content and APIs. Many sites serve different data to mobile users, which makes these strings invaluable for comprehensive scraping.

Android Chrome:

Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 14; SM-S918B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Mobile Safari/537.36

iPhone:

Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1
Mozilla/5.0 (iPhone; CPU iPhone OS 17_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Mobile/15E148 Safari/604.1

User Agents to Avoid

These user agent strings will immediately flag you as a bot:

1. Default Scraping Library User Agents:

python-requests/2.28.1
curl/7.68.0
urllib/3.9
Scrapy/2.5.1

2. Headless Browser Identifiers:

HeadlessChrome/120.0.0.0
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/120.0.0.0 Safari/537.36
PhantomJS/2.1.1

3. Malformed or Generic User Agents:

Mozilla/5.0
Browser 1.0
*
(empty user agent)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

For a ready-to-use collection of tested user agent strings, you can find regularly updated lists at user-agents.net or useragentstring.com. For maximum effectiveness, always verify that your chosen user agents match current browser versions and include the appropriate Client Hints headers.

How to Set a Custom User Agent in Python

​​Setting a custom user agent in Python is straightforward, but it’s important to verify that your target server receives the user agent you’re sending. Here are practical examples for the most popular Python libraries, complete with validation.

1. Using Requests Library

Prerequisites:

  • Python
  • Requests

Requests is Python’s most popular HTTP library. It is simple, reliable, and perfect for basic web scraping. This example shows how to set a custom user agent and verify it’s working correctly.

import requests

# Define a realistic user agent
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

# Set user agent in headers
headers = {
   'User-Agent': user_agent
}

try:
   # Test with httpbin to see what headers the server receives
   response = requests.get('https://httpbin.org/headers', headers=headers)
  
   if response.status_code == 200:
       data = response.json()
       received_ua = data['headers'].get('User-Agent', 'Not found')
       print(f"Sent User-Agent: {user_agent}")
       print(f"Received User-Agent: {received_ua}")
       print(f"Match: {user_agent == received_ua}")
   else:
       print(f"Request failed with status code: {response.status_code}")
      
except requests.RequestException as e:
   print(f"Request error: {e}")

You should still get the green light:

Sent User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Received User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Match: True

2. Using Selenium WebDriver

Prerequisites:

  • Python
  • selenium

Selenium allows you to control the browser programmatically while scraping, and it requires setting the user agent through browser options for browser automation:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException

# Define user agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument(f"--user-agent={user_agent}")

try:
   # Initialize driver with custom user agent
   driver = webdriver.Chrome(options=chrome_options)
  
   # Navigate to test endpoint
   driver.get('https://httpbin.org/headers')
  
   # Get the page source and check if our user agent appears
   page_source = driver.page_source
  
   # You can also use JavaScript to get the actual user agent
   actual_ua = driver.execute_script("return navigator.userAgent;")
  
   print(f"Set User-Agent: {user_agent}")
   print(f"Browser User-Agent: {actual_ua}")
   print(f"Match: {user_agent == actual_ua}")
  
except WebDriverException as e:
   print(f"WebDriver error: {e}")
finally:
   if 'driver' in locals():
       driver.quit()

3. Using HTTPX (Modern Alternative)

Prerequisites:

  • Python
  • httpx

HTTPX is a modern HTTP client that’s becoming increasingly popular:

import httpx
import asyncio

# Define user agent
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1"

# Synchronous version
def test_sync_ua():
   headers = {'User-Agent': user_agent}
  
   try:
       with httpx.Client() as client:
           response = client.get('https://httpbin.org/headers', headers=headers)
          
           if response.status_code == 200:
               data = response.json()
               received_ua = data['headers'].get('User-Agent', 'Not found')
               print(f"Sync - Sent: {user_agent}")
               print(f"Sync - Received: {received_ua}")
               print(f"Sync - Match: {user_agent == received_ua}")
           else:
               print(f"Sync request failed: {response.status_code}")
              
   except httpx.RequestError as e:
       print(f"Sync request error: {e}")


# Asynchronous version
async def test_async_ua():
   headers = {'User-Agent': user_agent}
  
   try:
       async with httpx.AsyncClient() as client:
           response = await client.get('https://httpbin.org/headers', headers=headers)
          
           if response.status_code == 200:
               data = response.json()
               received_ua = data['headers'].get('User-Agent', 'Not found')
               print(f"Async - Sent: {user_agent}")
               print(f"Async - Received: {received_ua}")
               print(f"Async - Match: {user_agent == received_ua}")
           else:
               print(f"Async request failed: {response.status_code}")
              
   except httpx.RequestError as e:
       print(f"Async request error: {e}")

# Run both tests
if __name__ == "__main__":
   test_sync_ua()
   asyncio.run(test_async_ua())

How to Rotate User Agents at Scale

When scraping at scale, using the same user agent for thousands of requests is like wearing the same disguise to rob every bank in town, you’ll get caught. User agent rotation helps distribute your requests across different “browser profiles,” making your traffic appear more natural and harder to detect.

The Simple Approach: Random Selection

The basic method involves maintaining a list of user agents and randomly selecting one for each request:

import random
import requests

user_agents = [
   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
   "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"
]

def scrape_with_rotation(url):
   headers = {'User-Agent': random.choice(user_agents)}
   return requests.get(url, headers=headers)

if __name__ == "__main__":
   scrape_with_rotation("your-page-to-scrape.com")

While this approach works for basic scraping, it becomes inadequate for larger operations or protected sites.

The Challenge of Manual Rotation

Maintaining user agent rotation manually presents several complex challenges:

  • Keeping lists current: Browser versions change monthly, and outdated user agents become red flags
  • Matching complementary headers: Each user agent should pair with realistic Accept, Accept-Language, and Accept-Encoding headers
  • Avoiding detectable patterns: Overtime, random selection can create suspicious patterns that websites can recognize
  • Scale management: Large operations need thousands of unique, validated user agent combinations

The Browser Fingerprinting Problem

In addition to checking your user agent, modern anti-bot systems also do thorough browser fingerprinting to make sure that every feature of your “browser” matches what the user agent says. This establishes several levels of detection:

  • Headers layer: Your headers must be internally consistent. Does your Accept-Language header match the location implied by your IP address? For example, if your user agent claims to be Chrome on Windows but your Accept-Language is “zh-CN” while your IP is from New York, that’s suspicious.

Client Hints layer: Modern browsers send Client Hints headers alongside the traditional user agent. Does your Sec-CH-UA-Mobile header match the OS claimed in your user agent? Claiming to be desktop Chrome while sending mobile client hints will get you blocked instantly.

User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X)...
Sec-CH-UA-Mobile: ?0  ← This mismatch will get you caught
  • JavaScript Detection layer: Sites can compare the `navigator.userAgent` value in JavaScript with the User-Agent header sent in your HTTP request. If you’re using Selenium or another browser automation tool, these values must match perfectly.
  • Behavioral Analysis layer: Even with perfect headers, sites analyze behavior patterns. Does this “browser” load pages in 50 milliseconds? Does it skip loading images and CSS? Does it never scroll or move the mouse? These unnatural behaviors flag automated traffic regardless of user agent authenticity.

The ScraperAPI Solution

This complexity is exactly why tools like ScraperAPI exist. Instead of managing user agent rotation manually, ScraperAPI handles the entire browser fingerprinting challenge automatically:

  • Automatic user agent rotation: Thousands of verified user agents rotated intelligently
  • Complete header consistency: All headers match and make sense together
  • Premium proxy rotation: Residential and datacenter IPs that match geographic headers
  • JavaScript rendering: Real browser rendering when needed
  • Smart retry logic: Automatic blocking detection and retry with different fingerprints

We’ll look at a Python example of how a simple ScraperAPI application can scrape a heavily protected site (you’ll need requests for this):

import requests

# ScraperAPI endpoint with your API key
api_key = "your_scraperapi_key" # Only for testing purposes
target_url = "https://protected-site-example.com"

# ScraperAPI handles all the complexity automatically
scraperapi_url = f"http://api.scraperapi.com?api_key={api_key}&amp;url={target_url}&amp;render=true"

try:
   response = requests.get(scraperapi_url)
  
   if response.status_code == 200:
       print("Successfully scraped protected site!")
       print(f"Content length: {len(response.text)}")
       # ScraperAPI automatically handled:
       # - User agent rotation
       # - Proxy rotation
       # - Header consistency
       # - JavaScript rendering
       # - CAPTCHA solving (if needed)
   else:
       print(f"Request failed: {response.status_code}")
      
except requests.RequestException as e:
   print(f"Error: {e}")

You should be getting an OK:

Successfully scraped protected site!
Content length: 10968

If you’re dealing with large-scale scraping or tough-to-crack sites, ScraperAPI takes care of user agent management for you and offers rock-solid reliability. That way, you can spend less time worrying about setup and more time digging into the data that actually matters.

Note: As this demonstration is only for testing purposes, feel free to paste your ScraperAPI key directly into your code. However, be mindful never to hardcode it if you are pushing your code to a public repository! Save it instead in an .env file and import it. 

FAQ

What is a user agent in web scraping?

A user agent is a string that your browser sends to websites to identify what type of browser, operating system, and device you’re using. In web scraping, user agents are crucial because many websites block requests that don’t have proper user agent headers or have suspicious ones that look like bots. When scraping, you should always include a realistic user agent header (like “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36”) to make your requests appear as if they’re coming from a real browser, which helps avoid getting blocked and ensures you receive the same content that regular users see. 

How to use a fake user agent for web scraping?

To use a fake user agent in web scraping, simply add a User-Agent header to your HTTP requests with a realistic browser string. In Python with requests, use

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

and pass it to your request like `requests.get(url, headers=headers)`. For better results, rotate between multiple user agents by creating a list of different browser strings and randomly selecting one for each request using `random.choice(user_agents_list)`. You can also use libraries like `fake-useragent`, which automatically provides random, up-to-date user agent strings with `UserAgent().random`

Is web scraping for personal use illegal?

Web scraping for personal use is generally legal when you’re accessing publicly available information, respecting robots.txt files, and not overwhelming servers with excessive requests. However, legality depends on factors like the website’s terms of service, whether you’re bypassing authentication, the jurisdiction you’re in, and what you do with the scraped data. It becomes problematic when you ignore robots.txt, violate terms of service, scrape copyrighted content for commercial use, or cause harm to the website’s performance. Always check the website’s robots.txt file, read their terms of service, use reasonable delays between requests, and when in doubt, consult with a legal professional.

Where can I find a list of user agents for web scraping?

You can find user agent lists from several sources: use the Python library `fake-useragent`, which provides current user agents with `UserAgent().random`, visit websites like WhatIsMyBrowser.com or UserAgentString.com for comprehensive databases, or check your own browser’s user agent by opening developer tools and looking at the Network tab headers.

Conclusion: Putting It All Together – Your User-Agent Strategy in 2025

In this guide, we explored the critical role that user agents play in successful web scraping, from understanding what they are to implementing them effectively in your scrapers. We covered the best practices for selecting realistic user agents, rotating them properly, and avoiding common mistakes that can get your scraper blocked. 

While using realistic user agents will help your scraper blend in, sophisticated websites in 2025 employ multiple detection methods that require a holistic approach to overcome.

Effective scrapers in 2025 will implement all of the following techniques: 

  • Rotate recent, realistic user agents: Use up-to-date browser strings from popular browsers
  • Match headers to the claimed user agent: Include Client Hints and other headers that correspond to your chosen browser
  • Rotate IP addresses using proxies: Distribute requests across multiple IP addresses to avoid rate limiting 
  • Delay requests to avoid rate limits: Implement delays between requests to mimic human browsing patterns 
  • Handle sessions and cookies: Maintain proper session state and cookie management
  • Render JavaScript on modern web apps: Use headless browsers for sites that rely heavily on JavaScript 
  • Respect header consistency and browser fingerprinting logic: Ensure all request headers work together to create a believable browser fingerprint 

Managing all these components manually is certainly possible, but it’s time-consuming and requires constant maintenance as websites update their detection methods. You’ll need to monitor for blocked requests, update user agent lists, manage proxy pools, handle CAPTCHAs, and debug fingerprinting issues. 

This is where ScraperAPI simplifies the entire process. Instead of building and maintaining complex anti-detection infrastructure, ScraperAPI handles user agent rotation, proxy management, header optimization, JavaScript rendering, and CAPTCHA solving with a single API call. Your scraper can focus on extracting data while ScraperAPI manages the technical complexity of staying undetected. 

Ready to scale your web scraping with a production-ready solution? Try ScraperAPI today and experience hassle-free scraping with built-in anti-detection technology. Get started with a free trial and see how easy professional web scraping can be.

Happy Scraping!

The post Best User Agent List for Web Scraping in 2025 (with Examples & Tips) appeared first on ScraperAPI.

]]>
How to Scrape Product Reviews from Target.com Using Python https://www.scraperapi.com/web-scraping/scrape-target-product-reviews-with-python/ Mon, 05 May 2025 17:04:33 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7660 Scraping review data from Target.com can be tricky. Like many modern websites, Target uses tools to block bots, such as rate limiting, IP blocking, and dynamic content that loads with JavaScript. This means that basic scraping methods often won’t work well or may get blocked quickly. In this tutorial, I’ll walk you through scraping product […]

The post How to Scrape Product Reviews from Target.com Using Python appeared first on ScraperAPI.

]]>
Scraping review data from Target.com can be tricky. Like many modern websites, Target uses tools to block bots, such as rate limiting, IP blocking, and dynamic content that loads with JavaScript. This means that basic scraping methods often won’t work well or may get blocked quickly.

In this tutorial, I’ll walk you through scraping product reviews from Target.com using Python, Selenium, and ScraperAPI in proxy mode. ScraperAPI helps us avoid getting blocked by rotating IP addresses, making collecting the data we need easier.

By the end of this guide, you’ll have a solid scraping setup that can reliably pull product reviews from Target.com—great for research, analysis, or building tools that rely on honest customer opinions.

Ready? Let’s get started!

Bypass Target Botblockers
ScraperAPI let’s you scrape Target.com consistently and at scale with a near 100% success rate.

TL;DR: Scraping Target Product Reviews [Full Code]

If you’re in a hurry, here’s the full Target Product Scraper:

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
import json
from time import sleep
from dotenv import load_dotenv
import os

load_dotenv()  # Loads the .env file

API_KEY = os.getenv("SCRAPERAPI_KEY")
target_url = "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab"

# Setup proxy and driver
proxy_url = f"http://scraperapi.render=true.country_code=us:{API_KEY}@proxy-server.scraperapi.com:8001"

options = Options()
#options.add_argument("--headless")  # Optional: enable for headless mode

seleniumwire_options = {
    'proxy': {
        'http': proxy_url,
        'https': proxy_url,
    },
    'verify_ssl': False,
}

driver = uc.Chrome(options=options, seleniumwire_options=seleniumwire_options)

# Load page
print(f"Opening URL: {target_url}")
driver.get(target_url)
sleep(10)

# Scroll and click "Show more" if it appears
print("Scrolling and checking for 'Show more' buttons...")
for i in range(3):
    try:
    	driver.execute_script("window.scrollTo(0, 3200);")
    	show_more = WebDriverWait(driver, 5).until(
        	EC.element_to_be_clickable((By.XPATH, "/html/body/div[1]/div[2]/main/div/div[3]/div/div[17]/div/div[5]/button"))
    	)
    	show_more.click()
        show_more.click()
        print(f"'Show more' clicked ({i + 1})")
        sleep(3)

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(3)
    except:
        print(f"No more 'Show more' button after {i} click(s).")
        break

# Wait for reviews to appear
try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-test='reviews-list']"))
    )
except:
    print("Review section may not be fully loaded.")
    driver.quit()
    exit()

sleep(3)

# Parse reviews
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()

data = []

review_divs = soup.find_all("div", attrs={"data-test": "review-card--text"})
rating_spans = soup.find_all("span", class_=["styles_ndsScreenReaderOnly__mcNC_", "styles_notFocusable__XkHOR"])

for review in review_divs:
    review_text = review.get_text(strip=True)
    data.append({"review": review_text})

for i, rating in enumerate(rating_spans):
    if i < len(data):
        data[i]["rating"] = rating.get_text(strip=True)

# Save to JSON
if data:
    with open("target_reviews.json", "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)
    print(f"Saved {len(data)} reviews to target_reviews.json")
else:
    print("No reviews found to save.")

print("Scraping complete.")

Want to dive deeper? Keep reading!

Scraping Target.com Reviews with Python and ScraperAPI (Proxy Mode)

In this section, we’ll build a Python script using Selenium, BeautifulSoup, and ScraperAPI in proxy mode to extract customer reviews from a product page. We’ll walk through loading the page, handling dynamic content like the “Show more” button, and saving the extracted reviews and ratings into a JSON file.

Prerequisites

Before we jump into the code, make sure you have the following installed:

  • Python 3.7 or above. Due to compatibility errors between 3.12+ Python and undetected-chromedriver, we recommend sticking to Python 3.7-3.11. For a quick workaround on Python 3.12, you can pip install setuptools, which tends to temporarily fix the error.
  • undetected-chromedriver – helps bypass bot detection
  • selenium – to control the browser
  • selenium-wire – for setting up proxies with Selenium
  • beautifulsoup4 – to parse HTML content
  • python-dotenv – to load your credentials from your .env file
  • Google Chrome browser
  • A ScraperAPI account and API key: Sign up for a free API key here if you don’t have one.

You can install the required Python packages using pip:

pip install undetected-chromedriver selenium selenium-wire beautifulsoup4 lxml python-dotenv

Step 1: Set Up Your Imports and API Key

Start by creating a new Python file, such as target_scraper.py. This will hold your scraping script.

Then, import the necessary libraries and set your ScraperAPI key and target URL:

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
import json
from time import sleep

In the root directory of your project, create a .env file. This file will contain your credentials (your API key in this case) and store them safely, so that you can be sure not making them public by mistake. You should also add this file to your .gitignore file to avoid pushing sensitive data to version control. Create it by running:

touch .env

In it, define your API key:

SCRAPERAPI_KEY= "YOUR_API_KEY"  # Replace with your actual ScraperAPI key

Once your credentials are safely stored, we can import them in our Python file.

Also add the URL of the product page you want to scrape:

from dotenv import load_dotenv
import os

load_dotenv()  # Loads the .env file

API_KEY = os.getenv("SCRAPERAPI_KEY")



target_url = "https://www.target.com/p/enclosed-cat-litter-box-xl-up-up/-/A-90310047?preselect=87059440#lnk=sametab"

Step 2: Configure ScraperAPI as a Proxy

To avoid getting blocked, we’ll use ScraperAPI in proxy mode. Here’s how to set that up and pass it into the Chrome driver.

proxy_url = f"http://scraperapi.render=true.country_code=us:{API_KEY}@proxy-server.scraperapi.com:8001"

options = Options()
options.add_argument("--headless")  # Runs Chrome in headless mode (no browser window)

seleniumwire_options = {
    'proxy': {
        'http': proxy_url,
        'https': proxy_url,
    },
    'verify_ssl': False,
}

driver = uc.Chrome(options=options, seleniumwire_options=seleniumwire_options)

This configures the driver to run headlessly (in the background) and use ScraperAPI’s rotating proxy for every request.

Step 3: Open the Target Product Page

Now that our browser is ready and using a proxy, we navigate to the Target product page:

print(f"Opening URL: {target_url}")
driver.get(target_url)
sleep(10)

We’re adding a sleep(10) here to give the page enough time to load all the dynamic content.

Step 4: Scroll and Click the “Show More” Button

Target hides some of the reviews behind a “Show more” button. If that button is available, we’ll scroll down and try clicking it up to three times.

print("Scrolling and checking for 'Show more' buttons...")
for i in range(3):
    try:
    	driver.execute_script("window.scrollTo(0, 3200);")
    	show_more = WebDriverWait(driver, 5).until(
        	EC.element_to_be_clickable((By.XPATH, "/html/body/div[1]/div[2]/main/div/div[3]/div/div[17]/div/div[5]/button"))
    	)
    	show_more.click()
        show_more.click()
        print(f"'Show more' clicked ({i + 1})")
        sleep(3)

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(3)
    except:
        print(f"No more 'Show more' button after {i} click(s).")
        break

This loop mimics a real user by scrolling and clicking the button up to 3 times. It uses WebDriverWait and expected_conditions to wait for the button to become clickable.

Step 5: Wait for the Reviews to Load Fully

Before we parse the HTML, we must ensure the review list has fully rendered:

try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-test='reviews-list']"))
    )
except:
    print("Review section may not be fully loaded.")
    driver.quit()
    exit()

sleep(3)

We use the presence_of_element_located to confirm that the main reviews container exists in the DOM. If it’s missing, we exit early.

Step 6: Parse the Reviews with BeautifulSoup

With the page fully loaded, we can parse it using BeautifulSoup and extract reviews and ratings.

soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()  # Close the browser

data = []

review_divs = soup.find_all("div", attrs={"data-test": "review-card--text"})
rating_spans = soup.find_all("span", class_=["styles_ndsScreenReaderOnly__mcNC_", "styles_notFocusable__XkHOR"])

for review in review_divs:
    review_text = review.get_text(strip=True)
    data.append({"review": review_text})

for i, rating in enumerate(rating_spans):
    if i < len(data):
        data[i]["rating"] = rating.get_text(strip=True)

We’re extracting review texts from div tags with the data-test="review-card--text" attribute and star ratings from span tags containing specific class names. We then pair reviews and ratings by their order on the page to build a structured list of dictionaries.

Step 7: Save the Reviews to a JSON File

Finally, save the extracted data into a .json file.

if data:
    with open("target_reviews.json", "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)
    print(f"Saved {len(data)} reviews to target_reviews.json")
else:
    print("No reviews found to save.")

print("Scraping complete.")

The result is a JSON file like this:

[
    {
        "review": "I like everything about this property box. I have 2 cats and this box can handle boths",
        "rating": ""
    },
    {
        "review": "Large and sturdy. Easy to put together.",
        "rating": ""
    },
    {
        "review": "Perfect size. Even our 20+lbs chunky bum fits in it.",
        "rating": "4.7 out of 5 stars with 972 reviews"
    },
    {
        "review": "I love the size of this litter box. Plenty of space for my cat to stand up and move around inside it. I removed the flap and the opening is really easy for my cat to go in and out of. The lid locks on securely and I've had no issues with it sliding or moving as my cat goes in and out.",
        "rating": "2 out of 5 stars with 1 ratings"
    },
    {
        "review": "Sturdy litter box, tall enough to keep everything in the box (even for our tall male cat). The price is great",
        "rating": "5 out of 5 stars with 1 ratings"
    },
    {
        "review": "I am very pleased with this items. My cat loves it 🤩",
        "rating": "5 out of 5 stars with 11 ratings"
    },
    {
        "review": "Quality is good but disappointed in scratches the swing door has. Seems like they give you what's been tossed around at the store for pick up orders",
        "rating": "4.4 out of 5 stars with 1437 ratings"
    },

Wrapping Up

And that’s it! You now have a working scraper that pulls real customer reviews from Target.com using Python, Selenium, and ScraperAPI’s proxy mode.

This data can be helpful for sentiment analysis, market research, or just keeping tabs on how customers receive products. With the proper setup, you can expand this script to scrape multiple products, track changes over time, or even build your review analysis tool.

If you haven’t already, sign up for ScraperAPI and grab your free API key to start scraping without worrying about getting blocked. It’s an excellent tool for handling the behind-the-scenes complexity of proxies and IP rotation so you can focus on building.

Happy Scraping!

The post How to Scrape Product Reviews from Target.com Using Python appeared first on ScraperAPI.

]]>
How to Scrape Google Search Results with Python in 2025 https://www.scraperapi.com/web-scraping/scrape-google-search-results/ Mon, 28 Apr 2025 12:25:51 +0000 https://www.scraperapi.com/?post_type=web_scraping&p=7641 Want to learn how to scrape Google search results without getting blocked? You’re in the right place. Google’s search engine results pages (SERPs) are packed with valuable data for SEO tracking, market research, and competitive analysis. Whether you’re monitoring keyword rankings, analyzing search trends, or gathering insights for content optimization, scraping Google search data will […]

The post How to Scrape Google Search Results with Python in 2025 appeared first on ScraperAPI.

]]>
Want to learn how to scrape Google search results without getting blocked? You’re in the right place. Google’s search engine results pages (SERPs) are packed with valuable data for SEO tracking, market research, and competitive analysis. Whether you’re monitoring keyword rankings, analyzing search trends, or gathering insights for content optimization, scraping Google search data will help you get ahead.

However, scraping Google isn’t as simple as extracting data from a typical website. Google has strict anti-scraping measures in place, including CAPTCHAs, IP rate limits, and bot-detection systems designed to block automated requests. Traditional web scrapers often struggle to overcome these obstacles, resulting in incomplete data or outright bans.

In this article, you will learn how to scrape Google search results using Python, covering:

  • The key challenges of Google search scraping and how to bypass them
  • A step-by-step guide to using ScraperAPI for fast, scalable, and reliable scraping
  • Alternative methods, including Selenium for browser-based scraping
  • Practical ways to use Google SERP data for SEO, market research, and analytics

Ready? Let’s get started!

What Is a Google SERP and How Does It Work?

Whenever you search for something on Google, the results appear on a Search Engine Results Page (SERP). Google SERPs aren’t just a list of blue links; they contain a mix of organic results, paid ads, featured snippets, and other search elements that Google dynamically adjusts based on relevance, location, and user intent.

Key Elements of a Google SERP

Google SERPs include multiple sections, each designed to provide different types of information:

  • Organic search results: The organic results are the traditional blue-link listings ranked by Google’s search ranking algorithms, which search engine optimization (SEO) practices aim to influence. These rankings depend on factors such as page content relevance (including keywords), backlinks, and page authority. If you’re scraping Google results, extracting title tags, meta descriptions, and URLs from organic results can help track keyword rankings and optimize your SEO strategy.
Python coding search results on Google
  • Featured snippets: Featured snippets appear above organic results and provide a direct answer to a search query. They often include text, lists, or tables extracted from high-ranking web pages. Extracting these snippets from Google search results allows you to parse direct answers for a given search query.
Definition of featured snippet in Google Search
  • People Also Ask (PAA) boxes: The People Also Ask (PAA) section contains expandable questions related to the topic. Clicking on one reveals an answer along with additional related queries. Extracting PAA data can help identify common search queries, optimize content strategy, and discover new SEO opportunities. It’s important to note that PAA boxes are dynamic, meaning they can change depending on the specific user’s intent or query.
Related Python coding questions on Google Search
  • Image and Video Results: Google SERPs often include image and video carousels, particularly for visual queries. Extracting these results can help analyze which media content ranks for a particular keyword. This is useful for optimizing YouTube SEO and image-based content.
YouTube video results explaining what Python is

How to Scrape Google Search Results Using Python

Scraping Google search results has always been challenging due to Google’s strict anti-bot measures. Traditional web scraping tools struggle to maintain stable access without frequent interruptions. This is where ScraperAPI provides a seamless solution. Instead of dealing with user agents and proxies and handling Google’s evolving structure, ScraperAPI’s Google Search endpoint delivers structured JSON data with minimal effort.In this tutorial, we’ll scrape data from Google search results for the query “how to code in python 2025” using ScraperAPI’s structured search endpoint. We’ll retrieve organic search results, related questions from Google’s “People Also Ask” section, and YouTube video results, then save the extracted data into CSV files for further analysis.

Step 1: Install Required Packages

For this tutorial, you will need:

  • Python
  • Your integrated development environment (IDE) of choice to run your scripts

Before running the script, ensure you have installed the necessary dependencies. These include:

  • requests: To send HTTP requests to ScraperAPI
  • pandas: To store and manipulate the extracted data
  • python-dotenv: For safely handling credentials

Run the following command in your terminal or command prompt:

pip install requests pandas python-dotenv

Step 2: Set Up the ScraperAPI Request

First and foremost, we need to define your ScraperAPI key and other search parameters. If you don’t have a ScraperAPI key yet, sign up here and get one for free. After you have created your account, you can find your personal API key on your dashboard

Since we will be using this key in two files and it contains sensitive information, we need to create a file specific to our credentials and then set it safely as an environment variable. This passage is important, so don’t skip it! You don’t want your personal API key to be mistakenly made public.

Since we will be using this key in two files and it contains sensitive information, we need to create a file specific to our credentials and then set it safely as an environment variable. This passage is important, so don’t skip it! You don’t want your personal API key to be mistakenly made public.In the root directory of your project, create a .env file to store your environment variables. You should also add this file to your .gitignore file to prevent sensitive data from being pushed to version control. Create it by running:

touch .env

In it, define your API key:

SCRAPERAPI_KEY= "YOUR_API_KEY"  # Replace with your actual ScraperAPI key

Create a file in your project folder and call it config.py. In this file, load the environment variable. You can also set your parameters in your actual scraper, but having them in your configuration will make it easier to use them across files.

import os
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()

# Access the API key from the environment variable
API_KEY = os.getenv("SCRAPERAPI_KEY")

# Search parameters 
search_query = "how to code in python 2025"
search_country = "us"
pages_to_scrape = 5  # Number of pages to scrape
  • API_KEY: Your unique ScraperAPI key, loaded from your environment
  • search_query: The Google search term (e.g., “how to code in python 2025”)
  • search_country: The country for localized search results (e.g., “us” for the United States)
  • pages_to_scrape: Number of pages to scrape (each page contains multiple results)

After doing that, in a new Python file (let’s call it scraper.py), import the necessary Python libraries and your configurations:

import requests
import json
import pandas as pd
import time
from config import API_KEY, search_query, search_country, pages_to_scrape

Step 3: Define the Scraping Function

Now, let’s create a function to scrape Google search results using ScraperAPI.

Below you can see it in its entirety, but we will break it down in later sections for better handling and understanding:

def scrape_google_search(query, country="us", num_pages=3):
   """Scrapes Google search results using ScraperAPI's structured data endpoint."""
   search_results = []
   related_questions = []
   video_results = []


   for page in range(num_pages):
       payload = {
           "api_key": API_KEY,
           "country": country,
           "query": query,
           "page": page + 1
       }


       try:
           response = requests.get("https://api.scraperapi.com/structured/google/search", params=payload)


           if response.status_code == 200:
               serp = response.json()
               for result in serp.get("organic_results", []):
                   search_results.append({
                       "Title": result.get("title", "N/A"),
                       "URL": result.get("link", "N/A"),
                       "Snippet": result.get("snippet", "N/A"),
                       "Rank": result.get("position", "N/A"),
                       "Domain": result.get("displayed_link", "N/A"),
                   })
               for question in serp.get("related_questions", []):
                   related_questions.append({
                       "Question": question.get("question", "N/A"),
                       "Position": question.get("position", "N/A"),
                   })
               for video in serp.get("videos", []):
                   video_results.append({
                       "Video Title": video.get("title", "N/A"),
                       "Video URL": video.get("link", "N/A"),
                       "Channel": video.get("channel", "N/A"),
                       "Duration": video.get("duration", "N/A"),
                   })
               print(f"Scraped page {page + 1} successfully.")
           else:
               print(f"Error {response.status_code}: {response.text}")
               break


       except requests.exceptions.RequestException as e:
           print(f"Request failed: {e}")
           break


       time.sleep(2)  # Pause to prevent rate limiting
      
   # Save organic search results
   if search_results:
       df = pd.DataFrame(search_results)
       df.to_csv("google_search_results.csv", index=False)
       print("Data saved to google_search_results.csv")


   # Save related questions
   if related_questions:
       df_questions = pd.DataFrame(related_questions)
       df_questions.to_csv("google_related_questions.csv", index=False)
       print("Related questions saved to google_related_questions.csv")


   # Save YouTube video results
   if video_results:
       df_videos = pd.DataFrame(video_results)
       df_videos.to_csv("google_videos.csv", index=False)
       print("Video results saved to google_videos.csv")


   print("Scraping complete.")
              






# Call your function so that the scraping is triggered. We will pass it the variables that we have imported from our config file
scrape_google_search(search_query, country=search_country, num_pages=pages_to_scrape)

Let’s now break down each part of our function to highlight each step of the process. First off, let’s create the core of the function, which defines variables to store results and launches the request:

def scrape_google_search(query, country="us", num_pages=3):
    """Scrapes Google search results using ScraperAPI's structured data endpoint."""
    search_results = []
    related_questions = []
    video_results = []

    for page in range(num_pages):
        payload = {
            "api_key": API_KEY,
            "country": country,
            "query": query,
            "page": page + 1
        }

        try:
            response = requests.get("https://api.scraperapi.com/structured/google/search", params=payload)

            if response.status_code == 200:
                serp = response.json()

How This Function Works:

  1. Creates empty lists (search_results, related_questions, and video_results) to store extracted data.
  2. Loops through num_pages, making an API request for each page of search results.
  3. Sends a request to ScraperAPI with:
    • api_key → Your API key for authentication
    • country → The targeted country
    • query → The search term
    • page → The current page number
  4. Checks if the response status is 200 (OK) before processing the data.

Step 4: Extract Search Results

If the API request is successful, we extract three types of data:

  1. Organic Search Results:
for result in serp.get("organic_results", []):
                    search_results.append({
                        "Title": result.get("title", "N/A"),
                        "URL": result.get("link", "N/A"),
                        "Snippet": result.get("snippet", "N/A"),
                        "Rank": result.get("position", "N/A"),
                        "Domain": result.get("displayed_link", "N/A"),
                    })

Code Breakdown:

  • Extracts the top organic search results from serp.get("organic_results", []).
  • For each result, it retrieves:
    • title: The headline of the search result
    • link: The URL of the webpage
    • snippet: A short description from Google’s result page
    • position: The ranking position of the result
    • displayed_link: The domain of the website
  1. Related Questions (“People Also Ask”):
for question in serp.get("related_questions", []):
                    related_questions.append({
                        "Question": question.get("question", "N/A"),
                        "Position": question.get("position", "N/A"),
                    })

Code Breakdown:

  • Extracts questions from the “People Also Ask” section of Google search results.
  • Retrieves:
    • question: The related question text
    • position: The question’s ranking position on the page
  1. YouTube Video Results:
for video in serp.get("videos", []):
                    video_results.append({
                        "Video Title": video.get("title", "N/A"),
                        "Video URL": video.get("link", "N/A"),
                        "Channel": video.get("channel", "N/A"),
                        "Duration": video.get("duration", "N/A"),
                    })

Code Breakdown:

  • Extracts YouTube video results appearing in Google search.
  • Retrieves:
    • title: The video’s title
    • link: The direct URL to the video
    • channel: The YouTube channel name
    • duration: The length of the video

Step 5: Handle Errors and API Rate Limiting

To ensure our script runs smoothly without unexpected failures, we need to handle common errors and manage our request rates effectively:

print(f"Scraped page {page + 1} successfully.")
            else:
                print(f"Error {response.status_code}: {response.text}")
                break

        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            break

        time.sleep(2)  # Pause to prevent rate limiting

Code Breakdown:

  • Handles errors by checking response.status_code. If there’s an error, it prints the message and stops execution.

Introduces a time.sleep(2) delay between requests to avoid triggering ScraperAPI’s rate limits.

Step 6: Save Data to CSV Files

Once the data is collected, we save it as CSV files for analysis:

# Save organic search results
 if search_results:
     df = pd.DataFrame(search_results)
     df.to_csv("google_search_results.csv", index=False)
     print("Data saved to google_search_results.csv")

# Save related questions
 if related_questions:
     df_questions = pd.DataFrame(related_questions)
     df_questions.to_csv("google_related_questions.csv", index=False)
     print("Related questions saved to google_related_questions.csv")

# Save YouTube video results
 if video_results:
     df_videos = pd.DataFrame(video_results)
     df_videos.to_csv("google_videos.csv", index=False)
     print("Video results saved to google_videos.csv")

 print("Scraping complete.")

Code Breakdown:

  • Checks if data exists before saving to prevent empty files.
  • Uses pandas.DataFrame() to store extracted data.
  • Exports results to CSV files for further analysis.

Call the function you have just created at the end of your file, and outside of the function (mind indentation!):

# Call your function within your file so that it runs
scrape_google_search(search_query, country=search_country, num_pages=pages_to_scrape)

You can run the script by going to your root folder in your terminal and doing:

python scraper.py

Now you have a functioning Google search scraper, and your data is saved in CSV files and ready for analysis in Google Sheets or Excel. Feel free to explore and refine your search parameters for even better results; you can modify the query, country, or number of pages to tailor the data to your needs!  

Automating Google Search Scraping

Now that you’ve successfully scraped Google search results manually let’s take it a step further by automating the process. With ScraperAPI’s Datapipeline, you can schedule your scraping tasks to run at regular intervals, whether daily, hourly, or weekly. This allows you to collect data continuously without needing to run the script manually.To automate Google search scraping, create a new file (we’ll call it automatic_scraper.py). Here is the code in its entirety:

import requests
import json
from config import API_KEY, DATAPIPELINE_URL


data = {
   "name": "Automated Google Search Scraping",
   "projectInput": {
       "type": "list",
       "list": ["how to code in python 2025"]
   },
   "projectType": "google_search",
   "schedulingEnabled": True,
   "scheduledAt": "now",
   "scrapingInterval": "daily",
   "notificationConfig": {
       "notifyOnSuccess": "with_every_run",
       "notifyOnFailure": "with_every_run"
   },
   "webhookOutput": {
       "url": "http://127.0.0.1:5000/webhook",  # Replace if using a webhook
       "webhookEncoding": "application/json"
   }
}




headers = {'content-type': 'application/json'}
response = requests.post(DATAPIPELINE_URL, headers=headers, data=json.dumps(data)) # Print response
print(response.json())

Let’s break it down. Start by importing the necessary libraries:

import requests
import json

Next, set up authentication by adding yourDatapipeline API URL to your config file, then import it into your automatic_scraper.py file together with your API key. In config, below your API key, add:

DATAPIPELINE_URL = f"https://datapipeline.scraperapi.com/api/projects?api_key={API_KEY}"

And add it to your imports in your automatic_scraper.py file:

from config import API_KEY, DATAPIPELINE_URL

Now, configure the scraping job by specifying:

  • The search queries to track
  • How often the scraping should run
  • Where the results should be stored

Instead of assigning these to variables like we did in our first script, we’ll define them using an in-file JSON object. By default, results are stored in the ScraperAPI dashboard, but they can also be sent to a webhook if specified.

data = {
    "name": "Automated Google Search Scraping",
    "projectInput": {
        "type": "list",
        "list": ["how to code in python 2025"]
    },
    "projectType": "google_search",
    "schedulingEnabled": True,
    "scheduledAt": "now",
    "scrapingInterval": "daily",
    "notificationConfig": {
        "notifyOnSuccess": "with_every_run",
        "notifyOnFailure": "with_every_run"
    },
    "webhookOutput": {
        "url": "http://127.0.0.1:5000/webhook",  # Replace if using a webhook
        "webhookEncoding": "application/json"
    }
}

Breaking Down the JSON Configuration

Each key in this JSON object defines an important part of the automated scraping job:

  • "name": The name of the scraping project (for organization).
  • "projectInput": The search queries to scrape:
    • "type": "list" indicates that multiple queries can be provided.
    • "list": A list of search terms (e.g., "how to code in python 2025").
  • "projectType": Defines the structured data endpoint we are using. Since we are using the Google search endpoint, the value is "google_search".
  • "schedulingEnabled": True enables automatic scheduling.
  • "scheduledAt": "now" starts the scraping immediately (can be set to a specific time).
  • "scrapingInterval": Controls how often the scraping runs:
    • "hourly", "daily", or "weekly" are valid options.
  • "notificationConfig": Defines when notifications are sent:
    • "notifyOnSuccess": "with_every_run" means a notification is sent every time the scraping succeeds.
    • "notifyOnFailure": "with_every_run" means an alert is sent every time the scraping fails.
  • "webhookOutput": (Optional) Sends the results to a webhook instead of just storing them in ScraperAPI’s dashboard:
    • "url" → The webhook URL where results will be sent.
    • "webhookEncoding" → Defines the format of the data, which is "application/json" in this case.

For more details on how to automate web scraping efficiently, check out this guide: Automated Web Scraping with ScraperAPI.Now that everything is set up, we can build or request at the bottom of the file. This can be sent to ScraperAPI to schedule the scraping job:

headers = {'content-type': 'application/json'} 
response = requests.post(DATAPIPELINE_URL, headers=headers, data=json.dumps(data)) # Print response
 print(response.json())

Run your script by doing:

python automatic_scraper.py

By automating Google search scraping with ScraperAPI’s Datapipeline, you can collect search data continuously without manually running scripts. Whether you need fresh results daily, hourly, or weekly, this solution saves time while handling large search volumes effortlessly!

Challenges of Scraping Google Search Results

Google’s anti-scraping measures are always evolving, and without the right techniques, your scraper can still get blocked. Here’s a quick recap of the biggest challenges and how to handle them:

1. CAPTCHAs and Bot Detection

Google can detect scrapers and redirect them to a CAPTCHA challenge, making further data extraction impossible. Since you’ve already seen how ScraperAPI’s Render Instruction Set is able to bypass most CAPTCHAs, you know how powerful automated rendering can be for avoiding blocks.

2. IP Bans and Rate Limits

Scraping too many results too quickly from the same IP address will get your requests blocked. The best way to prevent this is by rotating IPs and spreading out your requests. ScraperAPI handles this for you, switching IPs automatically to keep your scraper undetected.

3. Constantly Changing SERP Layouts

Google frequently updates its SERP structure, which can break scrapers relying on fixed HTML selectors. Instead of manually updating your scraper every time Google changes its layout, you can use ScraperAPI’s rendering capabilities to return fully loaded HTML or DOM snapshots to simplify the process and extract dynamic content using more flexible parsing methods.

4. JavaScript-Rendered Content

Some search elements—like People Also Ask boxes or dynamic snippets—only appear after JavaScript runs. If you used ScraperAPI’s Render Instruction Set in the tutorial, you’ve seen how enabling JavaScript rendering ensures all Google search data is fully loaded before extraction.

5. Geographic Restrictions and Personalized Results

Google tailors search results based on location, meaning the same query can produce different results depending on where the request is coming from. To scrape region-specific data, you need geotargeted proxies. ScraperAPI lets you specify country-based IPs, so you can get results from any location.

The Best Way to Scrape Google Search Results

Google’s defenses against web scraping are strong, but ScraperAPI’s Render Instruction Set makes it easier than ever to scrape search results efficiently. Instead of managing proxies, request headers, and CAPTCHA-solving manually, ScraperAPI automates everything for you—allowing you to focus on extracting real-time search data without interruptions.

If you need even more control, Selenium and Playwright provide a browser-based alternative, but they come with higher complexity and slower execution times. For most users, ScraperAPI remains the fastest and most scalable option for Google search scraping.

Now that you know how to handle Google’s anti-scraping measures, let’s look at what you can do with the data you’ve collected.

Wrapping Up: What You Can Do with Scraped Google Search Data

Now that you’ve successfully scraped Google search results, it’s time to put that data to use. Having direct access to search data gives you insights that would otherwise take hours to gather manually.

1. Track Keyword Rankings More Effectively

Knowing where your website ranks is essential for SEO. Scraped search data helps you:

  • Monitor rankings across different keywords and locations.
  • Track changes over time to measure the impact of SEO strategies.
  • Identify search terms where you can improve and outrank competitors.

2. Analyze Competitor Strategies in Detail

Your competitors are constantly working to gain visibility. Scraping Google search results helps you:

  • Identify which web pages and articles rank for your target keywords.
  • Analyze featured snippets, People Also Ask boxes, and paid ads to see where competitors are appearing.
  • Find gaps in their content strategy and create optimized pages to outperform them.

3. Get Insights Into Google Ads and Paid Search

Running paid search campaigns without competitive data can be costly. Extracting Google Ads data allows you to:

  • See which businesses are bidding on your target keywords.
  • Analyze ad copy, pricing strategies, and landing pages to refine your own approach.
  • Compare paid search results with organic rankings to find cost-effective opportunities.

4. Spot Market Trends and Shifting Consumer Interest

Google search data reveals what people are searching for in real time. Scraping search results allows you to:

  • Identify trending topics and growing demand in your industry.
  • Track seasonal patterns in search volume to adjust your strategy.
  • Extract related searches to understand how users think about your topic.

5. Automate Competitor and Industry Monitoring

Manually checking Google every day is inefficient. Scraping automates the process so you always have fresh insights. You can:

  • Keep track of new content published by competitors.
  • Monitor brand mentions in search results.
  • Stay updated on shifts in rankings for key industry terms.

Having direct access to Google’s search data gives you more control, deeper insights, and a competitive edge. Automating data collection saves time and helps you make smarter, data-driven decisions. Try out ScraperAPI to streamline and automate your Google search scraping!

FAQ

Is it legal to scrape Google search results?

Yes, scraping publicly available search results is generally legal, but how the data is used matters. Some websites’ terms of service prohibit scraping, so it’s important to review Google’s policies and use the data responsibly.

Does Google ban scraping?

Yes, Google actively blocks scrapers by using CAPTCHAs, IP bans, and request rate limits. However, ScraperAPI rotates IPs, bypasses CAPTCHAs, and renders JavaScript, making it possible to scrape without getting blocked.

What tools can I use to scrape Google search results?

ScraperAPI is the best choice for fast, scalable, and automated scraping with built-in IP rotation and CAPTCHA solving. Selenium and Playwright are useful for browser-based scraping but are slower and more complex. Google’s SERP API provides structured search data but has access restrictions and rate limits.

The post How to Scrape Google Search Results with Python in 2025 appeared first on ScraperAPI.

]]>