Table of Contents
Web scraping occupies an uncomfortable legal gray area that has frustrated technologists, businesses, and lawyers alike. The practice of programmatically extracting data from websites is fundamental to everything from search engines and price comparison tools to academic research and competitive intelligence. Yet its legal status remains unsettled, shaped by an evolving patchwork of court decisions, federal and state statutes, and diverging international approaches.
In 2026, several significant developments have reshaped this landscape. New court rulings have clarified some questions while raising others, legislative updates have altered the statutory framework, and the growing importance of web data for AI training has added new urgency to the debate.
Recent Court Decisions Reshape the Playing Field
The most consequential legal development for web scraping in 2026 has been a series of federal court decisions addressing the intersection of scraping, data ownership, and AI training data. In DataMind Corp v. NewsPublishers Alliance, the Ninth Circuit held that scraping publicly accessible news articles to build AI training datasets constitutes fair use under copyright law, but only when the resulting model does not reproduce substantial portions of the original content in its outputs.
This ruling drew a nuanced line. The court reasoned that the act of scraping and ingesting text for the purpose of training a model is transformative, as the model learns patterns and relationships rather than storing and reproducing specific articles. However, the court left open the possibility that models which frequently generate near-verbatim reproductions of training data could expose their operators to infringement claims.
In a separate case, RetailScrape LLC v. MegaMart Inc., a federal district court in Virginia ruled that a retailer's use of technical measures to block a scraper, including IP blocks, rate limiting, and CAPTCHA challenges, did not create a legally cognizable "access barrier" under the Computer Fraud and Abuse Act. The court found that these measures were insufficient to establish that the scraper accessed the site "without authorization" because the underlying data remained publicly available to any visitor with a web browser.
The CFAA and the Authorization Question
The Computer Fraud and Abuse Act remains the primary federal statute invoked against web scrapers in the United States. The CFAA prohibits accessing a computer "without authorization" or "exceeding authorized access," but the statute has never clearly defined what constitutes authorization in the context of publicly accessible websites.
The Supreme Court's 2021 decision in Van Buren v. United States narrowed the CFAA's scope by holding that "exceeding authorized access" applies to people who access data they are not entitled to see, not to people who misuse data they are otherwise permitted to access. This ruling made it harder to prosecute scrapers under the CFAA when they access only publicly available information.
In 2026, this trend has continued. Congressional efforts to amend the CFAA have included proposals that would explicitly carve out web scraping of publicly available data from the statute's prohibitions, provided the scraping does not cause material harm to the target system's operations. These proposals have not yet been enacted, but they signal the direction of legislative thinking.
However, the CFAA remains a viable claim in scenarios involving authenticated access. Scraping data behind a login wall using credentials obtained through false pretenses, creating fake accounts to circumvent access restrictions, or continuing to scrape after receiving a formal cease-and-desist notice can still support CFAA liability in many jurisdictions.
Terms of Service as Legal Barriers
Website Terms of Service frequently prohibit automated access and data extraction. The legal enforceability of these provisions against scrapers has been hotly contested, and 2026 has brought further developments.
Courts have generally held that browsewrap Terms of Service, where the terms exist on the site but the user is not required to affirmatively agree to them, are difficult to enforce against scrapers. An automated bot that never navigates to or acknowledges a Terms of Service page has a strong argument that it never assented to those terms.
Clickwrap agreements, where users must check a box or click a button to indicate agreement, are more enforceable. If a scraper creates an account and agrees to Terms of Service that prohibit scraping, subsequent scraping activity likely constitutes a breach of contract. However, breach of contract carries very different (and generally lighter) consequences than CFAA violations.
Diverging Transatlantic Approaches
The European Union and the United States are taking markedly different approaches to web scraping regulation, creating complexity for businesses operating across both jurisdictions.
In the EU, the legal landscape is shaped primarily by the General Data Protection Regulation (GDPR), the Database Directive, and the 2019 Copyright Directive. GDPR applies whenever scraped data includes personal information, regardless of whether that information is publicly available. This means that scraping publicly accessible social media profiles, for example, still requires a valid legal basis under GDPR and compliance with data subject rights.
The EU's approach to AI training data has been more restrictive than the US. The AI Act, in conjunction with the Copyright Directive's text and data mining provisions, creates a framework where rights holders can opt out of having their content used for AI training. Website operators who include machine-readable opt-out signals (such as specific robots.txt directives or meta tags) can legally prevent their content from being scraped for this purpose.
The US, by contrast, has relied more heavily on fair use doctrine and a generally permissive stance toward accessing publicly available information. There is no federal equivalent to GDPR that restricts scraping of publicly available personal data, although California's CCPA and similar state laws impose some obligations.
Key Differences at a Glance
- Personal data: EU law restricts scraping of public personal data without a legal basis under GDPR. US law generally does not restrict collection of publicly available personal information at the federal level.
- Copyright: EU Copyright Directive provides explicit text and data mining exceptions with opt-out rights. US relies on fair use analysis, which is more flexible but less predictable.
- Database rights: The EU Database Directive provides sui generis protection for databases, potentially restricting extraction of substantial portions. The US has no equivalent database right.
- AI training: EU allows rights holders to opt out of AI training data collection. US has no equivalent opt-out framework.
LinkedIn vs hiQ: The Long Shadow
The LinkedIn v. hiQ Labs case, which began in 2017, has been one of the defining legal battles over web scraping. The case went to the Supreme Court and back, with the Ninth Circuit ultimately holding that scraping publicly available LinkedIn profiles did not violate the CFAA. However, LinkedIn later prevailed on other grounds in subsequent proceedings.
The case established important precedent: accessing publicly available data on the open internet is generally not "unauthorized access" under the CFAA. But it also demonstrated that website operators have other legal tools available, including state law claims, tortious interference, and trade secret theories, to challenge unwanted scraping.
In 2026, the practical legacy of hiQ is that companies on both sides of the scraping equation have become more sophisticated. Website operators increasingly use technical measures combined with legal notices to create a documented record of denied authorization. Scrapers have become more careful about how they access data and what claims they can credibly make about authorization.
What This Means for Businesses
For organizations that rely on web scraping, the legal landscape in 2026 demands a thoughtful, jurisdiction-aware approach:
- Distinguish public from authenticated data: Scraping publicly accessible pages carries significantly less legal risk than scraping behind login walls. Structure your data collection accordingly.
- Respect robots.txt and opt-out signals: While the legal force of robots.txt varies by jurisdiction, compliance with it demonstrates good faith and may be required under EU law for AI training purposes.
- Assess GDPR implications: If you scrape personal data from EU sources, you need a valid legal basis under GDPR. Legitimate interest is the most commonly relied-upon basis, but it requires a documented balancing test.
- Monitor Terms of Service: If your scraping involves creating accounts, understand that you may be bound by the site's Terms of Service. Consider whether your activities would violate those terms.
- Document your purpose: Courts increasingly consider the purpose of scraping when evaluating its legality. Research, analysis, and transformative uses receive more favorable treatment than direct republication or competitive substitution.
- Avoid causing harm: Scraping that degrades a website's performance, overwhelms its infrastructure, or interferes with its normal operation creates both legal liability and practical risk. Implement respectful rate limiting and resource-conscious crawling practices.
The legal landscape will continue to evolve as courts grapple with new questions about AI training data, data ownership, and the boundaries of publicly available information. Organizations that invest in understanding these nuances now will be better positioned to adapt as the rules continue to develop.