HOW TO prevent screen scraping

It is not difficult to screen scrape web pages & get specific portions of the page using regular expressions. This custom C# GetImages method fetches all images from a specified web page URL. [BTW, it has a nice trick to show complete hyperlinks or image filepaths where relative paths are used. These would otherwise never link to original content from the scraped page. It uses the BASE tag to set the document's base URL. The BASE element should be used within the HEAD tag.]

Some of the strategies that this whitepaper[pdf] [Google's cached HTML version] discusses to prevent scraping include proactive measures like putting a website policy forbidding scraping, limiting results, DRM, rendering and reactive measures like policing by monitoring Physical and Internet (through IP address) identities.

While it alludes to rendering, one other possibility is to block access based on the referer page & render content conditionally. A page with valuable information could be guarded by allowing it to be accessed through a landing page & then let it render (through server controls in ASP.NET) only after it has been checked that the domain or path of the referer page ( fetched through Request.ServerVariables["HTTP_REFERER"]) is that of the legitimate owner.

Comments

Popular posts from this blog

Maven Crash Course - Learn Power Query, Power Pivot & DAX in 15 Minutes

"Data Prep & Exploratory Data Analysis" course by Maven Analytics

Oracle Cloud Infrastructure 2024 Generative AI Professional Course & Certification Exam (1Z0-1127-24)