HOW TO prevent screen scraping

It is not difficult to screen scrape web pages & get specific portions of the page using regular expressions. This custom C# GetImages method fetches all images from a specified web page URL. [BTW, it has a nice trick to show complete hyperlinks or image filepaths where relative paths are used. These would otherwise never link to original content from the scraped page. It uses the BASE tag to set the document's base URL. The BASE element should be used within the HEAD tag.]

Some of the strategies that this whitepaper[pdf] [Google's cached HTML version] discusses to prevent scraping include proactive measures like putting a website policy forbidding scraping, limiting results, DRM, rendering and reactive measures like policing by monitoring Physical and Internet (through IP address) identities.

While it alludes to rendering, one other possibility is to block access based on the referer page & render content conditionally. A page with valuable information could be guarded by allowing it to be accessed through a landing page & then let it render (through server controls in ASP.NET) only after it has been checked that the domain or path of the referer page ( fetched through Request.ServerVariables["HTTP_REFERER"]) is that of the legitimate owner.

Sites can use techniques like “rate-throttling” to prevent crawlers from downloading too many web pages at once. Sites can also still use technology like CAPTCHA to test whether a human or a web crawler is requesting the page.

Search This Blog

Tech Tips, Tricks & Trivia

HOW TO prevent screen scraping

Comments

Post a Comment

Popular posts from this blog

Things Near Me – Find & Learn About Landmarks Nearby

What is the difference between Browser Mode & Document Mode in IE

Tech-No-Logic