<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d8211560\x26blogName\x3dTech+Tips,+Tricks+%26+Trivia\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLUE\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttp://mvark.blogspot.com/search\x26blogLocale\x3den\x26v\x3d2\x26homepageUrl\x3dhttp://mvark.blogspot.com/\x26vt\x3d-5147029996388199615', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

Tech Tips, Tricks & Trivia

by 'Anil' Radhakrishna
An architect's notes, experiments, discoveries and annotated bookmarks.

Search from over a hundred HOW TO articles, Tips and Tricks


HOW TO prevent screen scraping

It is not difficult to screen scrape web pages & get specific portions of the page using regular expressions. This custom C# GetImages method fetches all images from a specified web page URL. [BTW, it has a nice trick to show complete hyperlinks or image filepaths where relative paths are used. These would otherwise never link to original content from the scraped page. It uses the BASE tag to set the document's base URL. The BASE element should be used within the HEAD tag.]

Some of the strategies that this whitepaper[pdf] [Google's cached HTML version] discusses to prevent scraping include proactive measures like putting a website policy forbidding scraping, limiting results, DRM, rendering and reactive measures like policing by monitoring Physical and Internet (through IP address) identities.

While it alludes to rendering, one other possibility is to block access based on the referer page & render content conditionally. A page with valuable information could be guarded by allowing it to be accessed through a landing page & then let it render (through server controls in ASP.NET) only after it has been checked that the domain or path of the referer page ( fetched through Request.ServerVariables["HTTP_REFERER"]) is that of the legitimate owner.

Labels: , ,

Tweet this | Google+ it | Share on FB

« Home | Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »

»

Post a Comment