Computer is supposed to help make things easier for us. One simple example is to delete lines from a text file that doesn’t contain a specific keyword. This task is a no brainer but very time consuming and tedious. Recently I have spent some time in compiling a list of websites that has copied and published articles taken from this blog to their website. Although Google does a pretty good job in determining the original publisher, it is still a robot based on a bunch of constantly changing algorithm that can and has made mistakes. Searching for websites that has copied the posts from here is very time consuming, so I have used Copyscape Premium to automatically perform a batch scan on all 2000 articles on this website to track down plagiarism of the content from this blog.
Copyscape Premium finished scanning all 2000 posts in just 10 hours and I was able to export the results to a CSV file for further investigation. There are over 20,000 URLs in the list and I want to categorize the websites based on the domain names. Not all websites from the list are copycats but most of the websites hosted in free hosts such as blogspot/blogger/wordpress are either scrapers or copy paster. Once the URLs are categorized, I can concentrate on filing a DMCA complaint to Blogger, then followed by WordPress instead of jumping back and forth.
(more…)





