Click here to Skip to main content
15,905,963 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more: , +
Can anyone help me in this.
I want to do this project for my academic.
Can some one give me any idea how to do this.
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality. The detection of duplicate and near duplicate web pages has long been recognized in web crawling research community. It is an important requirement for search engines to provide users with the relevant results for their queries in the first page without duplicate and redundant results. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling. Detection of near duplicate web pages is carried out ahead of storing the crawled web pages in to repositories. At first, the keywords are extracted from the crawled pages and the similarity score between two pages is calculated based on the extracted keywords. The documents having similarity scores greater than a threshold value are considered as near duplicates. The detection has resulted in reduced memory for repositories and improved search engine quality.
Updated 7-Mar-11 9:12am
Sergey Alexandrovich Kryukov 7-Mar-11 13:13pm    
Who is the author of this text? What did you do so far?
Sandeep Mewara 7-Mar-11 13:42pm    
What kind of help you are expecting here?

Posting the problem statement without clarifying/showing your code simply means you want code, is so? If not, elaborate a little on what are you seeking for - it would help others to answer you.
Kishore Jangid 7-Mar-11 15:38pm    
I am asking is this project worth doing
Smithers-Jones 7-Mar-11 16:08pm    
First of all:
"I want to do this project for my academic." != "I am asking is this project worth doing."

What do you expect people to answer? Short answer: no. Long answer: yes. How would anybody else know, whether it's worth doing? You have to decide for yourself, depending on your skills, interests...

1 solution

If you are waiting for permission, feel free to get started. I don't mind, even though I think it is a bit dull, and about as useful in the real world as a chocolate fire-guard.

If you are waiting for volunteers to write the code for you, then that would be cheating. And you wouldn't do that, would you?
Share this answer
Kishore Jangid 7-Mar-11 15:38pm    
I am asking, Is this title worth doing
Gonzoox 7-Mar-11 16:35pm    
Of course is worth doing, if you have the time, you will have to create an algorithm very powerful capable of detecting the differences between pages and then match all the information you have (that you'll need to keep in a huge database) with the user's query based on the relevance of the search and the information presented in the page.
Google, Bing, Yahoo and others have something like this and their algorithms are way too advanced, for a school project doing a web crawler and a simple match can help you get the grades you need, still will require a lot of time. For something more advanced you will need time and resources if you want to compete against those monsters called Google or Bing
Kishore Jangid 7-Mar-11 18:34pm    
How about using the code at below or version 3.
They have a good algorithm to use but i am confused with the various versions and files they have for a single technique.
Initially they used the SearcharooCrawler and then SearcharooSpider_alpha and then SearcharooSpider.aspx.
But didnt understand whether really they are eliminating the duplicate page or not. And it didn't worked for certain websites.
They have used the browsers cache but i am wanna and going to use a SQL Server 2005.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900