Essay preview
Breadth-first BASED WEB Crawling Application
May Phyu Htun
Computer University (Mandalay)
[email protected]
Abstract
The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web-based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. Traversing the web graph in breadth-first search order is a good crawling. This system is intended to study a crawling infrastructure and basic concepts in Web crawling. Then, web crawler application is implemented by using breadth-first search technique. Breadth-First Crawling checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first page’s first’ link, and so on, until each level of link has been exhausted. While Crawling the links of a URL address, the local HTML web pages are saved in a folder as MHTML format: (Single File Web Page).
Introduction
The Web is a very large collection of pages and search engines serve as the primary discovery mechanism to the content. To be able to provide the search functionality, search engines use crawlers that automatically follow links to web pages and extract. Web crawlers are programs that exploit the graph structure of the Web to move from page to page. In their infancy such programs were also called wanderers, robots, spiders, fish, and worms, words that are quite evocative of Web imagery. Crawler can be viewed as a graph search problem. The Web is seen as a large graph with pages at its nodes and hyperlinks as its edges. Web Crawler moves from node to node by means of the hyperlinks that each node contains and that define the edges of the web graph. Therefore, many algorithms used in graph searching can be frequently observed in web crawling of transformed versions. Traversing the web graph in breadth-first search order is a good crawling strategy, as it tends to discover high-quality pages early on in the crawl. In its simplest form, a crawler starts from a seed page and then uses the external links within it to attend to other pages. The process repeats with the new pages offering more external links to follow, until a sufficient number of pages are identified or some higher level objective is reached. There is a continual need for crawlers to help applications stay current as new pages are added and old ones are deleted or moved. When a web crawler is given a set of starting URL the web crawler downloads the corresponding documents. If save as a Web page need to creates a folder that contains an .htm file and all supporting files, such as images, sound files, cascading style sheets, scripts, and more. Save your presentation as a Web page when you want to edit it with FrontPage or another HTML editor, and then post it to an...