Essays /

Web Mining Essay

Essay preview

In this paper, we present an overview of research issues in web mining. The World Wide Web has turned to be one of the largest information sources. It is an heterogeneous,explosive,,dynamic and mostly unstructured data repository.Some companies use the Web to find out more about heir competition.the user want to have the efficient search tools to find significant information easily. All of them are expecting tools or techniques to help them satisfy their demands and / or solve the problems encountered on the web. Therefore,web intelligence is required to help organizations in decision making as well as also help users in finding relevant information. Web mining with respect to web data referred here as web data mining. In particular, our focus is on web data mining research in context of our web warehousing project called WHOWEDA (Warehouse of Web Data). We have categorized web data mining into threes areas; web content mining, web structure mining and web usage mining. We have highlighted and discussed various research issues involved in each of these web data mining category. We believe that web data mining will be the topic of exploratory research in near future. 1 Introduction

The advent of the World Wide Web has caused a dramatic increase in the usage of the Internet. The World Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost. Information on the WWW is important not only to individual users, but also to the business organizations especially when the critical decision-making is concerned. Most users obtain WWW information using a combination of search engines and browser, however, these two types of retrieval mechanisms do not necessarily address all of a user’s information needs. This is particularly true in the case of business organizations that currently lack suitable tools to systematically harness strategic information from the web and analyze these data to discover useful knowledge to support decision making. A recent study provides a comprehensive and comparative evaluation of the most popular search engines [1]. A more recent survey of web query processing has appeared in [23].

The resulting growth in on-line information combined with the almost unstructured web data necessitates the development of powerful yet computationally efficient web data mining tools. Web data mining can be defined as the discovery and analysis of useful information from the WWW data. Web involves three types of data; data on the WWW, the web log data regarding the users who browsed the web pages and the web structure data. Thus, the WWW data mining should focus on three issues; web 2

structure mining, web content mining [8] and web usage mining [2,10,13]. Web structure mining involves mining the web document’s structures and links. In [24], some insight is given on mining structural information on the web. Our initial study [5] has shown that web structure mining is very useful in generating information such visible web documents, luminous web documents and luminous paths; a path common to most of the results returned. In this paper, we have discussed some applications in web data mining and E-commerce where we can use these types of knowledge. Web content mining describes the automatic search of information resources available on-line. Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc. A survey of some of the emerging tools and techniques for web usage mining is given in [2]. In our discussion here, we focus on the research issues in web data mining with respect to the web warehousing project called WHOWEDA (Warehouse of Web Data).

The key objective of WHOWEDA at the Centre for Advanced Information Systems in Nanyang Technological University, Singapore is to design and implement a web warehouse that materializes and manages useful information from the web to support strategic decision making. We are building a web warehouse [7] using the database approach of managing a web warehouse containing strategic information coupled from the web that may also inter-operate with conventional data warehouses. One of the important areas of our work involves the development of techniques for mining useful information from the web. We would be integrating WHOWEDA with intelligent tools for information retrieval and extend the data mining techniques to provide a higher level of data organization for unstructured data available on the web.

With respect to our web data mining approach, we argue that extracting information from a very small subset of all HTML web pages is also an instance of web data mining. In WHOWEDA, we focus on mining a subset of web pages stored in one or more web tables because we believe that due to the complexity and vastness of the web, mining information from a subset of web stored in the web tables is more feasible option. Our web warehousing approach allows us to do this effectively as we materialize only the results returned in response to a user's query graph. 2 WHOWEDA

In WHOWEDA, we introduced our web data model. It consists of a hierarchy of web objects. The fundamental objects are Nodes and Links, where nodes correspond to HTML text documents and links 3
correspond to hyper-links interconnecting the documents in the WWW. These objects consist of a set of attributes as follows: Nodes = [url, title, format, size, date, text] and link = [source-url, target-url, label, link-type]. In our web warehouse, Web Information Coupling System (WICS) [9] is a database system for managing and manipulating coupled information extracted from the Web. We have defined a set of coupling operators to manipulate the web tables and correlate additional useful and related information [9].

We materialize web data as web tuples stored in web tables. Web tuples, representing directed connecting graphs, are comprised of web objects (Nodes and Links). We associate with each web table a web schema that binds a set of web tuples in a web table. A web schema contains the meta-data that binds a set of web tuples to a web table in the form of connectivities and predicates defined on node and link variables. Connectivities represent structural properties of web tuples by describing possible paths between node variables. Predicates on the other hand specify the additional conditions that must be satisfied by each tuple to be included in the web table. In WICS, a user expresses a web query in the form of a query graph consisting of some nodes and links representing web documents and hyperlinks in those documents, respectively. Each of these nodes and links can have some keywords imposed on them to represent those web documents that contain the given keywords in the documents and/or hyperlinks. When the query graph is posted over the WWW, a set of web tuples each satisfying the query graph are harnessed from the WWW. Thus, the web schema of a table resembles the query graph used to derive the web tuples stored in web table. Note that the results are returned as web tuples. Note that some nodes and links in the query graph may not have keywords imposed. They are called unbound nodes and links, respectively.

Consider a query to find all data mining related publications by the computer science faculty at Stanford University, starting with the web page The query above may be expressed as follows :

AI or database e
publications data mining
x y z
The above query graph is assigned as schema to the web table generated in response the above query. The schema corresponding to the above query graph can be formally expressed as where Xn is the set node variables; x,y,z in the example above, Xl is the set of link variables; - (unbound link) and e in the example, C is set of connectivities ; k1 k2 where k1 = xy, k2 = yz and P is a set of predicates as follows : p1 p2 p3 p4 such that p1 (x) = [x.url EQUALS http://www.cs...

Read more


-027 -19 -431 -66 -833 -90 /is-f/libraryf/ /people/faculty.html /people/faculty.html. /people/faculty.html], /producta /productb /productc /tr/rec-dom-level1. 1 10 11 12 13 14 15 16 17 17th 18 19 1994 1995 1996 1997 1998 2 20 20th 21 21st 22 23 24 26 3 3.1 4 40 420 5 50 54 5th 6 6th 7 79 8 80 866 9 97 98 9th aaai abil abiteboul abl abstract access accord account acm across activ actual addit address advanc advent aggarw ai aid aim air air-far airlin al algebra algorithm allow almost also alta alway among amount analog analysi analyz and/or annual anoth answer apart appear applic approach approxim april area argu argument ari aris arrang asian aspect assign associ assum atom attribut audio aug august australia author automat autonom avail averag b backman backward bag barbara base basi basic beach becom behavior believ beneath bhowmick bind bio bio-scienc birmingham bray bring broadcast brows browser build busi buy c california call captur car carri case categor categori caus center centr centroid certain chang characterist chawath chee chee-thong chen chi chi-sheng chile clara class classif clean climb close cluster collect com/nc/811/811cn2.html combin commerc common compani compar comparison competition.the complet complex comprehens compris comput concept conceptu concern concis conclud conclus condit confer confid connect consecut consid consider consist consortium consum contain content context contribut control convent cooley correl correspond cost could counterpart coupl cpu cpus crawl crawler creat critic cross cross-refer current custom d data databas date decid decis decision-mak defin degre delphi demand demonstr depend depict dept deriv describ descript design detail determin develop deviat devis differ difficult digit direct discov discoveri discrimin discuss disk display distinct divers document dom domain dot download dramat due duplic dynam e e-commerc e.label earlier earth earth-scienc easier easili edu educ ee ee-peng effect effici electr elimin ellen emerg enabl encount engin entir entri equal er especi estim et etc european evalu even evid exampl exchang excit execut exist expect expedit explicit explor exploratori explos express extend extern extract facil facilit fact faculti fail fan fare fast feasibl featur fengqiong ferri figur file filter find first flat florescu flow focus follow forc form formal format forward found foundat frame framework franc frequenc frequent fu fundament furthermor futur general generat get give given global goal goldman graph graphic greec group growth h han hand handl har hard heir help heterogen hidden hierarch hierarchi high higher highlight hinder home hong hotbot hotel howev html hua huang hyper hyper-link hyperlink hypertext i.e ictai idea ident identifi ieee ignor imag implement impli implic import impos improv in-degre includ incom increas indic individu infer inform infoseek inher initi insight instanc institut instruct integr intellig inter inter-docu inter-link inter-oper inter-rel interact interconnect interest interfac interior interlink intern internet intl introduc introduct involv irrelev isol issu italian item itemset iter j jain januari japan k k1 k2 keong key keyword kind know knowledg known kong label lack languag larg largest later lead least leighton level levi librari like lim lin line link link-typ list liu local locat log lore low lower lumin lumini luminos lyco m m.s macau made madria maintain make manag mani manipul mannila manual manufactur map markup mart match materi materliaz maxim may mchugh mean measur mechan mediat medium melbourn member mendelzon meng mention meta meta-data method might milano milano-r mine minneapoli minnesota mirror mix mobash mobil model modif mohnia monitor most motwani multi multi-level multipl multiple-level must n n-ari name nanyang natur navig near necessari necessarili necessit need nestorov network nevertheless new newport news news-pap next ng node note nov novemb number numer object obtain occur occurr oem offer often on-lin one ontolog open oper optim option oracl order organ other otherwis out-degre outgo output overview p p.s p1 p2 p3 p4 page paper parasit pari paris-r part particular pass pasta path pattern pc peng percentag perform perhap person perspect pertain pesta phrase piec pitkow pizza place play popul popular pose possibl post power pp practic precis predic present price primarili problem proceed process product profil project properti provid public purpos q1 q2 qin quass queri question r rang reach reason recent recip record redesign reduc redund refer regard registr relat relationship relev reliabl replic report repository.some repres represent request requir research resembl resid resolv resourc respect respons restaur restaurt restrict result retriev return reveal revers road roddick role rout rubbin rule run santa santiago satisfi scalabl schedul schema scheme scienc search section see select sell semant semi semi-structur semistructur sens sept septemb sequenc sequenti seri server servic session set sever sheng shian shian-hua shih show shown sigir sigmod sign signific sigr similar sinc singapor singl site size small smaller solv sourav sourc source-url span specif specifi spertus srikant srivastava srivsatava stanford start step store strateg structur studi style sub sub-path sub-sect subsect subsequ subset substanti success suggest suitabl summar summari suppli support survey switzerland synonym system systemat t1 t2 tabl tabular take taken target target-url task technic techniqu technolog techweb.comp tedious term text textual theori therefor thong three threshold thus time titl tkde togeth tool topic total toward tr transact transform travel travers tree true tupl turn two type typic u.k u.s.a ullman unbound unclassifi understood uninterest univers unlik unstructur unsur url us usag use user user-medi valu valuabl variabl various vast ventur vernon versa version vice video visibility-threshold visibl visit vista visual vldb w wang want wareh warehous warrant way wcm web webind2/webind2.htm webpag wee wee-keong weight well whenev whether whose whoweda wic wicm wide widom wish within word work workshop world would wrapper www], x x.url x1 xl xml xn xy y y.text yes yet yu yue yz z z.text z1 z2 z3 zurich