Essays /

Note Essay

Essay preview

A Tree Partitioning Method for Memory Management in Association Rule Mining Shakil Ahmed, Frans Coenen, and Paul Leng Department of Computer Science, The University of Liverpool Liverpool L69 3BX, UK {shakil,frans,phl}@csc.liv.ac.uk Abstract All methods of association rule mining require the frequent sets of items, that occur together sufficiently often to be the basis of potentially interesting rules, to be first computed. The cost of this increases in proportion to the database size, and also with its density. Densely-populated databases can give rise to very large numbers of candidates that must be counted. Both these factors cause performance problems, especially when the data structures involved become too large for primary memory. In this paper we describe a method of partitioning that organises the data into tree structures that can be processed independently. We present experimental results that show the method scales well for increasing dimensions of data, and performs significantly better than alternatives, especially when dealing with dense data and low support thresholds.

1

Introduction

The most computationally demanding aspect of Association Rule Mining is identifying the frequent sets of attribute-values, or items, whose support (occurrence in the data) exceeds some threshold. The problem arises because the number of possible sets is exponential in the number of items. For this reason, almost all methods attempt to count the support only of candidate itemsets that are identified as possible frequent sets. It is, of course, not possible to completely determine the candidate itemsets in advance, so it will be necessary to consider many itemsets that are not in fact frequent. Most algorithms involve several passes of the source data, in each of which the support for some set of candidate itemsets is counted. The performance of these methods, clearly, depends both on the size of the original database and on the number of candidates being considered. The number of possible candidates increases with increasing density of data (greater number of items present in a record) and with decreasing support thresholds. In applications such as medical epidemiology, where we may be searching for rules that associate rather rare items within quite densely-populated data, the low support-thresholds required may lead to very large candidate sets. These factors motivate a continuing search for efficient algorithms. Performance will be affected, especially, if the magnitudes involved make it impossible for the algorithm to proceed entirely within primary memory. In these cases, some strategy for partitioning the data may be required to enable

algorithmic stages to be carried out on primary -memory-resident data. In this paper we examine methods of partitioning to limit the total primary memory requirement, including that required both for the source data and for the candidate sets. We describe a new method of partitioning that exploits tree structures we have previously developed for Association Rule Mining. Experimental results are presented that show this method offers significant performance advantages.

2

Background

Most methods for finding frequent sets are based to a greater or lesser extent on the “Apriori” algorithm [2]. Apriori performs repeated passes of the database, successively computing support-counts for sets of single items, pairs, triplets, and so on. At the end of each pass, sets that fail to reach the required support threshold are eliminated, and candidates for the next pass are constructed as supersets of the remaining (frequent) sets. Since no set can be frequent which has an infrequent subset, this procedure guarantees that all frequent sets will be found. A problem of Apriori is that it requires the source data to be scanned repeatedly, which is especially costly if this cannot be contained in primary memory. Attempts to reduce this cost include the “Partition” algorithm [12], which partitions the data into a number of equal-sized segments of manageable size, and the strategy introduced by Toivonen [13], which first processes a small random sample of the data to identify candidates for counting across the full database. The drawback of both these approaches highlights the second weakness of Apriori: that the number of candidates whose support must be counted may become very large. Both methods increase the size of the candidate set, and also require all candidates to be retained in primary memory (for efficient processing) during the final database pass. Other methods [3] [4] [1] [14] aim to identify maximal frequent sets without first examining all their subsets. These algorithms may cope better with densely-populated databases and long patterns than the others described, but again usually involve multiple database passes. The DepthProject [1] algorithm bypasses the problem by explicitly targeting memory-resident data. The method of Zaki et. al. [14] partitions candidate sets into clusters which can be processed independently. The problem with the method is that, especially when dealing with dense data and low support thresholds, expensive pre-processing is required before effective clustering can be identified. The partitioning by equivalence class, however, is relevant to the methods we will describe. Our methods begin by p...

Read more

Keywords

-116 -118 -12 -149 -20 -380 -499 -66 -93 0 0.01 0.05 0.1 0.22 000 1 1.0 1.3 1.38 1.6 10 100 108 11 111 116 12 128 13 14 141 150 15th 1994 1995 1996 1998 1999 1gb 2 2/3 20 200 2000 2001 2002 2003 20th 21th 22th 23 250 2500 256 3 300 371 3bx 4 400 432444 487 5 50 500 50000 512 53 54 6 651 7 7.8 8 85 9 99 a-condit abc abcd abd abe abstract accept acd ace achiev acm across actual acut ad add addit ade advanc advantag affect agarw aggarw agraw ahm aim al algorithm allow almost alreadi also altern although alway anoth appar appear append appli applic approach appropri approxim apriori apriori-tfp aris aspect associ assum attempt attribut attribute-valu averag avoid b background base basi bayardo bcde bce bd bde becom begin better border boston bramer build bypass c cach call cambridg candid cannot capac carri case cat caus ccondit cd cde ce challeng characterist cheung child chile choos chosen class clear close closet cluster coenen cofitre combin compact compar comparison compens complet comput conclus condit confer conserv consid constrain constraint constraint-bas construct contain continu contrast contribut conveni convers cope copi correspond cost could count cours creas creat csc.liv.ac.uk d dalla data databas dataset dawak db de deal decreas defin degre demand demonstr denot dens densely-popul densiti depart depend depth depthproject deriv descend describ determin develop differ difficulti dimens discoveri discuss disk disk-stor distinct divid done drawback e e.g ead ed edb edbc edbca effect effici either el el-hajj elimin enabl end engin enough entir enumer epidemiolog equal equal-s equival es es2001 especi essenti establish et etc ethod even examin exampl exceed execut expect expens experi experiment explain explicit exploit exponenti extens extent extrem f fact factor fail fast faster fastest figur filestor final find first flexibl follow form found fp fp-tree fpgrowth fran frequenc frequent full furthermor g gain general generat ghz give goulbourn greater grow guarante gunopulo h hajj han head held henc high higher highlight horizont howev hp i.e idea identifi illustr implement import impos imposs includ inclus incomplet incorpor increas increment independ india infrequ initi int intellig interest introduc introduct investig involv ion item items/partition itemset ize j k k-itemset kb kdd kept knowledg knowledge-bas l l69 label larg larger last later latter lead least led leng length less lesser level lexicograph li like limit linear littl liverpool locat london long longer low lower m m.j machin magnitud main make manag mani mao mark maxim maximum may mb mean medic member memori memory-resid method methodolog might mine modif motiv much multipl mumbai must n natur navath naïv nb necessari necessarili need negat new next nfs node non non-memory-resid notat note num number o.r observ obtain obvious occur occurr offer offset often ogihara omiecinski omiss one one-off oper optimis order organis origin other overal overcom overhead overlap p p-tree page pair paper paramet part parthasarathi partial partial-support particular partit partition-p-tre pass pattern paul pei perform phl pkdd plus point pointer pointer-rich poor popul possibl potenti power pp pp-tree pp1 pp2 pp3 prasad pre pre-process preced predecessor preec prepar preprocess present previous primari princip problem proc procedur proceed process program proper properti proport proportionatelyfrequ propos ptree purpos quest quit r r.j ram random rapid rare rather rational re re-read reach read reason record recurs reduc refer reflect relat relev remain reorder repeat replic report repres represent requir research resid respect restructur result retain rich rise rochest root rule run sampl santiago savaser scale scan scatter scienc search sec second secondari seem seen segment separ septemb sequenc server servic set set-enumer sever shakil show shown sigmod signific similar simpl simpler sinc singl size slight slower slowli small smaller sourc spars springer springer-verlag srikant stage standard start step still storag store strategi structur subsequ subset substanti subtre succeed success suffici suitabl summat superset support support-count support-threshold support-tot suppos swizerland symposium synthet system t-tree t10.i5 t10.i5.n500 t14.i7 t18.i9 t22.i11 take taken target technic tfp third three threshold throughout time togeth toivonen total tp transact travers tree tree-construct tree-partit trend triplet turn twice two uk unaccept union univers unpartit upward us use usual v valu vari various verlag vldb w way weak well wherea whole whose within without work workshop would written x xviii y yin zaian zaki zurich