Vertical Data Management and Mining
TABLE OF CONTENTS
INTRODUCTION
Scalable Data Mining
The explosion of machine collecteddata technologies, such as barcode and RFID tag scanners in commercial domains, sensors in scientificindustrial domains, telescopes and Earth Observing Systems in the aero domain are adding tremendous volumes to the already huge amounts of data available in digital form. In the near future, sensor networks in battlefields, agricultural fields, manufacturing domains and meteorological domains, will only exacerbate this data overload situation. This explosive growth in data and databases generates the need for new techniques and tools that can intelligently and automatically transform the data into useful information and knowledge. Data mining is one such technique.
Data mining or knowledge discovery in databases (KDD), aims at the discovery of useful patterns from large data volumes. Data mining is becoming much more important as the number of databases and database size keeps growing. Researchers and developers in many different fields have shown great interest in data mining.
Data mining has two kinds of scalability issues: row (or database size) scalability and column (or dimension) scalability [HK01]. The rowscalability problem is sometimes referred to as 搕he curse of cardinalityand the column scalability problems is referred to as 搕he curse of dimensionality A data mining system is considered (linearly) row scalable if, when the number of rows is enlarged 10 times, it takes no more than 10 times as long to execute the same data mining queries. A data mining system is considered column (linearly) scalable if the mining query execution time increases linearly with the number of columns (or attributes or dimensions).
The use of traditional horizontal database structure (files of horizontally structured records) and traditional scanbased, data processing approaches (scanning files of horizontal records) are known to be inadequate for knowledge discovery in very large data repositories [HK01, HPY00, SAM96]. This Presentation addresses the scalability and efficiency issues in data mining by considering the alternative, vertical database technology.
In vertical databases, the data in each table, file or relation is vertically partitioned (projected) into a collection of separate files, one for each column or even one for each bit position of each (numeric) column. Such vertical partitioning requires that the original matchup of values be retained in some way, so that the 揾orizontalrecord information is not lost. In our approach, the horizontal matchup information is retained by maintaining a consistent ordering or tree positioning of the values, relative to oneanother. If we consider a list to be a 0dimensional tree, then we can speak in terms of treepositioning only.
We partition all data tables into individual vertical attribute files, and then for numeric attribute domains, further into individual bitposition files. For nonnumeric attribute domains, such as categorical attribute domains, we either code them numeric or construct separate, individual, vertical bitmaps for each category. If the categorical domain is hierarchical, we simply use composite bitmaps to accommodate the higher levels in that concept hierarchy.
The first issue we will deal with is that data mining almost always expects just one table of data. Although Inductive Program Logicians have attempted to deal with multitable or multirelational data directly, we argue that this approach has inherent shortcomings. Our approach is to combine the multiple tables or relations into one first and then datamine the resulting 搖niversaltable. However, any such approach would only exacerbate the curse of cardinality (and to some extent the curse of dimensionality) if applied directly, that is, if it is applied by first joining the multiple tables into one massively large table and then vertically partitioning it.
Our approach is to convert the sets of compressed, lossless, vertical, tree structures (Ptrees) representing the original multiple tables directly to a set of compressed, lossless, vertical, tree structures (Ptrees) representing the universal relation, without ever having to actually join the tables. Since the resulting Ptrees are compressed, this ameliorates the curse of cardinality to a great extent.
As to the curse of dimensionality, except for domain knowledge related and analytical (e.g., Principal Component Analysis) dimension reduction methods, there is no way to relieve the curse of dimensionality. In some real sense it is not a curse, but a fact, if the internal information is spread across all dimensions.
A General Framework for Data Mining
An Introduction first: What is Data mining, roughly?
Data mining, in its most restricted form can be broken down into 3 general methodologies for extracting information and knowledge from data. These methodologies are Rule Mining, Classification and Clustering. To have a unified context in which to discuss these three methodologies, let us assume that the 揹atais in one relations, R(A1,K,An) (a universal relation – unnormalized) which can be thought of as a subset of the product of the attribute domains, µ §Di
Rule Mining is a matter of discovering strong antecedentconsequent relationships among the subsets of the columns on based evident from a given database state (called the training table here).
Classification is a matter of discovering signatures for the values in an attribute, called the class label attribute (which may be composite attribute), from values of the other attributes (some or all of them) in the training table.
Clustering is a matter of using some notion of tuple similarity (or dissimilar or distance) to group together training table rows so that within a group (a cluster) there is high similarity and across groups there is low similarity.
The above definitions will be expanded upon greatly in this document.
In general, the training table will be considered to be the Universal relation of all relevant attributes for the analysis at hand. Alternatively, the training table can be decomposed into many tables to achieve any of the various normal forms or it can be decomposed into a star, snowflake or constellation model. These are all relational models (Universal Relational (UR), Normal Form Relational (NFR), MultiDimensional Relational (MDR)). We will start with the MDR model.
Given a training table, R, one can distinguish those attributes which are entity keys, K1,K,Kk, i.e., each is a (composite?) attribute which minimally, uniquely identifies instances of the entity for all time (not just for the current state). In addition to the key attributes, there are feature attributes for each of the keyed entities. The feature attributes for entity key, Ki, will be denoted, Ai,1,K,Ai,ni .
We assume there is one central fact to be analyzed. A star model can be developed to model the data (alternatively, a constellation model if there are multiple facts to be analyzed) as:
The kdimensional hypercube of key vectors and any features of those key vectors, called measurements at the center of the star.
Each entity and its features (features that refer only to that entity) can be factored out into a separate dimension table as a point of the star.
Any remaining features then refer to (describe) some combination of the entity keys but not the entire set of them (more than one but less than k) are placed with the cuboid composed of that set of keys (key subvector space). With the addition of these subvector or subfact spaces, the picture no longer resembles a star but is a lattice of cuboids with feature attributes for each cuboid. In the example below, the central fact file (star center) is the 3dimensional base cuboid for the fact, sale, with keys for each of the three dimensions of a sale, the product sold, the date of the sale, and the country in which the sale took place. Therefore the lowest (3D) level and the second level (1D) of this lattice could be made to look like a star (distributing the red links evenly with angle so 120 degrees), however there are three 2dimensional subfact cuboids, productdate, productcountry, and datecountry which may have feature attributes themselves. For example, datecountry sale instances may have a 揾oliday seasonattribute whose value depends upon the country and the date but not the product.
Moving up this lattice of cuboids one level is called 揜olling Upand it eliminates one or more dimensions. Moving up thid lattice of cuboids is 搑olling upan entire key dimension to the top of its semantic hierarchy where it has one value, the entire domain, and therefore is eliminated along with its features. This is a schemalevel rollup in the sense that it can be defined entirely at the schema (intentional) level and need not involve the instance (extensional) level.
The final issue to be considered (to further complicate the picture but capture the last remaining concept) is the issue of attribute value granularity (domain semantic hierarchy level). The domain of each attribute, whether structural (key) or descriptive (feature,) in a cuboid, has associated with it a semantic hierarchy (ontology) which can be thought of as a subPOSet of the PowerSet POSet of the domain itself (set of all subsets with set containment as the order relation). To model these semantic hierarchies, one can view the above lattice of cuboids (with all feature attributes included) at the bottom of a ontological hierarchy (OH). As noted previously, moving up the standard (pictured) lattice of cuboids is 搑olling upan entire key dimension to the top of its semantic hierarchy where it has one value, the entire domain, and therefore is eliminated along with its features. This is a schemalevel rollup in the sense that it can be defined entirely at the schema (intentional) level and need not involve the instance (extensional) level. However, one can partially roll up (or down) any or all key attributes (e.g., date can be rolled up from month to quarter or down form month to day along the dashed red link). This is an extensionlevel rollup on keys. Finally, one can think of projecting off a feature attribute as a complete (schemalevel) rollup of that attribute to eliminate the information it holds completely (and therefore eliminate the need for it completely). Before we leave this very preliminary discussion of the OLAP operator, RollUp, we note that Rollup can be done in all the context above in many ways (using many different aggregates or 搑ollup actions. Projection is an aggregationfunctionfree rollup (attribute is eliminated completely. One can think of an OLAP slice as another example of an aggregationfunctionfree rollup. Rollups can involve central tendency operators (e.g., mean, median, mode, midrange), true aggregations (e.g., sum, avg, min, max, count), and measures of dispersion (e.g., quartile operators, measures of outlierness, variance). Each feature attribute can be extensionlevel rolled up or down within its semantic hierarchy (ontology). We will assume a level of granularity has been fixed for each attribute and that the universal relation has been formed for all entities and all attributes involved in the analysis in question. However, the structure we have described makes standard OnLine Analytic Processing (OLAP) much more complex.
Next we note that a functional dependency is a schema or intention level association rule. Whereas, an association rule is a (fuzzy) extensionlevel rule. One can view an association rule as a fuzzy (within some level of confidence, e.g., 80% confidence) functional relationship relating a value in an antecedent (possibly composite) attribute to a value in a consequent attribute at a level of confidence.
If all attributes are binary, this definition corresponds to the definition given in, so called, Market Basket Research (MBR). If attributes are categorical, usually they are bitmapped so that each resulting category is a binary attribute. If, however, attributes are numerical (vector space setting), then one has to consider an association rule as associating a subset of the antecedent attribute with a subset of the consequent attribute up to a level of confidence. For a composite attribute, these subsets are products of subsets of the individual attributes. How does this definition fit with the statement just made, 揙ne can view an association rule as a fuzzy functional relationship relating a value in an antecedent attribute to a value in a consequent attribute at a level of confidence. By moving up the antecedent and consequent concept hierarchies to any level at which the sets are both points, the definition holds. Of course, if one then rolls down from there, the set definition is necessary again (where an item is an attributedomain_subset pair).
So now we have related association rules and functional dependencies (they are the same except that ARs are extensional and FDs are intentional). Also clearly, clustering and classification are just partitioning the entire training set, where classification is partitioning with respect to the values in some particular attribute column (called the class label attribute) and clustering is more general. In clustering we have a similiarty function (or just a similariy notion) and we partition the set into clusters that are internally very similar but dissimilar across clusters. If one defines a binary similarity function in which two tuples are similar iff they have the same class and are dissimilar if they do not, then clustering becomes classification with respect to that similarity measure.
One can note that one should always place a feature attribute with the highest cuboid to with it applies. Starting with the universal training table, for a given feature, one determines what the highest cuboid is for which that feature is invariant, that is, for which it is a slice in the sense that each of its values is fully determined by one cuboid key subvector value. Then, clearly, the redundant replication of this one value is unnecessary and the feature can be rolled up (by simply projecting the single value) to that cuboid. Later we will discuss more complex roll up operators that use aggregation operators, however, here, since feature value is the same all along the roll up, no aggregate operator is necessary.
The full lattice of cuboids (the full central fact cube of all the keys is called the base cuboid (at the bottom of the lattice) and the degenerate fact consisting of none of the keys (at the top of the lattice, just above the 1key dimension tables) is called the apex cuboid. An example is given below in which there are three entities, Genes, Organisms and Experiments (GEO star Schema and cuboid lattice) below.
Finally, we point out that there is a semantic hierarchy for each entity and that the rollup can be to any level in that hierarchy (not all the way to the top in which the entity disappears entirely, e.g., rolling up the full GEO cube to the GE cube along the dimension ORGANISM (eliminating organism entirely) could be partially done to roll up GEO along the Organism concept hierarchy to, say Phylum instead.
Thus there is a massive lattice of cuboids hidden here.
A1
A2
C1
(Support set shown in gray with a portion of the consequent cutaway to expose the support set).
and its antecedent set (note that the consequent set is of little consequences other than to identify a particular subregion of the antecedent where, for example in the search for confident rules, the antecedent points are very populous (bunch up). That is to say, when doing confident rule mining one is searching for boxes in which the member points 揵unch upwith respect to other dimension(s) as shown in the figure below. For fixed consequent rule mining, one can see that we are looking for frequent classes (really classification with respect to the consequent as the class label attribute).
1.3. The role of partitions
The concept of a partition links many of the common data mining techniques and many standard techniques can be described in the same framework of generalized database operations. Relational data base system are ubiquitous today. The notion of a unary equivalence relation is central to understanding data patterns through similarity partitioning and the notion of a comparison relation (order relation or hierarchy) is central for distinguishing similarity patterns. The former glues object together and the latter distinguishes them. The former is reflexive, symmetric and transitive and the latter is irreflexive, and transitive.
We can view a relation, R(A1,K,An) with Dom(Ai) = Xi, as the f1(1)component of the preimage partition generated by a function
f:X1別K別Xn 劤 {0,1}
which assigns 1 if the tuple 揺xists in the relationand 0 if it 揹oes not exist in the relation (preimages of functions; partitions and equivalence relations are pairwise dual concepts). That is, we partition the full Cartesian product of the attribute domains into two components whenever we define a relation. Data mining and database querying are a matter of describing the nonrandomness of that partition boundary (if it is nonrandom). Clearly, if f is identically 1, the relation is the entire Cartesian product and there is no boundary. This is one extreme.
At the other extreme, f is the characteristic function of a singleton set and there is a clear boundary and clear nonrandomness. Data mining in the latter case degenerates to data querying. So "searching for patterns" can be viewed as searching for and describing the nonrandomness of that boundary.
A partition on a relation over attribute domains, X1,K,Xn is the preimage partition generated by a surjection function,
F:X1別K別Xn 劤 {0,1,K,N}.
The range provides a labeling for the partition. We don抰 need to define a relation separately from a partition since this partition function, F, when composed with the characteristic function, g:[0,N] > [0,1] given by g(n)=1 iff n刯0, is the function, f, that defines the underlying relation being partitioned. Composition with this characteristic function is used in Market Basket Research to focus on existence of a data item in a market basket (independent of count issues) in much the same way.
Another very central construct we will use to unify data querying and data mining of a relational database is the partition. Both the 損artition  equivalence relationduality and the 損artition  label functionduality will be exploited in this treatment  namely, every partition generates an equivalence relation and vice versa, and every labeled partition generates a function from the partitioned set to the label set and vice versa. Partitions have subobjects.
A subpartition is simply a finer partition (every partition component is a subset of a component of the superpartition). The class of partitions forms a partially ordered set under the 搒uboperator. Within the context of the partially ordered set of partitions (or the lattice of partitions), querying, indexing, clustering, classification, association rule mining, data warehousing operations, and even concurrency control can be defined and related.
Using this extended model, it may be possible to bring database and data mining research together. It may also be possible to eliminate the current need for two separate systems, an operational database management system and a data warehouse. If this should happen, the original goals of database management, namely: centralized control of enterprise data resources, reduced data redundancy, standardization of schemas, database correctness (i.e., serializability), maximal information resource utilization, etc.; may be achieved. The purpose of this paper is to attempt to make a contribution in this direction.
We will use the notions of partitions and hierarchies (partial orderings) of partitions as a unifying theme. Most data mining operations involve partitioning – based on distance functions, classifiers, equivalence relations (e.g., binning algorithms) and chaining techniques (e.g., densitybased clustering). For example, clusters generated from the kmeans clustering method are partitions produced from distance functions. Partitions are often but not always represented by indexes. Data warehouses use bitmap indexes for data mining queries.
Many data mining algorithms use treestructured indexes to represent hierarchical partitions. Examples of such indexes are B+trees, Rtrees[2], Quadtrees[3], and Ptrees[4,5,6,7,8]. A close relationship exists between bitmap indexes that are used in data warehouses, and Ptree indexes.
The 揹istance function  similarity measure the 揹istance function  norm dualities and the 揹istance function  scalar productdualities will be exploited in this paper, also. We will discuss distance between data points (i.e., database tuples) in a general framework that includes commonly used distance metrics such as Euclidean distance and Manhattan distance, as well as other Lpdistances and their variations, the Max distance, and a new distance called the HOBBit distance[5]. Each of these generates a similarity measure and therefore a whole class of clusterings (depending on the clustering algorithms employed). Each of these also generates a norm and scalar product and therefore provides the notions of orthonormal basis and coincident angle.
Support Vector Machines (SVM), Wavelet Analysis, Principal Component Analysis (PCA) and other approaches to clustering and classification make use of these notions. It will be necessary to keep in mind when considering a database state in the context of a linear space, that a database state is always finite and discrete and therefore is a subset, not a subspace. We refer the reader to [12] regarding functional and linear space details. We will show how one of the standard distances, namely the Max distance, can provide huge benefits when dealing with categorical data. We encode categorical attributes, based on their labels, as an integer and break up the integer into bit planes.
The bit planes are then treated as Boolean variables, the distance between which is given by the Max distance. We will show that this results in a distance of 1 whenever the attribute values differ. By this scheme we can encode a categorical attribute that has a domain of 2n values in n bits without losing any of the distance properties of the standard encoding (which uses one Boolean attribute for each of the 2n domain values). This shows how a systematic treatment of distance metrics can lead to dramatic performance improvement.
It is important to note that the standard encoding of categorical attributes that uses one Boolean attribute for each domain value can easily be regained by a bitwise "AND" operation on a combination of Boolean attributes and their complements. This allows existing algorithms to be used unmodified.
Based on attribute values and distances, we will identify partitions that can be efficiently searched through indexes. It is important for our discussion that partitions can be defined at many levels. In the data mining context this can be identified with a concept hierarchy, or in our model a 損artition hierarchy Concept hierarchies are commonly defined as a tree of mappings from a set of lowlevel concepts to more general concepts, such as "city" < "province_or_state" < "country"[1].
More general mappings are described through directed graphs and are called concept lattices. In practice, concept hierarchies are often converted into what we will term concept slices by realizing that a lower level attribute only has to specify the incremental information with respect to the higherlevel attribute. In the presence of the higherlevel attribute 搚earthe month is uniquely defined through its name or number (without specifying the year), and the day through the attribute 揹ay_of_month Specifying a concept hierarchy for these attributes requires combining attributes ("year","month","day_of_month") < ("year","month") < "year". We will refer to 搚ear "month" and "day_of_month" as concept slices. Concept slices can only be defined for concept hierarchies, i.e. trees, not for concept lattices, i.e., graphs.
