Scaling dynamic authority-based search using materialized subgraphs .. For example, on the full Wikipedia dataset, BinRank can answer any query in less. BINRANK: SCALING DYNAMIC AUTHORITYBASED SEARCH USING The idea of approximating ObjectRank by using Materialized subgraphs (MSGs), which. Effective Bin Rank for Scaling Dynamic Authority. Based Search with Materialized Sub Graphs. L. Prasanna Kumar. Abstract. Dynamic authority-based keyword.
|Published (Last):||23 June 2005|
|PDF File Size:||19.9 Mb|
|ePub File Size:||15.27 Mb|
|Price:||Free* [*Free Regsitration Required]|
According to one embodiment of the present invention, a method for processing a query is provided. The method includes generating a set of pre-computed materialized sub-graphs from a dataset and receiving a search query having one or more search query terms. A particular one of the pre-computed materialized sub-graphs is accessed and a dynamic authority-based keyword search is executed on the particular one of the pre-computed materialized sub-graphs.
Nodes in the dataset are then retrieved based on the executing, and a response to the search query is provided which includes the retrieved nodes. A variety of algorithms are in use for keyword searches in databases and on the Internet. Dynamic, authority-based search algorithms, leverage semantic link information to provide high quality, high recall search results.
For example, the PageRank algorithm utilizes the Web graph link structure to assign global importance to Web pages. The PageRank score is independent of a keyword query. Recently, dynamic versions of the PageRank algorithm have been developed. They are characterized by a query-specific choice of the random walk starting points. Personalized Page Rank is a modification of PageRank that performs search personalized on a preference set that contains web pages that a user likes.
BinRank: Scaling Dynamic Authority Based Search Using Materialized Sub Graphs – AngelList
For a given preference set, PPR authlrity-based an expensive fixpoint iterative computation over the entire Web graph, while it generates personalized search results. ObjectRank extends Personalized PageRank to perform keyword search in databases.
ObjectRank uses a query term posting list as a set of random walk starting points, and conducts the walk on the instance graph of the database. ObjectRank has successfully been applied to databases that have materializdd networking components, such as bibliographic data and collaborative product design. According to one embodiment of the present invention, a method comprises: According to another embodiment of the present invention, a method comprises: According to a further embodiment of the present invention, a system comprises: According to another embodiment of the present invention, a computer program product for processing a query comprises: Embodiments of the invention provide a practical solution for scalable dynamic authority-based ranking.
The above-discussed Personalized Page Rank and ObjectRank algorithms both suffer from scalability issues. Personalized Page Rank performs an expensive fixpoint iterative computation over the entire Web graph.
ObjectRank requires multiple iterations over all nodes and links of the entire database graph. The original ObjectRank system has two modes: Authorigy-based on-line mode runs the ranking algorithm once the query is received, which takes too long on large graphs.
For example, on a graph of articles of English Wikipedia 1 with 3. In the off-line mode, ObjectRank precomputes top-k results for a query workload in advance.
This precomputation is very expensive and requires a lot of storage space for precomputed results. Dymamic, this approach is not feasible for all terms outside the query workload that a user may search for, i. For example, on the same Wikipedia dataset, the full dictionary precomputation would take about a CPU-year. Embodiments of the present invention employ a hybrid approach where query time can be traded off for pre-processing time and storage, referred to as BinRank.
BinRank closely approximates ObjectRank scores by running subgrwphs same ObjectRank algorithm on a small sub-graph, instead of the full data graph. The sub-graphs are precomputed offline.
BinRank: Scaling Dynamic Authority-Based Search Using Materialized Subgraphs
The precomputation can be parallelized with linear scalability. For example, on the full Wikipedia dataset, BinRank can answer any query in less than one second, by precomputing about a thousand sub-graphs, which takes only about 12 hours on a single CPU.
Query execution in accordance with the invention easily scales to large clusters by distributing the sub-graphs between the nodes of the cluster. This way, more sub-graphs can be kept in RAM, thus decreasing the average query execution time.
Since the distribution of the query terms in a dictionary is usually very uneven, the throughput of the system is greatly improved by keeping duplicates of popular sub-graphs on multiple nodes of the cluster. The query term is routed to the least busy node that has the corresponding sub-graph. There are two dimensions to the sub-graph precomputation problem: The embodiments of the invention use an approach based on the idea that a sub-graph that contains all objects and links relevant to a set of related terms should have all the information needed to authority-bassd objects with respect to one of these terms.
For 2we execute ObjectRank for each bin using the terms in the bins as random walk starting points and keep only those nodes that receive non-negligible scores. In particular, the invention approximates ObjectRank by using Materialized Sub-Graphs MSGwhich can be precomputed off-line to support on-line querying for a specific maaterialized workload, or the entire dictionary. In addition, embodiments of the invention use a greedy algorithm that minimizes the number of bins by clustering terms with similar posting lists.
The scalability of ObjectRank is improved with embodiments of the invention, while still maintaining the high quality of top-K result lists. ObjectRank performs top-K relevance search over a database modeled as a wubgraphs directed graph. The data graph G V, E models objects in a database as nodes, and the semantic relationships between them as edges. In ObjectRank, the role of edges between objects is the same as that of hyperlinks between web pages in PageRank.
However, notice that edges of different edge types may transfer different amounts of authority. Let w t denote the weight of edge type seadch. ObjectRank assumes that weights of edge scaaling are provided by domain experts. Details regarding ObjectRank may be found at Balmin et al. Also, it is noted that there are three important properties of ObjectRank vectors that are directly relevant to the result quality and searcu performance of ObjectRank.
First, for many of the keywords in the corpus, the number of objects with non-negligible ObjectRank values is much less than V. This means, that just a small portion of G is relevant to a specific keyword. An ObjectRank value of v, r vis non-negligible if r v is above the convergence threshold. The intuition for applying the threshold is that differences between the scores that are within the threshold of each other are noise after ObjectRank execution.
Thus, scores below threshold are effectively indistinguishable from zero, and objects that have such scores are not at all relevant to the query term. Second, we observed that top-K results of any keyword term t generated on sub-graphs of G composed of nodes with non-negligible ObjectRank values, with respect to the same t, are very close to those generated on G. Thus, a sub-graph of G composed of nodes with non-negligible ObjectRank values, with respect to a union of basesets of a set of terms, subgrapjs potentially be used to answer any one of these terms.
Based on the above observations, we speed up the ObjectRank computation for query term q, by identifying a sub-graph of the full data graph that contains all the nodes and edges that contribute to the accurate ranking of the objects with respect to q.
Autgority-based, every object that receives a non-zero score during the ObjectRank computation over the full graph should be present in the sub-graph and should receive the same score.
In reality, however, ObjectRank is a search system that is typically used to obtain only the top-K result list. Thus, the sub-graph only needs to have enough information to produce the same top-K list. The top-K result list of the ObjectRank of keyword term dynanic on data graph G V, Edenoted OR t, G, k auhtority-based, is a list of k objects from V sorted in descending order of their ObjectRank scores with respect to a baseset that is the set of all objects in V that contain keyword term seaech.
It is hard to find an exact RSG for a given term, and it is not feasible to precompute one for every term in a large workload. However, the present invention introduces a method to closely approximate RSGs. Furthermore, it can be observed that a single sub-graph can serve as an approximate RSG for a number of terms, and also that it is quite feasible to construct a relatively small number of such sub-graphs that collectively cover, i.
This measure is commonly used to describe the quality of approximation of top-K lists of exact ranking R E and approximate ranking R A that may contain ties nodes with equal ranks. A pair of nodes that is strictly ordered in both lists is called concordant if both rankings agree on the ordering, and discordant otherwise.
A pair is e-tie, if R E does not order the nodes of the pair, and a-tie, if R A does not order them. Let C, D, E, and A denote the number of concordant, discordant, e-tie, and a-tie pairs respectively.
An ARSG may be constructed for term t by executing ObjectRank with some set of objects B hinrank the baseset and restricting the graph to include only nodes with non-negligible ObjectRank scores, i.
The main challenge of this approach is identifying a baseset B, which will provide a good RSG approximation for term t. Embodiments of the invention focus on sets B, which are supersets of the baseset of t. This relationship gives us the following important result. According to this theorem, authority-bases a given term t, if the term baseset BS t is a subset of B, all the important nodes relevant to t are always subsumed within MSG B. That is, all the non-negligible end points of random walks originated from starting nodes containing t are present in the sub-graph generated using B.
However, it may be observed that even though two nodes v 1 and v 2 are guaranteed to be found both in G and in MSG Bthe ordering or their ObjectRank scores might not be preserved on Authority-bazed B as we do not include intermediate nodes if their ObjectRank scores are below the convergence threshold.
However, it is unlikely that many walks terminating on relevant nodes will pass through irrelevant nodes. Experimental evaluations performed by the inventors support this materiakized. The quality of search results should improve if objects in B are semantically related to t. In fact, the inventors have discovered that terms with strong semantic connections can generate good RSGs for each other. However, there is definitely a strong semantic connection between these terms, since XML is a data format famous for its flexible schema.
Papers about XML tend to cite papers that materiwlized about schemas and vice versa. It can be hard to automatically identify terms with such strong semantic connections for every query term.
A baseset B is created for every bin by taking the union of the posting lists of the terms in the bin, and construct MSG B for every bin. The mapping of terms to bins is remembered, and at query searcj, the corresponding bin for each term can be uniquely identified, and the term can be executed on the MSG of this bin.
Empirical results support this. The most frequent among them appeared in 8 documents. As previously discussed, a set of MSGs is constructed for terms of a dictionary or a workload by partitioning the terms into a set of term bins based on their co-occurrence. An MSG is generated for every bin based on the intuition that a sub-graph that contains all objects and links relevant to a set of related terms should have all matterialized information needed to rank objects with respect to one of these terms.
There are two main goals in constructing term bins. The djnamic goal is controlling the size of each bin to ensure scwling the resulting sub-graph is small enough for ObjectRank to execute in a reasonable amount of time. The second goal subgrahs minimizing the number of bins to save the pre-processing time.
BinRank: Scaling Dynamic Authority Based Search Using Materialized Sub Graphs
We know that pre-computing ObjectRank for authority-baed terms in our corpus is not feasible. To achieve the first goal, a maxBinSize parameter is introduced that limits the size of the union of the posting lists of the terms in the bin, called jaterialized size. As discussed above, ObjectRank uses the convergence threshold that is inversely proportional to the size of the baseset, i. Thus, there is a strong correlation between the bin size and the size of the materialized sub-graph.