Skip to end of metadata
Go to start of metadata

Choke Point Classification

CP1-AggregationPerformance. Performance of aggregate calculations.

    • CP1.1 QEXE: Ordered Aggregation. 
    • CP1.2 QOPT: Interesting Orders
      • Apart from clustered indexes providing key order, other operators also preserve or even induce tuple orderings. Sort-based operators create new orderings, typically the probe-side of a hash join conserves its order, etc.
    • CP1.3 QOPT: Array-based Group-by keys. 
    • CP1.4 QEXE: Dependent Group-By Keys (removal of ). 
    • CP1.5 (QEXE) Low cardinality hash aggregation speed
    • CP1.6 (QEXE) High cardinality group by performance, e.g. partitioning on grouping keys in order to avoid having to add up per thread partial group by's )
      • If an aggregation produces significant numbers of groups with intra query parallelization each thread may make its own partial aggregation. To produce the result, these have to be re-aggregated. In order to avoid this, the tuples entering the aggregation operator may be partitioned by a hash of the grouping key and directed to the appropriate partition. The partition may have its own thread so that only this thread writes the aggregation, hence avoiding costly critical sections. 
        A high cardinality distinct modifier in a query is a special case of this choke point. It is amenable to the same solution with intra query parallelization and partitioning as the group by. 
        We further note that scale-out systems have an extra incentive for partitioning since this will distribute the CPU and memory pressure over multiple machines, yielding better platform utilization and scalability.

    • CP1.7 Complex aggregate performance, e.g. concatenation
      • Many databases offer user defined aggregates and more complex aggregation operations than the basic count, sum, man and min. For example SPARQL has a standard string concatenation aggregation operator. These types of aggregates are expected to benefit from the same optimizations than the basic built-in ones, for example partitioning.

CP2-JoinPerformance. Voluminous joins, with or without selections.

    • CP2.1 QEXE: Large Joins (out-of-core).
    • CP2.2 QEXE: Sparse Foreign Key Joins (bloom filters).
    • CP2.3 QOPT: Rich Join Order Optimization.
      • The execution times of different join orders differ by orders of magnitude. Therefore, finding an efficient join order is important, and, in general, requires enumeration of all join orders, e.g., using dynamic programming. The enumeration is complicated by operators that are not freely reorder-able like semi-, anti-, and outer- joins. Because of this difficulty most join enumeration algorithms do not enumerate all possible plans, and therefore can miss the optimal join order.
    • CP2.4 QOPT: Late Projection (column stores).
      • In column stores, queries where certain columns are only used late in the plan, can typically do better by omitting them from the original table scans, to fetch them later by row-id with a separate scan operator which is joined to the intermediate query result. Late projection does have a trade-off involving locality, since late in the plan the tuples may be in a different order, and scattered I/O in terms of tuples/second is much more expensive than sequential I/O. Late projection specifically makes sense in queries where the late use of these columns happens at a moment where the amount of tuples involves has been considerably reduced; for example after an aggregation with only few unique group-by keys, or a top-N operator.
    • CP2.5 (QOPT) Index choice - Decide scan order taking into account order of results for downstream (parent) operators
    • CP2.6 (QEXE) Unpredictable, widely scattered indexed access pattern (graph walk)
      • The efficiency of index lookup is very different depending on the locality of keys coming to the indexed access. Techniques like vectoring non-local index access by simply missing the cache in parallel on multiple lookups vectored on the same thread may have high impact. Also detecting absence of locality should turn off any locality dependent optimizations if these have overhead when there is no locality. A graph neighborhood traversal is an example of an operation with random access without predictable locality.
    • CP2.7 (QOPT) Join type - correct choice of hash vs. index based on cardinality on either side
      • Specially with stores, where one usually has an index on everything, deciding to use a hash join requires a good estimation of cardinalities on both the probe and build sides. In TPC-H, the use of hash join is almost a foregone conclusion in many cases, since an implementation will usually not even define an index on foreign key columns. There is a break even point between index and hash based plans, depending on the cardinality on the probe and build sides.sides. This choke point tests whether this break-even point is correctly modeled and whether the cost model produces accurate estimates as input to this choice.

CP3-DataAccessLocality. Non-full-scan access to (correlated) table data.

    • CP3.1 STORAGE: Columnar Locality (favors column storage).
    • CP3.2 STORAGE: Physical Locality by Key (clustered index, partitioning).
    • CP3.3 QOPT: Detecting Correlation (ZoneMap,MinMax,multi-attribute histograms).
      • If a schema rewards creating clustered indexes, the question then is which of the three date columns to use as key. In fact it should not matter which column is used, as range- propagation between correlated attributes of the same table is relatively easy. One way is through creation of multi-attribute histograms after detection of attribute correlation. With MinMax indexes, range-predicates on any column can be translated into qualifying tuple position ranges. If an attribute value is correlated with tuple position, this reduces the area to scan roughly equally to predicate selectivity.
    • CP3.4 (QEXE) Use of zone maps for accelerating scans
    • CP3.5 (Storage) Dimensional clustering - assigning pk/fk/uri/vertex ids in function of attributes of the entity concerned
      • A data model where each entity has a unique synthetic identifier, e.g. RDF or graph models, has some choice in assigning a value to this identifier. The properties of the entity being identified may affect this, e.g. type (label), other dependent properties, e.g. geographic location, date, position in a hierarchy etc, depending on the application. Such identifier choice may create locality which in turn improves efficiency of compression or index access.

CP4-ExpressionCalculation. Efficiency in evaluating (complex) expressions. 

    • CP4.1 Raw Expression Arithmetic.
      • CP4.1a QEXE: SIMD Arithmetic.
      • CP4.1b QEXE: Overflow Prevention (arithmetic).
      • CP4.1c QEXE: Compressed Execution.
      • CP4.1d QEXE: Vectorized Execution & JIT (compilation for CPU, GPU or FPGA).
    • CP4.2 Complex Boolean Expressions in Joins and Selections.
      • CP4.2a QOPT: Common Subexpression Elimination (CSE).
        •  A basic technique helpful in multiple queries is common subexpression elimination (CSE). CSE should recognize also that average aggregates can be derived afterwards by dividing a SUM by the COUNT when those have been computed.
      • CP4.2b QOPT: Join-Dependent Expression Filter Pushdown.
      • CP4.2c QOPT: Invisible Join for IN clauses.
      • CP4.2d QEXE: Run-time Reordering of Conjunctions.
    • CP4.3 String Matching Performance.
      • CP4.3a QOPT: Rewrite LIKE(X%) into a Range Query.
      • CP4.3b QEXE: SIMD String Matching (e.g. SSE4.2).
      • CP4.3c QEXE: Regular Expression Compilation (JIT/FSA generation).

CP5-CorrelatedSubqueries. Efficiently handling dependent subqueries.

    • CP5.1 QOPT: Flattening Subqueries (into join plans).
      • Many queries have correlated subqueries and their query plans can be flattened, such that the correlated subquery is handled using an equi-join, outer-join or anti-join. To execute queries well, systems need to flatten both subqueries, the first into an equi-join plan, the second into an anti-join plan. Therefore, the execution layer of the database system will benefit from implementing these extended join variants. The ill effects of repetitive tuple- at-a-time subquery execution can also be mitigated in execution systems use vectorized, or block-wise query execution, allowing to run sub-queries with thousands of input parameters instead of one. The ability to look up many keys in an index in one API call, creates the opportunity to benefit from physical locality, if lookup keys exhibit some clustering.
    • CP5.2 QOPT: Moving Predicates into a Subquery.
    • CP5.3 QEXE: Overlap between Outer-and Subquery.
      • [In some queries, the correlated subquery and the outer query have the same joins and selections. In this case, a non-tree, rather DAG-shaped query plan would allow to execute the common parts just once, providing the intermediate result stream to both the outer query and correlated subquery, which higher up in the query plan are joined together (using normal query decorrelation rewrites). As such, the benchmark rewards systems where the optimizer can detect this and whose the execution engine sports an operator that can buffer intermediate results and provide them to multiple parent operators.

CP6-Parallelism and Concurrency. Making use of parallel computing resources. 

    • CP6.1 QOPT: Query Plan Parallelization.
    • CP6.2 QEXE: Workload Management.
    • CP6.3 QEXE: Result Re-use.
      • Sometimes with a high number of streams a significant amount of identical queries emerge in the resulting workload. The reason is that certain parameters, as generated by the workload generator, have only a limited amount of parameters bindings. This weakness opens up the possibility of using a query result cache, to eliminate the repetitive part of the workload. A further opportunity that detects even more overlap is the work on recycling, which does not only cache final query results, but also intermediate query results of a "high worth". Here, worth is a combination of partial-query result size, partial-query evaluation cost, and observed (or estimated) frequency of the partial-query in the workload.

CP7 RDF and Graph Specifics

    • CP7.1 (QOPT) Translation of internal id's into external ones. Translate at point of minimum cardinality, e.g. after top k order by
      • RDF and possibly graph models often use a synthetic integer identifier for entities, e.g. URI’s . For presentation to the client applications, these identifiers must be translated to their original form, e.g. the URI string that was used when loading the data. This should be done as late as possible, or at the point of minimal cardinality in the plan.
    • CP7.2 (QOPT) Cardinality estimation of transitive paths, order of traversal (1:n, nm:1 n:n)
      • A transitive path may occur in a ’fact table’ or a ’dimension table’ position. A transitive path may cover a tree or a graph, e.g. descendants in a geographical hierarchy vs. graph neighborhood or transitive closure in a many-to-many connected social network. In order to decide proper join order and type, the cardinality of the expansion of the transitive path needs to be correctly estimated. This could for example take the form of executing on a sample of the data in the cost model or of gathering special statistics, e.g. the depth and fan-out of a tree. In the case of hierarchical dimensions, e.g. geographic locations or other hierarchical classifications, detecting the cardinality of the transitive path will allow one to go to a star schema plan with scan of a fact table with a selective hash join. Such a plan will be on the other hand very bad for example if the hash table is much larger than the ’fact table’ being scanned.
    • CP7.3 (QEXE) Execution of a transitive step
      • Graph workloads may have transitive operations, for example finding a shortest path between vertices. This involves repeated execution of a short lookup, often on many values at the same time, while usually having an end condition, e.g. the target vertex being reached or having reached the border of a search going in the opposite direction. For best efficiency, these operations can be merged or tightly coupled to the index operations themselves. Also parallelization may be possible but may need to deal with a global state, e.g. set of visited vertices. There are many possible tradeoffs between generality and performance.


  • No labels