LDBC SNB Interactive Chokepoints (http://ldbcouncil.org/sites/default/files/LDBC_D3.3.34.pdf)

TPC-H Interactive Chokepoints

Paper:

Slides:

 

 CP1.1CP1.2CP1.3CP1.4CP1.5CP1.6CP2.1CP2.2CP2.3CP2.4CP3.1CP3.2CP3.3CP4.1CP4.2CP4.3CP5.1CP5.2CP5.3CP6.1CP7.1CP7.2CP7.3CP7.4CP7.5
Q1 X         X X      R.I.P    
Q2XX X  X X XX        R.I.P    
Q3         XXX X X  XXR.I.P    
Q4XX X  XX X  X       R.I.P    
Q5 X XX XXXX  X     XXR.I.P    
Q6 X      X           R.I.P    
Q7 X      X  X
X      XR.I.P    
Q8     X      X    X  R.I.P    
Q9 X X  X XX          R.I.P    
Q10 X    X X  X        R.I.P    
Q11X     XXX XX       XR.I.P    
Q12 X     X  X        XR.I.P    
Q13 X     XX  X       XR.I.P    
Q14 X     XX  X        R.I.PXXX 
Q15 X      X  XX     XXR.I.P    
Q16 X X    XX  X       R.I.PXX X
Q17X       X           R.I.P    
Q18XX   X     X  XX    R.I.P    
Q19X  X  X XX  X   X   R.I.P XX
 
Q20     XX            XR.I.P    
Q21 X    X XX XX   X X R.I.P    
Q22   X XX   X X   XXX R.I.P    
Q23     X  XX  X  X    R.I.P    
Q24     XX XX X   X    R.I.P    
Q25 X    XX X  X   X X  XX  

 

Query 1 (Alex)

NOTE1: TPC-H CHOKEPOINTS NOT EVALUATED YET

Query 2 (Arnau)

SNBI-1.1 The query asks for a group by, whose key is partially defined by the country of the person. Thus, if the first joins ensure a sorted output by country, then the group by can become cheaper. The same can happen with the month of the messages. If the messages are sorted by date, the access to the structure used to perform the aggregation will be more local. Even, I think that for the tags, it is possible to have an interesting order if the message X message-tag join is clustered by tag.

SNBI-1.2 The amount of different group-by keys for this query can be of more than 4M in the worst case. difficulting the locality if the aggregation structure does not fit the cache or memory in case of a very large dataset.

[Alex] "SNBI-1.3: QEXE Complex aggregate performance"  <-- "age group of person" grouping key?

[Alex] "SNBI-1.4: QOPT Top-k push down"  <-- with all the grouping keys and sorting criteria it's hard for my brain to know, but it seems like there may be potential for this – although maybe it is difficult because message_count is the first sorting criteria

SNBI-2.1 The messages table needs to be joined with the message/tags table. The optimizer should do this after joining the messages with the persons of the country, since doing the alternate way would consume more memory and produce a rather large table.

SNBI-2.3 I think that since the persons in the countries will be a rather small table compared to the messages table, a hash-join operator would be preferable, to hash over the person table and then filter those messages whose creator are in the hash table.

SNBI-3.1 Message's creation date is correllated with the messages table position. Unconfirmed.

SNBI-3.2 For those systems who have adjacencies materialized, having those nodes with similar attributes to have similar identifiers, will reduce the amount of data accessed when accessing the indexes. I think this chokepoint will be fullfilled by many or all of the queries.

Query 3 (Marcus)

(I took the choke point numbers from http://oai.cwi.nl/oai/asset/21424/21424B.pdf)

SNBI-1.1 Ordered Aggregation: can help, if domain of group-by key is large -> group by on tag name (cardinality is around 16K)

--> no chokepoint

CP1.2 Interesting orders: probably not helpful here (if join already provides order, this could be used in the aggregation)

--> no chokepoint

CP1.3 Small Group-by keys: there is only one group-by key (this one could however be represented by 14 bits instead of 64)

--> no chokepoint

CP2.2. Sparse Foreign Key Joins: not sure if this is applicable here

CP2.4. Only the tagname/creationdate (+join attributes) are of real interest, so columnar storage might help

--> chokepoint

CP3.1 Columnar Locality: Scan on creationdate, count aggregation on tags

--> chokepoint

CP3.2 Physical Locality by Key: Range-partitioning the table by creationdate is likely a good idea

--> chokepoint

CP3.3 Scattered index accesses - Undecided

 

CP4.1 Expression arithmetics: The query states that the second date has to be computed / doing this in a smart way will likely help here

--> chokepoint

SNBI-5.3

--> chokepoint

CP6.1 Query Plan Parallelization: Both subqueries (for the first and the second month) can be executed independently from each other

--> chokepoint

TPCH-6.1

--> It applies but should we include it in chokepoints?


Query 4 (Moritz)

 

- SNBI-1.1: Ordered Aggregation:

     - When joining post_tag with tags having the posts orderd by forum would allow for very cheap count aggregation

 - SNBI-1.2: QEXE High Cardinality group-by performance.

     - This query requires e.g. a group by with high cardinality when counting the number of posts per forum with a given tag

-  SNB-1.4

 - SNBI-2.1: QOPT Rich join order optimization

     - When finding forums by country its important to first connect forums to persons and than to country, rather than persons to country and than forums which would be much more expensive

     - When counting posts the runtime should decide whether first filtering by forums or by tag.

-SNBI-2.2

-SNBI-2.4

 -SNBI-3.3 I think the runtime has to do the right decisions concerning these chokepoints, but I think it should be trivial to decide correctly Undecided

[Marcus] SNBI-1.5: there seems to be at least a functional dependency between the forum id and the corresponding forum title (so grouping by both will probably result in grouping only by forum id)

[Marcus] SNBI-2.4: The selection on country name and tc name could result in sparse joins ( ? ) 

 

Query 5 (Alex)
NOTE1: ALL THINGS IN BOLD RED I AM UNSURE ABOUT, PLEASE DOUBLE CHECK THOSE ESPECIALLY
NOTE2: TPC-H CHOKEPOINTS NOT EVALUATED YET

Query 6 (Arnau)

SNBI-2.1 The optimizer should decide between selecting the messages per person, filtering them by tag and perform the counting vs starting by tag and for each message, count and aggregate per person. In the first case we would have a very good locality in the map used to keep the counts per person. In the second case the filtering of messages by tag would be faster. Possibly, the first case would be improved by creating a hash table first with those messages that has the tag, for discarding unnecessary messages fast. As a counterpart, we would still need to scan the whole messages table, which is hugh. Possibly, version 2 seems better (smile). Actually, in order to improve locality on the map used to count, the temporal message table containing the messages and persons, could be previously sorted. NOT a CHOKEPOINT

SNBI-1.2 One group per person, high cardinality

SNBI-2.3 An Index Join between Tags-Message table and Message Table.

SNBI-3.2 Having messages identifiers clustered per author, would improve the locality when accessing to the like indexes and reply indexes to count NOT a CHOKEPOINT

 TCP-6.1 When counting and aggregating the number of likes per person message, and replies per person message, this could be parallelized, in such a way that one thread does the like count while another the comment count. Having such threads in different cores could maximize the caching of the indexes used to index the likes per message and the comments per message. Also the counting could be partitioned and then reduced using multiple threads NOT a CHOKEPOINT

 TPC-6.3 There are not so many tags compared to the amount of messages, and the distribution of tag message is skewed. Also, some tags might be popular at a given moment (flashmob). In a suffitiently long run without updates, we could cache the results of this query for reuse. Even with updates, the counts could be easily maintained. NOT a CHOKEPOINT

Query 7 (Marcus)

Aggregations:
=============

SNBI-1.1:
keeping the derived person relations sorted will be beneficial for the joins and grouping on the person Ids

SNBI-1.2: High Cardinality Group-By
- computing the popularity of each person requires a join between posts and likes and a subsequent grouping by personId

SNBI-1.4:
- Since the score function is a derived aggregate, I'm not sure whether there is any early-out during the processing possible (maybe by sorting on the number of likes a person received)

Joins:
===========

SNBI-2.1:
The join order between the posts, the post_tags, and the tags likely benefits from a rich join order optimization by first evaluation the join on the (filtered) tag relation

TPCH-CP2.1 Large Joins: there are some large joins in the query (between posts and likes, between posts and posttags)
--> chokepoint

SNB-2.2:
Late projection will help here since mostly key columns are used to join, the only attribute column accessed is the tag name

SNBI-2.4:
the selective filter on the tag relation will probably result in many tuples finding no join partner when joined with post_tag

SNBI-3.2:
Clustering posts by tags

-----

SNBI-6.1:
The popularity of persons can be computed offline and reused between multiple queries

 

Chokepoints: 1.2, 2.3, 3.2 and 3.3, 6.1, 7.1


Query 8 (Moritz)

Chokepoints: 1.6, 3.3, 5.2

Query 9 (Alex)

Query 10 (Arnau)

1.2

2.1

2.3

3.2

Query 11 (Marcus)

1.1 <- Chokepoint

2.1 <- Chokepoint

2.2 <- Chokepoint (Tag.name)

2.3 <- Chokepoint

3.1 <- Chokepoint (Change selection of parameters accordingly)

3.2 <- Chokepoint ( cluster Message-like join table by message, to make counting likes cheaper)

6.1 <- Chokepoint (computing incrementally the size of the intersection between tags of a reply and the message, or an index between persons to messages that qualify in terms of the tags they have with respect to the message they reply)

Query 12 (Moritz)

1.2 <- Chokepoint

2.2 <- Chokepoint

3.1 <- Chokepoint

6.1 <- Chokepoint

Query 13 (Alex)

1.2 <- Chokepoint

2.2 <- Chokepoint

2.3 <- Chokepoint

3.2 <- Chokepoint

6.1 <- Chokepoint

Query 14 (Arnau)

1.2 <- Chokepoint

2.2 <- Chokepoint

2.3 <- Chokepoint

3.2 <- Chokepoint

7.2 <- Chokepoint

7.3 <- Chokepoint

7.4 <- Chokepoint

Query 15 (Marcus)

 1.2 <- Chokepoint

CP 2.2 Late projection <- NOT A CHOKEPOINT
- only join attributes are required (for persons only their ids and their places are needed)

  2.3 <- Chokepoint

CP2.4 Sparse Foreign Key Joins <- NOT A CHOKEPOINT
- Persons could be partitioned by their country of residence
- Only persons within a country are considered / selection on the :1 side of the join

 

CP3.1 Detecting Correlation <- NOT A CHOKEPOINT
- friends tend to have more friends within the same country than in others (?)

3.2 <- Chokepoint

3.3 <- Chokepoint

CP 5.3 Intra-query result re-use <- Chokepoint
- The friendships of all persons coming from the same country is acutally needed twice, once for computing the average number of friends and one for computing the number of friends for each person

 CP 6.1 Inter-query result re-use <- Chokepoint
- the average number of friends per country could be easily cached and reused later on


Query 16 (Moritz)


CP1.2 <- Chokepoint

CP1.4 <- Chokepoint <- we can early skip Persons based on their message count, and even if they pass the early test, we can early skip the message tag join based on the already observed tags

CP2.3 <- Chokepoint

CP2.4 is a Chokepoint

CP3.3 is a Chokepoint

CP7.2 is a Chokepoint

CP7.3 is a Chokepoint

CP7.5 is a Chokepoint

Query 17 (Alex)


CP1.1 is a chokepoint - Having the adjecency list ordered by id and keeping this order between operators makes finding distinct triangles cheaper as a partial triangle will only need to be joined with neighbors that have a larger id.

CP2.3 is a chokepoint

Query 18 (Arnau)

CP1.1 is  a chokepoint

CP1.2 is a chokepoint

CP1.6 is a chokepoint

CP3.2 is a chokepoint

Query 19 (Marcus)

 

CP1.2 is a chokepoint : exists clause is effectively a group-by + final grouping by person

CP1.4 is a chokepoint: if you start from persons filtered by birthday, one can apply top-k pushdown

CP2.1 is a choke point: there are multiple possible ways to start this query, either starting from the tag classes or from the persons born after a given birthday

CP2.3 is a choke point: choose between BFS and DFS makes a difference in evaluation

CP2.4 is a choke point: there are many joins involved in this query, some of them could potentially become sparse, i.e., between comments and persons

CP3.3 is a Chokepoint

CP5.1 IS PENDING

CP6.1 is a chokepoint.

CP7.3 This query does not have a real transitive part but for the traversals it can make use of many of the optimization that are used in transitive queries.

CP7.4 TODO: Adjust 7.4 in spirit of the remark for 7.3 "This query does not have a real transitive part but for the traversals it can make use of many of the optimization that are used in transitive queries."

 


Query 20 (Moritz)

 CP1.6: There are just 72  TagClasses

 CP2.1: Proper join order impacts the performance.

CP6.1: We can materialize partial counts for each Tagclass, which can be updated everytime a message is added. When a query arrive, the counts are summed to compute the final counts.

Query 21 (Alex)


CP 1.2 is a chokepoint, as the size of the group by will linearly increase with the scale factor, and at some point it will be large enough.

CP2.1 is a chokepoint: So many joins you can play with depending on the cardinalities

CP2.3 is a chokepoint: In some places( specially message->likes) you have to properly chose between index based join or hash-join

CP2.4 is a chokepoint: The number of messages to join with likes is small compared to the total number

CP3.2 is a chokepoint: clustering by message would impact the performance

CP3.3 is a chokepoint: If an index is used for message->likes, access will be sparse

CP5.1 is a chokepoint: For zombie like count and total like count are correlated sub-queies that can be flattened

CP5.3 is a chokepoint: Simila to 5.1. Also, the persons that qualify for likes is similar to those that qualify for zombies (besides the country filter)

Query 22 (Arnau)

CP1.4 is a chokepoint: there are top-k push down opportunities if some interactions are met, we don't need to look at others

CP1.6 is a chokepoint: there as many groups as cities

CP2.1 is a chokepoint: there are many possible join orders

CP3.1 is a chokepoint: It is more likely that persons reply other persons that already replied to their messages. Detecting such correlation is needed for proper cardinality estimation

CP3.3 is a chokepoint: The use of index is prominent in the whole query, and as long as the prunning of the considered pairs goes, index accesses become more scattered

CP5.1 is a chokepoint:

CP5.2 is a chokepoint:

CP5.3 is a chokepoint: In general, there are many overalps and results reuse for the sub-queries 

Query 23 (Marcus)

CP1.6 is a chokepoint: the number of counts is 12*110, so pretty low.

CP2.3 is a chokepoint: index vs hash-join depending on the country and the scale factor

CP2.4 is a chokepoint: the ratio of persons and messages is very mall

CP3.3 is a chokepoint: if an index is used to access messages, it is likely that these accesses are very sparse

CP4.3 is a chokepoint: the extraction of the month from dates

Query 24 (Moritz)

CP1.6 is a chokepoint: the number of groups is 36*5

CP2.1 is a chokepoint: depending on the tagclass its better to join continents and message first and then tags or the other way around

CP2.3 is a chokepoint: depending on the tagclass and thus the tags, an index join or a hash-join should be used

CP3.2 is a chokepoint: having messages clustered by date will improve locality

CP4.3 is a chokepoint: extracting year and months from dates

 

Query 25 (all)

CP1.2 is a chokepoint (many persons)

CP2.1 is a chokepoint (tons of joins)

CP2.2 is a chokepoint (tons of joins with different cardinalities)

CP2.4 is a chokepoint (very sparse joins between reduced set of messages which few results as an outcome of the join.)

CP3.3 is a chokepoint (scattered indexes access patterns if used index joins)

CP5.1 is a chokepoint

CP5.3

CP7.2

CP7.3



 

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save