RDF and Graph Databases (GDB) have a different approach for the data representation and queries. RDF is based in triples, that are usually represented and implemented as relations or tables with 3 or 4 char columns. Queries in RDF are solved using SPARQL, which is a triple pattern matching version of SQL and, in general, a sequence of relational operators (joins, etc...) over a relational schema. GDBs and Graph Frameworks (such as BSP-based) are generally based in nodes as entitites and collections of adjacencies of edges and/or neighbors. Queries in GDBs are usually traversal patterns with an intensive access to the adjacencies. Adjacencies are implemented and optionally indexed in multiple ways such as adjacency lists or any other similar data structure that represents collections of graph objects associated to another graph object.
While the semantics of the GDBs operations can be easily mapped to RDF/relational expressions with RDF/relational operators, most probably they are not always the same in terms of choke points. For example, a join is not necessarily a stress situation when collections of adjacencies are part of the model: counting the edges (adjacencies) of a hub is different as counting the number of (edge) triples joined to a (node) triple.
Some of the potential graph query choke points (and their variants) could be:
- DEGREE: counting the outgoing or ingoing degree of a node is to count the length of an adjacency list. Counting the degree of a particular edge label implies a filter over an adjacency list. Some examples are to find the friends of my friends, or to find the the most commented post (largest hub). Variations:
- Counting all the outgoing and ingoing degree means that the storage can not be optimized for a particular direction of directed edges.
- AGGREGATE ON EDGE ATTRIBUTES: rank or compute arithmetics over an attribute of all the edges (or a subset) of a node is a usual operation in ranking or viral propagation. This computation in large hubs are usually expensive. Some system directly support edge weights (or edge attributes) embedded into the adjacency lists. Examples are PageRank, or recommendation systems based on user ratings.
- REACHABILITY: from a starting node, obtain recursively the neighbor nodes until no more appear (connected component), and return the collection of neighbors. Variations:
- until a max number of steps (k-hops).
- SHORTEST PATH: similar to reachability but until it reaches a destination node. It returns the sequence of nodes (or edges) to be traversed. Variations:
- retrieve all the shortest paths between two nodes
- obtain a weighted shortest path between two points (e.g. for route estimation)
- SUBGRAPH MATCHING:find repetitive patterns inside the graph, in general small but massive, such as cliques (e.g. triangles) or cycles. Variations:
- restrict the search to a subgraph, for example the largest connected component.