Skip to end of metadata
Go to start of metadata

The first LDBC graph database benchmark proposal is getting to a stage where schema and queries are being proposed, but we (partially because we come from different database backgrounds - relational, RDF, graph) clearly have difficulties agreeing on certain points:

  • Which aspects (e.g. graph schema) of a benchmark should be strictly enforced and which can be left to vendors to decide?
  • How do we design a benchmark that both tests the system (rather than the engineers) and is portable across database models (relational, RDF, graph)?

The definitions and arguments in this document are only intended to apply within the context of the LDBC Social Network graph database benchmark, we’re not proposing industry-wide definitions.

Definitions

Graph Database Model

  • Data Model = Labeled, Attributed, Directed Graph
  • Synonyms
    • Node == Vertex
    • Relationship == Edge
    • Property == Attribute
  • Functional
    • Collection of labeled, attributed, nodes and edges
    • Labels
      • Express the “type” of a node or edge
      • If a system does not support labels, labels may be stored as attributes
      • A label can apply to a set of nodes or set edges, not a mix of both
    • Nodes
      • Have a unique identifier
        • For this benchmark, there must be a way (for S3G2) to specify the ID - the unique ID must be settable.
      • Have a label
      • Have zero or more attributes
      • Have zero or more edges
    • Edges
      • Have a unique identifier (still under discussion)
      • Have a label
      • Have zero or more attributes
      • Connect two nodes
      • Can be directed or undirected
        • If system does not support directed/undirected edges, a workaround using multiple edges or attributes on the edges is acceptable
    • Attributes
      • Attributes are key-value pairs, where values are typed. For example:
        • String
        • Integer
        • Double precision float
        • Boolean
        • Time stamp
      • Attributes are associated with nodes and edges. In other words, a node/edge has attributes
      • Unlike nodes and edges, attributes are not part of the graph topology
  • Non-functional
    • Indices

Graph Schema

  • Node types
    • Specified by label
    • Compulsory attributes
    • Optional attributes
  • Edge types
    • Specified by label
    • Compulsory attributes
    • Optional attributes
    • Optional direction
    • Start/source node
    • End/destination node
  • Connectivity
    • Which node types are connected to each other, and which edge types are used to connect them
    • What is the cardinality of relationships between node types
  • Example
    • Text: 
      "Forum nodes have an 'owner_name' attribute to identify the User that created the Forum.
      A Forum may contain many comments, which each have a 'content' attribute that stores their text.
      Comment nodes are connected to their containing Forum via directed Contains edges, which have a 'date' attribute to specify when the Comment was added to the Forum"

    • Formal:
         Node : Forum {
            id: String (UNIQUE),
            owner: String
         }
         Node : Comment {
            
      id: String,
            
      created: Timestamp(OPTIONAL),
            
      content: String
         }
         Edge : Contains {
            
      id: String (UNIQUE),
            
      date: TimeStamp
         }
         Forum [0,*] - Contains -> [1,1] Comment

Graph Queries (do we need these definitions in the context of this document?)

  • Query types generally fall into two groups 
    1. For graphs consisting of a large number of small graphs (bioinformatics, cheminformatics, repositories of business process models) there are two query types:
      • Subgraph query which searches for all graphs in the database such that a given query graph is a subgraph of them.
      • Supergraph query that searches for all graphs in the database that are subgraphs of the given query graph
    2. For graphs consisting of one or a small number of large graphs (social, bibliographical, knowledge bases) there are three query types:
      • Pattern match query that searches for a pattern graph (e.g. path, star, subgraph) in the large graph.
      • Reachability query that verifies if there exists a path between any two vertices in the large graph.
      • Shortest path query, a variant of the reachability query, returns the shortest path between any two vertices in the large graph.
  • So far we have only considered Group 2 for this benchmark, which seems logical given the domain

Analytic Graph Queries (do we need these definitions in the context of this document?)

  • The following attempts to capture all query types not covered by the above classification, analytical tasks in particular:
    • Communities
    • Density
    • Clustering Coefficient
    • Connected Components
    • Centrality Measures
    • PageRank

Process & Goals

Benchmark Goals

  • Which aspects of the benchmark definition do we want to constrain, and what impact does constraining, or not constraining, a variable have on the meaning and meaningfulness of a benchmark?
    • Constrained: items constrained by the benchmark definition. For example, in TPC-H the SQL statements to execute are constrained (right?), they are provided in the benchmark definition and can not be altered.
    • Unconstrained: items not constrained by the benchmark definition. For example, internal data structures used by the database vendor.
  • What is it we want to compare with the benchmarks?
  • If we constrain only the application domain (database vendors can choose how to model that domain), what impact does it have on the benchmark?
  • If the graph topology (a product and instance of the graph schema) is different in the benchmark implementation of each vendor, how do we design vendor agnostic queries?
    • Pattern matching queries search for patterns in a graph, graphs of differing topology will, by definition, contain different patterns. For each vendor the resultset of a given query will differ.
    • Similarly, the result of a shortest path query will differ when executed on graphs with different topologies.
  • Data Generation: besides the fact that we can’t modify S3G2 until we have the schema defined, there are questions related to data generation.
    • In what format should 3SG2 create the “graph”?
    • What do we lose/gain by having S3G2 create the dataset with a given schema. For example, how do we design fair bulk load benchmarks if some vendors have to translate the the S3G2 graph (which has a given schema) into their own schema before/while loading it?

Graph Schema & Query Design

  • Regardless of whether they’re constrained or not, what should the schema/query design process look like?
  • Few (none?) of us have developed a real social network application, even fewer (if that’s possible) have done so using a graph database.
  • Under "Outcomes", on the LDBC website (http://www.ldbc.eu/project), we write: "target real-world usage scenarios.". To make the benchmark schema more representative of real application, maybe we should involve more industry partners (graph database users that develop social network applications)?
    • Neo4j users
    • DEX users
    • OrientDB users
    • InfiniteGraph users
  • At what level/stage should we involve these partners?
  • Examples of data structures that we need to design meaningful schemas for:
    • Feeds
    • Conversations
    • Geographical data
  • No labels