Skip to end of metadata
Go to start of metadata

This page contains information about the progress on the data generation based on the SIB benchmark and the S3G2 data generator. It will serve as a meeting point for the task force to advance on the topics raised in the first action point of the last call.

Document of work

Graph Database Benchmark Design Process

A discussion related to how we go about developing this benchmark can be found here

SNA Graph Schema

A proposal for a SNA graph schema based on the SIB benchmark can be found here

Query Sets for SNA Benchmarking

A discussion related to the different query sets of the SNA benchmark can be found here

Dataset generator

  • Current version: 080113
  • GDB output
    • Standard formats: GRAPHML ( or GEFX ( Both formats are the most widely used by graph analysis tools. This format is better to specify nodes and edges as atomic units, with all their attributes, and to define a proper sequence of insertions
    • Tabular CSV (comma separated values). This format allows a more efficient bulk load because all entities of the same type are grouped in a single file
  • Configuration file
    • size
      • # users
      • time scale
      • # posts
      • etc.
    • distributions and probabilities
    • dictionaries
  • [Norbert] The generator defines edges before the adjacent nodes have been created (e.g. sib:like). We need to check if it is valid in GRAPHML or GEFX.
  • [Norbert] It seems that the generated dataset does not correspond exactly with the logical schema specified in the document (validation in process)
  • [Norbert] Random generator:
    • is it necessary to guarantee that datasets generated in different platforms or in different moments have the same content?
    • Is it possible to do it with the current hadoop-based generator?
  • Known problems:
    • The SIB document defines Users and Accounts, but the generator generates Persons and Users
    • Texts from DBPEDIA contain UNICODE characters (UTF-8?). RDF readers such as JENA cannot process it in the current format.
    • All tokens extracted from a DBPEDIA entry are used as hashtags. It seems that some nodes have exactly the same hashtags as others, without any variability
    • Java VM memory size for large datasets
    • Hadoop timeout
  • Experimental validation
    • Convert to property graph and export in tabular format (1 computer, 2 QuadCores, 128GB mem)

      Filename#Users#YearsT sibSize RDF#TriplesSize tabular#Nodes#Edges#Values#Total
    • Sample files can be downloaded from<filename>.tar.gz

    • File data is compliant with the current property graph schema
    • Each row contains one node or one edge with comma-separated values
    • String values are not quoted except post and comments
    • The first value is always the node or the edge label
    • For nodes, after the label comes each attribute value in the same order as in the schema
    • For edges, after the label comes the source label and id, the destination label and id, and the attribute values

Benchmark Execution

  • Performance metrics proposed in SIB
    • Query per second
    • Query mix per hour
    • Total execution time
  • [Norbert] Scaling (scale factor)
  • We are running some configurations of the generator to analyze the dataset sizes in GB and number of triples
  • [Norbert] reference data set and expected query results?
  • [Norbert, Miquel] frequencies of queries in query mix?
  • [Norbert, Miquel] dynamic arguments for queries?


  • No labels