This page contains information about the progress on the data generation based on the SIB benchmark and the S3G2 data generator. It will serve as a meeting point for the task force to advance on the topics raised in the first action point of the last call.
Document of work
Graph Database Benchmark Design Process
A discussion related to how we go about developing this benchmark can be found here
SNA Graph Schema
A proposal for a SNA graph schema based on the SIB benchmark can be found here
Query Sets for SNA Benchmarking
A discussion related to the different query sets of the SNA benchmark can be found here
- Current version: 080113
- GDB output
- Standard formats: GRAPHML (http://graphml.graphdrawing.org/) or GEFX (http://gexf.net/format/). Both formats are the most widely used by graph analysis tools. This format is better to specify nodes and edges as atomic units, with all their attributes, and to define a proper sequence of insertions
- Tabular CSV (comma separated values). This format allows a more efficient bulk load because all entities of the same type are grouped in a single file
- Configuration file
- # users
- time scale
- # posts
- distributions and probabilities
- [Norbert] The generator defines edges before the adjacent nodes have been created (e.g. sib:like). We need to check if it is valid in GRAPHML or GEFX.
- [Norbert] It seems that the generated dataset does not correspond exactly with the logical schema specified in the document (validation in process)
- [Norbert] Random generator:
- is it necessary to guarantee that datasets generated in different platforms or in different moments have the same content?
- Is it possible to do it with the current hadoop-based generator?
- Known problems:
- The SIB document defines Users and Accounts, but the generator generates Persons and Users
- Texts from DBPEDIA contain UNICODE characters (UTF-8?). RDF readers such as JENA cannot process it in the current format.
- All tokens extracted from a DBPEDIA entry are used as hashtags. It seems that some nodes have exactly the same hashtags as others, without any variability
- Java VM memory size for large datasets
- Hadoop timeout
- Experimental validation
Convert to property graph and export in tabular format (1 computer, 2 QuadCores, 128GB mem)
Filename #Users #Years T sib Size RDF #Triples Size tabular #Nodes #Edges #Values #Total graph400 400 1 3m 0.6GB 16.9M 363MB 1.53M 6.85M 7.45M 15.83M 10000 1 6m 4.7GB 126.5M 2.8GB 11.15M 51.86M 54.66M 117.67M 20000 3 24m 32GB 839.7M 19GB 75.57M 304.78M 369.76M 750.11M 100000 1 38m 48GB 1259.8M 29GB 111.18M 518.26M 544.93M 1.15G 100000 2 74m 103GB 2.73B 63GB 244.81M 1.13G 1.20G 2.58G 200000 2 151m 207GB 5.46B 126GB 489.11M 2.22G 2.39G 5.10G
Sample files can be downloaded from http://fedra.pc.ac.upc.edu/LDBC/<filename>.tar.gz
- File data is compliant with the current property graph schema
- Each row contains one node or one edge with comma-separated values
- String values are not quoted except post and comments
- The first value is always the node or the edge label
- For nodes, after the label comes each attribute value in the same order as in the schema
- For edges, after the label comes the source label and id, the destination label and id, and the attribute values
- Performance metrics proposed in SIB
- Query per second
- Query mix per hour
- Total execution time
- [Norbert] Scaling (scale factor)
- We are running some configurations of the generator to analyze the dataset sizes in GB and number of triples
- [Norbert] reference data set and expected query results?
- [Norbert, Miquel] frequencies of queries in query mix?
- [Norbert, Miquel] dynamic arguments for queries?