These are the notes for the skype conference call held on Wednesday 19 december 2012 on the progress and activities of the Social Network Benchmark task force (10am-12am).
- Larri, Miquel (UPC); Orri (OGL); Andrey (TUM); Peter (VUA) + Duc (CWI) in the last part
The meeting started with a discussion of the data generator. Peter and Orri suggested that the S3G2 data generator should be the starting point. Larri and Miquelsuggested that work in data generation might be needed. A specific question was about the generation of timstamp information. Peter answered that S3G2 generates timestamp values based on frequency distributions, while enforcing certain constraints that guide consistency (making a friend has a date after joining the network, disucssions are populated by people who are friends at the time of the discussion). Peter noted that while there is an idea behind generating dates, the current implementation in S3G2 may be incomplete and/or buggy.
Miquel also mentioned the possibility to obtain real datasets from Facebook and Twiiter. Peter asked to - igf possible - share such datasets with the other consortium members if available. Further, it would be interesting to analyze them by computing graph metrics (centrality, betweennnes, diameter, etc).
Following was a related discussion in using the timestamp information to generate a stream of graph pieces; that is a stream of data inserts. This has been done in primitive form by Duc for the stream query experiments in the ISWC paper (Linked Stream Data Processing Engines: Facts and Figures http://link.springer.com/content/pdf/10.1007%2F978-3-642-35173-0_20). The approach is thus to generate a full dataset first, and then "play it out" as a stream. UPC raised that one would normally not want to start at zero,m but at an initial data set size, and then play out further inserts. Peter concluded that to do this, extra work is needed.
The discussion then moved to the query workload. Miquel proposed that we would first have to focus on a single workload. Orri and Peter had the opinion that data generation cannot be seen separate from the query workloads, as the query workloads are those that introduce certain requirements for the data generation. Orri further made the case to augment the query workloads in three (i) transactional (ii) business intelligence queries (iii) graph algorithms, Andrey then described his recent efforts in analyzing the query workloads using Virtuoso v6, mostly the transactional workload as on the SIB W3C page. See also the link to the SIB query choke point query coverage matrix. His conclusion here that not (i) all choke points are covered, that (ii) data correlations do not yet affect these queries and (iii) some queries are very similar and thus the query set could be reduced. Peter suggested for the TUC members of the task force to take a look at the queries of both the transactional and analytical SIB sets to see if they are representative. Further, to improve the queries to cover OPTIONAL clauses as well as becoming affected by correlations. Further, a call to suggest more choke points was issued.
- UPC to start working with the S3G2 data ganerator and adapt it (document of work here)
- ask Duc (email@example.com) for help where needed
- make a tabular data generator option
- import the data in DEX and start playing with it
- run existing DEX code to compute graph metrics
- UPC to try to obtain real datasets (Facebook, Twitter, etc)
- run existing DEX code to compute graph metrics and compare with S3G2
- timestamps in S3G2
- Duc to explain the current situation (what timestamps are generated and with what constrains and correlations)
- Duc to explain and share his stream data generator
- UPC to enhance the stream generator to generate an update query workload
- UPC to design a mechanism to separate a S3G2 dataset in a snapshot and a subsequent stream of updates
- query choke points
- Andrey to modify the transactional queries so they include optionals and are affected by correlations (add more parameters)
- include work to devise a mechanism to "learn" similar parameter bindings of correlated parameters with the same selectivities
- Andrey to encode the analytical SIB queries in SPARQL
- providing more choke point ideas
- Orri offered to provide more, input from others also welcome (Peter?, Andrey/Thomas?, ...)