These are the notes for the skype conference call held on Wednesday February 20 2013, on the progress and activities of the Social Network Benchmark task force (11:30-12:45).
- Larri, Norbert (UPC); Orri (OGL); Peter, Duc, Renzo (VUA) ; Alex Averbuch (NEO)
Peter asks Norbert to explain the results of the analysis of the SIB-generated synthetic graph that UPC has loaded in DEX. Norbert informs that the graph has 244M vertices and 1.1B edges, and remarks the following issues identified in the analysis of the User-Knows relationship: the average clustering coefficient 0.08 is too low, it should be at least larger than 0.3; the diameter 5 is short, it should be up to 9 or 11; the hop-plot grows too fast between 3 and 4; and the average path length 3.84 is too close to the diameter. Norbert also explains that there is not an edge degree distribution that clearly shows a power-law distribution (almost linear in a log-log diagram), and that several of the time evolution series at their end (the last two or three months) finishes with a strange behavior that does not follow the expected distribution. He points out that this is perhaps the S3G2 generator, which is trying to adjust the time distributions at the end of the process.
Peter remarks that the average clustering coefficient is perhaps the most worrying, and he also would like to know which and how do we expect to be the power law distributions. Duc explains that perhaps it is possible to adjust some of the distributions by changing the settings of the generator, but there are strong interactions between these parameters (up to 11 for User-Knows) and in several cases it is not possible to fix a power law distribution. He also points out that in the case of the User-Knows degree distribution, perhaps this is due to the size of the graph (number of users), and Larri asks then if the generator is generating real scale-free distributions. About the size of the graph, Norbert says that with the current schema and settings, a graph with 100K users and two years of activity generates more than 1.1B edges. From the experiments that UPC has made it seems that there exists a direct relationship between the number of users and the time period with the final size. Then, doubling users will double the graph, and this means that even an simple SN graph a few several millions of users most probably will have hundreds of billions of edges with several TB of data.
Moreover, Alex points out that the generator is not generating the triples sorted by time or dependencies, as Peter and Duc confirm, and then he remarks that if the incremental updates are generated in the same way then it will be impossible to do parallel loads. Norbert says then that this could be also a problem for the batch load of the initial graph. Alex also asks for a log with the timestamps of the generated entities, to allow a post-proces to sort or organize the data to be loaded in the graph, but Norbert points out that the cost of this reorganization for large graphs could be not affordable, and that timestamp ordering should be guaranteed by the generation process. Orri also remarks that this is not a problem for RDF because there are not structural dependencies between triples. Larri suggest to postpone this discussion until we have the final logical graph schema and the basic batch generator. The final decission is that Norbert will provide Duc with a list containing the proposed changes to the generator, sorted by priority, and they (Duc and Norbert) will adjust the parameters to try to find better distributions and graph statistics. Peter suggests that Duc could come to Barcelona one week to joint efforts with UPC, and Norbert says that it would be a good moment to modify the generator to export directly in graph-like format but that has not yet been decided with Alex which should be the export format.
Finally, Renzo, Alex and Norbert will continue working on the logical schema. Norbert states that there are some issues still open, such as User and Person merging (almost agreed by everyone), Photo and Post merging, and adding new twitter-like relationships (followers and retweet). Duc explains that the generator has different processes and distributions for Photo and Post and it might be difficult to modify, and then Norbert suggest to keep it without modifications and only merging both collections in a single Post entity. Also, twitter-like edges are still being analyzed by the UPC SocialMedia team and it is not clear now if they can be simulated in the current schema or if we need to add them to the generator. Finally, Norbert also remembers that all the proposed changes should be agreed with the RDF teams, and Orri points out that he initially does not see any major issue on the current list of modifications to the schema
- approach some social network users for the TUC meeting in Munich
- UPC to continue working with the S3G2 data generator and adapt it (document of work here)
- together with Duc
- create a true 'scale factor' that allows to predictably generate a dataset of a certain size
- modify the generator settings to try to fit better to the expected distributions and graph statistics
- modify the generator to export data in graph-like format
- together with Alex and Renzo
- logical schema of the SN graph
- propose the standard graph format to export graph data
- UPC to approach Accesso and Havas Media (partly done - postponed now)
- show the social network schema and query sets and ask for feedback (done)
- try to obtain real datasets (Facebook, Twitter, etc)
- run existing DEX code to compute graph metrics and compare with S3G2
- timestamps in S3G2 [ postponed ]
- Duc to explain the current situation (what timestamps are generated and with what constrains and correlations)
- Duc to explain and share his stream data generator
- UPC to enhance the stream generator to generate an update query workload
- UPC to design a mechanism to separate a S3G2 dataset in a snapshot and a subsequent stream of updates
- query choke points
- Andrey to modify the transactional queries so they include optionals and are affected by correlations (add more parameters)
- include work to devise a mechanism to "learn" similar parameter bindings of correlated parameters with the same selectivities
- Andrey to encode the analytical SIB queries in SPARQL
- providing more choke point ideas
- Orri offered to provide more, input from others also welcome (Peter?, Andrey/Thomas?, ...)