These are the notes for the skype conference call held on Wednesday May 29 2013, on the progress and activities of the Social Network Benchmark task force (11:00-12:30).
- Josep Lluís Larriba, Norbert Martínez (UPC); Peter Boncz, Renzo Anglés (VU); Pham Duc (CWI), Alex Averbuch (NEO)
First Norbert explains the advances in the dataset generator (the complete list can be found at the end of this post) and the new page in Confluence (DBGEN v1.0) that contains the current schema, descriptions, data correlations and distributions tables, and RDF output examples. Renzo then explains in more detail the different tables: one with all the characteristics of each entity (e.g. attributes, correlations, etc.); a second one with the characteristics of each relationship; the third with an enumeration of the different features for relationship and attribute generation that uses the dataset generator; another one with the list of the dictionaries; and, finally, the list of CSV files.
Renzo then comments some issues about RDF vocabularies (DBpedia namespace) and reification for attributed relationships with cardinalities one-to-many. This discussion is postponed because Orri and Andrey are not present in this concall.
Norbert mentions the problem of the implementation of multiple study_at relationships because it is being used as a sort key to establish friendships between people. Duc says that it is possible to create a combined key to sort inside the comparison windows, and Peter suggest to use multiple hadoop steps, one for each university. The discussion is postponed because Peter suggest that the priority is to guarantee a good distribution for Post-Tag for the GRADES workshop (and the future GraphLAB). Norbert and Renzo agree to put this as a priority.
Norbert informs also that this preliminary version in GITHUB will not have scale factor, but the configuration settings will be the minimum as possible. He also mentions that JIRA is being used to document and track all the issues. Peter says then that it is necessary an account for all the people involved even if it cost money.
- Define the final RDF output
- Post-Tag distribution
- Add all issues to JIRA
- Multiple study_at: evaluate the different options
Appendix: generator improvements
- modify dictionaries to relate ID with NAME
- use new tags instead of HAS_TAG for POST
- create manually the dictionary Country-Region from Wikipedia
- Person now is based near a City and not a Country
- generate and export multiple emails
- multiple organizations
- add "Wall" title
- extract from DBPedia the Universities with full names (not 20 char maximum as now)
- cleanup Locations
- cleanup Names and Surnames
- cleanup Languages
- remove extra URL data
- write in UTF-8 format
- unify common code in RDF serializers (TURTLE and N3)
- modify column separator in CSV
- export LANGUAGE
- export LOCATION hierarchy
- export BASED_NEAR of ORGANIZATION
- export IPADDRESS as entity
- add GPLv3 license to source code
- cleanup of unused files
- rename of packages and executables
- create MAVEN configuration file