Skip to end of metadata
Go to start of metadata

Summary of Activities

Participants

Minh-Duc Pham (CWI), Renzo Anglés (VUA), Xavi Sánchez, Norbert Martínez (UPC)

Activities (in time sequence)

  • Explanation in detail of the components and modules of the SIB generator
  • Draft document of the generator phases, algorithms and settings
  • Several datasets were generated in order to understand the library. Analysis of the power-law distribution obtained with the SSJ Java library (10K users)
  • Solved problem: invalid parameter passed to the SSJ power-law distribution
  • Analysis of the 10K users friendship relationship with the new power-law distribution: improved average clustering coefficient and degree distribution, but with an incorrect accumulated peak at 75% of the tail
  • Experiments varying the window size and friendship probability interval: some potential friendship are lost because no more relationships can be established
  • Conference calls with Orri to discuss the design and SQL algorithms for the TAG and TWEET entities and the HASHTAG and FOLLOWS relationships.
  • Solved problem: relaxed probability interval to establish more friendships inside the window of candidates
  • Analysis of the 100K users friendship relationship: degree distribution looks like a power-law. The plots still show a special characteristic (the curve is dispersed at the beginning) of the power law that we expect to analyze more in-depth in the future.
  • Fixed the problem of not generating the popular places (e.g., cities) for the photo's location.
  • New logic schema proposal based on an exhaustive analysis of all the interactive and BI query proposals, choke points and data correlations. Graph representation published in SNA Graph Schema Proposal
  • Agreement to restrict the exporters in v1.0 to TURTLE, N3 and CSV data formats
  • Entities PERSON and USER merged
  • Entities POST and PHOTO merged
  • Extracted a new LANGUAGE dictionary from CIA FactBook
  • Extracted Celebrities from DBPedia, with #references and country. Build dictionaries for celebrities with the location correlation, calculating the cumulative distribution for the celebrities in each location based on the # references.
  • Build dictionaries with celebrities,topics, and their co-occurrences number, restricted to 5000 celebrities with aprox. 1E6 topics.
  • Generated the new TAG entity and re-computed the INTEREST dimension based on the users' tags (which are correlated celebrity and topics)
  • New dictionary with locations hierarchy: CITY, COUNTRY, CONTINENT, AREA

Expected Tasks

  • each person may have multiple
    • emails
    • browsers
    • universities
    • works
    • languages
      • Correlated with the country (based on the data taken from the CIA Factbook. Certain percentage in the country should speak the main language, plus there should be some minorities. Maybe add English as an international language to many profiles regardless of their countries).
  • universities and organizations have Locations
  • locations form geographical hierarchy (Geonames)
    • CITY | REGION | COUNTRY | CONTINENT | AREA
  • new Tags
  • ip_address mapped to locations
    • triples of the form <27.99.128.8> locatedIn <Jakarta>
  • Post have language
  • Person like Post in some date
  • Comments to Post/Comment
  • data correlations:
    • name and country
    • country of origin and university
    • country of the user and country of his friends
    • hashtags of the user with hashtags of his friends (we mostly use the variant of this: hashtags of posts correlate for friends)
    • IP address of a post and a photo correlates with the country of the origin for the user. There should be a long tale though, and we will exploit it: users are travelling and posting from different locations
  • tweets
  • followers (authorities)

Next steps

  • Prepare the first prototype v0.1 of the ldbc_socialnet_dbgen restricted to the PERSON, TAG and LOCATION entities, and the BASED_NEAR, INTEREST, FRIENDSHIP and KNOWS relationships
  • Add the license text to all the source code files
  • Upload v0.1 to GITHUB
  • Document the data generation process, data correlations, dictionaries and generator settings of v0.1
  • v0.2: improve PERSON with relationships to EMAIL_ADDRESS, ORGANIZATION, CONNECTION and IP_ADDRESS
  • v0.3: LOCATION hierarchy, improve PERSON with relationships to LANGUAGE
  • v0.4: GROUP, FORUM, MEMBERSHIP
  • v0.5: POST, PHOTO_ALBUM, COMMENT
  • v0.6: TWEET and HASHTAG
  • v0.9: FOLLOWS

Comments

  • DBGEN code will be leaded by UPC, documentation by VUA. Design and coordination is shared between the two teams (Renzo and Norbert). Duc no longer will work in the SOCIALNET generator: he only will help by mail to clarify details of the SIB data generator.
  • Future changes to the schema and the data generator should be justified by a query choke point.
  • The current schema contains perhaps too many entities and relationships, in particular for graph databases. Some of them could be simplified once analized the choke points and the required data correlations.
  • v0.1 is the first public version. It has enough content to be used by the GRAPHLAB community, in particular the USER-KNOWS relationship
  • each new v0.X version guarantees the data correlated as expected by the query choke points
  • each new v0.X version includes the updated documentation of the generator, the dictionaries, data correlations, etc...
  • bugs or improvements to existing versions will have priority to new versions with more functionallity
  • scale factors will be decided after v0.9
  • v1.0 will be published only when the five involved teams (VUA, UPC, TUM, NEO and OGL) agrees and validates the final model and data correlations. This most probably will be done after some iterations over the interactive and BI query workloads.

 

  • No labels