1. The schema now is super-relational.
a) A Person should be able to have several different firstNames and lastNames(especially with the language tags in RDF, so for instance a person could have attributes <firstName> "Andrey"@eng and <firstName> @ru, and 10 other languages). Even more so for emails and browsers (people could use different ones). Naturally, the number of these attributes should be random between 1 (or even 0) and say 20 (maybe less for browsers).
[Norbert] For emails and browsers I don't see any problem, but for names it seems a little bit strange in a SN. I think that multiple different names is the only reason to have separated Person and User entities. Could be enough only the multiple emails and browsers?
[Andrey] OK, let's keep firstNames unique. However, people do change lastNames (when they marry, mostly).
[Andrey] a') In addition, can we add Languages to the Person's profile? (languages that he/she speaks). This would be at least one and at most (say) 10 languages, which almost always includes the language of the country and has correlations towards the languages of the region.
b) A Person may work in more than one organization and study at more than one university (or at none at all). Again, we can select a random number for every user (with higher probability for 1 university and 1 organization, of course)
2. Why do we need separate Person and User?
[Norbert] We approved to merge them in a single one
3. Let's add geographical hierarchy for Locations from Geonames or YAGO. So that the specific location would have hierarchy of places that include it (Barcelona-> Catalunya->Spain->Europe->Earth)
[Norbert] OK, it is a clear dimension for multidimensional analysis
4. Add tags suggested by Orri. This will add lots of background information for people, and (on the chokepoint side) allow us to construct tough queries
Peter, in response to Andrey and Thomas:
Thomas Neumann: Isn't a fundamental problem of the scheme that it is too regular? For example everybody has one email adress, at most one organization, etc. If you look at the data instances, they nearly look like relational data! Of course there is a friendship graph etc. in the background, but the data itself has a very nice, clean, and regular structure.
Peter Boncz not sure why you consider the friendship graph as something in the background, as I would say it is the central concept in the dataset. Remarkably, in the RDF deliverable there is mention of path expressions, but when it comes down to choke points and test queries, this is not there. Maybe, now inspired by the Kleene expression GRADES submission, this can be elaborated on? And not only raw power is to be tested here, because the start and end-points of the expressions in S3G2 could have selection predicates that correlate or anti-correlate in S3G2. Given that friendship is one large connected component doing friend* is problematic (although with filters the query result could still be small, and recognizing this in fact could be a choke point), but the discussion posts provide other opportunities for path expressions; and because it is the friends of a person doing the discussion (and friends are correlated), there should be possibilities.
But, indeed, the node types in S3G2 are regular. Let us not forget that this is not designed to be an RDF benchmark per se, but intended to model a social network and is targeting graph databases and relational as well. Note that graph databases such as DEX have well defined (=regular) node types, same holds for RDF.
Is it an innate feature of social networks that data is irregular? Well, ask Facebook. This is not a web crawl benchmark. If one insists on using RDF technology, the benchmark is of course open to that, and then indeed willfully ignoring the structure adds some extra challenges that should also be tested. But coming from the social network angle per-se, these RDF challenges will not be around the ragged datashape.
Thomas Neumann: I think we should add some more variety there, for example potentially multiple email addresses, potentially works for multiple organizations, potentially multiple first names, etc. Just to make the data more "RDF"-style.
Peter Boncz: Well, OK, I am not per se against that either. Formulating it without RDF slant, one wants support for multi-valued attributes, and some instances of those.
Assuming we would adopt the ER or UML class diagram way of denoting our schema, this would be perfectly possible in our social network model as multi-valued attributes are a supported feature or both ER and UML. And yes, on average people have multiple email addresses. Note that relational systems are capable of dealing with multi-valued attributes as well, it is just one more join. As such, I fail to see why the inclusion of an "extra table" with holds the multi-valued things, and thus the presence of one extra join, really changes anything of matter to the discussion, certainly with the regard of this having to be something RDF specific.
Andreys asserted that 30% of the coke-points purely rely on multi-valued attributes. So, to make it concrete, which of these are the ones:
1. join ordering for star queries, recognizing structure
2. join ordering for star queries, large search space
3. bushy vs linear trees
4. queries with the optional clause (outerjoin semantics)
5. cardinality estimation
6. correlated data
7. cardinality misestimation in star queries
8. cardinality misestimation in complex queries
9. cardnality misestimation and hash joins
In any case, 70% of the choke points apparently would already be expressible on the S3G2 data, so that should be arranged. Then, if indeed 3 out of these require something else, such as multi-valued attributes or hierarchies, then I would like to know. And in that case, fixes could be proposed, adding data to the schema. Geonames as a proposal is accepted, fur sure, and one or more multi-valued attribute would not hurt either. So, the discussion at the top of the page is very good.
With a slight change of topic, one more remark that I better had made as a deliverable reviewer. So sorry for coming late with this, but I am still in time for our discussion of BI query design and the choke points behind that. The set of core choke points is rather RDF focused, and misses relational choke-points which are e.g. in TPC-H and SSB. Here is my own interpretation of what is cool in those benchmarks, and please add things that I forget:
TPCH-a. exploitation of functional dependencies (declared, or not!) in aggregate groupby keys
TPCH-b. analysis of complex join expressions, pushing independent parts below them (Q7 nation rewrite of &(|(x,y),|(a,b)) into |(x,a) and |(y,b) -- plus Q19)
TPCH-c. highly selective joins (bloom filters)
TPCH-d. complex string match optimizations, both testing raw power of these and rewrite into startswith
TPCH-e. large IN lists (or for that matter, long lists of disjunctions)
TPCH-f. correlated (date) columns, and exploiting these in access patterns (data might be generated in X_date order, or one might allow an index on X_date – but then also test access to Y_date wich is strongly correlated)
TPCH-g. test for presence of efficient anti-joins (NOT EXISTS)
TPCH-h. huge joins without any optimization possibility (raw power test, availability of spilling algorithms)
TPCH-i. huge scan without any indexing possibility (raw power test)
TPCH-k. queries with significant overlapping subqueries (complex join subqueries calculating an aggregate per group) and then repeating the complex join subquery, and joining with the aggregate result on the group, selecting on the aggregate value).
SSB-l exploiting hierarchical dimensions (functional dependency analysis for the benefit of reducing groupby keys)
SSB-m exploiting hierarchical dimensions (partial re-aggregation in multiple aggregates after each other)
SSB-n exploiting hierarchical dimensions (in indexing)
SSB-o exploiting dimension key functional dependencies (invisible join opportunity)
My point is that I would like to have each and every of these (in some form) also lurking in our queries. This will have the effect of sending the graph- and RDF-systems on a path of more maturity.
As I personally think from Vectorwise experience, the BI use case should also get a bulk update load, which should be better than the TPC-H one in exhibiting certain (time-related) correlations.
Andrey, in response to Peter:
First, I do not agree that Facebook data is regular. A user can add multiple emails, multiple universities and companies, languages etc. On the other hand, a user does not need to have any of them specified. This is exactly the type of "schema-less" that we would need (maybe not all of these multi-valued attributes are really equally needed, but at least a couple)
Second, In most of the queries in our deliverable we do not use the value correlations at all, but we do use structural correlations. Now, the value correlation is what you have in SIB (if a person is from Germany, his name would most likely be german, not chinese). This type of correlations is common in both RDF and relational databases, and the benchmark queries should of course cover it. The structural correlation is, for example, "if an entity has <hasLatitude> property, it is very likely to have <hasLongitude>". We also used the anti-correlations like "it is not very common to have both <created> and <hasLatitude> attributes for the same entity".
Of course, the choke points themselves are quite general and could be illustrated in several ways, including the TPCH queries. However I considered these structural correlation query examples very interesting, since they pose unique challenges for RDF systems and should send them "on a path of more maturity" as you say.
The choke point examples that depend on these structural correlations are:
Cardinality misestimations (star queries),
Cardinality misestimations (complex queries) and
Join Ordering for star queries
(The deliverable explains in detail what kinds of correlations are used)
This point about structural correlations is also supported by the Apples-And-Oranges paper at SIGMOD 2011: indeed they show that all the schema-based RDF benchmarks are far away from the datasets that people really use. I suspect that with the current schema the SIB will easily end up somewhere next to LUBM in their classification.
Now, I do not say that we have to use these structural correlations all other the place, and I realize that the SNA is not a purely RDF benchmark, but since RDF systems are also in the focus of the project (or do you want to benchmark only relational systems now?) , we should definitely add such challenging queries. In other words, we need less schema in the graph nodes.
Peter, in response to Andrey:
The structure of Facebook data is relational and quite regular, since Facebook has always purely relied on relational technology (mysql) – a good extract of that is now published as the Linkbench schema. Further, Facebook is presumably interested in data quality and further needs that data to populate an attractive user interface, which would steer one very much clear of a schema-less approach. Your calling of the feature of multivalued attributes "schema-less" further misses the point that ER and UML support multi-valued attributes while at the same time these are the global standards for languages that express a... "schema". But, this is a discussion about terminology only, not really relevant for deciding on the needed changes to the S3G2 schema and data contents.
You suggest that S3G2 data, in other words relational data, when expressed in RDF does not exhibit structural correlations. This is not true, as in fact the structural correlations are most strong when people store purely relational data in RDF. By the way, S3G2 has a number of attributes in each node type that may be absent (as a die-hard relationalist I actually call these "nullable" attributes), thus also non-100% structural correlations can be found there. Hence, when adding in a few multi-valued attributes, in addition to the already existing nullable attributes, we should cover everything you want and seem to define as schema-less.
Note, that what you then get in my opinion is by no means schema-less. Schema-less for me is the absence of a fixed schema, and in the RDF reality that I know it means the presence of noisy, garbage data: appearing in the property and value distributions as a long tail of misspelled and misused identifiers and literal types. If one is truly interested in helping people manage that kind of data (apart from investing in data cleaning techniques :-), it could of course also be targeted in a benchmark, but I do not think that the SNA benchmark should go there. After all, the delights of core RDF use (?) cases are not the central mission, which rather is in representing social graphs and analyzing them, in a rather technology agnostic way.
A final remark is that even if a dataset was noisy and schema-less and very diverse, the fact that a database query workload that works on that data only contains a finite set of structural constraints and properties, imposes a reduced "schema of interest". Hence, in my opinion, representing all the non-used schematic constructs and certainly the long distribution tails containing the (presumably totally or mostly) unused elements is in fact not very important for the query performance of even core SPARQL use cases. As such, a benchmark is by definition schema-oriented. It could have been otherwise if SPARQL in its design had in fact attempted to cater for schema-fuzzy query formulation (instead of opting for a being the sort of SQL-with-URIs that it really is). But again, the SNA benchmark should not focus on RDF and SPARQL per se (better attempt that in the publishing benchmark – though there the data appears to be throughly regular as well).
Thomas, in response to Peter:
I think this is also a kind of philosophical question. As database people, we hate schema-less data. The more schema, the better, both concerning data quality and for query processing. And also application logic, etc. So, more schema is good. But clearly, there is data out that that is only loosely structured, and some people claim that this is actually what makes RDF attractive, in particular for non-computer scientists. Just take a look at the Billion Triples Challenge to see some quite noisy data, and I have seen crawls from social networks that were even worse. I therefore think this aspect should be covered by the benchmark. Not necessarily in a prominent position, so I am fine with 99% of the entries having a quite regular structure. But there should be some noise in the data, just to keep the database systems honest. The "noise" could represent real schema flexibility,or only set-valued attributes, or just application errors. Even if we don't like this kind of data, we must somehow check if the database system is able to handle it, as you could cheat in various ways if you knew beforehand that your data is 100% regular and nice.
Peter, in response to Thomas:
Peter, in response to Thomas: