Skip to end of metadata
Go to start of metadata

We are defining three different workloads:

  • the interactive workload: short running queries, but highly concurrent 
  • the analysis workload: long running queries touching a lot of data; only moderately concurrency
  • the algorithms workload: typically not expressible in a query language, using exclusive dataset access (or on a copy)

Not all graph data management techhonolgies are expected to run all workloads ("one size does not fit all"). 

Interactive Workload

Basic Graphs Operations

NAMEDESCRIPTION
ADD_NODEAdds a new node of label L
ADD_EDGEAdds a new edge of label L between nodes X and Y
UPDATE_ATTRUpdates one or more attributes of a node/edge X. An attribute value can be removed or set to NULL (both are equivalent)
DELETE_NODERemoves node X and, optionally, all its edges in cascade
DELETE_EDGERemoves an edge X
SELECT_LABELReturn all nodes/edges of label L
FILTER_NODESReturn all nodes that match one or more selections over its attributes
FILTER_EDGESReturn all edges that match one or more selections over its attributes
FILTER_NULLReturn all nodes/edges that do not have value for a certain attribute A
FILTER_SIMILARReturn all nodes/edges that have a value for a text attribute that contains a string pattern
FILTER_RANGEReturn all nodes/edges that have a value for an attribute into a restricted range of values
FIND_EDGEReturn the edges of label L between two nodes X and Y
1_HOPReturn all the neighbors of node X following the edge label L outgoing/ingoing.
K_HOPReturn all the neighbors of node X following the edge label L one or times outgoing/ingoing.
NO_HOPReturn all nodes that are not neighbors of node X following the edge label L outgoing/ingoing.
BIDIR_HOPFor all the neighbors of node X following the edge label L outgoing/ingoing, return all their neighbors (except X) following the edge label L in reverse direction
DEGREEReturn the count of edges outgoing/ingoing from node X following the edge label L
FILTER_DEGREEReturn all nodes that have a degree of at least N edges following the edge label L outgoing/ingoing
AGGR_EDGEReturn an aggregate function of the values of an attribute of all the edges of a node X following the edge label L outgoing/ingoing
RANK_TOP_KRank the result (by attribute or degree) ascending/descending and return the top most K

 

Coverage of Graph Operations at the SIB Interactive Query Set

 Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11Q12Q13Q14Q15Q16Q17Q18Q19Q20total
ADD_NODE                    0
ADD_EDGE                    0
UPDATE_ATTR                    0
DELETE_NODE                    0
DELETE_EDGE                    0
SELECT_LABEL                    0
FILTER_NODESXXXXXXXXXXXXXXXXXXXX20
FILTER_EDGES X                  1
FILTER_NULL                    0
FILTER_SIMILARX  X X     X        4
FILTER_RANGE            X       1
FIND_EDGE                    0
1_HOPXXXXXXXXXXXXXXXXXXXX20
K_HOP  X X               2
NO_HOP            X       1
BIDIR_HOP                    0
DEGREE               XXXXX5
FILTER_DEGREE      XX  X  X      4
AGGR_EDGE                    0
RANK_TOP_K  X   X XXX X  XXXXX11
total33433343334353244444 

 

Analysis of the Query Set

 

      • Suggested queries
        • [Alex] There are no Shortest Path queries
        • [Alex] There are no queries that actually return Paths
        • [Alex] As an extension of number [16],"your top-10 close friends' top-10 close friends, that are not your friends yet", to make it a complex personalized recommendation.
      • [1] Find all users whose first names contain a particular string, e.g., "?ijk?" (regular expression text search)

        • [Alex] Ok, but not so interesting (not a graph operation). I guess we need a few simple lookups like this too.
      • [2] Return the names of people studied in the same school/organization at the same time (multiple patterns matching). These people are likely the classmates/colleagues of the user.

        • [Alex] Yes. Simple pattern match.
      • [3] Find people studied from the same school that connect with you by a path of friend relationship (Use the "Property Path Expression" in SPARQL 1.1 with arbitrary length path)

        • [Alex] Yes. Simple pattern match + Simple path expression
      • [4] Find singers who won American Music Awards and are liked by one of your friends. (exploit DBpedia knowledge bases for this award and singers who won it)

        • [Alex] No. Not supported by current schema (DBpedia is missing, and we have no concept of Singer, Awards, etc.) - "Thing" might be too limited.
      • [5] Find all people living in a specific location, e.g., Amsterdam, that can be reached from a user by at most 3 steps friend relationship. (path expression of specific length)

        • [Alex] Yes. Simple pattern match + Simple, bounded path expression
      • [6] Show all the friends of yours who are living in Europe. This requires using the information from DBpedia, for example, Amsterdam is a city in Europe and London is a city in Europe. (regional correlation)

        • [Alex] No. Interesting, but not supported by current schema (DBpedia is missing).
      • [7] Find top-10 suggested friends for a user: Those people that are currently not your friend but are friends of many of your friends. (Get all friends of your friends, order them by the number of people in your friends list connecting to them)

        • [Alex] Yes. Very nice. Pattern matching + Filters + OrderBy
      • [8] Return all users that haven’t joined a specific group but 5+ of their friends joined the group

        • [Alex] Yes. Very nice. Pattern matching + Filters
      • [9] Show 10 latest posts/tweets from your friends or the friends of them (Order by the posting time)


        • [Alex] Yes/No, not with current schema - it will scale poorly as User has increasing number of Posts/Comments. As suggested above, a better approach may be to form a list structure in the graph, where the head/tail element of that list is one Users most recent Post/Comment. List could be traversed in time proportional to number of Posts/Comments desired. When retrieving from multiple friends (of friends, etc.) the complexity would be no different, as they would all be used as “input queues” with priority based on the time-stamp of their “head” element. I can discuss this in more detail later. See http://docs.neo4j.org/chunked/stable/cypher-cookbook-newsfeed.html
      • [10] Show active posts/tweets - the 10 latest commented posts/tweets from your friends. (Order by the timestamp of the last comments on the posts)


        • [Alex] Yes/No. Again, schema should be made to match, to be realistic.

      • [11] Return top-10 most interesting posts from your friends - First order by the number of "like" (or in Twitter, the number of "re-tweet" posts) on the posts from your friends, then order by the number of comments.

        • [Alex] Yes/No. Again, i think it’s important to design a good schema for this problem. Question: are these second OrderBy's designed to work as a tie-breakers for the first OrderBy? 
      • [12] Return all posts about an event, an (e.g., Unrest in Tunisia) in 10 recent days. Based on the hash tags if they are available. In case no tag appears in the post, check whether the content of the post contains the terms in the searching event. (free text search & tags checking)

        • [Alex] Yes/No. Depending on schema, this could be performed in a “graphy” way too I think, perhaps by treating each Event as a User, and putting Posts about it in its Timeline, but this may have to be done at write time. By default (with no graph-specific modeling to make a good schema), I’m not sure how "graphy" this query is. Generally, the schema needs to be improved to handle such operations in a more graphy manner.
      • [13] Find people having same gender, in a range of age, living in the same areas who are not friends of a user, and order them by the number of shared interests with the user.

        • [Alex] Yes. Like it, quite interested in how to best model the graph to do this efficiently too.
      • [14] Find number of current inactive user: all users activated more than 60 days and do not have any post during last 30 days.

        • [Alex] Yes/No. Schema should be made to match, to be realistic. Rough idea: again, perhaps user Activations could be stored in a list-like “feed” structure, then traversed until first Activation is encountered that’s >60 days old. Each Activation links to a User, then each User has a feed (as discussed above) that can be accessed similarly, if head of that list is older than 30 days they are omitted.
      • [15] Show all photos posted by my friends that I was tagged in.

        • [Alex] Yes. Traversal/pattern match, filter results to be distinct.
      • [16] Show the list of a user's top-10 close friends. Tips: If two people are tagged in the same photo, it is likely that they are close friends or colleague. Thus, sort your friendship according to the number of time that a user and her friend are both tagged in a photo, then according to number of your tags for each friends (user may not tag you in your photo when you upload).

        • [Alex] Yes. Nice personal recommendation-type query. Suggestion: Could extend this to return "your top-10 close friends' top-10 close friends, that are not your friends yet", to make it a true recommendation. That would be a much more complicated/costly operation.
      • [17] Find top-10 friends or all friends of friends of you that have common interest (Based on the similarity between the tags in your posts and tags in their posts)

        • [Alex] Yes, like it. Schema needs to support this though, i.e., I'm not sure how tags in Posts are encoded in the graph/dataset.
      • [18] What are the current hottest events/problems? (Get the hash tags from posts and order by the number of their appearances in 10 recent days)

        • [Alex] Yes/No. I think this is an interesting query, but we need to agree on how to model the graph. Dependant on the way “feeds” are stored - should be as realistic (i.e., efficient) as possible. Also, are unique tags stored as Nodes or Properties, i.e., are posts tagged by adding a link to a tag Node or by adding a Property to them (and indexing the property)?
      • [19] Which area is the most active area? (Order by the total number of posts in each location in 5 recent days)

        • [Alex] Yes/No. I like it but we need a modified schema. In current schema Comments/Posts have no Location, but Photos do. I think we should give Comments/Posts a Location too.
      • [20] Return the top-10 locations that have the fastest growth in the number of users. (Count the number of people joined during the 10 recent days).

        • [Alex] Yes/No. As usual, I like it, but we need a fitting schema. For example, user Activations stored in a list-like "feed" structure, and Activations link to Users, which in turn link to Locations.
    • Update (U1..U16)
      • 16 queries, only 6 in SPARQL
    • Analysis (A1..A11)
      • 11 queries, only defined
  • [Norbert] probably there are too many queries, it is necessary to analyze groups of queries depending on the functionality, and get one representative of each group
  • [Norbert] queries I4 and I6 have external dependencies with DBPEDIA, they can not be solved by a standalone GDBMS
  • [Miquel] Query A7: How the zealot is generated? (if generated)

  • [Miquel] Is there a rdf query for U3-5,7-8 ?
  • [Norbert] Query specification for GDB:
    • only functional (free text), each vendor implements using its own API
    • BluePrints
    • a new BluePrints-like interface only for LDBC
    • a declarative language (such as Cypher, G-SPARQL, SQL with graph extensions, etc...)
  • [Norbert] What is the output expected for graph queries? Collections of nodes and edges, or projections? Is the operation DISTINCT important?

 

Analysis Workload

OLAP style graph queries that detect patterns and quantify these (grouping, counting). Typically still expressible in e.g. SPARQL1.1

Orri, Alex, Norbert, Andrey  to provide input.

 

 

--Q post frequency distribution

sparql select ?cnt count (*) where {{ select ?u (count (*) as ?cnt) where { ?u sioc:creator_of ?post } group by ?u}} group by ?cnt order by desc 2 limit 20;
sparql ?cnt count (*) where ({ ?u (count (*) as ?cnt) where { ?u sioc:creator_of ?post . ?post a sib:Photo } group by ?u}} group by ?cnt order by desc 2 limit 20;
sparql select ?l count (*) where { ?person sib:like ?l } group by ?l order by desc 2 limit 40;

 


--Q Tags per country

sparql select ?tag, ?country, count (*) 
where { ?acct sioc:account_of ?person . ?acct sioc:creator_of ?post . ?post sib:hashtag ?tag .
?person sib:ip_country ?country .
} group by ?country ?tag order by desc 3 limit 100;

 

-- Who posts a lot about what Cher fans like?

 

sparql select ?author2 count (*) 
where
{ {select ?tag (count (*) as ?cnt) where { ?author sib:like <http://dbpedia.org/resource/Cher> . ?author sioc:creator_of ?post . ?post sib:hashtag ?tag }group by ?tag order by desc 2 limit 100} .
?post2 sib:hashtag ?tag .
?author2 sioc:creator_of ?post2 .
} group by ?author2 order by desc 2 limit 20;

 

-- longest threads

sparql select ?post count (*)
where { ?post a sioc:Post . ?r sioc:reply_of * ?post } group by ?post order by desc 2 limit 10;

-- Who got the most replies during 1st month of participation?
sparql select ?author count (*)
where {
?author sioc:creator_of ?post .
?reply sioc:reply_of ?post .
?post dc:created ?postdate .
?author dc:date ?joindate .
filter (?postdate < bif:dateadd ("day", 30, ?joindate))
} group by ?author order by desc 2 limit 100
;

 

--Q where are American rock singers popular?

sparql select ?country count (*) where {
?person sib:ip_country ?country .
?user sioc:account_of ?person .
?user sioc:creator_of ?post .
?post sib:hashtag ?tag .
?tag a <http://dbpedia.org/class/yago/AmericanRockSingers> .
} group by ?country order by desc 2 limit 20;

 

-- Person posting about x and y

sparql select ?user ?tag1 ?tag2 count (*) ?likes
where { ?user sioc:creator_of ?post1 .
?user sioc:creator_of ?post2 .
?post1 sib:hashtag ?tag1 .
?post2 sib:hashtag ?tag2 .
?tag1 a <http://dbpedia.org/ontology/MusicalWork> .
?tag2 a <http://dbpedia.org/class/yago/Island109316454> .
} group by ?user ?tag1 ?tag2 order by desc 4 limit 20;

 

-- Person posting about x and y with x and y likes scores

sparql select ?user ?tag1 ?tag2 count (*) sum (?likes1) sum (?likes2) 
where { ?user sioc:creator_of ?post1 .
?user sioc:creator_of ?post2 .
?post1 sib:hashtag ?tag1 .
?post2 sib:hashtag ?tag2 .
?tag1 a <http://dbpedia.org/ontology/MusicalWork> .
?tag2 a <http://dbpedia.org/class/yago/Island109316454> .
{ select ?post1 (count (*) as ?likes1) where { ?post1 sib:liked_by ?liker } group by ?post1} .
{ select ?post2 (count (*) as ?likes2) where { ?post2 sib:liked_by ?liker2 } group by ?post2} .
} group by ?user ?tag1 ?tag2 order by desc 4 limit 20;

 

-- How much more likely are friends of people who like Madonna to like Madonna over the average?

 

sparql select ?mcount 
(select count (*) where {?friend sioc:creator_of ?fpost2})(select count (*) where {?friend a sioc:User . filter (exists {?friend sioc:creator_of ?fpost . ?fpost sioc:hashtag <http://dbpedia.org/resource/Madonna> })}))
where {
{ select (?usercount / ?xsd:float (?mcount)) as ?mshare
where { { select (count (*)as ?usercount) where { ?u a sioc:user }} .
{select (count (*) as ?mcount) where { ?mu a sioc:User . filter (exists {?mu sioc:creator_of ?post . ?post sib:hashtag <http://dbpedia.org/resource/Madonna> })}}}
} .
?user a sioc:User . 
filter (exists { ?user sioc:creater_of ?post . ?post sib:hashtag <http://dbpedia.org/rtesource/Madonna> }) .
?user foaf:knows ?friend .
}

 

 


-- timeline

 

sparql select ?year ?month count (*) where {{ select (bif:year (?d) as ?year) (bif:month (?d) as ?month) where { ?post dc:created ?d}}} group bY ?year ?month order by ?year ?month;

 

-- regional timeline for American rock

 

select ?r ?year ?month count (*) 
where {
{select (bif:year (?d) as ?year) ( bif:month (?d) as ?month) ?region 
where {
?user sioc:creator_of ?post . ?post sib:hashtag ?tag .
?tag a <http://dbpedia.org/class/yago/AmericanRockSingers>
?tag dbpo:genre ?genre .
filter (?genre in (<soul>, <funk>)) .
?post sioc:created ?d .
?user sioc:account_of ?person .
?person sib:ip_country ?country .
?country dbpo:region ?r .
}
}
} group by ?year ?month ?r
order by ?r ?year ?month;


-- What part of posts from abroad?

sparql select count (*) sum (if (?pctry = ?hctry, 1, 0))
where { ?acct sioc:account_of ?person . ?person sib:ip_country ?hctry . ?acct sioc:creator_of ?post . ?post sib:ip_country ?pctry };

 

Algorithm workload

Graph analysis algrithms, such as computing the clustering degree, computing PageRank, all-pairs shortest paths, etc.

Orri, Alex, Norbert, Andrey  to provide input.

 

  • No labels