Skip to end of metadata
Go to start of metadata

 

Choke points:

a) star-shaped queries

b) large query plans

c) REGEX

d) bushy vs linear

e) OPTIONAL

f) cardinality estimation for join ordering

g) cardinality estimation for hash join

h) cardinality estimation for property paths

i) anti-join

 

Short summary of the data:

we assume the data described in the document SNA Graph Schema Proposal and in SIB document, plus the following elements:

a) Locations form geographical hierarchy (Geonames)

a') Locations are cities (and other small places), not countries. Countries will be derived based on the locations' hierarchy.

b) universities and organizations have Locations, i.e. it is possible to ask for the city and a country (as a consequence of requirement a) ) of a university

c) people may have multiple universities/organizations/emails/browsers. 

d) IP addresses are mapped to Locations (i.e., in RDF terms there are triples of the form <27.99.128.8> locatedIn <Jakarta>)

e) Users, Posts and Groups have hashtags (drawn from DBpedia)

f) User profiles have Language property (multivalied), correlated with the country (based on the data taken from the CIA Factbook. That is, certain percentage in the country should speak the main language, plus there should be some minorities. Maybe add English as an international language to many profiles regardless of their countries).  The posts are also annotated with the language they are written in. The languages of the user's posts correlate with his/her profile languages.

 

Data has the following corellations:

    c1) name and country

    c2) country of origin and university

    c3) country of the user and country of his friends

    c4) (when augmented with Orri's tags) hashtags of the user with hashtags of his friends (we mostly use the variant of this: hashtags of posts correlate for friends)

We need to add the following correlations:

    c5) IP address of a post and a photo correlates with the country of the origin for the user. There should be a long tale though, and we will exploit it: users are travelling and posting from different locations

    

 

Queries:

Q1. Extract all the properties of a person with a given name, plus the universities she attended and companies she worked at

SELECT * WHERE {

        ?person lastName %NAME%.

        ?person firstName ?fn.

        ?person gender ?g.

        ?person birthdate ?date.

        ?person email ?email.

        ?person browser ?browser.

        ?person studyAt ?uni.

        ?person worksAt ?company.

        ?uni locatedIn* ?country1.

        ?company locatedIn* ?country2.

}

Notes: this is a simple lookup query, interesting mostly for throughput.

Chokepoints: star queries

 

Q2. Find all people that studied abroad together.

Chokepoints: bushy, cardinality estimation for join ordering and hash join, 

 

Q3. Find Friends and Friends of Friends of the user X that have been to the countries A and B in the last year.

Chokepoints: large query, cardinality estimation for join ordering and hash join and path queries. The plan depends on A, B (compare (A,B) = (US, Canada) and (A,B)=(Thailand, Zimbabwe) ), and the country of origin for the user X, and altogether on the number of his friends and friends-of-friends

Note: derive the visit from the IP address of the post.

 

Q4. [Emerging trends] Find top 10 most popular topics-hashtags (by the number of comments and posts) that your friends have been talking about in last 24 hours (parameter), but not before that.

Chokepoints:  'not exists' clause; 

 

Q5. [New global groups] What are the groups that your connections (friendship up to second hop) from countries A and B have joined last week? order them by the number of posts and comments they made there

Chokepoints: cardinality estimations for complex queries;

 

Q6. [People who discuss X also discuss...] Find 10 most popular interests (hashtags) of people that are connected to you via friendship path and talk about topic 'X'

Chokepoints: property paths of arbitrary length, cardinality estimations for them

 

Q7. Find top 10 most popular topics of the last X hours in your country that none of your friends knows/blogs/comments about.

Chokepoints: large intermediate results (assuming topics of friends correlate with yours and most of your friends are in the region), cardinality estimations

 

Q8. Find all pairs of users such that: one is called John (parameter1) from NYC (parameter2, REGEX), other is Mary (parameter3) from Hong Kong (parameter4) and they met in a discussion group on music (parameter5). This means they both posted to the same group, and commented to each other.

Chokepoints: optimal plan varies greatly (both in join ordering and physical plan selection) depending on parameters, REGEX

 

Q9: Find Top-10 foreign-speaking bloggers from your country

(or more general: people that speak language L and blog in your country)

Note: Based on language tags of people.

Chokepoints: correlated data, cardinality estimations; path queries.

 

 

  • No labels