Skip to end of metadata
Go to start of metadata

Queries

Note: only queries 1-9 have been tested.

FIXME

= Queries 10-22 are very rough and the sparql has not been checked for correctness. Queries 1-9 are in github already.

= there are no path queries!!! We should add queries with path expressions, as well as shortest path queries (where the path length is bounded to e.g. 4 so it can be simulated).

= more queries should get date restrictions.

= choke points should be checked (do all TPC-H choke-points appear? do correlations play a role? Do graph tasks play a role? Do the queries take advantage e.g. of multi-valued attribute handling?).

 


  1. Post Stats. Test expression evaluation and optional properties 

    This is analogous to TPCH Q1.  Should not use an index for the unselective date condition. The cardinalities of the patterns are the same, so the right plan selects on the date first, the rest of the order does not matter.
    Must recognize that the scalar subquery is independent of the rest of the query.
    One may recognize that this is in principle subsumed in the group by count (*). This is not so however if the properties are not all single valued, which one does not a priori know.

    select ?year ?is_reply ?size count (*) avg (?len) sum (?len)
          (xsd:float (count (*)) / ((select count (*) where {?p a snvoc:Post . ?p snvoc:creationDate ?dt . filter (?dt < "%Date0%"^^xsd:date)})))
    where {
          ?post a snvoc:Post .
          ?post snvoc:creationDate ?date .
          ?post snvoc:contentLength ?len .
          optional { ?post snvoc:replyOf ?org} .
          filter (?date < "%Date0%"^^xsd:date)
    }
    group by ((bound (?org)) as ?is_reply)
             ((bif:year (?date)) as ?year)
             (if (?len < 40, "short", if (?len < 80, "one liner", if (?len < 160, "tweet", "long"))) as ?size)
    order by ?year ?size

    Date0=CreationPostDate_90_100
  2. Volume in forums on a subject in a country. 

    This is a snowflake pattern that accesses many columns of posts filtered on relatively unselective foreign key joins.
    We expect a hash to be built from the tags and the forums. The scan will then go over the columns in order.

    If the criteria are specially selective, an index based plan starting with the forum or the tag is possible.

    select ?title ?year ?f count (*) sum (?len) sum (if (bound (?repl), 1, 0))
    {
        ?f snvoc:containerOf ?post .
        ?f snvoc:hasModerator ?mod .
        ?mod snvoc:isLocatedIn %Country% .
        ?f snvoc:title ?title .
        ?post snvoc:hasTag ?tag .
        ?tag a <%TagType%> .
        ?post snvoc:creationDate ?dt .
        ?post snvoc:contentLength ?len
        optional {?post snvoc:replyOf ?repl} .
    }
    group by ?title ?f (bif:year (?dt) as ?year)
    order by desc 4
    limit 20

    Country=BigCountryURI
    TagType=TagTypeURI
     
  3. The Locally Going Thing.
    Given a date and a country, get tags of posts such that the post is before the date and a reply of the post is after the date. This is like TPC H Q3 with the twisted join of orders and lineitem. If there is correlation between date and placement of posts, then zone maps can be used for avoiding even looking at actual dates for many posts. 

    select ?tag count (*)
    where {
        ?post snvoc:creationDate ?date .
        ?reply snvoc:replyOf ?post .
        ?reply snvoc:creationDate ?rdate .
        filter (?date < "%Date0%"^^xsd:date && ?rdate > "%Date0%"^^xsd:date ) .
        ?post snvoc:hasCreator ?author .
        ?author snvoc:isLocatedIn %Country% .
        ?post snvoc:hasTag ?tag .
    }
    group by ?tag
    order by desc 2
    limit 20
    Country=CountryURI
    Date0=CreationPostDate_80_100
  4.  Thread Length Distribution.

    Take all posts that are not replies. Take all their replies transitively. Group by the post and count the thread. Group by the count, getting counts of threads for each distinct length.

    There are many transitive paths of moderate length all processable at the same time. Batching them with vectored execution of the transitive operator improves join locality. Recognizing that each post serves only once as a starting point means that orderec aggregation is possible. The transitive operation does not necessarily produce results in order but once a batch of posts has been processed to completion the state of the aggregation can be dropped, the same grouping keys will not come back.

    select ?cnt count (*)
    where {
        {
        select ?post (count (*) as ?cnt)
        where {
            ?post snvoc:hasCreator ?u . minus {?post snvoc:replyOf ?xx} .
            ?reply snvoc:replyOf * ?post .
        }
        group by ?post
        }
    }
    group by ?cnt
    order by ?cnt


  5. Best publicist.

    Find top authors in terms of likes of their posts where the post is tagged with a tag in a category, e.g. products of a brand. Scope this to authors from a given country posting in places moderated by a person of the same or other country.

    This is essentially a snowflake around post. The count of likes depends on the post alone but should be evaluated only after all the restricting joins, e.g. country of author and country of moderator where the post occurs. In cases where the countries of poster and moderator differ there can be a strong anti-correlation and this may be reflected in join order. If the tag is unselective and the countries are small the join order will likewise be affected. Zonemaps will not be effective unless the data is reorganized around these joins. One could imagine a two level dimensional clustering with date and country of author affecting the placement in storage. Zone maps could then be effective. Also some possibilities could be ruled out by the id of the post, e.g. the id being in a certain range guarantees the author is not from India.

    select ?author count (*)
           ( sum ((select count (*) where {?xx snvoc:likes ?lk . ?lk snvoc:post ?post})) as ?likes)
    where {
        ?post snvoc:hasCreator ?author .
        ?post snvoc:hasTag ?tag.
        ?tag a <%TagType%> .
        ?author snvoc:isLocatedIn %Country% .
        ?f snvoc:containerOf ?post .
        ?f snvoc:hasModerator ?mod .
        ?mod snvoc:isLocatedIn %Country% .
    }
    group by ?author
    order by desc 3
    limit 30

     

    Country=CountryURI
    TagType=TagTypeURI

     

  6. Branding Hour.

    Choose timing of online advertizing to match the online times of a target audience. For top forumns touching the brand, find the busiest window per country. 

    select
    where {
        ?f snvoc:containerOf ?post .
        ?post snvoc:creationDate ?dt .
        ?post snvoc:hasTag ?tag .
        ?tag a <%TagType%> .
    }
    group by ((bif:hour (?dt)))

     TagType=TagTypeURI


  7. Market Share.

    For forums with over x posts about the general subject matter, what is the market share of a particular tag? E.g. Bob Dylan's share of myusical artists in forums with many mentions of musical artists?

    select ?title ?volume (100 * count (*) / xsd:float (?volume))
    where {
        {
        select ?f (count (*) as ?volume)
        where {
            ?f snvoc:containerOf ?post .
            ?post snvoc:hasTag ?tag .
            ?tag a %TagType%
        }
        group by ?f
        having count (*) > 500
        }
        .
        ?f snvoc:containerOf ?post .
        ?post snvoc:hasTag %Name% .
        ?f snvoc:title ?title .
    }
    group by ?f ?title ?volume
    order by desc 3
    limit 20

    TagType=TagAndName_tag

    Name=TagAndName_name

     

  8. Cross-Border Conversation.

    Count conversation acrosss two countries in a time range over a subject matter. Like TPCH Q7.

    Note that here the condition on the tag is for the first post and the condition on the time is on the reply.  Note that the tag is specified as Dbpedia resources that reference the tag of interest, i.e. many of these will not occur as tags. The OR on the countries must be evaluated on the build side of the hash join between post or reply or country. There is a choice of building the hash on the countries or on the authors in the countries. A purely index based plan will usually not be best, except if one of the countries is very small or the tag very selective. 

    select ?year ?country1 ?country2 count (*)
    where {
        ?post1 snvoc:hasCreator ?author1 .
        ?post2 snvoc:hasCreator ?author2 .
        ?post2 snvoc:replyOf ?post1 .
        ?author1 snvoc:isLocatedIn ?country1 .
        ?author2 snvoc:isLocatedIn ?country2 .
        filter (?country1 != ?country2) .
        filter (?country1 = %Country1% || ?country1 = %Country2%) .
        filter (?country2 = %Country1% || ?country2 = %Country2%) .
        ?post2 snvoc:creationDate ?dt .
        filter (?dt > "%Date0%"^^xsd:date) .
        ?post1 snvoc:hasTag ?tag .
        ?tag dbpedia-owl:wikiPageWikiLink %Person% .
    }
    group by ?country1 ?country2 (bif:year (?dt) as ?year)
    order by 3 1 2
        
    Country1=CountryPair_1
    Country2=CountryPair_2
    Person=FamousPerson
    Date0=CreationPostDate_50_98
  9. Most posts in a foreign language.

    -- This query should be changed in this way:
    -- Find me a country (with qualification of the size of the country),
    -- where a large percentage of user generated content is in the language
    -- that is different from the most common language of the country
    select ?person count(*)
    where {
      { select ?lng count(*)
        where {
          ?person a snvoc:Person.
          ?person snvoc:isLocatedIn ?location. ?location dbpedia-owl:isPartOf %Country%.
          ?person snvoc:speaks ?lng.
        }
        group by ?lng
        order by desc 2
        limit 1
      }
      ?post snvoc:hasCreator ?person .
      ?person a snvoc:Person.
      ?person snvoc:isLocatedIn ?location. ?location dbpedia-owl:isPartOf %Country%.
      ?post snvoc:language ?flng.
      filter (?flng != ?lng)
    }
    group by ?person
    order by desc 2
    limit 10

    Country=CountryURI

  10. People who studied at the same foreign university. NOT FINISHED

    FIXME:
    =this query creates a too large join (all with all people)
    =the use case for one big count also is unclear
    =Europe should be parametrized (a random continent should be used)
    select count(*)
    where {
      ?person1 a snvoc:Person.
      ?person1 snvoc:studyAt ?o1. ?o1 snvoc:hasOrganisation ?org. ?org dbpedia-owl:City ?city.
      ?city dbpedia-owl:isPartOf ?unicountry. ?unicountry a dbpedia-owl:Country.
      ?person1 snvoc:isLocatedIn ?location1. ?location1 dbpedia-owl:isPartOf ?country1. ?country1 a dbpedia-owl:Country.
      filter(?unicountry != ?country1).
      filter (!exists { ?unicountry dbpedia-owl:isPartOf dbpedia:Europe } ).
      ?person2 a snvoc:Person.
      ?person2 snvoc:studyAt ?o2. ?o2 snvoc:hasOrganisation ?org.
      ?person2 snvoc:isLocatedIn ?location2. ?location2 dbpedia-owl:isPartOf ?country2. ?country2 a dbpedia-owl:Country.
      filter(?unicountry != ?country2).
      filter (?person1 != ?person2).
    }
    limit 1000

  11. Teenagers talking to strangers

    sparql Select ?person ?stranger ?month (count (*) as ?cnt)
    {
    ?reply snvoc:hasCreator ?person .
    ?reply snvoc:replyOf ?post .
    ?reply snvoc:creationDate ?dt .
    ?post snvoc:hasCreator ?stranger .
    Minus { ?pperson snvoc:knows ?stranger}
    ?person snvoc:birthday ?birth .
    Filter (?birth > "1990-1-1"^^xsd:date)
    } group by ?person ?stranger (month (?dt) as ?month)
    Order by ?cnt limit 100;

     

  12. Liker Clique. Given a topic, find persons who like a small group of authors. If these like only a few authors and do nothing else these are suspect of being zombies controlled by the author(s). Find, per author the likers. For each liker, calculate a score. The score is the count of likes of the author’s posts minus count of likes of other’s posts minus count of own posts. Return authors with many highly scored likers.

    sparql Select ?author ((?likesofauthor - ?likesofother) as ?likerscore) ((?likesofauthor / (1 + xsd:float (?likesofother))) as ?ratio)
    where {
    {select ?author ?liker (sum (if (?liked = ?author, 1, 0)) as ?likesofauthor)
    (sum (if (?liked = ?author, 0, 1)) as ?likesofother)
    where {
    ?post snvoc:hasCreator ?author . 
    ?like snvoc:hasPost ?post .
    ?liker snvoc:likes ?like . 
    ?liker snvoc:likes ?like2 . filter (?like != ?like2) . 
    ?like2 snvoc:hasPost ?post2 .
    ?post2 snvoc:hasCreator ?liked .
    } group by ?author ?liker }
    } order by desc 2 limit 100;


  13. Data Quality. Take all properties that occur in a post. For each take the class of all objects. Return the classes of objects where the count is under 1% of all triples with the p.

     

    sparql Select ?p ?c count (*) as ?cnt
    Where { 
    {select ?p ?refdc count (*) where {
    {{select distinct ?p where { ?s ?p ?o . ?s a snvoc:Post }}
    ?s1 ?p ?o1 . ?o1 a ?refdc .
    } group by ?p ?refdc}
    Having ?cnt < ((select 0.01 * ?cnt where {?s2 ?p ?o3})))
    };


  14. Find unusual values. Take all properties with a meaning of name. Find pairs of different properties with many values in common.

    Select ?pn1 ?pn2 countt (*) sum (if (?o1 = ?o2, 1, 0) as ?shared where 
    {
    ?pn1 rdfs:subPropertyOf <name> .
    ?pn2 rdfs:subPropertyOf <name> .
    Filter (?pn1 != ?pn2)
    ?s1 ?pn1 ?o1 . ?s2 ?pn2 ?o2 . 
    }
    Group by ?pn1 ?pn2 order by desc ?shared

     


  15. Development of mindshare. Given a class of tags, e..g.move star,, count scores for the tag in two consecutive time windows and comparre them. The count is the ccount of distinct posts and plus 0.5 points for each like of a post with the tag.


    Select ?tg ?cnt ?likes ?cnt2 ?likes2 ((?cnt2 + ?likes2 *0.5) / (?cnt + ?likes * 0.5) as ?ratio)


    Where 
    { {select ?tg count (distinct ?post) as ?cnt1 count (?like)as ?like2 where {
    ?post snvoc:hasCreator ?author . 
    ?post snvoc:hasTag ?tg . 
    Tg a ?tagclass . . ?tagclass rdfs:subClassOf * %tgclass% .
    ?postt snvoc:creationDate ?dt . filter (?dt > %dt1% && ?dt < %dt2%) .
    optional { ?like snvoc:hasPost ?post}} group by ?author}
    }
    --- repeated for next year
    {select ?tg count (distinct ?post) as ?cnt2 count (?like) as ?likes2 where {
    ?post snvoc:hasCreator ?author . 
    ?post snvoc:hasTag ?tg . 
    Tg a ?tagclass . . ?tagclass rdfs:subClassOf * %tgclass% .
    ?postt snvoc:creationDate ?dt . filter (?dt > %dt2% && ?dt < %dt3%) .
    optional { ?like snvoc:hasPost ?post}} group by ?author}

    order by ?ratio;

    }


  16. Biggest posters on a tag. Get the highest number of references to a tag by any author. Get the authors that refer too the tag at least 0.9x as much as the biggest.

    {Select max (?cnt) as ?maxcnt where {
    {select ?author count (*) as ?cnt where { ?post snvoc:hasCreator ?author . ?post snvoc:hasTag %Tag% } group by ?author}}
    {select ?author2 count (*) as ?cnt2 where 
    {?postt snvoc:hasCreator ?author2 . ?postt snvoc:hasTag %Tag% . } 
    group by ?author2
    Having ?cnt > 0.9 * ?maxcnt
    }
    Order by ?cnt limit 1000;

  17. Country Bias. Tags that are much more popular than in any other country.

    Select ?tg avg (?cnt) as ?avgcnt where {
    Select ?author ?tg count (*) as ?cnt
    ?author snvoc:location ?loc . ?loc snvoc:isPartOf %Country%.
    ?post snvoc:hasCreator ?author . ?postt snvoc:hasTag ?tg . 
    } group by ?author 

    {Select ?tg max(?cnt)as maxcnt {Select ?tg ?author2 count (*) as ?cnt2 where 
    { ?author snvoc:location ?loc2 . ?loc2 snvoc:isPartOf ?ctry2 . filter (?ctr2 = %Country%) .
    ?post2 snvoc:hasCreator ?author2 . ?post snvoc:hasTag ?tg} group by ?tg, ?ctr2} } group by ?tg having ?avgcnt > 2 * maxcnt
    }

    Order by ?cnt limit 1000;


  18. Posting activity 

    sparql select ?cnt count (*) where {
    { select ?post (count (*) as ?cnt) where {
    ?post sv:creator ?u . minus {?post sv:reply_of ?xx} .
    ?reply sv:reply_of * ?post .
    }group by ?post
    }} group by ?cnt order by ?cnt;

     

  19. Hub Countries. Take the most popular people in terms of the direct connections, the repliers and likers of their posts.

    There is a union of many joins and the leading joins in the union terms are overlapping, hence reuse is possible.
    There is a hierarchical dimension at both ends.

    Select ?conn_ctry count (*) where 
    {
    ?author snvoc:located * ?loc . ?loc snvoc:isPartOf %Country%

    {
    ?post snvoc:hasCreator ?author ?rep snvoc:replyOf ?post . ?rep snvoc:hasCreator ?connection .
    } union {
    ?author snvoc:knows ?connection .
    } union {
    ?post2 snvoc:hasCreator ?author . ?like snvooc:hasPost ?post . ?like snvoc:hasPerson ?connection }
    }
    ?connection snvoc:located * ?connloc. ?connloc snvoc:isPaartOf ?conn_ctry . ?connctry a snvoc:Country .
    } group by ?conn_ctry;

     

  20. Moving predicates. A tag is popular in a group. Find the time when these tags have more than 10% of the conversation. Return tag, day, count of all posts, count of posts with the tag.

    The subqueries can be done by hash and the hash build may import most conditions from the enclosing query. There are also reuse opportunities between the subqueries.


    Select ?day ?tag ?tagcnt ?allcnt where 
    { ?tag a [tagclass] .
    ?group snvoc:hasModerator ?mod . ?mod snvoc:ipCountry [ctry] .
    Filter ?day >= %Day1% && ?day <= %Day2% )


    {select ?day ?group count (*) as ?allcnt where {
    ?post snvoc:hasContainer ?group .
    ?post snvoc:creationDate ?dt .
    Group by ))bif:dayofyear (?dt) as ?day) ?group }
    }
    {Select ?day ?tag ?group count (*) as ?tagcnt where {
    ?post snvoc:hasContainer ?group .
    ?post snvoc:hasTag ?tag .
    ?post snvoc:creationDate ?dt .
    Group by ))bif:dayofyear )?dt)) as ?day) ?group ?tag}
    Filter (?tagcnt > 0.1 * ?allcnt)
    } order by ?tagcnt ?group ?tag ;


  21. Mole Hunt. A possible mole is one that has access to classified information and has previous contacts to representatives of the ‘ axis of evil.’ Sudden cessation of these contacts signifies the recruitment of the mole. NOT FINISHED

    FIXME:

    = this query is not finished, it seems to consist of three parts
    = the interaction)?handler,?mole,?time,?score) syntax is unclear (a SPARQL "view"??)
    = what is the final result?
     

    Insert into <suspect? { _:mw snvoc:hasMole ?mmole . _:mw snvoc:hasWeek ?week . _:mw snvoc:hasHandler ?handler . _:mw snvoc:hasScore ?score} snvoc:hasMole ?mole . }
    Select ?handler ?mole ?week sum (?score)
    Where {
    ?mole snvoc:worksAt ?agency . 
    {{?agency a snvoc:DefennseContracttor}
    union {?agency a snvoc:Agency}}
    ?handler a snvoc:Recruiter
    interaction (?handler, ?mole, ?time, ?score)
    }group by (datediff (‘week’, [start], ?dt) as ?week)

    } group by ?handler ?mole;

     

    Interaction is defined as having a time and a weight. Some interactions are long term, e.g. working in the same company or knowing somebody. These are given a low score with many small interaction events.


    Interaction (?p1, ?p2, ?time, ?score)
    {Select ?p1 ?p2, (if start1 > start2, ?start1, ?start2), (100 as ?score) 
    ?p1 snvoc:worksAt ?pw1 .
    ?p2 snvoc:worksAt ?pw2 .
    ?pw2 snvoc:hasOrganization ?company .
    ?pw1 snvoc:hasOrganization ?company .
    ?pw1 snvoc worksFrom ?start1
    ?pw2 snvoc:worksFrom ?start2 .
    ?dt a snvoc:Day .
    ?day snvoc:dayNumber ?dayno.
    Filter (?dayno if (?start1 > start2, ?start1, ?start2) && ?dayno < if (start1 + 5 > start2 + 5, start2 + 5, start1 + 5))
    Filter (overlap (?start1, start2,, start1 + 5, start2 + 5))

    Union
    { select ?p1 ?p2 (if (?dt1 > ?dt2, ?dt1, ?dt2), (300 as ?score) where
    { ?p1 snvoc:knows ?p2 .
    ?p1 snvoc:creationDate ?dt1
    ?p2 snvoc:creationDate ?dt2 }
    Union 
    {select ?p1 ?p2 ?dt (400 as ?score) where
    { ?p1 snvoc:likes ?lk .
    ?lk snvoc:hasPost ?post .
    ?post snvoc:hasCreator ?p2 .
    ?lk snvoc:creationDate ?dt 
    }
    Union 
    { select ?p1 ?p2 ?dt (500 as ?score)
    Where {
    ?post snvoc:hasCreator ?p1 .
    ?post2 snvoc:hasCreator ?p2 .
    ?post2 snvoc:replyOf ?post1 .
    }

    Recruiter is a flag that is sset for persons who engage in conversation on multiple sites associated with the suspect topic.

    Select ?person count (distinct ?post) count (?tag) min (?dt) max (?dt)
    Where {
    ?forum snvoc:hasTag ?tg 
    ?tg a snvoc:EvilTag .
    ?person snvoc:memberOf ?forum .
    ?post snvoc:hasCreator ?person .
    ?post snvoc:hasContainer ?forum
    ?post ssnvoc:creationDate ?dt .
    Optional (?post snvoc:hasTag .
    ?tag a snvoc:EvilTag }
    } group by ?person 
    Having (datediff (: day”, ?first, ?last) / ?cnt > 0.2


  22. Concept Promoter. A person that is prominent in conversation on nnew topics.

    Tags are given a weight per day.

    Insert into <stats} {?tg snvoc:hasDayCount _:dc . _:dc snvoc:hasCount ?cnt . _:dc snvoc:hasDay ?day}
    Select ?day sum (1 + 0.5 * coalesce (?replies, 0)) 
    Where {?post snvoc:hasTag ?tg . 
    Optional {select ?post count (*) as ?repeats where {?reply snvoc:replyOf * ?post} group by ?post}
    } group by datediff (“ day: , [startdate], ?dt) as ?day’;

    Some tagdays are hot. The heat of a tagday is the tag’scirculation on the tagday divided by the total circulation of the tag.

    Insert { ?dc snvoc:heat ?heat]
    Select ?dc (daycnt / ?total) as ?heat
    Where {
    ?tg snvoc:hasDayCount ?dc . ?dc ?dc snvoc:day Count ?cnt .
    {select ?tg sum (?dcnt) as ?total where {?tg snvoc:hasDayCount ?dc . ?dc snvoc:dayCount ?dcnt} group by ?tg}
    }


    Select ?person ?tg ?dayno ?daycnt ?pcnt 
    )xsd:float (?pcnt) / ?daycnt as ?ratio)


    ?post snvoc:hasCreator ?person .
    ?post snvoc:hasTag ?tag .
    ?tag a snvoc:HotTag .
    ?day snvoc:dayNumber ?dayno .
    Exists { ?dc snvoc:heat . ?dc snvoc:hasTag ?tg . 
    ?dc snvoc:hasDay ?dayno 
    Filter (heat > 0.1) 
    }
    ?post snvoc:creationDate ?dt .
    Filter (?dayno = sql:dt_dayno (?dt)) .
    {select ?dayno (count (*) as ?daycnt) where { ?post snvoc:hasDay ?day }
    {select ?person ?dayno?tg count (*) as pcnt where {?post snvoc:hasTag ?tg post snvoc:hasCreator ?person. 
    ?postt snvoc:hasDay ?dayno
    Filter (?pcnt > 2)
    } order by ?ratio desc limit 1000
    }

  • No labels