Message-ID: <20080808064532.17003.qmail@plover.com> Subject: Hibernate SQL puzzle I cannot figure out (Warning: LONG) Organization: Plover Systems Date: Fri, 08 Aug 2008 02:45:32 -0400 From: Mark Jason Dominus The query I'm trying to do for tree topology search is like this. The user specifies three strings, and wants a tree shaped like this: +---- A | +---+ | | ----+ +---- B | +-------- C The method expands the three three sets of taxon variants, which we can call ASet, BSet, and CSet. Then the method runs a query which looks for four tree nodes, A, B, C, and AB, all in the same tree, such that all of the following are true: A's taxon label refers to one of the taxon variants in ASet B's taxon label refers to one of the taxon variants in BSet C's taxon label refers to one of the taxon variants in CSet AB is an ancestor of A AB is an ancestor of B AB is NOT an ancestor of C A and B are treated symmetrically; C is different. The query is "Find trees where A and B are more closely related to each other than either is to C." The example I picked is three species of holly, related like this: +---- Ilex amelanchier | +---+ | | ----+ +---- Ilex mucronata | +-------- Ilex glabra If C is I.glabra, tree 4522 ought to match. If C is one of the other two, there should be no matches. But Hibernate returns tree 4522 both ways. When I wrote a SQL query manually to do the same thing, I got the result I expected: select a.phylotreenode_id, b.phylotreenode_id, c.phylotreenode_id, ab.phylotreenode_id from phylotreenode a, phylotreenode b, phylotreenode c, phylotreenode ab, // labels of a, b, and c taxonlabel atl, taxonlabel btl, taxonlabel ctl // All four nodes in the same tree where a.phylotree_id = b.phylotree_id and a.phylotree_id = c.phylotree_id and a.phylotree_id = ab.phylotree_id // join taxonlabels and a.taxonlabel_id = atl.taxonlabel_id and b.taxonlabel_id = btl.taxonlabel_id and c.taxonlabel_id = ctl.taxonlabel_id // constrain taxonvariants to match search // a = I.glabra and atl.taxonvariant_id in ( 340, 341, 342, 343 ) // b = I.amelanchier and btl.taxonvariant_id in ( 357, 358, 359, 360 ) // c = I.mucronata and ctl.taxonvariant_id in ( 366, 367 ) // constrain topology: // ab must be an ancestor of a and ab.leftnode < a.leftnode and ab.rightnode > a.rightnode // ab must be an ancestor of b and ab.leftnode < b.leftnode and ab.rightnode > b.rightnode // ab must NOT be an ancestor of c and ( ab.leftnode >= c.leftnode or ab.rightnode <= c.rightnode ) When I run this query, I get an empty result, at I should, because it requires that I.glabra be in position C. If I switch the taxonvariant contstraints on a and c, I get the four correct tree nodes from tree 4522. So my handwritten SQL does just what I think it should. The SQL that comes out of Hibernate looks very similar. For technical reasons, I have Hibernate selecting the root node of the tree that contains a, b, c, and ab. In the following SQL, "phylotreen5_" is that root node. The SQL is exactly as was logged by Hibernate, except that I have added my own comments: select distinct phylotreen5_.PHYLOTREENODE_ID as PHYLOTRE1_57_, phylotreen5_.VERSION as VERSION57_, phylotreen5_.BranchLength as BranchLe3_57_, phylotreen5_.CHILD_ID as CHILD8_57_, phylotreen5_.LeftNode as LeftNode57_, phylotreen5_.Name as Name57_, phylotreen5_.NODEATTRIBUTE_ID as NODEATTR9_57_, phylotreen5_.nodeDepth as nodeDepth57_, phylotreen5_.PARENT_ID as PARENT10_57_, phylotreen5_.RightNode as RightNode57_, phylotreen5_.SIBLING_ID as SIBLING11_57_, phylotreen5_.TAXONLABEL_ID as TAXONLABEL12_57_, phylotreen5_.PHYLOTREE_ID as PHYLOTREE13_57_ from PHYLOTREENODE phylotreen0_, // phylotreen0_ is tree node a PHYLOTREE phylotree4_ // phylotree4_ is the tree containing a inner join PHYLOTREENODE phylotreen5_ // the root node of phylotree4_ on phylotree4_.ROOTNODE_ID=phylotreen5_.PHYLOTREENODE_ID, PHYLOTREENODE phylotreen1_, // node b PHYLOTREENODE phylotreen2_, // node c PHYLOTREENODE phylotreen3_, // node ab TAXONLABEL taxonlabel6_, // taxonlabel of node a TAXONLABEL taxonlabel7_, // taxonlabel of node b TAXONLABEL taxonlabel8_ // taxonlabel of node c where // phylotree4_ is the tree containing node a phylotreen0_.PHYLOTREE_ID=phylotree4_.PHYLOTREE_ID // join taxonlabels and phylotreen0_.TAXONLABEL_ID=taxonlabel6_.TAXONLABEL_ID and phylotreen1_.TAXONLABEL_ID=taxonlabel7_.TAXONLABEL_ID and phylotreen2_.TAXONLABEL_ID=taxonlabel8_.TAXONLABEL_ID // All four nodes (a, b, c, ab) in the same tree and phylotreen0_.PHYLOTREE_ID=phylotreen1_.PHYLOTREE_ID and phylotreen0_.PHYLOTREE_ID=phylotreen2_.PHYLOTREE_ID and phylotreen0_.PHYLOTREE_ID=phylotreen3_.PHYLOTREE_ID // constrain taxonvariants to match search // a = I.glabra and ( taxonlabel6_.TAXONVARIANT_ID in ( ? , ? , ? , ? ) ) // b = I.amelanchier and ( taxonlabel7_.TAXONVARIANT_ID in ( ? , ? , ? , ? ) ) // c = I.mucronata and ( taxonlabel8_.TAXONVARIANT_ID in ( ? , ? ) ) // constrain topology: // ab must be an ancestor of a and phylotreen3_.LeftNodephylotreen0_.RightNode // ab must be an ancestor of b and phylotreen3_.LeftNodephylotreen1_.RightNode // ab must NOT be an ancestor of c and ( phylotreen3_.LeftNode>=phylotreen2_.LeftNode * or phylotreen3_.LeftNode<=phylotreen2_.RightNode ) So it looks the same. (I.mucronata is the only one of the three taxonvariants that has two equivalent variants instead of four, so we can know that I.mucronata is in position C in this code, and I.glabra is not. Also, diagnostics in the program confirm that the values of the '?' placeholders are what I expect.) But this query, which should return an empty result, returns the root node of tree 4522. Why?