Wednesday 9 January 2008

computer science - Reconstructing an ordering of a multiset from its consecutive submultisets

I have some unpublished notes that I think are relevant to this problem, and I am cherry-picking from them. (See also here for a related question.) Because the notes are unpublished I hope it will not be considered poor form to provide a rather lengthy response.



To save myself trouble while picking the relevant parts from them, I will write $q$ in place of $r$ and $k$ in place of $j$. Also, let $B = q^k$ and let $N$ be the length of a generic cyclic word (or necklace) over a $q$-ary alphabet $mathcal{A}_q$.



As Gerhard suggests, the problem can be recast in terms of necklaces and their subwords of length $k$. We introduce the appellation dimgraph for “directed multigraph”. Let $[w]_k$ denote the generalized order $(k – 1)$ de Bruijn dimgraph of the $q$-ary word $w$: i.e., vertices correspond to the $(k – 1)$-tuples, and directed edges correspond to the $k$-tuples in $w$ in the usual way. If $[w′]_k = [w]_k$, we say that $w$ and $w′$ belong to the same order $k$ de Bruijn homology class. It is easy to see that this really is an equivalence relation, which we denote $sim_k$ (suppressing subscripts when there is no risk of confusion). If the size of an equivalence class is greater than unity, it will be necessary to take into account the additional data associated to $p$, but I will not touch on that (or on the formula to determine the multiplicity of a nontrivial equivalence class, though if anyone wants this please leave a comment and I will try to reply with a PDF) here.



For convenience, assume first that $q = 2$ and define the frequency tuple $alpha = alpha(w)$ as the $B$-tuple of natural numbers indicating the frequency of occurrence in $w$ of the (lexicographically ordered) $k$-tuples over $mathcal{A}_q$. For later convenience we take the frequency tuple to be an integral column vector. For example,



$w = 00011 in mathcal{A}_2^5, k = 3 Rightarrow alpha(w) = (alpha_{000},dots,alpha_{111})^T equiv (alpha_0,dots,alpha_7)^T = (1,1,0,1,1,0,1,0)$.



Notice that we start the index at zero: this will be our default convention for all indices relating to $k$-tuples, and context should generally suffice to determine the indexing scheme for a particular object.



The Eulerian property of generalized de Bruijn graphs means that $alpha$ is completely specified by the $B/2 + 1$ components $alpha_0,alpha_1,dots,alpha_{B/2-1},alpha_{B-1}$: conversely, any such assignment completely specifies the corresponding generalized de Bruijn dimgraph—though notice here that we make no implicit requirement that such a dimgraph must as a matter of course be strongly connected (i.e., correspond to a word). We could make other choices for the “free” components: the essential thing is that the two “loop” components $alpha_0$ and $alpha_{B-1}$ must be free, and any other $B/2 – 1$ components will suffice to complete a specification. The reason for this is basically that the Eulerian property (besides its requirement of strong connectivity) provides $B/2$ independent equations in $B$ unknowns.



More explicitly, for k > 1, write the frequency tuple as a column vector:



$alpha = (alpha_0,alpha_{[0]}^{(k)},alpha_{[1]}^{(k)},alpha_{B-1})^T; quad alpha_{[0]}^{(k)} = (alpha_1,dots,alpha_{B/2-1})^T, quad alpha_{[1]}^{(k)} = (alpha_{B/2},dots,alpha_{B-2})^T$.



Now there is an integral matrix—actually there are many integral matrices—satisfying



$alpha_{[1]}^{(k)} = M_{[0 rightarrow 1]}^{(k)} alpha_{[1]}^{(k)}$.



There is an algorithm for computing these matrices for arbitrary $k$: it is just an exercise in index-juggling, however, so we omit the details. The point is that we can dispense with the unnecessary components of frequency tuples for the binary case, although some care must be taken. Also, it turns out to simplify matters considerably if we omit the words of all zeros and all ones from consideration: we will do so without comment when convenient. More generally, it is convenient to require that none of the components of the frequency tuple are zero (which also implies that none of them equal $N$), but we shall be explicit when making such an assumption.



We can treat some of the components of a frequency tuple in the manner just outlined as functions into the natural numbers. In turn, the matrix-tree and BEST theorems can be jointly recast as a recipe for another function—the BEST function—from this space into the natural numbers that gives the cardinality of the de Bruijn equivalence class.



...



With the binary case in hand, we seek to generalize to $k$-tuples over $mathcal{A}_q$. The approach will be somewhat different here. Define ancestor and descendant matrices by



$M_leftarrow := 1_{1 times q} otimes I_{B/q}, quad M_rightarrow := I_{B/q} otimes 1_{1 times q}$.



Now the indegree/outdegree (I/O) or weak Eulerian property reduces to



$M_leftarrow alpha = M_rightarrow alpha$.



Nothing is lost here by forming the augmented ancestor and descendant matrices



$tilde M_leftarrow := delta_1 otimes M_leftarrow + I_B, quad tilde M_rightarrow := delta_1 otimes M_rightarrow + I_B$



and characterizing the I/O property instead by



$tilde M_leftarrow alpha = tilde M_rightarrow alpha$.



The augmented matrices are invertible, and so we can rewrite the I/O property once more as



$M_{(I/O)}alpha equiv left( tilde M_rightarrow^{-1} tilde M_leftarrow -I right)alpha = 0$.



In other words, a proper frequency tuple must belong to the kernel of the consistency matrix on the LHS above. The dimension $d$ of this in/out (I/O) kernel $ker_{I/O}(q,k)$ is just the number of independent variables required to determine a frequency tuple.



This number is $d = B – B/q + 1$. A sketch of a proof is as follows: of the $B$ edges (including any with multiplicity zero) in an order $(k – 1)$ generalized de Bruijn dimgraph, $q$ are loops: the corresponding frequency tuple components obviously need to be treated as free variables. For the remaining edges, we have B/q independent linear equations in $B – q$ variables. This gives a net of $B – B/q$ independent variables for a minimal integral solution to the linear equations. The +1 is accounted for by scaling.



Accordingly, if the word length is fixed to $N$ then evidently there are $B – B/q$ free components.



For small $q, k$ an exact/integral basis for the in/out kernel can be formed easily (e.g., by using the null command in MATLAB with the r option), and since these are the only cases we can realistically address, we will not bother with an algorithm.



We have two comments at this point. The first is that the reader should be convinced (or take it as an exercise to prove) that the I/O kernel contains an integer lattice whose primitive cells are relatively small (a precise formulation and attendant proof are probably rather technical, and certainly uninteresting). The second is that the lattice can be generated by a positive integral basis: the few negative entries above can clearly be made to go away with a few elementary linear operations.



That is, we consider the I/O lattice



$Lambda_{I/O}(q,k) :- ker_{I/O}(q,k) cap mathbb{Z}^B$



and the associated lattice cone



$Lambda_{I/O}^{0+}(q,k) := ker_{I/O}(q,k) cap mathbb{N}^B$.



Except for some degenerate cases corresponding to disconnected generalized de Bruijn dimgraphs (to be addressed in the sequel), points in this cone correspond to valid frequency tuples. For computational purposes we want to be able to efficiently generate the frequency tuple simplices



$Lambda_{I/O}^{(N)}(q,k) := Lambda_{I/O}^{0+}(q,k) cap N Delta_B$



where a dilation of the standard simplex is indicated on the RHS. (It will be an implicit corollary of our construction of these sets that the I/O lattice is unimodular, but we do not use this.)



We first define a bucketspace (“$s$ balls in $r$ buckets”) $X_{r,s} := {alpha in mathbb{N}^r : sum_{j=1}^r alpha_j = s}$. Now it is clear that there exists a $B times d$ integer matrix $M^{(Lambda)}$ s.t.



$M^{(Lambda)}X_{B,N} = Lambda_{I/O}^{(N)}(q,k)$



(where the LHS is interpreted in the obvious way). In fact there may be several such integer matrices—each reflecting a particular choice of basis—but all we need is one. We will call it a bucketspace-I/O lattice matrix.



The problem of its construction appears subtle (but perhaps that is because we have focused on explicit examples). A useful heuristic is to invoke the LLL lattice basis reduction algorithm on an integer basis for the I/O kernel, then use the structure of the de Bruijn graph—in particular, a table of its cycles—to improve on LLL. Note that the cycle enumeration for de Bruijn graphs has only been effected for small values of $k$.



In practice there are some subtle considerations, which we sketch in the context of the two specific cases $q = 2$ and $k = 3, 4$. Since the I/O kernel is the span of the dimgraph cycle space (see the sequel), it makes sense to look for a set of cycles corresponding to a “nice” basis. A natural candidate is a reduced basis of positive lattice vectors with minimal length.



Once the simple cycles of the de Bruijn graph are enumerated, it is easy to produce such a positive basis with a greedy algorithm. Start with the cycle representatives as a proto-basis. Now pick one of the remaining simple cycles of minimal weight uniformly at random: if it is linearly independent of the existing proto-basis, add it. Keep going until a basis is formed.



For $q = 2, k = 3$ the 6 simple cycles and corresponding simple cycle vectors are



$(0) leftrightarrow (1,0,0,0,0,0,0,0)^T$



$(001) leftrightarrow (0,1,1,0,1,0,0,0)^T$



$(0011) leftrightarrow (0,1,0,1,1,0,1,0)^T$



$(01) leftrightarrow (0,0,1,0,0,1,0,0)^T$



$(011) leftrightarrow (0,0,0,1,0,1,1,0)^T$



$(1) leftrightarrow (0,0,0,0,0,0,0,1)^T$



and the greedy algorithm produces the unique minimal positive basis



${(0),(001),(01),(011),(1)}$.



Similarly for $q=2, k=4$ we have the 19 simple cycles



$(0),(0001),(0001011),(00010111),(00011),(0001101),(000111),(00011101),(001)$,
$(001011),(0010111),(0011),(001101),(00111),(0011101),(01),(011),(0111),(1)$



and the greedy algorithm produces either of the two minimal positive bases



${(0),(0001),(001),(001011),(0011),(01),(011),(0111),(1)}$



${(0),(0001),(001),(0011),(001101),(01),(011),(0111),(1)}$



Note that the simple vectors of length 5 are rejected by the greedy algorithm, as is the simple vector $(000111)$ of weight 6.



(Notice that Gordan’s lemma implies that the monoid [commutative semigroup with identity and cancellation] of integral points in a lattice cone is finitely generated. This in turn implies that we can use one of the minimal positive bases to produce frequency tuple simplices if we are willing to pay a large overhead factor over a more efficient [if less well-understood] method outlined below.)



By performing elementary column operations it is possible to find a workable bucketspace-I/O lattice matrix by trial and error (of course an algorithm would be preferable, but the cycle decomposition of de Bruijn graphs is known only for small values of $k$, suggesting that such a goal is too lofty). We find by direct experiment with $N = 16$ that the following matrices work:



alt text



The columns of the $k = 4$ matrix can be indicated graphically:



alt text



Finally, notice that the method outlined essentially boils down to finding the vertices of a frequency tuple simplex—but this appears to be more easily said than done.



...



Now that the I/O condition has been suitably developed, we move on to the question of (strong) connectivity. Let $B$ (with an abusive resue of notation) denote the incidence matrix of a dimgraph $G$. Without bothering to restate the common definition, we simple provide a relevant example: for $q = 2$ and $k = 4$ we have



alt text



Notice by way of passing that we can recast the I/O equations as



$B1 = 0$.



It can be shown that $ker B$ is the cycle space of $G$.



In any event it is a standard result for dimgraphs that



$mbox{rank}(B(G)) = |V(G)| - c(G)$



where on the RHS we indicate the number of vertices minus the number of components. In short, the rank of the incidence matrix allows us to judge whether or not the corresponding dimgraph is (strongly) connected.



Our intent here is that a generalized de Bruijn dimgraph has only a single component by definition: although we have not really produced a proper mathematical definition, this intent should suffice for the remainder. Now the requirement on a frequency tuple is that



$|V(alpha)| - mbox{rank}(B(alpha)) = 1$.



The number of vertices is just the number of $(k – 1)$-tuples making an appearance; the rank can be computed by Gaussian elimination. For $q = 2$, it can quickly be seen that the rank is just the number of nonzero components in $(alpha_1,dots,alpha_{B/2-1})$.



With that in hand, define the tuple canonical ensemble (TCE) by



$hat Lambda_{I/O}^{(N)}(q,k) := Lambda_{I/O}^{(N)}(q,k) cap {alpha: |V(alpha)|-mbox{rank}(B(alpha)) = 1}$.



The TCE gives all the frequency tuples for a given length, and no garbage points, although some care is required in its construction.



In conclusion: check to see if a given point is in the TCE. If it is, then use the matrix-tree and BEST theorems (again, ask me how and I'll provide a PDF if I see the comment) to determine its cardinality. Providing an enumeration and selecting $p$ minimal completes the solution to the problem, although I'm not sure about the value of $p$, complexity, etc.

No comments:

Post a Comment