Case Study: ≥1 series (sequence) 

This is a case study of
Part (i) was not too difficult using class TimeSeries and type TimeSeriesType from [200309] provided that context2model :: tsm ds > [ds] > model ds was added to the class. The DPA was general and quite simple; it was able to remain "unaware" of the details of the given timeseries model, except for whatever length of recent history "made a difference" in the local incremental choice of an optimal alignment. But it was soon realised that stateful timeseries models would require a slightly different version of the algorithm because they cannot be efficiently "paused and restarted" on an arbitrary context::[ds]. This was a nuisance and there would be trouble using contextbased and statebased models together. A new problem appeared over the approximaterepeats model[4]; this requires a submodel over Jump destinations which almost always depends on the current position within the string being generated. At first it was thought that the length of the context would be useful. This takes O(n)time to calculate on demand, but could be precalculated by the class machinery. There again, some future model would likely need some other property of the context, so where would it end? In any case, the length was not sufficient; the present model needed the number of characters generated or scanned which is ≤ the context length. It seems that the approximaterepeats timeseries model requires stateful models  for efficient implementation. After further experiment
it appears that while class TimeSeries
(with predictors but without context2model)
is a superclass of class TimeSeriesS
The 2D DPA was "easily" converted to use TimeSeriesS models. The algorithm became simpler and supporting code to construct contexts for the component submodels became redundant  good. The approximaterepeats model was easily defined in terms of TimeSeriesS models, and further experimentation allowed it to be simplified by using timeseries models of alignments. Alignment Models, (OrBoth a b)The data type [OrBoth a b] can be used to represent an alignment of two data series (sequences) [a] and [b]. Types a and b might or might not be the same depending on the application, for example it makes sense to align a sequence of DNA Codons, [(DNA,DNA,DNA)], with the AminoAcid sequence of a protein, [AminoAcid].
(JustL and JustR are named after
Just, from the standard type Forming a TimeSeries of (OrBoth a b)A TimeSeries of (OrBoth a b) can be formed in many ways but
one natural way is from
a
(This required adding method
context2model ::
(tsm ds) > Finding a good [OrBoth a b]Given a TimeSeries of (OrBoth a b) we can compare any
two alignments, i.e. two The 2D DPAThe 2D
dynamic programming algorithm
(DPA) for finding the "cost" (or "score") of an optimal alignment of
two sequences, sa and sb,
under a variety of cost (score) measures, is well known and
some versions have even been programmed in
a lazy functional programming language[1].
In imperative languages it is usually described
in terms of a 2D matrix, d[0..sa,0..sb],
where d[i,j] contains, possibly amongst other things,
the cost (score) of sa[0..i1] and sb[0..j1],
where sequence positions are numbered from zero.
The basic algorithm runs in
quadratic time provided that the calculation of
The 2D DPA takes a timeseries model
of
On OptimalityNote that in general taking a model of the population of sequences
into account can, and should, change
These two alignments (others are possible) would have
equal costs or scores under many algorithms.
Here matching the rare Cs,
compared to matching the common As,
saves ContextsOne way to recover an optimal alignment is for the DPA
to also store the best choice,
i.e. northwest, north or west, in d[i,j].
This chain (context) of options, starting from d[sa,sb] and
leading back to d[0,0],
gives an optimal alignment in reverse order.
For many natural timeseries models, the incremental cost of
a single step, from the n.w., n. or w., depends on the context.
There are exponentially many paths (alignments) from
d[0,0] to d[i,j] so it is not feasible to
store information about all of them in d[i,j].
For an orderk Markov model
the incremental cost depends only on the previous k≥0 steps,
that is on the first k elements of the appropriate context;
anything beyond this horizon has no effect.
It is sufficient to store a best representative context in d[i,j]
for each such prefix.
There are upto 3^{k} possible contextprefixes of Shared length, k = 1, context prefixes, appropriate for linear (affine) gapcosts. There are upto three contextprefixes of length one, fewer at boundaries. Shared length, k = 2, context prefixes. There are upto nine contextprefixes of length two. By storing only such representatives in d[i,j],
provided that the Even for a timeseries model with no bound on a "horizon", it may still be a good heuristic to store only such a collection of representative contexts for some k imposed by the programmer, the DPA then yielding good, but not necessarily optimal, alignments. StatesSome timeseries models are based on
functions with state (and
it seems that the 2D DPA can only achieve both
good efficiency and generality for such models).
Such a model does not look at the context
but rather at its current state and at the next data element.
Therefore a state must be stored with each representative context
(the context is still necessary to recover a final alignment).
The state could be recreated on demand by retracing an alignment,
To use statebased timeseries models
the DPA is required to store, for each "open" alternative
at the current position,
Approximate Repeats Models,

data ApproxRepeats a = Gen a  Jump Int  Rpt (OrBoth a a) modelAppRpts mO mA mN mOB = let nlp (Gen a) = nlPr mO 0 + nlPr mA a nlp (Jump n) = nlPr mO 1 + nlPr mN n nlp (Rpt ob) = nlPr mO 2 + nlPr mOB ob  ob::OrBoth... in MnlPr ... nlp ... 
The are various "models" of editing ([b]>[a]), e.g.
data Edit3 a = E3ins a  E3del  E3use a  NB x(E3use y)>y data Edit4 a = E4ins a  E4del  E4chng a  E4copy  application determines if x(E4chng x)>x is legal 
For no very strong reason,
[b] is the "source" and
[a] is the "target".
Edit3 or Edit4 can be used to give nonredundant representations of approximaterepeats.
There are some "cute" correspondences
([JustLJustRBoth],[a],[b]) <> [OrBoth a b] <> ([b],[Edit3 a]) <> ([b],[Edit4 a])  if applicable 
And there must be related correspondences on relevant statistical models.
Given a suitable data structure,
the approximaterepeats timeseries model follows a similar pattern
to the timeseries model of
tsmApproxRepeats tsmO tsmA0 tsmO2 = let combine2 mA mB = let cm = combine mA mB p (a,b) = pr cm (a,b) / pr mB b  NB b is given in MPr ... p ... t (n, tsmO, tsmA, tsmOB) (Gen a) =  t, state trans'n (n+1, step tsmO 0, step tsmA a, tsmOB) t(n, tsmO, tsmA, _) (Jump _) = (n, step tsmO 1, tsmA, tsmOrBoth combine2 tsmO2 tsmA tsmA0) ! ... p (n, tsmO, tsmA, tsmOB) =  state > prediction modelAppRpts (predict tsmO)  "ops" (predict tsmA)  a (uniform 0 (n1))  say, Jumps (predict tsmOB)  OrBoth a a' in TSMs ... (0, tsmO, tsmA0, tsmOrBoth combine2 tsmO2 tsmA0 tsmA0) t p ... 
14/10/, 24/10/2004 
tsmO models the GenJumpRpt operations and
tsmO2 models the JustLJustRBoth operations.
tsmA0 is the "base" model for elements of the series that are
output by Gen operations.
Note the use, twice, of tsmOrBoth...,
The class structure, from super to sub, could in principle get as complex as:
class  instances  

super  TimeSeries predictors  TimeSeriesType 
sub  TimeSeriesS step predict  TimeSeriesTypeS 
sub^{2}  TimeSeries' context2model  TimeSeriesType' 
where TimeSeriesType' is ~ TimeSeriesType
augmented with an option to store a "state",
window on the wide world:

