Inductive Programming: Data

Data

LA home
Computing
 Algorithms
 Bioinformatics
 FP,  λ
 Logic,  π
 MML
 Prog.Langs

FP
 II
  #refs

A type system probably cannot(?) do all of the following:

Some variables have an origin, some do not, e.g.,
position has an origin.
length has a scale but does not have an origin.
A length may be multiplied but a position may not(?) be multiplied.
Two positions may not be added, but
the difference between two positions is a length.
Units, e.g.,
centigrade v. farenheit,
feet and inches v. m., cm, and mm.
and dimensions, e.g.,
length, mass, time, electric-current, temperature, amount-of-substance (mole), luminous-intensity (candela)
-- 7 base SI dimensions for physics.
acceleration = length.time-2
mass x acceleration = m.length.time-2 = force = momentum/time

Some types, and classes, of data.

atomic discrete categorical data T = C1 | C2 | ... | Cn deriving (Bounded, Enum)
e.g. data Boolean = True | False
e.g. data Gender = Male | Female
e.g. data DNA = A | C | G | T
e.g. data Party = Liberal | Labor | Democrat | Green | Indep
NB. Something changes qualitatively for a "large" number of categories, maybe even for 7+ or 10+.
NB. Bounded and Enum do not imply any semantic (non-arbitrary) order on the values; see ordered below.
ordered as above plus  deriving (..., Ord)
e.g. data Quality = Bad | Poor | Avg | Fair | Good
e.g. data Topography = Mountains | Foothills | Plain | Coastal
See [missing persons].
hierarchic,
partially ordered
e.g. reptile | mammal( rodent | primate( chimp | gorilla))
One method...
data Animal = Reptile | Mammal (Maybe M)
data M = Rodent | Primate (Maybe P)
data P = Chimp | Gorilla
Is a primate ~ Mammal (Primate Nothing), => is a mammal ~ Mammal Nothing ::Animal.
 
Model by a suitable collection of multistate models.
Or set-based, Primate = {Chimp, Gorilla} etc., c.f. DNA.
Also see measurement accuracy, discrete.
integer  
posInt >0
similarly non-neg. >=0
periodic e.g. day of the week, month.
continuous Real Float, Double
e.g. voltage, position (1D), velocity (1D)
(Complex) ?structured? (rl,im) or (r,θ)
see vector
positive e.g. mass, length, speed
periodic e.g. angle
composite multivariate tuple: (T1, T2, ..., Tm)
or constructor: data Person = Person String Int
or: data Person = Person{name::String, age ::Int}
or array (homogeneous)
or list, [t], (homogeneous)
vector array (homogeneous)
e.g. m-Dim. position, velocity, force, etc.
sequence list: [t] --list of t
e.g. DNA seq., annual weather data, daily stock exchange data, visits to doctor, etc.
Element type can be multivariate.
set
list of members, e.g. [set of mutations],
or vector (bit map),
(equiv. in principle but not necc. in practice, esp. sparse sets).
structured the sky is the limit, new data-types
inapplicable usually structured data, e.g.
data Person = Male | Female Int --#pregnancies!
optional Maybe t, but different symantics from missing data (below)!
Also Either t1 t2 -- standard H98.
Really a special kind of structured data.
Model as discrete plus a suitable model for t.
Whether an optional t was in fact present or not could be missing, and if it was known to be present then the value itself could be missing or not!
properties data measurement accuracy continuous
(a) fixed, ±δ, or
(b) relative, ±x%, or
(c) arbitrary, range (lo,hi), per datum.
NB. omitting to deal with accuracy in a data trans. can affect inferences; safer to inverse transform model.
discrete (sometimes)
e.g. DNA
H={A,C,T}, ..., R={A,G}, Y={C,T}, K={G,T}, N={A,C,G,T} ~missing?!
(A 4-bit "set" rep. works nicely for many purposes.)
Also see hierarchic, partially ordered, above.
missing data Maybe t = Nothing | Just t  --H98 standard type
There was a value but it was either not measured or not recorded.
(a) Missingness is common knowledge; need not be coded at all.
(b) Missingness is of known prob.; can code using a fixed given prob..
(c) Missingness is to be estimated once, globally, for use in all sub-models.
(d) Missingness is to be estimated per sub-model, and so may influence global model structure.
See [modelMaybe].
censored data either:
data Cnsrd t = Cnsrd | Normal t
or transform the model. Related to missing, and optional, but with different semantics.
E.g. A "sticky" voltmeter measures [0.0 .. 1.0]v as 0.0v.
A reasonable, although not perfect, way to model censored data is similar to what can be done for missing data, cases (c) or (d), above. (As in ecological segmentation '05.)
weighted data
(i) integral: compacting repetitive values,
(ii) fractional: part membership of a class in a [mixture model].
25/5/2006, LA.
window on the wide world:

Computer Science Education Week

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite,
ver 3.4+

The GIMP
~ free photoshop
Firefox
web browser
FlashBlock
like it says!

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Wednesday, 03-Sep-2014 02:42:38 EST.