\chapter{Computational Learning Techniques} \label{MachineLearningTechniques} \section{Introduction} The aim of this chapter is to outline and discuss the major approaches to computational learning. Based on this discussion, and the discussion in Chapter~\ref{IntelligentAuctionAgents}, we should be able to determine which learning techniques may be useful for application to our traders problem. We first, however, discuss the broad categories of computational learning techniques. According to \citeN{ThorntonLearning}, there are three main areas of computational learning: machine learning, connectionism and genetic algorithms. \emph{Machine learning} techniques focus on symbolic algorithms, where connections are made between various objects and additional information can be inferred based on these connections. \emph{Connectionism} focuses on computer based modelling inspired by the operation of the human brain. \emph{Genetic algorithms} use analogies from Darwinian evolution in an attempt to evolve a solution to a problem. However, Thornton's classification of learning techniques is incomplete, ignoring many types of learning algorithms, including important techniques such as \emph{evolutionary programming} and \emph{Q-Learning}. For our purposes, the learning techniques we will discuss can be classified into two broad fields: \emph{evolutionary computation} and \emph{reinforcement learning}. Briefly, reinforcement learning techniques decide which of a finite set of actions to output to the environment, given a predefined goal \cite{ArthurEconomicAgents}. Evolutionary computation techniques apply principles of evolution theory in order to achieve learning \cite{NissenEA}. Even though evolutionary computation can be considered broadly as a type of reinforcement learning, for clarity it is kept seperate from the other reinforcement learning techniques - as we will see, the operation of these two categories is quite different. \section{Reinforcement Learning} \label{ReinforcementTechniques} \subsubsection{Overview} \label{ReinforcementOverview} \emph{Reinforcement learning} techniques involve a decision making agent choosing repeatedly between a number of possible actions, each action having a random payoff \cite{ArthurEconomicAgents}. Each action has an associated strength, which indicates the probability of that action being chosen as the next action. If no prior iterations have occured, generally the relative strength associated with each action is distributed evenly. A negative returned from the environment results in a reduction in the strength associated with that action, whilst a positive return results in an increase in that action's strength. For example, consider an artificial agent which is capable of performing four actions: A, B, C and D. Initially, the agent is not aware of the likely impact of executing each action. As such, each action will most likely have a strength of .25 to begin. Based on these strengths, the agent can now choose the action to perform. Based on the feedback returned from the environment, the agent is able to modify the strengths of each of the actions. For example, if action C returns a negative response from the environment, the probability of executing action C at the next iteration is reduced. Likewise, if action C returns a positive response, the probability of executing action C at the next iteration is increased. As we will see, an important advantage of reinforcement learning techniques appears to be \emph{adaptability} and \emph{flexibility}. Consider our example agent outlined above. An alternative to applying a learning technique may be to embed the strengths associated with the execution of each action into the trader, with these strengths remaining fixed. The problem with this is that \emph{the environment may change over time}, thus potentially rendering the fixed set of strengths obsolete. For example, the previous best action may become worst when the environment changes. By using a reinforcement learning technique, the strength values are 'leant', thus reducing the amount of embedded, or \emph{a priori}, knowledge used by the learning agent. Whilst this is a clear advantage, there are also some problems with certain types of reinforcement learning techniques, which will be discussed later. We now turn our focus to investigating the specific types of reinforcement learning techniques which are available. We will look at the \emph{policy only} approach, the \emph{reinforcement comparison} approach, the \emph{adaptive heuristic critic}, \emph{Q-Learning}, \emph{dyna-architectures} and finally the \emph{learning automaton}. \subsection{Policy-Only} \label{PolicyOnly} In \emph{policy-only architectures}, the environment is assumed to be fixed, with only the strengths associated with each action varying from iteration to iteration \cite{SuttonAnimats}. A \emph{policy} is simply a mapping of an input from the environment to an output action. The major problem associated with the policy-only architecture is that, in many cases, the success associated with each action may vary depending on the current state of the environment. For example, the optimal action in one state may turn out to be the most disastrous in another \cite{SuttonAnimats}. This fact is not considered by the policy-only only architecture. As a result, it appears that the policy-only architecture, whilst straightforward to implement, is too simplistic to be useful for agents in most environments. An agent incorporating a policy-only architecture in a constantly changing environment would most likely fail miserably. Critically, due to the agent's inability to distinguish between environment states, very little true learning may actually take place. \begin{figure}[ht] \epsfxsize=250pt \centerline{\epsfbox{envex.eps}} \caption{\label{PolicyOnlyExample}Example environment with changing state.} \end{figure} Consider the example outlined in Figure~\ref{PolicyOnlyExample}. We have an environment with two states (A and B) and two actions (0 and 1). In state A, action 0 is the optimal action, whilst in state B the best action is 1. Any action triggers a transition to the other state. In this environment, an policy-only agent would never be able to learn anything substantial, since based on its limited perception, the environment has one state in which action 0 has a higher payoff than action 1 half the time, whilst half the time action 1 produces a higher payoff than action 0. Thus the probabilites associated with actions 0 and 1 remain at around 0.5 each. \subsection{Reinforcement Comparison} \label{ReinforcementComparison} \emph{Reinforcement comparison} architectures build on the policy-only approach, by associating a strength with each action \emph {and each state} \cite{SuttonAnimats}. This remedies the major problem of policy-only architectures, since strengths for each action are also dependant on the current state of the environment. Consider the example outlined in Figure~\ref{PolicyOnlyExample} once again. Reinforcement comparison architectures would be able to perform much better than policy-only architectures in this environment, since they would be able to identify the different levels of reward for the same actions in each of the environment's states. However, reinforcement comparison architectures ignore the change of environment state which may take place as a result of an agent carrying out an action \cite{SuttonAnimats}. For instance, it is possible that a particular action may provide a high payoff, but also change the environment to a state where low payoffs are likely. Stated simply, reinforcement comparison architectures fail to look past the immediate future - they are \emph{myopic}, looking for the best short term return, without looking at any negative long term consequences. It seems that critical to the success of reinforcement comparison techniques is an agents knowledge of the states of an environment and the actions which trigger transitions between environment states. However, we should also consider the impact of the actual structure of the environment changing. In other words, the number of states and the actions which cause state changes may be modified. If this was to occur, based on the above discussion it would appear that an agent utilising the reinforcement comparison technique would lack the autonomy to adapt on it's own, since it is unable to modify its \emph{embedded} knowledge of the structure of the environment. \subsection{Adaptive Heuristic Critic} \label{AdaptiveHeuristicCritic} The \emph{adaptive heuristic critic} architecture takes into account the long term consequences of performing a certain action \cite{SuttonAnimats}. Rather than looking at immediate reward, the adaptive heuristic critic looks at a longer term measurement of reward, or \emph{return}, to determine the next action. The basic formula for calculating return is given in figure~\ref{Sutton}, where \emph{x} is the current state of the environment. The return associated with each state is equal to the sum of all future rewards, with the strength of more immediate rewards being greater than those predicted in the more distant future. \begin{figure}[ht] \epsfxsize=250pt \centerline{\epsfbox{sutton.eps}} \caption{\label{Sutton}Calculating the return using the adpative heuristic critic (Sutton 1991).} \end{figure} \subsection{Q Learning} \label{QLearning} According to \citeN{AsadaAgents}, \emph{Q-Learning} is an 'elegant' technique for learning the nature of the envoronment and forming the policy concurrently (recall that a \emph{policy} is simply a mapping of an input from the environment to an output action). In Q-Learning, the predicted return from the environment is calculated based on state \emph{and} \cite{SuttonAnimats}. For each state, the algorithm determines the expected return for the best action \emph{and} the actual action selected. This approach appears to be more useful than the previous reinforcement learning approaches, since the nature of the environment is learnt. Recall that a problem with approaches such as the reinforcement comparison is that the knowledge of the environment is embedded and not learnt. By learning the nature of the environment, the flexibility and adaptiveness of an agent is increased. \subsection{Dyna Architectures} \label{DynaArchitectures} A limitation of the reinforcement learning techniques outlined above is their trial and error nature \cite{SuttonAnimats}. The only way an agent can find what happens when a particular action is performed is to actually trigger that action. This can cause problems. Consider, for example, the risk associated with executing a previously untried action. It would be useful if we could experiment with actions without actually needing to trigger the action in the actual environment. \emph{Dyna architectures} attempt to allow this, by implementing a simple form of planning. This is achieved via the development of a world model \cite{Whitehead}. Refer to Figure~\ref{DAOverview} for an overview of the dyna algorithm. As the figure indicates, the world model is formulated based on the agents interactions with the actual world. Note the way in which actions are executed hypothetically on the world model. This allows the agent to assess the impact of the action without the risk associated with executing the action in the 'real' environment. An additional advantage of dyna-architectures is that an agent becomes more adaptable. This is because \emph{goal changes can be acommodated more easily} \cite{SuttonAnimats}. This is due to the fact that if the goal changes, the world model remains fixed. Using a pure trial and error learning technique, problems can arise when the goals of an agent change, \emph{all prior policy information becomes useless}. \citeN{SuttonAnimats} presents a specific implementation of dyna-architectures utilising Q-Learning. He concluded, based on a series of simple experiments, that performance of dyna-architectures was significantly improved over that of the other reinforcement learning techniques, concluding this improvement was as a result of the use of an internal world model. A problem with Sutton's experiments, however, is that the agents knew the states and transitions of the world to begin. In the vast majority of cases the states and transitions of the world are not known with certainty - the agent must make an estimation of these parameters based on previous experience \cite{SuttonAnimats}. Thus while the performance of the dyna algorithm is excellent for agents dealing in reasonably small, well understood environments, the usefulness of the algorithm in less understood environments is still unknown. \begin{figure} \begin{small} \begin{verbatim} begin repeat observe current state and choose an action; send the action to the world and observe the resultant next state and reward; apply reinforcement learning technique to experience; update world model based on experience; repeat select a hypothetical state and action; send current state and a hypothetical action to world model obtain predictions of next state and reward; apply reinforcement learning technique to hypothetical experience; until (repeated k times) until (termination condition) end \end{verbatim} \end{small} \caption{\label{DAOverview}Overview of dyna-algorithm (Sutton 1991).} \end{figure} \subsection{The Learning Automaton} \label{TheLearningAutomaton} Broadly speaking, the learning automaton approach to learning is consistent with that of other reinforcement learning techniques. Given a finite number of allowable actions, a learning automaton attempts to choose the most appropriate \cite{NarendraLearning}. A learning automaton has a transition function which maps the current state and input into the next state. Additionally, an automaton has an output function which is responsible for determining the output to return to the environment. There are three broad categories of learning automata: \emph{deterministic}, \emph{fixed-structure stochastic} and \emph{variable-structure stochastic}. We now outline each of these approaches, from simplest to most complex. \subsubsection{Fixed Structure Deterministic Automata} \label{DeterministicAutomata} With a \emph{deterministic}, or \emph{fixed probability deterministic} automaton, the probabilities associated with the transition of states remain fixed. Additionally, these probabilities are either 0 or 1 - there are no random factors involved in a state transition \cite{NarendraLearning}. If the automaton is in state A and receives input B, we can always conclude the next state with certainty. No updating of the transition or action probabilities take place, so clearly this approach is limited, particularly if an automaton is going to exist within a dynamic environment. \subsubsection{Fixed-Structure Stochastic Automata} \label{FixedStructureStochasticAutomata} \emph{Fixed structure stochastic automata}, or \emph{fixed probability stochastic automata}, function in a similar way to deterministic automata. The fundamental difference is that fixed structure stochastic automata have a random element associated with their state transitions \cite{NarendraLearning}. To rephrase, the probabilities associated with moving from one environment state to another, can be anywhere between 0 and 1. However, these probabilities remain fixed. Whilst \citeN{NarendraLearning} classify this, and the similar \emph{fixed-structure deterministic automata}, as learning techniques, in reality it appears that no learning actually takes place. No updating of the probabilities associated with each state transition takes place, which clashes with our definition of reinforcement learning, where probabilities are updated based on past experiences. Given this, the benefits of this technique appear limited. \subsubsection{Variable-Structure Stochastic Automata} \label{VariableStructureStochasticAutomata} \emph{Variable structure stochastic automata}, or {variable probability stochastic automata}, resolve the major limitation of fixed-structure automata, since the probabilities associated with each action are able to be updated based on previous experience \cite{NarendraLearning}. Of the three techniques we have discussed in this section, the variable-structure stochasitc is the only true 'learning' technique, since only this category changes it's probabilities (or structure) based on its interactions with the environment. Given this, variable-structure stochastic automata are clearly the most useful of the automata discussed here, since they are best able to adapt to changes in the environment. \section{Evolutionary Computation} \label{EvolutionaryComputation} \subsection{Overview} \label{EvolutionaryComputationOverview} The term \emph{evolutionary computation} is used to describe the broad classification of learning techniques which borrow principles from evolution theory \cite{NissenEA}. They are a useful category of search and optimisation techniques, being able to find approximate solutions to problems with a huge number of potential solutions. \citeN{NissenEA} outlines the key steps in the evolutionary computation process as follows. Firstly, a number of potential solutions to the particular problem are generated, forming a new population. Each member of this population is then evaluated, using a \emph{fitness function}. A fitness function is simply a procedure which quantitavely measures the current worth of a particular individual in relation to the achievement of it's goals. Additionally, some type of selection procedure is employed to determine which individuals live on and which die off. Once this has been determined, a new generation of solutions is created, by applying some operations on members of the current population. Two of the more important of these operations are \emph{mutation} and \emph{crossover}, although the extent to which they are used are dependant on the specific evolutionary computation technique being used This entire process is repeated continually until some termination criteria is met. For example, the alogorithm may iterate until the population evolves to an adequate level of fitness. Alternatively, the process may iterate until the main loop of the algorithm has been executed a particular number of times. We will consider four variants on the evolutionary computation theme: \emph{evolutionary programming}, \emph{evolution strategies}, \emph{genetic algorithms} and \emph{genetic programming}. We also briefly look at \emph{neural networks}, although a detailed discussion of this approach is outside the scope of this thesis. We first, however, intoduce the notion of learning as an optimisation process. Since evolutionary computation techniques are all optimisation procedures, it is important to firstly introduce the notion of optimisation. \subsection{Learning as an Optimisation Process} \subsubsection{Overview} In Section~\ref{ReinforcementTechniques}, we have discussed learning as a process of making adjustments based on previous experiences. Learning can be viewed as a search for an optimal configuration of parameters, or optimal sequence of events. In other words, we can view learning as an \emph{optimisation} problem. After more clearly defining the term \emph{optimisation}, we briefly introduce the various categories of optimisation problems and the potential difficulties which arise in attempting to solve these problems. We limit the discussion to only those categories which are relevant to our artificial traders problem. \subsubsection{Optimisation} \citeN{SchwefelES} defines the process of optimisation as, '...selecting a better or best alternative from a number of states of affairs.' Where we have the possibility of choosing between two or more alternatives, we should choose the best. In this respect, the reinforcement learning techniques which have been outlined in Section~\ref{ReinforcementTechniques} can also be viewed as optimisation procedures. Critical to an optimisation problem are the \emph{parameters} and \emph{objective function}. The parameters are those controllable aspects of the problem, or as stated by \citeN{SchwefelES}, the 'independant features that distinguish the results from one another.' An optimisation algorithm is able to modify these parameters to create new potential solutions to the problem. The objective function is simply a fitness function, which returns the fitness of a particular solution. It is the objective of an optimisation procedure to find the combination of parameter settings which return the best fitness. \citeN{HollandGA} describes optimisation as a search through an \emph{n} dimensional landscape, where the peaks in the landscape (or mountains) represent parameter settings of high fitness, whilst the valleys represent parameter settings of comparitively low fitness. The aim of an optimisation procedure is to reach the top of the highest mountain in the landscape. As we will see, optimisation procedures often have difficulty in identifying whether the mountain it has climbed is really the largest. \subsubsection{Categories of Optimisation Problems} \label{OptimisationCategories} \paragraph{Experimental Optimisation Problems} Experimental optimisation problems are those where the relationship between the parameters to be optimised and the fitness function is not known \cite{SchwefelES}. In other words, we cannot describe this relationship mathematically. According to \citeN{SchwefelES}, a common problem with experimental optimisation problems is that often 'noise' from the environment can interfere with the fitness measurement function. For example, the fitness of an artificial trader may be measured by the current endowment held by the trader. However, this measurement is affected by factors beyond the control of the trader, such as market volatility, market price of the asset and so on. \paragraph{Mathematical Optimisation Problems} Mathematical optimisation problems, in contrast to experimental optimisation problems, are those in which the relationship between the parameters to be optimised and the fitness function is \emph{directly} known \cite{SchwefelES} and can be described mathematically. For example, the \emph{sphere} model objective function is a mathematical optimisation problem, since we can describe the relationship between it's parameters (the \emph{x} array) and the eventual fitness returned in pseudocode (see Figure~\ref{SphereObjectiveFunction1}). Note that the term \emph{dimension of problem} mentioned in this figure simply refers to the number of parameters which are to be optimised. \begin{figure} \begin{small} \begin{verbatim} result = 0; i=0; loop dimension of problem times result = result + x[i]*x[i]; i=i+1; end loop; return result; where: x = array of current parameter values, of size dimension of problem result = fitness produced by current parameter configuration \end{verbatim} \end{small} \caption{\label{SphereObjectiveFunction1}Pseudocode for the sphere model objective function. Adapted from Schwefel (1993).} \end{figure} \paragraph{Constrained Optimisation Problems} Constrained optimisation problems are simply those which have limits on the values the optimisation parameters are able to take. For example, in our artificial traders problem, the three values which are to be optimised must at all times be greater than 0 (see Section~\ref{SimulationDesign}). \subsection{Evolutionary Programming} \label{EvolutionaryProgramming} The poineering work of \citeN{FogelAI} suggested that evolutionary programming could be used to achieve a form of artificial intelligence, by being able to predict the environment, and determining a suitable output to return to the environment based on the predicted input. This notion gave birth to the evolutionary programming field. \emph{Evolutionary programming} techniques focus on evolving populations of elements, gradually improving the quality of each population over time with respect to the achievement of a particular goal. In this respect evolutionary programming is similar to the genetic algorithm outlined in Section~\ref{GeneticAlgorithms}. Unlike genetic algorithms, evolutionary programming techniques stress the behavioural link between parent and offspring, rather than the transmission of genetic material \cite{FogelEPIntro}. The broad steps in the evolutionary programming algorithm are outlined in figure~\ref{EPOverview}. \begin{figure} \begin{small} \begin{verbatim} begin generate initial population of m individuals at random; evaluate initial population members using fitness function; current population = initial population; repeat intermediate population = current population (parents); for i=1 to m do begin generate ith offspring by duplicating ith parent; mutate ith offspring using normally distributed random numbers; evaluate ith offspring using fitness function; insert ith offspring into intermediate population; end; for j=1 to 2m do hold tournament between jth individual and a number of other individuals from the population (fitness values of each individual determines who wins the tournament); rank individuals based on number of wins in tournament; select highest ranking individuals to form new population; until (termination condition) end \end{verbatim} \end{small} \caption{\label{EPOverview}Overview of the evolutionary programming algorithm (Nissen 1993).} \end{figure} There are many flavours of evolutionary programming which can be used depending on the needs of the specific situation. Additionally, evolutionary programming techniques are quite adaptable. For example, a different level of reward can be returned for predicting different input symbols from the environment \cite{FogelEPIntro}. For example, we can allocate a higher reward for predicting an input symbol which represents possible danger to the agent. In this case, it is not of prime importance that the algorithm may occasionally predict a particular danger symbol when in fact a different symbol is actually returned from the environment. What is more important is that the algorithm never fails to predict an actual danger symbol from the environment. \subsubsection{Extensions} \label{EPExtensions} Several extensions have been suggested to the basic evolutionary programming technique outlined above. One such extension was proposed by \citeN{FogelMetaEP}, who suggested that the performance of evolutionary programming algorithms could be further enhanced by optimising at a higher level the \emph{a priori} parameters required by the evolutionary programming algorithm, such as mutation rate. This process operates concurrently with the 'normal' evolutionary programming algorithm and is known as \emph{meta-evolutionary programming}. The benefits of meta-evolutionary programming seem clear. By 'learning' rather than embedding the evolutinary programming parameters, theoretically we should approach a closer to optimal result. Additionally, the algorithm is less reliant on the initial values specified by the programmer in order to suceed. However, in reality the meta-evolutionary programming algorithm actually underperformed in comparison to the standard evolutionary programming algorithm \cite{FogelMetaEP}. \subsection{Evolution Strategies} \label{EvolutionStrategies} \subsubsection{Overview} \label{EvolutionStrategiesOverview} \emph{Evolution strategies} are an evolutionary computation technique closely related to the genetic algorithm. Under the evolution strategy approach, a population of potential problem solutions produce offspring, which are manipulated via the operators of \emph{mutation}, \emph{selection} and, in some implementations, \emph{recombination} \cite{SchwefelES}. The dominant operator within the evolution strategy is mutation. Like the other evolutionary computation techniques, an objective function is used to quantitavely measure the fitness of a particular individual. The broad steps in the evolution strategy algorithm are presented in figure~\ref{ESOverview}. Note that this algorithm is a fairly general description of a \emph{multimembered} evolution strategy. One of the significant differences between the evolution strategy and the genetic algorithm is that, rather than representing populations of potential solutions as bit-strings, evolution strategies represent solutions as real-valued vectors \cite{NissenEA}. Each element of this vector corresponds to a particular parameter value which we are trying to optimise. Further details of the representation of individuals in an evolution strategy population is given in section~\ref{IndividualRepresentation}. Mutation in the evolution strategy takes the form of adding normally distributed random variables to each of the elements of this vector \cite{NissenEA}. As will be shown later, mutation takes place not only on the parameter values, but also the step sizes, which are responsible for controlling the rate of mutation. \begin{figure} \begin{small} \begin{verbatim} begin generate initial population of n real-valued individuals at random; evaluate individual population members using fitness function; repeat intermediate population = {}; for i=1 to y do begin select two individuals for mating with equal probability; randomly copy component values from parents to offspring; average parent values to determine strategy parameter values of offspring; mutate offspring using normally distributed random numbers; insert ith offspring into intermediate population; end; evaluate new individuals using fitness function; select y best individuals (may be offspring or parents) to form new population; until (termination condition) end \end{verbatim} \end{small} \caption{\label{ESOverview}Overview of the evolution strategy algorithm (Nissen 1993).} \end{figure} Evolution strategies were initially developed by two students at the Technical University of Berlin, who were aiming to solve a technical optimisation problem \cite{SchwefelES}. One of the students decided to change the function parameters of this problem randomly to see if this could bring about a faster rate of convergance to a near optimal solution. As \citeN{HerdyES} outlines, evolution strategies have been shown to be useful for solving optimisation problems and are flexible enough to adapt should the optimum change over time. Additionally, little prerequisite assumptions of the problem domain are required, which means that the evolution strategy should be able to adapt to a broad range of situations. Although evolution strategies are normally used to solve continuous problems, it has been shown that they can also be used for solving discrete optimisation problems and even pattern matching problems \cite{HerdyES}. However, for the evolution strategy to be succesful the optimisation problem should have the property of \emph{strong causality} \cite{HerdyES}. By this it is meant that a small change to parameter values will result in a small change to the fitness of the solution, whilst a large change to parameter values will result in a large change to the fitness of the solution. There are many variants on the basic evolution strategy theme. To describe the specific type of an evolution strategy, the following format is normally used, outlined by \cite{SchwefelES}: \begin{center} \emph{($\mu$,$\lambda$)} or \emph{($\mu$+$\lambda$)} \end{center} In these examples, $\mu$ represents the number of parents per generation and $\lambda$ the number of offspring produced per generation. For example, an evolution strategy of type (1+1) represents the \emph{two membered evolution strategy}, outlined in section~\ref{TwoMemberedES}. The (1+1) evolution strategy population consists of one parent, which produces a single offspring per generation. The differences between the plus and comma strategies are explained in section~\ref{CommaPlus}. \subsubsection{Reperesentation of Individuals} \label{IndividualRepresentation} As outlined earlier, each individual in the evolution strategy population describes a potential solution to the specified problem. Each solution specifies a certain parameter configuration, or a particular point in the search space. Each individual consists of a vector of real valued parameters, described by \citeN{LohmannES} as \emph{object variables}. Additionally, each individual contains another vector of real valued paremeters, described as \emph{strategy variables}. The basic structure of an evolution strategy 'individual', or potential solution, is given in Figure~\ref{ESIndividualStructure} The step size variables determine the degree to which the object variables are to be mutated. In the multimembered evolution strategy, the strategy variables themselves are also mutated as well as the object variables, so in fact adaption is occuring at two levels (see section~\ref{MultimemberedES}). The significance of the strategy variables relates to their ability to control mutation step size, which is critical to the success of the evolution strategy \cite{SchwefelES}. \begin{table}[!hbt] \begin{center} \begin{tabular}{|l|l|l|l|} \hline \hline & 0 & 1 & 2 \\ \hline \hline x Vector & Param. 0 Value & Param. 1 Value & Param. N Value \\ \hline Step Size Vector & Param. 0 Step Size & Param. 1 Step Size & Param. N Step Size \\ \hline \end{tabular} \end{center} \caption{\label{ESIndividualStructure}Basic structure of an evolution strategy individual.} \end{table} \subsubsection{Comma versus Plus Strategies} \label{CommaPlus} As outlined by \citeN{SchwefelES}, the selection mechanism of the evolution strategy is responsible for choosing which members of the existing population are to survive and become parents of the following generation. There are two broad selection strategies, the \emph{plus} strategy, reperesented by ($\mu$+$\lambda$), and the \emph{comma} strategy, represented by ($\mu$,$\lambda$) Under the plus strategy, parents for the next generation are chosen from both the existing offspring \emph{and} parents. Note that under this approach parents may live an infinitely long period, in contrast to the behaviour of real world entities \cite{SchwefelES}. Alternatively, the new set of parents may be selected from only the current population of offspring, with all parents dying off. This approach is characterised as the \emph{comma} strategy. This approach complies more closely with natural evolution and according to \citeN{SchwefelES} may often lead to a better solution than the plus strategy. At first glance it may appear that the plus strategy is preferable, since this strategy does not allow degradation of fitness from one population to another. However, we must remember the importance of correct mutation sizes (or step lengths) in finding an optimal solution. As noted by \citeN{SchwefelES}, a particularly large mutation variance may result in a large jump closer to the optimum. But after this jump has been taken, we are left in the situation where offspring which inherit these large mutation variances are unlikely to perform well. As stated by \citeN{SchwefelES}: \begin{quote} While the variance allocated to [the individual] was eminently suitable for the preceding situation, it is not suited to the new one, being in general much too big... This increases the probability of a succesful mutation still having a poorly adapted step length. \end{quote} Thus it can be concluded that in the majority of cases, surprisingly, the comma strategy may be preferable to the plus strategy. \subsubsection{Two Membered Evolution Strategies} \label{TwoMemberedES} This approach is the simplest of the evolution strategies. It can also be described as the (1+1) strategy. As outlined by \citeN{SchwefelES}, the population of a two membered evolution strategy consists of one parent and one offspring. The offspring is generated by copying the parent and mutating it by adding randomly generated numbers. Following this, the fitness of each of these individuals is measured, with the fitter of the parent or offspring surviving. Under this approach, mutation is the sole means of modifying the object and strategy variables. Recombination is obviously not possible, given that there is only ever one parent in the population during a single iteration. Furthermore, only the plus strategy is applicable with two membered evolution strategies. To use a comma strategy in this situation is impractical, given that the one offspring would always become the parent of the next generation irrespective of its fitness. \subsubsection{Multimembered Evolution Strategies} \label{MultimemberedES} As the name suggests, the multimembered evolution strategy involves a population of many (more than two) individuals, in contrast to the two membered evolution strategy. The multimembered evolution strategy operates along similar lines to the two membered evolution strategy, with a few noticeable differences. Firstly, apart from the obvious difference in population size, a parent is not necessarily limited to generating one offspring per iteration. Instead, the parent produces $\lambda$/$\mu$ offspring on average \cite{SchwefelES}. Additionally, since the size of the population must remain constant, only the $\mu$ best individuals (from the pool of both parents and offspring, or offspring only, depending on the type of strategy) become parents in the following generation \cite{SchwefelES}. This is also another area of difference with the genetic algorithm, in which the selection mechanism allows potentially any member of the population (even the weakest) to become a parent. The selection mechanism of the multimembered evolution strategy is harsher, only allowing the best to become parents. Finally, the ratio of parents to offspring becomes, along with step size control, an important factor in the success or otherwise of the evolution strategy \cite{SchwefelES}. \subsubsection{Step Length Control} \label{StepLengthControl} As stated by \citeN{SchwefelES}, for the evolution strategy to be succesful, the mutation rate (or step lengths) must be continually modified themselves for the optimisation process to progress at a close to ideal rate. Too low a mutation rate results in slow progress towards an optimal solution, whilst too high a rate results in only a very coarse search which may produce a solution which is far from optimal. Therefore the method to control step length is of critical importance to the success of the evolution strategy. This section describes how the evolution strategy mutates the strategy variables in order to maintain an efficient mutation rate. The techniques for controlling step size varies depending on the whether the two membered or multimembered strategy is being employed. We consider each approach seperately below, basing the discussion on \citeN{SchwefelES}. \paragraph{Two membered strategy} To control the rate of mutation using the two membered evolution strategy, the \emph{1/5 success rule} is employed. \citeN{SchwefelES} describes this rule as follows: \begin{quote} From time to time during the optimum search obtain the frequency of successes, i.e., the ratio of the number of successes to the total number of trials (mutations). If the ratio is greater than 1/5, increase this variance, if it is less than 1/5, decrease the variance. \end{quote} A succesful mutation is simply one which results in a fitter offspring than parent. The 1/5 success rule arose through the investigations of \citeN{RechenbergES} into the two membered evolution strategy. After trialling the rule on two model objective functions (the \emph{sphere} model and \emph{corridor} model), he found that the rule is effective in achieving close to the best rate of progress towards the optimum. However, the rule in the format outlined above leaves some unanswered questions. In particular, by what amount do we increase or decrease the mutation variance? Additionally, how often do we need to measure the mutation success rate? To answer these questions, the following more descriptive rule was proposed, based on further theory from \citeN{RechenbergES}: \begin{quote} After every \emph{n} mutations, check how many successes have occured over the preceding 10 \emph{n} mutations. If this number is less than 2 \emph{n}, multiply the step lengths by the factor 0.85; divide them by 0.85 if more than 2 \emph{n} successes occured \cite{SchwefelES}. \end{quote} \paragraph{Multimembered strategy} One of the criticisms of the two membered evolution strategy is that the 1/5 success rule is not based on any biological process \cite{SchwefelES}. The multimembered evolution strategy attempts to address this criticism by assigning a further set of variables to each evolution strategy individuals. These variables describe the degree of variance of the random changes of the step lengths. \subsubsection{Recombination} \label{ESRecombination} Recombination is the combining of genes to produce new offspring. In the context of evolution strategies, recombination refers to the formation of a new offspring by copying (or in some cases averaging out) object and strategy variables from two or more parents. \citeN{SchwefelES} states that the evolution strategy permits crossover to occur at both the object \emph{and} strategy variable level. Further flavour is added to recombination under the evolution strategy, since recombination is able to occur either by recombining material from pairs of parents, or \emph{all} parents. Additionally, both discrete and intermediate recombination is permitted. \citeN{SchwefelES} outlines four possible types of recombination: discrete recombination of pairs of parents; intermediary recombination of pairs of parents; discrete recombination of all parents; and intermediary recombination of all parents in pairs. These recombination possibilities are another area of difference in comparision to the genetic algorithm, where only discrete recombination between two parents is possible (see section 1). We outline the distinction between the \emph{discrete} and {intermediate} recombination approaches below. \paragraph{Discrete Recombination} \emph{Discrete recombination} is the genetic algorithm style of crossover, where components from parents are \emph{directly copied} to the offspring \cite{SchwefelES}. \paragraph{Intermediary Recombination} \emph{Intermediate recombination} occurs when offspring do not take on a direct value from a particular parent, but instead take on the average of the particular value held by both parents. \citeN{SchwefelES} suggests that this is a useful approach for recombining strategy variables. \subsubsection{Extensions of the Evolution Strategy} \label{EvolutionStrategyExtensions} Researchers are investigating the ways in which evolution strategies can be further extended and improved. One area of research is that of \emph{distributed evolution strategies}. \citeN{RudolphES} outlined a coarsely distributed evolution strategy, where the population is split into a number of sub-populations, each contained on a seperate processor. Under this approach, individuals are able to \emph{migrate} from one sub-population to another, which allows the exchange of information between distributed evolution strategy processes. A further area of evolution strategy research is that of \emph{multiple criteria optimisation}. Traditional approaches to optimisation have focused on single criteria optimisation, where only one objective function is used to measure performance. The reasoning behind multiple criteria optimisation, according to \citeN{KursaweES}, is that optimisation using only one objective criteria is unrealistic, given the complexity and variety of aspects to consider in real world problems. In the same paper, Kursawe also expresses interest in seeing whether or not more ideas from nature, such as aging and fertility rates, may be eventually incorporated into the evolution strategy and if so, in what form. \subsection{Genetic Algorithms} \label{GeneticAlgorithms} Genetic algorithms, again based on the same underlying evolutionary computation algorithm, make heavy use of \emph{crossover}, whilst utilising \emph{mutation} to a lesser extent to encourage investigation of previously unexplored search space \cite{NissenEA}. \emph{Crossover}, or recombination, is the process of combining genetic material from two parents in order to create offspring. In the context of genetic algorithms, crossover works as follows. As outlined by \citeN{HollandGA}, it involves the exchange of genetic material between two members of the population. In terms of the categories of recombination described in Section~\ref{ESRecombination}, genetic algorithms use discrete recombination of pairs of parents. For example, consider individuals A and B, which the genetic algorithm has decided to mate. The bit string representation of these individuals is outlined in Figure~\ref{GA1}. \begin{figure} \begin{small} \begin{center} \begin{verbatim} Individual A: 1010110010101011100100010 Individual B: 1011001010011010100101010 \end{verbatim} \end{center} \end{small} \caption{\label{GA1}Representation of individuals A and B.} \end{figure} Genetic material is exchanged based on the \emph{crossover point}, as indicated in figure~\ref{GA2}, in order to form new individuals. \begin{figure} \begin{small} \begin{center} \begin{verbatim} Individual C: 1010110010101 | 010100101010 Individual D: 1011001010011 | 011100100010 \end{verbatim} \end{center} \end{small} \caption{\label{GA2}Representation of individuals C and D.} \end{figure} \emph{Mutation}, in the context of genetic algorithms, can be viewed as flipping a randomly chosen bit in a solution. According to \citeN{HollandGA}, this allows exploration of previously uncharted search space. Additionally, it attempts to ensure that populations do not all uniformly search the same space. The selection mechanism used by genetic algorithms allows potentially \emph{any} member of the population, even the weakest, to reproduce \cite{HollandGA}. An alternative selection mechanism has been suggested by \citeN{GATournament}. Under this approach, pairs of individuals 'compete' with one another, with the fitter of the two moving on to the next stage. The fundamental difference between evolutionary programming and genetic algorithms is that genetic algorithms make heavy use of crossover, as opposed to mutation. Indeed, some have described the use of crossover as the most distinctive feature of genetic algorithms \cite{FogelMetaEP}. Contrastingly, use of crossover in evolutionary programming techniques is considered undesirable, and even potentially destructive. \citeN{FogelMetaEP} for example state that crossover is not required for succesful adaptation. As we have seen, the genetic algorithm represents the population of solutions as a bit strings. Whilst this structure is well suited to operations such as crossover and mutation, it has also been the subject of much criticism, by \citeN{FogelNew} among others. Recent developments such as \emph{genetic programming}, outlined in section~\ref{GeneticProgramming}, attempt to address this criticism. \begin{figure} \begin{small} \begin{verbatim} begin generate initial population of binary coded individuals (solutions); repeat evaluate individual population members using fitness function; new population = {}; while new population not full select two individuals for mating according to fitness; recombine individuals according to crossover probability to form two offspring; mutate offspring according to mutation probability; add offspring to population; endwhile; old population = new population; until (termination condition) end \end{verbatim} \end{small} \caption{\label{GeneticOverview}Overview of the genetic algorithm (Nissen, 1993).} \end{figure} \subsubsection{Cultural Algorithms} \label{CulturalAlgorithms} Cultural algorithms attempt to extend and improve the basic genetic algorithm by adding a belief space to the traditional framework \cite{ReynoldsCA}. Like genetic algorithms, a population of entities is created and refined over a period of time, until the termination criteria is met. Recall from Section~\ref{GeneticAlgorithms} that under the genetic algorithm approach traits are passed down from generation to generation via use of crossover. Cultural algorithms extend this by also making use of a belief space - that is, beliefs are formed based on each entity's individual experiences. The reasoning behind this, as outlined by \citeN{ReynoldsCA}, is that cultural evolution allows populations to learn and adapt at a rate faster than pure biological evolution alone. Additionally, each entity within a cultural evolution framework has an associated mappa, which describes the individuals past experience as well as forecasts \cite{ReynoldsCA}. Importantly, the learning which takes place individually by each entity is passed on to the remainder of the group, allowing learning to take place at a much faster rate. To describe the process briefly, each individual in the population is described in terms of their current traits \cite{ReynoldsCA}. As with the basic genetic algorithm, members of the population are evaluated in order to measure their fitness at performing a particular task. Additionally, each entity creates a mappa describing their experiences. This mappa can be merged with group mappa, in order to 'blend' various individual experiences together. The performance of this mappa is then evaluated. If it's performance is below a certain level, the mappa is discarded. Refer to Figure~\ref{CulturalOverview} for a high level description of the cultural algorithm. \begin{figure} \begin{small} \begin{verbatim} begin t=0; Initialise Population POP(0); Initialise Belief Network BLF(0); Initialise Communication Channel CHL(0); Evaluate(POP(0)); t=1; repeat Communicate(POP(0), BLF(t)); Adjust(BLF(t)); Communicate(BLF(t), POP(t)); Modulate Fitness (BLF(t), POP(t)); t=t+1; Select POP(t) from POP(t-1); Evolve(POP(t)); Evaluate(POP(t)); until (termination condition) end \end{verbatim} \end{small} \caption{\label{CulturalOverview}Overview of the cultural evolution algorithm (Reynolds 1994).} \end{figure} In theory, the benefits offered by cultural algorithms over genetic algorithms seem clear. Because learning takes place at an individual \emph{and} group level, learning is able to take place at a faster rate. However, the literature currently available on this topic is fairly limited. More tangible, quantitative proof of the performance improvements in comparison to raw genetic algorithms is probably required before cultural algorithms can be considered appropriate for use. \subsection{Genetic Programming} \label{GeneticProgramming} Genetic programming is a relatively recent modification of the genetic algorithm. The major difference introduced by genetic programming is that solutions to the problem are represented as an actual computer program, as opposed to a bit string \cite{NissenEA}. Genetic programming appears to be an attempt to address criticisms directed at genetic algorithms by \cite{FogelMetaEP} among others. In particular, genetic programming addresses the criticism that a bit string problem representation creates too many difficulties. According to \citeN{KozaGP}, a computer program is a much more 'natural' way of representing a problem solution, given the flexibility offered by computer programs in terms of both size and shape. The broad steps in the genetic programming algorithm are outlined in figure~\ref{GPOverview}. \begin{figure} \begin{small} \begin{verbatim} begin generate an initial population of random computer programs; repeat execute each program in population, assigning a fitness value based on it's ability to solve the problem; create a new population of computer programs by either: - copying existing programs to the new population; - genetically recombining randomly chosen parts of two existing programs; until (termination condition) let the best program in current population be the solution; end \end{verbatim} \end{small} \caption{\label{GPOverview}Overview of the genetic programming algorithm (Nissen 1993).} \end{figure} \subsection{Neural Networks} \label{NeuralNetworks} A complete discussion of neural networks is outside the scope of this report. However, for completeness the major features of neural networks are outlined below. Neural networks are inspired by the way in which the human mind operates. They are a simple model of the brain's thought processes \cite{BlumNN}. Some areas of application include forecasting and image recognition. Briefly, a neural network is made up of a number of interconnected neurons. Each connection has an synapse, which describes the weight of the connection. The knowledge acquired by the network is stored in these synapses. For a broader introduction, refer to \citeN{BealeNN} or \citeN{BlumNN}. \section{Evaluation of Learning Techniques} \label{EvaluationofLearningTechniques} We begin this discussion with an introduction to some of the broad problems associated with reinforcement learning techniques in general and the solutions to these problems, if any, offered by the various techniques. Our aim is eventually to conclude which of the computational learning techniques we have introduced may be suitable candidates for application to our traders problem. One of the most important problems associated with both reinforcement learning techniques and evolutionary computation techniques is premature convergance. There is always the possibility that a learning agent will center in too quickly on a succesful, but suboptimal action \citeN{ArthurEconomicAgents}. Ideally, there should always be scope for exploration of untried actions. A 'wild stab in the dark' may produce unexpectedly high results. Likewise, actions which previously had been returning low payoffs may unexpectedly begin producing higher payoffs, particularly if the state of the environment changes. Arthur's learning mechanism, which we will introduce in Section~\ref{IntelligentAuctionAgents}, accounts for the premature convergance problem by providing parameters for the \emph{rate} and \emph{deceleration} of learning. However, note that these parameters are set by the programmer, \emph{not by the learning mechanism itself}, thus reducing the autonomy of an agent using this technique. Genetic algorithms address the problem by randomly mutating bits, allowing exploration of previously unexplored search space \emph{HollandGA}. Similarly, the evolution strategy is able to control the rate of mutation via the step size (or strategy) parameters. If premature convergance is occuring, we can simply decrease this rate. If we are using the two membered evolution strategy, we have the additional option of using the 1/10 success rule, rather than the 1/5 success rule \cite{SchwefelES}. If we broadly compare evolutionary algorithms against other reinforcement learning techniques, such as Narendra's learning automata, Q-Learning and so forth, it appears that evolutionary algorithms may have the upper hand. In order to be effective, techniques such as reinforcement comparison and adaptive heuristic critic require knowledge of an environment, in terms of states and transitions - they are not capable of learning an environment's states and transitions in the manner of evolutionary programming. As outlined by \citeN{Kaelbling}, in the vast majority of cases learning the optimal action is not difficult to achieve in itself, but in fact \emph{learning the dynamics of the world is often more complex}. Thus while in theory the reinforcement learning techniques are capable of learning the most appropriate actions over time, \emph{in reality agents will have only limited, approximate knowledge of the environment, and thus the algorithms are likely to lose accuracy unless they can somehow learn the nature of the environment themselves.} In contrast, evolutionary computation techniques are able to evolve an approximate solution to a problem, \emph{even with a very limited amount of knowledge of the problem domain}. Whilst some prior knowledge of the problem domain would clearly be useful, since it is possible to incorporate this knowledge into the initial population to increase the speed at which an appropriate solution can be produced, \emph{this is by no means essential}. From this, therefore, we can conclude that in most cases the use of an evolutionary computation technique would be more useful than use of a general reinforcement learning technique. Given this, we now turn our focus towards a comparison between the various evolutionary computation techniques. Despite the very close similarities between the various types of evolutionary computation techniques, various sections of the evolutionary computation community have targeted criticisms at some of the other approaches. We consider the major criticisms below. \citeN{FogelMetaEP} have identified two major problems with the genetic algorithm: \emph{premature convergance}, which we have already discussed, and \emph{problem representation}. Genetic algorithms have been criticised for the way in which solutions are represented as bit strings. This is due to the difficulties invovlved in representing a solutions a bit string - it is not a 'natural' problem representation. This is in contrast, for example, to the evolution strategy, where individuals are represented simply as a vector of real values. In terms of performance, \citeN{FogelMetaEP} conclude that genetic algorithms do not perform as well as evolutionary programming techniques for optimisation problems. Their experiments indicate that genetic algorithms tend to converge slowly towards the optimal solution, but then become 'stuck' at a clearly non-optimal solution. However, it seems important to consider that Fogel is at the heart of the evolutionary programming community himself and could be viewed as competing against the genetic algorithm community. As such, he may be biased towards evolutionary programming. \section{Conclusion} Based on the above discussion it appears that in broad terms evolutionary computation techniques are more flexible than the other reinforcement learning techniques discussed, due to the low degree of \emph{a priori} knowledge required. Of the evolutionary compuation techniques, the evolution strategy appears to be the most useful, due to its natural problem representation and success in many different categories of optimisation problems (see Section~\ref{EvolutionStrategies}). Further comparison of these learning techniques is provided in Section~\ref{Conclusion}, where we look specifically at the usefulness of these learning techniques in light of our traders problem.