## Sunday, December 19, 2010

### Foundations of Higher-Order Probability

I think the foundations of higher-order probability theory may be in a bit of a mess.

Whereas first-order probabilities express uncertainty about the world, a higher-order probability expresses uncertainty about what value a first-order probability has. Or, if you ask a Frequentist instead of a Bayesian: first-order probabilities give limiting frequencies of results from experiments, whereas higher-order probabilities give limiting frequencies of getting certain limiting frequencies.

Uncertainty about uncertainty could be argued to defeat the whole point of having a plausible degree of belief... why can't it be collapsed into a straight degree of belief?

This paper tries to base higher-order probability on the idea that there is some "ideal expert" who knows the "true probability" which we are uncertain about. However, the paper puts it up to the agent to decide what constitutes a "fully informed ideal expert." If the ideal expert knows which events actually happen or not, then the first-order probabilities will all be 0 or 1, so the higher-order probabilities take on the role that 1st-order probabilities usually take on. In the opposite extreme, the "ideal expert" is just the agent itself, so that the higher-order probabilities are all 1 or 0 and the first-order probabilities are known. (All this is specifically laid out by the author.) This seems to put higher-order probability on a very shaky footing; at least to my ear.

Consider the following situation concerning an event A:

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2

In English, we do not know if the probability of A is 1/3 or 2/3-- we assign 50% probability to each.

Now, two plausible principles of higher-order reasoning are as follows.

• Expectation principle: We can find our first-order belief in event A by a weighted averaging over the possibilities; ie, P(A)=E(P(A)), where E is the expectation operator (which takes the average value, weighted by probability). This is necessary to convert higher-order distributions into numbers we can use to place bets, etc.
• Lifting principal: If we know X, we know P(X)=1. This paper argues for a related rule: P(X)=p if and only if P(P(X)=p)=1. Either rule is sufficient for the argument that follows; specifically, I only need that P(X)=p implies P(P(X)=p)=1. (This is a special case of my "lifting" by instantiation of X to P(X)=p, and a weakening of their coherence-based principle since it takes just one direction of the "if and only if".) [It is interesting to note the similarity of these rules to proposed rules for the truth predicate, and in fact that is what started my investigation of this; however, I won't give significant space to that now.]
Granting both of these rules,  it is not hard to see that any non-trivial higher-order distribution will be contradictory. Returning to the example at hand, we apply the two rules in succession:

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
->
P(A)=1/2 via expectation,
->
P(P(A)=1/2)=1 via lifting.

But this contradicts the original distribution; we have

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
P(P(A)=1/2)=1,

which sums to 3/2, violating the laws of probability.

For higher-order probability to be non-trivial, then, it seems we  must reject one or both of my given principles. If this is granted, I think the best view is that the expectation principle should be abandoned; after all, that paper I cited gives a pretty good argument for a form of lifting principle. One way of looking at the expectation principle is that it flattens the higher-order distribution to a first-order one, trivialising it.

However, I want to preserve both principles. Both seem totally reasonable to me, and I've found a solution for keeping a form of each without any problem. I will argue that the problem here is one of equivocation; the notation being used is not specific enough. (The argument I use comes from my brother.)

For concreteness, suppose the event 'A' I mentioned is flipping a coin and getting heads. One scenario which fits with the probabilities I gave is as follows. We have two unfair coins in a bag; one gives heads 2/3 of the time, and the other gives heads 1/3 of the time. If we want a fair chance of heads, we can just draw at random and flip; half the time we will get the coin bias towards heads, but half the time we will get the one bias against. Since the bias is equal in each direction, we will get heads 50% of the time. This illustrates the expectation principle in action, yet it does not trivialise the higher-order characterisation of the situation: if we draw one coin out of the bag and keep it out, we can try and determine whether it is the 1/3 coin or the 2/3 coin through experimentation. This involves a Bayesian update on the higher-order distribution I gave.

The intuition, then, is that the "flattened" probability (1/2) and the "non-flat" probabilities (1/3 and 2/3) are the limiting frequencies of very different experiments. There is no contradiction in the two distributions because they are talking about different things. 1/2 is the limiting frequency of drawing, flipping, and re-drawing; 1/3 or 2/3 is what we get if we draw a coin and don't re-draw as we continue to flip (and we will find that we get each of those 1/2 of the time).

So, how do we formalise that? One way, which I particularly like, is to let the probability operator bind variables.

I'll use the notation P[x](A)=e to represent P binding variable x (where A is some statement with x as a free variable). Intuitively, the meaning is that if we choose x randomly, the statement A has probability e of being true. In addition, we can talk about choosing multiple variables simultaneously; this will be given notation like P[x,y,z](A)=e.

In this notation, the two principles become the following:

• Expectation principle: If we have a higher-order distribution represented by a collection of statements P[x](P[y](A)=a)=b, P[x](P[y](A)=c)=d, et cetera, then we can form the first-order distribution P[x,y](A)=E[x](P[y](A)), where E[x](P[y](A)) is the expected value of P[y](A), choosing x randomly.
• Lifting principle: If we know some statement A (possibly with free variables), we can conclude P[x](A)=1, for any variable. (Together with the expectation principle, this implies the related if-and-only-if statement from the paper I cited.)
With these rules, the argument from earlier has a different conclusion:

P[x](P[y](A)=1/3)=1/2
P[x](P[y](A)=2/3)=1/2
->
P[x,y](A)=1/2 via expectation,
->
P[z](P[x,y](A)=1/2)=1 via lifting.

This makes my earlier point about equivocation clear: the concluding probability is not about P[y](A) at all, but rather, is about P[x,y](A).

There are many details to fill out here; we'd like convenient notation for probability densities,  a proper analysis of what subjective probabilities are like in this system, and several other things. However, I feel these are better left to a paper rather than a blog post. :)

PS-- if anyone has references on similar systems, that'd be great!
PPS-- This paper talks about similar issues. :D

## Sunday, December 5, 2010

### Three Dreams

(1) I wish I had time to build self-education software. This is a wish carried over from my brother, who came up with the idea; a system which hands you exactly the practice problem you need at the moment, to challenge you, keep your interest, but not give you something you can't do. Ideally, the problem sets would be created and shared in a wiki-lije online community (which is the main difference from current high-end education software!) The broadness enabled by this could revolutionise education, job training, and certification. (Since the program estimates your skill level to determine what problem to give you, sitting potential employees down for a session on the machine would give valuable information about their abilities.)

(2) I would also like to have the time to build a powerful "open work" software in the spirit of mechanical turk, love machine, rentacoder, and others. The hope would be to emphasise love machine's social aspect, but mechanical turk's "all are welcome" aspect. Ideally, the environment would also feel like stack overflow-- you can build reputation by answering questions for free, so that you are more likely to get paid in the future. There are a lot of issues to be dealt with here! The nature of contracts, the details of any "reputation" system... it's complicated. However, it would be great for the efficiency of the intellectual labour market!

(3) I want to build a Bayesian reasoning system capable of acting as mankind's store of knowledge, so that all arguments for and against certain points can be kept track of and weighed properly against the evidence. This is the least realistic of the three dreams, but if it could replace/augment Wikipedia, it would help to end debates like global warming (or, rather, force the debate to proceed entirely from empirical evidence and in a logically correct manner). There are lots of big problems with this one, including the problem of verifying sources of empirical data.

## Monday, November 15, 2010

### Bayesian Statistics

I'm in a statistics class this semester. On the first day, the professor gave an argument for Bayesianism: frequentist probabilities are only defined when events can be viewed as members of a series of experiments with a limiting frequency, but we also sometimes want to talk about the probability of events which can't be framed in that way. For example, Obama being re-elected is a singular event, so we would have difficulty framing it as one of a sequence of experiments. Bayesianism extends the notion of probability to such cases.

Since then, of course, I've been bringing up a few Bayesian points in class when relevant. On his end, the prof goes so far as to point out that Bayes Law is not strictly necessary whenever he uses it, and work through the problem a second time avoiding Bayes Law.

This makes me want to write about a few things.

• Higher Moments
• The square and the second moment
• Least-square fitting
• p-testing and the sort of refutation which Bayesianism is capable of
• covariance vs mutual information, variance vs entropy
• Information theory (& coding theory) as a foundation for subjective probability
However, I have too little time to write about these things. :p If I take the time later, I will turn these bullet points into links.

## Wednesday, September 1, 2010

### Holes in Weak Inferentialism

If something like my weak inferentialism is to be taken seriously, it should be able to account for common structures of mathematical reasoning. In particular, let's take two examples: mathematical induction (used for reasoning about the natural numbers and other discrete structures) and the continuum (used for reasoning about real numbers and other continuous structures).

My favorite way of thinking about mathematical induction is as a "nothing more" operator-- the principle of mathematical induction essentially asserts that there aren't any numbers except the ones whose existence is assured by the other axioms.

Viewed as a second-order axiom, this is the only axiom of number theory whose logical consequences are not computably enumerable. If we view it instead as a first order infinite axiom schema, the consequences are computably enumerable, but are an incomplete characterization of the natural numbers. (This is as a result of applying the classical semantics to 1st-order and 2nd-order logics.)

In a weak inferentialist system like the one I described, it's quite possible to list the consequences of the first-order axiom schema or other stronger versions which approach the 2nd-order axiom.  However, the full 2nd-order axiom causes some trouble. The result depends on how paradoxes are resolved. In my description, I said that Kripke's theory of truth would provide a way of avoiding paradoxes. Kripke's theory, however, leaves open some questions.

Basically, Kripke says that some sentences must have definite truth values, and some cannot, but leaves some sentences between the two which we are allowed to assign values or not based on preference. The least fixed-point is the theory resulting from leaving all these middle sentences undefined as well; it's generally the preferred theory. However, there are others, such as those based on supervaluation. (Note: This is not an entirely complete way of spelling out what's going on here. There are actually two different things which Kripke leaves open: the fixed-point and the valuation scheme. When I say "least fixed point" I'll actually mean the least fixed point with a Kleene evaluation scheme. However, I won't go into those details here.)

The full inferentialist 2nd-order induction axiom would say that for any predicate, we can conclude that if it is true of 0 and is true of n+1 whenever it is true of n, it is true of all numbers.

According to the least-fixed-point, this assertion should always be undefined. This is because the quantification over all predicates necessarily includes undefined predicates, for which the whole statement will come out undefined.

I believe the supervaluation version will instead allow the assertion to be well-defined. As I understand it, supervaluation says that a statement is true if it is true for any consistent assignment of truth values to the undefined sentences. This means that, conceptually, for the instances of induction which are handed ill-defined predicates, we ask "what would be true of well-defined extensions of this predicate?"

This is hopeful, but at the same time it seems unfortunate that the properties of the system would depend so much on which version of Kripke's theory is used. The minimal fixed-point seems (at least to some) like the nicest one, so relying on a different version needs some explanation. Can the choice be motivated in a way more strongly tied to the thesis of weak inferentialism? I'll leave that question for later.

Now, for the continuum. By continuum, I mean to refer to any entity with the cardinality of the powerset of the natural numbers. That powerset (ie, the set of all sets of natural numbers) is one such entity; the real numbers are another. These objects have the essential feature that not every element can be described-- there are far more elements than there can be formulas in any symbolic language. It is a larger sort of infinity.

The main question here is, does this sort of system allow for a classical continuum, or is it more like a constructive continuum (in which only the describable elements exist)?

The answer is, again, dependent on the version of Kripke's truth that we choose. Supervaluation seems to do just what classical mathematics wants: "all possible valuations" will include an uncountable number of possibilities if the setup is right. This will cause universal generalizations about sets if natural numbers (such as the induction axiom!) to take the correct truth values. I could be wrong here, though-- I am not sufficiently familiar with supervaluation.

Least fixed point will again seem to fail us, refusing to make many generalizations which are taken to be sensible in classical mathematics.

## Wednesday, August 4, 2010

### Logical Systems

This is a rehash of some ideas from my older blog. Basically, I'm trying to state the problem anew, in light of reading Jon Cogburn's paper Are Turing Machines Platonists? (Unfortunately not available online, but do email him if you want a copy, he is a nice fellow.)

Inferentialism is, roughly speaking, the view that the only thing that is important to our understanding of a statement is the way that statements interacts with the surrounding web of statements in our belief system. This is made precise by saying that we understand a statement precisely when we could recognize a proof or disproof of it.

Computationalism is the view that a mind can be represented as a computer program, that is, there is no fundamentally non-computable stuff going on up there: if a computer was fast enough, it could compute the proper outputs to the nerves based on the microsecond-to-microsecond inputs received.

Together, these two views entail that we cannot fully understand math in its present form. Goedel's semantic incompleteness theorem shows that for any computer program, there will exist statements in basic number theory which that computer program can recognize no proof or disproof of. Perhaps one might respond that this seems acceptable, as it becomes very difficult to understand complicated mathematical statements, and it doesn't seem implausible that we have some limit corresponding to Goedel-style incompleteness. However, in general our ability to understand the meaning of mathematical statements does not seem to correspond that well to our ability to prove or disprove them (or understand proofs provided by others). The continuum hypothesis, for example, seems understandable; yet it is known not to have a proof or disproof in any widely accepted set of axioms. It seems implausible (to me, at least) that it's understandability comes from its being provable or disprovable in our mental logic. The halting problem provides numerous other examples which I would claim were understandable yet not amenable to proof or disproof.

This means we've got to either give up inferentialism, computationalism, or classical mathematics. Very roughly speaking: people who give up classical mathematics are some variety of intuitionist or constructivist; people who give up computationalism are some variety of dualist or hypercomputationalist (like Roger Penrose).Now, I don't disagree in principle with restructuring math from the bottom up, but it seems desirable for a foundational program to capture as much as possible of the way mathematicians intuitively reason, so I'm hesitant to be a constructivist (though I may yet be converted). Similarly, I don't have anything against the possibility that some processes yet-unknown to physicists are endowing minds with special non-computational behaviors (and since I'm not a constructivist, I even believe that such behaviors can be well-defined and deterministic!). However, I don't know that this is the case, and neuroscience seems to suggest that much of neural processing can be accounted for in a computational way. Furthermore, as an artificial intelligence researcher, I hope that the essence of "mind" can be captured computationally. I fall in the third category, wanting to give up inferentialism.

Giving up inferentialism is not to be done lightly. It's a highly plausible assertion, especially for the computationalist: what else should matter about a statement then the computational interactions it has with other statements?

The solution I find plausible I'll call weak inferentialism: we understand a statement if we can compute a defining set of inferences. The statement's meaning is precisely that defining set; it is merited when all inferences in that set are merited, and (optionally?) false when one of them is false. (Should falseness be taken as a basic concept?) This does not mean that we can compute all of the statement's consequences, though. For example, the defining set of a universal statement P: "For all x, S is true of x" will be all the statements "S is true of A", "S is true of B", ... It's possible that another statement, Q, has a list of consequences which is some subset of P's list. In this case, Q would be a consequence not in the list for P. In some sense, however, Q does not add anything to the list: it just summarizes a portion of it. The weak inferentialist argues that this allows us to understand P without necessarily knowing that Q follows from it.

(There may be some interesting connections between these two types of inferentialism and the "Principle of Harmony" from proof theory, which states that the introduction rules and elimination rules for symbols should precisely mirror each other. This basically corresponds to a connection between the inferences from which we can conclude a statement and the inferences we can make from that statement. This principle may have to be spelled out differently for the two types of inferentialism. I don't know enough about the principle of harmony to make a well-considered connection, though.)

Now, the question: what foundational logics do the two different inferentialist pictures recommend? In particular, if we're also computationalist?

Strong inferentialism will only care about notions of logical consequence which have complete proof theories, like first-order logic. An inferentialist will only care about what structures of reasoning can be implemented in the logic. In particular, it seems natural to consider a logic as a programming language. Think of it like this: we have some basic domain of discourse we wish to talk about (such as the actual world), and we have the logic which allows us to make assertions which will cause some statements about the domain of discourse to entail other such statements. The logic is nothing more than a means for expressing these entailment relationships between the domain-level facts.

Ignoring computational efficiency and one or two other practical matters, it seems that little about the logic matters once we've determined that it is Turing complete. Classical first-order logic, intuitionistic first-order logic, and a host of others will all do equally well.

Interestingly, though, more powerful logics can be motivated even in this minimalistic worldview (if we bring practical matters like speed back into the picture). Goedel showed that mathematically more powerful logics (in a specific sense) will always have the following property: there will be some inferences which can be made in an astronomical number of inference steps in the less-powerful logic, but which the more-powerful logic proves in just a few steps. This theorem only holds for arbitrarily large domains of discourse, though, so it is an empirical question whether the phenomenon occurs in practical situations. The paper "A curious inference" by George Boolos and "Some More Curious Inferences" by Jeffrey Ketland discuss the issue (taking the affirmative).

Happily, the notion of "more powerful" here coincides at least to an extent with the more typical notions, which seems to mean that we can justify a good amount of normal math via this sort of reasoning, despite the fact that strong inferentialism will deny that math its standard meaning. However, I don't know the precise connection here, and I won't try to explore (in this blog post) precisely what of mathematics could be justified in that way.

In any case: what sort of view of logic does weak inferentialism suggest? Well, based on the idea of the defining set of consequences, we could say that a (non-basic) statement is a computer program for listing its own consequences. The "most expressive" logic will be one which uses a Turing-complete notation to do this. The key difference between this system and the previous is that we may not be able to infer a statement even if we can infer all of its defining consequences: we cannot implement the truth conditions computationally. Hence, we still have a (highly abstract, irrelevant of speed issues) concept of a more powerful logic: a more powerful logic will know more about which statements follow from which others. This is equivalent to knowing more about the halting problem (since we could check for implication A->B by making a program that halts when B implies something A does not, but keeps looping otherwise).

Fortunately, the extra information can always be expressed in the same logic! We never need to add more expressiveness, only more knowledge. The work done by any additional symbols can evidently be done without them, because the notation is Turing-complete.

The weak inferentialist logic includes the Liar sentence, ie, the sentence whose defining consequence is just its own negation. This can be dealt with via Kripke's fixed-point valuation: we enforce the constraint that a statement is considered true exactly when its defining inferences are merited, but we don't require that a sentence is either true or false. The inference "The Liar sentence is false" is neither right nor wrong; it remains undefined, since there is nothing for it to take its truth or falsehood from. The Liar sentence is ungrounded.

Exactly how this works will depend on whether we want falsehood as a basic concept, which I left open at the beginning. If we don't take it as basic, then falsehood might be defined as the property of implying everything.  The Liar paradox then becomes something very reminiscent of the Curry paradox: "This sentence implies everything." What the fixed-point construction tells us is that we can't in general use hypothetical reasoning to see if inferences are indeed justified: if we want to know whether sentence X, which asserts just X|-Y, is true, then it appears we should assume X and see if we can derive Y. (Read X|-Y as "from X  we can infer Y".) If we assume X, then we know X|-Y, but combining those two, we know Y. This hypothetical reasoning proves X|-Y, so we know X (un-hypothetically). But this entails Y, which might be "everything"! Logicians have provided weaker forms of hypothetical reasoning which conform to the fixed-point construction in order to avoid this problem. (Specifically, we can only make hypothetical assumptions which we already know are grounded.)

It's interesting that this means the sentence which just claims that inferring Y is justified is radically different from the sentence which claims that inferring Y from itself is justified, despite the fact that once we believe either, they justify the same inferences. The two even have the same conditions for being true: if we know Y, we can conclude both (since A|-B is true when B is inferable, that is, |-B implies A|-B, regardless of A). However, when Y is false, then the statement "infer Y" is false, but "From this statement, infer Y" is ungrounded and thus considered undefined.

The final thing to note is that, although no further expressiveness appears to be justified by weak inferentialism, the system described cannot fully express the concept of "groundedness" I've been using. (It can only mark it true, never false; but I've noted that statements are ungrounded more than once in this discussion.) Hence, we have an immediate example of a concept which appears to be mathematically well-defined, but which weak inferentialism does not seem to be able to account for. Yet, what is lost? After all, these supposed statements can't even be given a computable list of defining inferences they justify! Is it useful to state that something is ungrounded? (I think the more basic notion called into question here is negation itself: is it always meaningful to know that something is not the case? Negation has no defining set of inferences!)

## Tuesday, July 6, 2010

### Concerning Monetary Systems

It all started with a math-geek desire to come up with an "imaginary" dollar-- a unit of currency  equal to the square root of debt. Unfortunately, googling "imaginary dollar" just gets a load of people saying that money is already imaginary. :) So, I eventually googled "hypercomplex dollar" (the hypercomplex numbers go beyond imaginary numbers, introducing things like hyperbolic multiplication, et cetera). I got this:

http://www.halfbakery.com/lr/idea/Hypercomplex_20money

Sadly, this and the other proposals it links to are not really attempts to construct imaginary money: they just use the fact that the imaginary numbers (and hypercomplex numbers) are orthogonal to real money; they don't attempt to give an answer to what the square root of a monetary amount might be.

Anyway, these proposals referenced the LETS money system, which is interesting...

http://www.gmlets.u-net.com/home.html

The basic idea: everyone starts out with 0 money, but is free to spend anyway. This puts them in the negative, but negative is referred to as "commitment" rather than debt, and is not supposed to be a bad thing. There is always as much negative as positive, so being in the negative is normal and does not cause a problem (since you're still free to spend even more). The justification: scarcity of money causes a lot of problems. Scarcity of dollars stops local economies, even if there is really enough goods and labor to go around. This is why more US dollars are printed then go out of service each year: to fight the scarcity of money. Unfortunately, that leads to inflation, which has its own problems... LETS tries to fix this by letting anyone issue money whenever they like, creating "commitment" in doing so. This means there doesn't have to be money around in order for people to exchange commitment for goods and labor, so that money scarcity can never be a problem.

This basically relies on individuals to honor their commitment. I think this is a big part of why LETS is supposed to remain a local currency: it relies on the goodness of people, so if it got too big it would run into troublemakers.

My take on the future of money is that we are headed for a credit-based economy (in which credit is more important than raw money). We're already almost there! However, there are some serious flaws with the current credit economy... credit cards are a Bad Plan. Personally, I think things like kickstarter are a better sign of what's coming... people investing in people. The problem with banks and credit cards is that all the investing is being done by a few monolithic entities, whereas the "little guy" just gets the debt. The little guy needs to be able to give credit as readily as take it.

The idea of "debt" is owing money to one person. The LETS idea of "commitment" is to instead owe to a community. The problem is that this is too easy to take advantage of it becomes too mainstream (ie, if people don't feel a personal commitment to the whole community). The currency would just inflate.

To address this, I propose an idea of financial backers. Individuals back each other rather than being backed by a big bank. If a group of people back each other with no reservation, then they should automatically  form something like a LETS network: they can spend as many credits with each other as they want, making "commitment" with the group as they do so. When they get money, either within the group or from outside, the commitment gets payed off.

More commonly, people could offer each other limited lines of credit. The amount I offer to you would depend on a combination of how much I trust you, how much I can afford personally, and how much credit I have from other sources.

Obviously more details need to be worked out here. One principle from LETS is that no interest should be charged on commitments. It's not obvious whether that's preferable in the system I'm thinking about. It promotes a positive and friendly outlook (as does the option of offering an unlimited credit line to people you trust, rather than worrying about placing a \$ cap). However, it also seems to make sense to pay people back for their hospitality... perhaps mutual credit agreements are better than interest for that purpose, though.

One conspicuous advantage of this scheme (in contrast to the currently mainstream big-bank approach!) is that the distributed nature of the system prevents large failures. An uncareful collective of individuals might fall together, but the impact would be limited.

EDIT: This proposal has already been thought of and is being implemented! If you extend me a little credit in ripplePay I'll extend you a little back. :) I may write a post soon on the differences between what I was imagining and what these people are doing.

## Friday, June 11, 2010

### Goal Systems

Continues this.

It seems fairly clear that backwards-chaining should be seen as a result of goal-oriented behaviour, whereas forward-chaining should be more a result of the reactive reasoning layer. This thought arises from several considerations (including efficiency), but mainly from the idea that backtracking is simply an attempt to achieve a goal by trying different possible (sets of) subgoals in turn.

A simple production-rules-vs-logic-programming view would see the issue as two different ways of interpreting the conditional from propositional logic; A->B either means "if you have A, conclude B" (if you're a production rules person) or "if you want B, try A" (if you're a logic programming person). I'm denying that that is what's going on. A->B in the knowledge base means just what it means in the declarative logic, and doesn't by itself trigger any inferences. For inferences to be triggered, procedural knowledge must be present. Procedural knowledge can always be put in a series of if-then type statements which are executed by forward-chaining; both "if you have A, conclude B" and "if you want B, try A" are just examples of this.

Now, for probabilistic-relational systems, goal-oriented reasoning is set within the framework of bayesian utility theory. The mathematical theory is simple and elegant (although there have been refinements over time, such as causal decision theory, timeless decision theory, updateless decision theory); but like pure probabilistic reasoning, it isn't immediately obviously how to efficiently implement/approximate. Much of the relevant research for this comes from the domain of reinforcement learning; also, an algorithm that's generating a lot of research right now is monte carlo tree search. It seems reasonable to follow these threads of research for a goal-system implementation.

For me, it seems natural to replace the on/off subgoals & backtracking which appear in crisp systems with a more continuous goal structure. This is based on the analogy to the way the truth values become continuous in these systems, which my be a bit hasty. Still, here is my rough outline:

I've been considering for some time a system which allocates time to different reasoning processes based on a currency of "effort". Effort could take two forms: a longer-term effort would represent percentage of processing time, while a shorter-term effort would represent amounts of time. There would be long-ish cycles in which every goal that has long-term effort attached to it gets a corresponding amount of short-term effort assigned to it. A goal can then cause subgoals to be spawned which it assigns some of its effort to.

This may or may not be a good idea. :) The shorter-term effort seems like a natural concept; it's been invented independently by two different people I've talked with (Ben Goertzel's group, and Russel Wallace). It seems useful for complex combinations of approximation algorithms that can be given more or less time to give better or worse answers, especially if they may employ each other or themselves in a loopy manner. Programs based on this sort of effort would be guaranteed to terminate in whatever amount of time they are given (though, perhaps not with a very good answer.)

The longer-term type of effort is not so obviously useful; hard to say.

## Wednesday, June 9, 2010

### Procedural Logic

I've decided to start using the term "procedural logic" for what I called action logic in the previous post on this topic. This better reflects the desire to merge the procedural and declarative; also, it steers clear of logics already named "action logic."

To pick up roughly where I left off...

Tactical theorem provers such as HOL Light and Coq are the modern-day epitome of procedural logic, in some ways. They work by the simple lack-of-mechanism which I gave as the minimal procedural logic in the previous post: they are simply programmable theorem provers. "Tactics" (aka "strategies") are created, which are simply programs for guiding inference; tactics can use each other, so that in a weak sense they are like goals and subgoals. What's really nice about them is that they cannot be made to return an incorrect result, no matter how wild the tactic. What they lack compared to SOAR and PLANNER is:

1. Reflexive inferences: SOAR automatically forward-chains for all rules in its KB. In a similar way, Prolog automatically backwards-chains for all programmed rules. PLANNER was designed to do both, with a declaration of which way a rule should be used upon its creation. (SOAR does something similar with a little work.) ACL2 probably has the most advanced facilities, from my cursory examination; there are many categories of possible reflexive inferences for different sorts of theorems, and the system has intelligent defaults which the user can override. However, the reflexive inferences are still clearly too simplistic sometimes; for example, the user can create an infinite loop just by proving both B = A and A = B... ACL2 will happily swap A for B indefinitely the next time either one appears in a statement.
2. Truth Maintenance: Some facts are of the sort that can change. Under strict logic, we should index them by time to account for this, but it seems necessary for efficiency to ignore this and just change the facts directly. If that's the strategy, then we have a truth maintenance problem: we may need to update our entire KB to reflect the change, withdrawing any statements we proved based on the now-false information. In SOAR, this is reflected in the distinction between o-support and i-support: o-support represents that the rule is intended as making a change to the KB in firing, which cannot be undone except by another o-rule, whereas i-support represents that the rule's consequences should be undone immediately when the premises become false (thanks to an o-rule changing something). This is probably the most advanced form of this sort of distinction. PLANNER also addressed the issue, however. (That's why I called SOAR a "descendant of planner" in my previous post on this topic; since then, though, I've asked Paul Rosenbloom about it. He indicated that PLANNER did not actually influence SOAR in any significant way, so it looks like SOAR came up with the idea quite independently.)
So, the question becomes: what should a probabilistic relational system do about all this?

Paul Rosenbloom's probabilistic-relational rethink of SOAR (see his publication page) takes the position that #2 can emerge in a simple way from the sum-product belief propagation algorithm; basically, this amounts to re-estimating all the weights in working memory whenever a change is made by a rule in long-term memory. The system also deals with time in additional ways. As for #1, there is some divergence from SOAR's system, but nothing as sophisticated as ACL2. Paul Rosenbloom's proposal does not have a reason to focus on proofs via mathematical induction in the way ACL2 does, so it would not benefit from a translated version of ACL2's guidance system... perhaps it's reasonable to say that much of the sophisticated stuff in ACL2 is fairly special-purpose for its domain of application.

Having sum-product as a basic dynamic seems like a good idea, but at the same time, I'm thinking of a system which can reason about inference guidance even for that. For example, it seems reasonable to (A) weigh the product-sum algorithm's attention to specific areas of the belief network based on multiple factors, including which variables are currently undergoing the most change, and which variables are most critical for the goals; (B) augment the product-sum algorithm with more accurate algorithms in areas of the belief network where accurate estimation is critical. In the Rosenbloom system, (A) would have to be an addition to the basic level, but he has discussed (B) as something which might be implemented on top of the basic level using rules. The second situation is preferable, but at the same time, any "basic level" of reactive inference/decision will have some built-in assumptions. There is a way around this, however: dynamic replacement of inference strategies in the manner of Hutter's Algorithm. Hutter's algorithm is tremendously inefficient, so it would be nice to find a small-scale, incremental way of employing the same trick. (Also, Hutter's algorithm would need to add probabilistic reasoning about algorithm speed and correctness, to satisfy the desire for a fully probabilistic-relational system.)
To summarise that disorganised paragraph: I like the idea of using sum-product as the reactive-level reasoner, but it may not be what I'm after for procedural logic. As a disclaimer: worrying about abstract asymptotic universality stuff is fun, but it is not necessarily relevant to practical implementations. Sometimes I am more of a mathematician than a computer scientist. However, there are some recent indications that the theory of universal intelligence might actually guide practical implementations: a working, practically applicable, asymptotically universal approximation of AIXI, and a Levin-search-based automatic programmer that works on some toy problems.

In any case! The main point is, in probabilistic relational systems, truth maintenance gets replaced by some form of belief revision plus some way of dealing with time. (In OpenCog, I understand that part of the strategy will be to have the truth values of time-sensitive statements decay gradually as they become more probably out-of-date.)

The other issue to address is how a goal system for a probabilistic-relational version should work, but, I guess that's too big a topic for the moment... though honestly I started writing this post to talk about that :) Such is blogging.

## Tuesday, May 18, 2010

### SOAR workshop so far

Daniel and I have just come back from the pre-workshop dinner. Our ride (another attendee) had a delayed flight, and we had to call a taxi; we ended up being about 45 minutes late for dinner. Daniel and I ended up getting seats at a separate table, so we didn't talk to anyone right away; eventually, though, we got some good discussions.

My primary thesis has always been the importance of expressive power in AI. This has been a reaction to the overwhelming majority of AI research being done with propositional methods.  However, this method tends to fall on deaf ears for several reasons.

1. If I don't phrase myself carefully, it sounds like I'm arguing the anti-AI position that many people abuse GÃ¶del's theorem to argue. This is thanks, for example, to my use of words like "uncomputability" (and if I invoke GÃ¶del's theorem, of course, that just makes it worse). If I say "system X can't represent uncomputable mathematical concepts, but humans can," it really makes people think that I mean "computers can't handle uncomputable mathematical concepts, but humans can." I don't mean that. Rather, I want to soundly answer the anti-AI position via a formal theory of which systems are capable of climbing the Tarski hierarchy on their own (and how they do it).
2. John Laird and the other fellow in the front seat when he was giving me a ride back to the hotel (sorry, don't remember the name! EDIT: Paul Rosenbloom) pointed out that worrying about expressive completeness is far less common than trying to find less-expressive subsets of existing systems which are more efficiently manipulated. Daniel put it another way: one shouldn't just worry about what a system can do in principle but about what it can do in reasonable amounts of time. Systems which are "more expressive" might be able to do far less in practice. My response to this is just to say that I prefer systems which use heuristics to stay in the space of efficient reasoning, rather than reduced expressiveness... this is a weak-sounding point, though I do think it is the right response...
3. If I'm already talking to someone who believes that expressive completeness is an important issue, then the remainder of my pet theory consists of ways of generalizing from a propositional system to more expressive models. This typically becomes a situation of singing to the quire, because the person will already be past propositional models and into at least first-order. All I have left to argue is the seemingly reversed position that most reasoning should actually take place at the sub-first-order level (say, 1/2 n-tuple markov models, 1/4 context-free, 1/8 context-sensitive, 1/16 first-order, 1/32 2nd-order...). This is related to the response for the 2nd point.
4. My project is really quite an abstract one, because at present my best guess is that so long as a system is doing probabilistic relational learning (ie, at least 1st order), and so long as it approximates Solomonoff induction in its ability to discover hidden relations in data, then it satisfies the requirements to climb the Tarski hierarchy. Because of this, I can't argue that there is a pitfall of expressive incompleteness waiting to trap anyone who isn't careful. So, it seems I should only be arguing for those basic requirements rather than arguing that a general theory telling us how to climb the Tarski hierarchy must be found. If I can argue for those points on separate grounds, a general theory saying that those are sufficient is more a nice factoid for those worried about the foundations of mathematics than a guiding principle for AI. (Though, there's still the possibility that the required structure for a system is significantly more specific than Solomonoff induction, in which case the theory may have definite value.)
So, should I give up what has been my primary thesis for 4 years? The banner "Expressive completeness is the central problem!" may not be the best way of organizing and presenting my ideas at present... in any case, it's clear that I need to think about my arguments more carefully.

## Friday, May 7, 2010

### Action Logic

Continuing a thought in this post.

A normal axiomatic logic pins down which inferences are permissible, but makes no comment on how these inferences are to be made. For real implementations, we've come up with a plethora of inference strategies-- however, the axiomatic system is always treated as separate from the inference strategies. My idea of action logic is to merge the two, hopefully in order to do nifty things like letting the system reason about its own inference strategies in a smooth way.

In other words, what we want is to nicely integrate logic and programming.

Several attempts to do this already exist. One example is Prolog. Prolog does not, however, satisfy the desire to allow deductive reasoning to determine flow of inference; Prolog just performs a backtracking resolution algorithm on horn clauses, paying some heed to meta-logical control flow statements that can be inserted.

However, I think we can take some inspiration from a predecessor of Prolog, known as Planner. Planner worked through a backtracking mechanism similar to that of Prolog, but more expressive. It used a separate notion of "goal" and "fact" to drive reasoning: if a goal is present, then the system starts reasoning about how to achieve that goal. This can involve the creation of subgoals which drive more reasoning, the execution of programs, and performance of inference steps.

One could argue that Prolog makes this distinction, but it is much less explicit: since programs and statements are the same thing, the creation of subgoals always just corresponds to the conditions in an if-then. (In PLANNER one can specify arbitrary advice for what subgoal to create at a given step; it is not so strongly tied to the form of the logical assertions in the knowledge base.)

PLANNER also has the feature that facts can be removed from the knowledge base when they become false, ie, the system does not need to index facts to time in order to model facts changing; this is important for efficiency, though it is less "logically pure" than the alternative.

This is a good overview of Planner for the interested. More recent systems such as SOAR, SCREAMER, ACL2, et cetera have many of the same features; I am really picking out PLANNER mainly for its historical significance.

In any case. I said that I was trying to merge declarative and procedural, yet I am praising PLANNER's explicit goal system over Prolog's implicit way of dealing with goals. Why? Because it is more powerful: we can provide more inference guidance in PLANNER than in Prolog.

In any case, I am not saying that Planner is a sufficient action logic, either! In fact, I'd say, let's question the basic mechanism which it shares with Prolog: if the system has goal G, and knows a statement of the form "S implies G," then it will automatically make a subgoal S. (For the curious: Both PLANNER and Prolog will select out of several statements "X implies G", try that X, give up if that subgoal turns out infeasible, and try the next. This technique is applied recursively, creating sub-sub goals, et cetera. In Prolog this is the only mechanism at play; in PLANNER there are more conditions to be met for this to happen, there are other mechanisms at work, and the programmer can add arbitrary mechanisms.)

What happens when we vary this mechanism? Well, there are a lot of options. We could create a reinforcement-type agent which attempted to learn what inferences to draw and what subgoals to create and when, getting rewarded whenever it meets a goal specified by the programmer. We could create a brute-force system which keeps proving theorems until it happens upon one of the form "P implies G" where P is a fully specified sequence of actions, rather than just a subgoal-- the system then does P. We could produce a system like Hutter's asymptotically optimal algorithm by creating a mechanism which searches for plans of action, replacing the current plan when the system finds a proof that switching to a different plan will result in a better outcome.

The weakest choice is to just require that if the goal "do X" is asserted, and X is a sequence of actions, the system does them immediately. This just means that the system is programmable. (X can be an arbitrary program because it can include branching goto type statements.) Any other choice of system can be modeled in this one by telling it to behave in that way.

Stronger choices will provide more built-in useful behavior, but will also have more built-in assumptions.

My desire is to create an abstract framework, not something that pins things down to a specific implementation. So, it seems like what I'm looking for is an abstract "should-ness" concerning what inferences to draw when, and what subgoals to make when.

Much of the logic discussed under the title of action logic in the Stanford Encyclopedia is somewhat relevant here, though the logics there are definitely not action logics in the sense I intend: they are just normal axiomatic systems, which must rely on outside inference procedures.

In general, my approach to AI is very much in the line of probabilistic relational methods: I advocate a move to probabilistic reasoning where just boolean reasoning is present, and to relational methods where just propositional methods are present. In this case, that means that what I see as the logical next step is to research probabilistic generalizations of PLANNER... it seems this would force many of the simplistic methods employed in PLANNER to be "broken open," so to speak, getting out the chaff and determining what can be salvaged and improved for the probabilistic setting.

## Wednesday, April 14, 2010

### Postmodernism

I've been thinking for a while about why postmodernism might be a tempting intellectual position. There are certain logically sound logical arguments that support particular assertions associated with postmodernism, as well as certain practical issues which might make it tempting. I will try to illustrate my position against postmodernism by giving the best arguments I can muster for it-- the intention is to show why we can get this far, but no further.

I won't be too surprised if someone reads this and says "but postmodernism doesn't try to go any further than that with the argument,"or further, "that's not what postmodernism is at all;" these arguments are based on my impression of postmodernist thought, not based on a formal education in the subject.

Since this thing would be tl/dr (or worse, tl/dw) if I did it as one post, I'll just post an outline for now; the points below will become links as I write individual posts. (I may also come back and add/delete/edit points, of course.)

Practical Reasons

-making room for others to doubt your beliefs
-the intellectual proliferation of hypotheticals
-Flexibility of language and definitions; anti-prescriptivism
-disagreements are often based on matters of language (differing definitions) rather than matters of fact
-such things as tables, chairs, etc don't exist (strictly speaking)
-Scientific theories are approximations

Logical Reasons

-loeb's theorem
-algorithmic complexity is relative
-We can always "interpret" talk in any logical system or system of axioms as just hypothetical first-order talk (by throwing out the naturalness constraint on interpretations)

## Monday, April 12, 2010

### More on Self-Modifying Logic

This is a follow-up to this post, and will simply attempt to fill in some details.

In order to properly refer to a system as "self modifying logic," one would want the logic to modify itself, not merely to be a system of logic which gets modified by an overseer-type system... of course. I did not make it clear in the previous post how this would be accomplished.

It seems to me that a prerequisite would be what one might call an action logic: a logic which isn't just a way of representing and manipulating declarative knowledge, but which is also capable of making and executing decisions, including decisions about its own reasoning procedures.

Lacking a full-fledged action logic, I can only offer a first approximation (loosely inspired by Markus Hutter's algorithm for all well-defined problems): We have a set of axioms describing what kind of outcomes are better and worse, and how our current action policy affect them. We have some standing  we have a reasoning system which uses some declarative logic and is proving things in an exhaustive manner (Hutter uses Levin search). We keep track of the proofs of action-policy values, and at all times use the best policies currently known of.

And, of course, the theorem proving method (initially an exhaustive search) should ideally be a part of the policy so that it can be modified. (This becomes more like Godel machines, which I'm sure I link to often enough.)

Now, to make this usable, we've got to be selecting policies based on estimated value rather than provable values; thus the logic should include probabilistic reasoning about the policies based on limited reasoning and limited testing of the methods. (ie, the probabilistic reasoning should be broad enough to admit such spot-tests as evidence, something that requires a solution to the problem of logical uncertainty.)

So, obviously, many details need filled in here.

In any case, one of the available actions will be to change the logic itself, substituting a more useful one.

The utility of such changes should in principle be derived from the basic utility function (that is, our preference ordering on possible outcomes); however, for the discussion in the previous post, for analysis of possible behaviors of such systems, and possibly for practical implementations, a notion of goodness that applies directly to the logic itself is helpful.

So, questions:

--What might a full-featured "action logic" look like?
--Under what conditions will the system tend to add new mathematical truths to the set of axioms? Under what conditions will the system tend to add truth predicates, or in any case, metalanguages?
--Under what conditions will the system satisfy the backwards-compatibility requirement that I mentioned last time?

## Thursday, April 1, 2010

### More on Curiosity

I took a look at this paper on curiosity, which includes a good review on more recent work than what I had read about when I wrote the previous post. One nice insight that has been made is that it is useful to split up the value function based on actual reward from the value function based on "exploration bonus". These can then added together to make the final value. One can still think of the exploration bonus in terms of optimism, but another way to think of it is that the system is really just trying to calculate the benefit of exploring a particular option (that is, the learning benefit), and adding that to the direct benefit of choosing the route.

In this account, the confidence-interval method mentioned in the last post is seen as a method of estimating the learning benifit of a state as the distance between the most probable average utility and the top of the X%-confidence range for the average utility.

A related estimate might be the expected information gain...

It's not yet clear to me how to make an estimate that approximates the true benefit in the limmit.

## Wednesday, March 31, 2010

### Formalized Curiosity

I've been reading up on reinforcement learning lately. An interesting question, perhaps the first question, is how to balance exploration with exploitation.

The basic idea is this: if we are faced with a number of options which have unknown probabilistic payoffs, then we can either choose the option that has had the best average payoff so far (ie, exploit our current knowledge), or choose a different option in order to gain more evidence about how good that option is (ie, explore the avaliable options).

In principle, we can compute the optimal choice at a given point in a straightforward way, but this requires looking at the entire future (up to some cutoff we choose) and is highly inefficient. It seems to be critical to actual implementation that we approximate the utility of exploration in a more local way, making exploration an essential element of our decision procedure rather than a derived property that comes from planning far enough ahead.

Unfortunately, the simplistic strategy of choosing to explore totally at random some fraction of the time, and exploit the rest of the time, seems all-too-common. This fails to take into accound factors such as the current risk of exploration, weighing exploration to favor more promising options, et cetera.

An interesting class of algorithms for improving upon this is optimistic greedy methods: act as if you're exploiting, but use optimistic estimates. It looks like there are several ways of fleshing this out; perhaps the simplest is to come up with an interval within which the utility of an option falls with probability X (say, 95% confidence). Our "actual estimate" for the utility might be in the middle of the range, but we can act as if we think the top value is the right one in order to encourage exploration: if an option has not been explored very well, it will have a broad range, so that the top of the range may be higher than the actual current-best-bet even if the middle is pretty far below.

This can lead to too little exploration: a string of bad luck associated with an option which is in actuality good can push even the high estimate of the payoff to below the actual payoff of some other option, so that it will never be explored again. This becomes less and less likely the more optimistic we are (ie, the wider our con interval), but it's still a disadvantage.

One might ask, what intervals should we pick? Well, it seems like it depends how long we are going to be choosing between the particular set of options. If we are only choosing one more time, our intervals should be of width 0-- we should exploit rather than explore. Similarly, if we only have a few more times to choose, we should choose with relatively narrow intervals, doing relatively little exploration; if we have a great many left, we should have very broad intervals, exploring a great deal. (Perhaps I should try to work out the optimal intervals here.)

So, if we have a known finite number of moves to make, we should move our interval range down to 0 via some formula.

The interval width can be though of as the "curiosity" of the system at a particular point in time. If we're using a 95% confidence interval, then we are saying that for each option we don't try, we want to be at least 95% certain that it is worse than the option we do try: any less confident and we'd prefer to experiment.

What if we have an infinite number of moves, and want a strategy which will be guaranteed to converge to the optimal strategy at infinity? (Of course, what we really care about is how quickly we converge, but let's first worry about converging at all.) I said earlier that, with a fixed confidence, this would be impossible: we can always converge to the wrong thing. However, the higher the confidence we require, the less probable this is. So, we can gradually increase our confidence requirement over time to guarantee convergence. This can be thought of as increasing our epistemic standardsover time: if we've only been doing something for a few minutes, we're ok with only being 75% certain that it's better than the alternatives; but if we've been following a strategy for millions of years, we want to be damn sure by that point that it's the strategy we shoul dactually have been following.

There are several paradoxical things going on here:

• While for finite-time problems we want to decrease our curiosity over time, for infinite-time cases we want to increase it; the infinite-time case does not appear to be the limit of longer and longer finite-time cases.
• Higher epistemic standards (ie, requiring greater confidence) correspond to more optimism, not less, even though one might think of optimism as a sort of epistemic dishonesty (pretending the expected value is higher than it is). It's a consequence of differing definitions, not a real contradiction, but I think it's curious.
source

Edit-- Questions:
• What are the formulas for optimal curiosity levels  in the finite and infinite versions?
• Can we make sure that a system using these approximate exploration strategies will still approximate the true optimal strategy as it is given more computation time with which to plan ahead?

## Tuesday, March 30, 2010

### Self-Modifying Logic

This post is inspired by the GOLEM architecture that Ben Goertzel recently wrote a post about. In discussing the architecture on the Singularity list, we came up with the idea of allowing an artificial intelligence to alter its own utility function computation in a controlled way: it evaluates potential new utility calculations by comparing the outputs to the outputs of the original utility function, looking for computations that are essentially the same but faster.

Now there are some fun (and potentially important) questions about how to deal with uncertainty in the original utility function, but I won't go into these here.

1. Start with a set of axioms and rules of inference to start with, and if you like, also a set of statements about the world (perhaps sensory data).
2. Look for new logics which can derive what the old can derive, but possibly more.
3. Judge these based on some criteria; in particular, shortness of derivations and simplicity of the new logics both seem sensible.
This potentially solves my problem of coming up with a system that iteratively learns higher and higher levels of the Tarski hierarchy. If the system keeps augmenting itself with a new truth predicate, then it will keep increasing the efficiency with which it can derive the truths from the initial system (by a result of Goedel for the type-theoretic hierarchy which if I'm not mistaken will hold similarly for the Tarski hierarchy; see his On the Length of Proofs). This does not show that the Tarski hierarchy is the best way of increasing the power of the system, but I am perfectly OK with that.... what I'd like, however, would be some guarantee that (some canonical axiomatization of) each level of the Tarski hierarchy can at least eventually be interpreted (ie, as we keep adding further extensions, we interpret more of the Tarski hierarchy, without bound). I do not know how to show this, if it's true.

## Monday, March 22, 2010

### A Few More Thoughts on Guidance

So: if we want to make good inferences, is it enough to look for a programming language which describes good inferences succinctly, and then proceed to carry out actions which have succinct descriptions in that language?

Specifically, can this strategy be arranged such that it is (in some sense) equivalent with the strategy of searching for the inference algorithm that's got the lowest expected time (or, lowest average time on randomly generated test problems typical of what's been seen so far)?

If we've got a bunch of trial runs of different inference algorithms, then the absolute best language in the sense that I'm looking for would be one that could state just the algorithm with the best average time. Less-good languages would be judged by how good the inference algorithms they could state would be. This should make clear part of the divergence from the simple concept of compression: we're not trying to compress all the data, just the "best" data. We've got a range of goodness for our data samples; we want to have very short descriptions for the best cases, but very long ones for the worst cases.

Another difference is that we don't care in the same way about the simplicity of the result. It might be a good idea to favour simpler languages when we only know how good they are on some data, since shorter ones will probably generalise their behaviour more cleanly to more data. Yet, if we could calculate the actual average runtime on truly average data, we would no longer consider shortness to be a concern; not unless we ran a risk of exhausting the memory. With compression, this is not the case.

One might still base an effective algorithm on the loose metaphor: perhaps implement the Levin-like search for good inference strategies, and then some compression-like meta-search for strategy programming languages that tend to result in good strategies. However, the basis in compression doesn't appear to be totally rigorous.

### Compressive Inference Guidance

I just did a post on inference guidance, but I've got a lot more thoughts (which may or may not be worth anything).

The first one is a curiosity I've had for a while: a good way of predicting is to compress sensory information. Can compressing good inference methods lead to a similarly good method of inference?

The previous post indicated at least two ways in which compression can be relevant: first, the Levin-like search gives more priority to shorter inference guidance methods, so although it's not actually trying to compress anything, the result will tend to be compressive of the solution (and of its proof). One might expect that the more compressive a solution, the better it will generalize. However, note that what we're worried about is speed, not correctness--- an inference method cannot harm the correctness of a result, only the speed with which it is inferred. (It can delay the inference perpetually, though.) So we want the speed to generalize. A program that is more compressive of the same data tends to take longer to execute.

Still, I'd expect some connection between compressiveness and the generalization of speed. Wouldn't you?

The second definite connection is that it's useful to compress the problems the system has solved so far, in order to anticipate the coming problems and fine-tune to them. One way of thinking about this is that we look for a good language to describe the problems so far. Then, to generate new problems, we generate random descriptions in that language.

It would be nice, to apply this to the search for strategies-- rather than looking for a good strategy, look for a good programming language, in which randomly-generated strategies tend to be good! Remember the best strategies, and use them to continue revising the language and improving it. The question is, does this scheme have normative value? Will it systematically approximate the optimal behavior?

### Guidance of Inference

When I first switched to this new blog, I thought that the point was going to be to take it more seriously, writing in a more organised fashion and being more careful to only say things I know to be true.

The truth is, though, that just makes me not post. A more informal blog would do much better. One advantage of the change is that the blog's title reflects a much broader range of possible topics, such as the previous post that was just some playing with trigonometry.

So, here are some disorganised thoughts about inference guidance for AI.

What I want is a normative theory-- something that says "This is how it should be done, in principle." IE, a normative answer. Practical inference systems are of interest, but a guiding principle that tells us what sorts of practical implementations will tend to be good is very important.

This is somewhat related to my wondering about (normative) systems for handling mathematical uncertainty; the best way to guide inference may also involve, or be involved in, the best way to handle uncertainty about the potential result of an inference.

The main example of inference I'm thinking of is optimisation or constraint satisfaction.

Levin search is asymptotically optimal, but does not use the problem description in any way (like genetic programming). Jurgen's "Optimal Ordered Problem Solver" ("Oops," an unfortunate acronym) is optimal in a stronger sense, but still does not use problem descriptions. Hutter's asymptotically optimal algorithm for well-defined problems uses problem descriptions. So, several areas for improvement present themselves. Something that is optimal in Jurgen's sense ("bias-optimal") but which uses problem descriptions rather than searching blindly would be nice. Also, bias-optimality is a strong sort of optimality, but does not capture everything Oops is meant to do: Oops is meant to learn from its experience (hence "ordered problem" solver). Jurgen asserts that it is optimal in a certain sense, but I think there is room for improvement.

One way of using problem descriptions would be to just hand them to Oops as if it could understand them, and judge its output on that assumption. It would be forced to learn to use the problem descriptions. However, this would be quite inefficient.

A more interesting way would be to use Levin search at the inference-guidance level. Execute all possible inference-guidance programs in parallel, with the execution time they are given weighted by their simplicity. Solutions would no longer have to be checked, since results would automatically be correct; a proof of correctness would automatically come with the solution.

Oops could be modified in the same way. (Oops can be thought of as just an efficient implementation of Levin search with added features for learning from previous success.)

Now, Oops does several things to learn from experience. (I'll use the language of inference guidance, making the assumption that Oops has been modified to work via inference guidance rather than direct solution generation.)

1. It attempts to apply the so-far-sucessful inference guidance program to the new problem, searching extensions of the program if it's not yet a complete program (ie, if the new situation causes the execution to reach the end of the code where before it didn't); half the attention is directed to this, while the other half searches for a fresh solution (to all problems, not just to the new one).
2. Inference guidance programs are also allowed to provide search orderings for continuations, so that it's possible that a partial inference guidance program represents a good search heuristic for inference guidance programs; this is particularly useful in the situation mentioned above, when a program turns out to be incomplete for a new example.
3. New programs are allowed to copy and modify old ones.
To me this appears haphazard and hacky, though it has its good points. What can we do for a more normatively forceful strategy of learning?

Oops takes execution time into account just by the fact that strategies which are faster will tend to be be found more quickly than slower strategies. This is because if it optimised for execution time directly, then it would quickly overmatch: once it found a solution at all, the fastest-executing program would simply be the one that spit out that answer when given that input. The situation may improve with the amount of experience of the system (since after a large number of instances, a lookup table may no longer be the fastest way of computing the answers). Still, this seems wrong; we want the system to use extra cycles to look for inference strategies that are optimized to be quick on probable questions, not just on previously-experienced questions.

It seems reasonable, then, to search for a probability distribution which would generate the questions observed so far. Using the current-best estimate, the system should look for the quickest solution not just to the current known problems, but to the expected future problems as well.

This could be done in several ways. Perhaps potential future instances are generated at random from the distribution, and inference methods are tested against them. Perhaps a strategy more like Hutter's asymptotically optimal one is used: the system might try to prove which strategies will be better. In any case, the objective will be clear: optimize speed for expected problems.

## Monday, February 15, 2010

### Defining Sine

I don't know if this is a sign of things to come, but the first post is not entirely on-topic. :)
I suppose I should link back to my older blog before I begin:

http://dragonlogic-ai.blogspot.com/

The move roughly indicates me realizing that I now know too much logic & math to be able to write to a general audience without a fair amount of exposition, and also enough that I want to correct much of what my previous self said.
Anyway.
It's bothered me for some time that the functions sine and cosine are given, in a sense, without definition (in textbooks, that is). We didn't even get a method for calculating these, as with the algebraic functions we'd learned.  I know, of course, that they are defined in terms of angles and the unit circle. Angles, however, are not given an algebraic definition either... fortunately, the unit circle is defined:

x2 + y2 = 1

It struck me recently that the derivative of sine-1, which is 1/√(1-x2), looks like it is somehow gotten from the above; specifically, from the equivalent:

y = ±√(1-x2)

Rather than look for a proof of this derivative, I decided to see if I could figure it out for myself.
Now, if I had been given the task of finding the derivative of sine-1, I would have taken the implicit derivative, which gives you the derivative of any inverse function in terms of the derivative of the original function:

f(f-1(x)) = x
f(f-1(x))' = x'
f'(f-1(x))f-1'(x) = 1
f-1'(x) = 1/ f'(f-1(x))

So, for any f, you can find the inverse's derivative if you know the inverse and the derivative.
In particular,

sine-1'(x) = 1/cos(sin-1(x))

Of course, this is not nearly as useful as the derivative given in my textbook (for finding integrals of algebraic functions, that is).
So, the question I ask myself is: How can we define sine-1 algebraically?
Intuition: The "angle" is a measurement of the length along the unit circle that must be traveled, starting at (1,0) and going counter-clockwise,  to get to a particular x-location.
From the 2nd form for the equation for the unit circle, it's fairly simple to get a parametric equation system:

y = ±√t
x = ±√(1-t)

In order to keep the distance we're moving steady as we increase t, we want to stick the derivatives of x and y in t into the distance formula:

√(¼t-1 + ¼(1-t)-1)

Now, the integral of this from t=0 to t=t1 for t1 between 0 and 1 is what we're interested in. This is a measure of the distance we've traveled on the unit circle as t has changed. Denote this integral as a function, D(t1).
The inverse, D-1(), gives the t1 that we arrive at if we travel a given distance. From this, we can calculate the x-location from the original parametric equations: x = ±√(1-t1)! So, here we have it:

sin-1(x) = ±√(1-D-1(x))

We can then proceed to define sin as the inverse of sin-1.

If anyone can find any mistakes in this, thanks in advance. Rightly speaking, to get an actual purely algebraic definition, I need to go and find that integral. Perhaps another day. :)

Edit-- Fixed some errors thanks to Daniel Demski.