Wednesday, January 21, 2015

Bug found in standard deviation calculation.

I just corrected a bug in musigma.sh: the actual Z-scores for the euro predictor are actually around +0.7, not +1.5 as I had thought. It may still be significant. Especially since the histogram for the Z-scores of the 31 neural networks shows a bell curve with a mean close to +0.7. That is: not just some ugly distribution.

Here's the corrected version:

https://www.dropbox.com/s/itckgrnq77tj7fo/elm.tar.gz?dl=0

Saturday, January 17, 2015

Some updates in the code

Still very messy. But with some upgrades.
https://www.dropbox.com/s/azs4nwwdasvyyye/elm17.tar.gz?dl=0

Monday, January 12, 2015

Let's play a game

Let's say I have a series of values. The series consists of 4000 values. The first few values are:

1) 1.1789
2) 1.179
3) 1.1743
4) 1.1632
5) 1.1659
6) 1.1569
7) 1.152
...

Now, let's say I show you the first 99 values in the series, and ask you to tell me if the 100th value will be greater than, equal to or less than the 99th value. You can only take a look at the 100th value once you've guessed, and you earn one point every time you are correct in your prediction.

1) 1.1789
2) 1.179
3) 1.1743
...
98) 1.0634
99) 1.0639
100) ???
...

To make things simpler, you will be given only two choices:

a) The 100th value is greater than the 99th value
b) The 100th value is equal to, or less than, the 99th value

Let's say you guessed b) this time around. Ok, now let's reveal the 100th value to see if you were correct:

...
99) 1.0639
100) 1.0572
 ...

You were right! The 100th value is 1.0572, which is less than 1.0639 (the 99th value). Therefor, b) is the correct answer.

We can keep playing this game using the values from 2 to 100, and trying to guess a) or b) for the 101th value, then taking values 3 to 101, and guessing for 102, and so on until we reach the end of the series.

To sum things up: the game consists of guessing whether the value will go up, down, or stay the same at any point in the series. The rule is that, to make your guess, you are only allowed to look at the 99 values previous to the point you choose in the series.

How can you use the information from the 99 previous values to guess the direction of the next one?

Well, here's an idea: first let's take the first few values from before.

1) 1.1789
2) 1.179
3) 1.1743
4) 1.1632
5) 1.1659
6) 1.1569
7) 1.152
...

Now, let's pay attention to the direction of the value, from one value to the next:

From 1) to 2), the value went up.
From 2) ro 3), the value went down.
From 3) to 4), the value went down.
From 4) to 5), the value went up.
...

And so on. We can create a new series using the letters a) and b) from our game. We can use the letter a) when the value goes up, and the letter b) when the value stays the same, or goes down. Our new series will look something like this:

1 to 2) a
2 to 3) b
3 to 4) b
4 to 5) a
...

Etcetera. Our new series is composed only of the letters a) and b). Notice that, if we take 99 values to produce a series like this one, our new series of a)'s and b)'s will contain 98 letters.

Let's suppose that we've counted how many a)'s and how many b)'s there are in our new series of 98 letters. Let's also suppose that, in our new series, there are more a)'s than b)'s. By looking at the previous 99 values, and how there are more increments (a) than decrements (b), we could assume that there is an "upwards trend". That is: up to this point in the series, the value seems to increase more than it tends to decrease.

This may lead us to think the 99th value is more likely to increase, than to decrease or stay the same. That is: the 100th value is more likely to be greater than the 99th value. We may be led to think that a) is the correct answer. But we may be wrong in our guess this time around. And in fact, in this instance, we would be.

But what if we keep playing the game using this method?

Let's say we play the game again, using the same guessing method, but this time, using values from 2 to 100 to guess a) or b) for value 101. Then, let's take 3 to 101, and guess for 102, then 4 to 102 and guess for 103... Let's say we keep going like this, until we've played the game 3000 times.

What would our overall score be with our guessing method? Maybe we would guess correctly 1500 times out of 3000, maybe just 1000/3000, or, if we're lucky, 2000/3000!

How can we know if our method is any better than, for example, just flipping a coin to choose from a) or b) each time? Could a sequence of 3000 coin flips produce a lucky score of 2000/3000 or better? If we repeat the same experiment over and over, how often would 3000 coin flips produce such a score? Would counting the a)'s and b)'s for previous values really be any better?

We can use statistics and probability to find out. Let's start by making a score history for the 3000 times we played the game. This history may look like this:

From 99 to 100) The answer was a), and you guessed b). You earn 0 points.
From 100 to 101) The answer was b), and you guessed b). You earn 1 point.
From 101 to 102) The answer was b), and you guessed a). You earn 0 points.
...
From 3098 to 3099) The answer was a), and you guessed a). You earn 1 point.

Notice that there are three lists here: the list of "correct answers", the list of "guesses", and the list of "points earned". Each of these lists has 3000 elements, one for each time we played the game. We can make a table with these three lists:

Correct answers Guesses Points earned
a b 0
b b 1
b a 0
...

a a 1

As you can see, you only earn one point if your guess matches the correct answer. Let's say we know how many a)'s and b)'s there are in the "correct answers" list. We can estimate the probability of any single element of the list being a) by simply dividing the amount of a)'s by 3000. The same may be applied to the probability of b) in that list.

Let's use the following names for the probabilities:

We will use Pc(a), for the probability of any single element of the "correct answers" list being a)
And Pc(b), for the probability of any single element of the "correct answers" list being b)

Now, what about the list of "guesses"? If we're guessing by coin flips, then the probability of guessing a) would be 1/2, and thus, the probability of guessig b) would also be 1/2. That is: the probability of any single element of the "guesses" list being a) would be 1/2; as would be the probability of it being b)

Let's also use names here:

We will use Pg(a), for the probability of any single element of the "guesses" list being a)
And Pg(b), for the probability of any single element of the "guesses" list being b)

Given these two pairs of probabilities, what is the probability of any given element of the "earned scores" list being 1? And what is the probability of it being 0?  We can use the addition and multiplication rules of probability to determine this. Remember that you only get one point if your guess matches the correct answer. So the probability of earning 1 point would be:

Pp(1) = ( Pc(a) * Pg(a) ) + ( Pc(b) * Pg(b) )

Similarly, the probability of earning 0 points would be

Pp(0) = ( Pc(a) * Pg(b) ) + ( Pc(b) * Pg(a) )

Using these calculations, we can know how likely it would be for us to earn one point by flipping a coin in any single game out of the 3000, without even having to flip the coin 3000 times!

Let's make an example by using actual probability values. Let's say that, in our "correct answers" list, there are 1575 a)'s, and 1425 b)'s. This leaves us with:

Pc(a) = 1575 / 3000 = 0.525
Pc(b) = 1425 / 3000 = 0.475

If we use the coin-flipping method to make our guesses, the probability of succeeding in any single game out of the 3000 would be:

Pp(1) = ( 0.525 * (1/2) ) + ( 0.475 * (1/2) ) = 0.5

We have a 50% chance of earning 1 point for each one of the 3000 games. This means that, in the end, we will probably only win half of the times we play. This gives us a likely final score of 3000 * 0.5 = 1500 points, for the coin-flipping method.

But if we were to actually carry out the experiment in real life, we would notice that getting exactly 1500 points each time is actually not very likely! Why is this so? Well, in real life, probability is only an estimation. Even if you have a perfect coin, with exactly 1/2 of probability for a) and b), there is still a possibility that, in the 3000 tosses, the coin will guess a) more often than b), or vice versa.

Likewise, there is a possibility that you will get all 1's in the "earned points" list, by just flipping coins! How likely is this possibility? Not very. Let's see why...

Let's ask ourselves the following question: how many different patterns of 0's and 1's are possible for a list of 3000 0's and 1's? If you can count in binary, you will notice that we can consider the "eraned points" list as just a binary number with 3000 bits. If we count in binary from 000...0 to 111...1, for a 3000-bit long number, we would get 23000 different binary patterns. Those are a lot of different patterns. Those are 23000 different ways in which the "earned points" list may result in the end.

Now, let's ask ourselves this: how many of these 23000 possible binary patterns have all 1's? Only one! The last one. The rest of them have at lease one zero.

This is how we can know exactly how small the probability of getting all 1's really is. Note that every single possible pattern for the "earned points" list is just as likely to appear as any other (because Pp(0) = Pp(1)). Therefor, a perfect score has a probability of only

Ps(3000) = (1 / 23000)

of ever appearing by coin-flipping.

Now, a more complicated question: how likely is it to get a score of 2/3000 by coin-flipping? We would have to count the amount of patterns, in the set of 23000 possible patterns, which have 2 ones and 2998 zeros. Fortunately for us, there exists a formula which tells us exactly how many there are. It's called the formula for "permutations with repetition of indistinguishable objects". For simplicity's sake, I will not go into the details of using this formula here. For now, let's just assume it works. If you wish, you may google it later.

The way to calculate the probability for any particular final score X would be to:

1. Obtain the amount of different binary patterns with X ones and (3000 - X) zeros using the formula for "permutations with repetition of indistinguishable objects"

2. Divide that number by 23000

The point of all this is that we can use these formulas to know the probability of getting any single final score by coin-flipping. If we were to calculate the complete list of these probabilities, for scores 0 through 3000, we would end up with a "binomial probability distribution".

In our example, the highest probabilities would be distributed around a final score of 1500. Higher or lower scores would be ever less probable. This would account for the way in which we get different scores every time we play the game 3000 times with the coin-flipping method. Most of these times, our final score will be close to 1500. Altough a fluke of, for example 2923, is always possible, it is also very, very, very unlikely.

How then, does this help us to determine whether our a)/b)-counting method is any better than coin-flipping? We can tell exactly how better (or worse) counting is to coin-flipping by looking at how far apart the counting score is from 1500.

Remember that the highest probabilities are distributed around a final score of 1500 for coin-flipping. If we get a score of 1582 by using the counting method, then we should ask ourselves: how likely would it have been to get that same score, by just coin-flipping?

If the difference between the likely coin-flipping score (1500) and the counting score (1582) is big enough, then we can tell that our counting method is superior to just coin-flipping. The question now is: how big is big enough?

It turns out there is a standard way of measuring this difference. The standard measurement of this difference is called the "Z-score", and it's unit is the "sigma".

It turns out that, for the binomial distribution obtained from the coin-flipping method, most of the probability will be concentrated around +/-3 sigmas from the mean. The mean being 1500 for this case.

Any counting score which is around +3 sigmas, or more, away from the mean, can be considered vastly superior to coin-flipping. This is because the binomial probability distribution of mere coin-flipping would make it almost impossible to reach the same score with coin-flipping.

But, how can we know whether our specific score of 1582 is +3 sigmas away from the mean of 1500? All we have to do is standardize. For this, we will use the formulas for the mean and variance used in binomially distributed probability.

For our particular case:

n = number of experiments (3000)
p = probability of success (Pp(1) = 0.5)

The mean is:

m = p * n = 0.5 * 3000 = 1500

And the variance is

v = p * n * (1 - p) = 0.5 * 3000 * 0.5 = 750

The Z-score for a final score x = 1582 would then be:

z = (x - m) / sqrt(v) = (1582 - 1500) / 27.38 = 2.92

A score of 1582 is +2.92 sigmas away from the mean of 1500! This would be large enough to be coinsidered "statistically significant". That is: far, far better than just guessing by coin-flipping.

Now for the good part...

Let's say the 4000 values from the original list are actually daily currency exchange rates for the EUR_USD pair. The previous method could help us in finding out if a neural-network-based price direction predictor is any better than a merely random price direction predictor.

The exact same comparison would apply. But instead of using the a)/b)-counting method, a neural network would be used.

In fact, I've already used a type of Simple Recurrent Network called the Elman Network to play the exact same game. By training an Elman network with a simple greedy algorithm using the 99 previous values mentioned in the game, I have been able to obtain Z-scores of +1.5 in average. I should mention that feeding random series to the network results in much lower Z-scores. That is: the network can tell the difference between artificial, random series created by me, and series which come from real FOREX market history.

A very messy, preliminary version, of the source code I've used for this can be downloaded from here. I must warn you that the code is extremely messy, and provided as-is, without any instructions. The results I mentioned are perfectly obtainable using this software, but you must know how to compile and run the programs. You may be able to do this on your own, but I encourage you to e-mail me with any questions.

 LEGAL NOTICE: As I am currently unemployed and attempting to make my way to an Erasmus Mundus scholarship (or any other professional opportunity available to me), I will very much appreciate it if you ask me for permission before using my work in any way or form. If you are a scientist, academician or otherwise professional, interested in my work, please e-mail me at: vomv1988@gmail.com

Cheers!