Bible Codes and some probability

T'ai Chi · Feb 5, 2004

I was tinkering around this past weekend and sat down and helped write some Perl code to investigate something that interests me and combines probability, statistics, and skepticism.

Since some people see random bundles of words from the Bible (selected through an equidistant letter skip, ie. a systematic sample) as meaningful, I thought I'd investigate if it is easy (or not) to get meaningful words from random letters in general.

I created n 'words' of length k, where k varied from 2 to 25 (the max and min word length in my dictionary), where each letter in the 'word' was selected randomly from the letters a through z. Then I searched and saw if these 'words' were in the dictionary.

I ran trials of n = 100, n = 1000, n = 10000, n = 100000, n = 1000000, and n = 10000000 random words of random lengths k. As it turned out on average, for n words created randomly, about .12*n of these words were real dictionary words.

That seems pretty significant to me. If meaningful words can be created from obvious randomness without employing a systematic sample, then how much easier is it to create meaning from words that are not randomly created with employing a systematic sample?

Matabiri · Feb 5, 2004

You might get an even better hit rate if you randomly chose letters in proportions that mirror the language (e.g. 'e' far more common than 'z').

Nice example, though.

T'ai Chi · Feb 5, 2004

Matabiri said:
You might get an even better hit rate if you randomly chose letters in proportions that mirror the language (e.g. 'e' far more common than 'z').

Nice example, though.

Ah, that would be interesting to change the proportions and see what happens. I definitely feel you are correct that the rate would go from 12% to something much higher. The specific percentage I'm not sure though. Any guesses?

I wrote some code in the past that did some vector calculations with weighted averages that I could adapt for this exploration. Something like specifying how often on avearge random letters are chosen:

proportion a = p1
proportion b = p2
proportion c = p3
.
.
.
proportion z = p26

where there are certain constraints on the p's. That should do the trick.

I'll work on implementing it this weekend. Then I'll write up something and either post it or submit it to SkepticReport.

-T

LW · Feb 5, 2004

T'ai Chi said:
where there are certain constraints on the p's. That should do the trick.

You could consider assigning the probabilities for digrams or even trigrams to get more accurate values. I suppose that someone has already computed their distributions but I don't have a reference available right now.

However, in case you are interested in one-letter probabilities, I include here the list from Handbook of Applied Cryptography, page 247:

A: 8.04%
B: 1.54%
C: 3.06%
D: 3.99%
E: 12.51%
F: 2.30%
G: 1.96%
H: 5.49%
I: 7.26%
J: 0.16%
K: 0.67% <- now, I'm surprised
L: 4.14%
M: 2.53%
N: 7.09%
O: 7.60%
P: 2.00%
Q: 0.11%
R: 6.12%
S: 6.54%
T: 9.25%
U: 2.71%
V: 0.99%
W: 1.92%
X: 0.19%
Y: 1.73%
Z: 0.09%

That handbook has only 15 most common digrams listed and no trigrams at all.

Ed · Feb 6, 2004

I suggest that you Gogol topics then apply your code to the resulting hits. No need to create your own sample of letters.

Take the results and interpret creatively.

I see an interesting article as well as a book for you, my friend. Just use care in picking topics and make sure there are lots of hits.

The idea is that the interent represents the assembled thought-power of the entire human race, so, naturally, any problem will become tractable.

To get you started:

Alien visits: 190,000 hits
Alien Probing: 71,000 hits
Islamic conspiracy: 155,00
Islamic probing: 17,500
Remote viewing god: 92,900
USA Invade: 183,000
Denmark Invade: 29,600
John Edward Gay: 596,000
Superbowl boob: 1,170,000
John Edward Boob: 7,720,000

See, you can get the true paranormal word on these and other buring issues of the day!!!!!!

You will have the 2004 equivilant of a magic 8 ball. It is all true and can be reproduced (REPLICATION) you have a theory, you have references, you have an existing audience. This is the big one. You must get there first before some other charlata... uh entrepene...uh.... Gifted Psychic gets there first. Wear robes at your press conference , or go black on black, Crew neck cashmere sweater, black Armani suit, black Magli loafers, shades, only silver ring, watch and simple necklace with an indecipherable pendant--small though and understated.

Edit: Also invest in one colored contact. Green or red both have merit. I'd go green. Get one with a dialated pupil a la Bowie. Actually, thinking about it the "Gray Duke" personna might be a good one. Also, loose that T'ai moniker, not mainstream and too damn foreign, I suggest "Sebastian Stewart". Strong, white, faintly androgenous. The mongram (SS) --used on all of your writings-- will get you face time explaining how there is no Nazi influence on your work.

You can have a cult and tons of stupid chicks.

Career killing warning: Do not mention Kool-Aid!!!!!!!!

Really, try it

Yahweh · Feb 7, 2004

Very cool, T'ai Chi

LW · Feb 9, 2004

I spent some time examining these kinds of probabilities.

Here's the description of my experiments:

I generated random strings of letters of lengths ranging from 2 to 9 letters. For each generated string I checked whether it was in my spellchecker's dictionary (ispell version something or other). I used three different probability distributions and generated 10000 strings for each distribution and string length. Then I repeated the experiment 100 times.

The three distributions were:

(1) Uniform: each letter has 1/26 change of occuring in any position.
(2) Letter frequency: the letter probabilities were according to the letter frequencies of the English language but each letter was generated independent from others.
(3) Trigram frequency: the initial letter was according to the distribution (2), the second according to the digram frequency distribution, and the rest letters used trigram frequencies.

What digram frequency means in practice is that in two-letter pairs the second letter is not independent from the first one. For example, the digram "an" occurs in English text overs over 200 times more often than "aq". Similarily, the trigram "and" occurs over 6000 times more often than "anh". So, in distribution (3) each letter after the second was generated depending on the previous two letters.

I generated the digram and trigram frequencies myself by analyzing almost 780000 lines of English text, including among others the NIV Bible, War and Peace, and Linux HOWTO documentation. [The last component brought some pretty exotic digrams (like "wq" - vi-users can guess where that comes from) to the data but since they were so rare I didn't make any effort to weed them out.]

Here are the results (hoping that the table comes intact:

Code:

            Avg.   Min.  Max.  Stddev.   % of words
Length 2
Uniform :  885.32   826  1000   30.84       8.85
Letters : 1847.62  1759  1941   33.67      18.48
Trigrams: 3400.45  3229  3515   47.22      34.00
Length 3
Uniform :  310.79   270   352   17.31       3.11
Letters :  814.32   758   883   28.53       8.14
Trigrams: 2619.25  2505  2763   47.18      26.19
Length 4
Uniform :   46.18    36    59    5.70       0.46
Letters :  200.03   174   246   13.50       2.00
Trigrams: 1399.86  1331  1484   32.71      14.00
Length 5
Uniform :    3.32     0    10    1.91       0.03
Letters :   24.86    12    41    5.24       0.25
Trigrams:  452.44   376   515   21.61       4.52
Length 6
Uniform :    0.28     0     2    0.51       0.00
Letters :    2.78     0     7    1.53       0.03
Trigrams:  145.79   110   178   12.41       1.46
Length 7
Uniform :    0.00     0     0    0.00       0.00
Letters :    0.20     0     2    0.43       0.00
Trigrams:   41.04    25    59    6.41       0.41
Length 8
Uniform :    0.00     0     0    0.00       0.00
Letters :    0.02     0     1    0.14       0.00
Trigrams:    8.64     2    18    3.18       0.09
Length 9
Uniform :    0.00     0     0    0.00       0.00
Letters :    0.00     0     0    0.00       0.00
Trigrams:    2.41     0     7    1.53       0.02

Here "Avg" is the average number of words among the 10000 generated strings, "Min" and "Max" the minimum and maximum number of words in individual runs, "Stddev" the standard deviation of the number of words, and "% of words" the proportion of words among the random strings.

T'ai Chi · Feb 9, 2004

I'll work on implementing it this weekend. Then I'll write up something and either post it or submit it to SkepticReport.

I didn't get around to it this weekend.

I'll probably work on it this upcoming one.

LW · Feb 13, 2004

LW said:
Here are the results (hoping that the table comes intact:

In case someone is interested, here's the corresponding result for Finnish. Though, the corpus from where to draw the frequencies was a lot of smaller than in the English case, being only Kalevala and the Bible, and the Bible contains a
lot of rather infrequent digrams:

Code:

            Avg.   Min.  Max.  Stddev.   % of words
Length 3
Uniform :   80.85    64   103    8.60       0.81
Letters :  476.17   433   529   18.95       4.76
Trigrams: 1109.24  1022  1176   30.64      11.09
Length 4
Uniform :   13.52     7    23    3.29       0.14
Letters :  171.94   129   205   12.97       1.72
Trigrams:  813.97   763   886   26.61       8.14
Length 5
Uniform :    1.50     0     5    1.12       0.01
Letters :   45.99    30    62    6.35       0.46
Trigrams:  423.38   373   476   20.18       4.23
Length 6
Uniform :    0.14     0     2    0.38       0.00
Letters :    7.08     2    18    3.02       0.07
Trigrams:  136.48   107   164   11.30       1.36
Length 7
Uniform :    0.02     0     1    0.14       0.00
Letters :    0.68     0     4    0.86       0.01
Trigrams:   31.36    19    47    5.85       0.31
Length 8
Uniform :    0.00     0     0    0.00       0.00
Letters :    0.07     0     1    0.26       0.00
Trigrams:    9.08     1    18    2.88       0.09
Length 9
Uniform :    0.00     0     0    0.00       0.00
Letters :    0.00     0     0    0.00       0.00
Trigrams:    2.18     0     7    1.63       0.02

You may note that there is marked difference with the English case with short words, especially using the uniform distribution. I believe this comes from two main sources:

(1) Finnish words are longer than the English counterparts so there are less changes for a match; and

(2) Finnish alphabet is bigger and contains 29 letters.

T'ai Chi · Feb 13, 2004

Neat LW.

Renfield · Feb 13, 2004

Paging through my copy of the bible the other day, I found the following words interspersed through the text.

fundamentalists
superstitious
ignorant
backwards
wrongheaded
Bible Code
Crap
irrational
Bush
one-term

Obviously this was put there by a higher power to send us a message. It could not have appeared simply by chance.

teddygrahams · Feb 15, 2004

Ever hear of Boggle ?

davefoc · Feb 16, 2004

T'ai Chi,
Where did you get your dictionary? Is there a computerized list of words in the English language floating around out there?

Thanks, Dave

LW · Feb 16, 2004

davefoc said:
Where did you get your dictionary? Is there a computerized list of words in the English language floating around out there?

Try to find one here.

NullPointerException · Feb 17, 2004

perform a compression binary encoding of the bible and write a vertical/diagonal/horizontal binary reconstruction program to backtrack through a two d array representing the bible and you get much higher ratio of real words. Reform it into a 3d array and bam, instant history of the past present and future in possible valid world combinations(in english)

LW · Feb 17, 2004

NullPointerException said:
perform a compression binary encoding of the bible and write a vertical/diagonal/horizontal binary reconstruction program to backtrack through a two d array representing the bible and you get much higher ratio of real words. Reform it into a 3d array and bam, instant history of the past present and future in possible valid world combinations(in english)

Sorry, but I don't understand what you mean by this. Could you elaborate a little what you mean with:

(1) "compression binary encoding", I can't parse this.

(2) 2d array of Bible. How exactly I'm supposed to form this array? Do I form a large square of the Bible text or do I choose some fixed-length rows, and if latter, how long should they be?

If you detail a little bit what you mean by your post, I certainly can write such a program and check what happens to the ratio of English words.

LW · Feb 17, 2004

I made a test using KJV Genesis text. Test procedure:

(1) Remove all punctuation from the text, change all uppercase letters to lowercase.

(2) Form a 390 * 390 square of the letters of Genesis with the text starting at the first row, then continuing from beginning of the second line, and so on. The last row is not full since the number of letters in KJV Genesis is not a perfect square.

(3) Go through all positions of the square finding n-letter lowercase English words upwards, downwards, and diagonally up, down, left, and right. Not horizontally, since the text itself runs that way.

The resulting proportions of English words for different word lengths were:

Code:

Square size: 390
3: 10.147%
4:  2.321%
5:  0.277%
6:  0.034%
7:  0.002%
8:  0.000%
9:  0.000%

Note that these figures are slightly higher than the letter distribution that I posted before. Here's a comparison with that:

Code:

    Bible   Random
-----------------------
3:  10.15%   8.14%
4:   2.32%   2.00% 
5:   0.28%   0.25% 
6:   0.03%   0.03% 
7:   0.00%   0.00% 
8:   0.00%   0.00% 
9:   0.00%   0.00%

[Edited to add: I just started a test run with the whole KJV Bible. Though, this may take some serious time to run through.]

Hexxenhammer · Feb 17, 2004

I just saw a Bible Code "documentary" on the History Channel yesterday. They barely gave lip service to the Moby Dick predictions. The argument boiled down to "we've done so many of these it has to be beyond chance."

LW · Feb 19, 2004

LW said:
[Edited to add: I just started a test run with the whole KJV Bible. Though, this may take some serious time to run through.]

Here are figures for full KJV Bible and Moby Dick:

Code:

     KJV     Moby Dick    
---------------------------
3:  9.974%    9.941%
4:  2.281%    2.304%
5:  0.268%    0.277%
6:  0.031%    0.031%
7:  0.002%    0.003%
8:  0.000%    0.000%
9:  0.000%    0.000%

KJV has a slight edge at 3-letter words, then the advantage goes to Moby Dick, though differences are very very small and comparable to the random distribution.

So, I guess this wasn't the method meant by NullPointerException.

Wrath of the Swarm · Feb 19, 2004

I believe Null was suggesting that if you placed the letters in a three-dimensional matrix instead of a two-dimensional one, the number of words you'd find would increase exponentially.

Bible Codes and some probability

Penultimate Amazing

Graduate Poster

Penultimate Amazing

Master Poster

Philosopher

Philosopher

Master Poster

Penultimate Amazing

Master Poster

Penultimate Amazing

Graduate Poster

Critical Thinker

Philosopher

Master Poster

Thinker

Master Poster

Master Poster

Malleus Malefactorum

Master Poster

Graduate Poster