Register

My thoughts on the 340

Moderators: Bullitt, Aud8us, Michael Butterfield, johnny5

Level 5
User avatar
Posts: 560
Joined: Fri Jan 15, 2010 8:51 pm

Re: My thoughts on the 340

Postby doranchak » Wed Nov 09, 2011 10:29 am

ALisowsky wrote:Dave, I think the reason you are getting so many false positives is because with this method, if you are brute forcing such a common letter it generates too high a degree of freedom for assuming a pattern = homophone set... I'll try to explain.

For the first homophone "set" (which is actually not a set), you are picking up a true homophone set of 3 (Z, p, W). These are frequent enough because they represent the 'E' symbols. They are occuring very commonly in a sequential pattern, because they are homophones. When you add another frequently occuring letter (N, for instance represented by ^), and instead of restricting that to occur AFTER the sequence and in it's place each time, you allow for it to "roam", you pick up odd symbols which are purely by chance (and this chance increases with the frequency of the plaintext letter it represents.

So while this approach WILL identify some homophone sets that are not regularly substituted, it will always also generate some sets which contain letters that are NOT homophones.


That is a very good point. I should probably include some further measure of regularity in the symbol occurrences to filter out the random effects of high-frequency symbols. I was hoping that the way I was computing the odds would diminish those effects by including all the ways the extra non-repeated symbols could ruin the nice pattern of repeats. But maybe I need to think about a better computation, or pair it with Jurgen's environment analysis to look for other common behaviors shared by the set of symbols under consideration.

Level 2
Posts: 51
Joined: Tue Aug 09, 2011 9:20 pm

Re: My thoughts on the 340

Postby ALisowsky » Wed Nov 09, 2011 10:32 am

I think that there is a benefit to this in hindsight with the 408, but I don't think it's going to help much with the 340.

If you analyze the strings you found:

ZpW^
ZpWN
ZpNE
DkZp
DZpW
ZpDE
pW6N
p6NE
pW+N
p+NE
/pW+

And sort each letter by how many strings contain it:

p=1
Z=2
W=3
N=4
+=5
D=6
6=7
k=8
^=9
/=10

And substitute in the plaintext
1=E
2=E
3=E
4=E
5=E
6=N
7=E
8=I
9=N
10=K

You will see that the most frequently occuring symbols found within those strings ARE homophone sets. Theoretically if we use the method you and JK are exploring here, and you were to generate 5 strings containing 8 different potential homophone strings through brute search, even assuming error/false positives, we would still expect for those higher occuring characters to = homophones.

Adam

Level 5
User avatar
Posts: 560
Joined: Fri Jan 15, 2010 8:51 pm

Re: My thoughts on the 340

Postby doranchak » Wed Nov 09, 2011 10:40 am

Here is an example of environment analysis (generated by CryptoScope) of each symbol that maps to E in the 408:

https://www.evernote.com/shard/s1/sh/2e9d74a8-3b52-445a-80f9-ccf2690579ab/35d094ddecc46fb260388be5d016b09d

Jurgen, let's say you didn't know that those symbols all mapped to E, and you were just looking at the environment analysis. What do you look at in the environment analysis to rank the possibility that the symbols map to a high-frequency plaintext letter? Or are you just using the analysis to establish that each occurrence of the symbol probably maps to only one plaintext letter?

Level 3
Posts: 60
Joined: Sat Oct 29, 2011 7:46 am

Re: My thoughts on the 340

Postby JayKay » Wed Nov 09, 2011 11:49 am

If so, I wonder if you can explicitly compute those similarities using a technique similar to the cosine distances described in the Copiale paper by extending it beyond the "immediately before" and "immediately after" positions around a given cipher symbol.


At the moment it looks like that we just not have enough text to "do so". Nevertheless I will take a look on the whole "complex" (+/-5).
Example:
Aufzeichnen.JPG
Aufzeichnen.JPG [ 63.02 KiB | Viewed 321 times ]


Jurgen, let's say you didn't know that those symbols all mapped to E, and you were just looking at the environment analysis. What do you look at in the environment analysis to rank the possibility that the symbols map to a high-frequency plaintext letter? Or are you just using the analysis to establish that each occurrence of the symbol probably maps to only one plaintext letter?


If we take in account that even the symbols used twice that means for two plaintext letters in the 408 are within the same linearity context in a high frequency standing for only one plaintext letter we may conclude - even it's a weak conclusion - that often occurring symbols within a linearity complex are standing, for a certain probability for the same plaintext letter.

There is the question what does this mean for the fixed symbols (xxxxx+xxxxx). If the fixed symbol is a less used one and an other less used one has the same symbols with the same digits inbetween, does this refer to a certain probability that these symbols are standing for the same plaintext letter.

I think we should take a look on a normal English text. And do these analysis on the plaintext. My first intuition was that maybe environment analysis could be something like the Kasiski-test or frequency analysis.

Level 5
User avatar
Posts: 560
Joined: Fri Jan 15, 2010 8:51 pm

Re: My thoughts on the 340

Postby doranchak » Wed Nov 09, 2011 4:35 pm

ALisowsky wrote:You will see that the most frequently occuring symbols found within those strings ARE homophone sets. Theoretically if we use the method you and JK are exploring here, and you were to generate 5 strings containing 8 different potential homophone strings through brute search, even assuming error/false positives, we would still expect for those higher occuring characters to = homophones.
Adam


That's worth exploring. Something that troubles me is that the combinatorial odds computations are so much lower for the 340 than for the 408.

For instance, the top 100 matches in the 5-symbol brute force search on the 408 have odds that range from 113,016,731,547:1 to 30,570,920:1. But the top 100 matches in the 5-symbol brute force search on the 340 have odds that range from 3,187,041:1 to 369,038:1. In fact, almost all of the top 300 matches in the 408 brute force search exceed the odds of the best match found in the 340. This makes me think the 340 is much more prone to false positives since its homophone cycles are so weak.

Level 5
User avatar
Posts: 560
Joined: Fri Jan 15, 2010 8:51 pm

Re: My thoughts on the 340

Postby doranchak » Thu Nov 10, 2011 4:39 am

JayKay wrote:I think we should take a look on a normal English text. And do these analysis on the plaintext. My first intuition was that maybe environment analysis could be something like the Kasiski-test or frequency analysis.


Can you describe the information we need to collect about the English text? I did some basic counts (using Moby Dick) when we last talked about this. Here is an example of the top 10 bigrams of various forms:

Code: Select all
Bigrams (AB):

TH,32929,0.03459005
HE,26625,0.027968055
IN,19858,0.020859703
ER,15838,0.016636921
AN,15277,0.016047623
ES,13322,0.013994006
HA,12692,0.013332227
ST,12547,0.013179912
RE,12014,0.012620026
ND,11299,0.011868959

Bigrams (A?B):

T?E,25033,0.026295748
E?E,13965,0.014669442
E?T,12847,0.013495046
A?E,10916,0.011466639
O?T,10853,0.011400461
E?A,10853,0.011400461
O?E,10382,0.010905703
E?O,10134,0.010645193
A?D,9965,0.010467668
H?S,9003,0.009457141

Bigrams (A??B):

E??E,13467,0.014146335
E??A,11328,0.011899435
H??E,11241,0.011808046
E??O,9727,0.010217673
E??I,9398,0.009872077
T??E,9336,0.00980695
A??E,9312,0.009781739
O??H,8903,0.009352107
E??N,8779,0.009221852
T??T,8739,0.009179834

Bigrams (A???B):

E???E,14238,0.014956243
T???E,12072,0.0126809785
E???T,11864,0.012462486
O???E,10167,0.01067988
E???A,9407,0.009881541
E???N,8962,0.009414093
A???E,8882,0.009330058
T???A,8174,0.0085863415
S???E,7776,0.008168264
I???E,7760,0.008151458



What other information would you like to collect about the plain text? I can try to do it.

Level 5
User avatar
Posts: 560
Joined: Fri Jan 15, 2010 8:51 pm

Re: My thoughts on the 340

Postby doranchak » Thu Nov 10, 2011 5:22 am

ALisowsky wrote:You will see that the most frequently occuring symbols found within those strings ARE homophone sets. Theoretically if we use the method you and JK are exploring here, and you were to generate 5 strings containing 8 different potential homophone strings through brute search, even assuming error/false positives, we would still expect for those higher occuring characters to = homophones.
Adam


Here is a way we might be able to enhance your idea.

In each homophone cycle candidate, produce all combinations of pairs of symbols, and compute their cosine distances. The cosine distance measures how close the distribution of a symbol X resembles that of another symbol Y, based on the symbols that occur immediately before or after X and Y. Cosine distance is a value from 0 to 1, where 1 is a perfect match. A "true positive" homophone cycle candidate should show higher cosine distances among its pairs, while a "false positive" should show lower cosine distances. Here is a sample calculation:

Code: Select all
ZpW^ (decodes to EEEN). 

Zp: 0.2721655269759087
ZW: 0.20833333333333334
Z^: 0
pW: 0.40824829046386296
p^: 0.07715167498104597
W^: 0

sum: 0.965899
mean: 0.160983

ZpWN (decodes to EEEE)

Zp: 0.2721655269759087
ZW: 0.20833333333333334
ZN: 0.20412414523193154
pW: 0.40824829046386296
pN: 0.16666666666666669
WN: 0.15309310892394862

sum: 1.41263
mean: 0.235439

DkZp (decodes to NIEE)

Dk: 0.05976143046671968
DZ: 0
Dp: 0
kZ: 0.10540925533894598
kp: 0.19364916731037085
Zp: 0.2721655269759087

sum: 0.630985
mean: 0.0901408


Those three cycles have the same odds computation, but you can see here that the cosine distances are higher overall for ZpWN (a true homophone cycle of the 408) than for the others. So, maybe we can combine this measurement to further rank the candidates.

Level 2
Posts: 51
Joined: Tue Aug 09, 2011 9:20 pm

Re: My thoughts on the 340

Postby ALisowsky » Thu Nov 10, 2011 8:21 am

Those three cycles have the same odds computation, but you can see here that the cosine distances are higher overall for ZpWN (a true homophone cycle of the 408) than for the others. So, maybe we can combine this measurement to further rank the candidates.


:clap:

Thats amazing Dave, great job!

Level 3
Posts: 60
Joined: Sat Oct 29, 2011 7:46 am

Re: My thoughts on the 340

Postby JayKay » Thu Nov 10, 2011 2:44 pm

Can you describe the information we need to collect about the English text? I did some basic counts (using Moby Dick) when we last talked about this.


Some thoughts:

The idea is that if there is a basic-inner logic within every language so that we can conclude that the most often used letter is p.e. e followed by t, why should'nt there be such a logical fundament for words and sentences, for the combination of the most often used 1000, 5000 words in standard English etc.?

We can actually build trees for letter combinations:
Only an example: A most often followed by B most often followed by C most often followed by D

Can`t we do this within the most often used words? Even the space inbetween two words could be filled by the probabilities of the following letter.

I|like|killing|people (We take in account that statistically a word in English is built out of four letters I think).
IL ...% LI,IK,KE %
EK...% ....
GP... %...

Wouldn't this be something indepentently of the encryption? Maybe I am wrong.

Let's say we have the words "house,mouse,the,I,in,like"

I like the house ..%
I like the mouse ..%
The mouse I like ..%
.... ..%

Can we combine the output with the ranking of the words? The ranking of the sentences?

Level 3
Posts: 60
Joined: Sat Oct 29, 2011 7:46 am

Re: My thoughts on the 340

Postby JayKay » Thu Nov 10, 2011 3:31 pm

Environment analysis

Output 2p+z, distance +/- 5
2p+z distance5.jpeg
2p+z distance5.jpeg [ 64.94 KiB | Viewed 265 times ]


Interesting:
p+, p/diagonally+,
p
+

the same for +z, +2, 2z and so on.

What the hell does that mean? :doh:

PreviousNext

Return to November 8, 1969: The 340-symbol Cipher

Who is online

Users browsing this forum: No registered users and 0 guests