Jump to content
Double Fine Action Forums
Sign in to follow this  
buckysrevenge

New stuff on Tumblr!

Recommended Posts

Possibly the first 104...

BECAUSE.AS.YOU.CAN.SEE if the dots are always spaces in the text, then they should be number of words - 1. But yes, indeed I think it's unlikely that anything beyond the first 104 letters is a final letter.

I don't have a whole lot of time to stare at this today, but I will try to when I can.

It's *possible*, just unlikely - if I truncate my noddy example

This is a simple sample


SSAEE

II.LL

H.PPT

MM.II

SS

5 words, 3 dots, because the two longest words are the same length.

I think balance of probability it's almost certainly 104 words, but I also think 103, 105 and 106 may be possible through some similar quirk of the enciphering method

I still don't see how you're figuring 5 words 3 dots from that. However it's encoded, it has to be encoded such that all spaces are accounted for by dots. Otherwise there's be no way to decipher because the ciphertext wouldn't contain the information needed.

Share this post


Link to post
Share on other sites

There's only three dots in this one because two words were the same length. If it was "This is a simple sentence" there would be four dots.

Share this post


Link to post
Share on other sites

Yeah, looks like I just misread the number of spaces, but I agree that the number of words - 1 makes the most sense, though that could just be me not getting why two of the same length word would alleviate the need for spacing. There's also not any real reason to assume that the $ would replace a . whether it's a variable or a price so given all that 104 sounds the most likely to me. If that's the case, the fact that there are two As and two Is in that particular block means the number of second to last letters should be at least 100.

What was interesting is the frequency analysis earlier that had a large group of the most common first letters. Even if the words were similar lengths, you'd expect them to be more spaced out than the third letters, if only because the algorithm should run out of third letters before it runs out of first letters. Although, if the search were to be narrowed, say to smaller spikes, it may be possible to discern the direction in which they're declining, or at least areas which are more likely to be Xth rows. Unfortunately, the sample size for any given group might not be big enough to do that, given the amount of noise we're working with, and it might not be just going letter by letter of each word. Like it could be alternating last letter-first letter or something.

Share this post


Link to post
Share on other sites

Yeah, two words being the same length wouldn't change the number of dots that appear - it would just change where the dots appear in the text (if the enciphering method is something like the ones we tried)

Share this post


Link to post
Share on other sites
Yeah, looks like I just misread the number of spaces, but I agree that the number of words - 1 makes the most sense, though that could just be me not getting why two of the same length word would alleviate the need for spacing. There's also not any real reason to assume that the $ would replace a . whether it's a variable or a price so given all that 104 sounds the most likely to me. If that's the case, the fact that there are two As and two Is in that particular block means the number of second to last letters should be at least 100.

What was interesting is the frequency analysis earlier that had a large group of the most common first letters. Even if the words were similar lengths, you'd expect them to be more spaced out than the third letters, if only because the algorithm should run out of third letters before it runs out of first letters. Although, if the search were to be narrowed, say to smaller spikes, it may be possible to discern the direction in which they're declining, or at least areas which are more likely to be Xth rows. Unfortunately, the sample size for any given group might not be big enough to do that, given the amount of noise we're working with, and it might not be just going letter by letter of each word. Like it could be alternating last letter-first letter or something.

In the method above multiple words at the maximum length alleviates it unless you want to put a . (or ..) at the end because if the decrypter knows the technique they can easily notice that they're getting nonsense when they decrypt as 4 words, so they can shift the number of words up until they get something sensical. It's almost certainly not the case here becase there's something more complex going on (which actually makes me think 103 words might be a possibility, since it could be that all words might need their length denoted to make it decryptable. Eh, this is just me muddying the waters though, I still agree that 104 is where we should focus our efforts.

Maybe more interesting than needing a block of 100 - the first A/I doesn't turn up until the 49th character, so we presumably need a block of 49+ characters with no spaces at all. There's only one of those - a block of 56:

HHLWHKHNMLHRNKSVRLRHWVDMLBDVRTHFBTBEERHERVDHLDHVDTGSYMN

Though obviously that's based on the premise of no reshuffling which may not be borne out given all the clustering we're seeing elsewhere. On the other hand is it just me, or does that seem like a big cluster of letters that are typically fairly infrequent? I was hoping it might match the typical distribution of first letters in a word, but it really doesn't (though there's a lot of Hs)

Share this post


Link to post
Share on other sites
Yeah, looks like I just misread the number of spaces, but I agree that the number of words - 1 makes the most sense, though that could just be me not getting why two of the same length word would alleviate the need for spacing. There's also not any real reason to assume that the $ would replace a . whether it's a variable or a price so given all that 104 sounds the most likely to me. If that's the case, the fact that there are two As and two Is in that particular block means the number of second to last letters should be at least 100.

What was interesting is the frequency analysis earlier that had a large group of the most common first letters. Even if the words were similar lengths, you'd expect them to be more spaced out than the third letters, if only because the algorithm should run out of third letters before it runs out of first letters. Although, if the search were to be narrowed, say to smaller spikes, it may be possible to discern the direction in which they're declining, or at least areas which are more likely to be Xth rows. Unfortunately, the sample size for any given group might not be big enough to do that, given the amount of noise we're working with, and it might not be just going letter by letter of each word. Like it could be alternating last letter-first letter or something.

In the method above multiple words at the maximum length alleviates it unless you want to put a . (or ..) at the end because if the decrypter knows the technique they can easily notice that they're getting nonsense when they decrypt as 4 words, so they can shift the number of words up until they get something sensical. It's almost certainly not the case here becase there's something more complex going on (which actually makes me think 103 words might be a possibility, since it could be that all words might need their length denoted to make it decryptable. Eh, this is just me muddying the waters though, I still agree that 104 is where we should focus our efforts.

Maybe more interesting than needing a block of 100 - the first A/I doesn't turn up until the 49th character, so we presumably need a block of 49+ characters with no spaces at all. There's only one of those:

HHLWHKHNMLHRNKSVRLRHWVDMLBDVRTHFBTBEERHERVDHLDHVDTGSYMN

Though obviously that's based on the premise of no reshuffling which may not be borne out given all the clustering we're seeing elsewhere. On the other hand is it just me, or does that seem like a bit cluster of letters that are typically fairly infrequent?

Let's do some analysis on those 49 letters. Watch this space.

Share this post


Link to post
Share on other sites

Analysis:

Most common letters

H 10

R 6

L 5

V 5

D 5

Not a strong match, but the sample is quite small. But H is by far the most common, which is consistent with the second letter in a word, which the most common is actually H, presumably because of common words like when, then, the, there, their, which, why, who, she, should and so on.

Share this post


Link to post
Share on other sites

Might be worth taking differen blocks of text, comparing their letter frequencies with what we know about letter frequencies in positions of words. So we already know the first 103 or so correlate very strongly to the last letter in a word, but what about other sections?

The end of the text

EOOOORUKAAEY

ESIILAINWIIUSNN.L..EOOAU

..SOUIAASASUEIUNSI..$...

.O.YE.IAAA.....X...OFTF.

GGSBAJOOBPPAAOOOAAE....L

........OEMLMBAH..ANZZUU

is made largely of AOIESU which is quite close to the general letter frequencies of ETAOIN but with a conspicuous lack of Ts and a lower frequency of Es.

From this we can conclude that wherever the first letters of the words are coming up, they're not appearing at the end of the text. T is by far the most commen starting letter for a word, and yet there's only 1 here. I would say that it's probable that at the end we're seeing a jumble of letters from various parts of words, an artifact of the enciphering method used.

What about this block from the middle...

.NNNNONNEAAAAOII.TW...TT

TTTTWWTTTTTTP..GG.FFLW..

EHVHNTTT.H.DD..RW.IALLUA

ZZB.E.OIPB..BEAANRI...RR

UO.OE..OEEOOUAOOOO.OIIOI

IIIAOA.OAAAATTSDGSTTTL..

TOANEW are the most common, in that order, which is at least a reasonable match for first letters in words, which go TOAWBC - b and c more common here, and not sure how to account for the lack of Cs in the above or the two Z in a row.

All in all I'm getting the impression that the algorithm starts in a very ordered way, but gets more and more disorderly until the end, which is consistent with our test algorithms we've been trying, so it's probably something similar, but different.

Share this post


Link to post
Share on other sites

You guys have found some real telling clues I think! I'm coming at this from another angle right now: the font. It's Share Tech, not in and of itself weird or unusual, but what is unusual (or at least odd when you look at the page source) is the way it's included in the CSS. It's in there in about 3 different formats (embedded OpenType, TrueType and SVG) and these are just links to normal files on Tumblr's server (all very normal).

However, there's a fourth version, WOFF (web open font format). For some reason the data for this font is embedded right in the CSS. A huge chunk of utf-8, base-64 encoded data.

The question on my mind is why? Why not just link to an external file like the others? It's also inside of the URL field of this particular CSS font-face but it's clearly not a URL. I'm no CSS guru so maybe this is not unusual. But it strikes me as odd. I scanned a few other tumblr pages to see any evidence that this is normal for tumblr but nothing similar is there.

I tried decoding the data and it at least starts with the right magic number for a WO font ('wOFF') but I didn't get much further than that.

http://pastebin.com/AfNYZUsJ

It may be a red-herring but it seems like things could benefit from a fresh perspective here anyways!

UPDATE: Also worth noting, this is something they have control over. Check out the 'how to upload' section of this page http://codehelp.tumblr.com/fonts

UPDATE: I decoded the data with 'recode' and then converted the resulting file to a ttf using woff2snft (http://people.mozilla.org/~jkew/woff/) then opened up the font table with TTFEdit and nothing looks weird. Just a normal spline font. Boo. That would be a fun way to provide a substitution cipher (a font file with the glyph table scrambled). But substitution ciphers are too easy to decrypt anyways I suppose.

Seth B.

Share this post


Link to post
Share on other sites

Another recap time. So far we have supposed that the text is enciphered something thus:

EIOGTTRYFTEES

R.GNUEEROAHDI

E .IOYHO.HTO.

H  T..TE W.C

.  S  OH . .

  E  NT

  T  A.

  .  .

That is the phrase HERE I GO TESTING OUT YET ANOTHER THEORY OF WHAT THE CODE IS converted into vertical columns with the end of the word always at the top and a . at the bottom to remove the space. Once reformatted the result is a block of text something like this:

EIOGTTRYFTEESR.GNUEEROA

HDIE.IOYHO.HTO.HT..TEW.

C.SOH..ENTTA...

It's become quite clear that while there are some things about this we like (it puts the last letters in the right place, and produces a distribution of spaces SORT of like what we see in the ciphertext, there are several problems.

1) The ciphertext features triplets of dots early on, which would imply that there are several one letter words in a row in the plaintext - not particularly plausible.

2) This method so far doesn't reproduce the clusters of letters we are hoping for, although clearly it does produce patterns (e.g. the .IOYHO.HTO.HT..TEW.C.SOH is clearly different in character to the EIOGTTRYFTEESR.GNUEEROA at the start)

From which we can speculate that the actual method of enchiphering is something like the above, but with an added step that makes the thing tick. If we can think of what added step would solve problems 1 and 2, we might have our answer.

My best guess: It may be to do with the ordering of the columns in the first bit. For example, what if our original

EIOGTTRYFTEES

R.GNUEEROAHDI

E .IOYHO.HTO.

H  T..TE W.C

.  S  OH . .

  E  NT

  T  A.

  .  .

were rearranged into different columns via some consistent rule, for example longest to shortest word, say, which would have the tendency to push letters together because it would put words of similar length (including identical words) together. So if all instances of the word THAT were together that would explain why there are suddenly a bunch of Ts in a row. That's probably not exactly it (in fact almost certainly it isn't because it introduces new problems), but as an example it serves. But it may be something of this sort.

Share this post


Link to post
Share on other sites

I'm a visual person so here it is transposed (so you don't have to read it top to bottom … er … bottom to top). Two things strike me. The '$' is in the exact same place in this arrangement. There's lots of terminating periods at the end of the first few lines (the first 8):

S    L    A    E    R    R    .    .    D    H    .    T    E    Z    U    I    .    L    T    E    .    .    G    .

G    E    F    D    S    W    A    .    M    V    N    T    H    Z    O    I    L    N    T    S    .    O    G    .

G    S    F    N    Y    .    .    H    L    D    N    T    V    B    .    I    N    .    T    I    S    .    S    .

C    M    E    E    E    R    M    H    B    T    N    T    H    .    O    A    F    Y    G    I    O    Y    B    .

S    D    S    E    G    F    .    L    D    G    N    W    N    E    E    O    S    H    U    L    U    E    A    .

N    E    A    S    U    H    .    W    V    S    O    W    T    .    .    A    S    .    O    A    I    .    J    .

T    T    E    Y    E    W    .    H    R    Y    N    T    T    O    .    .    I    M    U    I    A    I    O    .

S    N    E    E    R    W    I    K    T    M    N    T    T    I    O    O    I    C    U    N    A    A    O    .

S    O    E    Y    R    E    I    H    H    N    E    T    .    P    E    A    .    C    E    W    S    A    B    O

S    E    N    T    E    .    E    N    F    .    A    T    H    B    E    A    P    R    E    I    A    A    P    E

F    D    T    G    .    H    .    M    B    O    A    T    .    .    O    A    W    A    .    I    S    .    P    M

E    R    E    E    O    H    S    L    T    O    A    T    D    .    O    A    .    U    .    U    U    .    A    L

E    T    T    D    M    M    S    H    B    O    A    P    D    B    U    T    .    .    E    S    E    .    A    M

S    S    E    N    M    R    N    R    E    .    O    .    .    E    A    T    C    .    O    N    I    .    O    B

R    O    O    R    M    M    E    N    E    .    I    .    .    A    O    S    L    .    O    N    U    .    O    A

W    O    N    L    .    C    E    K    R    .    I    G    R    A    O    D    .    .    O    .    N    X    O    H

O    O    E    R    M    H    E    S    H    N    .    G    W    N    O    G    I    E    O    L    S    .    A    .

T    T    Y    I    C    H    E    V    E    .    T    .    .    R    O    S    F    E    R    .    I    .    A    .

O    S    R    I    .    H    I    R    R    .    W    F    I    I    .    T    E    E    U    .    .    .    E    A

S    T    T    E    .    F    A    L    V    .    .    F    A    .    O    T    V    L    K    E    .    O    .    N

O    G    G    T    .    W    R    R    D    S    .    L    L    .    I    T    F    E    A    O    $    F    .    Z

Y    E    D    E    E    W    N    H    H    N    .    W    L    .    I    L    F    A    A    O    .    T    .    Z

G    R    T    S    G    .    N    W    L    S    T    .    U    R    O    .    F    E    E    A    .    F    .    U

L    O    Y    T    R    .    .    V    D    R    T    .    A    R    I    .    P    O    Y    U    .    .    L    U

Share this post


Link to post
Share on other sites
I'm a visual person so here it is transposed (so you don't have to read it top to bottom … er … bottom to top). Two things strike me. The '$' is in the exact same place in this arrangement. There's lots of terminating periods at the end of the first few lines (the first 8):

S    L    A    E    R    R    .    .    D    H    .    T    E    Z    U    I    .    L    T    E    .    .    G    .

G    E    F    D    S    W    A    .    M    V    N    T    H    Z    O    I    L    N    T    S    .    O    G    .

G    S    F    N    Y    .    .    H    L    D    N    T    V    B    .    I    N    .    T    I    S    .    S    .

C    M    E    E    E    R    M    H    B    T    N    T    H    .    O    A    F    Y    G    I    O    Y    B    .

S    D    S    E    G    F    .    L    D    G    N    W    N    E    E    O    S    H    U    L    U    E    A    .

N    E    A    S    U    H    .    W    V    S    O    W    T    .    .    A    S    .    O    A    I    .    J    .

T    T    E    Y    E    W    .    H    R    Y    N    T    T    O    .    .    I    M    U    I    A    I    O    .

S    N    E    E    R    W    I    K    T    M    N    T    T    I    O    O    I    C    U    N    A    A    O    .

S    O    E    Y    R    E    I    H    H    N    E    T    .    P    E    A    .    C    E    W    S    A    B    O

S    E    N    T    E    .    E    N    F    .    A    T    H    B    E    A    P    R    E    I    A    A    P    E

F    D    T    G    .    H    .    M    B    O    A    T    .    .    O    A    W    A    .    I    S    .    P    M

E    R    E    E    O    H    S    L    T    O    A    T    D    .    O    A    .    U    .    U    U    .    A    L

E    T    T    D    M    M    S    H    B    O    A    P    D    B    U    T    .    .    E    S    E    .    A    M

S    S    E    N    M    R    N    R    E    .    O    .    .    E    A    T    C    .    O    N    I    .    O    B

R    O    O    R    M    M    E    N    E    .    I    .    .    A    O    S    L    .    O    N    U    .    O    A

W    O    N    L    .    C    E    K    R    .    I    G    R    A    O    D    .    .    O    .    N    X    O    H

O    O    E    R    M    H    E    S    H    N    .    G    W    N    O    G    I    E    O    L    S    .    A    .

T    T    Y    I    C    H    E    V    E    .    T    .    .    R    O    S    F    E    R    .    I    .    A    .

O    S    R    I    .    H    I    R    R    .    W    F    I    I    .    T    E    E    U    .    .    .    E    A

S    T    T    E    .    F    A    L    V    .    .    F    A    .    O    T    V    L    K    E    .    O    .    N

O    G    G    T    .    W    R    R    D    S    .    L    L    .    I    T    F    E    A    O    $    F    .    Z

Y    E    D    E    E    W    N    H    H    N    .    W    L    .    I    L    F    A    A    O    .    T    .    Z

G    R    T    S    G    .    N    W    L    S    T    .    U    R    O    .    F    E    E    A    .    F    .    U

L    O    Y    T    R    .    .    V    D    R    T    .    A    R    I    .    P    O    Y    U    .    .    L    U

Cool, but I'm not sure significant - it's not surprising that the $ stays in the same place because it's exactly 4 away from the edges, so when flipped around it would stay in the same place, as would any other of the letters along that line.

And I suppose it looks interesting that the . appears regularly at the end of the line, but in the end it's no more significant than them appearing at the bottom in the original ciphertext, it's just flipped around a bit.

It's a shame it's so long or we could potentially solve this by anagramming.

Share this post


Link to post
Share on other sites
It's a shame it's so long or we could potentially solve this by anagramming.

It's fun to put pieces of it into anagram-solver.net (like here)! There's so many letters in there that when you allow partial matches you can get just about anything. My personal favorite so far:

A list of anything and everything in any video game that might sort of kind of be realted to ghosts but not really

:-O

Share this post


Link to post
Share on other sites
It's a shame it's so long or we could potentially solve this by anagramming.

Yeah that's the crux of it, I've been scripting up various daft transformations and anagramming of chunks of the text, but the chances of hitting on a real bit of the solution is so small that its basically just a way to keep my brain ticking over on the subject.

A list of anything and everything in any video game that might sort of kind of be realted to ghosts but not really

Hee! I think you're on to something there ;)

Share this post


Link to post
Share on other sites

I'm duplicating a lot of work here but I'm mostly just playing catchup (only got into this today). Here's the frequencies charted. I think your right that it's not substituted, just scrambled.

HnS-Frequency.png

HnS-Frequency.png.fb07ceab890da068b45041

Share this post


Link to post
Share on other sites

Wow, you have all done such a great job analyzing the important features of the text - features I didn't even consider but are totally important in determining what the text is. I put another hint on the site for you all. Of course, it is, in turn, a little cryptic...

Share this post


Link to post
Share on other sites

Here in the source:

<!--

Y'all have worked out so much already! The properties you've discovered in the text might even be useful in fields other than cryptography.

Here's a new hint that might push things along!

QlpoORdyRThQkAAAAAA=

> CBD_

-->

Right. Anyone?

Share this post


Link to post
Share on other sites

The first thing I noticed is the text looks like text in a hex editor. I mainly say that because of the periods. Though there are too many columns for the typical hex editor, so it's likely a red herring._HEXEDIT.GIF

Share this post


Link to post
Share on other sites

Okay, so QlpoORdyRThQkAAAAAA= is the string that is output when you try to encode NOTHING to bz2, and then encode that to base64.

So... what's significant about bz2?

Share this post


Link to post
Share on other sites

I think this is what is significant about it

http://en.wikipedia.org/wiki/Burrows–Wheeler_transform

So, the Burrows-Wheeler transform is an efficiency algorithm used in compression and specifically BZ2 where it rearranges the characters to make it so that characters clump together, which is why we see all those repeated letters. But the thing about the BWT is that it's reversible, which means if we do a reverse BWT on the ciphertext we should get the plaintext. Working on that now.

Share this post


Link to post
Share on other sites

bwt.PNG

Thanks to my friend Kieran for some help in sorting out some of the more technical biz of reversing that.

Share this post


Link to post
Share on other sites

I've always wanted to delve further into compression techniques. What a cool little transform. Had not heard of it. The stuff of pure computer science! I'll have to see if I can make a project out of that sometime for my students! (with credit due to the creator of course)

Here's my own little program to reverse it I was busy writing when you totally spoiled it by posting the result :-P

http://pastebin.com/yRwLGkaf

Seth B.

Share this post


Link to post
Share on other sites

Awesome! It makes a lot of sense that this is an easily compressible form of the text, but I'm not sure I'd have made that leap without the hint.

I figured Tumblr might have made it difficult to embed any steganography in the images, but I look forward to whatever puzzle is eventually figured out.

Share this post


Link to post
Share on other sites

As a little coda to this, since I haven't had time to actually study the ins and outs of the transform, it appears that based on the example given on wikipedia:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT

a side effect seems to be that it does in fact seem to (kinda) sort the end letters of words to the start of the text, though not in the same order. :)

Share this post


Link to post
Share on other sites

Yep I think a lot of your observations were dead accurate! And your sample text was good - encode it and it looks like:

SSENSSTEHNLEEADIEETTONEEG

ENEEOSTTRISDRSSOETTNETTGT

ETLRTSYTKENFSSEESESSOFGGE

GTETERFRDGKTYTSHEGOTTLTEE

NNF...MT..C.CZ..HHHHHHHHH

TH.WMY...I.AA..NUE.OO...E

RNEO.N.RHWVHHHVBPRWRVRHHH

BLDKWKNDSEDBTVEV.THHHHOMN

GBRMM.TIOOO..WNNNNNNNGUUI

IA..CGWTTTTTWTT.TTTTTTTP.

TPTTTTT.GG..S.MRLLTW.A..G

LKHHTHL.TTCC...HH......RG

WCCIIOOLUIB...LIEBNA...NI

.OOO.EAAEOOEIIAIIIIIIIIOI

UO.EOAIUETSDTNLLCG...OOCS

SIIL.CLLHF.FWPLN..BBOII.O

UOOEAOEEAC.WEEIIKIAIIIEIE

IDI.S...OEE..EAHAAIXAOHAI

USSUAAAANN.NR............

...E.E$.NINA..SSOF..OOBAO

OIE....A......EALAI

I'm sure I'll find other reasons for having put together a BWT utility... somehow...

was more fun to implement than the TLS stuff from the press release anyway, it's such a simple and elegant little algorithm. I'm kinda fascinated by it now.

Share this post


Link to post
Share on other sites

Oh and re the last letters appearing at the beginning, yes that's exactly what's happening - the alphabetical sort is putting all the .s at the top which means that the value in the final column is all the last letters of the words. But their position within that set is dependent on the first letter of the *next* word so they get very scrambled :). Though one consistent thing is with the $ eof indicator, the first character is always the last character of the actual text.

Share this post


Link to post
Share on other sites

I've read up on it now, and yes, it's remarkably simple, isn't it? I'm surprised that we didn't get to compression techniques from where we were-

*I/We knew that repeated patterns were useful in compression (because OOOOO can be shortened to O5, etc)

*We knew that we were dealing with an algorithm that generated text with lots of repeated patterns.

*We also knew that it was probably doing so in a way that was reversible without any extra info (because otherwise the puzzle wouldn't be solvable, except through brute force or via a key we didn't have)

Kicking myself it didn't occur to us that such an algorithm would be HIGHLY useful in compression, and we should look there.

Share this post


Link to post
Share on other sites
Sign in to follow this  

×
×
  • Create New...