how can a program recognize encoded text without decoding it?

using machine learning to identify text encodings without decoding them

note: the date shown is the publication date of this post. some portions were written earlier during development of the project, and the content was revised and expanded multiple times before and after publication.

second note: you can find all the code for this project and use it for yourself here

around October, I was looking through GitHub when I found a repository that attempted to decode a string without knowing which transformation it had undergone by just… spamming every possible decoding function until it produced plain text, which worked but felt unsatisfying since it relied on brute force, and would quickly become inefficient.

so I wanted to create something myself that would recognize these encodings without having to decode them. I found a similar project that did just this using regex, which was effective, but also boring to me, since regex rules felt fragile and easy to break with small variations.

from my memory, I thought random forest trees might be one of the better options for this. a random forest is basically a collection of decision trees trained on different random subsets of the data and “features”; their predictions are then combined (by averaging or voting) to improve accuracy and reduce overfitting.

however, before I started to write code, I researched a bit more and found out about gradient boosted trees, which build decision trees sequentially. each new tree is trained to correct the errors of the previous one, gradually improving overall prediction performance.

so I decided to settle on that for a bit and moved to the next challenge: where would I source the data from? for this, I decided to use the contents of the English Wikipedia and save a limited number of lines. my idea was to then apply a number of transformations to each line and save the results as the dataset, so I did just that.

I wrote some code that simply went through every line, appended the original text as the plain text sample, and then applied a number of predefined encoders, storing their results along with the name of the transformation.

for text in corpus:
    samples.append((text, "plain"))

    for name, encoder in ENCODERS.items():
        try:
            encoded = encoder(text)
            samples.append((encoded, name))

as a base, I started with just a few simple transformations, namely: base16 (hex encoding), base32, base64, base85, URL encoding, rot13, and gzip64. I would later add more.

but wait, how do our decision trees understand what the differences between the samples are? for that, we use the “features” I briefly mentioned. basically, for every string given to the decision trees, it is converted into an array of features.

as a start, I used some very basic features that I could think of: the length of the string; the length of the string modulo 4; the ratio of alphabetical characters to the total number of characters; the ratio of digits to the total number of characters; the ratio of printable characters to the total number of characters; the ratio of the number of ’=’ characters appearing in the text to the total number of characters (padding ratio); and how compressible the text is, which reflects how much structure or redundancy it contains. additionally, I added a feature for Shannon entropy.

so, for example, if we wanted the model to predict the transformation for the text “the quick brown fox jumped over the lazy dog,” we would extract the features above and feed them to it. the result would look something like this: [44, 0, 1.0, 0.8181818181818182, 0.0, 0.0, 1.1590909090909092, 4.323067982273661]

then it was time to actually teach the model. for that, I used a library called scikit-learn, which provides simple and efficient tools for predictive data analysis. the code ended up looking something like this:

samples = generate_samples(corpus)

X = np.array([extract_features(s) for s, _ in samples])
y = np.array([label for _, label in samples])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

clf = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=3,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    verbose=1,
)


clf.fit(X_train, y_train)

joblib.dump(clf, MODEL_PATH)

and after running the code, it took a few minutes to complete on a CPU, and it reported back 80% accuracy — wow, that’s nice! But how can we make it better?

the answer was in the features. By adding more linguistic features, for example the ratio of stopwords and how similar the letter distribution of the string is to typical English text, and then rerunning the training code, it reported back 99% accuracy.

I later learned that the reason this happened was because of the aforementioned rot13 encoding. using a confusion matrix, I could see the model mistook plain text and rot13 too frequently, which lowered the overall accuracy, and I can’t really blame it:

plain textrot13
hello worlduryyb jbeyq
the quick brown fox jumped over the lazy doggur dhvpx oebja sbk whzcrq bire gur ynml qbt

to us, this is clearly not typical English text… but to the model there is no real change in the features. For example, as we may recall, the result of extracting features from “the quick brown fox jumped over the lazy dog” looked something like this: [44, 0, 1.0, 0.8181818181818182, 0.0, 0.0, 1.1590909090909092, 4.323067982273661]. Meanwhile, “gur dhvpx oebja sbk whzcrq bire gur ynml qbt” looks something like this: [44, 0, 1.0, 0.8181818181818182, 0.0, 0.0, 1.1590909090909092, 4.323067982273661].

wait, isn’t that the same? yep, meaning the model cannot accurately tell the difference. however, after adding those linguistic features, the results looked quite different.

textfeatures
the quick brown fox jumped over the lazy dog[44, 0, 1.0, 0.8181818181818182, 0.0, 0.0, 1.1590909090909092, 4.323067982273661, 0.5921386687003095, 0.3333333333333333]
gur dhvpx oebja sbk whzcrq bire gur ynml qbt[44, 0, 1.0, 0.8181818181818182, 0.0, 0.0, 1.1590909090909092, 4.323067982273661, 0.06922206918672859, 0.0]

as you can see, the two new features are very different from the original plain text form, which increased the model’s ability to detect the differences and make more accurate predictions.

at this point, I was already quite happy and didn’t really introduce much more, other than adding a CLI so people could interface with the model from their terminal, adding more transformations to the dataset (namely, morse, rot47, md5, sha1, sha224, sha256, sha384, and sha512), and making the code neater.

but then I realized that because I had specifically used the English Wikipedia as the data source, the model could only accurately predict English text. this was fine at the time, but I thought there was no reason not to add more languages.

I decided to add data from the Russian, Hebrew, and Arabic Wikipedia, specifically because they all use distinct scripts with different characters. for example, it wouldn’t have made sense to add French, since it uses the same letters as English, as they both use the Latin script.

I then retrained the model and, after a few minutes, uh oh… accuracy dropped from 99% to less than 60%. that’s not good.

the reason for this was that the linguistic features were biased toward English, since they relied on English stopwords and measured letter distribution against typical English text. after replacing these with less biased features, specifically swapping the English stopword ratio for a non-ASCII text ratio and replacing letter distribution similarity with word density, I also added another feature based on the Shannon entropy of bigrams, or two-character sequences. with these changes, accuracy recovered to somewhere around 80%.

I also did some testing and realized that the printable ratio and the length modulo 4 features had very negligible contributions, so they were removed. additionally, I decided to remove the rot13 and rot47 transformations, since with the changes to the features and scripts they were lowering accuracy too much. at this point, accuracy recovered to around 90%.

at this point, I realized that decision trees were no longer suitable for this project, so I began to pivot and, after doing some research, settled on 1D convolutional neural networks. convolutional neural networks use sliding filters, which are small learned weight matrices that move across the input and detect local patterns at each position, to automatically learn local patterns in data, especially images, and combine them into higher-level features.

this also meant that I would no longer have to manually perform feature engineering to extract specific, predefined features, since the model would learn them on its own.

I began the rewrite, and it was time to pick a tokenizer. tokenizers split raw text into smaller units, or tokens, such as words or subwords, that models can then process numerically and that essentially form the model’s “vocabulary.”

I decided to pick a bigram tokenizer, since many transformations, such as base64, hex, and hashes, have strong local n-gram signatures. it looked something like this:

def extract_bigrams(text: str):
    if len(text) < 2:
        return [text]

    return [text[i : i + 2] for i in range(len(text) - 1)]

def build_vocab(samples, min_freq=-1):
    counter = Counter()

    for s, _ in samples:
        counter.update(extract_bigrams(s))

    vocab = [bg for bg, c in counter.items() if c >= min_freq]

    stoi = {bg: i + 1 for i, bg in enumerate(sorted(vocab))}
    itos = {i: bg for bg, i in stoi.items()}

    return stoi, itos

def encode_bigrams(text: str, stoi: dict):
    bigrams = encode_bigrams(text)
    return torch.tensor([stoi.get(bg, 0) for bg in bigrams], dtype=torch.long)

the rewrite also meant that we would be dropping scikit-learn and instead using torch. after a bit of refactoring, I finally implemented the training loop, which looked something like this:

model = model.to(DEVICE)

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
    model.train()
    
    total_loss = 0.0
    
    for X, y in dataloader:
        X, y = X.to(DEVICE), y.to(DEVICE)
        optimizer.zero_grad()
        logits = model(X)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()

epochs are full iterations in which a model sees every training example once. I started with 5 and later settled on 10.

during my testing, I also realized that there was some corpus bias, since Wikipedia text is much cleaner than real-world inputs. my solution to this was adding stopwords for every supported script to the dataset, which, based on my testing, definitely reduced it quite a bit.

after this, no major changes were introduced, aside from optimizations, adding the Greek script, and a few other optimization-related improvements, such as pregenerating the dataset and making the CLI lighter dependency-wise.

I also decided to give the model one predetermined feature, namely the true length of the text, since all of the text had to be padded before it could be trained:

def __getitem__(self, idx):
    text, label = self.samples[idx]

    tokens = encode_bigrams(text[:MAX_LEN], self.stoi)
    true_len = min(len(tokens), MAX_LEN) # addition

    x = torch.tensor(tokens[:MAX_LEN], dtype=torch.long)
    x = F.pad(x, (0, MAX_LEN - x.size(0)))
    y = torch.tensor(self.label2idx[label], dtype=torch.long)

    return x, torch.tensor(true_len, dtype=torch.float32), y # addition

I did this after realizing that the model wasn’t able to tell the differences between different hashes, since they were all very similar hex digests, with the only real difference being their length, which was padded. the model still has a hard time telling them apart, but this definitely improved its performance.

you can find the code here and even install and use it on your system using pipx with pipx install whatenc.

the cli looks something like this:

[+] input: ZW5jb2RlIHRvIGJhc2U2NCBmb3JtYXQ=
   [~] top guess   = base64
      [=] base64   = 1.000
      [=] base85   = 0.000
      [=] plain    = 0.000

[+] input: hello
   [~] top guess   = plain
      [=] plain    = 1.000
      [=] md5      = 0.000
      [=] base64   = 0.000

[*] loading model
[+] input: האקדמיה ללשון העברית
   [~] top guess   = plain
      [=] plain    = 1.000
      [=] base64   = 0.000
      [=] base85   = 0.000

[*] loading model
[+] input: bfa99df33b137bc8fb5f5407d7e58da8
   [~] top guess   = md5
      [=] md5      = 0.999
      [=] sha1     = 0.001
      [=] sha224   = 0.000

I am currently quite happy with the state of this project. The last change I made, as of the time of publication in late December, was on November 27, with the first change being on October 25.

could it be better? definitely. I might revisit it soon. It was very fun working on this project, and I think I learned quite a bit about ML. I definitely want to work on more projects involving ML, and I have a fun idea related to malware and ML that I may explore soon.