Calculating Probabilities for Substitution Ciphers using Frequency Analysis












3












$begingroup$


I have been trying to put together a tool that can take in cipher text encrypted via a simple substitution cipher and calculate the most likely "key" (that is, how the plain text letters were mapped to the cipher text letters).



I think I've had some success with an approach I've been using, but I suspect it could be a lot better.



What I've been doing to this point:



For a given mystery letter (that is, that which appeared in the cipher text), I calculated that it actually decrypts to 'a', given the frequency that this mystery letter appears in the cipher text. I did that by using Bayes formula to render it as the likelihood that 'a' appears in the cipher text the number of times that the mystery letter does (knowing usual frequency of 'a' in English). In other words, if we denote $X$ as the mystery letter and $F_X$ denotes the frequency that $X$ appears, then:



$$P(X = a mid F_{X} ) = frac{ P(F_X mid X = a) P(X = a) }{ P(F_X) }$$



$P(F_X mid X = a)$ can be easily calculated by using the binomial distribution mass function knowing the general frequency that 'a' appears in the English language. Without any other information, I assume that $P(X = a)$ equals $1 over 26$. I neglected to divide by the bottom for reasons which I'll explain in a moment.



Then I did the same for every other letter; that is $P(X = b mid F_X), ldots, P(X = z mid F_X)$. I summed up all of what I had calculated for this mystery letter, and then divided each one by the sum, to complete the calculation for each possibility.



Then I did the same for every other mystery letter. So, what I ended up with was a 26-by-26 matrix of probabilities that a mystery letter corresponded to a plaintext letter. It then becomes an assignment problem of selecting 26 elements, exactly one from each column and one from each row and trying to maximize probability for a particular key.



What I don't like about this (beyond that I'm generally inexperienced and uneasy on this topic), is that I feel that I could definitely do better. Each probability calculated for a mystery letter only takes into account the frequency of that particular mystery letter, and not the frequency of other mystery letters. I've thought about trying to do this as a Bayesian Network, but I feel unsure if that is valid, and I'm having a hard time getting my mind around it.



Thanks for any help and insight!










share|cite|improve this question











$endgroup$

















    3












    $begingroup$


    I have been trying to put together a tool that can take in cipher text encrypted via a simple substitution cipher and calculate the most likely "key" (that is, how the plain text letters were mapped to the cipher text letters).



    I think I've had some success with an approach I've been using, but I suspect it could be a lot better.



    What I've been doing to this point:



    For a given mystery letter (that is, that which appeared in the cipher text), I calculated that it actually decrypts to 'a', given the frequency that this mystery letter appears in the cipher text. I did that by using Bayes formula to render it as the likelihood that 'a' appears in the cipher text the number of times that the mystery letter does (knowing usual frequency of 'a' in English). In other words, if we denote $X$ as the mystery letter and $F_X$ denotes the frequency that $X$ appears, then:



    $$P(X = a mid F_{X} ) = frac{ P(F_X mid X = a) P(X = a) }{ P(F_X) }$$



    $P(F_X mid X = a)$ can be easily calculated by using the binomial distribution mass function knowing the general frequency that 'a' appears in the English language. Without any other information, I assume that $P(X = a)$ equals $1 over 26$. I neglected to divide by the bottom for reasons which I'll explain in a moment.



    Then I did the same for every other letter; that is $P(X = b mid F_X), ldots, P(X = z mid F_X)$. I summed up all of what I had calculated for this mystery letter, and then divided each one by the sum, to complete the calculation for each possibility.



    Then I did the same for every other mystery letter. So, what I ended up with was a 26-by-26 matrix of probabilities that a mystery letter corresponded to a plaintext letter. It then becomes an assignment problem of selecting 26 elements, exactly one from each column and one from each row and trying to maximize probability for a particular key.



    What I don't like about this (beyond that I'm generally inexperienced and uneasy on this topic), is that I feel that I could definitely do better. Each probability calculated for a mystery letter only takes into account the frequency of that particular mystery letter, and not the frequency of other mystery letters. I've thought about trying to do this as a Bayesian Network, but I feel unsure if that is valid, and I'm having a hard time getting my mind around it.



    Thanks for any help and insight!










    share|cite|improve this question











    $endgroup$















      3












      3








      3





      $begingroup$


      I have been trying to put together a tool that can take in cipher text encrypted via a simple substitution cipher and calculate the most likely "key" (that is, how the plain text letters were mapped to the cipher text letters).



      I think I've had some success with an approach I've been using, but I suspect it could be a lot better.



      What I've been doing to this point:



      For a given mystery letter (that is, that which appeared in the cipher text), I calculated that it actually decrypts to 'a', given the frequency that this mystery letter appears in the cipher text. I did that by using Bayes formula to render it as the likelihood that 'a' appears in the cipher text the number of times that the mystery letter does (knowing usual frequency of 'a' in English). In other words, if we denote $X$ as the mystery letter and $F_X$ denotes the frequency that $X$ appears, then:



      $$P(X = a mid F_{X} ) = frac{ P(F_X mid X = a) P(X = a) }{ P(F_X) }$$



      $P(F_X mid X = a)$ can be easily calculated by using the binomial distribution mass function knowing the general frequency that 'a' appears in the English language. Without any other information, I assume that $P(X = a)$ equals $1 over 26$. I neglected to divide by the bottom for reasons which I'll explain in a moment.



      Then I did the same for every other letter; that is $P(X = b mid F_X), ldots, P(X = z mid F_X)$. I summed up all of what I had calculated for this mystery letter, and then divided each one by the sum, to complete the calculation for each possibility.



      Then I did the same for every other mystery letter. So, what I ended up with was a 26-by-26 matrix of probabilities that a mystery letter corresponded to a plaintext letter. It then becomes an assignment problem of selecting 26 elements, exactly one from each column and one from each row and trying to maximize probability for a particular key.



      What I don't like about this (beyond that I'm generally inexperienced and uneasy on this topic), is that I feel that I could definitely do better. Each probability calculated for a mystery letter only takes into account the frequency of that particular mystery letter, and not the frequency of other mystery letters. I've thought about trying to do this as a Bayesian Network, but I feel unsure if that is valid, and I'm having a hard time getting my mind around it.



      Thanks for any help and insight!










      share|cite|improve this question











      $endgroup$




      I have been trying to put together a tool that can take in cipher text encrypted via a simple substitution cipher and calculate the most likely "key" (that is, how the plain text letters were mapped to the cipher text letters).



      I think I've had some success with an approach I've been using, but I suspect it could be a lot better.



      What I've been doing to this point:



      For a given mystery letter (that is, that which appeared in the cipher text), I calculated that it actually decrypts to 'a', given the frequency that this mystery letter appears in the cipher text. I did that by using Bayes formula to render it as the likelihood that 'a' appears in the cipher text the number of times that the mystery letter does (knowing usual frequency of 'a' in English). In other words, if we denote $X$ as the mystery letter and $F_X$ denotes the frequency that $X$ appears, then:



      $$P(X = a mid F_{X} ) = frac{ P(F_X mid X = a) P(X = a) }{ P(F_X) }$$



      $P(F_X mid X = a)$ can be easily calculated by using the binomial distribution mass function knowing the general frequency that 'a' appears in the English language. Without any other information, I assume that $P(X = a)$ equals $1 over 26$. I neglected to divide by the bottom for reasons which I'll explain in a moment.



      Then I did the same for every other letter; that is $P(X = b mid F_X), ldots, P(X = z mid F_X)$. I summed up all of what I had calculated for this mystery letter, and then divided each one by the sum, to complete the calculation for each possibility.



      Then I did the same for every other mystery letter. So, what I ended up with was a 26-by-26 matrix of probabilities that a mystery letter corresponded to a plaintext letter. It then becomes an assignment problem of selecting 26 elements, exactly one from each column and one from each row and trying to maximize probability for a particular key.



      What I don't like about this (beyond that I'm generally inexperienced and uneasy on this topic), is that I feel that I could definitely do better. Each probability calculated for a mystery letter only takes into account the frequency of that particular mystery letter, and not the frequency of other mystery letters. I've thought about trying to do this as a Bayesian Network, but I feel unsure if that is valid, and I'm having a hard time getting my mind around it.



      Thanks for any help and insight!







      probability cryptography bayesian






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited May 5 '13 at 16:56







      James Sydow

















      asked May 5 '13 at 6:44









      James SydowJames Sydow

      304




      304






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.






          share|cite|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "69"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f381889%2fcalculating-probabilities-for-substitution-ciphers-using-frequency-analysis%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.






            share|cite|improve this answer











            $endgroup$


















              0












              $begingroup$

              You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.






              share|cite|improve this answer











              $endgroup$
















                0












                0








                0





                $begingroup$

                You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.






                share|cite|improve this answer











                $endgroup$



                You might be interested in the paper "The Markov Chain Monte Carlo Revolution" by Diaconis, which you can find online. In this paper, Diaconis shows how Stanford students were able to break a simple substitution cipher by using the Metropolis algorithm, using digraph frequencies. Diaconis says an attempt based on the frequencies of single letters failed.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Nov 2 '16 at 12:24









                Andreas Caranti

                56.4k34395




                56.4k34395










                answered May 5 '13 at 14:02







                user75108





































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Mathematics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f381889%2fcalculating-probabilities-for-substitution-ciphers-using-frequency-analysis%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Quarter-circle Tiles

                    build a pushdown automaton that recognizes the reverse language of a given pushdown automaton?

                    Mont Emei