Estimating Parameter - What is the qualitative difference between MLE fitting and Least Squares CDF fitting?












8












$begingroup$


Given a parametric pdf $f(x;lambda)$ and a set of data ${ x_k }_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):



$$lambda^* = arg max_lambda prod_{k=1}^n f(x_k;lambda)$$



where this product is called the likelihood function.



The second is least squares CDF fitting:



$$lambda^*=arg min_lambda | E(x)-F(x;lambda) |_{L^2(dx)}$$



where $F(x;lambda)$ is the CDF corresponding to $f(x;lambda)$ and $E(x)$ is the empirical CDF: $E(x)=frac{1}{n} sum_{k=1}^n 1_{x_k leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)



In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.



What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.










share|cite|improve this question











$endgroup$












  • $begingroup$
    I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
    $endgroup$
    – Royi
    Aug 28 '17 at 13:59
















8












$begingroup$


Given a parametric pdf $f(x;lambda)$ and a set of data ${ x_k }_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):



$$lambda^* = arg max_lambda prod_{k=1}^n f(x_k;lambda)$$



where this product is called the likelihood function.



The second is least squares CDF fitting:



$$lambda^*=arg min_lambda | E(x)-F(x;lambda) |_{L^2(dx)}$$



where $F(x;lambda)$ is the CDF corresponding to $f(x;lambda)$ and $E(x)$ is the empirical CDF: $E(x)=frac{1}{n} sum_{k=1}^n 1_{x_k leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)



In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.



What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.










share|cite|improve this question











$endgroup$












  • $begingroup$
    I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
    $endgroup$
    – Royi
    Aug 28 '17 at 13:59














8












8








8


6



$begingroup$


Given a parametric pdf $f(x;lambda)$ and a set of data ${ x_k }_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):



$$lambda^* = arg max_lambda prod_{k=1}^n f(x_k;lambda)$$



where this product is called the likelihood function.



The second is least squares CDF fitting:



$$lambda^*=arg min_lambda | E(x)-F(x;lambda) |_{L^2(dx)}$$



where $F(x;lambda)$ is the CDF corresponding to $f(x;lambda)$ and $E(x)$ is the empirical CDF: $E(x)=frac{1}{n} sum_{k=1}^n 1_{x_k leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)



In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.



What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.










share|cite|improve this question











$endgroup$




Given a parametric pdf $f(x;lambda)$ and a set of data ${ x_k }_{k=1}^n$, here are two ways of formulating a problem of selecting an optimal parameter vector $lambda^*$ to fit to the data. The first is maximum likelihood estimation (MLE):



$$lambda^* = arg max_lambda prod_{k=1}^n f(x_k;lambda)$$



where this product is called the likelihood function.



The second is least squares CDF fitting:



$$lambda^*=arg min_lambda | E(x)-F(x;lambda) |_{L^2(dx)}$$



where $F(x;lambda)$ is the CDF corresponding to $f(x;lambda)$ and $E(x)$ is the empirical CDF: $E(x)=frac{1}{n} sum_{k=1}^n 1_{x_k leq x}$. (One could also consider more general $L^p$ CDF fitting, but let's not go there for now.)



In the experiments I have done, these two methods give similar but still significantly different results. For example, in a bimodal normal mixture fit, one gave one of the standard deviations as about $12.6$ while the other gave it as about $11.6$. This isn't a huge difference but it is large enough to easily see it in a graph.



What is the intuition for the difference in these two "goodness of fit" metrics? An example answer would be something along the lines of "MLE cares more about data points in the tail of the distribution than least squares CDF fit" (I make no claims on the validity of this statement). An answer discussing other metrics of fitting parametric distributions to data would also be of some use.







statistics numerical-methods least-squares parameter-estimation maximum-likelihood






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Aug 28 '17 at 14:01









Royi

3,45012352




3,45012352










asked Oct 2 '16 at 12:47









IanIan

68.1k25388




68.1k25388












  • $begingroup$
    I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
    $endgroup$
    – Royi
    Aug 28 '17 at 13:59


















  • $begingroup$
    I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
    $endgroup$
    – Royi
    Aug 28 '17 at 13:59
















$begingroup$
I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
$endgroup$
– Royi
Aug 28 '17 at 13:59




$begingroup$
I had this thought for long time as well. What I can tell you is that the ML maximizes the Fisher Information which guarantees some properties which I don;t think the other method can.
$endgroup$
– Royi
Aug 28 '17 at 13:59










1 Answer
1






active

oldest

votes


















0












$begingroup$

In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal.
Distribution of errors






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1950523%2festimating-parameter-what-is-the-qualitative-difference-between-mle-fitting-an%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal.
    Distribution of errors






    share|cite|improve this answer









    $endgroup$


















      0












      $begingroup$

      In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal.
      Distribution of errors






      share|cite|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal.
        Distribution of errors






        share|cite|improve this answer









        $endgroup$



        In my eyes, the intuitive explanation is that the ML estimates the conditional mode (the maximum of the distribution), the least squares the conditional mean. In the case, where the errors are perfectly Gaussian distributed are this estimates equal.
        Distribution of errors







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Dec 15 '18 at 12:36









        Rafael Rafael

        465




        465






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1950523%2festimating-parameter-what-is-the-qualitative-difference-between-mle-fitting-an%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Quarter-circle Tiles

            build a pushdown automaton that recognizes the reverse language of a given pushdown automaton?

            Mont Emei