Values in Softmax Derivative












0












$begingroup$


I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:



$frac{delta p_i}{delta a_j} = p_i*(1-p_j)$ if i=j



and $-p_j*p_i$ else.



What I don't get is: What exactly are i and j and how do I get them from my input-vector?



Could anyone please explain this?










share|cite|improve this question









$endgroup$

















    0












    $begingroup$


    I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:



    $frac{delta p_i}{delta a_j} = p_i*(1-p_j)$ if i=j



    and $-p_j*p_i$ else.



    What I don't get is: What exactly are i and j and how do I get them from my input-vector?



    Could anyone please explain this?










    share|cite|improve this question









    $endgroup$















      0












      0








      0





      $begingroup$


      I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:



      $frac{delta p_i}{delta a_j} = p_i*(1-p_j)$ if i=j



      and $-p_j*p_i$ else.



      What I don't get is: What exactly are i and j and how do I get them from my input-vector?



      Could anyone please explain this?










      share|cite|improve this question









      $endgroup$




      I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:



      $frac{delta p_i}{delta a_j} = p_i*(1-p_j)$ if i=j



      and $-p_j*p_i$ else.



      What I don't get is: What exactly are i and j and how do I get them from my input-vector?



      Could anyone please explain this?







      linear-algebra derivatives neural-networks






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Jan 1 at 15:26









      Yama994Yama994

      1




      1






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          The Softmax function maps an $n$-dimensional ($n ge 2$) vector of reals, $mathbf{z}$,
          $$bbox{ z_i in mathbb{R} , quad i = 1 .. n }$$
          to another $n$-dimensional real vector $mathbf{p}$ with all components within $0$ and $1$,
          $$bbox{ 0 le p_i le 1, quad i = 1 .. n }$$
          and the sum of components 1,
          $$bbox{ sum_{k=1}^n p_k = 1 }$$
          The Softmax function itself is defined component-wise,
          $$bbox{ p_i = frac{e^{z_i}}{sum_{k=1}^n e^{z_k}}, quad i = 1 .. n }$$
          Its derivative turns out to be simple. The partial derivative of the $i$'th component of $mathbf{p}$, with respect to the $j$'th dimension, is
          $$bbox{ frac{partial p_i}{partial z_j} = begin{cases}
          p_i ( 1 - p_i ), & i = j \
          - p_i p_j, & i ne j \
          end{cases} }$$

          In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.



          It may be useful to look at the Jacobian matrix of the Softmax function on $mathbf{z}$. Each row corresponds to a component of the derivative, $mathbf{p}$, and each column corresponds to a dimension of $mathbf{z}$ the partial derivative is taken with respect to. In other words,
          $$bbox{mathbf{J}_{i j} = frac{ partial p_i }{partial z_i}, quad i, j = 1 .. n }$$
          $$bbox{mathbf{J} = left [ begin{matrix}
          p_1 ( 1 - p_1 ) & - p_1 p_2 & dots & - p_1 p_{n-1} & -p_1 p_n \
          -p_2 p_1 & p_2 ( 1 - p_2 ) & dots & -p_2 p_{n-1} & -p_2 p_n \
          vdots & vdots & ddots & vdots & vdots \
          -p_{n-1} p_1 & -p_{n-1} p_2 & dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \
          -p_n p_1 & -p_n p_2 & dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \
          end{matrix} right ]}$$

          Because $p_i in mathbb{R}$, $mathbf{J}$ is symmetric,
          $$bbox{ mathbf{J}_{i j} = mathbf{J}_{j i}, quad i , j = 1 .. n }$$





          Programmers are often more comfortable with pseudocode examples:



          Function Softmax(z, n):
          Let p be an array of n reals
          Let d = 0.0

          # Calculate exponents; unscaled components
          For i = 1 to n:
          p[i] = Exp(z[i])
          d = d + p[i]
          End For

          # Normalize components
          For i = 1 to n:
          p[i] = p[i] / d
          End For

          Return p
          End Function


          We can calculate the entire Jacobian matrix of $mathbf{z}$ using



          Function Jacobian_matrix_of_Softmax(z, n):
          Let p = Softmax(z)
          Let J = n by n matrix of reals
          For i = 1 to n:
          J[i][i] = p[i] * (1.0 - p[i])
          For k = 1 to i-1:
          J[k][i] = -p[i]*p[k]
          J[i][k] = J[k][i]
          End For
          End For
          Discard p
          Return J
          End Function


          or, using $mathbf{p} = text{Softmax}(mathbf{z})$ if that is already calculated,



          Function Jacobian_matrix_of_Softmaxed(p, n):
          Let J = n by n matrix of reals
          For i = 1 to n:
          J[i][i] = p[i] * (1.0 - p[i])
          For k = 1 to i-1:
          J[k][i] = -p[i]*p[k]
          J[i][k] = J[k][i]
          End For
          End For
          Discard p
          Return J
          End Function


          Individual partial derivatives are calculated using



          Function Partial_of_Softmaxed(p, i, j):
          If i == j:
          Return p[i] * (1.0 - p[i])
          Else:
          Return -p[i] * p[j]
          End If
          End Function


          but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.



          Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $mathbf{p}$ ($nablamathbf{p}$), use



          Function Gradient_of_Softmaxed(p, n):
          Let g be an array of n reals
          For i = 1 to n:
          g[i] = p[i] * (1.0 - p[i])
          End For
          Return g
          End Function

          Function Gradient_of_Softmax(z, n):
          Let g = Softmax(z, n)
          For i = 1 to n:
          g[i] = g[i] * (1.0 - g[i])
          End For
          Return g
          End Function


          instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $text{Softmax}(mathbf{z})$, can reuse the storage for the gradient.






          share|cite|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "69"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3058580%2fvalues-in-softmax-derivative%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            The Softmax function maps an $n$-dimensional ($n ge 2$) vector of reals, $mathbf{z}$,
            $$bbox{ z_i in mathbb{R} , quad i = 1 .. n }$$
            to another $n$-dimensional real vector $mathbf{p}$ with all components within $0$ and $1$,
            $$bbox{ 0 le p_i le 1, quad i = 1 .. n }$$
            and the sum of components 1,
            $$bbox{ sum_{k=1}^n p_k = 1 }$$
            The Softmax function itself is defined component-wise,
            $$bbox{ p_i = frac{e^{z_i}}{sum_{k=1}^n e^{z_k}}, quad i = 1 .. n }$$
            Its derivative turns out to be simple. The partial derivative of the $i$'th component of $mathbf{p}$, with respect to the $j$'th dimension, is
            $$bbox{ frac{partial p_i}{partial z_j} = begin{cases}
            p_i ( 1 - p_i ), & i = j \
            - p_i p_j, & i ne j \
            end{cases} }$$

            In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.



            It may be useful to look at the Jacobian matrix of the Softmax function on $mathbf{z}$. Each row corresponds to a component of the derivative, $mathbf{p}$, and each column corresponds to a dimension of $mathbf{z}$ the partial derivative is taken with respect to. In other words,
            $$bbox{mathbf{J}_{i j} = frac{ partial p_i }{partial z_i}, quad i, j = 1 .. n }$$
            $$bbox{mathbf{J} = left [ begin{matrix}
            p_1 ( 1 - p_1 ) & - p_1 p_2 & dots & - p_1 p_{n-1} & -p_1 p_n \
            -p_2 p_1 & p_2 ( 1 - p_2 ) & dots & -p_2 p_{n-1} & -p_2 p_n \
            vdots & vdots & ddots & vdots & vdots \
            -p_{n-1} p_1 & -p_{n-1} p_2 & dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \
            -p_n p_1 & -p_n p_2 & dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \
            end{matrix} right ]}$$

            Because $p_i in mathbb{R}$, $mathbf{J}$ is symmetric,
            $$bbox{ mathbf{J}_{i j} = mathbf{J}_{j i}, quad i , j = 1 .. n }$$





            Programmers are often more comfortable with pseudocode examples:



            Function Softmax(z, n):
            Let p be an array of n reals
            Let d = 0.0

            # Calculate exponents; unscaled components
            For i = 1 to n:
            p[i] = Exp(z[i])
            d = d + p[i]
            End For

            # Normalize components
            For i = 1 to n:
            p[i] = p[i] / d
            End For

            Return p
            End Function


            We can calculate the entire Jacobian matrix of $mathbf{z}$ using



            Function Jacobian_matrix_of_Softmax(z, n):
            Let p = Softmax(z)
            Let J = n by n matrix of reals
            For i = 1 to n:
            J[i][i] = p[i] * (1.0 - p[i])
            For k = 1 to i-1:
            J[k][i] = -p[i]*p[k]
            J[i][k] = J[k][i]
            End For
            End For
            Discard p
            Return J
            End Function


            or, using $mathbf{p} = text{Softmax}(mathbf{z})$ if that is already calculated,



            Function Jacobian_matrix_of_Softmaxed(p, n):
            Let J = n by n matrix of reals
            For i = 1 to n:
            J[i][i] = p[i] * (1.0 - p[i])
            For k = 1 to i-1:
            J[k][i] = -p[i]*p[k]
            J[i][k] = J[k][i]
            End For
            End For
            Discard p
            Return J
            End Function


            Individual partial derivatives are calculated using



            Function Partial_of_Softmaxed(p, i, j):
            If i == j:
            Return p[i] * (1.0 - p[i])
            Else:
            Return -p[i] * p[j]
            End If
            End Function


            but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.



            Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $mathbf{p}$ ($nablamathbf{p}$), use



            Function Gradient_of_Softmaxed(p, n):
            Let g be an array of n reals
            For i = 1 to n:
            g[i] = p[i] * (1.0 - p[i])
            End For
            Return g
            End Function

            Function Gradient_of_Softmax(z, n):
            Let g = Softmax(z, n)
            For i = 1 to n:
            g[i] = g[i] * (1.0 - g[i])
            End For
            Return g
            End Function


            instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $text{Softmax}(mathbf{z})$, can reuse the storage for the gradient.






            share|cite|improve this answer









            $endgroup$


















              0












              $begingroup$

              The Softmax function maps an $n$-dimensional ($n ge 2$) vector of reals, $mathbf{z}$,
              $$bbox{ z_i in mathbb{R} , quad i = 1 .. n }$$
              to another $n$-dimensional real vector $mathbf{p}$ with all components within $0$ and $1$,
              $$bbox{ 0 le p_i le 1, quad i = 1 .. n }$$
              and the sum of components 1,
              $$bbox{ sum_{k=1}^n p_k = 1 }$$
              The Softmax function itself is defined component-wise,
              $$bbox{ p_i = frac{e^{z_i}}{sum_{k=1}^n e^{z_k}}, quad i = 1 .. n }$$
              Its derivative turns out to be simple. The partial derivative of the $i$'th component of $mathbf{p}$, with respect to the $j$'th dimension, is
              $$bbox{ frac{partial p_i}{partial z_j} = begin{cases}
              p_i ( 1 - p_i ), & i = j \
              - p_i p_j, & i ne j \
              end{cases} }$$

              In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.



              It may be useful to look at the Jacobian matrix of the Softmax function on $mathbf{z}$. Each row corresponds to a component of the derivative, $mathbf{p}$, and each column corresponds to a dimension of $mathbf{z}$ the partial derivative is taken with respect to. In other words,
              $$bbox{mathbf{J}_{i j} = frac{ partial p_i }{partial z_i}, quad i, j = 1 .. n }$$
              $$bbox{mathbf{J} = left [ begin{matrix}
              p_1 ( 1 - p_1 ) & - p_1 p_2 & dots & - p_1 p_{n-1} & -p_1 p_n \
              -p_2 p_1 & p_2 ( 1 - p_2 ) & dots & -p_2 p_{n-1} & -p_2 p_n \
              vdots & vdots & ddots & vdots & vdots \
              -p_{n-1} p_1 & -p_{n-1} p_2 & dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \
              -p_n p_1 & -p_n p_2 & dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \
              end{matrix} right ]}$$

              Because $p_i in mathbb{R}$, $mathbf{J}$ is symmetric,
              $$bbox{ mathbf{J}_{i j} = mathbf{J}_{j i}, quad i , j = 1 .. n }$$





              Programmers are often more comfortable with pseudocode examples:



              Function Softmax(z, n):
              Let p be an array of n reals
              Let d = 0.0

              # Calculate exponents; unscaled components
              For i = 1 to n:
              p[i] = Exp(z[i])
              d = d + p[i]
              End For

              # Normalize components
              For i = 1 to n:
              p[i] = p[i] / d
              End For

              Return p
              End Function


              We can calculate the entire Jacobian matrix of $mathbf{z}$ using



              Function Jacobian_matrix_of_Softmax(z, n):
              Let p = Softmax(z)
              Let J = n by n matrix of reals
              For i = 1 to n:
              J[i][i] = p[i] * (1.0 - p[i])
              For k = 1 to i-1:
              J[k][i] = -p[i]*p[k]
              J[i][k] = J[k][i]
              End For
              End For
              Discard p
              Return J
              End Function


              or, using $mathbf{p} = text{Softmax}(mathbf{z})$ if that is already calculated,



              Function Jacobian_matrix_of_Softmaxed(p, n):
              Let J = n by n matrix of reals
              For i = 1 to n:
              J[i][i] = p[i] * (1.0 - p[i])
              For k = 1 to i-1:
              J[k][i] = -p[i]*p[k]
              J[i][k] = J[k][i]
              End For
              End For
              Discard p
              Return J
              End Function


              Individual partial derivatives are calculated using



              Function Partial_of_Softmaxed(p, i, j):
              If i == j:
              Return p[i] * (1.0 - p[i])
              Else:
              Return -p[i] * p[j]
              End If
              End Function


              but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.



              Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $mathbf{p}$ ($nablamathbf{p}$), use



              Function Gradient_of_Softmaxed(p, n):
              Let g be an array of n reals
              For i = 1 to n:
              g[i] = p[i] * (1.0 - p[i])
              End For
              Return g
              End Function

              Function Gradient_of_Softmax(z, n):
              Let g = Softmax(z, n)
              For i = 1 to n:
              g[i] = g[i] * (1.0 - g[i])
              End For
              Return g
              End Function


              instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $text{Softmax}(mathbf{z})$, can reuse the storage for the gradient.






              share|cite|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                The Softmax function maps an $n$-dimensional ($n ge 2$) vector of reals, $mathbf{z}$,
                $$bbox{ z_i in mathbb{R} , quad i = 1 .. n }$$
                to another $n$-dimensional real vector $mathbf{p}$ with all components within $0$ and $1$,
                $$bbox{ 0 le p_i le 1, quad i = 1 .. n }$$
                and the sum of components 1,
                $$bbox{ sum_{k=1}^n p_k = 1 }$$
                The Softmax function itself is defined component-wise,
                $$bbox{ p_i = frac{e^{z_i}}{sum_{k=1}^n e^{z_k}}, quad i = 1 .. n }$$
                Its derivative turns out to be simple. The partial derivative of the $i$'th component of $mathbf{p}$, with respect to the $j$'th dimension, is
                $$bbox{ frac{partial p_i}{partial z_j} = begin{cases}
                p_i ( 1 - p_i ), & i = j \
                - p_i p_j, & i ne j \
                end{cases} }$$

                In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.



                It may be useful to look at the Jacobian matrix of the Softmax function on $mathbf{z}$. Each row corresponds to a component of the derivative, $mathbf{p}$, and each column corresponds to a dimension of $mathbf{z}$ the partial derivative is taken with respect to. In other words,
                $$bbox{mathbf{J}_{i j} = frac{ partial p_i }{partial z_i}, quad i, j = 1 .. n }$$
                $$bbox{mathbf{J} = left [ begin{matrix}
                p_1 ( 1 - p_1 ) & - p_1 p_2 & dots & - p_1 p_{n-1} & -p_1 p_n \
                -p_2 p_1 & p_2 ( 1 - p_2 ) & dots & -p_2 p_{n-1} & -p_2 p_n \
                vdots & vdots & ddots & vdots & vdots \
                -p_{n-1} p_1 & -p_{n-1} p_2 & dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \
                -p_n p_1 & -p_n p_2 & dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \
                end{matrix} right ]}$$

                Because $p_i in mathbb{R}$, $mathbf{J}$ is symmetric,
                $$bbox{ mathbf{J}_{i j} = mathbf{J}_{j i}, quad i , j = 1 .. n }$$





                Programmers are often more comfortable with pseudocode examples:



                Function Softmax(z, n):
                Let p be an array of n reals
                Let d = 0.0

                # Calculate exponents; unscaled components
                For i = 1 to n:
                p[i] = Exp(z[i])
                d = d + p[i]
                End For

                # Normalize components
                For i = 1 to n:
                p[i] = p[i] / d
                End For

                Return p
                End Function


                We can calculate the entire Jacobian matrix of $mathbf{z}$ using



                Function Jacobian_matrix_of_Softmax(z, n):
                Let p = Softmax(z)
                Let J = n by n matrix of reals
                For i = 1 to n:
                J[i][i] = p[i] * (1.0 - p[i])
                For k = 1 to i-1:
                J[k][i] = -p[i]*p[k]
                J[i][k] = J[k][i]
                End For
                End For
                Discard p
                Return J
                End Function


                or, using $mathbf{p} = text{Softmax}(mathbf{z})$ if that is already calculated,



                Function Jacobian_matrix_of_Softmaxed(p, n):
                Let J = n by n matrix of reals
                For i = 1 to n:
                J[i][i] = p[i] * (1.0 - p[i])
                For k = 1 to i-1:
                J[k][i] = -p[i]*p[k]
                J[i][k] = J[k][i]
                End For
                End For
                Discard p
                Return J
                End Function


                Individual partial derivatives are calculated using



                Function Partial_of_Softmaxed(p, i, j):
                If i == j:
                Return p[i] * (1.0 - p[i])
                Else:
                Return -p[i] * p[j]
                End If
                End Function


                but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.



                Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $mathbf{p}$ ($nablamathbf{p}$), use



                Function Gradient_of_Softmaxed(p, n):
                Let g be an array of n reals
                For i = 1 to n:
                g[i] = p[i] * (1.0 - p[i])
                End For
                Return g
                End Function

                Function Gradient_of_Softmax(z, n):
                Let g = Softmax(z, n)
                For i = 1 to n:
                g[i] = g[i] * (1.0 - g[i])
                End For
                Return g
                End Function


                instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $text{Softmax}(mathbf{z})$, can reuse the storage for the gradient.






                share|cite|improve this answer









                $endgroup$



                The Softmax function maps an $n$-dimensional ($n ge 2$) vector of reals, $mathbf{z}$,
                $$bbox{ z_i in mathbb{R} , quad i = 1 .. n }$$
                to another $n$-dimensional real vector $mathbf{p}$ with all components within $0$ and $1$,
                $$bbox{ 0 le p_i le 1, quad i = 1 .. n }$$
                and the sum of components 1,
                $$bbox{ sum_{k=1}^n p_k = 1 }$$
                The Softmax function itself is defined component-wise,
                $$bbox{ p_i = frac{e^{z_i}}{sum_{k=1}^n e^{z_k}}, quad i = 1 .. n }$$
                Its derivative turns out to be simple. The partial derivative of the $i$'th component of $mathbf{p}$, with respect to the $j$'th dimension, is
                $$bbox{ frac{partial p_i}{partial z_j} = begin{cases}
                p_i ( 1 - p_i ), & i = j \
                - p_i p_j, & i ne j \
                end{cases} }$$

                In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.



                It may be useful to look at the Jacobian matrix of the Softmax function on $mathbf{z}$. Each row corresponds to a component of the derivative, $mathbf{p}$, and each column corresponds to a dimension of $mathbf{z}$ the partial derivative is taken with respect to. In other words,
                $$bbox{mathbf{J}_{i j} = frac{ partial p_i }{partial z_i}, quad i, j = 1 .. n }$$
                $$bbox{mathbf{J} = left [ begin{matrix}
                p_1 ( 1 - p_1 ) & - p_1 p_2 & dots & - p_1 p_{n-1} & -p_1 p_n \
                -p_2 p_1 & p_2 ( 1 - p_2 ) & dots & -p_2 p_{n-1} & -p_2 p_n \
                vdots & vdots & ddots & vdots & vdots \
                -p_{n-1} p_1 & -p_{n-1} p_2 & dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \
                -p_n p_1 & -p_n p_2 & dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \
                end{matrix} right ]}$$

                Because $p_i in mathbb{R}$, $mathbf{J}$ is symmetric,
                $$bbox{ mathbf{J}_{i j} = mathbf{J}_{j i}, quad i , j = 1 .. n }$$





                Programmers are often more comfortable with pseudocode examples:



                Function Softmax(z, n):
                Let p be an array of n reals
                Let d = 0.0

                # Calculate exponents; unscaled components
                For i = 1 to n:
                p[i] = Exp(z[i])
                d = d + p[i]
                End For

                # Normalize components
                For i = 1 to n:
                p[i] = p[i] / d
                End For

                Return p
                End Function


                We can calculate the entire Jacobian matrix of $mathbf{z}$ using



                Function Jacobian_matrix_of_Softmax(z, n):
                Let p = Softmax(z)
                Let J = n by n matrix of reals
                For i = 1 to n:
                J[i][i] = p[i] * (1.0 - p[i])
                For k = 1 to i-1:
                J[k][i] = -p[i]*p[k]
                J[i][k] = J[k][i]
                End For
                End For
                Discard p
                Return J
                End Function


                or, using $mathbf{p} = text{Softmax}(mathbf{z})$ if that is already calculated,



                Function Jacobian_matrix_of_Softmaxed(p, n):
                Let J = n by n matrix of reals
                For i = 1 to n:
                J[i][i] = p[i] * (1.0 - p[i])
                For k = 1 to i-1:
                J[k][i] = -p[i]*p[k]
                J[i][k] = J[k][i]
                End For
                End For
                Discard p
                Return J
                End Function


                Individual partial derivatives are calculated using



                Function Partial_of_Softmaxed(p, i, j):
                If i == j:
                Return p[i] * (1.0 - p[i])
                Else:
                Return -p[i] * p[j]
                End If
                End Function


                but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.



                Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $mathbf{p}$ ($nablamathbf{p}$), use



                Function Gradient_of_Softmaxed(p, n):
                Let g be an array of n reals
                For i = 1 to n:
                g[i] = p[i] * (1.0 - p[i])
                End For
                Return g
                End Function

                Function Gradient_of_Softmax(z, n):
                Let g = Softmax(z, n)
                For i = 1 to n:
                g[i] = g[i] * (1.0 - g[i])
                End For
                Return g
                End Function


                instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $text{Softmax}(mathbf{z})$, can reuse the storage for the gradient.







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered Jan 1 at 18:24









                Nominal AnimalNominal Animal

                7,0632617




                7,0632617






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Mathematics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3058580%2fvalues-in-softmax-derivative%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Quarter-circle Tiles

                    build a pushdown automaton that recognizes the reverse language of a given pushdown automaton?

                    Mont Emei