Vectorizing the product of an array of complex numbers











up vote
10
down vote

favorite
1












I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:



#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}


However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:



typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}


Can this code be improved for AVX and is it as fast as it can get?










share|improve this question




















  • 2




    have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
    – ratchet freak
    Jan 16 '17 at 11:56






  • 1




    @ratchetfreak Like this godbolt.org/g/ndqZyN ?
    – Lembik
    Jan 16 '17 at 11:57






  • 1




    As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
    – Jacob Manaker
    Jan 16 '17 at 19:52















up vote
10
down vote

favorite
1












I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:



#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}


However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:



typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}


Can this code be improved for AVX and is it as fast as it can get?










share|improve this question




















  • 2




    have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
    – ratchet freak
    Jan 16 '17 at 11:56






  • 1




    @ratchetfreak Like this godbolt.org/g/ndqZyN ?
    – Lembik
    Jan 16 '17 at 11:57






  • 1




    As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
    – Jacob Manaker
    Jan 16 '17 at 19:52













up vote
10
down vote

favorite
1









up vote
10
down vote

favorite
1






1





I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:



#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}


However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:



typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}


Can this code be improved for AVX and is it as fast as it can get?










share|improve this question















I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:



#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}


However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:



typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}


Can this code be improved for AVX and is it as fast as it can get?







performance c vectorization gcc complex-numbers






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday









200_success

127k15149412




127k15149412










asked Jan 16 '17 at 11:39









Lembik

1515




1515








  • 2




    have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
    – ratchet freak
    Jan 16 '17 at 11:56






  • 1




    @ratchetfreak Like this godbolt.org/g/ndqZyN ?
    – Lembik
    Jan 16 '17 at 11:57






  • 1




    As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
    – Jacob Manaker
    Jan 16 '17 at 19:52














  • 2




    have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
    – ratchet freak
    Jan 16 '17 at 11:56






  • 1




    @ratchetfreak Like this godbolt.org/g/ndqZyN ?
    – Lembik
    Jan 16 '17 at 11:57






  • 1




    As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
    – Jacob Manaker
    Jan 16 '17 at 19:52








2




2




have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56




have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56




1




1




@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57




@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57




1




1




As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52




As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52










1 Answer
1






active

oldest

votes

















up vote
2
down vote













I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).



How about this?



typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;

static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}

complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}


Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.



I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.






share|improve this answer



















  • 1




    Do you know what the assembly looks like from this?
    – Lembik
    Dec 22 '17 at 9:36










  • Thanks for reminding me, I had forgotten to check. Looks good.
    – Snowbody
    Dec 22 '17 at 15:18











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152759%2fvectorizing-the-product-of-an-array-of-complex-numbers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).



How about this?



typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;

static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}

complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}


Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.



I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.






share|improve this answer



















  • 1




    Do you know what the assembly looks like from this?
    – Lembik
    Dec 22 '17 at 9:36










  • Thanks for reminding me, I had forgotten to check. Looks good.
    – Snowbody
    Dec 22 '17 at 15:18















up vote
2
down vote













I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).



How about this?



typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;

static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}

complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}


Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.



I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.






share|improve this answer



















  • 1




    Do you know what the assembly looks like from this?
    – Lembik
    Dec 22 '17 at 9:36










  • Thanks for reminding me, I had forgotten to check. Looks good.
    – Snowbody
    Dec 22 '17 at 15:18













up vote
2
down vote










up vote
2
down vote









I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).



How about this?



typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;

static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}

complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}


Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.



I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.






share|improve this answer














I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).



How about this?



typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;

static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}

complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}


Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.



I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.







share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 22 '17 at 15:18

























answered Dec 21 '17 at 23:19









Snowbody

7,7671344




7,7671344








  • 1




    Do you know what the assembly looks like from this?
    – Lembik
    Dec 22 '17 at 9:36










  • Thanks for reminding me, I had forgotten to check. Looks good.
    – Snowbody
    Dec 22 '17 at 15:18














  • 1




    Do you know what the assembly looks like from this?
    – Lembik
    Dec 22 '17 at 9:36










  • Thanks for reminding me, I had forgotten to check. Looks good.
    – Snowbody
    Dec 22 '17 at 15:18








1




1




Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36




Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36












Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18




Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18


















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152759%2fvectorizing-the-product-of-an-array-of-complex-numbers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Mont Emei

Province de Neuquén

Journaliste