Vectorizing the product of an array of complex numbers

up vote
10
down vote

favorite

I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:

#include <complex.h>

complex float f(complex float x, int n ) {

  complex float p = 1.0;

  for (int i = 0; i < n; i++)

    p *= x[i];

  return p;

}

However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:

typedef float v4sf __attribute__ ((vector_size (16)));

typedef union {

  v4sf v;

  float e[4];

} float4;

typedef struct {

  float4 x;

  float4 y;

} complex4;

static complex4 complex4_mul(complex4 a, complex4 b) {

  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};

}

complex4 f4(complex4 x, int n) {

  v4sf one = {1,1,1,1};

  complex4 p = {one,one};

  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);

  return p;

}

Can this code be improved for AVX and is it as fast as it can get?

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

2

have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56

1

@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57

1

As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52

add a comment |

up vote
10
down vote

favorite

I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:

#include <complex.h>

complex float f(complex float x, int n ) {

  complex float p = 1.0;

  for (int i = 0; i < n; i++)

    p *= x[i];

  return p;

}

However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:

typedef float v4sf __attribute__ ((vector_size (16)));

typedef union {

  v4sf v;

  float e[4];

} float4;

typedef struct {

  float4 x;

  float4 y;

} complex4;

static complex4 complex4_mul(complex4 a, complex4 b) {

  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};

}

complex4 f4(complex4 x, int n) {

  v4sf one = {1,1,1,1};

  complex4 p = {one,one};

  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);

  return p;

}

Can this code be improved for AVX and is it as fast as it can get?

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

2

have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56

1

@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57

1

As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52

add a comment |

up vote
10
down vote

favorite

I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:

#include <complex.h>

complex float f(complex float x, int n ) {

  complex float p = 1.0;

  for (int i = 0; i < n; i++)

    p *= x[i];

  return p;

}

However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:

typedef float v4sf __attribute__ ((vector_size (16)));

typedef union {

  v4sf v;

  float e[4];

} float4;

typedef struct {

  float4 x;

  float4 y;

} complex4;

static complex4 complex4_mul(complex4 a, complex4 b) {

  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};

}

complex4 f4(complex4 x, int n) {

  v4sf one = {1,1,1,1};

  complex4 p = {one,one};

  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);

  return p;

}

Can this code be improved for AVX and is it as fast as it can get?

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:

#include <complex.h>

complex float f(complex float x, int n ) {

  complex float p = 1.0;

  for (int i = 0; i < n; i++)

    p *= x[i];

  return p;

}

However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:

typedef float v4sf __attribute__ ((vector_size (16)));

typedef union {

  v4sf v;

  float e[4];

} float4;

typedef struct {

  float4 x;

  float4 y;

} complex4;

static complex4 complex4_mul(complex4 a, complex4 b) {

  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};

}

complex4 f4(complex4 x, int n) {

  v4sf one = {1,1,1,1};

  complex4 p = {one,one};

  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);

  return p;

}

Can this code be improved for AVX and is it as fast as it can get?

performance c vectorization gcc complex-numbers

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

edited yesterday

200_success

127k15149412

edited yesterday

200_success

127k15149412

edited yesterday

200_success

127k15149412

asked Jan 16 '17 at 11:39

Lembik

1515

asked Jan 16 '17 at 11:39

Lembik

1515

asked Jan 16 '17 at 11:39

Lembik

1515

2

have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56

1

@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57

1

As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52

add a comment |

2

have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56

1

@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57

1

As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52

have you tried using 4 ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56

@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57

As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).

How about this?

typedef float v4sf __attribute__ ((vector_size (16)));

typedef struct {

  v4sf x;

  v4sf y;

} complex4;



static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_r -a_i*b_i;

}

static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_i + a_i*b_r;

}



complex4 f4(v4sf x_r, v4sf x_i, int n) {

  v4sf one = {1,1,1,1};

  v4sf p_r = one;

  v4sf p_i = one;

  v4sf p_r_temp;

  for (int i = 0; i < n; i++)

  {

     p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);

     p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);

     p_r = p_r_temp;

  }

  return (complex4){p_r, p_i};

}

Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.

I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

1

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152759%2fvectorizing-the-product-of-an-array-of-complex-numbers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).

How about this?

typedef float v4sf __attribute__ ((vector_size (16)));

typedef struct {

  v4sf x;

  v4sf y;

} complex4;



static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_r -a_i*b_i;

}

static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_i + a_i*b_r;

}



complex4 f4(v4sf x_r, v4sf x_i, int n) {

  v4sf one = {1,1,1,1};

  v4sf p_r = one;

  v4sf p_i = one;

  v4sf p_r_temp;

  for (int i = 0; i < n; i++)

  {

     p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);

     p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);

     p_r = p_r_temp;

  }

  return (complex4){p_r, p_i};

}

Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.

I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

1

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

add a comment |

up vote
2
down vote

I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).

How about this?

typedef float v4sf __attribute__ ((vector_size (16)));

typedef struct {

  v4sf x;

  v4sf y;

} complex4;



static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_r -a_i*b_i;

}

static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_i + a_i*b_r;

}



complex4 f4(v4sf x_r, v4sf x_i, int n) {

  v4sf one = {1,1,1,1};

  v4sf p_r = one;

  v4sf p_i = one;

  v4sf p_r_temp;

  for (int i = 0; i < n; i++)

  {

     p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);

     p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);

     p_r = p_r_temp;

  }

  return (complex4){p_r, p_i};

}

Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.

I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

1

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

add a comment |

up vote
2
down vote

I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).

How about this?

typedef float v4sf __attribute__ ((vector_size (16)));

typedef struct {

  v4sf x;

  v4sf y;

} complex4;



static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_r -a_i*b_i;

}

static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_i + a_i*b_r;

}



complex4 f4(v4sf x_r, v4sf x_i, int n) {

  v4sf one = {1,1,1,1};

  v4sf p_r = one;

  v4sf p_i = one;

  v4sf p_r_temp;

  for (int i = 0; i < n; i++)

  {

     p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);

     p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);

     p_r = p_r_temp;

  }

  return (complex4){p_r, p_i};

}

Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.

I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).

How about this?

typedef float v4sf __attribute__ ((vector_size (16)));

typedef struct {

  v4sf x;

  v4sf y;

} complex4;



static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_r -a_i*b_i;

}

static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {

  return a_r*b_i + a_i*b_r;

}



complex4 f4(v4sf x_r, v4sf x_i, int n) {

  v4sf one = {1,1,1,1};

  v4sf p_r = one;

  v4sf p_i = one;

  v4sf p_r_temp;

  for (int i = 0; i < n; i++)

  {

     p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);

     p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);

     p_r = p_r_temp;

  }

  return (complex4){p_r, p_i};

}

Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.

I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

edited Dec 22 '17 at 15:18

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

answered Dec 21 '17 at 23:19

Snowbody

7,7671344

1

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

add a comment |

1

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36

Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Krdytkyu