Vectorizing the product of an array of complex numbers
up vote
10
down vote
favorite
I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:
#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
Can this code be improved for AVX and is it as fast as it can get?
performance c vectorization gcc complex-numbers
add a comment |
up vote
10
down vote
favorite
I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:
#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
Can this code be improved for AVX and is it as fast as it can get?
performance c vectorization gcc complex-numbers
2
have you tried using 4ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of theps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56
1
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
1
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52
add a comment |
up vote
10
down vote
favorite
up vote
10
down vote
favorite
I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:
#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
Can this code be improved for AVX and is it as fast as it can get?
performance c vectorization gcc complex-numbers
I am trying to write fast/optimal code to vectorize the product of an array of complex numbers. In simple C this would be:
#include <complex.h>
complex float f(complex float x, int n ) {
complex float p = 1.0;
for (int i = 0; i < n; i++)
p *= x[i];
return p;
}
However, gcc can't vectorize this and my target CPU is the AMD FX-8350 which supports AVX. To speed it up I have tried:
typedef float v4sf __attribute__ ((vector_size (16)));
typedef union {
v4sf v;
float e[4];
} float4;
typedef struct {
float4 x;
float4 y;
} complex4;
static complex4 complex4_mul(complex4 a, complex4 b) {
return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}
complex4 f4(complex4 x, int n) {
v4sf one = {1,1,1,1};
complex4 p = {one,one};
for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
return p;
}
Can this code be improved for AVX and is it as fast as it can get?
performance c vectorization gcc complex-numbers
performance c vectorization gcc complex-numbers
edited yesterday
200_success
127k15149412
127k15149412
asked Jan 16 '17 at 11:39
Lembik
1515
1515
2
have you tried using 4ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of theps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56
1
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
1
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52
add a comment |
2
have you tried using 4ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of theps? and seeing if gcc can vectorize that?
– ratchet freak
Jan 16 '17 at 11:56
1
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
1
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52
2
2
have you tried using 4
ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?– ratchet freak
Jan 16 '17 at 11:56
have you tried using 4
ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of the ps? and seeing if gcc can vectorize that?– ratchet freak
Jan 16 '17 at 11:56
1
1
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
1
1
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).
How about this?
typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;
static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}
complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}
Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.
I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).
How about this?
typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;
static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}
complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}
Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.
I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
add a comment |
up vote
2
down vote
I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).
How about this?
typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;
static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}
complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}
Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.
I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
add a comment |
up vote
2
down vote
up vote
2
down vote
I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).
How about this?
typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;
static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}
complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}
Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.
I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.
I was wondering if the compiler would more easily be able to vectorize it without the union-within-struct and the potential type punning of the e member (which doesn't seem to be used).
How about this?
typedef float v4sf __attribute__ ((vector_size (16)));
typedef struct {
v4sf x;
v4sf y;
} complex4;
static inline v4sf complex4_mul_r(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_r -a_i*b_i;
}
static inline v4sf complex4_mul_i(v4sf a_r, v4sf a_i, v4sf b_r, v4sf b_i) {
return a_r*b_i + a_i*b_r;
}
complex4 f4(v4sf x_r, v4sf x_i, int n) {
v4sf one = {1,1,1,1};
v4sf p_r = one;
v4sf p_i = one;
v4sf p_r_temp;
for (int i = 0; i < n; i++)
{
p_r_temp = complex4_mul_r(p_r, p_i, x_r[i], x_i[i]);
p_i = complex4_mul_i(p_r, p_i, x_r[i], x_i[i]);
p_r = p_r_temp;
}
return (complex4){p_r, p_i};
}
Looking at the assembly on https://godbolt.org/ it seems it gets fully vectorized. I can't get the godbolt sharing links to work.
I'm thinking about whether it might be possible to stick both the real and imaginary parts in the same vector, and use __builtin_shuffle() to permute them as necessary. Can't quite work it out.
edited Dec 22 '17 at 15:18
answered Dec 21 '17 at 23:19
Snowbody
7,7671344
7,7671344
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
add a comment |
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
1
1
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Do you know what the assembly looks like from this?
– Lembik
Dec 22 '17 at 9:36
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
Thanks for reminding me, I had forgotten to check. Looks good.
– Snowbody
Dec 22 '17 at 15:18
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152759%2fvectorizing-the-product-of-an-array-of-complex-numbers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
have you tried using 4
ps (aka manual partial unroll and splitting of accumulators) and doing a post multiply of theps? and seeing if gcc can vectorize that?– ratchet freak
Jan 16 '17 at 11:56
1
@ratchetfreak Like this godbolt.org/g/ndqZyN ?
– Lembik
Jan 16 '17 at 11:57
1
As a guess, it seems like part of the issue is the complexity of complex multiplication. I don't know where you are getting your complex vectors from, but if you can store them in polar form without losing too much performance on that end, the multiplication will be much simpler here.
– Jacob Manaker
Jan 16 '17 at 19:52