Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?

up vote
34
down vote

favorite

The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:

import time

start_time = time.time()

num = 0

for x in range(0, 10000000):

    # num += 2 * (x * x)

    num += 2 * x * x

print("--- %s seconds ---" % (time.time() - start_time))

if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?

On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?

I ran each version of the program 10 times, here are the results.

   2 * x * x        |   2 * (x * x)

---------------------------------------

1.7717654705047607  | 2.0789272785186768

1.735931396484375   | 2.1166207790374756

1.7093875408172607  | 2.024367570877075

1.7004504203796387  | 2.047525405883789

1.6676218509674072  | 2.254328966140747

1.699510097503662   | 2.0949244499206543

1.6889283657073975  | 2.0841963291168213

1.7243537902832031  | 2.1290600299835205

1.712965488433838   | 2.1942825317382812

1.7622807025909424  | 2.1200053691864014

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

14

Little hint: Use the timeit module for better statistics
– user8408080
Dec 1 at 12:43

For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
Dec 2 at 1:16

Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
Dec 2 at 1:31

add a comment |

up vote
34
down vote

favorite

The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:

import time

start_time = time.time()

num = 0

for x in range(0, 10000000):

    # num += 2 * (x * x)

    num += 2 * x * x

print("--- %s seconds ---" % (time.time() - start_time))

if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?

On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?

I ran each version of the program 10 times, here are the results.

   2 * x * x        |   2 * (x * x)

---------------------------------------

1.7717654705047607  | 2.0789272785186768

1.735931396484375   | 2.1166207790374756

1.7093875408172607  | 2.024367570877075

1.7004504203796387  | 2.047525405883789

1.6676218509674072  | 2.254328966140747

1.699510097503662   | 2.0949244499206543

1.6889283657073975  | 2.0841963291168213

1.7243537902832031  | 2.1290600299835205

1.712965488433838   | 2.1942825317382812

1.7622807025909424  | 2.1200053691864014

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

14

Little hint: Use the timeit module for better statistics
– user8408080
Dec 1 at 12:43

For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
Dec 2 at 1:16

Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
Dec 2 at 1:31

add a comment |

up vote
34
down vote

favorite

The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:

import time

start_time = time.time()

num = 0

for x in range(0, 10000000):

    # num += 2 * (x * x)

    num += 2 * x * x

print("--- %s seconds ---" % (time.time() - start_time))

if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?

On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?

I ran each version of the program 10 times, here are the results.

   2 * x * x        |   2 * (x * x)

---------------------------------------

1.7717654705047607  | 2.0789272785186768

1.735931396484375   | 2.1166207790374756

1.7093875408172607  | 2.024367570877075

1.7004504203796387  | 2.047525405883789

1.6676218509674072  | 2.254328966140747

1.699510097503662   | 2.0949244499206543

1.6889283657073975  | 2.0841963291168213

1.7243537902832031  | 2.1290600299835205

1.712965488433838   | 2.1942825317382812

1.7622807025909424  | 2.1200053691864014

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

The following Python 3.x integer multiplication takes on average between 1.66s and 1.77s:

import time

start_time = time.time()

num = 0

for x in range(0, 10000000):

    # num += 2 * (x * x)

    num += 2 * x * x

print("--- %s seconds ---" % (time.time() - start_time))

if I replace 2 * x * x with 2 *(x * x), it takes between 2.04 and 2.25. How come?

On the other hand it is the opposite in Java: 2 * (x * x) is faster in Java. Java test link: Why is 2 * (i * i) faster than 2 * i * i in Java?

I ran each version of the program 10 times, here are the results.

   2 * x * x        |   2 * (x * x)

---------------------------------------

1.7717654705047607  | 2.0789272785186768

1.735931396484375   | 2.1166207790374756

1.7093875408172607  | 2.024367570877075

1.7004504203796387  | 2.047525405883789

1.6676218509674072  | 2.254328966140747

1.699510097503662   | 2.0949244499206543

1.6889283657073975  | 2.0841963291168213

1.7243537902832031  | 2.1290600299835205

1.712965488433838   | 2.1942825317382812

1.7622807025909424  | 2.1200053691864014

python python-3.x performance benchmarking integer-arithmetic

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

edited Dec 2 at 1:11

smci

14.6k672104

edited Dec 2 at 1:11

smci

14.6k672104

edited Dec 2 at 1:11

smci

14.6k672104

asked Dec 1 at 12:40

Waqas Gondal

216313

asked Dec 1 at 12:40

Waqas Gondal

216313

asked Dec 1 at 12:40

Waqas Gondal

216313

14

Little hint: Use the timeit module for better statistics
– user8408080
Dec 1 at 12:43

For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
Dec 2 at 1:16

Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
Dec 2 at 1:31

add a comment |

14

Little hint: Use the timeit module for better statistics
– user8408080
Dec 1 at 12:43

For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
Dec 2 at 1:16

Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
Dec 2 at 1:31

Little hint: Use the timeit module for better statistics
– user8408080
Dec 1 at 12:43

For good measure you might as well throw in 2 * pow(x,2) and 2 * x**2 as well. Also, please redo your timings using timeit, it's much more accurate than time.time() for short processes.
– smci
Dec 2 at 1:16

Related near-duplicate, although not very clearly-stated: Times-two faster than bit-shift, for Python 3.x integers?
– smci
Dec 2 at 1:31

add a comment |

4 Answers
4

active

oldest

votes

up vote
26
down vote

accepted

First of all, note that we don't see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")

51.00784397125244

>>> timeit("for i in range(1000): 2*(i*i)")

50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

For small enough integers (including the ones we're considering here), CPython actually just uses the O(MN) grade-school digit by digit multiplication algorithm (for larger integers it switches to the Karatsuba algorithm). You can see this yourself in the source.

The number of digits in x*x is roughly twice that of 2*x or x (since log(x²) = 2 log(x)). Note that a "digit" in this context is not a base-10 digit, but a 30-bit value (which are treated as single digits in CPython's implementation). Hence, 2 is a single-digit value, and x and 2*x are single-digit values for all iterations of the loop, but x*x is two-digit for x >= 2**15. Hence, for x >= 2**15, 2*x*x only requires single-by-single-digit multiplications whereas 2*(x*x) requires a single-by-single and a single-by-double-digit multiplication (since x*x has 2 30-bit digits).

Here's a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

5.796971936999967

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

4.3559221399999615

Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

3.0912468433380127

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

3.1120400428771973

(One interesting note: If you look at the source, you'll see that the algorithm actually has a special case for squaring numbers (which we're doing here), but even still this is not enough to overcome the fact that 2*(x*x) just requires processing more digits.)

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

1

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

2

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

2

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

2

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

|
show 4 more comments

up vote
6
down vote

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)

Out[6]: 28 # one slot + heading



In [7]: sys.getsizeof(2**30)

Out[7]: 32 # two slots

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

(2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

Python have the same problem. If x is a number such than 2x < B < x² , let x² = aB+b , with a,b <B. x² is stored in 2 slots, which I note (a|b). Computations leads to (without managing carries here):

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)

   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

3

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

add a comment |

up vote
3
down vote

If your benchmark is right (didn't check), it may come from the fact that Python integers may be two different things : native integers when they are small (with a quick computation), and big integers when they increase in size (slower computation). The first syntax keeps the size smaller after the first operation while the second syntax may lead to two operations involving big integers.

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

11

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

4

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

2

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

add a comment |

up vote
2
down vote

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)

           30 LOAD_CONST               3 (2)

           32 LOAD_FAST                2 (x)

           34 BINARY_MULTIPLY

           36 LOAD_FAST                2 (x)

           38 BINARY_MULTIPLY

           40 INPLACE_ADD

           42 STORE_FAST               1 (num)

           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)

             30 LOAD_CONST               3 (2)

             32 LOAD_FAST                2 (x)

             34 LOAD_FAST                2 (x)

             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value

             38 BINARY_MULTIPLY                 <=== then multiply result with 2

             40 INPLACE_ADD

             42 STORE_FAST               1 (num)

             44 JUMP_ABSOLUTE           24

answered Dec 1 at 14:10

Eric Fortin

6,61721728

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

1

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53570864%2fwhy-is-2-x-x-faster-than-2-x-x-in-python-3-x-for-integers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
26
down vote

accepted

First of all, note that we don't see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")

51.00784397125244

>>> timeit("for i in range(1000): 2*(i*i)")

50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

Here's a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

5.796971936999967

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

4.3559221399999615

Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

3.0912468433380127

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

3.1120400428771973

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

1

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

2

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

2

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

2

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

|
show 4 more comments

up vote
26
down vote

accepted

First of all, note that we don't see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")

51.00784397125244

>>> timeit("for i in range(1000): 2*(i*i)")

50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

Here's a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

5.796971936999967

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

4.3559221399999615

Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

3.0912468433380127

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

3.1120400428771973

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

1

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

2

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

2

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

2

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

|
show 4 more comments

up vote
26
down vote

accepted

First of all, note that we don't see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")

51.00784397125244

>>> timeit("for i in range(1000): 2*(i*i)")

50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

Here's a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

5.796971936999967

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

4.3559221399999615

Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

3.0912468433380127

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

3.1120400428771973

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

First of all, note that we don't see the same thing in Python 2.x:

>>> timeit("for i in range(1000): 2*i*i")

51.00784397125244

>>> timeit("for i in range(1000): 2*(i*i)")

50.48330092430115

So this leads us to believe that this is due to how integers changed in Python 3: specifically, Python 3 uses long (arbitrarily large integers) everywhere.

Here's a direct way to see this (Python 3):

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

5.796971936999967

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

4.3559221399999615

Again, compare this to Python 2, which doesn't use arbitrary-length integers everywhere:

>>> timeit("a*b", "a,b = 2, 123456**2", number=100000000)

3.0912468433380127

>>> timeit("a*b", "a,b = 2*123456, 123456", number=100000000)

3.1120400428771973

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

edited Dec 6 at 19:59

user2357112

149k12156244

edited Dec 6 at 19:59

user2357112

149k12156244

edited Dec 6 at 19:59

user2357112

149k12156244

answered Dec 1 at 13:58

arshajii

100k18180250

answered Dec 1 at 13:58

arshajii

100k18180250

answered Dec 1 at 13:58

arshajii

100k18180250

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

1

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

2

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

2

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

2

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

|
show 4 more comments

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

1

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

2

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

2

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

2

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

Is Karatsuba algorithm slower than digit multiplication algorithm?
– Banghua Zhao
Dec 1 at 14:11

@BanghuaZhao It has a better runtime complexity (wrt to number of digits), but the number of digits has to be large enough for it to actually be worth it.
– arshajii
Dec 1 at 14:13

source : #define KARATSUBA_CUTOFF 70 . So the Karatsuba algorithm is only used if ints have about 600 decimal digits. it is not the problem here.
– B. M.
Dec 1 at 17:35

In Python, a "digit" is a 30 or 60-bit chunk, right? So base 2^30, not a decimal digit. Also critically important: Java int wraps at 2^32, while Python does arbitrary precision. Plus Java JIT-compiles to (inefficient) native code using 32-bit registers, while CPython interprets all the way. So those are massive differences.
– Peter Cordes
Dec 1 at 18:15

The numbers involved in the question are small enough that Karatsuba is not involved, and beyond that, they are small enough that asymptotic considerations are completely irrelevant. All operands and results involved fit in two 30-bit internal "digits".
– user2357112
Dec 1 at 19:30

|
show 4 more comments

up vote
6
down vote

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)

Out[6]: 28 # one slot + heading



In [7]: sys.getsizeof(2**30)

Out[7]: 32 # two slots

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

(2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)

   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

3

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

add a comment |

up vote
6
down vote

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)

Out[6]: 28 # one slot + heading



In [7]: sys.getsizeof(2**30)

Out[7]: 32 # two slots

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

(2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)

   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

3

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

add a comment |

up vote
6
down vote

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)

Out[6]: 28 # one slot + heading



In [7]: sys.getsizeof(2**30)

Out[7]: 32 # two slots

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

(2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)

   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

Python intern representation of integers is special, it uses slots of 30 bits :

In [6]: sys.getsizeof(2**30-1)

Out[6]: 28 # one slot + heading



In [7]: sys.getsizeof(2**30)

Out[7]: 32 # two slots

So everything happens as if Python counts in base B = 2**30 = 1 073 741 824 ~1 billion.

For a human who want to calculate 2*4*4, two ways :

(2*4)*4 = 8*4 =32 = 30 + 2 is immediate if you knows your add tables.

2*(4*4) = 2*16 = 2*10 + 2*6 = (2*10+10) + 2 = 30 + 2 since we have to put the operation down.

   (x*x)*2 =>  (a|b)*2 => (2*a|2*b)

   (2*x)*x =>  (2x)*x =>(2a|2b)

in the first case the 2* operation is done two times, against only one in the first case. That explains the difference.

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

edited Dec 1 at 17:54

answered Dec 1 at 13:17

B. M.

12.7k11934

answered Dec 1 at 13:17

B. M.

12.7k11934

answered Dec 1 at 13:17

B. M.

12.7k11934

3

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

add a comment |

3

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

Where do a, b and X come from, what do they represent and how do they relate to the value of x?
– Bergi
Dec 1 at 15:55

@Bergi: a and b are the high and low 30-bit chunks of x*x. (Capital X no longer appears in the current revision of the answer.)
– user2357112
Dec 1 at 19:36

add a comment |

up vote
3
down vote

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

11

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

4

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

2

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

add a comment |

up vote
3
down vote

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

11

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

4

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

2

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

add a comment |

up vote
3
down vote

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

edited Dec 1 at 13:01

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

answered Dec 1 at 12:58

Thomas Baruchel

4,56421636

11

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

4

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

2

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

add a comment |

11

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

4

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

2

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

Seems more like a guess (albeit a good guess) than an answer. Using timeit with different size x might give the needed evidence to push it from a guess to an answer.
– John Coleman
Dec 1 at 13:00

python has "cached" integers for -5 to 256 Dok - this would be easily veryfied if both formulas are closer to the same times if only small integers are in play? The max value goes to 2e+14 which is small enough to fit a 64-bit signed int so processor int calculation limitations is probably out of the game - it is too big for 32-bit unsigned int, so on 32-bit this might factor in?
– Patrick Artner
Dec 1 at 13:12

I still measure a 5% difference using 2 as a value for x.
– Vincent
Dec 1 at 13:13

The answer might seem more like a guess, though in my eyes it is helpful because I just learned something more about Python.
– Ely
Dec 1 at 13:47

add a comment |

up vote
2
down vote

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)

           30 LOAD_CONST               3 (2)

           32 LOAD_FAST                2 (x)

           34 BINARY_MULTIPLY

           36 LOAD_FAST                2 (x)

           38 BINARY_MULTIPLY

           40 INPLACE_ADD

           42 STORE_FAST               1 (num)

           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)

             30 LOAD_CONST               3 (2)

             32 LOAD_FAST                2 (x)

             34 LOAD_FAST                2 (x)

             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value

             38 BINARY_MULTIPLY                 <=== then multiply result with 2

             40 INPLACE_ADD

             42 STORE_FAST               1 (num)

             44 JUMP_ABSOLUTE           24

answered Dec 1 at 14:10

Eric Fortin

6,61721728

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

1

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

add a comment |

up vote
2
down vote

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)

           30 LOAD_CONST               3 (2)

           32 LOAD_FAST                2 (x)

           34 BINARY_MULTIPLY

           36 LOAD_FAST                2 (x)

           38 BINARY_MULTIPLY

           40 INPLACE_ADD

           42 STORE_FAST               1 (num)

           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)

             30 LOAD_CONST               3 (2)

             32 LOAD_FAST                2 (x)

             34 LOAD_FAST                2 (x)

             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value

             38 BINARY_MULTIPLY                 <=== then multiply result with 2

             40 INPLACE_ADD

             42 STORE_FAST               1 (num)

             44 JUMP_ABSOLUTE           24

answered Dec 1 at 14:10

Eric Fortin

6,61721728

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

1

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

add a comment |

up vote
2
down vote

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)

           30 LOAD_CONST               3 (2)

           32 LOAD_FAST                2 (x)

           34 BINARY_MULTIPLY

           36 LOAD_FAST                2 (x)

           38 BINARY_MULTIPLY

           40 INPLACE_ADD

           42 STORE_FAST               1 (num)

           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)

             30 LOAD_CONST               3 (2)

             32 LOAD_FAST                2 (x)

             34 LOAD_FAST                2 (x)

             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value

             38 BINARY_MULTIPLY                 <=== then multiply result with 2

             40 INPLACE_ADD

             42 STORE_FAST               1 (num)

             44 JUMP_ABSOLUTE           24

answered Dec 1 at 14:10

Eric Fortin

6,61721728

From what I can tell, it comes down to a little bit more memory access in the version using 2 * (x * x). I printed the disassembled bytecode and it seems to prove that:

Relevant part of 2 * x * x:

7          28 LOAD_FAST                1 (num)

           30 LOAD_CONST               3 (2)

           32 LOAD_FAST                2 (x)

           34 BINARY_MULTIPLY

           36 LOAD_FAST                2 (x)

           38 BINARY_MULTIPLY

           40 INPLACE_ADD

           42 STORE_FAST               1 (num)

           44 JUMP_ABSOLUTE           24

Relevant part of 2 * (x * x):

  7          28 LOAD_FAST                1 (num)

             30 LOAD_CONST               3 (2)

             32 LOAD_FAST                2 (x)

             34 LOAD_FAST                2 (x)

             36 BINARY_MULTIPLY                 <=== 1st multiply x*x in a temp value

             38 BINARY_MULTIPLY                 <=== then multiply result with 2

             40 INPLACE_ADD

             42 STORE_FAST               1 (num)

             44 JUMP_ABSOLUTE           24

answered Dec 1 at 14:10

Eric Fortin

6,61721728

answered Dec 1 at 14:10

Eric Fortin

6,61721728

answered Dec 1 at 14:10

Eric Fortin

6,61721728

answered Dec 1 at 14:10

Eric Fortin

6,61721728

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

1

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

add a comment |

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

1

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

If this was the case we'd see the same effect in Python 2, but we don't.
– arshajii
Dec 1 at 14:22

it's just moved a LOAD_FAST one instruction later. how is this "more memory access"?
– Sam Mason
Dec 1 at 14:42

@arshajii You are right, I checked the disassembled code for Python 2 and it shows the same thing. Although I still believe it might have an impact but is negligible compared to the one you noted.
– Eric Fortin
Dec 1 at 14:43

@SamMason In the second case, it needs to write back the result of x*x in a temp value because it is needed for the next multiply whereas in the first case, it always writes in the same register. So the second case is a read after write. It is however a really small penalty but in a hot loop, it can have an impact.
– Eric Fortin
Dec 1 at 14:52

CPython is a stack machine, it always leaves/writes results of computations to the top of the stack… there might be some cache related impact, but it's going to be incredibly small for code this close
– Sam Mason
Dec 1 at 14:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

SNn7B,ozeLsfM,4CX

搜尋此網誌

Krdytkyu