Schönhage–Strassen algorithm


The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm for large integers. It was developed by Arnold Schönhage and Volker Strassen in 1971. The run-time bit complexity is, in Big O notation, for two n-digit numbers. The algorithm uses recursive Fast Fourier transforms in rings with 2n+1 elements, a specific type of number theoretic transform.
The Schönhage–Strassen algorithm was the asymptotically fastest multiplication method known from 1971 until 2007, when a new method, Fürer's algorithm, was announced with lower asymptotic complexity; however, Fürer's algorithm currently only achieves an advantage for astronomically large values and is not used in practice.
In practice the Schönhage–Strassen algorithm starts to outperform older methods such as Karatsuba and Toom–Cook multiplication for numbers beyond 2215 to 2217. The GNU Multi-Precision Library uses it for values of at least 1728 to 7808 64-bit words, depending on architecture. There is a Java implementation of Schönhage–Strassen which uses it above 74,000 decimal digits.
Applications of the Schönhage–Strassen algorithm include mathematical empiricism, such as the Great Internet Mersenne Prime Search and computing approximations of π, as well as practical applications such as Kronecker substitution, in which multiplication of polynomials with integer coefficients can be efficiently reduced to large integer multiplication; this is used in practice by GMP-ECM for Lenstra elliptic curve factorization.

Details

This section explains in detail how Schönhage–Strassen is implemented. It is based primarily on an overview of the method by Crandall and Pomerance in their Prime Numbers: A Computational Perspective. This variant differs somewhat from Schönhage's original method in that it exploits the discrete weighted transform to perform negacyclic convolutions more efficiently. Another source for detailed information is Knuth's The Art of Computer Programming. The section begins by discussing the building blocks and concludes with a step-by-step description of the algorithm itself.

Convolutions

Suppose we are multiplying two numbers like 123 and 456 using long multiplication with base B digits, but without performing any carrying. The result might look something like this:
This sequence is called the acyclic or linear convolution of the two original sequences and. Once we have the acyclic convolution of two sequences, computing the product of the original numbers is easy: we just perform the carrying. In the example this yields the correct product 56088.
There are two other types of convolutions that will be useful. Suppose the input sequences have n elements. Then the acyclic convolution has n+n−1 elements; if we take the rightmost n elements and add the leftmost n−1 elements, this produces the cyclic convolution:
If we perform carrying on the cyclic convolution, the result is equivalent to the product of the inputs mod Bn − 1. In the example, 103 − 1 = 999, performing carrying on yields 3141, and 3141 ≡ 56088.
Conversely, if we take the rightmost n elements and subtract the leftmost n−1 elements, this produces the negacyclic convolution:
If we perform carrying on the negacyclic convolution, the result is equivalent to the product of the inputs mod Bn + 1. In the example, 103 + 1 = 1001, performing carrying on yields 3035, and 3035 ≡ 56088. The negacyclic convolution can contain negative numbers, which can be eliminated during carrying using borrowing, as is done in long subtraction.

Convolution theorem

Like other multiplication methods based on the Fast Fourier transform, Schönhage–Strassen depends fundamentally on the convolution theorem, which provides an efficient way to compute the cyclic convolution of two sequences. It states that:
Or in symbols:
If we compute the DFT and IDFT using a fast Fourier transform algorithm, and invoke our multiplication algorithm recursively to multiply the entries of the transformed vectors DFT and DFT, this yields an efficient algorithm for computing the cyclic convolution.
In this algorithm, it will be more useful to compute the negacyclic convolution; as it turns out, a slightly modified version of the convolution theorem can enable this as well. Suppose the vectors X and Y have length n, and a is a primitive root of unity of order 2n. Then we can define a third vector A, called the weight vector, as:
Now, we can state:
In other words, it's the same as before except that the inputs are first multiplied by A, and the result is multiplied by A−1.

Choice of ring

The discrete Fourier transform is an abstract operation that can be performed in any algebraic ring; typically it's performed in the complex numbers, but actually performing complex arithmetic to sufficient precision to ensure accurate results for multiplication is slow and error-prone. Instead, we will use the approach of the number theoretic transform, which is to perform the transform in the integers mod N for some integer N.
Just like there are primitive roots of unity of every order in the complex plane, given any order n we can choose a suitable N such that b is a primitive root of unity of order n in the integers mod N.
The algorithm will spend most of its time performing recursive multiplications of smaller numbers; with a naive algorithm, these occur in a number of places:
  1. Inside the fast Fourier transform algorithm, where the primitive root of unity b is repeatedly powered, squared, and multiplied by other values.
  2. When taking powers of the primitive root of unity a to form the weight vector A and when multiplying A or A−1 by other vectors.
  3. When performing element-by-element multiplication of the transformed vectors.
The key insight to Schönhage–Strassen is to choose N, the modulus, to be equal to 2n + 1 for some integer n that is a multiple of the number of pieces P=2p being transformed. This has a number of benefits in standard systems that represent large integers in binary form:
To make the recursive multiplications convenient, we will frame Schönhage–Strassen as being a specialized multiplication algorithm for computing not just the product of two numbers, but the product of two numbers mod 2n + 1 for some given n. This is not a loss of generality, since one can always choose n large enough so that the product mod 2n + 1 is simply the product.

[|Shift optimizations]

In the course of the algorithm, there are many cases in which multiplication or division by a power of two can be profitably replaced by a small number of shifts and adds. This makes use of the observation that:
Note that a k-digit number in base 2n written in positional notation can be expressed as. It represents the number. Also note that for each di we have 0≤di < 2n.
This makes it simple to reduce a number represented in binary mod 2n + 1: take the rightmost n bits, subtract the next n bits, add the next n bits, and so on until the bits are exhausted. If the resulting value is still not between 0 and 2n, normalize it by adding or subtracting a multiple of the modulus 2n + 1. For example, if n=3 and the number being reduced is 656, we have:
Moreover, it's possible to effect very large shifts without ever constructing the shifted result. Suppose we have a number A between 0 and 2n, and wish to multiply it by 2k. Dividing k by n we find k = qn + r with r < n. It follows that:
Since A is ≤ 2n and r < n, A shift-left r has at most 2n−1 bits, and so only one shift and subtraction is needed.
Finally, to divide by 2k, observe that squaring the first equivalence above yields:
Hence,

Algorithm

The algorithm follows a split, evaluate, pointwise multiply, interpolate, and combine phases similar to Karatsuba and Toom-Cook methods.
Given input numbers x and y, and an integer N, the following algorithm computes the product xy mod 2N + 1. Provided N is sufficiently large this is simply the product.
  1. Split each input number into vectors X and Y of 2k parts each, where 2k divides N..
  2. In order to make progress, it's necessary to use a smaller N for recursive multiplications. For this purpose choose n as the smallest integer at least 2N/2k + k and divisible by 2k.
  3. Compute the product of X and Y mod 2n + 1 using the negacyclic convolution:
  4. # Multiply X and Y each by the weight vector A using shifts.
  5. # Compute the DFT of X and Y using the number-theoretic FFT.
  6. # Recursively apply this algorithm to multiply corresponding elements of the transformed X and Y.
  7. # Compute the IDFT of the resulting vector to get the result vector C. This corresponds to interpolation phase.
  8. # Multiply the result vector C by A−1 using shifts.
  9. # Adjust signs: some elements of the result may be negative. We compute the largest possible positive value for the jth element of C,, and if it exceeds this we subtract the modulus 2n + 1.
  10. Finally, perform carrying mod 2N + 1 to get the final result.
The optimal number of pieces to divide the input into is proportional to, where N is the number of input bits, and this setting achieves the running time of O, so the parameter k should be set accordingly. In practice, it is set empirically based on the input sizes and the architecture, typically to a value between 4 and 16.
In step 2, the observation is used that:
This section explains a number of important practical optimizations that have been considered when implementing Schönhage–Strassen in real systems. It is based primarily on a 2007 work by Gaudry, Kruppa, and Zimmermann describing enhancements to the GNU Multi-Precision Library.
Below a certain cutoff point, it's more efficient to perform the recursive multiplications using other algorithms, such as Toom–Cook multiplication. The results must be reduced mod 2n + 1, which can be done efficiently as explained above in Shift optimizations with shifts and adds/subtracts.
Computing the IDFT involves dividing each entry by the primitive root of unity 22n/2k, an operation that is frequently combined with multiplying the vector by A−1 afterwards, since both involve division by a power of two.
In a system where a large number is represented as an array of 2w-bit words, it's useful to ensure that the vector size 2k is also a multiple of the bits per word by choosing kw ; this allows the inputs to be broken up into pieces without bit shifts, and provides a uniform representation for values mod 2n + 1 where the high word can only be zero or one.
Normalization involves adding or subtracting the modulus 2n + 1; this value has only two bits set, which means this can be done in constant time on average with a specialized operation.
Iterative FFT algorithms such as the Cooley–Tukey FFT algorithm, although frequently used for FFTs on vectors of complex numbers, tend to exhibit very poor cache locality with the large vector entries used in Schönhage–Strassen. The straightforward recursive, not in-place implementation of FFT is more successful, with all operations fitting in the cache beyond a certain point in the call depth, but still makes suboptimal use of the cache in higher call depths. Gaudry, Kruppa, and Zimmerman used a technique combining Bailey's 4-step algorithm with higher radix transforms that combine multiple recursive steps. They also mix phases, going as far into the algorithm as possible on each element of the vector before moving on to the next one.
The "square root of 2 trick", first described by Schönhage, is to note that, provided k ≥ 2, 23n/4−2n/4 is a square root of 2 mod 2n+1, and so a 4n-th root of unity. This allows the transform length to be extended from 2k to 2k + 1.
Finally, the authors are careful to choose the right value of k for different ranges of input numbers, noting that the optimal value of k may go back and forth between the same values several times as the input size increases.