
2025/12/04 12:50
Euler Conjecture and CDC 6600
RSS: https://news.ycombinator.com/rss
要約▶
本文
November 18, 2025, 10:59am
1 In 1966, Lander and Parkin published a paper containing exactly two sentences. They reported that they had used a program that used direct search on a CDC 6600 to obtain one counterexample to Euler’s Sum Of Powers Conjecture. The result:
27^4 + 84^4 +110^4 +133^4 =144^4
A small program, written in Fortran and using OpenMP, reproduces this result (and others of a similar nature) in two minutes on a current PC (OpenMP, 8 threads). I am curious to know if anyone can estimate how long the CDC 6600 (which cost about $10 million in 1966, equivalent to about $ 70 million this year) would have run before it produced the reported result.
I am grateful to John Nichols for bringing the paper to my attention in the Intel Fortran Forum a few years ago.
Some details on the search algorithm used in the 1966 paper may be seen in another paper, https://www.ams.org/journals/mcom/1967-21-097/S0025-5718-1967-0220669-3/S0025-5718-1967-0220669-3.pdf.
4 minutes seems like a lot of time.
I just tested this with: ! sum_conjecture.f90 ! ! 27^5 + 84^5 + 110^5 + 133^5 = 144^5 ! use, intrinsic :: iso_fortran_env, only: int64 implicit none integer(int64) :: i,j,k,l,n
outer: do i = 1_int64, 10000_int64 do j = 1_int64, i do k = 1_int64, i do l = 1_int64, i do n = 1_int64, i if (j5 + k5 + l5 + n5 == i5) then print *, "i^5 = ", i5 print *, "j^5 + k^5 + l^5 + n^5 = ", j5 + k5 + l5 + n5 print *, j,k,l,n,i exit outer end if end do end do end do end do end do outer
end
On an Apple M2 Pro I get: $ time ./a.out i^5 = 61917364224 j^5 + k^5 + l^5 + n^5 = 61917364224 27 84 110 133 144
real 0m5.956s user 0m5.873s sys 0m0.060s
themos November 18, 2025, 2:24pm 3 0.4 seconds, but I made the loops hyper-triangular. themos November 18, 2025, 3:01pm 4 0.2 seconds if you precompute all integer fifth-powers. wspector November 18, 2025, 5:47pm 5 It is an interesting example since it is dominated by floating point multiplication, with a relatively small number of integer adds, and minimal memory references.
A floating point multiply functional unit on the 6600 took 10 cycles to complete. It wasn’t pipelined. There were actually two FP multiply units. So with a clock speed of 10 mhz, it could theoretically do ~2 million multiplies/sec. The 6600 allowed other instructions in other functional units to operate in parallel via its “scoreboard” logic. The compiler could schedule integer adds and memory loads/stores into the gaps while the multiplier units were busy. In the later CDC 7600, Mr Cray ditched the scoreboard and went to fully pipelined functional units (except floating divide). So a new floating point multiply could be issued every clock period. Clock speed was faster (36 mhz) as well. But I’d guess the bottleneck would have moved to the eight 60-bit X registers. Both the 6600 and 7600 did integer multiplies via floating point multiply units. Integer adds were done via a separate integer add functional unit.
Do you know how many clock cycles an integer multiply and a floating point multiply took on these machines? The integer instruction supposedly could skip the exponent operations. BTW, the inner loop in that code really only does one n**5 operation, one addition, and one comparison. The compiler should move all the other stuff to the outside of the inner loop.
wspector November 18, 2025, 7:12pm 7
Timing is identical for integer vs FP multiply. It is really a FP instruction either way - just unrounded vs rounded variants, and of course, zeros in the upper 12 bits for sign and exponent. There is also a ‘double precision’ version of the multiply - which gives the upper 48 bits of a 96 bit result.
wspector November 18, 2025, 7:48pm 8
It’s been a long time since I’ve thought about these details… The DP product instruction returns the lower 48 bits… Perusing my copy of the classic “Assembly Language Programming”, by Ralph Grishman, he reminds us that the earliest versions of the 6600 did not have the ability to directly multiply integers via the FP multiply. So one had to use the Pack instruction to convert the integers to FP, then the DP product instruction, then convert back to integer via the Unpack instruction. CDC fairly quickly realized they could make a small change to the DP product instruction to recognize that “if an integer smaller than 2^48 is used as an operand to a floating point instruction, it is treated as a floating point number with a biased exponent of 0 and hence a true exponent of -1777(base 8).” Before the change was made, “if both operands were integers less than 2^48 the machine compute a true exponent of -3776(base 8). Since a floating point number that small can not be represented in 6600 floating point format, the instruction would return a zero result.” The hardware change was made to most of the earliest machines.
certik November 18, 2025, 10:23pm 9
I was going to say, 5s is way too long. 0.4s seems about right for this kind of problem.
rwmsu November 19, 2025, 2:24am 10 On my AMD Ryzen 5 5600x running Linux Mint 21.3 I get the following times for Ivan’s code using AMD AOCC5.0 flang, nvfortran 25.5, and ifx 2025.1.1
flang - 8.122 secs nvfortran - 8.09 secs ifx - 42.68 secs using the Linux time function. Since both AMD and Nvidia are based on classic flang I’m not surprised they give the same times. ifx though is a puzzle. All runs were compiled with -O3 optimization
Good idea. That gets the inner loop down to an integer addition and a couple of array lookups. program sum_conjecture ! sum_conjecture.f90 ! ! 27^5 + 84^5 + 110^5 + 133^5 = 144^5 ! use, intrinsic :: iso_fortran_env, only: int64 implicit none integer(int64), parameter :: nmax = 10000_int64 integer(int64) :: i, j, k, l, n, sk, sl, sn integer(int64) :: n5(nmax) character(), parameter :: cfmt = '((g0,1x))'
outer: do i = 1_int64, nmax n5(i) = i**5 do j = 1_int64, i do k = 1_int64, j sk = n5(j) + n5(k) do l = 1_int64, k sl = sk + n5(l) do n = 1_int64, l sn = sl + n5(n) if ( sn == n5(i) ) then print cfmt, "i^5 = ", n5(i) print cfmt, j, k, l, n, i exit outer end if end do end do end do end do end do outer
end program sum_conjecture
$ gfortran -O3 sum_conjecture.f90 $ time a.out i^5 = 61917364224 133 110 84 27 144
real 0m0.188s user 0m0.174s sys 0m0.007s
The timings are with an Apple M1 with MacOS.
MarDie November 19, 2025, 6:15am 12 it’s also possible to compute the integer fifth-powers at compile time.
integer(int64), parameter :: n5(nmax) = int([(i,i=1_int64,nmax)],int64)**5_int64
it’s not faster for me, but it gives overflow warnings upfront.
I don’t think this warning applies here because the array is small, but the danger of this approach in general is that the executable file must contain that compile-time computed data, which means that it takes time to load that memory from disk (SSD, etc.) into the process memory on startup. In contrast, if that memory is allocated at run time directly by the executable, and then filled by cpu instructions, then it can be some 10^3 to 10^5 times faster. You can also see the difference by looking at the size of the executable image (ls -l a.out). With most modern compilers, you do not see the size of the declared array reflected in the executable size unless they are parameter arrays, arrays initialized at compile time, or arrays in common blocks. Also, if done at compile time, the whole array would need to be computed. That is 10^4 elements in this case (and integer overflows would be generated in doing so). The above run time code only computes 144 elements of the array before finding a solution and stopping. A further question is where does that extra effort get charged? This extra effort is appropriate if the user is paying for that time (e.g. with money, or with elapsed time, or with total throughput through the machine), but not if that overhead is somehow not charged as user time (e.g. in a timeshare or batch environment with many other users). This is why the programmer sometimes writes code to minimize wall time and sometimes to minimize cpu time. Those two goals are not always exactly the same. In the original code, the posix time command was used for the timings. That command returns three different time values for exactly this reason. If you alone own the machine you are running on, and you want to maximize throughput through that machine every 24 hour period, then it is the elapsed time that is critical. If you are one of many users sharing the machine, then it is the user time that you want to minimize, the system time is not charged to you because while your job is stalled waiting for the disk to respond, someone else’s job is swapped in and is being charged to execute its user time. Here is the timing result of that last version of the code using gfortran -O3 with an Apple M2 cpu on MacOS. It computes the i**5 values at run time, so the system time is minimal. i^5 = 61917364224 133 110 84 27 144
real 0m0.141s user 0m0.139s sys 0m0.002s
One other comment about timings is that they are almost never really consistent from run to run. For small segments of code like this, one can do $ time a.out; time a.out; time a.out
The first results are usually longer than the others, which are then usually more consistent if not identical at the millisecond level. But if timings are measured at the microsecond level, they would show variations too.
I think int64 just doesn’t have enough range.
nagfor test_int64.f90 NAG Fortran Compiler Release 7.2(Shin-Urayasu) Build 7203 Warning: test_int64.f90, line 8: Unused PARAMETER N5 Error: test_int64.f90, line 8: Integer overflow for exponentiation 6209**5 Errors in declarations, no further processing for TEST [NAG Fortran Compiler error termination, 1 error, 1 warning]
Click for test_int64.f90 program test use, intrinsic :: iso_fortran_env, only: int64 implicit none
integer(int64) :: i
integer(int64), parameter :: nmax = 10000_int64 integer(int64), parameter :: n5(nmax) = int([(i,i=1_int64,nmax)],int64)**5_int64
end program
Yes, i=6208 is the largest integer for which i**5 can be computed without overflow, so nmax=10000 is overkill. But imagine a situation with an array say 10^8 or 10^9 in size, and the choice is to compute it at compile time or at run time. Then you can see how the differences in executable size and the differences in compile time vs. run time efficiency become relevant.
mecej4 November 20, 2025, 6:42am 16 @RonShepard, Here is a shortened version of your code, guided by the motto “calculate just in time” . This version is short enough, fast enough and the aim of the program rich enough in historical interest, that it should be made part of the TIOBE benchmark suite. Does anyone know the process for submission?
! ! 27^5 + 84^5 + 110^5 + 133^5 = 144^5 ! program euler use, intrinsic :: iso_fortran_env, only: int64 implicit none integer, parameter :: nmax = 150 integer(int64) :: i,j, k,l,m integer(int64) :: n5(nmax) character(*), parameter :: cfmt = '(*(g0,1x))' outer: do i = 1, nmax n5(i) = i**5 do j = 1, i do k = 1, j do l = 1, k do m = 1, l if ( n5(j)+n5(k)+n5(l)+n5(m) == n5(i) ) then print cfmt, "i^5 = ", n5(i) print cfmt, j, k, l, m, i exit outer end if end do end do end do end do end do outer end program euler
When compiled with the Intel Ifort compiler and run on a Windows PC with Ryzen 7-6800U, this program ran in 0.3 s.
mecej4 November 20, 2025, 1:33pm 17 A further speed-up may be achieved by noting that mod(the fifth power of an integer,11) has to be in the set {0,1,10} Is there a reason why m is an integer and not an int64? On Windows with Intel i7-528K, I got 0.375 s (gfortran) and 0.30 s (Python + Numba). mecej4 November 20, 2025, 1:47pm 19 That 0.3 s for Python is impressive. Please consider posting your code!
One would hope that the compiler would automatically move the additions outside of the inner loops. However, one additional reason for manually moving them outside is that you can exit some of the loops early. do k = 1, j sk = n5(j)+n5(k) if ( sk > n5(i) ) exit do l = 1, k sl = sk +n5(l) if ( sl > n5(i) ) exit do m = 1, l sm = sl +n5(m) if ( sm > n5(i) ) exit
I doubt that a compiler optimizer would do those early loop exits. I notice that if you have a solution to the equation j^5+k^5+l^5+m^5=i^5, then if you multiply the integers by some common integer factor x, then (xj)^5+(xk)^5+(xl)^5+(xm)^5=(x*i)^5 is also a solution. So when one violation of the conjecture is found, that automatically means that an infinite number have been found. I was curious if there are any other violations of the conjecture in addition to the multiples of the (133,110,85,27,144) sequence, so I removed the exit outer statement from the program and let it run for up to i=1728, and those multiples are the only ones found. I expect that mathematicians who are familiar with this problem already know if there are other violations, but it was a simple thing to do, so I did it. That code ran several hours overnight. I wonder if there are algorithms to do this search in a more efficient, maybe less brute force, way? For example, can you remove the outer i loop, do the other four loops, and then efficiently search the n5(:) array for matches to sm? If you could do that, it would reduce the effort from O(N^5) to O(N^4). @mecej4 mentioned using modular arithmetic, maybe that could be used to limit some of the possibilities early in the outer loops? Of course, all these things will make the code more complicated.