I recently learned how Doom was ported to the SNES. It's quite impressive. The SNES hardware was nowhere near fast enough to do all the trig calculations needed for the game but cartridge based games had a trick up their sleeve: they could include actual hardware inside the cart that the game code could make use of. It was more expensive but if you expected to sell a boatload of copies, it could be worth it. However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision. They were helped by the fact that the game resolution in this case was fairly low.
If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:
Yup, I remember watching a video about how the RAM bus is the bottleneck when running Super Mario 64 on the N64. The original implementation used trig lookup tables, but the person optimized it by instead using Taylor series (I think) and some negation / shifting.
Also enjoy his discovery that, while the game got flak online for years due to the release build have files that were not optimized, it turns out most of the optimizations that were done were actually for the worse due the low instruction cache/ram (I forget the exact details.) Things like unrolling loops just increased the code size and required more slow code paging.
Abramowitz and Stegun's Handbook of Mathematical Functions[0] - I still use it to test whenever I need to implement any kind of fundamental math function that's not built into whatever environment I'm using. Haven't used it in some time but it comes in handy.
It turns out that SNES DOOM missed out on a big optimization that people figured out later on. If you use the SNES's Mosaic feature combined with scrolling tricks, you can nearly double the fill rate. Rather than outputting two pixels, you let the SNES's mosaic hardware do the pixel doubling.
The approach is older than that. I remember my grandfather's engineering books from 1950's – nearly each of them had a large addendum with the said pre-calculated sine, cosine, tangent and logarithm lookup tables. And there was at least one book that only had such tables and no other information.
That is how engineers used to calculate before the advent of computers.
The classic of this field of books is Abramowitz and Stegun's "Handbook of Mathematical Functions" - although the two listed names are merely those of the compilation editors, as the calculations of the numerous tables of values (and sheets of mathematical identities) required hundreds of human computers operating for years.
Ironically, on publication in 1964 it was just in time to see the dawn of the electronic computer age that would supplant it.
Once I tested lookup tables for a path tracer (ray tracer).
It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.
I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?
The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.
If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.
Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG
Squeezing performance out of modern hardware requires doing very different things.
Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.
Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?
For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.
(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)
> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.
Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.
This made me realize that trigonometric functions are not deterministic across different CPU architectures, OS, and programming languages (floating point precision aside).
Rounding transcendentals correctly has unknown time and space complexity. [1] Sort of brushes up against the halting problem. With limited precision, the upper limit becomes calculable but it's rather large - packages that offer correct rounding on 64-bit floating point use potentially hundreds of bytes to deal with a single floating point value. Dedicated circuitry to implement it fast would be big and complicated even by today's standards.
True in general, almost false for binary64. Important univariate functions have been thoroughly mapped for known hard-to-round cases, which resulted in the fact that we only need at most triple-double format to do the correct rounding. (Bivariate functions like `pow` are much harder, and not yet fully mapped as of this writing.) As a result we now have a mathematical library that almost ensures correct rounding [1], and a further optimization is currently in progress.
True in general but for the basic datatypes sent through hardware regusters your processor architecture has fixed precision. So the time and space complexity is O(1)
Yeah, but you'd be surprised at how frequently they appear to be the same. I once worked on a HTML5 game that that relied on deterministic simulation for networking. And it wasn't untill pretty late in development that a build of Chrome shipped on some platform that finally triggered a desync in our extensive cross-platform test suite.
We implemented a deterministic approximation, and moved on. But I learned something important about trig functions that day.
Well, what's fun is that (AFAIK) trigonometric functions tend not to be implemented in the newer floating point instructions, such as AVX or SSE.
So while what you say is true about the x87 implementation of those functions, for anything targeting a machine built in the last 20 years it's likely the code will run consistently regardless the architecture (barring architecture floating point bugs, which aren't terribly uncommon in the less significant bits and when overclocking comes into play).
x86 compilers won't use x87 instructions when SSE2 and later are available. x87 is just a really weird and funky instruction set that's best left in the gutter of history.
Sadly even SSE vs. AVX is enough to often give different results, as SSE doesn't have support for fused multiply-add instructions which allow calculation of a*b + c with guaranteed correct rounding. Even though this should allow CPUs from 2013 and later to all use FMA, gcc/clang don't enable AVX by default for the x86-64 targets. And even if they did, results are only guaranteed identical if implementations have chosen the exact same polynomial approximation method and no compiler optimizations alter the instruction sequence.
Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.
Barring someone doing a "check if AVX is available" check inside their code, binaries are generally compiled targeting either SSE or AVX and not both. You can reasonably expect that the same binary thrown against multiple architectures will have the same output.
This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.
That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.
> -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them
> The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.
I would have expected to be a bug in the documentation? Why would they turn FMA off for standard compliant C mode, but not for standard compliant C++ mode?
it defaults to off for standard-compliant mode. Which in my mind was the default mode as that's what we use everywhere I have worked in the last 15 years. But of course that's not the case.
In any case, according to the sibling comment, the default is 'fast' even in std-compliant mode in C++, which I find very surprising. I'm not very familiar with that corner of the standard, but it must be looser than the equivalent wording in the C standard.
> x87 is just a really weird and funky instruction set that's best left in the gutter of history
hmmm, can you use the long doubles in sse or avx? They are glorious, and as far as I see from playing with godbolt, they still require dirtying your hands with the x87 stack.
The 80bit float? Not as far as I'm aware. However, it's fairly trivial to represent a 127bit float with 2 64bit floats. And with the nature of AVX/SSE, you don't really take much of a performance hit for doing that as you are often operating on both parts of the double with the same instruction.
Floats follow a clear specification which determines precisely how basic arithmetic should work. They should work the same on all popular modern platforms. (Whether specific software libraries are the same is a separate question.)
If you implement sine in software using the same sequence of basic arithmetic instructions, the result should be the same across platforms. If you make two different implementations using different arithmetic, then of course you can't rely on them being the same.
Point being that IEEE 754 defines two sets of operations, the required operations (section 5) that should be produce correctly rounded results to the last digit, and recommend operations (section 9) with relaxed requirements. And sine belongs to the latter section, so IEEE 754 does not mandate reproducible results for sine.
My understanding is that most software always uses some software implementation of sine, rather than calling a hardware instruction. Which is definitely what you should do if you care about getting the exact same results across platforms.
Software implementations can and do differ (even dynamically) based on the hardware though - e.g. glibc's sin(x) function, what C code will end up using (if not other languages relying on the C stdlib), uses FMA instructions on my CPU, and thus the exact same binary on the exact same OS with the exact same glibc should behave differently on a very old CPU without FMA where it should have a different implementation (as generally things using FMA cannot be exactly ported to hardware without it without a gigantic drop in performance which'd be extremely unacceptable).
Yeah, the lack of FMA in some contexts is a serious bummer. It would be great if every popular CPU platform would figure out a way to get FMA implemented, and if programming languages would figure out better ways to help programmers use it explicitly without making their code too ugly.
at this point, Intel, AMD, Arm, and RiscV all do and have for a while. The only one that is at all relevant that doesn't is Apple m-series under rosetta.
This is not always a safe assumption (in certain scenarios floating point results being nondeterministic has the possibility to introduce bugs and security issues) and is also a kind of sad way to look at the world. The response to "I don't understand how this works" should not be to adopt an incorrect viewpoint, but to know the limitations of your understanding.
It’s not that I don’t understand, it’s that I do. Floats are inherently lossy representations. Yes, this means the more operations you perform on a float input, the fuzzier the value is.You ignore that harsh reality at your peril. If you find engineering rigor sad, I don’t know what to tell you.
"Floats are not deterministic" is not engineering rigor, it's just wrong. They are specified precisely by IEEE-754 in how they must behave and which operations are allowed to produce which results.
IEEE 754 conforming floats conform to IEEE-754. If they actually conform. Low end devices with shitty software implementations often get the hard edge cases wrong.
> the more operations you perform on a float input, the fuzzier the value is
No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.
Some algorithms are poorly conditioned and when implemented using floating point arithmetic will lead to a result that is different than what you would get in idealized real number arithmetic. That doesn't make any floating point value "fuzzy".
> No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.
A float always precisely represents a specific number, but that number is not always precisely the equal to the algebraic result of the operations performed (even when ignoring transcendental and irrational functions). This should be obvious since there is no limit to rational numbers, but finite floating point numbers.
If you design your algorithms very carefully, you can end up with the ratio of the output of your algorithm to the ratio of the algebraic result close to unity over a wide domain of inputs.
It depends. If you're constrained to one chip and one platform you can characterize or you can estimate the characteristics of a float that matter in your application. In some applications like embedded that's actually totally fine, and modern embedded chips can often do floating point as fast or faster than they can emulate fixed point to work around floating point's drawbacks. On one project I worked on they originally wrote everything fixed point out of fear that floating point would introduce some deleterious effect. But in the end they rewrote parts of the project using floating point to no ill effect and great performance improvement. And there were features of the product that they had to strike because the rewrite needed to support them couldn't touch certain sensitive areas of the code that had been tested extensively in the 2 or 3 years of development. It would have been much better to evaluate the assumption that floats are bad early on in the project and make the decision based on real information. The heuristic they were applying ended up costing part of the product that was strategically important.
Floats are well defined, and it is perfectly possible to reason about how algorithms based on them should behave. Few languages specify the accuracy of things like trig functions, so relying on them can be tricky, and JavaScript is particularly bad in that respect.
They're always deterministic in some sense (and as long as your OS respects the rounding mode after a context switch properly). This might sound pedantic but it determines how we think about floats — the behaviour is specified quite exactly.
Curiously, early Intel 386 processors had a bug where 32-bit multiplies were genuinely nondeterministic: some answers would depend on the voltage, frequency, temperature, and manufacturing conditions. The problem was essentially analog, a layout issue, where the signal didn't always have enough margin. (This is unrelated to the famous Pentium FDIV bug.) Until Intel got the problem fixed, they would stamp bad chips with "16 BIT S/W ONLY", while good chips were labeled "ΣΣ".
What I mean is that the same code running on different hardware/os may not always give the same answer. It’ll be close, but you can’t always expect bit for bit identical.
But why? We all know 0.1+0.2 won’t give 0.3 with floats but at least we should expect deterministic result for same numbers and same operations and same order, no?
I don't think that's safe at all. Catastrophic cancellation would be quite a lot less catastrophic if rounding errors were random but accurate on average.
Somewhat annoyingly the ascribe standard only specifies that various math functions return an approximation but does not set any bounds on that approximation. So for many functions you could just return NaN and still be compliant.
After reducing the interval, you don't want to use the Taylor series as you're building an approximation that's really good in 0 but not so good moving away from 0.
It's better to use an interpolating polynomial (Chebychev comes to mind) over the whole target interval.
I've seen the third order Taylor series used, but with the coefficients calculated at various offsets for a quarter wave. So you lookuo where you are in the quarter wave, then look up the 3 or 4 cofficients. This keeps the error somewhat bounded as the size of X is a small so the series does not diverge too much.
Anyone experienced with the Remez algorithm mentioned at the end of the article?
The degree-9 polynomial, said to be a thousand times better than the original Taylor approximation in maximum error, also appears to be very close to the Taylor series in the first place.
Rounding the Taylor coefficients to 6 digits after the decimal:
1/3! = 0.166667
1/5! = 0.008333
1/7! = 0.000198
1/9! = 0.000027(56)
The first 2 are exact, the third is 5 digits only (so 0.000190), and the fourth is more different starting from the 6th digit (0.000026019).
The delta in the 9-th order is expected if you were to truncate the Taylor series starting from the 11th order to infinity (+ x^11 / 11! - x^13/13! ...).
I don't think the polynomial given in the article was calculated via Remez. Perhaps the author merely meant it as an illustration. Here is Wolfram Alpha plotting the error of equation from the article:
You'll note that the error is small near zero, and large near pi/4, which is characteristic of a Taylor series around zero, and not the characteristic "level" oscillation characteristic of Remez approximations[1]. Note also that the polynomial only includes odd terms, which is not something I would expect from Remez unless it was run on the symmetric interval [-pi/4, pi/4].
I ran Remez for the problem and after 4 iterations obtained a degree 8 polynomial with error less than 1e-9, but it didn't look anything like the polynomial given in the article.
Although of course the first few digits of the low-order matching terms will be very similar - any polynomial approximation method will agree there because they are fitting the same function, after all. But by the time we reach x^5 or x^7 the agreement is very loose, really only the first couple digits are the same.
> Perhaps the author merely meant it as an illustration.
Yes I too was expecting oscillations (since I saw Chebyshev polynomials mentioned on the wiki page). I'm beginning to think that the word "might" in their statement "One way to calculate is Remez’s algorithm. The results might look like this:" is doing a lot of heavy lifting.
A couple of notes about your results:
- The interval only needs to be [0, pi/16] and not [0, pi/4]; adjusting that in your wolframalpha query yields a similar bound to what the author lists: 3.5 e-9. (Although same observation about the change of behavior at the bounds as you made).
- The polynomial you obtained actually looks pretty close to the Taylor one: the even coefficients are small, and rounding to 5 or 6 digits might yield a close one, maybe if we look for a degree 9 polynomial instead of degree 8. (I don't have an implementation of Remez handy right now to try it myself).
> Although of course the first few digits of the low-order matching terms will be very similar
It’s a very simple iterative algorithm, essentially the dumbest thing that could possibly work (like most good algorithms). It fails to converge for functions that have poles nearby unless you have a very good initial guess (the Chebyshev or Carathéodory-Fejér approximants are ~always good starting points and easily computed). In practice you want to optimize a weighted L-inf norm rather than absolute, because floating-point errors are measured in a relative norm.
I've been doing FoC motor control for a long time and I've settled on a neat little fixed point approximation for sin/cos. I haven't been able to find the blog I got the idea from. It's accurate to 9 bits, but is very symmetric and hits 1,0,-1 exactly. It's also smooth which usually makes it better than lookup tables.
Just using the 32 entry LUT and the small-angle approximations (sin(x) = x, cos(x) = 1-x^2/2) lets you calculate sin(x) within +/- 0.00015, which is rather impressive for something that can be done quickly by hand.
If you use cos(x) = 1 then you are still accurate to within 1%
[edit]
I think I also found an error in TFA? It seems like picking the "best" N would allow r to be in the range -pi/32 < r < pi/32, which makes the 3rd order Taylor series have an error of 4e-12, significantly better than the error range for 0 < r < pi/16
CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!
(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).
CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.
I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.
There are a lot of references given in the book including more details on CORDIC.
CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.
I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.
Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.
CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.
Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.
yeah, but 52 adds can be a lot cheaper than a few multiplies, if you're making them out of shift registers and logic gates (or LUT). in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation.
> in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation
Moving data is indeed expensive, but there’s another reason to not care. Modern CPUs take same time to add or multiply floats.
For example, the computer I’m using, with AMD Zen3 CPU cores, takes 3 cycles to add or multiply numbers, which applies to both 32- and 64-bit flavors of floats. See addps, mulps, addpd, mulpd SSE instructions in that table: https://www.uops.info/table.html
It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.
That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.
> EDIT: That's a paper for a software library, not the CPU's internal implementation.
Unless you're seeing something I'm not, it's talking about x87, which hasn't been anything other than 'internal' since they stopped selling the 80486sx.
I was truly amazed when my high school computer science teacher expanded the sine function into a Taylor series for us to implement in class.
Couldn't wrap my head around the concept until the topic was revisited in college, but idea was there and helped me understand the tradeoffs brought by "fast math" libraries I was using during high school.
That would be quite a big lookup table... Half of float numbers are in -1…+1 range. You don't need the negative part, so that's a quarter of float numbers, over one billion numbers. And then some to get from 1 to π/2.
I’m brushing up my math basics (I graduated CS while dodging the math requirements) and it frustrates me that in trig I need to remember values at all. The values such as sqrt(2)/2 make sense but how hard is it to calculate sin(5 degrees) by hand?
Use a Taylor series with a four function calculator.
0 is a decent approximation of sin near 0.
x is a better one.
x - x^3/6 is an even better one.
x - x^3/6 + x^5/120 ...
Note that x here is in radians rather than degrees so convert (degrees * pi/180) first. Repeat until you're satisfied with how many stable digits you get
I remember seeing the source code for a version of SpaceWar! running on an Adage Graphics Terminal (a one's complement machine) around 1970 that used a precomputed sine table.
I wonder what the first program was that ever used a precomputed trig table.
I thought modern CPUs since the late 1980's used a lookup table for trig/transcendental functions. Is the LUT just an expansion of the polynomial? I never really understood how FPUs worked...
That would take way too much room. A full lookup table would have 2^64 entries of 64 bits each, at 2^70 bits of ROM.
For comparison:
- Apple’s M2 Ultra has about 134 billion transistors. That’s about 2^38.
- Avogadro’s number is about 2^79.
Reducing the argument to a small range around zero decreases that a lot, but not enough by a far stretch. There are 2^52 doubles in [0.5, 1.0), 2^52 more in [0.25, 0.5], 2^52 more in [0.125, 0.25], etc. so you’d still easily need 2^52 entries or 2^58 bits (likely way, way more)
“It is implemented using a programmable logic array with 2,048 cells, of which 1,066 cells should have been populated with one of five values: −2, −1, 0, +1, +2.”
Not sure what you’re trying to demonstrate, they wouldn’t store every single float!! I hope don’t program. ;)
Older CPUs generally have used CORDIC (which does use LUT but that's only a part of the algorithm) due to its simplicity and compactness, while later CPUs with extensive microcode support would do the same thing as software implementations.
Given the ridiculous number of transistors and so on we can use in CPUs and GPUs nowadays how feasible is a relatively huge trig lookup table burned into rom?
A lookup table (for any function) that covered all values between 0 and 1 in single precision, would be ~4GB; there are approximately 1B values between 0 and 1, and the result of each value is 4 bytes.
Such a table for double-precision would be much, much larger.
It would also be extremely inaccurate. The x^n numerators grow very quickly and digits get lost because unlimited precision isn't available. Likewise, the n! denominators also grow rapidly. Then the series is alternating which means cancellation is happening for every added term.
That still makes it easier compared to computing constants in which the series are not globally convergent, like inverse trig functions. Obviously, you would have to break it apart to speed convergence.
If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:
https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...