How do computers calculate sine?

staplung · on March 8, 2024

I recently learned how Doom was ported to the SNES. It's quite impressive. The SNES hardware was nowhere near fast enough to do all the trig calculations needed for the game but cartridge based games had a trick up their sleeve: they could include actual hardware inside the cart that the game code could make use of. It was more expensive but if you expected to sell a boatload of copies, it could be worth it. However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision. They were helped by the fact that the game resolution in this case was fairly low.

If you're interested, you can peruse the C code that was used to generate the tables. Here's the file for sine/cosine:

https://github.com/RandalLinden/DOOM-FX/blob/master/source/m...

pillusmany · on March 8, 2024

Games targetting pre-Pentium PCs also used precomputed trig tables.

Pentium was fast enough that it didn't matter as much.

Just a few years later it was slower to read a trig precomputed table.

BD103 · on March 8, 2024

Yup, I remember watching a video about how the RAM bus is the bottleneck when running Super Mario 64 on the N64. The original implementation used trig lookup tables, but the person optimized it by instead using Taylor series (I think) and some negation / shifting.

Polycryptus · on March 8, 2024

For anyone curious, the video:

https://youtu.be/xFKFoGiGlXQ

pests · on March 9, 2024

Also enjoy his discovery that, while the game got flak online for years due to the release build have files that were not optimized, it turns out most of the optimizations that were done were actually for the worse due the low instruction cache/ram (I forget the exact details.) Things like unrolling loops just increased the code size and required more slow code paging.

xarope · on March 8, 2024

in other words, for those of us who remember, they used the equivalent of a slide rule

fodkodrasz · on March 8, 2024

More like a Trigonometry table, which predates even slide rules:

https://en.wikipedia.org/wiki/Mathematical_table

UndefinedRef · on March 8, 2024

Back in the 80's, when I made demos on the C64, we also used pre-calculated sines. I remember going to the library to get a book with the values.

albrewer · on March 8, 2024

Abramowitz and Stegun's Handbook of Mathematical Functions[0] - I still use it to test whenever I need to implement any kind of fundamental math function that's not built into whatever environment I'm using. Haven't used it in some time but it comes in handy.

[0]https://en.wikipedia.org/wiki/Abramowitz_and_Stegun

nwellnhof · on March 8, 2024

I just wrote a BASIC program generating values using the SIN function [1] and poked them into memory.

[1] https://www.c64-wiki.com/wiki/SIN

Dwedit · on March 8, 2024

It turns out that SNES DOOM missed out on a big optimization that people figured out later on. If you use the SNES's Mosaic feature combined with scrolling tricks, you can nearly double the fill rate. Rather than outputting two pixels, you let the SNES's mosaic hardware do the pixel doubling.

inkyoto · on March 8, 2024

> […] pre-calculated lookup tables […]

The approach is older than that. I remember my grandfather's engineering books from 1950's – nearly each of them had a large addendum with the said pre-calculated sine, cosine, tangent and logarithm lookup tables. And there was at least one book that only had such tables and no other information.

That is how engineers used to calculate before the advent of computers.

aoanla · on March 8, 2024

The classic of this field of books is Abramowitz and Stegun's "Handbook of Mathematical Functions" - although the two listed names are merely those of the compilation editors, as the calculations of the numerous tables of values (and sheets of mathematical identities) required hundreds of human computers operating for years. Ironically, on publication in 1964 it was just in time to see the dawn of the electronic computer age that would supplant it.

albrewer · on March 8, 2024

I still use it when testing implementations of mathematical functions. Like if all I need is a bessel function, why pull in a whole CAS to do that?

tdudhhu · on March 8, 2024

Once I tested lookup tables for a path tracer (ray tracer).

It is interesting that you can get very decent results even with low quality tables. Of course there will be artifacts but due to the randomness of a path tracer this is not always very noticeable.

ImHereToVote · on March 8, 2024

I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?

csande17 · on March 8, 2024

The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.

If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.

ImHereToVote · on March 8, 2024

I think this is the only answer that addresses the issue. We always underestimate the cost of a read. and overestimate the cost of a compute.

Const-me · on March 8, 2024

Some of these old optimizations are now deprecated. For example, there’s a famous trick for inverse square root: https://en.wikipedia.org/wiki/Fast_inverse_square_root Modern processors have a special instruction for that. The hardware instruction is several times faster, and couple orders of magnitude more precise: https://www.felixcloutier.com/x86/rsqrtps

Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG

Squeezing performance out of modern hardware requires doing very different things.

Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.

tdudhhu · on March 8, 2024

Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?

For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.

(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)

lambdas · on March 8, 2024

A lot get antiquated by instruction additions, like the infamous inverse square root

lkschubert8 · on March 8, 2024

Doing so costs time/wage dollars.

hakuseki · on March 8, 2024

> However, even using extra hardware wasn't enough in this case. So they pre-calculated lookup tables for sine, cosine, tangent etc. for every angle at the necessary precision.

Is this really the order of events? I imagine the pre-calculated route is what you'd try first, and only go for extra hardware if that failed somehow.

shzhdbi09gv8ioi · on March 8, 2024

Lookup tables were commonplace in 80s-90s graphics programming. It should have been used before any consideration of custom hardware solutions.

pengaru · on March 8, 2024

imagine thinking lookup tables are innovative

warpech · on March 7, 2024

This made me realize that trigonometric functions are not deterministic across different CPU architectures, OS, and programming languages (floating point precision aside).

E.g. I would assume that Math.sin(x) returns the same thing in NodeJS on Windows and Mac/M1, but it turns out it is necessarily so. https://stackoverflow.com/questions/74074312/standard-math-f...

retrac · on March 7, 2024

Rounding transcendentals correctly has unknown time and space complexity. [1] Sort of brushes up against the halting problem. With limited precision, the upper limit becomes calculable but it's rather large - packages that offer correct rounding on 64-bit floating point use potentially hundreds of bytes to deal with a single floating point value. Dedicated circuitry to implement it fast would be big and complicated even by today's standards.

https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma

lifthrasiir · on March 8, 2024

True in general, almost false for binary64. Important univariate functions have been thoroughly mapped for known hard-to-round cases, which resulted in the fact that we only need at most triple-double format to do the correct rounding. (Bivariate functions like `pow` are much harder, and not yet fully mapped as of this writing.) As a result we now have a mathematical library that almost ensures correct rounding [1], and a further optimization is currently in progress.

[1] https://core-math.gitlabpages.inria.fr/

throwawaymaths · on March 8, 2024

unknown time and space complexity

True in general but for the basic datatypes sent through hardware regusters your processor architecture has fixed precision. So the time and space complexity is O(1)

ImHereToVote · on March 8, 2024

We should just use analog circuit for this sort of thing where the exact precision doesn't matter.

Arelius · on March 8, 2024

Yeah, but you'd be surprised at how frequently they appear to be the same. I once worked on a HTML5 game that that relied on deterministic simulation for networking. And it wasn't untill pretty late in development that a build of Chrome shipped on some platform that finally triggered a desync in our extensive cross-platform test suite.

We implemented a deterministic approximation, and moved on. But I learned something important about trig functions that day.

cogman10 · on March 7, 2024

Well, what's fun is that (AFAIK) trigonometric functions tend not to be implemented in the newer floating point instructions, such as AVX or SSE.

So while what you say is true about the x87 implementation of those functions, for anything targeting a machine built in the last 20 years it's likely the code will run consistently regardless the architecture (barring architecture floating point bugs, which aren't terribly uncommon in the less significant bits and when overclocking comes into play).

x86 compilers won't use x87 instructions when SSE2 and later are available. x87 is just a really weird and funky instruction set that's best left in the gutter of history.

bnprks · on March 7, 2024

Sadly even SSE vs. AVX is enough to often give different results, as SSE doesn't have support for fused multiply-add instructions which allow calculation of a*b + c with guaranteed correct rounding. Even though this should allow CPUs from 2013 and later to all use FMA, gcc/clang don't enable AVX by default for the x86-64 targets. And even if they did, results are only guaranteed identical if implementations have chosen the exact same polynomial approximation method and no compiler optimizations alter the instruction sequence.

Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.

cogman10 · on March 7, 2024

That's a bit of a different problem IMO.

Barring someone doing a "check if AVX is available" check inside their code, binaries are generally compiled targeting either SSE or AVX and not both. You can reasonably expect that the same binary thrown against multiple architectures will have the same output.

This, of course, doesn't apply if we are talking about a JIT. All bets are off if you are talking about javascript or the JVM.

That is to say, you can expect that a C++ binary blob from the Ubuntu repo is going to get the same numbers regardless the machine since they generally will target fairly old architectures.

zokier · on March 7, 2024

> Barring someone doing a "check if AVX is available" check inside their code

Afaik that is exactly what glibc does internally

gpderetta · on March 7, 2024

GCC won't use FMA without fast-math though. Even when AVX is otherwise enabled.

zokier · on March 7, 2024

Sure it will:

> -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them

> The default is -ffp-contract=off for C in a standards compliant mode (-std=c11 or similar), -ffp-contract=fast otherwise.

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#ind...

gpderetta · on March 7, 2024

Oh, wow, forgot about fp-contract. It says it is off in C by default, what about C++?

zokier · on March 7, 2024

Read closer, it defaults to fast, not off

planede · on March 8, 2024

I would have expected to be a bug in the documentation? Why would they turn FMA off for standard compliant C mode, but not for standard compliant C++ mode?

But the documentation does appear to be correct: https://godbolt.org/z/3bvP136oc

Crazy.

gpderetta · on March 8, 2024

it defaults to off for standard-compliant mode. Which in my mind was the default mode as that's what we use everywhere I have worked in the last 15 years. But of course that's not the case.

In any case, according to the sibling comment, the default is 'fast' even in std-compliant mode in C++, which I find very surprising. I'm not very familiar with that corner of the standard, but it must be looser than the equivalent wording in the C standard.

enriquto · on March 7, 2024

> x87 is just a really weird and funky instruction set that's best left in the gutter of history

hmmm, can you use the long doubles in sse or avx? They are glorious, and as far as I see from playing with godbolt, they still require dirtying your hands with the x87 stack.

cogman10 · on March 7, 2024

The 80bit float? Not as far as I'm aware. However, it's fairly trivial to represent a 127bit float with 2 64bit floats. And with the nature of AVX/SSE, you don't really take much of a performance hit for doing that as you are often operating on both parts of the double with the same instruction.

enriquto · on March 7, 2024

Do you know if there's language support for that? Are there obscure gcc options that make "long double" be quadruple precision floats?

zokier · on March 7, 2024

You can just use standard C _Float128 type https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html

cogman10 · on March 7, 2024

Which language?

For C++, there's this: https://en.cppreference.com/w/cpp/types/floating-point

bee_rider · on March 7, 2024

They do, however, have some intrinsics for trig functions in AVX in their compilers. Not as good as having an instruction of course.

fulafel · on March 8, 2024

What about GPU ISAs?

paulddraper · on March 8, 2024

> deterministic

I would use the word "consistent."

Non-determinism implies randomness.

TylerE · on March 7, 2024

Safer to assume that floats are never deterministic.

jacobolus · on March 7, 2024

Floats follow a clear specification which determines precisely how basic arithmetic should work. They should work the same on all popular modern platforms. (Whether specific software libraries are the same is a separate question.)

zokier · on March 7, 2024

But transcendentals like sine are not part of the strictly defined basic arithmetic; they are intentionally defined with relaxed behavior.

jacobolus · on March 7, 2024

If you implement sine in software using the same sequence of basic arithmetic instructions, the result should be the same across platforms. If you make two different implementations using different arithmetic, then of course you can't rely on them being the same.

zokier · on March 7, 2024

Point being that IEEE 754 defines two sets of operations, the required operations (section 5) that should be produce correctly rounded results to the last digit, and recommend operations (section 9) with relaxed requirements. And sine belongs to the latter section, so IEEE 754 does not mandate reproducible results for sine.

jacobolus · on March 7, 2024

My understanding is that most software always uses some software implementation of sine, rather than calling a hardware instruction. Which is definitely what you should do if you care about getting the exact same results across platforms.

dzaima · on March 8, 2024

Software implementations can and do differ (even dynamically) based on the hardware though - e.g. glibc's sin(x) function, what C code will end up using (if not other languages relying on the C stdlib), uses FMA instructions on my CPU, and thus the exact same binary on the exact same OS with the exact same glibc should behave differently on a very old CPU without FMA where it should have a different implementation (as generally things using FMA cannot be exactly ported to hardware without it without a gigantic drop in performance which'd be extremely unacceptable).

jacobolus · on March 8, 2024

Yeah, the lack of FMA in some contexts is a serious bummer. It would be great if every popular CPU platform would figure out a way to get FMA implemented, and if programming languages would figure out better ways to help programmers use it explicitly without making their code too ugly.

adgjlsfhk1 · on March 9, 2024

at this point, Intel, AMD, Arm, and RiscV all do and have for a while. The only one that is at all relevant that doesn't is Apple m-series under rosetta.

gpderetta · on March 7, 2024

Even there many standard libraries provide very good precision at least within sane domain.

TylerE · on March 7, 2024

Yeah, that’s kind of my point. 99% consistent isn’t.

saagarjha · on March 7, 2024

This is not always a safe assumption (in certain scenarios floating point results being nondeterministic has the possibility to introduce bugs and security issues) and is also a kind of sad way to look at the world. The response to "I don't understand how this works" should not be to adopt an incorrect viewpoint, but to know the limitations of your understanding.

TylerE · on March 7, 2024

It’s not that I don’t understand, it’s that I do. Floats are inherently lossy representations. Yes, this means the more operations you perform on a float input, the fuzzier the value is.You ignore that harsh reality at your peril. If you find engineering rigor sad, I don’t know what to tell you.

saagarjha · on March 7, 2024

"Floats are not deterministic" is not engineering rigor, it's just wrong. They are specified precisely by IEEE-754 in how they must behave and which operations are allowed to produce which results.

TylerE · on March 7, 2024

IEEE 754 conforming floats conform to IEEE-754. If they actually conform. Low end devices with shitty software implementations often get the hard edge cases wrong.

saagarjha · on March 7, 2024

Yes and when that happens it is important to know what went wrong rather than just handwaving "oh it's that float stuff again I can't trust it".

jacobolus · on March 7, 2024

> the more operations you perform on a float input, the fuzzier the value is

No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.

Some algorithms are poorly conditioned and when implemented using floating point arithmetic will lead to a result that is different than what you would get in idealized real number arithmetic. That doesn't make any floating point value "fuzzy".

aidenn0 · on March 8, 2024

> No, any float always precisely represents a specific number. The issue is that only a finite number of numbers are representable.

A float always precisely represents a specific number, but that number is not always precisely the equal to the algebraic result of the operations performed (even when ignoring transcendental and irrational functions). This should be obvious since there is no limit to rational numbers, but finite floating point numbers.

If you design your algorithms very carefully, you can end up with the ratio of the output of your algorithm to the ratio of the algebraic result close to unity over a wide domain of inputs.

otabdeveloper4 · on March 8, 2024

Irrational numbers cannot have an exact representation in digital bits. (Computers use rational numbers with modular arithmetic under the hood.)

saagarjha · on March 8, 2024

Yes I am aware of this fact

throwway120385 · on March 7, 2024

It depends. If you're constrained to one chip and one platform you can characterize or you can estimate the characteristics of a float that matter in your application. In some applications like embedded that's actually totally fine, and modern embedded chips can often do floating point as fast or faster than they can emulate fixed point to work around floating point's drawbacks. On one project I worked on they originally wrote everything fixed point out of fear that floating point would introduce some deleterious effect. But in the end they rewrote parts of the project using floating point to no ill effect and great performance improvement. And there were features of the product that they had to strike because the rewrite needed to support them couldn't touch certain sensitive areas of the code that had been tested extensively in the 2 or 3 years of development. It would have been much better to evaluate the assumption that floats are bad early on in the project and make the decision based on real information. The heuristic they were applying ended up costing part of the product that was strategically important.

Jyaif · on March 7, 2024

> constrained to one chip and one platform

and constrained to one compiler at a precise version, and one set of compiler options

aardvark179 · on March 7, 2024

Floats are well defined, and it is perfectly possible to reason about how algorithms based on them should behave. Few languages specify the accuracy of things like trig functions, so relying on them can be tricky, and JavaScript is particularly bad in that respect.

mhh__ · on March 7, 2024

They're always deterministic in some sense (and as long as your OS respects the rounding mode after a context switch properly). This might sound pedantic but it determines how we think about floats — the behaviour is specified quite exactly.

kens · on March 8, 2024

Curiously, early Intel 386 processors had a bug where 32-bit multiplies were genuinely nondeterministic: some answers would depend on the voltage, frequency, temperature, and manufacturing conditions. The problem was essentially analog, a layout issue, where the signal didn't always have enough margin. (This is unrelated to the famous Pentium FDIV bug.) Until Intel got the problem fixed, they would stamp bad chips with "16 BIT S/W ONLY", while good chips were labeled "ΣΣ".

TylerE · on March 7, 2024

What I mean is that the same code running on different hardware/os may not always give the same answer. It’ll be close, but you can’t always expect bit for bit identical.

ducttapecrown · on March 7, 2024

They're always deterministic, just as long as physics is.

charlieyu1 · on March 8, 2024

But why? We all know 0.1+0.2 won’t give 0.3 with floats but at least we should expect deterministic result for same numbers and same operations and same order, no?

contravariant · on March 7, 2024

I don't think that's safe at all. Catastrophic cancellation would be quite a lot less catastrophic if rounding errors were random but accurate on average.

aardvark179 · on March 7, 2024

Somewhat annoyingly the ascribe standard only specifies that various math functions return an approximation but does not set any bounds on that approximation. So for many functions you could just return NaN and still be compliant.

layer8 · on March 7, 2024

Isn’t NaN the one value that can’t possibly count as an approximation, because it’s not a number and unordered? ;)

aardvark179 · on March 7, 2024

You might think so, but if it’s not specified in the standard…

zardo · on March 7, 2024

It is the worst possible approximation though.

layer8 · on March 7, 2024

I was submitting an interpretation of the standard.

adgjlsfhk1 · on March 7, 2024

some languages (e.g. Julia) provide their own math library do that you get the same results across across operating systems.

quickthrower2 · on March 8, 2024

So you can use sin(x) for various x to tell what you are running on. Maybe even in the browser?

Karliss · on March 8, 2024

There is a shader which uses similar idea to detect what kind of GPU you have https://www.shadertoy.com/view/slySWV

lifthrasiir · on March 8, 2024

V8 and SpiderMonkey have converged to the same underlying library (fdlibm), partly for the interoperability, so you generally can't.

jxy · on March 7, 2024

It's much clearer if you read one of the source code of the libm.

Plan 9: https://9p.io/sources/plan9/sys/src/libc/port/sin.c

Freebsd: https://cgit.freebsd.org/src/tree/lib/msun/src/k_sin.c

lifthrasiir · on March 8, 2024

FreeBSD code is missing the range reduction step (it's named a "kernel" for the reason): https://cgit.freebsd.org/src/tree/lib/msun/src/e_rem_pio2.c

planede · on March 8, 2024

Ah, I always referred to the musl implmenetation, but I just now realized that they copied the Freebsd one.

jxy · on March 8, 2024

and the directory name is literally: `msun`

anthk · on March 8, 2024

Does 9front keep the same sin.c implementation?

jxy · on March 8, 2024

http://git.9front.org/plan9front/plan9front/HEAD/sys/src/lib...

microtherion · on March 7, 2024

P.J. Plauger's _The Standard C Library_ provides an implementation for all functions in the (then) C standard: https://www.amazon.com/Standard-Library-P-J-Plauger/dp/01383...

toolslive · on March 7, 2024

After reducing the interval, you don't want to use the Taylor series as you're building an approximation that's really good in 0 but not so good moving away from 0. It's better to use an interpolating polynomial (Chebychev comes to mind) over the whole target interval.

paulpauper · on March 7, 2024

There are many ways to do this. It's not a difficult problem unless memory is constrained.

tails4e · on March 7, 2024

I've seen the third order Taylor series used, but with the coefficients calculated at various offsets for a quarter wave. So you lookuo where you are in the quarter wave, then look up the 3 or 4 cofficients. This keeps the error somewhat bounded as the size of X is a small so the series does not diverge too much.

eska · on March 7, 2024

You might also find this video interesting:

"Finding the BEST sine function for Nintendo 64"

https://www.youtube.com/watch?v=xFKFoGiGlXQ

eh_why_not · on March 8, 2024

Anyone experienced with the Remez algorithm mentioned at the end of the article?

The degree-9 polynomial, said to be a thousand times better than the original Taylor approximation in maximum error, also appears to be very close to the Taylor series in the first place.

Rounding the Taylor coefficients to 6 digits after the decimal:

1/3! = 0.166667

1/5! = 0.008333

1/7! = 0.000198

1/9! = 0.000027(56)

The first 2 are exact, the third is 5 digits only (so 0.000190), and the fourth is more different starting from the 6th digit (0.000026019).

The delta in the 9-th order is expected if you were to truncate the Taylor series starting from the 11th order to infinity (+ x^11 / 11! - x^13/13! ...).

olooney · on March 8, 2024

I don't think the polynomial given in the article was calculated via Remez. Perhaps the author merely meant it as an illustration. Here is Wolfram Alpha plotting the error of equation from the article:

https://www.wolframalpha.com/input?i=plot+sin%28x%29+-+P%28x....

You'll note that the error is small near zero, and large near pi/4, which is characteristic of a Taylor series around zero, and not the characteristic "level" oscillation characteristic of Remez approximations[1]. Note also that the polynomial only includes odd terms, which is not something I would expect from Remez unless it was run on the symmetric interval [-pi/4, pi/4].

I ran Remez for the problem and after 4 iterations obtained a degree 8 polynomial with error less than 1e-9, but it didn't look anything like the polynomial given in the article.

    1.7209863008943345e-05x^8 -0.00024575124459624625x^7 +7.194649190849227e-05x^6 +0.008268794893899754x^5 +3.425379759410762e-05x^4 -0.16667692317020713x^3 +1.5422400957890642e-06x^2 +0.9999999106213322x +8.526477071446453e-10

Although of course the first few digits of the low-order matching terms will be very similar - any polynomial approximation method will agree there because they are fitting the same function, after all. But by the time we reach x^5 or x^7 the agreement is very loose, really only the first couple digits are the same.

[1]: https://en.wikipedia.org/wiki/Approximation_theory

eh_why_not · on March 8, 2024

> Perhaps the author merely meant it as an illustration.

Yes I too was expecting oscillations (since I saw Chebyshev polynomials mentioned on the wiki page). I'm beginning to think that the word "might" in their statement "One way to calculate is Remez’s algorithm. The results might look like this:" is doing a lot of heavy lifting.

A couple of notes about your results:

- The interval only needs to be [0, pi/16] and not [0, pi/4]; adjusting that in your wolframalpha query yields a similar bound to what the author lists: 3.5 e-9. (Although same observation about the change of behavior at the bounds as you made).

- The polynomial you obtained actually looks pretty close to the Taylor one: the even coefficients are small, and rounding to 5 or 6 digits might yield a close one, maybe if we look for a degree 9 polynomial instead of degree 8. (I don't have an implementation of Remez handy right now to try it myself).

> Although of course the first few digits of the low-order matching terms will be very similar

Makes sense.

stephencanon · on March 8, 2024

https://en.wikipedia.org/wiki/Remez_algorithm

It’s a very simple iterative algorithm, essentially the dumbest thing that could possibly work (like most good algorithms). It fails to converge for functions that have poles nearby unless you have a very good initial guess (the Chebyshev or Carathéodory-Fejér approximants are ~always good starting points and easily computed). In practice you want to optimize a weighted L-inf norm rather than absolute, because floating-point errors are measured in a relative norm.

djmips · on March 9, 2024

Robin Green goes into this in this GDC 2002 article.

https://basesandframes.files.wordpress.com/2016/05/rgreenfas...

azhenley · on March 7, 2024

I blogged my adventure of implementing cosine from scratch and how others have done it:

https://austinhenley.com/blog/cosine.html

dboreham · on March 7, 2024

This was my first use of open source, around 1978. I wondered how calculators and computers did trig functions, and was also using Unix V7. We had a large disk and kept the source on line. So I was able to find this: https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/libm/s... and from there this book: https://www.biblio.com/book/computer-approximations-john-f-h...

duped · on March 7, 2024

1 - cos^2(x), obviously

nh23423fefe · on March 7, 2024

instead, you could just double negate to optimize away the square root

    -(-(sin(x))

ChainOfFools · on March 7, 2024

clear, but too verbose. 1/csc is what you want.

or for style points just -cos'

convolvatron · on March 7, 2024

http://steve.hollasch.net/cgindex/math/inccos.html is a great technique if you need a fast integer approximation for some some arbitrary sampling interval (i.e. motor control)

phkahler · on March 8, 2024

I've been doing FoC motor control for a long time and I've settled on a neat little fixed point approximation for sin/cos. I haven't been able to find the blog I got the idea from. It's accurate to 9 bits, but is very symmetric and hits 1,0,-1 exactly. It's also smooth which usually makes it better than lookup tables.

aidenn0 · on March 8, 2024

Just using the 32 entry LUT and the small-angle approximations (sin(x) = x, cos(x) = 1-x^2/2) lets you calculate sin(x) within +/- 0.00015, which is rather impressive for something that can be done quickly by hand.

If you use cos(x) = 1 then you are still accurate to within 1%

[edit]

I think I also found an error in TFA? It seems like picking the "best" N would allow r to be in the range -pi/32 < r < pi/32, which makes the 3rd order Taylor series have an error of 4e-12, significantly better than the error range for 0 < r < pi/16

vardump · on March 7, 2024

CORDIC is how it's usually done in hardware (and FPGAs).

https://en.wikipedia.org/wiki/CORDIC

perihelions · on March 7, 2024

CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!

(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).

pclmulqdq · on March 7, 2024

CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.

idatum · on March 8, 2024

I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.

There are a lot of references given in the book including more details on CORDIC.

rcxdude · on March 7, 2024

CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.

derf_ · on March 8, 2024

I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.

dilyevsky · on March 7, 2024

I think still used out of necessity when hw floating point not available (like fpgas)

f1shy · on March 7, 2024

Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.

kevin_thibedeau · on March 7, 2024

CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.

perihelions · on March 7, 2024

It uses multiplication by powers of two, which is a floating-point bit shift.

adgjlsfhk1 · on March 7, 2024

no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.

ajross · on March 8, 2024

Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.

f1shy · on March 7, 2024

That is 64 iterations for a double, that is nothing!

adgjlsfhk1 · on March 7, 2024

53, but that's still a lot more than the 5th degree polynomial that you need.

rcxdude · on March 7, 2024

yeah, but 52 adds can be a lot cheaper than a few multiplies, if you're making them out of shift registers and logic gates (or LUT). in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation.

Const-me · on March 8, 2024

> in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation

Moving data is indeed expensive, but there’s another reason to not care. Modern CPUs take same time to add or multiply floats.

For example, the computer I’m using, with AMD Zen3 CPU cores, takes 3 cycles to add or multiply numbers, which applies to both 32- and 64-bit flavors of floats. See addps, mulps, addpd, mulpd SSE instructions in that table: https://www.uops.info/table.html

adgjlsfhk1 · on March 8, 2024

> moving around the data is 100x more expensive than the ALU operation.

This is exactly the problem with CORDIC. 52 dependent adds requires moving data from a register to the ALU and back 52 times.

rcxdude · on March 8, 2024

It's the problem with CORDIC in that context, yes!

bigbillheck · on March 7, 2024

Why would you ever use CORDIC if you had any other option?

vardump · on March 7, 2024

It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.

That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.

EDIT:

AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.

Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.

I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.

bigbillheck · on March 7, 2024

They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...

(there doesn't seem to actually be a linked article there, just the summary)

vardump · on March 7, 2024

Pretty weird Intel's sine computation latency hasn't changed all that much over the years. Latencies have been pretty similar for 20 years.

EDIT: That's a paper for a software library, not the CPU's internal implementation. Which is probably still done with CORDIC.

bigbillheck · on March 7, 2024

> EDIT: That's a paper for a software library, not the CPU's internal implementation.

Unless you're seeing something I'm not, it's talking about x87, which hasn't been anything other than 'internal' since they stopped selling the 80486sx.

vardump · on March 7, 2024

Ah you're right.

Anyways I wonder why it's still so slow.

60-120 cycles sure looks like a CORDIC implementation, but perhaps not.

Tade0 · on March 8, 2024

I was truly amazed when my high school computer science teacher expanded the sine function into a Taylor series for us to implement in class.

Couldn't wrap my head around the concept until the topic was revisited in college, but idea was there and helped me understand the tradeoffs brought by "fast math" libraries I was using during high school.

t-3 · on March 7, 2024

Link doesn't appear to be valid, but aren't these usually precalculated and stored in a lookup table?

lupire · on March 8, 2024

Think about how big that lookup table would be for an 64bit double float.

They use a lookup table for nπ/16, and then a polynomial to approximate the difference to other values.

Etherlord87 · on March 9, 2024

That would be quite a big lookup table... Half of float numbers are in -1…+1 range. You don't need the negative part, so that's a quarter of float numbers, over one billion numbers. And then some to get from 1 to π/2.

And that's only float32!

mettamage · on March 8, 2024

What’s the best way to calculate it by hand?

I’m brushing up my math basics (I graduated CS while dodging the math requirements) and it frustrates me that in trig I need to remember values at all. The values such as sqrt(2)/2 make sense but how hard is it to calculate sin(5 degrees) by hand?

HenryPrickett · on March 8, 2024

Use a Taylor series with a four function calculator.

0 is a decent approximation of sin near 0.

x is a better one.

x - x^3/6 is an even better one.

x - x^3/6 + x^5/120 ...

Note that x here is in radians rather than degrees so convert (degrees * pi/180) first. Repeat until you're satisfied with how many stable digits you get

empath-nirvana · on March 8, 2024

https://www.youtube.com/watch?v=3d6DsjIBzJ4

_v7gu · on March 8, 2024

Five is small enough that you can get away with sin(5 deg) = 5pi/180

sema4hacker · on March 8, 2024

I remember seeing the source code for a version of SpaceWar! running on an Adage Graphics Terminal (a one's complement machine) around 1970 that used a precomputed sine table.

I wonder what the first program was that ever used a precomputed trig table.

simonblack · on March 8, 2024

The same way you can eat an elephant: one byte at a time.

Any calculating job can be undertaken by a proper Turing machine. You just have to keep in mind the old triangle of usage: Cost, capability and speed.

If a human can calculate a sine, so can any full-blown computer.

demondemidi · on March 8, 2024

I thought modern CPUs since the late 1980's used a lookup table for trig/transcendental functions. Is the LUT just an expansion of the polynomial? I never really understood how FPUs worked...

Someone · on March 8, 2024

That would take way too much room. A full lookup table would have 2^64 entries of 64 bits each, at 2^70 bits of ROM.

For comparison:

- Apple’s M2 Ultra has about 134 billion transistors. That’s about 2^38.

- Avogadro’s number is about 2^79.

Reducing the argument to a small range around zero decreases that a lot, but not enough by a far stretch. There are 2^52 doubles in [0.5, 1.0), 2^52 more in [0.25, 0.5], 2^52 more in [0.125, 0.25], etc. so you’d still easily need 2^52 entries or 2^58 bits (likely way, way more)

demondemidi · on March 8, 2024

They DO use a lookup table because that’s what the FDIV but came from:

https://en.m.wikipedia.org/wiki/Pentium_FDIV_bug

“It is implemented using a programmable logic array with 2,048 cells, of which 1,066 cells should have been populated with one of five values: −2, −1, 0, +1, +2.”

Not sure what you’re trying to demonstrate, they wouldn’t store every single float!! I hope don’t program. ;)

lifthrasiir · on March 8, 2024

Older CPUs generally have used CORDIC (which does use LUT but that's only a part of the algorithm) due to its simplicity and compactness, while later CPUs with extensive microcode support would do the same thing as software implementations.

aidenn0 · on March 8, 2024

TFA says they use a 32 entry LUT, then do some math on the result.

perihelions · on March 7, 2024

Is this still current? The paper has a publication year of 1999.

deepthaw · on March 7, 2024

Ignorant question:

Given the ridiculous number of transistors and so on we can use in CPUs and GPUs nowadays how feasible is a relatively huge trig lookup table burned into rom?

zokier · on March 7, 2024

There is huge number of 64bit floats and huge portion of those are between 0..pi/2.

aidenn0 · on March 8, 2024

A lookup table (for any function) that covered all values between 0 and 1 in single precision, would be ~4GB; there are approximately 1B values between 0 and 1, and the result of each value is 4 bytes.

Such a table for double-precision would be much, much larger.

bigbillheck · on March 7, 2024

Seems like a terrible idea on latency grounds alone.

Solvency · on March 8, 2024

Why isn't this just done with an industry standard lookup table these days?

paulpauper · on March 7, 2024

sine is easy because the series is globally convergent and fast converging

zgs · on March 8, 2024

It would also be extremely inaccurate. The x^n numerators grow very quickly and digits get lost because unlimited precision isn't available. Likewise, the n! denominators also grow rapidly. Then the series is alternating which means cancellation is happening for every added term.

If you don't believe me try for x=10.

planede · on March 8, 2024

You need around 26 terms for x=10 if you do it without reduction and you want an accuratish result for double precision.

You wouldn't evaluate the terms naively from left-to-right.

x - x^3/3! + x^5/5! - ... = x * (1 - x^2/(2*3) * (1 - x^2/(4*5) * ... ) )

I just checked in python and you get a result that is around 1000*machine epsilon off. Not great, not terrible.

ot · on March 7, 2024

Did you read the article? It is specifically about how the series looks simple, but the error is actually very bad if you do things naively.

paulpauper · on March 7, 2024

That still makes it easier compared to computing constants in which the series are not globally convergent, like inverse trig functions. Obviously, you would have to break it apart to speed convergence.

amelius · on March 7, 2024

And arcsin?

paulpauper · on March 7, 2024

harder due to convergence issues