Small Strings in Rust: smolstr vs. smartstring (2020)

unshavedyak · on Aug 24, 2024

On the note of small strings, Compact String[1] was i believed released after this article and has a nifty trick. Where Smol and Smart can fit 22 and 23 bytes, CompactStr can fit 24! Which is kinda nutty imo, that's the full size of the normal String on the stack.. but packed with actual string data.

There's a nice explanation on their readme[2]. Love tricks like this.

[1]: https://github.com/ParkMyCar/compact_str

[2]: https://github.com/ParkMyCar/compact_str?tab=readme-ov-file#...

parkmycar · on Aug 24, 2024

Hey I'm the author of compact_str, thanks for the kind words!

Fun fact, it was this fasterthanlime post that originally inspired me to play around with small strings and led to the creation of compact_str.

nextaccountic · on Aug 25, 2024

Do you think it is possible to integrate this with an existing string interner in the Rust ecosystem? Or one would need to roll their own?

The goal would be to compare two strings using only the 24-byte values, without touching the allocated part (if it points into a string larger than 24 bytes)

The first thing that makes me think this is impossible is that you only define an owned variant for your string type. But your can cleverly wrap a &'static str rather than allocate on heap, so if the interner can give a &'static str I think this could work, with some boilerplate. Namely: for strings smaller than 24 bytes don't intern and build a CompactString inline, for larger strings, intern, get a &'static str, then build a CompactString out of it. Then, two check if two strings are equal, you just compare its 24 bytes, and don't need to touch the interner for this at all.

However this only works if the interner actually leaks memory. If it returns you a &'a str that isn't 'static, then you can't store it on CompactString, unless your lib also provided a borrowed variant.

Also, to think about it, since interned strings are immutable (they are not a "string builder" like String is), you don't really need the capacity value for them, just the length. So it suggests a 16 bytes size, not 24. (but one could imagine an application where it's optimal to store 24 bytes inline rather than 16 anyway)

I think that this could be achieved with an 16 bytes &str wrapper, maybe something like &CompactStr, that works just like your CompactString, but wraps a &str instead (and offers no owned variant). Maybe such a thing could conceivably be included in the compact_string crate?

(Maybe make it 24 bytes even if it "wastes" some bytes, just to make it bitwise compatible with CompactString, so that borrowing CompactString into &CompactStr is zero cost - and then just zero out the remaining 8 bytes when a &CompactStr is stored on the heap)

[0] I was reading this post https://dev.to/cad97/string-interners-in-rust-797 that was written in response to this fasterthanlime post, but it contrasts interner with small string optimization, when you actually could have both!

eslaught · on Aug 25, 2024

I think the byteyarn library mentioned elsewhere in this thread meets your requirements:

https://mcyoung.xyz/2023/08/09/yarns/

nextaccountic · on Aug 25, 2024

This is perfect, thanks!

... well except it stores up to 15 bytes inline rather than using the UTF-8 trick to cram an extra byte. (But maybe it can't; or at least, ByteYarn really can only store 15 bytes, even though Yarn could store 16 bytes maybe, but this would pose problems when converting between those types because I suppose this conversion is zero cost)

nicoburns · on Aug 25, 2024

You don't get thw full 24 bytes inline, but the string_cache crate does this.

tomcam · on Aug 24, 2024

I've been wondering something for ages. Where did you get the 24 byte number, and how does it compare in Unicode terms? That is, did you analyze a large corpus and determine that 24 bytes was right for the largest number of strings? And does it come out to, say, 10 Unicode characters? Whenever I think about designing a new language, this very issue pops up.

eslaught · on Aug 25, 2024

To add some more detail to sibling's answer:

The optimal size will depend on the application. It's certainly reasonable that in some applications, many/most strings would be under 24 bytes and thus the small string optimization in many of these implementations would be beneficial. Perhaps in some other application strings are closer to 32 bytes (or something else) and then a larger stack size would be warranted. And in yet other applications, strings are large and no small string optimizations will make any difference, and if anything will slow the application down with unnecessary bookkeeping.

I do find it surprising that none of the implementations in the various comments linked in this thread seem to provide user-tunable sizes; or at least I haven't seen it. Because I can certainly imagine cases where the optimal size is > 24.

aw1621107 · on Aug 24, 2024

This style of small string optimization tries to take up the same amount of space on the stack as a "normal" heap-allocated string. On 64-bit platforms that is 24 bytes: 8 bytes for the pointer to the heap allocation, 8 bytes for the number of characters/bytes in the string, and 8 bytes for the allocation capacity.

It's quite possible to make the small string buffer larger, but that comes at the cost of the large string representation taking up more space than necessary on the stack. IIRC libstdc++ does this, which makes its std::string take up 32 bytes on the stack.

Dylan16807 · on Aug 25, 2024

> It's quite possible to make the small string buffer larger, but that comes at the cost of the large string representation taking up more space than necessary on the stack. IIRC libstdc++ does this, which makes its std::string take up 32 bytes on the stack.

Though to follow through on that, 24 bytes is also more than necessary. You don't have to be very clever to shrink your string size to 16 bytes (6 for the pointer, 6 for the size, 4 to store either capacity or spare capacity as floating point).

aw1621107 · on Aug 25, 2024

True, though as always that comes with tradeoffs of its own.

tomcam · on Aug 25, 2024

Oh! So the string itself is still on the heap? I assumed it was all on the stack.

super_flanker · on Aug 25, 2024

No, let me try to explain differently. If `compact_str` was not used, then your normal `String` would take 24 bytes of stack space (regardless of the string size) + heap space. What `compact_str` is trying to do is not use heap when string content is less than 24.

tomcam · on Aug 25, 2024

Thank you. That aligns with what I originally thought.

tomcam · on Aug 25, 2024

Not sure why the downvotes? This is a sincere question.

tialaramex · on Aug 25, 2024

For a lot of people it will seem obvious that String (Rust's standard library growable string type) is 24 bytes. CompactString is 24 bytes so that it is exactly the same size as String, that's the main idea. That's why they may not have believed you were sincere in asking.

Why is Rust's String 24 bytes? Well, this is a growable string type. So we need to store 1) where on the heap our growable buffer is and 2) how big that heap space is ("Capacity"), and 3) how much of it we're using already to store the string ("Size" or "Length"). On modern 64-bit computers it's reasonable to use a 64-bit (8 byte) value for all three of these facts, 3 times 8 = 24.

In fact Rust (unlike C++) insists on doing this the simplest and fastest way possible so the String type is more or less literally Vec<u8> - a growable array of bytes, plus a guarantee that those bytes are UTF-8 encoded text. Vec<u8> is likewise 24-bytes.

The rationale is that being simple is a good fundamental design, and (as CompactString illustrates) people can build more sophisticated types if they want to unlock specific optimisations which may suit their application.

tomcam · on Aug 25, 2024

Fantastic, thanks. Keeping this answer for reference.

hvenev · on Aug 25, 2024

Is 26 characters next? I think the number of bytestrings of length <= 26 that are also valid UTF-8 is only 0.021 * 256^24.

returningfory2 · on Aug 25, 2024

I think the problem you encounter with this is that you can no longer reference the small string as a &str.

conaclos · on Aug 24, 2024

compact_str is a fantastic crate, used by many projects. Do you know the byteyarn crate [0]? This could be nice to add this to the `Similar Crates` section if it makes sense.

[0] https://docs.rs/byteyarn/latest/byteyarn/

parkmycar · on Aug 24, 2024

I do, mcyoung wrote a great blogpost[1] about it! Good idea, I’m AFK at the moment but will add it to the ‘Similar Crates’ section once I’m back

[1] https://mcyoung.xyz/2023/08/09/yarns/

fasterthanlime · on Aug 25, 2024

That’s amazing! I’m so glad you did :)

dataflow · on Aug 24, 2024

Note for C++ developers: Their trick is only possible because the strings are UTF-8 and not null-terminated. It wouldn't work as a drop-in for standard strings in C++.

neonsunset · on Aug 24, 2024

Null-terminated strings really are a mistake. Make vectorized algorithms problematic by forcing them to account for page size and paged memory in general as well as always scan for NUL, cannot be easily sliced without re-allocation, are opposite to how languages with better string primitives define them and in general don't save much by passing a single pointer over ptr + length.

andersa · on Aug 25, 2024

It's not really possible to get rid of them in C++ however, given a staggering amount of legacy APIs that require them. Constantly converting every time you have to call a system API with your string is even worse.

anp · on Aug 25, 2024

FWIW Rust also has null-terminated strings available under its std::ffi module. I’m not sure it would be feasible to migrate C++ to the approach of multiple string types now, and I’m not sure it would have been the right approach for C++11 given C++’s approach to interop, but it’s definitely possible to support interoperability with those legacy APIs without constraining a default string type to null-termination.

kloop · on Aug 24, 2024

std::string isn't null terminated (or at least it isn't guaranteed to be, I don't think it's forbidden for an implementation to do that).

That's why the c_str method exists, so you can get a pointer to a null terminated character array

dataflow · on Aug 24, 2024

> std::string isn't null terminated

It is as of C++11. The constness of c_str() threw a wrench into that as soon as C++ got a threading model.

unclad5968 · on Aug 24, 2024

std::string is guaranteed to be null terminated since c++11.

std::string::c_str returns the same address as &string[0]

dataflow · on Aug 24, 2024

> std::string::c_str returns the same address as &string[0]

Note that this by itself doesn't imply null-termination, though as you say strings are indeed null-terminated now.

Edit: This is not particularly obvious, but the reason this has nothing to do with allocation or the return value is that the implementation could still leave space for the null terminator, but avoid actually setting it to zero until c_str() is invoked. That would neither affect the returned pointer nor the constant-time guarantee.

dietr1ch · on Aug 25, 2024

Shouldn't c_str() be null terminated? Then if it must point to the actual backing store, then that must also be a valid null terminated C string as there's no other way to use them.

dataflow · on Aug 25, 2024

> Shouldn't c_str() be null terminated?

Yes, by definition.

> Then if it must point to the actual backing store, then that must also be a valid null terminated C string as there's no other way to use them.

No, this doesn't follow. The buffer could leave room for the terminator but not actually write the terminator until c_str() is called. That's why the C++11 standard had to explicitly require that the string always be null-terminated regardless of whether c_str() is called.

dietr1ch · on Aug 25, 2024

Wow, I didn't expect `c_str()` to write at all. I knew that C++ was wild before C++57, but before C++ is insane.

Thanks for the explanation!

jcelerier · on Aug 25, 2024

I don't think any implementation ever did this. But it was "technically possible" as the spec didn't prevent it so you have to account for it.

masspro · on Aug 24, 2024

Why downvotes? Ignoring C++11 changes for a minute, c_str is absolutely able to cause a reallocation if it needs to, such that after it returns, its return value and the value of data() are the same. Ergo no, the GP statement alone doesn’t imply anything.

dpassens · on Aug 24, 2024

But the return value guarantee is, as far as I can see, a C++11 change, so you can't ignore those. And as of C++11, c_str is noexcept, so it cannot allocate anything.

dataflow · on Aug 24, 2024

Please see my edit.

searealist · on Aug 24, 2024

I think std::string is actually required to be null terminated now in the latest standards. But even before it was basically required as that was the only way to make c_str() constant time.

dataflow · on Aug 24, 2024

> But even before it was basically required as that was the only way to make c_str() constant time.

Note that c_str() could've inserted the terminator in constant time, as long as the string kept space reserved for that. So this wouldn't have violated constant-time-ness, and it's not an insane thing to do considering it would save some instructions elsewhere. But yeah, the flexibility wasn't all that useful even in the beginning, and became even more useless as soon as C++ incorporated threading, due to the constness of the function.

mananaysiempre · on Aug 24, 2024

More of a knock against C than C++, seeing as today’s C++ tends to prefer `string_view`s or at worst iterator pairs, neither of which use null-termination. (Saying that as someone who prefers C to C++ and avoids Rust exactly because of “better C++” vibes—I don’t want a better version of C++, if anything I want a much much simpler one that’s worse at some things.)

That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero. For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.

To be clear: none of this makes null termination not dumb.

aw1621107 · on Aug 24, 2024

> That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload (so 23 useful ones, the best you could hope for with null termination) into the same structure the same way by storing the compact version with null termination and ensuring the last byte is also always zero.

I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.

> For extra style points, define the last byte to be 24 minus payload length instead so you don’t need to recompute the length in the inline case.

IIRC Facebook's FBString from Folly does (did?) that?

tialaramex · on Aug 24, 2024

> I want to say libc++ and maybe MSVC do something along those lines in their std::string implementations.

Here's Raymond Chen on this topic earlier this same year. As a bonus since you're looking at this in August it's more or less correct now, whereas when it was published it had numerous serious errors. Whether the standard library implementation of such an important type should be so complicated that an expert makes numerous errors is another question...

https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10...

So, libc++ gets closest, 1 flag byte + 22 bytes of text + 1 byte of ASCII NUL = 24 bytes

The others are much worse, larger (32 bytes on modern computers) yet with lower SSO capacity (15 bytes of text).

aw1621107 · on Aug 24, 2024

Huh, I thought MSVC had a libc++-style SSO implementation. Now I have to wonder where I got that mistaken impression :(

dataflow · on Aug 24, 2024

> That said, I don’t see why it wouldn’t be possible to cram in 24 bytes of null-terminated payload

Note that I didn't say this is impossible, just that the given trick wouldn't work.

However, this is impossible for general strings. The only way could possibly make this work is if you constrain the inline string somehow (e.g., to UTF-8), so that some shorter strings failing that constraint are forced to go on the heap too. Otherwise you have 1 fixed zero byte at the end, and 23 fully flexible bytes, leaving you no way to represent an out-of-line string.

(Well, you could do it if you use the address as a key into some static map or such where you shove the real data, but that's cheating and beside the point here.)

Dylan16807 · on Aug 25, 2024

Think that through again. An inline string needs a fixed 0 byte at the end. A heap string does not. Therefore if the last byte is anything other than 0 you have a heap string.

Inline strings only use 0.4% of your possible values.

dataflow · on Aug 25, 2024

Oof, you're right, thank you. In my mind the last byte was obviously zero for a heap string too, since the pointer or sizes would've had a zero upper byte. Somehow I never accounted for the fact that on 64-bit there's no need to represent it that way. Fantastic point!

silisili · on Aug 24, 2024

This is interesting, thanks.

> 0b11111110 - All 1s with a trailing 0, indicates heap allocated

> 0b11XXXXXX - Two leading 1s, indicates inline, with the trailing 6 bits used to store the length

I stared at this for too long, as it allows collision. Then I realized you'd never set the third bit, it should probably have been written 0b110XXXXX and recorded that 5 bits are used for length. Right or did I understand it wrong?

tialaramex · on Aug 24, 2024

Probably this isn't helpful anyway - what's actually going on is more complicated and is explained later at a high level or I'll try now:

Rust has "niches" - bit patterns which are never used by that type and thus can be occupied by something else in a sum type (Rust's enum) which adds to that type. But stable Rust doesn't allow third parties to promise arbitrary niches exist for a type they made.

However, if you make a simple enumeration of N possibilities that automatically has a niche of all the M-N bit patterns which weren't needed by your enumeration in the M value machine integer that was chosen to store this enumerated type (M will typically be 256 or 65536 depending on how many things you enumerated)

So, CompactString has a custom enum type LastUtf8Char which it uses for the last byte in its data structure - this has values V0 through V191 corresponding to the 192 possible last bytes of a UTF-8 string. That leaves 64 bit patterns unused. Then L0 through L23 represent lengths - inline strings of length 0 to 23 inclusive which didn't need this last byte (if it was 24 then that's V0 through V191). Now we've got 40 bit patterns left.

Then one bit pattern (the pattern equivalent to the unsigned integer 216) signifies that this string data lives on the heap, the rest should be interpreted accordingly, and another (217) signifies that it's a weird static allocation (I do not know why you'd do this)

That leaves 38 bit patterns unused when the type is finished using any it wanted so there's still a niche for Option<CompactString> or MyCustomType<CompactString>

parkmycar · on Aug 24, 2024

Author of compact_str here, you hit the nail on the head, great explanation!

> ... and another (217) signifies that it's a weird static allocation (I do not know why you'd do this)

In addition to String Rust also has str[1], which is an entirely different type. It's common to represent string literals known at compile time as `&'static str`, but they can't be used in all of the same places that a String can. For example, you can't put a &'static str into a Vec<String> unless you first heap allocate and create a String. We added the additional variant of 217 so users of CompactString could abstract over both string literals and strings created at runtime to solve cases like the example.

[1]: https://doc.rust-lang.org/std/primitive.str.html

tialaramex · on Aug 24, 2024

Thanks! And the explanation of 217 makes sense too.

Since I have you here, wouldn't it be better to name that type "LastByte" or something? It's not a (Rust) char, and it's not necessarily UTF-8 whereas it is definitely the last byte.

parkmycar · on Aug 24, 2024

Ha, naming is hard! You’re totally right, it used to be just the values of the last byte of a UTF-8 char (and before that it was NonMaxU8) but now represents more. I’ll update it once I’m back at my computer, thanks!

tialaramex · on Aug 24, 2024

Ah, while you're in there, IIRC you mention NonMaxU8 in the code still and probably that's worth updating at the same time.

shikon7 · on Aug 24, 2024

Maybe you need it if pointers are 16 bytes?

mastax · on Aug 24, 2024

It’s nice that rustc’s niche optimization lets smolstr be implemented with a simple enum, rather than having to do some unsafe union bit packing[0]. The only concession that had to be made to the compiler is using an enum for the InlineSize value to show that the last 3 bits of that aren’t used.

[0]: https://github.com/rust-analyzer/smol_str/blob/fde86a5c0cb8f...

conaclos · on Aug 24, 2024

Readers who like this article may also like a more recent one [0]. It designs a compact string with extra capabilities. The crate was released under the name byteyarn [1]

[0] https://mcyoung.xyz/2023/08/09/yarns/

[1] https://docs.rs/byteyarn/latest/byteyarn/

masklinn · on Aug 24, 2024

[2020]. I wouldn't be surprised if the field had changed a fair bit since.

brigadier132 · on Aug 24, 2024

Ok, I dived into the rabbit hole a bit and found this:

https://github.com/rosetta-rs/string-rosetta-rs?tab=readme-o...

Click the charts for actually readable data.

From a quick scan, it seems like hipstr is the current fastest string implementation and is actively maintained.

epage · on Aug 24, 2024

Also, smol_str was removed from the comparison because matklad, the author of smol_str, suggests ecow

> I’d rather say the opposite: users of those crates should switch to ecow. It is exactly what smol_str would have been, if it were a proper crate with a stable API, rather than an implementation detail of rust-analyzer. > > It’s a drop-in replacement for String, with O(1) clone and SSO, and I believe this is all you need. Other crates either have needlessly restricted API (no mutation), questionable implementation choices, or a bunch of ad hoc traits in the API.

https://www.reddit.com/r/rust/comments/117ksvr/ecow_compact_...

skavi · on Aug 24, 2024

Those benchmarks aren’t great.

A real world-ish workload made generic over all these types would provide much more interesting results. Especially if heap usage was also tracked.

epage · on Aug 24, 2024

Real world benchmarks also wouldn't be great because they would be showing how well it works in someone else's program, rather than yours.

This at least gives you an idea of what the relative cost of different operations are so you can consider what are frequent operations in your program and then benchmark a couple from there, rather than everything.

dathinab · on Aug 24, 2024

the author has some recommendations, to quote formatting):

> Suggestions:

>

> Generally, String

> If you deal mostly with string literals but want some flexibility (like clap), generally you'll want Cow<'static, str>

> If a profiler says your strings are a problem:

> Try different crates and settings for that crate out with a profiler

> O(1) clones are important when doing a lot of clones.

> For one-off allocations, they are slower.

> For short-lived programs, look into string interning

i.e. author thinks you should profile with your own use case

EDIT: Not sure why HN removes indention between a leading > and the start of the text, makes quoting unnecessary hard

masklinn · on Aug 25, 2024

> Not sure why HN removes indention between a leading > and the start of the text, makes quoting unnecessary hard

Because HN’s markup is the most dreary, half-assed, and useless one in the history of lightweight markups.

HN will remove a bunch of characters it doesn’t like because they might be used for fun, parse two+ newlines as a paragraph break, 4-indent as a literal block, and mis-parse * as emphasis.

That’s it. So you get all the drawbacks of HTML with almost none of the tools provided by even just the original markdown. No quoting, no line breaks, no lists, no inline code, …

wizzwizz4 · on Aug 25, 2024

2-indent is a literal block. 4-indent also is, but only because it's a 2-indent.

HN markup also supports \-escapes (so you can write \*hello\* to get *hello*), but they grow linearly, not exponentially, with repeated escapes (\\\* becomes \\*). It also truncates long URLs in display, and it handles a single pair of parentheses in them (so you can write https://example.com/print(%22Hello,%20world%21%22)). But I think that's all.

samatman · on Aug 25, 2024

HTML doesn't do indentation. I typed four spaces between every word in these sentences.

dathinab · on Aug 25, 2024

but HN doesn't provide me with a HTML input field but a text input field and HTML has stuff like <blockquote> absent from HN input format and properly converting user input is a bread and butter task of anything displaying user content

so HTML collapsing whitespace is really meaningless for the argument IMHO

IshKebab · on Aug 24, 2024

And has the best name!

abhorrence · on Aug 24, 2024

Sadly it seems like some of the images have broken since it was originally posted. :(

01HNNWZ0MV43FF · on Aug 24, 2024

This capture looks a little better. Even the CSS looks off on the live site. Maybe someone should ping him? I think he's on Mastodon https://web.archive.org/web/20201002084042/https://fastertha...

cute_boi · on Aug 24, 2024

Actually previous design was better to my eyes.

pronoiac · on Aug 24, 2024

I emailed him.

fasterthanlime · on Aug 24, 2024

Thanks for the heads up!

I fixed the SVG graphs and made a couple layout updates based on the feedback here and some earlier feedback from my subreddit, see the before-after here.[1]

I've been mostly focused on function lately, the redesign is, let's say, a work in progress.

(Oh also, I use they/them pronouns these days [2])

[1] https://imgur.com/a/before-after-2024-08-24-PAmeHFX [2] https://fasterthanli.me/articles/state-of-the-fasterthanlime...

crowdstriker · on Aug 28, 2024

You're a twat for correcting people about your silly pronouns.

weinzierl · on Aug 24, 2024

Here [1] is a nice talk that discusses various options and trade-offs for small string and small vector optimization in Rust.

[1] https://m.youtube.com/watch?time_continue=2658&v=tLX_nvWD738...

dathinab · on Aug 24, 2024

on interesting thing to realize is that some small string types go beyond just the basic small len storage optimization

- compact_str e.g. depends on the string being valid utf-8, and in turn has larger short strings

- smol_str e.g. is a a enum over `[u8; CAP] | &'static str | Arc<str>` this means it avoids any allocations for static strings and has very fast clones, leading to similar perf. characteristics as string internalization in some use cases (like the use-cases it was designed for). But at the cost of it being immutable only and the heap allocation being slightly larger for the Rc.

Other interesting differences can be the handling of shrinking mutable Strings, do you re-inline it or not? What is better here is highly use-case dependent.

In the end there are many design decisions where there is no clear winner but it's a question of trade off with use-case specific preferences.

codedokode · on Aug 25, 2024

I wonder, is there a "string-like interface" in Rust or one has to rewrite all the code when changing string implementation? Also if you want to change implementation in half of the code, is there automatic convertion between implementations?

duped · on Aug 25, 2024

This is what the `AsRef` trait is great for

    impl AsRef<str> for MyCoolString {
        fn as_ref(&self) -> &str {
            ...
        }
    }

To be totally defensive, other library authors can write their APIs like

    fn my_cool_function(s: impl AsRef<str>) {
        let s: &str = s.as_ref();
    }

and now `s` is treated like any other &str. But even if library authors don't program defensively and use `s: &str` in the argument type, callers can always do this:

    my_cool_function(my_cool_string.as_ref());

dubi_steinkek · on Aug 25, 2024

It's possible but I would prefer to just pass &str at that point, leading to less monomorphizations and a simpler API

nextaccountic · on Aug 25, 2024

For read-only strings you pass around &str usually if you can borrow at all, and all UTF-8 string types should be compatible with it. AsRef is just a trait for "all types that can be converted into &str as needed" so if you use AsRef you need to have immutable strings too

But the thing about String is that it allows you to push data at the end (like Java's StringBuilder). I think there is no commonly used trait for that, no.

Note that not all owned string types can be appended at the end. Box<str> for example is just like String but without a .push_str() method. (This saves some bytes because it doesn't need a spare capacity)

If your string is not UTF-8 then usually you convert it at API boundaries, to deal with UTF-8 stuff internally (or at least WTF-8)

jtrueb · on Aug 24, 2024

I made humanize-bytes[1] for that formatting reason (1000 vs 1024). Coincidentally, it uses smartstring to avoid allocations.

[1] https://crates.io/crates/humanize-bytes

avinassh · on Aug 24, 2024

slightly related: earlier I was experimenting with Rust string allocations, even though my code did almost same as standard library, the heap allocations were taking 10x of time. Relevant code:

  fn alloc_string(s: &str) -> NonNull<u8> {
      let boxed_slice = s.as_bytes().to_owned().into_boxed_slice();
      NonNull::new(Box::into_raw(boxed_slice) as *mut u8).unwrap()
  }

Approximately, if stdlib was taking 1µs, but my code was about 14-15µs for large strings. Profiling also did not help. Anyone have any guesses? Here is the full code: https://github.com/avinassh/string-alloc

hummingly · on Aug 24, 2024

When you call to_owned, it creates a Vec<u8>. This must then be truncated due to the call to into_boxed_slice. The documentation says this could potentially mean a re-allocation depending on the allocator. When this happens the old allocation needs to be dropped too. You could try using a different allocator but before this I would recommend to replace those two calls with Box::from(s.as_bytes()).

skitter · on Aug 25, 2024

The vec already gets allocated with capacity == length

https://doc.rust-lang.org/nightly/src/alloc/slice.rs.html#16...

nextaccountic · on Aug 25, 2024

(2020)