Make C string literals const?

gustedt.wordpress.com

33 points by ingve a day ago

Modifying string litetals has never worked on any platform I've run code on the past 20 years. They're always in .rodata. I can't imagine doing this by default would be a problem except for really old code.

_kst_ 21 hours ago
The C standard, since 1989, has said that attempting to modify the array object corresponding to a string literal has undefined behavior. Whether it "works" or not is not the issue.
The problem is that it's currently legal to pass a string literal to a function expecting a (non-const) pointer-to-char argument. As long as the function doesn't try to write through the pointer, there's no undefined behavior. (If the function does try to write through the pointer, the behavior is undefined, but no compile-time diagnostic is required.) If a future version of C made string literals const, such a program would become invalid (a constraint violation requiring a diagnostic). Such code was common in pre-ANSI C, before const was introduced to the language.
The following is currently valid C. The corresponding C++ code would be invalid. The proposal would make it invalid in C, with the cost of breaking some existing code, and the advantage of catching certain errors at compile time.
```
    #include <stdio.h>

    void print_message(char *message) {
        puts(message);
        // *message = '\0'; // would have undefined behavior
    }

    int main(void) {
        print_message("hello");
    }
```
- jcalvinowens 20 hours ago
  
  > Whether it "works" or not is not the issue.
  Of course it is. It doesn't work on anything modern, and thus it is impossible for portable code which actually runs in the real world and has to work to have relied on it for a long time.
  Your example is not code any competent C programmer would ever write, IMHO. Every proficient C programmer I've ever worked with used "const char *" for string literals, and called out anybody who didn't in review.
  Old code already needs special flags to build with modern compilers: I think the benefit of doing this outweighs the cost of editing some makefiles.
  - _kst_ 16 hours ago
    
    A conforming implementation could make string literals modifiable, and (obviously non-portable) code could rely on that. I don't know whether any current compilers do so. I suspect not.
    Apart from that, it's not about actually modifying string literals. It's about currently valid (but admittedly sloppy) code that uses a non-const pointer to point to a string literal. It's easy to write such code in a way that a modern conforming C compiler will not warn about.
    That kind of code is the reason that this proposed change is not just an obvious no-brainer, and the author is doing research to find out how much of an issue it really is.
    As it happens, I think that the next C standard should make string literals const. Any code that depends on the current behavior can still be compiled with C23 or earlier compilers, or with a non-conforming option, or by ignoring non-fatal warnings. And of course any such code can be fixed, but that's not necessarily trivial; making the source code changes can be a very small part of the process.
    Any change that can break existing valid code should be approached with caution to determine whether it's worth the cost. And if the answer is yes, that's great.
    
    jcalvinowens 10 hours ago
    
    > That kind of code is the reason that this proposed change is not just an obvious no-brainer
    I don't understand your point here: I disagree this is "obvious", and I don't think I've said anything to imply that?
    > And of course any such code can be fixed, but that's not necessarily trivial; making the source code changes can be a very small part of the process
    In many cases, it's so trivial you can write code to patch the code. Often, the resulting stripped binary will be identical, so you can prove it's not necessary to even test the result! If decision makers can be made to understand that, you can run around most corporate process that makes this sort of thing hard.
    I've spent a lot of time fixing horrible old proprietary code to use const because I think it's important: most of the time, it's very easy. I don't deny there are rats nests that require a lot of refactoring to unwind, but that is the exception rather than the rule, in my personal experience.
    It will be vanishingly rare that code will need to be modified in a way that actually changes its runtime behavior to tolerate the proposed change.
  - ncruces 20 hours ago
    
    The most current SQLite amalgamation (3.49.1) is showing ~70 warnings when compiled with -Wwrite-strings.
    But maybe 70 warnings in 250k LoC is OK for your standards of proficiency.
    
    jcalvinowens 20 hours ago
    
    Surely you agree that is a problem that ought to be fixed in that code?
    70 warnings really doesn't sound that bad to fix. Most are probably trivial. I'm sure a few aren't.
    If nobody is around to fix it, that's what legacy flags are for.
zabzonk a day ago

Yes, but C (or C++, for that matter) has no concept of .rodata. This is something that needs to be enforced by the compiler, as it is in C++, and why C programmers should probably simply use a C++ compiler, with its much stronger type checking.
- jcalvinowens a day ago
  
  You missed the point: I'm saying it has been impossible to modify string literals forever, so enforcing const is probably a non-issue except in very old C.
  - zabzonk a day ago
    
    It is completely possible to write C code which does attempt to write to string literals.
    
    moefh a day ago
    
    The same is trivially true for C++: https://godbolt.org/z/h5znfchf8
    
    jcalvinowens 21 hours ago
    
    I was just editing my comment to add this point :)
    
    jcalvinowens a day ago
    
    No, it isn't. It will crash.
    
    moefh 21 hours ago
    
    It might crash, or it might work as naively expected, or it might do something else.
    For example, clang started simply omitting writes to data it knows to be read-only (which is allowed because these writes are undefined behavior, so anything goes). See this example[1]: `writable()` will return "*ello", but `readonly()` will just return "hello" and not crash (note its assembly doesn't include a write).
    [1] https://godbolt.org/z/MboK3hTPx
    
    jcalvinowens 20 hours ago
    
    That happens because the string is static. If you rewrite that so s is an argument to writable(), it will segfault.
    Although, I am curious if that optimization could happen across compilation units via LTO...
    
    moefh 20 hours ago
    
    I'm not sure what you mean. In `writable()`, there's no read-only data; `s` is a non-const char array (it has to be static because the function returns a pointer to it). The string literal is only there to tell the compiler how to initialize the array, `s` is not actually the string literal.
    If you change `writable()` to receive a `const char *` (and then cast it to `char *` to write), then clang will be forced to compile it with a store (even though it sees you storing to a `const char *`) because it doesn't know if the function will be called with a pointer to actual read-only data or just a pointer to writable data that was gratuitously converted to `const`.
    
    jcalvinowens 20 hours ago
    
    > because it doesn't know if the function will be called with a pointer to actual read-only data or just a pointer to writable data that was gratuitously
    That's exactly my point yeah, the optimization you described is only possible because you gave the compiler extra knowledge about the argument to that function (because it was static in the same compilation unit). It's artificial, typically that won't be the case.
    
    moefh 20 hours ago
    
    Ah, I understand you now, you're right.
    I remember there was a lot of confusion when llvm started removing stores to read-only memory[1], some people got angry because it broke some kernel code (that only worked because being in a kernel the memory page wasn't actually marked as read-only) and thought it would break any code that cast away a `const`, which is very common and valid as long as it was gratuitously `const`, as you say.
    [1] https://releases.llvm.org/9.0.0/docs/ReleaseNotes.html#notew...
    
    mmastrac 21 hours ago
    
    It is possible. Some platforms have no concept of rodata. You can mremap a segment. Lots of valid ways to do it.
    
    jcalvinowens 21 hours ago
    
    Well, at that point you deserve it :)
    
    acchow 21 hours ago
    
    A good reason to define it as an invalid program then and fail at compile time?
    
    jcalvinowens 20 hours ago
    
    Yes. I am literally arguing for doing that in this entire thread.
  - the_svd_doctor 21 hours ago
    
    Right but some code will stop compiling, no?
    
    jcalvinowens 21 hours ago
    
    Yes. But such code can be fixed without functional changes.
    I'm not denying that there are codebases where trying this would result in an Armageddon of refactoring, but I would venture that's the exception rather than the rule.
    Most C programmers use "const char*" for string literals, and have for a long time.
hun3 a day ago

The affected platforms lack an OS (e.g., bootloaders) and/or an MMU/MPU (e.g., microprocessors like AVR)
- jcalvinowens 21 hours ago
  
  I don't care about platform specific stuff. I'm talking about C which is actually intended to be portable. Nothing written with portability in mind in the past ~decade is going to be doing this.
  - hun3 8 hours ago
    
    I think we're going a bit past each other.
    In AVR or other MPU-less architecture you can literally modify the string literal memory without triggering a crash.
    Why? Because there is no memory protection ("rodata") at all.
    And such microprocessors are still in use today, so it's a bit too far fetched to say "really old code."
    It's UB, sure, but how many embedded programmers actually care? The OP's proposal is trying to change the type system so that this UB becomes much less likely to trigger in practice.
  - dyhi55 21 hours ago
    
    C is not node.js. C exists for 50 years and is expected to have stable API. In scientific circles it's not unusual to compile c and f77 libraries built in the 70's, 80's.
    BLAS, gemv, GEMM, SGEMM libraries are from 1979, 1984, 1989. You may have seen these words scroll by when compiling modern 2025 CUDA :)
    
    jcalvinowens 21 hours ago
    
    I was writing C long before node.js existed :)
    C has no backwards compatibility guarantee, and it never has. Try compiling K&R C with gcc's defaults, and see what happens.
    You can build your legacy code with legacy compiler flags. Why do you care about the ability to build under the modern standards?

Dwedit a day ago

Wait, C string literals are not already const? On many platforms, they live in a read-only data section, which is write-protected memory.

HeliumHydride a day ago

They're not const because of backwards compatibility. Const correctness in C is a lot weaker than the way C++ enforces it, letting you implicitly cast it away in a lot of cases.
- jcalvinowens a day ago
  
  On all modern platforms I'm familiar with, if you try to modify a string literal, you'll segfault. So while it's not const at the language level, it is very much const at the machine level.
  - zabzonk a day ago
    
    At runtime, yes. But I want to know about errors like this at compile time.
    
    jcalvinowens 21 hours ago
    
    That's not the point though. The point is that it's very unlikely any C written in the past 20 years relies on the ability to modify string literals.
    
    ryandrake 20 hours ago
    
    Not only that, but there are no valid C programs written ever, which rely on the ability to modify string literals. Doing so is undefined behavior, so the program is not valid. It may happen to work on some random platform, but it's still undefined.
  - dyhi55 a day ago
    
    You're young. On all the legacy platforms I'm familiar with, you can modify string literals. That's original c.
    
    jcalvinowens 21 hours ago
    
    I guess you missed the word "modern"? Or are you saying you actually know of one?
    
    kevin_thibedeau 21 hours ago
    
    Microcontrollers running code loaded in RAM will have rodata linked into that RAM. Just takes an accidental cast to start writing them.
    
    jcalvinowens 20 hours ago
    
    True. All the more reason to make it an error, IMHO.
    
    dyhi55 21 hours ago
    
    Sure, choose any platform before 1990. The modern ansi / iso c didn't exist before 1990. The c language is from 1970's. So code from any old tarball will assume c literals are writable, and will crash if not. It's a common complain when compiling old code, google it. The c standard library is full of functions that assume strings are writable: mktemp() sscanf() strtok() etc.
    Quote from gcc manual, explaining why you need to compile old code with -writable-strings option: "you cannot call mktemp with a string constant argument. The function mktemp always alters the string its argument points to.
    Another consequence is that sscanf does not work on some systems when passed a string constant as its format control string or input. This is because sscanf incorrectly tries to write into the string constant. Likewise fscanf and scanf."
    
    jcalvinowens 21 hours ago
    
    I define "modern" as ANSI/ISO C. That's pretty conservative IMO, I know people who call pre-C99 "legacy C"...
dyhi55 a day ago

Strings including string literals are supposed to be writable for strtok() to work. Const char * is a modern c construct. You gotta deprecate parts of the standard c library, which will break backward compatibility...
- kevin_thibedeau 21 hours ago
  
  I have a strtok() clone for this purpose that returns a pointer range for each token, leaving the string untouched.
bodyfour a day ago
The issue is that "const" didn't exist in the earliest forms of C... and even when it became available not everybody started using it.
So you might have a function that doesn't have proper "const" qualifications in its prototype like:
```
  void my_log(char *message);
```
and then call-sites like:
```
  my_log("Hello, World!");
```
...and that needed to stay compiling.
iknowstuff a day ago

Doesn’t the first paragraph address this?

KingLancelot a day ago

[dead]