< and > are valid URL characters and should be matched by the URL detector
Do you have any examples of where/how they are used?
I've never seen them in actual URLs myself, and I'm trying to decide which, if any, heuristics to apply for when not to allow them. For example, it would be nice if
<http://foo/bar> didn't result in
The standard for URLs is well-defined and well-behaved applications should conform to it. I agree that it would be annoying to have
<https://example.org> be interpreted wrongly, though.
From my reading of this (specfically steps 2.1 and 2.3) that URL results in a validation error and would typically cause a browser to "correct" it by treating the angle brackets as if they had been percent encoded.
The standard for URLs is well-defined and well-behaved applications should conform to it
Apps like e.g. browsers can be easily and unambiguously made to tolerate these kinds of errors, because all URL input in a browser is clearly delineated - either by being typed into an address bar, as bookmarks, as
<a href> elements, etc. That's not the case for a terminal feature like this, which has to scan through free-form text and detect URLs without any clear delineation.
If the URL above was encoded as https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CCD4OAKMUP5AS.0DDCXELK6Y9H%40taiga%3E, then foot (and codeberg/gitea) would have no problem detecting it.
FWIW, the document at that URL also fails validation for similar reasons.
I agree that it would be annoying to have
<https://example.org>be interpreted wrongly, though.
whatwg is not really the important source here - the RFCs are. Foot is not a web browser. And I don't think that Appendix C makes a particularly strong case for not parsing a valid URL (and it is valid), but instead just acknowledges concessions to contextual formatting of URLs (e.g. by wrapping it in <>, which is a distinguishable situation from an unwrapped URL which contains <>).
whatwg is not really the important source here - the RFCs are
The whatwg URL spec is much the same as the RFCs, but with the previously under-specified stuff fixed and made explicit. That's why I quoted it.
The relevant parts of the RFC 3986 grammar are:
path-absolute = "/" [ segment-nz *( "/" segment ) ] segment = *pchar segment-nz = 1*pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@" pct-encoded = "%" HEXDIG HEXDIG unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
So according to this:
The URI syntax defines a grammar that is a superset of all valid URIs ...
Using angle brackets in the path is not "valid" according the terminology of the RFC.
What is or isn't "valid" is not really relevant though, because parsing a URL with a well-defined beginning and end (which is what the spec concerns itself with) and detecting a URL inside an arbitrary buffer of text are 2 different things. The latter is ambiguous and there's no completely reliable (or well specified) way to do it. The best you can do is use heuristics.
As far as the parsing specs are concerned, angle brackets in the path have exactly the same status as e.g. space characters. They're explicitly tolerated (but considered "validation errors") by the whatwg spec and they're simply not a part of the grammar in the RFCs.
As mentioned in #655, I'm currently leaning towards making the character set configurable.
<> would, most likely, not be in the default set. But, it would still be included in the heuristics we use to handle
Deleting a branch is permanent. It CANNOT be undone. Continue?