< and > are valid URL characters and should be matched by the URL detector #654

Closed
opened 3 months ago by ddevault · 9 comments
There is no content yet.
Owner

Do you have any examples of where/how they are used?

I've never seen them in actual URLs myself, and I'm trying to decide which, if any, heuristics to apply for when not to allow them. For example, it would be nice if <http://foo/bar> didn't result in http://foo/bar>.

Do you have any examples of where/how they are used? I've never seen them in actual URLs myself, and I'm trying to decide which, if any, heuristics to apply for when **not** to allow them. For example, it would be nice if `<http://foo/bar>` didn't result in `http://foo/bar>`.
Poster

https://lists.sr.ht/~sircmpwn/sr.ht-dev/<CD4OAKMUP5AS.0DDCXELK6Y9H%40taiga>

The standard for URLs is well-defined and well-behaved applications should conform to it. I agree that it would be annoying to have <https://example.org> be interpreted wrongly, though.

https://lists.sr.ht/~sircmpwn/sr.ht-dev/<CD4OAKMUP5AS.0DDCXELK6Y9H%40taiga> The standard for URLs is well-defined and well-behaved applications should conform to it. I agree that it would be annoying to have `<https://example.org>` be interpreted wrongly, though.
Collaborator

https://lists.sr.ht/~sircmpwn/sr.ht-dev/<CD4OAKMUP5AS.0DDCXELK6Y9H%40taiga>

From my reading of this (specfically steps 2.1 and 2.3) that URL results in a validation error and would typically cause a browser to "correct" it by treating the angle brackets as if they had been percent encoded.

The standard for URLs is well-defined and well-behaved applications should conform to it

Apps like e.g. browsers can be easily and unambiguously made to tolerate these kinds of errors, because all URL input in a browser is clearly delineated - either by being typed into an address bar, as bookmarks, as <a href> elements, etc. That's not the case for a terminal feature like this, which has to scan through free-form text and detect URLs without any clear delineation.

If the URL above was encoded as https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CCD4OAKMUP5AS.0DDCXELK6Y9H%40taiga%3E, then foot (and codeberg/gitea) would have no problem detecting it.

FWIW, the document at that URL also fails validation for similar reasons.

I agree that it would be annoying to have <https://example.org> be interpreted wrongly, though.

This example is mentioned specifically in Appendix C of RFC 3986 and is one of the reasons why angle brackets still aren't considered "URL code points", even in the latest standards.

> https://lists.sr.ht/~sircmpwn/sr.ht-dev/<CD4OAKMUP5AS.0DDCXELK6Y9H%40taiga> From my reading of [this](https://url.spec.whatwg.org/#path-state) (specfically steps 2.1 and 2.3) that URL results in a [validation error] and would typically cause a browser to "correct" it by treating the angle brackets as if they had been percent encoded. > The standard for URLs is well-defined and well-behaved applications should conform to it Apps like e.g. browsers can be easily and unambiguously made to tolerate these kinds of errors, because all URL input in a browser is clearly delineated - either by being typed into an address bar, as bookmarks, as `<a href>` elements, etc. That's not the case for a terminal feature like this, which has to scan through free-form text and detect URLs without any clear delineation. If the URL above was encoded as https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CCD4OAKMUP5AS.0DDCXELK6Y9H%40taiga%3E, then foot (and codeberg/gitea) would have no problem detecting it. FWIW, the document at that URL also fails [validation](https://validator.nu/?doc=https%3A%2F%2Flists.sr.ht%2F%7Esircmpwn%2Fsr.ht-dev%2F%3CCD4OAKMUP5AS.0DDCXELK6Y9H%2540taiga%3E&charset=&schema=&preset=&parser=&nsfilter=) for similar reasons. > I agree that it would be annoying to have `<https://example.org>` be interpreted wrongly, though. This example is mentioned specifically in [Appendix C] of RFC 3986 and is one of the reasons why angle brackets still aren't considered ["URL code points"](https://url.spec.whatwg.org/#url-code-points), even in the latest standards. [validation error]: https://url.spec.whatwg.org/#validation-error [appendix C]: https://datatracker.ietf.org/doc/html/rfc3986#appendix-C
Poster

whatwg is not really the important source here - the RFCs are. Foot is not a web browser. And I don't think that Appendix C makes a particularly strong case for not parsing a valid URL (and it is valid), but instead just acknowledges concessions to contextual formatting of URLs (e.g. by wrapping it in <>, which is a distinguishable situation from an unwrapped URL which contains <>).

whatwg is not really the important source here - the RFCs are. Foot is not a web browser. And I don't think that Appendix C makes a particularly strong case for not parsing a valid URL (and it *is* valid), but instead just acknowledges concessions to contextual formatting of URLs (e.g. by wrapping it in <>, which is a distinguishable situation from an unwrapped URL which contains <>).
Collaborator

whatwg is not really the important source here - the RFCs are

The whatwg URL spec is much the same as the RFCs, but with the previously under-specified stuff fixed and made explicit. That's why I quoted it.

The relevant parts of the RFC 3986 grammar are:

path-absolute = "/" [ segment-nz *( "/" segment ) ]

segment       = *pchar
segment-nz    = 1*pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

So according to this:

The URI syntax defines a grammar that is a superset of all valid URIs ...

Using angle brackets in the path is not "valid" according the terminology of the RFC.

What is or isn't "valid" is not really relevant though, because parsing a URL with a well-defined beginning and end (which is what the spec concerns itself with) and detecting a URL inside an arbitrary buffer of text are 2 different things. The latter is ambiguous and there's no completely reliable (or well specified) way to do it. The best you can do is use heuristics.

As far as the parsing specs are concerned, angle brackets in the path have exactly the same status as e.g. space characters. They're explicitly tolerated (but considered "validation errors") by the whatwg spec and they're simply not a part of the grammar in the RFCs.

> whatwg is not really the important source here - the RFCs are The whatwg URL spec is much the same as the RFCs, but with the previously under-specified stuff [fixed](https://url.spec.whatwg.org/#goals) and made explicit. That's why I quoted it. The relevant parts of the RFC 3986 [grammar](https://datatracker.ietf.org/doc/html/rfc3986#page-49) are: > ``` > path-absolute = "/" [ segment-nz *( "/" segment ) ] > > segment = *pchar > segment-nz = 1*pchar > pchar = unreserved / pct-encoded / sub-delims / ":" / "@" > pct-encoded = "%" HEXDIG HEXDIG > unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" > > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" > ``` So according to this: > The URI syntax defines a grammar that is a superset of all **valid** URIs ... Using angle brackets in the path is not "valid" according the terminology of the RFC. What is or isn't "valid" is not really relevant though, because *parsing* a URL with a well-defined beginning and end (which is what the spec concerns itself with) and *detecting* a URL inside an arbitrary buffer of text are 2 different things. The latter is [ambiguous](https://blog.codinghorror.com/the-problem-with-urls/) and there's no completely reliable (or well specified) way to do it. The best you can do is use heuristics. As far as the *parsing* specs are concerned, angle brackets in the path have exactly the same status as e.g. space characters. They're explicitly tolerated (but considered "validation errors") by the whatwg spec and they're simply not a part of the grammar in the RFCs.
Poster

whatwg defines the behavior of web browsers. Foot is not a web browser.

whatwg defines the behavior of web browsers. Foot is not a web browser.
Collaborator

whatwg defines the behavior of web browsers. Foot is not a web browser.

OK, thanks for the info. 😆

> whatwg defines the behavior of web browsers. Foot is not a web browser. OK, thanks for the info. 😆
Owner

As mentioned in #655, I'm currently leaning towards making the character set configurable. <> would, most likely, not be in the default set. But, it would still be included in the heuristics we use to handle <http://example.org/foobar> vs. http://example.org/<foobar>.

As mentioned in https://codeberg.org/dnkl/foot/pulls/655, I'm currently leaning towards making the character set configurable. `<>` would, most likely, _not_ be in the default set. But, it **would** still be included in the heuristics we use to handle `<http://example.org/foobar>` vs. `http://example.org/<foobar>`.
dnkl referenced this issue from a commit 3 months ago
dnkl closed this issue 3 months ago
dnkl referenced this issue from a commit 3 months ago
Poster

Thanks!

Thanks!
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.