View Issue Details

IDProjectCategoryView StatusLast Update
0003587Composrcorepublic2018-11-10 21:08
ReporterChris GrahamAssigned To 
Severityfeature 
Status non-assignedResolutionopen 
Product Version 
Fixed in Version 
Summary0003587: Internationalised e-mail addresses and URLs
DescriptionThis is a complex topic.

Domain names may use any Unicode character via Punycode (aka IDN, basically). Domain names do not support utf-8 because by convention they map to hostnames, which are never going to support that.

E-mail addresses may use any Unicode character via Internationalized Email (it just uses any Unicode character set you want I believe, it's just more of a consensus to do things in a proper modern way).

URLs may have encoding that may involve a combination of ASCII, URL encoding, utf-8, and Punycode. Technically you're not allowed utf-8 in a URL, but it happens by people not doing encoding fully and can be interpreted non-ambiguously so is a reasonable thing.

So what do we need to do?

1) Our HarmlessURLCoder should convert Punycode to utf-8

2) Our HarmlessURLCoder should be used for when URLs are pasted in and we need link text but can't get a <title> from what's under the URL (i.e. we already show URLs for that link text, but without HarmlessURLCoder).

3) E-mail address santitisation server-side and client-side should be significantly loosened, IF a config option is enabled (maybe enabled by default?).
Additional InformationThere are a lot of concerns...

a) Punycode is intentionally crippled by browsers because it can lead to attacks. See https://wiki.mozilla.org/IDN_Display_Algorithm

b) I have concerns about non-ASCII e-mail address because allowing all kinds of symbolic characters and Unicode is likely to significantly increase the chance of typos that can't be detected.

c) I don't think Punycode or Internationalized email is in very common use. E.g. "Weibo" in Chinese is something transliterated such as weibo.com. I think people are used to this. I don't have data though. Realistically it is easier for the world if we all use latin (ASCII) identifiers for things, as they are easier to share and type. This may well just remain the predominant de-facto standard irregardless of the actual standards.



The best thing for now may be to do nothing until a practical concern comes up from someone actually affected.
TagsType: Internationalisation
Attach Tags
Time estimation (hours)16
Sponsorship open

Activities

Chris Graham

2018-04-20 20:19

administrator   ~0005669

Last edited: 2018-04-20 20:23

View 2 revisions

We allow URL monikers (optional *) and codenames to have Unicode characters (so long as reserved characters are not used). These will still be URL encoded in a nasty way in the real URLs Composr users, because URLs have to be safe in ASCII. For URLs encoded using HarmlessURLCoder (optional), we will show Unicode characters directly because we are showing them only in a text context that we control.

* The Composr webmaster controls whether monikers are made using Unicode, or transliteration.

THAT SAID. It may be the case that our URL encoded URLs to downloads overflow our available database field space. In such a case we bend the rules and allow non-ASCII URLs to be saved into our database instead. That is the best compromise in such a case and has no practical bugs relating to it.

Additionally we have the capability for transliteration. On old PHP versions on Windows we have to transliterate filenames (and hence URLs to those files) due to no PHP Unicode filesystem support.
We always transliterate directory names due to poor PHP support.

Chris Graham

2018-04-20 20:26

administrator   ~0005670

Last edited: 2018-04-20 20:31

View 2 revisions

It's also worth explaining the difference between urlencode, rawurlencode, cms_urlencode, cms_rawurlrecode, and HarmlessURLCoder.

rawurlencode - PHP function for standardised URL encoding.

urlencode - PHP function for URL encoding specifically for GET parameters. It's the same as rawurlencode except spaces become "+'.

cms_urlencode - A layer around urlencode that provides Composr-specific encoding that stops Apache's mod_rewrite from corrupting certain special characters during it's "smart" processing.

cms_rawurlrecode - Shortens URLs that are too long for the database by intelligently cheating in our encoding. The URLs are not technically valid but will work.

HarmlessURLCoder - Simplifies/desimplifies URLs trading human-readablity for non-compliance. Similar to what browsers do in their address bars. It is a non-destructive operation that doesn't allow for double encoding or double decoding. Non-latin characters in URLs encodes with HarmlessURLCoder are much easier to use.

Chris Graham

2018-06-01 01:17

administrator   ~0005720

I've implemented Punycode support.

I'm leaving email along for now as e-mail validation is a mess:
http://emailregex.com/email-validation-summary/
And I'm happy to reinforce the consensus of simple addresses for now.

Issue History

Date Modified Username Field Change
2018-04-20 20:10 Chris Graham New Issue
2018-04-20 20:10 Chris Graham Sponsorship open 0 =>
2018-04-20 20:10 Chris Graham Summary internationalised e-mail addresses and URLs => Internationalised e-mail addresses and URLs
2018-04-20 20:19 Chris Graham Note Added: 0005669
2018-04-20 20:23 Chris Graham Note Edited: 0005669 View Revisions
2018-04-20 20:26 Chris Graham Note Added: 0005670
2018-04-20 20:31 Chris Graham Note Edited: 0005670 View Revisions
2018-06-01 01:17 Chris Graham Note Added: 0005720
2018-11-10 21:08 Chris Graham Tag Attached: Type: Internationalisation