A recent study conducted by Governance Primer on behalf of the Universal Acceptance Steering Group (UASG) identified trends in the acceptance of all domain names in software hosted on Github, the world’s largest open source repository. This research builds on previous efforts to identify underlying issues that lead to issues when different applications have to deal with Internationalized Domain Names (IDNs) and new gTLDs, particularly with regard to email addresses. .
The goal was to get real data on the use of software libraries, which are pre-written pieces of code that act a bit like building blocks, providing specific functionality required by the developer so that it doesn’t do not have to reinvent the wheel every time a feature needs to be implemented. For example, the Pillow library for Python provides image processing functionality so that an application in which changes can be made to digital images does not need to have pixel manipulation coded from scratch, as well. as common features like transparency, blur, sharpness, etc.
In the case of domain names, libraries of particular concern are those which somehow deal with validation, allowing the entry of some characters and structures, while prohibiting others. To give a practical example, our previous research tested the validation of email addresses in the “form” field (often in the contact sections) of the world’s top 1,000 websites according to Alexa rankings. Those results were then updated in 2020 by another team, and this is what the acceptance landscape looked like:
Test case | 2017 | 2019 | 2020 |
---|---|---|---|
ascii@ascii.newshort | 91% | 97% | 98% |
ascii@ascii.newlong | 78% | 84% | 84% |
ascii@idn.ascii | 45% | 50% | 47% |
Unicode@ascii.ascii | 14% | 13% | 18% |
Unicode@idn.idn | 8% | 8% | 11% |
From right to left (RTL) | 8% | seven% | 11% |
What these results tell us is that code deployed on the web is competent enough to handle new gTLDs of four characters or less, but is already starting to struggle with those with longer ones and sees a dramatic drop when IDN are introduced. The question that followed these discoveries was: what does the software landscape look like? While many validation processes are performed on the web, several more occur in non-web applications.
To perform this analysis, the coding languages ââmost used in open source software were targeted, Java and Python, and a crawler was created to aggregate all valid software (as directed by Guthub), extracting their “dependency” file. This file is basically responsible for telling anyone who wants to work with a given application which libraries it relies on so that these can be included in the final software so that it can perform its tasks properly.
Although some lists of the most used libraries exist, their methodology is not based on direct sampling of projects, and not enough metadata is provided for correlations to be made between projects and the libraries they use. This means that it would be difficult to determine which projects are using an insufficient library and engage with them to spur changes to their codebase, implementing a more compliant library that is compliant with universal acceptance. Additionally, project metadata was collected to generate a ranking of the most relevant applications (based on an algorithm that took into account data points such as the number of forks), which is a feature not provided by Github.
Through the Universal Acceptance Compliance of Some Programming Language Libraries and Frameworks study, the compliance status of some libraries was already known and the team assessed the status of others that were deemed relevant. Essentially, a library that uses the new IDNA2008 standard is âUA-Readyâ, while a library that uses the older IDNA2003 standard is âNot UA-Readyâ. It is also possible that it does not follow either, leading to a reasonable assumption that it is “not UA-Ready”.
This is not the case that by incorporating a UA-Ready library, the application automatically becomes capable of accepting all domain names, because unfortunately other factors are involved, especially if the library is correctly implemented. by the developers. However, this makes the decision-making regarding the allocation of resources for engagement and remediation much more rational, as priorities can be better established, such as, for example: -Ready. “
The results are presented below.
Java
“RegEx via annotations” seems to be a popular method of performing validation in Java, which is unfavorable in the interests of the UASG, as it is not a uniform way to validate strings, and any arbitrary expression can be used. to perform this check. This means we can’t be sure what kind of processing is being done under the hood, but it probably isn’t helping the app become UA-Ready. The most relevant libraries using this method are: the validation-api ranking at 55th and its derivative hibernate-validator ranking even higher at 21st springfox-bean-validators also ranking quite high at 79th.
Library | Occurrence (projects) | Status |
---|---|---|
hibernation validator | 62963 | Not UA-Ready. RegEx via annotations; Hibernation ofvalidation-api. |
validation-api | 25190 | Not UA-Ready. RegEx via annotations. |
springfox-bean-validators | 12501 | Not UA-Ready. RegEx via annotations; SpringFox implementationvalidation-api. |
commons validator | 4906 | Not UA-Ready. Based on a static list of TLDs from 2017. |
icu4j | 886 | UA-Ready. IDNA2008. |
libidn | 29 | Not UA-Ready. IDNA2003, obsolete and ported to the Java language as “java.net.IDNâ. |
Python
Out of the Python dataset overall, the idna module ranks 6th in terms of usage, which is a positive result in the interests of the UASG. It can also be a key argument in engaging with Python language developers to bring this module to the heart of the language, overriding the default IDNA2003 implementation. This would be a significant gain for a coding language which is increasingly in demand.
Library | Occurrence in projects | Status |
---|---|---|
idna | 70789 | UA-Ready. IDNA2008. |
validators | 1660 | Not UA-Ready. Email validation based on the Django validator; RegEx based URL validation. |
email_validator | 1178 | UA-Ready. IDNA2008. |
pyicu | 243 | UA-Ready. IDNA2008. |
idna_ssl | ten | UA-Ready. IDNA2008. |
The full study is available on this link.
Many thanks to the contributors of the project Sávyo VinÃcius de Morais, Edson Celio Ferreira Araujo, Jonas Mendes Fiorini.