This is the working area for the technical and legal teams. The focus of this effort is for recording ideas and considerations about improving machine detection and recognition of licenses and headers in code.
- TODO: insert link to current guidelines
- TODO: insert link to bug 1302
Things that would improve working with the data programmatically:
- Get away from spreadsheets as the index for the licenses. Use JSON or XML (leaning towards XML to keep the toolset the same as markup parsing). A tool to compile the spreadsheet (for humans to work with) down to an XML file would probably work.
- Put all information about a license in the same place for each license (headers, addons/exceptions/whatever). Likely within markup within the license files themselves.
- Use XML or some other widespread format for the markup. This ensures there are tools available to work with the data in any language without having to write custom parsing tools or use tools tied to a specific platform because they're the only ones that exist.
- As, and if, the data moves towards a more markup-heavy format, it might be wise to simply include the full text verbatim in its own section rather than "recreate" it from marked up data.
An example of what a more normalized (to xml) data set might look like:
<licenses> <license identifier="Apache-2.0" file="Apache-2.0.xml">Apache License 2.0</license> ... </licenses>
<license> <original>...full original text content...</original> <template> <header>...Licensed under the Apache License, version 2.0, etc....</header> <body>Some text blah blah <match regex="foo|bar"/> more text</body> </template> </license>
I don't actually like XML, but it seems to be a good fit for this purpose. It helps keep the intrusiveness of the markup into the license text focused and to a minimum, lets us wrap entire sections as optional, keep the canonical original text if desired, and specify alternate license formats such as the short-form license header and any desired metadata all in one place. And there are tools to read XML in every language.
I recommend that the template content be *without* the parts that the matching guidelines say to ignore, and preferably it should have normalized or marked up bullets, etc. Best of all would be, simply, normalized text content in a full regular expression that can be used to match directly, rather than a bunch of markup that must be turned into a regular expression -- but I think this would make maintenance much harder. Again, a compilation step could help. There's a lot of work to be done here that can be qualified as preprocessing, and there's no reason we can't make sure the preprocessing is "done right" and publish the results, instead of requiring programmers to reimplement their own version of the preprocessing with uncertain results.
A particular note about normalizing text: Bullet items are one of the problematic points for parsing because it can be troublesome to distinguish a bullet point like "ii." from the end of a sentence that got wrapped to the start of the next line. (I actually encountered this in my testing; a saving grace is that the bullets tend to be things that are hard to confuse for words, but it's really easy to write a regular expression that isn't that smart.)