THE SPDX WIKI IS NO LONGER ACTIVE. ALL CONTENT HAS BEEN MOVED TO https://github.com/spdx

Talk:Legal Team/License List/License Matching Guidelines

From SPDX Wiki
Revision as of 15:37, 10 April 2013 by MartinMichlmayr (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

outstanding issues

2.1.1 Standard Headers guideline - currently standard headers only appear in spreadsheet and not on HTML license pages. This field needs to be added. Suggestion: if switch to HTML "master" file for license text, suggest adding Standard Header field there so same formatting can be used. and then in spreadsheet change field to simply yes or no.

4.1.1 states to treat all upper and lower case as lower case. Does it really matter? Can't we simply say, treat all upper and lower case as the same? -- Submitted by jlovejoy on Wed, 2012-06-27 19:36.

Trade, etc. marks

We should probably have equivalency rules for ™, ®, ℠ to their ascii representations. -- Submitted by pezra on Thu, 2012-06-21 17:03.

not added to guidelines at this time, as it doesn't seem to show up that often as far as I know. Using "TM" as an equivalent to the superscript version could cause a problem if those two letters appear together otherwise (not that I can think of an example...) -- Submitted by jlovejoy on Thu, 2012-06-21 19:44.

Ignore Copyright owner, why not ignore name of license?

Does the name of the license need to be match? I don't think so. So for the Apache v1.1 it is not necessary to match neither the name of the license or the copyright owner. -- Submitted by dmg on Thu, 2012-06-14 18:17.

incorporated this guideline - see #11 -- Submitted by jlovejoy on Wed, 2012-09-05 18:03.

Detecting the end of copyright notices

How should tool makers determine where a copyright notice ends? -- Submitted by pezra on Wed, 2012-06-13 15:30.

Handling of Non-ASCII characters

Do we have a guideline how to process characters outside the 7-bit ASCII range? E.g. the german ü character appears in various different encodings depending on the locale used.

Suggested rule 1: We assume all texts are UTF8, whenever there are no UTF8 syntax errors. Otherwise offending characters are stripped and replaced by a single replacement character (To be defined) so that correct UTF8 results.

Suggested rule 2: All UTF8 characters having a corresponding ASCII character are considered to be equal to that ASCII character. (e.g. variations of quotes, hyphens, asterisks, bullets) -- Submitted by jnweiger on Fri, 2012-03-02 15:15.

These rules are being developed in terms of characters, rather than byte sequence. I'd prefer to stay away from implementation details such as character encoding for the matching rules.
I think we should add a rule stating that any sequence of characters that whose glyphs appear the same when rendered should be considered equivalent for the purposes of matching.
This will handle the various ways combining characters such as umlauts can be represented in unicode by allowing comparision of the normalized form KC of the strings. It is also general enough that people who dislike unicode (which do exist) can apply the rule to whatever encoding they prefer. -- Submitted by pezra on Mon, 2012-03-05 19:38.

Combined Punctuation and whitespace

suggested rule: any punctuation that is not directly followed by whitespace is treated as if it were followed by whitespace.

Idea is to consider 'foo,bar or baz' equals 'foo, bar or baz'.

Question: What is the exact definition to be used for puncuation. I'd default to use ispunct() from the C-locale, which is any printable character which is not a space or an alphanumeric character. -- Submitted by jnweiger on Fri, 2012-03-02 14:57.

Suggest that after every piece of punctuation, we normalize to assume a single white space. -- Submitted by Anonymous on Thu, 2012-06-21 15:56.
On the question of what is punctuation, perhaps we should use the General Punctuation list compiled by the unicode spec. -- Submitted by pezra on Mon, 2012-03-05 19:46.

Re: Bullets and Numbers

This rule could be extended to cover more cases:

a) The definition of 'a line starts with' should clearly say, that any whitespace at the start of the line (e.g. indentation) is ignored.

b) typical comment characters at the start of a line (e.g. /*, //, #, REM, ...) should also be ignored.

c) (ad hoc idea:) any punctuation at the start of a line is ignored. -- Submitted by jnweiger on Fri, 2012-03-02 14:53.

a) I think this is already covered by the whitespace guideline
b) true. new guideline for this has been added
c) might require more thought. going with narrower suggestion in b for now... -- Submitted by jlovejoy on Thu, 2012-06-21 19:51.

Preambles/epilogue

How should adding a preamble/epilogue to a license text effect matching? -- Submitted by pezra on Wed, 2012-02-08 16:38.

I'd say for the most part preambles are included, but epilogues (where there is instructions on how to apply the license) can probably be considered "optional" text. This is just off the top of my head based on what I've seen included and omitted for various license in the wild (e.g. I've seen the instructions for applying GPL at the end of the license omitted, but never seen the preamble omitted)
I think this will ultimately need to be dealt with on a license-by-license determination, using the {{ }} indicators as to what can be ignored for matching purposes as discussed on the 6/21 call -- Submitted by jlovejoy on Thu, 2012-06-21 20:24.

hyphenation

We should probably point out that any word that is hyphenated to span lines should be consider equivalence to the unhyphenated word. -- Submitted by pezra on Wed, 2012-02-08 16:16.

Suggested Rule 1: all hyphen, endash, mdash, etc characters are considered the same.
Suggested Rule 2: if a line ends in a hyphen, and there is a word directly in front of the hyphen, and a word directly at the beginning of the next line, then the hyphen and whitespace is removed, so that both words are joined into one.
There is a slight risk, that a compound word like e.g. 'built-in' would lose its hyphen, if its hyphen happens to be used for line breaks. I'd say, we can accept that risk. -- Submitted by jnweiger on Fri, 2012-03-02 14:45.
added guideline re: equating hyphens and various forms of dashes. but not sure how to handle hyphenated words at the end of the line, since the whitespace is already accounted for in another guideline... -- Submitted by jlovejoy on Thu, 2012-06-21 20:20.