THE SPDX WIKI IS NO LONGER ACTIVE. ALL CONTENT HAS BEEN MOVED TO https://github.com/spdx
Technical Team/SPDX Specification Versions
For anyone wanting to add comments/questions/etc. directly in the document, so they get tracked without having to do a lot of version reference, please put your comments on a new line and use the following syntax:
(yyyymmdd initials comments)
for example:
(20100407 KS does this make sense?)
SPDX Specification Version: DRAFT 20100407
(20100310 JM General - Can we add a date or version to the document? We should probably add a revision table when it goes live but in my view not necessary to have right now) (20100407 KS done)
(20100211 KS The intent from the discussions so far is that this document is licensed under the CC-BY (allow derivatives) - where should we add license text? )
(20100210 MVW Package specific v. Project Specific - One aspect that may need some consideration is whether some part of the data is project specific (i.e. specific to many packages of the same project). However, the current standard seems to have – with first view - only package specific data.)
(20100310 JM General – I would add a standard disclaimer (i.e. we tried to get this right but there may be mistakes) to the Package facts that travel with it. Perhaps in the Identification Information section. See my next comment on License since it may solve that.)
(20100310 JM General – Is the Package Facts file licensed (the use of a Formal Copyright Holder in 3.2.7 seems to imply that)? If so, do we want to say it should be under the same license as the specification? I like the idea of a permissive license or possibly even public domain. However, we could allow people to license the file or not license it according to their project or tastes. My only concern with this is the inevitable are these licenses compatible mess if I try to take 10 or 20 of these files and roll them up into (or even take info from them) a nice neat re-distributable document. I would suggest at a minimum, if someone can license this content that we have a block to capture it.)
(20100310 JM General – In the final version should have an examples section that shows a completed Package Facts document or maybe examples per section or both? Perhaps we can use some of use cases that we are working on for this. )
(20100310 JM General – Do we want a way for people to extend the format of this file and if so in a controlled way? Do we want people to add new fields in any section they wish? What if someone takes a package then modifies it and re-distributes it? Would they add or remove form the package facts? Would it be worthwhile to capture that delta in its own section? I noticed that we want the document to be signed so maybe we don’t envision it being modified in this way?) (20100407 KS first entry in file is version field for specification itself - 2.2.1 is meant to handle this, is something else needed? )
(20100310 JM General – Are all these fields required or are some optional?)
(20100310 JM General – How do we represent different Packages in a single distribution of something? I can see where some projects are very singular and would have just an application or a library. What happens, as an example, if that project offers an application, kernel patch, library, etc in one download? Would we use one package facts to cover them all, one per piece, etc?)
Signed off:
(Approved for use by active participants in this specification effort, as indicated by name and email id)
(20100310 JM General – I would like to see the definition of Signed Off mean “approved for release” vs. “approved for use” since I think that’s what we mean and use could possibly be construed to mean something else. )
Contents
1. Rationale
1.1. Charter
Create a set of data exchange standards to enable companies and organizations to share license and component information (metadata) for software packages and related content with the aim of facilitating license and other policy compliance.
1.2. Why is a common format for data exchange needed?
Companies and organizations (collectively “Organizations”) are widely using and reusing open source and other software packages. Compliance with the associated licenses requires a set of due diligence activities that each Organization performs independently: a manual and/or automated scan of software and identification of associated licenses followed by manual verification. Software development teams across the globe use the same open source packages, but they have not yet set-up a way to collaborate on license discovery – many groups are performing the same work leading to duplicated effort and redundancy. This working group seeks to create a data exchange format so that information about software packages and related content, may be collected and shared in a common format with the goal of saving time and improving data accuracy.
(20100210 MVW 1.2 Should probably reflect the idea behind package specifig and use case specific data)(20100407 KS - use case specific data was removed - so not sure if this still needs to be addressed?)
1.3. What does this specification cover?
1.3.1. Identification Information: Meta data to associate analysis results with a specific package. This includes a unique identifier to permit correlation of a specific instance of this data with a specific package.
1.3.2. Overview Information: Facts that are common properties for the entire package.
1.3.3. File Specific Information: Facts that are specific to each file (copyrights, licenses) that are included in the package.
1.3.4. Common Licenses: standardized way of referring to the common licenses likely to be encountered.
1.3.5. ?
4. What is not covered?
1.4.1. Information that cannot be derived from a visual inspection of the package to be analyzed.
1.4.2. How the data stored in this file format is used. After we agree on what should be specified; discussions on how it can be used, who will generate it, how it will be published, audited, etc., will happen outside the scope of this document.
1.4.3. ? Any identification of any patent(s) which may or may not read on the package.
5. Format Requirements:
1.5.1. Needs to be in a syntax that humans can read and write.
1.5.2. Needs to be a syntax that tools can read and write.
1.5.3. Needs to be suitable to be checked for syntactic correctness independent of how it was generated (human or tool).
1.5.4. ? Character set to be used to support international naming. (follow Debian precedent?)
1.5.5. ? Actual specification of fields – below is illustrative rather than agreed on.
1.5.6. ? Discussion: XML vs. simple text to represent fields. Extent human understandable without tool still needs to be discussed.
2. Identification Information
1. One instance per package
2. Fields:
2.2.1. Version Number for the instance of the SPDX specification.
2.2.1.1. Purpose: version of SPDX specification to use to parse the rest of the file. This will permit future changes to the specification, and retain backwards compatibility.
2.2.1.2. Format: Version: N.N
2.2.1.3. Example: 1.0
2.2.1.4 Intent: Here, parties exchanging Identification Information in accordance with SPDX need to provide 100% transparency as to which SPDX specification such Identification Information is conforming to.
2.2.2. Unique Identifier
2.2.2.1. Purpose: Need an independently reproducible mechanism that is agreed will permit unique identification of a specific package with this data. It must be able to determine if any file in the original package has been changed. Options under consideration: SHA256, ?
(20100210 MVW 2.2.2 MD5 or PGP are also quite widely used for security, allowing e.g. to check that the downloaded package corresponds to the one distributed by the projects. There could be a signal, if the signum has been checked from file source too. If the file source did not provide a signum, it can be generated. Probably needs to allow variance for different signums (or more background knowledge, if a certain method is to be promoted.)
(20100310 JM 2.2.2.1 – Should we add MD5? That seems to be very common as a signature as well. If we allow multiple signature types would there be a preferred one? We should also have a URL to where the keys are posted so we can check against them. I would add a field here for that. That said, problems sometimes don’t appear right away and some considerable amount of time may have elapsed. In that case, whoever posted the checksums to validate against may have taken them down for whatever reason (i.e. it’s an older version, project folded, etc). Do we want the keys to still be around in this situation? If so, we need to comprehend that.)
2.2.2.2. Format: UniqueID: ?
2.2.2.3. Example: ?
2.2.2.4. Intent: Here, by providing an unique identifier of each package, confusion over which version/modification of a specific package the Identification Information references should be eliminated.
2.2.3. Generation Method
2.2.3.1. Purpose: identify how this information was generated. If manual – who, if tool – identifier and version.
2.2.3.2. Format: Manual: ”person name” | Tool: ”tool id - version”
2.2.3.3. Examples: ?
2.2.3.4. Intent: Here, the generation method will assist the reader of the Identification Information in self determining the general reliability/accuracy of the Identification Information.
2.2.4. Creation Time Stamp
2.2.4.1. Purpose: Identify when the analysis was done.
2.2.4.2. Format: Created: YYYYMMDD-HH:MM:SS
(20100310 JM 2.2.4 – Should we add a time zone or say it’s based on GMT? Alternatively we could adopt a date/time format from an RFC but I’m okay with this one.)
2.2.4.3. Example: Created: 20100129-18:30:22
2.2.4.4. Intent: Here, the Time Stamp can serve as a verification as to whether the analysis needs to be updated. For example, changes in the software industry may require a different reading of a particular license identification, post a certain fixed date, due to a court holding.
2.2.5. Independent Review/Audit
2.2.5.1. Purpose: reviewers of tool result, or other reviewer of original – equivalent to “signed off” or “reviewed by”.
(20100310 JM 2.2.5 – This one makes me a little nervous. If someone puts something there what does it mean? Have they verified all the information is factual? Independent Audit implies to me that someone other than the Package creator or even the project (?) has looked at this and said the information is <?>. )
2.2.5.2. Format: Reviewed by: “person name”
2.2.5.3. Example: ?
2.2.5.4. Intent: Here, as time progress certain reviewers will begin to gain creditability as reliable. This field intends to make such information transparent.
2.2.6. ??
(20100310 JM 2.0 – Would it be useful to have a marker or token at the top of the facts file that shows it uses the Package Facts format? This may make life easier for tools which are likely going to have to process different file formats. It would be the first thing that must appear in the file.)
3. Common Overview Information
(20100210 MVW License Applied by Project - At Validos, we record also the license applied by the project from the projects website and then (store that information as a pdf-printout and) compare that information with the package information. Sometimes the package doesn’t have any information,(For conflicts, we use a set of “approved conclusions”.) Consider adding a section where there is the url of the page of the project’s statement on its license and even add a separate pdf-printout to the metadata info?)
1. One instance per package
2. Fields:
3.2.1. Formal Name
3.2.1.1. Purpose: Full name given by originator with version information. ? Permit international extended characters in character string or restrict ?
3.2.1.2. Format: ?
3.2.1.3. Example: ?
3.2.1.4. Intent: Here, the formal name of each package is an important conventional technical identifier to be maintained for each package.
3.2.2. Specific Package Identifier
3.2.2.1. Purpose: Machine name of package.
3.2.2.2. Format: identifier.suffix
3.2.2.3. Examples: foo.tar, foo.rpm, ?
3.2.2.4. Intent: Here, the extension is an important conventional technical element to be carried with each package, particularly given the occasional loss of extension.
3.2.3. Official Source Location
3.2.3.1. Purpose: identify where the original version of this package resides (at time of analysis).
3.2.3.2. Format: download URL
3.2.3.3. Example: ?
(20100310 JM 3.2.3 – We should allow usage of git, svn, etc values)
3.2.3.4. Intent: Here, where to download the exact package being referenced is a critical verification and tracking datum.
3.2.4. Declared License for Package
3.2.4.1. Purpose: use a standard way of referring to license and its version. See Section 5.0 for standardized license short forms. If more than one in effect, list license package defaults to and indicate alternate license is present.
3.2.4.2. Format: identifier | other
3.2.4.3. Example: ? (something like GPL2.0)
(20100210 MVW 3.2.4: At Validos, we call this the “main license”, i.e. the license the project itself is applying. (The term is not the best, since most of the code can be under some other license.) However, the license the project is using is a sort of top-level license (compliance and license compatibility review can often be needed between sub-packages too, so this is also more a terminology question). Perhaps this item should state the “License Applied by Project” and then indicate if other licenses are present too ) (20100407 KS MVW's comments were against original term of "Formal", which has been under discussion. I think they've been addressed, but replicating here for completeness. Currently suggesting use of "Declared" terminology so updated field to have this. )
3.2.4.4. Intent: This is simply the license identified in text in the actual package source code files (typically in the header of each package file.) This field may have multiple declared licenses, if multiple licenses are recited in the source code files of the package.
3.2.5. License(s) Present
3.2.5.1. Purpose: list of all licenses found in files in package by scanning
3.2.5.2. Format: identifier (see section 5) | other; one per line.
3.2.5.3. Example: GPLv3.0
3.2.5.4. Intent: Here, we intend to capture additional licenses under which the package is licensed. The license(s) for this field are licenses which are not visibly identifiable in the actual source code, but rather identified by other means, e.g., scanning tools, by the reviewer.
3.2.6. –removed-
3.2.7. Declared Copyright Holder of Package
3.2.7.1. Purpose: identify the author and licensor of package itself. ? Permit international extended characters in character string or restrict ?
3.2.7.2. Format: ?
3.2.7.3. Example: ?
3.2.7.4. Intent: Here, by identifying the actual author(s), some ambiguities, e.g., under which license the author(s) were intending to license the package, may be resolvable by knowing who to contact for clarity.
3.2.8. Formal Copyright Date of Package
3.2.8.1. Purpose: Identify the date this package was created. Individual files inside package may have different copyright dates.
3.2.8.2. Format: YYYY
3.2.8.3. Example: 2010
3.2.8.4. Intent: Here, we can now begin to track when copyright protection expires, for example, and the package falls into the public domain.
3.2.9. ???
4. File Specific Information
(20100210 MVW File Level Data - One instance per file seems overwhelming if these instances are separate files. On the other hand, if they are within one file, it should be ok. - The practical result of one instance should be as short as possible. - There will be repetition (same copyright holder, same years, same license for many files); how about a standardized method of combining this info: e.g. only a list of path+file for all files falling under the same license and then separations under the license for copyright holders and then lastly for differences in years. This will avoid repetition. - As an option, the standard could standardize the license headers (or part of them) in the files themselves. This has the benefit of not creating another database (the database would be the source files), and can easily use the same version control systems as for the rest of the source code. Projects would be more likely to accept this: a standard for adding the license information in the beginning of the file could help them in practice and not just create another work step. Separate package meta-data would then be required only for files containing no license headers. Of course, this is not an option for existing packages that need to be used. However, once the file package is analysed and there is separate meta-data, that information could be then dropped (if decided by the project) into the source files themselves. Removes all repetition and can be machine read. )
(20100310 JM 4.0- Would files that do not have licensing information be present in this block? I would think so and the relevant fields would be blank. We may want to have an explicit statement versus leaving blank. Interested in others thoughts.)
(20100310 JM 4.0 – Do we need an exceptions field to capture exceptions that are written in to a license? Likely difficult to farm from existing code per my comment in 4.2.2 but seems useful. )
4.1. One instance for every file in package
4.2. Fields:
4.2.1. File Name
4.2.1.1. Purpose: identify path to file that corresponds to this summary information. version of this standard to use to parse the rest of the file.
4.2.1.2. Format: [directory/]filename.suffix
4.2.1.3. Example: bar/foo.c
4.2.1.4. Intent: Here, any confusion over where a file needs to hierarchically be placed for proper functionality is mitigated.
4.2.2. File Type
4.2.2.1. Purpose: Identify common types of files where there may be different treatment of copyright and license information: source, binary, machine generated, ??
4.2.2.2. Format: ?
4.2.2.3. Example: ?
4.2.2.4. Intent: Here, this field is basically the "best available" format field, from a developer perspective.
(20100310 JM 4.2.2 – I like the field but I’m struggling with whether it will be difficult to automate the generation of this information if that’s there and whether to be concerned about that. Specifically I am wondering about auto generated files that come from tools. Here is my thought process. I can see where a project could farm everything in 4.0 from existing source except for possibly this field. If so that means they have to either answer this manually for every file (think of the Linux kernel) or try and adopt (as an example) a keyword approach and add it to files.)
4.2.3. License(s)
4.2.3.1. Purpose: License governing file if known. This will either be explicit in file, or be expected to default to package license. Use a standard way of referring to license and its version. See Section 5.0 for standardized license references. If more than one in effect, list all licenses.
4.2.3.2. Format: ? [identier,]* [identifier | “string“]
4.2.3.3. Example: GPL2.0,BSD,”xyz license type”
4.2.3.4. Intent: Here, the intent is to have a uniform method to refer to each license with specificity to eliminate any license confusion. For example, the 3 clause BSD would have a different license identifier then the 4 clause BSD.
(20100210 MVW 4.2.3 A package may contain sub-packages, which may have their own “main license”. As an “approved conclusion” we default a file with no license information to the closest package level license (not necessarily the license of the package under inspection, but the license of a sub-package), unless there is contrary information. The distinction of package and sub-packages is relevant here.)
4.2.4. Copyright(s)
4.2.4.1. Purpose: identify the copyright holders and associated dates of their copyright that are in this specific file if known. Note: Copyright holder identifier may have developer names, companies, email addresses, so we’ll probably need a generic string mechanism (including international characters). Since there may be multiple per file, need a way of having separators between them.
4.2.4.2. Format: [ “copyright holder”:”date(s)”]*
4.2.4.3. Example: “Linus Torvalds”:”1996-2010”
4.2.4.4. Intent: Here, similar to identifying the actual author(s) (above), by identifying the copyright holder(s), the copyright holder(s) may be contact if licensing issues exist with the package, or to request distribution under another license more compatible with a given implementation, for example.
4.2.5. ?
5. Standard License Identifiers
5.1. Rationale for licenses to choose to standardize identifiers. Focus on standardizing the most commonly used rather than all. Align with any other standardization efforts underway here that will meet the need.
5.2. Table of standard licenses and their identifiers
Identifier |
Full name |
Official Source Text |
GPL2.0 |
GNU General Public License (GPL) Ver. 2, June 1991 |
|
GPL3.0 |
... |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6. Definitions
1. Package: ...
2. Date range: [YYYY,]*[YYYY-]YYYY syntax for multiple ranges needed.