Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: How to handle different deserialization scenarios #860

Open
goneall opened this issue Aug 18, 2024 · 4 comments
Open

Discussion: How to handle different deserialization scenarios #860

goneall opened this issue Aug 18, 2024 · 4 comments
Labels
serialization Something about the representation of data in bytes
Milestone

Comments

@goneall
Copy link
Member

goneall commented Aug 18, 2024

The serialization documentation has fairly detailed descriptions on how to serialize, but not as much a deserialization approaches and scenarios.

Specifically, it would be good to (decide and) document how the resultant model would be represented in the following scenarios:

  1. Deserializing a JSON-LD file with a single Element
  2. Deserializing a JSON-LD file with multiple Elements and no SpdxDocuments
  3. Deserializing a JSON-LD file with multiple Elements and a single SpdxDocument
  4. Deserializing a JSON-LD file with multiple Elements and multiple SpdxDocuments

How do we handle creating the in-memory SPDX documents in each of these scenarios?

@goneall
Copy link
Member Author

goneall commented Aug 18, 2024

For 1. and 2. above, suggest creating an SpdxDocument in memory "on the fly" with all of the Element(s) represented as root elements.

For 3., should we assume the single SpdxDocument represents the serialization information? Is there any validation we could do to confirm this? If we assume it represents the serialization information, then we can augment the serialized SpdxDocument with the information from the file itself to complete the in-memory representation.

Scenario 4. is the most challenging. It's quite likely one of the SpdxDocuments represents the serialization itself - but which one? We would need some way of determining which one is the SpdxDocument - or we treat it the same as not having any SpdxDocument.

@JPEWdev
Copy link
Contributor

JPEWdev commented Aug 20, 2024

I can tell you how the shacl2code bindings deal with this. First of all, since they are not SPDX specific, there is no requirement that an SpdxDocument is present. The bindings have a separate concept of a SHACLObjectSet which is the container that represents a set of objects to be serialized/destination for deserialization. It also does some indexing book-keeping (e.g. so you can look up an object by it's ID quickly), and performs "linking" where an object property that is referencing another object by a string IRI will be replaced with a reference to the actual object with that IRI, if it exists in the SHACLObjectSet. In this case, SpdxDocument is actually just a slightly special element handled at higher layers (e.g. the Yocto SPDX code track the SpdxDocument separately, make sure there is only one per SHACLObjectSet etc.).

I really believe that this approach is the right way to go. Don't encumber users with the semantics of SpdxDocuments if they don't want it. It's frustrating for users if they need to (de)serialize 1 or 2 in your examples, but can't because bindings have intertwined the concept of an SpdxDocument with "a set of things to (de)serialize". Code at a higher level can make it easier to deal with SpdxDocument, since that is the common case, but it's a "layer" on top, not the core functionality. The core bindings should avoid enforcing "policy" on users about how they do things and focus on the "mechanism" that enables them to do what they need. The "policy" is the responsibility of a higher level of abstraction that makes life easier for the common cases. If you force policy on the core bindings, you're bindings are not going to be very flexible and you can end up with a lot of weird edge cases needing to be encoded because you made choices for the users they didn't like :)

IOW, with the shacl2code python bindings (and the C++ bindings I'm working on), none of these 4 are a problem at all, since SPDXDocument is not special.

@goneall
Copy link
Member Author

goneall commented Aug 21, 2024

@JPEWdev - I think your approach for the lower level language bindings is fine. The libraries I'm writing have to deal with the higher level semantics, hence the need to solve the issue.

The SpdxDocument represents metadata about the serialization itself, and in some scenarios it can be quite important. One example is verifying references to SPDX elements in external files. The information to verify is stored in the SpdxDocument. If we don't know what SpdxDocument contains the metadata, we can't verify the external document.

I'm starting to form the opinion that we need to fix this in the serialization schema - either add an optional property at the root level, or require that only one SpdxDocument can be present in the @graph such that the SpdxDocument data is unambiguous. The former would be a non-breaking change. For the code which doesn't need the meta-data, it can just be ignored.

@bact bact added the serialization Something about the representation of data in bytes label Aug 28, 2024
@goneall
Copy link
Member Author

goneall commented Nov 14, 2024

Moving this to 3.1 (assuming we choose a non-breaking change approach)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
serialization Something about the representation of data in bytes
Projects
None yet
Development

No branches or pull requests

3 participants