Data model and vocabulary

nchauvat · January 19, 2021, 10:44am

Do you know this project that has similar goals ?

dachary · January 19, 2021, 4:37pm

A post was merged into an existing topic: Do you support the fedeproxy project?

dachary · January 19, 2021, 11:33am

Yes, and tried to get in touch but the author is currently not responsive. It is complementary and it would make sense to provide real world feedback collected by fedeproxy users (even if they are just a handful ) to help develop the vocabulary and data model.

I did not look into the history of the forgefed project enough to figure out if they were inspired by https://en.wikipedia.org/wiki/DOAP and the ontologies doap-* found at https://ontologi.es/ (doap-bugs + doap-changesets + doap-tests + doap-deps) that you mentioned.

nchauvat · January 19, 2021, 1:24pm

If I recall correctly, they were trying to reinvent that wheel.

arthurlogilab · January 19, 2021, 2:05pm

Great news to see these topics going forward

Small typo here : I think you’re talking about Nicolas Chauvat (@nchauvat here)

We exchanged some links with forgefed creators back in octobre 2019, but it didn’t go any further : https://talk.feneas.org/t/any-use-for-the-semanticweb-ontologies-for-forgefed-doap-for-example/170

Maybe further contact can be established using the forum : https://talk.feneas.org/c/forgefed/10 before the two projects grow further apart ? Oh and since this is related to ActivityPub too, you might have some luck contacting them on the fediverse : https://floss.social/@forgefed & https://todon.nl/@fr33domlover & https://floss.social/@bill_auger (although there doesn’t seem to be recent activity).

dachary · January 19, 2021, 4:38pm

Thanks for the links and the context, I completely missed that! And I concur, the main issue here is that there is no activity in the past three months or so.

dachary · January 20, 2021, 6:55am

I would like to use http://ontologi.es/doap-changeset# to represent a patch / commit and looking for:

a python package that I could use to wrap this into an ActivityPub payload, similar to RDF::DOAP
a description explaining what it means

I suppose this is where the patch / commit should be included but…

dcs:Change
	a owl:Class ;
	vs:term_status "stable" ;
	rdfs:isDefinedBy dcs: ;
	rdfs:label "Change"@en ;
	rdfs:comment "A change to something. Use rdfs:label to briefly describe the change. Use rdfs:comment for additional information."@en .

nchauvat · January 20, 2021, 5:07pm

If I read the ontology correctly, there is nothing in there to actually include the diff.

You could try to extend it as in:

<https://heptapod.logilab.fr/changesets/abcd123> a dcs:Changeset ;
    fedeproxy_namespace:has_text_diff "<diff in textual form>" ;
    rdfs:label "ci: fixing the typo in .gitlab-ci.yml" ;
    dc:creator <https://heptapod.logilab.fr/users/nico> ;
    dc:date "2021-01-20:18:06"^^xsd:datetime.

I tried to answer quickly, but this probably needs more thought.

nchauvat · January 20, 2021, 5:09pm

To manipulate rdf data in Python, I recommend rdflib.readthedocs.io/

Do you have an example in python or in pseudo-code of what you would want to write ?

dachary · January 20, 2021, 8:25pm

This is actually very helpful. I’m not sure yet what I’d like to write because I’m unsure how to use ActivityPub at the moment. Something like:

outbox->addActivity(buildPayloadFromDiff(patch))

If that makes any sense

nchauvat · January 20, 2021, 9:25pm

I have very superficial knowledge of ActivityPub.

From what I understand, the outbox you mention would be the outbox of a Repository or a Project Actor and the Activity would be the changeset added to the Repository. Is that right ?

I also assume that parameter to addActivity() would be a json-ld structure returned by buildPayloadFromDiff().

Hence in buildPayloadFromDiff you could write something along the lines of:

import rdflib # make sure the rdflib jsonld plugin is installed https://github.com/RDFLib/rdflib-jsonld
from rdflib import RDF, RDFS, DC, URIRef, Literal
DCS = rdflib.Namespace("http://ontologi.es/doap-changeset#")
FEDE = rdflib.Namespace("https://fedeproxy.eu/onto/")

def buildPayloadFromDiff(changeset):
    g = rdflib.Graph()
    uid = URIRef(changeset.uid)
    g.add( (uid, RDF.type, DCS.Changeset) )
    g.add( (uid, FEDE.has_text_diff, Literal(changeset.patchdiff))
    g.add( (uid, RDFS.label, Literal(changeset.message.lines[0]))
    g.add( (uid, RDFS.comment, Literal(changeset.message))
    g.add( (uid, DC.creator, URIRef(changeset.author) )
    g.add( (uid, DC.date, Literal(changeset.date) )
    return g.serialize(format='json-ld')

Of course it is only a sketch and there is a lot to do, but hopefully you get the general idea.

dachary · January 21, 2021, 7:10am

My understanding of ActivityPub is not good either but yes, that’s what I have in mind. A Project as in “software project including repository and all other things, issues, merge/pull requests etc.”.

Although an Activity is JSON-LD, the “payload” probably does not need to be JSON-LD. I’m sorry for not using the right vocabulary. Maybe it’s content? Anyway, you understood what I meant despite me being very vague, I’m impressed

The example you provide is crystal clear, thank you. Assuming all aspects of a software project is represented in a DVCS (not just the code but also issues etc.), the changeset may be the only kind of content that needs to be federated.

The data model to represent an issue or a merge/pull request needs to be decided but it may not be a requirement to implement ActivityPub in fedeproxy. Here is a tentative example to illustrate what I mean:

On a given project P on GitLab
I comment on an issue for P on GitLab
Fedeproxy for GitLab uses the GitLab API to GET all my issues
Every issue is exported by fedeproxy in a separate file in the repository of the P project on GitLab in the format used by GitLab import/export
My comment is represented by the diff between the updated issue files and the previous issue files
Fedeproxy for GitHub receives the activity “apply patch/changeset” from Fedeproxy for GitLab and applies the patch on the issues that are on the P project on GitHub
For every issue modified by the patch, Fedeproxy for GitHub reads the issues and uses the GitHub API to apply the differences, e.g. for each of my comments verifies it is up to date (creates a new one if it does not already exists, update the message if it existed but does not contain the same thing, removes if it is absent)

Here I assume, for the sake of a draft implementation, that the data model used to represent issues etc. is the GitLab import/export format but it really is a way to not address the issue of a well defined generic format independent of forges rather than advocating the GitLab import/export format is a good format.

What I find appealing in this approach is that the definition of the data model to represent a software project could be 100% decoupled from the definition of the protocol and data model used for federated forges to communicate.

Does that make any sense ? I may have overlooked something that makes this completely impractical, please be blunt Or maybe this is just gibberish and does not make sense at all. Again, please be blunt.

nchauvat · January 21, 2021, 9:28am

I think of it differently. In my mind, every object could be an actor syndicated using ActivityPub: projects, tickets, dvcs, etc.

For example I find interesting to make a gitlab issue appear as a toot in mastodon and when users reply to it, the comments are added to the issue in gitlab.

Or to have a project replicated from github in our heptapod, and when we add an issue to it, its get added to github on the other side.

Now that I think of it, I am not so sure FEDE.has_text_diff is such a good idea, because the underlying DVCS already has a synchronisation mechanism, why redoing it ?

Maybe what is needed is just to exchange with ActivityPub the metadata describing the different events/activities of the actors (project, issue, repository, pipelines, merge requests, etc), but not the code itself ?

As I said, I know little about ActivityPub and one thing I have not understood is why sending the content when doing outbox.addActivity() instead of just stating “the object at this url changed” and let the client GET the new version at this url in case it wants the new version.

We are dealing with data that is exposed on the web and identified by a URL, why should we encapsulate that data into a new protocol instead of letting the client GET the data using HTTP ?

dachary · January 21, 2021, 10:27am

I agree that this is the better option: sending the URL where the object is located makes more sense than sending the object itself.

The rest of your message gives me a lot to think about and it will take me a little longer to reply

dachary · January 21, 2021, 5:52pm

This is precisely the use case I’m most interested in. Could not agree more

Very good point and I was going in the wrong direction, I stand corrected. Looking at the PeerTube documentation and trying to transpose in the context of fedeproxy:

A PeerTube server is the equivalent of a software project (e.g. https://github.com/ceph/ceph)
A PeerTube video (represented by this ActivityPub extension) is the equivalent of a commit in the software project
The protocol to fetch a PeerTube video (i.e. http, webtorrent, etc.) is the equivalent of the repository protocol (i.e. git protocol, mercurial protocol)
The format of the content of a PeerTube video (i.e. mp4, webm) is the equivalent of the format of the data contained in the repository (i.e. anything really)

To continue with the comparison with PeerTube, there are multiple protocols and formats, including a custom REST API and ActivityPub with extensions. In the case of fedeproxy there are git/mercurial protocols and we’re discussing here about what should be communicated via the ActivityPub protocol. And I also think interactions with software development issues via ActivityPub messages are in scope.

I spent time looking at

https://www.w3.org/TR/activitypub/
https://www.w3.org/TR/activitystreams-core/
https://www.w3.org/TR/activitystreams-vocabulary/
https://www.w3.org/TR/json-ld/ and now better understand @nchauvat advice regarding https://ontologi.es/ (doap-bugs + doap-changesets + doap-tests + doap-deps) because it really is designed for reusing RDF with JSON

in addition to the PeerTube documentation and the associated code. It begins to make sense but I’m still unsure about how to put it all together. The above use case can be rewritten as:

On a given project P on GitLab
User U comments on an issue for P on GitLab
Fedeproxy for GitLab:
- uses the GitLab API to get the issue
- saves it in a file using the format used by GitLab import/export
- commits the file in the repository of the project (in a branch dedicated to saving issues)
- published the permalink of the commit including my comment as an ActivityPub activity in the outbox of user U on GitLab
Fedeproxy for GitHub polls the fedeproxy for GitLab and upon the reception of the permalink
- fetches the commit from the GItLab repository
- reads the content of the file describing the issue and uses the GitHub API to apply the differences, e.g. for each of my comments verifies it is up to date (creates a new one if it does not already exists, update the message if it existed but does not contain the same thing, removes if it is absent)

I realized that although it would be nice to extend ActivityPub to represent issues and other aspects of a software project in the way forgefed does, it is only one of several ongoing efforts to have a format representing a software project (other than the code itself):

https://forgefed.peers.community/vocabulary.html
https://ontologi.es/ (doap-bugs + doap-changesets + doap-tests + doap-deps)
https://docs.gitlab.com/ce/user/project/settings/import_export.html
etc.

The above use case proposes to use https://docs.gitlab.com/ce/user/project/settings/import_export.html because it is easier in the context of GitLab. But that may be a mistake because it will never evolve into a standard. However I’m conflicted about the other two because I’m not entirely sure they will evolve into a standard either and they would be significantly more complicated to use to represent all aspects of issues and pull/merge requests that are in the scope of fedeproxy.

This is where I am today

dachary · January 21, 2021, 10:05pm

I reached a point where I need to write code to better understand how all this could work. And since I’m very new to all this the chances are very high that all of it will be thrown away once I have a better understanding. I’m leaning towards:

Representing issues using the GItLab import/export format
Storing them in a dedicated branch of the code repository of the software project
Announcing issues in the same way PeerTube does with videos using the permalink of a commit as the object instead of the permalink of a video
Copying pasting ActivityPub code from activitystreams2

nchauvat · January 21, 2021, 11:39pm

Thank you for this writeup.

I had overlooked the fact that you wanted to setup a proxy that would use the API to get the data, then expose that data using ActivityPub. For some reason I was thinking about modifying GitLab, but of course this can not be done with GitHub. Ok.

That also explains why you want to export everything, store it onto the disk and diff it the next time you query the project data.

Could we separate the two ? There is one communication channel between the forge and the proxy and another one between the proxy and the fediverse using ActivityPub.

My comments were targeted at the second communication channel. This is the place where I think using RDF written as JSON-LD and the doap* ontologies is making the most sense, because the doap data model is already in use.

For example https://www.cubicweb.org/project/cubicweb?vid=doap is a description of the cubicweb project using DOAP. It is serialized as RDF/XML instead of RDF/JSON-LD, but that is a detail. What is important is that any application that can load and transform DOAP data can make sense of it. I am convinced this is something to leverage and build on.

nchauvat · January 21, 2021, 11:41pm

Announcing issues in the same way PeerTube does with videos using the permalink of a commit as the object instead of the permalink of a video

I look forward to this

A minimal viable product could be “fedeproxy is able to publish issues using ActivityPub and other tools of the fediverse can comment on these issues and fedeproxy will write the comments back to the forge”.

What do you think ?

dachary · January 22, 2021, 6:54am

Exactly. Hence the “proxy” part of the project, to be able to do something when modifying the server code is not an option, which is true for GitHub.com but also for GitLab.com. However I’m convinced using the API and a proxy must be designed as a temporary measure that will disappear once the forges have federation natively implemented. Which is really tricky because it’s so easy to forget the “temporary” aspect when working on a project. One way to achieve that is to translate every aspect of the proxy into merge requests for GitLab/Gitea and try really hard to get them merged. The discussion and resistance, the need to split such merge requests into tiny baby steps, etc. should keep fedeproxy from drifting into something that is too far ahead and (most likely) disconnected from what is ultimately desirable.

That sounds sane, is it something like:

GitHub <- GitHub API -> fedeproxy
fedeproxy <- ActivityPub -> fediverse

Understood, thanks for clarifying. I’d like to say it’s crystal clear but I’m afraid not. Thanks to your explanations it now makes a lot more sense. 48h ago I was still quite confused about many aspects. But I still think I’m missing a few very important parts of the puzzle.

I should take a look at bots designed to be ActivityPub clients (https://github.com/tootsuite/mastodon-bridge, https://github.com/yogthos/mastodon-bot) and others with a similar purposes (https://github.com/zedeus/nitter)

In an earlier comment I kind of dismissed this idea saying it’s in scope and implying that it may not be something that needs to be implemented in a first iteration. I’m still unconvinced but I feel that I’m missing something. You see something that I don’t and that gives me pause.

During breakfast this morning I realized fedeproxy should be able to leverage the UI of Mastodon to (for instance) express: “https://github.com/ceph/ceph” follows “https://mygitlab.com/myuser/ceph”. If, behind the scene, fedeproxy makes it so both are seen as ActivityPub conformant servers, maybe that would work. Or maybe it’s twisted. In any case, once a software project talks ActivityPub, there can be bots interpreting what it says in useful ways and humans using existing UIs to perform all ActivityPub conformant operations such as following without reinventing the wheel.

As you can see, it’s still very very fuzzy and confused but it’s making progress. I think

arthurlogilab · January 22, 2021, 9:34am

Another implementation (javascript, not python), you can take a look at is the one used by https://github.com/dariusk/rss-to-activitypub (which reminds me of the proxy side of this discussion, RSS being the API calls, and activitypub on the other side) : Express ActivityPub Server https://github.com/dariusk/express-activitypub

The UX is really simple : enter a RSS URL, choose an activitypub username, which can then be followed from the fediverse (read-only, no interaction).