piggyback
vs the alternativesThere are many alternatives to piggyback
, and after
considerable experience I haven’t found any that ticked all the boxes
for me:
Git LFS provides the closest user experience to what I was going for.
It stands out above all other alternatives for providing both the
best authentication experience (relying directly on any of the
standard git
authentication mechanisms such as https, ssh
keys, app integration), and it provides the most legitimate version
control of the data. However, there are many show-stoppers to using Git
LFS for me.
GitHub pricing & resulting problems for GitHub’s fork / PR
model. Described
eloquently here. Basically, despite generous rates and free data
options everywhere else, GitHub’s LFS storage and bandwidth not only
cost a lot, but also make it impossible to have public forks and pull
request for your repository. Technically this is a problem only for
GitHub’s LFS (since it stems from the pricing rules); and can be avoided
by using LFS on GitLab or other platform, as Jim Hester has
described. Still, this proved unsuccessful for
me, and still faces the other big issue with
git-lfs
:
Overwrites git
itself. Git LFS is just too
integrated into git
– it replaces your authentic
git
engine with git-lfs
, such that the
identical git
command can have different behaviors on a
machine with git-lfs
installed vs just plain
git
. Maybe fine for a professional team that is “all in” on
git-lfs
, but is a constant source of pitfalls when working
with students and moving between machines that all have only authentic
git
installed. The difficulties with supporting pull
requests etc are also related to this – in some sense, once you have a
git-lfs
repository, you’re really using an entirely new
version control system that isn’t going to be 100% compatible with the
nearly-ubiquitous authentic git
.
Amazon S3 is perhaps the most universal and most obvious go-to place
for online-available public and private data storage. The 5 GB/mo free
tier is nice and the pricing is very reasonable and only very
incremental after that. It is easily the most industry-standard
solution, and still probably the best way to go in many cases. It is
probably the most scalable solution for very large data, and the only
such that has built in support/integration to larger query services like
Apache Spark / sparklyr
. It falls short of my own use case
though in the authentication area. I require students create a GitHub
account for my courses and my lab group. I don’t like requiring such
third-party accounts, but this one is fundamental to our daily use in
classroom and in research, and most of them will continue using the
service afterwards. I particularly don’t like having people create
complex accounts that they might not even use much in the class or
afterwards, just to deal with some pesky minor issue of some data file
that is just a little too big for GitHub.
Amazon’s authentication is also much more complex than GitHub’s
passwords or tokens, as is the process of uploading and downloading data
from S3 (though the aws.s3
R package is rather nice remedy
here, it doesn’t conform to the same user API as the
aws-cli
(python) tool, leaving some odd quirks and patterns
that don’t match standard Linux commands.) Together, these make it
significantly more difficult to deploy as a quick solution for moving
private data around with private repositories.
For scientific research purposes, this would be my ideal solution. Encouraging researchers to submit data to a repository at the time of publication is always a challenge, since doing so inevitably involves time & effort and the immediate benefit to the researcher is relatively minimal. If uploading the data to a repository served an immediate practical purpose of facilitating collaboration, backing up and possibly versioning data, etc, during the research process itself rather than after all is said and done, it would be much more compelling. Several repositories permit sharing of private data, at least up to some threshold, including DataONE and figshare. Unfortunately, at this time, I have found the interfaces and R tooling for these too limited or cumbersome for everyday use.
The piggyback
approach is partly inspired by the
strategy used in the datastorr
package, which also uploads
data to GitHub releases. datastorr
envisions a rather
different workflow around this storage strategy, based on the concept of
an R “data package” rather than the Git LFS. I am not a fan of the “data
package” approach in general – I think data should be stored in a
platform agnostic way, not as .Rdata
files, and I often
want to first download my data to disk and read it with dedicated
functions, not load it “auto-magically” as a package. This latter issue
is particularly important when the data files are larger than what can
conveniently fit into working memory, and is better accessed as a
database (e.g. SQLite for tabular data, postgis spatial data, etc).
In terms of practical implementation, datastorr
also
creates a new release every time the data file is updated, rather than
letting you overwrite files. In principle piggyback
will
let you version data this way as well, simply create a new release first
using pb_new_release(tag="v2")
or whatever tag you like. I
have not opted for this workflow since in reality, versioning data with
releases this way is technically equivalent to creating a new folder for
each new version of the data and storing that – unlike true git commits,
release assets such as datastorr
creates can be easily
deleted or overwritten. I still believe permanent versioned archives
like Zenodo should be used for long-term versioned distribution.
Meanwhile, for day-to-day use I often want to overwrite data files with
their most recent versions. (In my case these ‘data’ files are most
often created from upstream data and/or other possibly-long-running
code, and are tracked for convenience. As such they often change as a
result of continued work on the upstream processing code. Perhaps this
is not the case for many users and more attention should be paid to
versioning.)