Favicon Logo

📣️ New: CRAN R package binaries for arm64 and Alpine Linux

<categories> [r]

CRAN R package binaries for Linux

Custom R repositories that offer R package binaries for a variety of Linux distributions (including Alpine Linux) and architectures.

Patrick Schratz •
← Back to all posts

R community - it is time for something new: CRAN package binaries for Linux arm64 AND Alpine Linux!

tl;dr:

install.packages("vctrs", repos = "https://cran.devxy.io/arm64/noble/latest")
install.packages("stringi", repos = "https://cran.devxy.io/amd64/alpine321/latest")
 
# and more...
# find all identifiers at <https://www.devxy.io/r-package-binaries/>
# container examples
docker run --rm -it devxygmbh/r-ubuntu:4-noble R -q -e 'install.packages("vctrs"); library(vctrs)'
docker run --rm -it devxygmbh/r-alpine:4-3.21 R -q -e 'install.packages("stringi"); library(stringi)'

Historic background

As an R user for over ten years, I’ve always missed having binaries available for Linux. In summer 2020, Posit began building amd64 binaries and with that providing an important asset to the R community. This allowed people to install large numbers of packages in a fraction of the time, and CI/CD runs were finally completing in a few minutes.

With the growing rise of centralized data science environments running in the cloud on Linux in the early 2020s, the need for binaries became imminent. Previously, since R was mainly used on desktops, most users were satisfied with binaries for Windows and Mac only. However, to succeed with their server-based offerings—Workbench and Connect—Posit (formerly RStudio) had to provide a solution for Linux environments. Without binaries, these applications would have been practically unusable due to the long waiting times for R package installations. Therefore, it made a lot of sense for Posit to tackle this field, paving the way for a new era of using R on (remote) Linux systems.

CRAN and the community

CRAN has long provided binaries for Windows and Mac—well before I started using R—and in 2020, it also delivered binaries for the arm64-based Apple Silicon architecture. However, no effort was made to build Linux binaries at the time. Given Posit’s initiatives in this area, there was little need for CRAN to take on the significant workload themselves.

Admittedly, building Linux binaries is a complex task. The diversity of Linux distributions, each with its own compilers and toolchains, means binaries must be tailored to these environments. Major upgrades to these distributions periodically require a complete rebuild of all binaries. Adding to the challenge, many system libraries required by R packages are not universally available across distributions. All of this translates into substantial demands for both storage and computing resources.

While Posit’s R package binaries were warmly welcomed by the community, the community itself was far from passive. In fact, efforts to build Linux binaries predated Posit’s involvement, as highlighted in the CRAN documentation for Ubuntu binaries (https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html). Additionally, the r2u package has provided Debian-specific binaries for a subset of R packages for many years.

However, these earlier approaches have/had limitations. The scope of available binaries was often narrow, and methods that relied on system package binaries posed challenges in shared environments where most users usually lack administrative privileges. Relying on a single administrator to install or update R packages is not a practical or scalable solution for most environments.

Next, the bspm project was created, following a similar approach to those previously mentioned by leveraging the system’s native package manager and building a bridge to the Posit Package Manager. However, it suffers from the same limitation outlined earlier.

The most recent effort, R-universe, aims to build binaries for Windows, Mac, and Linux (amd64) using GitHub Actions. This approach is the closest to a fully community-driven initiative conducted openly and is led by Jeroen Ooms, a well-known and highly skilled member of the R community.

However, all of the above are missing out on two points:

Why the arm64 architecture and Alpine Linux are important

Yes, amd64 binaries from sources like Posit, R-universe, or others are incredibly useful. However, data science environments and CI/CD builds are increasingly shifting towards arm64-based servers. These servers offer comparable performance with greater energy efficiency and are significantly more cost-effective—often up to 30% cheaper—when rented in the cloud compared to amd64 instances.

Alpine Linux is a lightweight Linux distribution optimized for size, primarily used in containerized CI/CD tasks. Its images are typically 2-5 times smaller than those built on other popular distributions like Ubuntu or Red Hat. However, Alpine differs significantly from other distributions in one key aspect: it uses the MUSL C library instead of the more common GLIBC.

This distinction introduces complications for C and C++ code, requiring specific adaptations for compatibility with MUSL. Currently, CRAN does not enforce checks for MUSL-based distributions, and the lack of prebuilt binaries means most package authors don’t address these compatibility issues. As a result, a significant portion of R packages with C bindings cannot be installed on Alpine.

Nevertheless, encouraging package authors to add MUSL support is a necessary first step. Once a substantial number of packages are compatible, Alpine could become a viable option for R-based CI/CD workflows. This shift might, in turn, encourage broader adoption, motivating users to prioritize MUSL compatibility in their packages and further expanding Alpine’s utility in the R ecosystem.

After building all CRAN packages for Alpine for the first time, the results were surprisingly positive: out of 21,7xx packages, 21,502 (99.1%) successfully built for amd64, and 20,405 (94.1%) for arm64. Detailed and up-to-date build statistics are available on the “R package binaries dashboard”. While not all build failures are directly attributable to MUSL/GLIBC compatibility issues, many likely are. It now falls to the authors and the community to investigate these failures, report the issues, and work towards fixing them.

Why also binaries for amd64?

Good question. Initially, only arm64 binaries were in scope. However, I quickly realized that I would need to change the main URL every time when working in a multi-arch R environment, i.e. to Posits Package Manager for amd64 and to cran.devxy.io for arm64. This becomes annoying quickly, let alone thinking about it.

I realized that users should be able to use the same repository URL for any architecture including dynamic arch identifiers for the respective one at hand. Ideally, the arch identifier should be settable in a dynamic way, e.g. via $(arch). Unfortunately, there are multiple ways to refer to a specific architecture: amd64 & x86_64 and arm64 & aarch64. I went for the first options for both, i.e. amd64 and arm64.

Sustainability and costs

While many parts are automated and publicly available in the r-package-binaries subgroup, keeping everything up-to-date so that users can rely on the service is an ambitious undertaking. Yet I am keen to take it on and believe that it is possible if the community contributes to the project by investigating build failures of missing packages, contributing to the costs of the service (hosting, rebuilds, etc.) and improving the build chain. I’ve always wanted such a project to exist rather than having to rely on a closed build system nobody can interact with.

That being said: let’s talk about costs. Building binaries, storing them and distributing them incurs costs. So far, I am paying for all of these with my personal money. Me? Yes, me. While it is hosted under the umbrella of devXY, a Ltd., it is a one-person company at the time of writing this, operating effectively using my personal money.

As said earlier, building this has been a personal wish of mine and I am also fully OK paying for it and giving back to the community. While the building part costs will stay static for the most, the storage and bandwidth costs will steadily increase. Especially the latter has potential to grow really large, depending on how many people will use it. Therefore, I have to set a limit on the provided bandwidth to avoid abuse and keep cost under control. This limit has been set to 50 TB per month for the start. I am happy to increase it once costs are covered by the community. While this might look restrictive at first, especially compared to Posit providing binaries without any limits, devXY can just not compete with a company of Posits size.

At the same time I would like to ask medium/large companies planning to use the service to reach out to me and ask for a private URL (for a monthly fee). The main URL cran.devxy.io is aimed for the community, for developers and other individuals. With reasonable usage of individuals, 50 TB can be a lot - at the same time, this can be a low volume if a large company jumps on it and uses it for hundreds of (daily) CI/CD jobs.

Signoff

This post is an entry post, outling the current state and motivation for the overall project. A separate post will provide more technical details and explain how all of this could be built by a single person with a limited budget.

← Back to all posts