Data Lab Infra - Part 5: Retrospective & MLOps - Model Deployment

Summary
#

On part 5 of this series, you’ll learn how to use multi-project CI/CD with the GitLab Free tier, in order to provision resources like PostgreSQL databases and credentials. You’ll also learn how easy it is to deploy your ML model with Docker, once you have the correct infrastructure setup. And, finally, we’ll look back at the pros and cons of the implemented architecture, doing a full retrospective, and proposing a redesigned architecture to fix the pitfalls of our current design.

Changes to CI/CD
#

Attempted Refactoring
#

At first, we attempted to refactor the CI/CD pipeline as follows:

.ci/
├── scripts/
│   ├── kafka/
│   ├── ollama/
│   └── postgres/
└── templates/
    ├── deploy.yml
    ├── kafka.yml
    ├── ollama.yml
    └── postgres.yml

However, if we run a script instead of defining it inline, we won’t be able to run these jobs from another repository, which is a concern for our current workflow, as we want to be able to provision databases and credentials, or other resources, by calling jobs within the datalab repository as effortlessly as possible. We ended up going back to the original approach, with templates on the root of .ci/.

We also fixed an error in the changes globbing, where infra/services/docker/** should have been infra/services/docker/**/* so that files were matched rather than directories.

Custom Ubuntu Image
#

Since we were continuously in need of a few common command line tools, like curl or jq, we decided to build a custom Ubuntu image that we pushed to the container registry for the datalab project. This let us reuse this image directly from our container registry, without the need to add a before_script block to setup the Ubuntu instance each time a new runner was launched for jobs requiring additional commands beyond the base image.

Building and pushing for this image was done through the Terraform project under infra/services/gitlab, so the workflow remains unchanged, assuming you setup your CI/CD variables using your .env via Terraform.

The first attempt to push a large image into our container registry failed. This ended up being fixed by adding checksum_disabled to /etc/gitlab/gitlab.rb as follows:

registry['storage'] = {
  's3_v2' => {
    ...
    'checksum_disabled' => true,
  }
}

And then SSHing into the GitLab VM and running:

sudo gitlab-ctl reconfigure

This configuration was also added to the L2 (Platform) Terraform project, to the cloud-config for GitLab, so if you’re deploying this now, you don’t have to worry about this.

We also had to change the project visibility to “Public” for datalab, otherwise CI/CD job were unable to pull from the container registry. Of course our GitLab instance is not exposed to the outside, otherwise we would need a different strategy to handle access.

Improving Postgres Workflow
#

We also improved the postgres template in order to store credentials as JSON, using a CI/CD variable, so that we could return them, if run multiple times, as opposed to just producing a new password and not providing access to the original credentials. This is a requirement that we missed during the initial design of this workflow, as we’ll need these credentials on our external application projects during deployment. This also makes the workflow idempotent.

Here’s a summary of changes:

Credentials no longer printed in logs.
Credentials stored as JSON, using a CI/CD variable.
External projects will always need to call this workflow to load credentials into the env during application container deployment.
Database and credentials will be created as required, if they don’t exist.

Notice that any project can access any database and credentials, as this wasn’t a concern here, but the JSON format and workflow can be expanded to handle this if required. A better way, as we’ll see next, is to just stop resisting using an extra service to handle secrets.

Multi-Project CI/CD on GitLab Free Tier
#

Templates for Provisioning
#

We provide, on the original datalab project, CI/CD templates that can be included as required, on external projects, as well as internally, to provision resources, be it within PostgreSQL, Kafka, or Ollama.

Since we are using the GitLab Free tier, we lack the syntactic sugar to trigger jobs on an external project, particularly when we need artifacts from that project. So, the best way to handle this is via the REST API.

We provide the following CI/CD templates:

.ci/provision/
├── kafka.yml
├── ollama.yml
└── postgres.yml

These are set to call the CI/CD pipeline with the appropriate input values to provision resources. The strategy is similar for all three templates. Here’s how we do it for .ci/provision/ollama.yml:

spec:
  inputs:
    pull:
      description: "Pull Ollama model"
      type: string

---

stages:
  - provision

variables:
  PROVISIONER_ID: datalabtechtv%2Fdatalab

provision_ollama_model:
  stage: provision
  image: gitlab:5050/datalabtechtv/datalab/ubuntu:custom
  variables:
    PULL: $[[ inputs.pull ]]
  script:
    - echo "Triggering downstream pipeline"
    - |
      PIPELINE_ID=$(curl -s -X POST \
        --form token=$GITLAB_TRIGGER_TOKEN \
        --form ref=infra \
        --form inputs[ollama_pull]=$PULL \
        $CI_API_V4_URL/projects/$PROVISIONER_ID/trigger/pipeline | jq -r '.id')      
    - echo "Triggered pipeline $PIPELINE_ID"
    - |
      while true; do
        STATUS=$(curl -s -H "PRIVATE-TOKEN: $GITLAB_TOKEN" \
          $CI_API_V4_URL/projects/$PROVISIONER_ID/pipelines/$PIPELINE_ID \
          | jq -r '.status')
        echo "Pipeline status: $STATUS"
        [[ "$STATUS" == "success" || "$STATUS" == "failed" ]] && break
        sleep 10
      done

The spec.inputs will be set to whatever information you need from an external project to provision the required resource:

spec:
  inputs:
    pull:
      description: "Pull Ollama model"
      type: string

---

Then, we set stages to provision, which is the only one required for these template:

stages:
  - provision

External projects that include this template must make sure that this stage is defined for them as well.

We set a global variable, PROVISIONER_ID, with the project ID for datalab in its string format:

variables:
  PROVISIONER_ID: datalabtechtv%2Fdatalab

And, finally, we define the provisioning job, which always triggers the datalab pipeline regardless of where it’s called from. Let’s take a look at the script for the provision_ollama_model job.

First, we trigger the pipeline, obtaining it’s run ID:

echo "Triggering downstream pipeline"

PIPELINE_ID=$(curl -s -X POST \
  --form token=$GITLAB_TRIGGER_TOKEN \
  --form ref=infra \
  --form inputs[ollama_pull]=$PULL \
  $CI_API_V4_URL/projects/$PROVISIONER_ID/trigger/pipeline | jq -r '.id')

echo "Triggered pipeline $PIPELINE_ID"

Notice that we need the GITLAB_TRIGGER_TOKEN we previously set. As you can see, we send the user configs via inputs[...] form data fields.

We then poll the API every 10 seconds to check if the pipeline has finished:

while true; do
  STATUS=$(curl -s -H "PRIVATE-TOKEN: $GITLAB_TOKEN" \
    $CI_API_V4_URL/projects/$PROVISIONER_ID/pipelines/$PIPELINE_ID \
    | jq -r '.status')

  echo "Pipeline status: $STATUS"

  [[ "$STATUS" == "success" || "$STATUS" == "failed" ]] && break

  sleep 10
done

Once this is run, control will be ceded to the external project CI/CD pipeline.

External Test Project
#

We created an external test project called datalab-infra-test to demo how this works.

First, we needed a trigger token from the datalab project, which was created by navigating to Settings → CI/CD → Pipeline trigger tokens → Add new token, under datalab. We then stored the token in the GITLAB_TRIGGER_TOKEN CI/CD variable under datalab-infra-test.

Additionally, we had to reconfigure /etc/gitlab-runner/config.toml to increase job concurrency, otherwise we wouldn’t be able to trigger a job, and wait for it from another job—with the default concurrency of 1, the pipeline would just freeze completely.

We SSHed into the GitLab VM, set concurrency = 4, and restarted the runner with:

sudo sed -i 's/^concurrent = 1/concurrent = 4/' /etc/gitlab-runner/config.toml
sudo gitlab-runner restart

This configuration was also added to the L2 (Platform) Terraform project, within the cloud-config for GitLab, so if you’re deploying this now you won’t have to worry about it.

For example, if you need a Postgres database and credentials, you can configure your CI/CD jobs as follows:

stages:
  - provision
  - test

variables:
  POSTGRES_DB_USER: ci_cd_user
  POSTGRES_DB_NAME: ci_cd_db

include:
  - project: datalabtechtv/datalab
    ref: infra
    file: '.ci/provision/postgres.yml'
    inputs:
      db_user: $POSTGRES_DB_USER
      db_name: $POSTGRES_DB_NAME

test_db_connection:
  stage: test
  image: postgres:18.0-alpine
  needs:
    - fetch_db_credentials
  script:
    - 'echo Connecting to database: $DB_NAME'
    - 'echo Connecting with user: $DB_USER'
    - PGPASSWORD=$DB_PASS psql -h docker-shared -U $DB_USER -d $DB_NAME -c '\q'

Here, test_db_connection would usually be replaced by something like a docker compose up for your own application. The point here is that this workflow will ensure that the database you need is created, and it will handle the secrets for you, making them available as env vars.

For Kafka and Ollama, we only run a provisioning job, since we don’t need any credentials back from the job, but for Postgres, the pipeline will also fetch the job ID for psql_create_db, which contains the credentials.env artifact (this expires after 15m), loading those credentials as environment variables. The pipeline for the test project looks like this:

And now you know of a CI/CD strategy, running on top of the GitLab Free tier, for provisioning resources in your data lab infrastructure! It might not be the best, but it works. Of course, we’ll keep improving on it, and we’ll share everything with you, as we go!

Model Deployment
#

Starting Point
#

Last time, on our ML End-to-End Workflow we had produced a REST endpoint, using FastAPI, that provided a way to run inference over one or multiple models (A/B/n testing) that had been previously logged to MLflow. Optionally, we could log the inference to DuckLake, which was running on top of a local SQLite catalog and a remote MinIO storage. Logged inferences were streamed to a Kafka topic, and then consumed and buffered, up to a point when they were inserted into the appropriate DuckLake table.

What we want to do now is prepare this REST API to be deployed on the Docker instance running on the docker-apps VM, while using available services running on docker-shared. This includes MLflow and Kafka, but also PostgreSQL and MinIO (L1) for DuckLake. Today, we’ll only be concerned with ensuring MLflow and Kafka are integrated, as we’ll have a blog post (and video) focusing on migrating your catalog from SQLite to PostgreSQL, at which time we’ll configure DuckLake adequately to run on top of docker-shared services.

Asking CI/CD for Kafka Topics
#

Since our goal is essentially to expose ml.server, which is a part of the datalab project, we’ll setup the CI/CD within this project. This time, we use two trigger jobs, for each of our Kafka topics, one for the logging the inference results (provision_mlserver_results_topic), and the other one to handle inference feedback sent by our users (provision_mlserver_feedback_topic).

Both jobs will be similar, so let’s take a look at provision_mlserver_results_topic:

provision_mlserver_results_topic:
  stage: deploy
  trigger:
    include:
      - local: .ci/provision/kafka.yml
        inputs:
          topic: ml_inference_results
          group: lakehouse-inference-result-consumer
    strategy: depend
  rules:
    - if: $CI_PIPELINE_SOURCE == "push"
      changes:
        - .ci/deploy.yml
        - infra/apps/docker/**/*
    - if: '"$[[ inputs.force_apps_deploy ]]" == "true"'

Similarly to what we did for the datalab-infra-test project, we include the provision template, but this time it’s a local include. We ask that it creates topic ml_inference_results, initializing a consumer for it with group lakehouse-inference-result-consumer.

The job triggering rules match the ones that we use for our apps_deploy job.

Deploying Applications
#

The apps_deploy job is defined under .ci/deploy.yml, for the datalab project, as follows:

apps_deploy:
  stage: deploy
  image: docker:28.4.0-cli
  needs:
    - provision_mlserver_results_topic
    - provision_mlserver_feedback_topic
  variables:
    DOCKER_HOST: tcp://docker-apps:2375
    DOCKER_BUILDKIT: 1
    INFERENCE_RESULTS_TOPIC: ml_inference_results
    INFERENCE_FEEDBACK_TOPIC: lakehouse-inference-result-consumer
    INFERENCE_RESULTS_GROUP: ml_inference_feedback
    INFERENCE_FEEDBACK_GROUP: lakehouse-inference-feedback-consumer
  script:
    - docker compose -p datalab -f infra/apps/docker/compose.yml up -d --build
    - docker ps
  rules:
    - if: $CI_PIPELINE_SOURCE == "push"
      changes:
        - .ci/deploy.yml
        - infra/apps/docker/**/*
    - if: '"$[[ inputs.force_apps_deploy ]]" == "true"'

Regarding the rules, notice that we provide a new boolean input that we can set during pipeline triggering to force redeploy the docker compose for our applications. This is useful when we just want to update the env vars for it.

As you can see, we set the two topic provisioning jobs as a dependency, and we then configure the environment variables required by our ml.server REST API. Also notice that we set DOCKER_BUILDKIT, which will reduce the overhead of redeploying with the --build flag, as image layers will be cached between deployments.

Docker Compose
#

Let’s take a look at infra/apps/docker/compose.yml:

services:
  mlserver:
    build:
      context: ../../../
      dockerfile: infra/apps/docker/mlserver/Dockerfile
    ports:
      - "8000:8000"
    environment:
      MLFLOW_TRACKING_URI: ${MLFLOW_TRACKING_URI}
      KAFKA_BROKER_ENDPOINT: ${KAFKA_BROKER_ENDPOINT}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      retries: 5
    restart: unless-stopped

Notice that we set our build context to the root of the datalab project, so that it will pickup our .env when building locally. Otherwise, environment variables will either be set via the apps_deploy CI/CD job, or as a CI/CD variable, as is the case for MLFLOW_TRACKING_URI and KAFKA_BROKER_ENDPOINT, which are set to:

MLFLOW_TRACKING_URI="http://docker-shared:5000"
KAFKA_BROKER_ENDPOINT="docker-shared:9092"

Dockerfile
#

The Dockerfile for ml.server is quite minimal, and based on the official uv image bundled with Python 3.13 and running on Debian Trixie:

FROM astral/uv:python3.13-trixie-slim

RUN apt update && apt install -y git curl

WORKDIR /datalab
COPY pyproject.toml pyproject.toml
COPY uv.lock uv.lock

RUN uv sync --frozen

COPY . .

RUN uv sync --frozen

ENTRYPOINT ["./infra/apps/docker/mlserver/docker-entrypoint.sh"]

Order matters, if you want to optimize caching.

First we install system dependencies—our pyproject.toml is using a dependency straight from a git repo, so we’ll need the git command:

RUN apt update && apt install -y git curl

Then, we switch to /datalab and copy only the require files to install uv dependencies:

WORKDIR /datalab
COPY pyproject.toml pyproject.toml
COPY uv.lock uv.lock

RUN uv sync --frozen

Installing dependencies before copying the source code for datalab will ensure that, unless dependencies change, we’ll be able to change the code in datalab and redeploy without having to reinstall all dependencies again, which takes quite a while.

Then, we finally copy our complete datalab repo and install our source code as the last missing dependency:

COPY . .

RUN uv sync --frozen

docker-entrypoint.sh
#

We set the entry point for our container as a shell script that loads the Python virtual environment and then calls the CLI command to start the REST API server:

###!/usr/bin/env bash

set -e

### shellcheck source=/dev/null
. .venv/bin/activate

dlctl ml server "$@"

Using set -e will ensure that, if any command fails, the script will terminate there.

Retrospective No. 1
#

We’ll now do a retrospective on the architecture that we designed for our data lab infrastructure, identifying the good and the bad, and proposing a redesigned architecture that takes all of this into account.

These are my fairly unedited notes. For a more digestible version, please what the video on this topic, where I restructure this into smaller topics fitting of a slide deck.

What Went Well
#

The architecture was deployable, and everything works!
Having a custom Ubuntu image for GitLab runners was useful to avoid constant apt update and package installs, which take time to download and install, each time a job with these requirements is run.
Container registry was already useful for the Ubuntu custom image.

To Improve
#

Splitting Docker into multiple VMs was a bad move—a single beefier instance would have been better. It’s easier to pool resources, but it also lowers the overhead of communicating among services within the same Docker instance.
Using GitLab for secret management along with .env and terraform.tfvars is a pain—we might have been better off deploying HashiCorp Vault into Layer 1 and just using that for everything, with a dev version on docker compose for a local deployment. We might use a Vault Agent to load secrets as env vars as well.
We have a container registry, but we haven’t used it yet—we might need a workflow to manage images separately while tracking their versions.
It might have been better to go with Gitea, which still has a container registry as well as CI/CD runners, rather than GitLab, given resource constraints.
- GitLab is also quite bloated for a small home lab, running a few redundant services by default, like its own PostgreSQL instance, which we don’t need, Prometheus, which we don’t care about, or Praefect for HA, which we don’t use.
- GitLab’s CI/CD must be defined in a single pipeline, as there are no separate workflows, like with GitHub Actions, or Gitea CI/CD for that matter.
- Documentation is hard to browse, mainly due to the project’s complexity and dimension.
- Some components can be quite slow likely due to the interpreted nature of the Ruby language (e.g., starting gitlab-rails console takes nearly 30 seconds🥶).
- Monetization is dependent on feature tiers, which makes it harder to get into (e.g., multi-project pipelines that require needs:project only work in Premium or Ultimate tiers).
Given the single workflow/pipeline approach of GitLab CI/CD, and assuming we would continue to use GitLab, a cleaner way to activate different workflows would have been to use a boolean input for each workflow to determine whether to activate the corresponding jobs—this is cleaner and more general than relying on non-empty values.
We used a few init services on our Docker Compose project, but we could have just implemented this via CI/CD and strip it completely from Compose.
Maybe we could have produced a single Terraform project for all layers of the infrastructure, although it’s unclear whether setting layer dependencies would be needlessly complex to manage.
Not using Ansible was a bad move—cloud-init is great for provisioning standard VMs, but not to handle configs, specially when we might need to change them.
Having a proxy cache would be useful to avoid too many requests to package repositories (e.g., apt), specially during the initial setup stage, but also if we’re continuously installing packages within runners for a few workflows, it will make sense to avoid constantly connecting to the base servers, both to ease load on the servers and to improve speed locally.
MLflow is running on a SQLite backend, but we do have a PostgreSQL instance that we should switch to.

Redesigning the Architecture
#

Data Lab Infra - Architecture Redesigned

Here are the overall changes for each layer:

L1: Foundation
- Add nginx to serve as a proxy cache for apt, apk, or others.
- Add HashiCorp Vault, since it integrates with the shell environment, via the vault agent, with Terraform, via its official provider, and with CI/CD, either via the vault agent, or through the Terraform provider, depending on whether we prefer a more or less real-time configuration update.
- Keep Terraform for deployment, but replace mutable configuration management with Ansible.
L2: Platform
- Combine the three Docker VMs into a single VM.
- Keep Terraform for deployment, but replace mutable configuration management with Ansible.
L3: Services
- Nothing changes here, except we extracted DuckLake into its own “L3: Edge” layer, since it doesn’t really run on the infrastructure, at least not directly, but on client machines, like a desktop or laptop, connecting to the PostgreSQL and MinIO instances.
L4: Applications
- Added NodeJS as an example, since we might want to deploy our own web apps (e.g., dynamic visualizations for our data science projects).
- Made it clear that all apps are deployed as containers in this layer.

Notice that it might also be the case that Gitea cannot adequately replace GitLab for our needs, and there is nothing free capable of doing it in a satisfactory way. We’ll need to test and compare Gitea with GitLab first. We might end up keeping GitLab in the stack—it’s hard to predict at this time.

Author

Data Lab Tech

https://youtube.com/@DataLabTechTV

Summary#

Changes to CI/CD#

Attempted Refactoring#

Custom Ubuntu Image#

Improving Postgres Workflow#

Multi-Project CI/CD on GitLab Free Tier#

Templates for Provisioning#

External Test Project#

Model Deployment#

Starting Point#

Asking CI/CD for Kafka Topics#

Deploying Applications#

Docker Compose#

Dockerfile#

docker-entrypoint.sh#

Retrospective No. 1#

What Went Well#

To Improve#

Redesigning the Architecture#