User Tools

Site Tools


mchadmin:managenomadoasis

MCh NOMAD Oasis Administration

FIXME NOMAD Oasis is running on the oasis.mch.rwth-aachen.de machine which is managed by our IT (currently: Sergej Laiko). Contact IT (currently Sergej) for login and password. The system itself is running in a virtual machine (VM) and using some Ubuntu clone optimized for VMs. The machine has currently 8 cores and 32GB of RAM. The disc space of the VM itself is quite with 32GB, the Oasis data and docker images reside on a separate /dev/xvdb1 drive (currently 1TB).

MCh NOMAD Oasis

The upstream documentation for running an Oasis is here https://nomad-lab.eu/prod/rae/docs/oasis.html. The system works through a docker https://www.docker.com/ containers, you can think of it as another lightweight level of virtualization, so that several services that are needed for an working Oasis each have its own container. The complete Oasis settings resides in the oasis folder. The separate docker containers are configured in the docker-compose.yaml file, the main oasis config file is nomad.yaml, webserver settings are in nginx.conf and the ssl keys are there as well (in order to have a working https).

Docker settings

All the Nomad Oasis data reside withing the docker directory tree. In fact the main drive /dev/xvdb1 is mounted as /var/snap/docker/common/var-lib-docker so most of the docker settings and images should be on it as well.

Oasis is started automatically on system startup. If not, you can start it manually by running

cd /home/admo/oasis_1
docker-compose up -d

from the oasis_1 directory. You can stop everything by

docker-compose down

For further docker-compose commands see docker-compose docs https://docs.docker.com/compose/.

Building our custom oasis

While the upstream NOMAD provides docker images for use, we want to apply some customization, so we have to build our own images. This is done directly from the https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/ tree, which is cloned to nomad-git directory on the VM.

Updates of the git tree works in the usual way, i.e.

cd nomad-git
git fetch
git rebase origin/<desired branch>

origin/master is the latest stable branch, so that is the safe bet, unless some extra features from development branches are needed. The tricky part is that the parsers and other components are incorporated through submodules. So after the rebase of the nomad-FAIR tree, also the

git submodule update --init --recursive

is needed to keep the submodules in sync with the nomad tree. If one needs to update a parser or submodule to a newer version that what is included in the current tree, one must go to the specific directory, for example to update the lobster parser to latest upstream do

cd dependencies/parsers/lobster/
git fetch && git rebase origin/master
cd ../../..
git add dependencies/parsers/lobster/
git commit

When you are happy with the update and changes, the docker images can be build with

docker build .

After the build finishes, check the list of all build images with

docker images

and write down the image ID of the most recent one. Now tag it with

docker tag <image_id> <some_new_random_tag>

and replace the currently used images for worker and app container in the docker-compose.yaml with the newly built one

worker:
  ...
  image: <some_new_random_tag>:latest
  ...
app:
  ...
  image: <some_new_random_tag>:latest
  ...

Now just restart the docker. A good idea is to check before that no one is uploading/downloading something to the oasis, this could be checked with top that there are no processes with high CPU usage running.

docker-compose down
docker-compose up -d

Everything should be running the new version now.

Current patches

We have a set of custom patches on top of the vanila NOMAD Oasis, so the it might be possible there will be some conflicts during the rebase. They needs to be resolved in the usual git way, however sometime a python or web frontend programmings skills might be needed if the conflicts are more complex. Most of the patches are quite simple though.

The current set of patches is

The MCh specific tweaking is likely gonna stay forever, as well as the system type (MatID based) classification patches. The rest should be hopefully not needed when the OpenMX and LOBSTER parsers are properly upstreamed. The fix for the VASP parser is questionable so it might also never make it to the upstream branch.

Current oasis settings

nomad.yaml

This is the main Oasis config file. It was mostly copy-pasted from the example one at https://nomad-lab.eu/prod/rae/docs/oasis.html and configured for our specific url and settings. See inline comments for explanation of some more specific settings.

services:
  api_host: 'oasis.mch.rwth-aachen.de'
  https: true
  https_upload: true
  api_port: 443
  api_base_path: '/nomad-oasis'
  
mongo:
  db_name: nomad_v1

elastic:
  entries_index: nomad_oasis_entries_v1
  materials_index: nomad_oasis_materials_v1

# This was needed to combat some hard to debug parser bugs in the past,
# might not be needed anymore and probably comes with some performance penalty
# however it is likely the safe option. See https://matsci.org/t/36150 for
# the original problems.
# process_reuse_parser: false

# If new mainfile matching should be done on reprocessing (for example if a parser)
# for new DFT code was added. See https://matsci.org/t/36286 for more details.
reprocess_match: true

# How large systems (in atoms) should be characterized with the system type normalizer.
# Please note that the main limit seems to be the memory. 
normalize:
  system_classification_with_clusters_threshold: 350

oasis:
  is_oasis: true
  uses_central_user_management: true
  # Add access for new users by adding their email here.
  allowed_users:
    - nomad-oasis@mch.rwth-aachen.de
    - email1@mch.rwth-aachen.de
    - ...
 
meta:
  deployment: 'oasis'
  deployment_id: 'oasis.mch.rwth-aachen.de'
  maintainer_email: 'nomad-oasis@mch.rwth-aachen.de'
  deployment_url: 'https://oasis.mch.rwth-aachen.de/api'
  
celery:
  timeout: 10000
  max_memory: 16000000

docker-compose.yaml

Mostly copy pasted from https://nomad-lab.eu/prod/rae/docs/oasis.html, the only changes are the –concurrency=5 switch for the worker container (that influences the number of cores used for parsing, 5 is a compromise between parallelism and possible slowdowns due to swapping when parsing large cases. The memory heavy stuff is ATM mostly vasprun.xml parser https://github.com/nomad-coe/nomad-parser-vasp/issues/12 and the system normalizer for system with lot of atoms) and the OMP_NUM_THREADS: 1 environment variable for the worker to prevent overload (see https://github.com/nomad-coe/nomad/issues/10 for details).

version: '3'

x-common-variables: &nomad_backend_env
    NOMAD_RABBITMQ_HOST: rabbitmq
    NOMAD_ELASTIC_HOST: elastic
    NOMAD_MONGO_HOST: mongo

services:
    # broker for celery
    rabbitmq:
        restart: always
        image: rabbitmq:3.11.5
        container_name: nomad_oasis_rabbitmq_v1
        environment:
            - RABBITMQ_ERLANG_COOKIE=SWQOKODSQALRPCLNMEQG
            - RABBITMQ_DEFAULT_USER=rabbitmq
            - RABBITMQ_DEFAULT_PASS=rabbitmq
            - RABBITMQ_DEFAULT_VHOST=/
        volumes:
            - nomad_oasis_rabbitmq:/var/lib/rabbitmq
            - 
        healthcheck:
          test: ["CMD", "rabbitmq-diagnostics", "--silent", "--quiet", "ping"]
          interval: 10s
          timeout: 10s
          retries: 30
          start_period: 10s
          
    # the search engine
    elastic:
      ulimits:
          nofile:
              soft: 1048576
              hard: 1048576
      restart: unless-stopped
      image: docker.elastic.co/elasticsearch/elasticsearch:7.17.1
      container_name: nomad_oasis_elastic_v1
      environment:
          - ES_JAVA_OPTS=-Xms512m -Xmx512m
          - discovery.type=single-node
      volumes:
          - elastic:/usr/share/elasticsearch/data
      healthcheck:
          test:
              - "CMD"
              - "curl"
              - "--fail"
              - "--silent"
              - "http://elastic:9200/_cat/health"
          interval: 10s
          timeout: 10s
          retries: 30
          start_period: 60s

    # the user data db
  mongo:
      ulimits:
          nofile:
              soft: 1048576
              hard: 1048576
      restart: unless-stopped
      image: mongo:5.0.6
      container_name: nomad_oasis_mongo_v1
      environment:
          - MONGO_DATA_DIR=/data/db
          - MONGO_LOG_DIR=/dev/null
      volumes:
          - mongo:/data/db
          - ./.volumes/mongo:/backup
      command: mongod --logpath=/dev/null # --quiet
      healthcheck:
          test:
              - "CMD"
              - "mongo"
              - "mongo:27017/test"
              - "--quiet"
              - "--eval"
              - "'db.runCommand({ping:1}).ok'"
          interval: 10s
          timeout: 10s
          retries: 30
          start_period: 10s
           
    # nomad worker (processing)
    worker:
      ulimits:
          nofile:
              soft: 1048576
              hard: 1048576
      restart: unless-stopped
      image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest
      container_name: nomad_oasis_worker_v1
      environment:
          <<: *nomad_backend_env
          NOMAD_SERVICE: nomad_oasis_worker
          OMP_NUM_THREADS: 1
      depends_on:
          rabbitmq:
              condition: service_healthy
          elastic:
              condition: service_healthy
          mongo:
              condition: service_healthy
      volumes:
          - ./configs/nomad.yaml:/app/nomad.yaml
          - /var/snap/docker/common/var-lib-docker/volumes/oasis_nomad_oasis_files/_data:/app/.volumes/fs
      command: python -m celery -l info -A nomad.processing worker -Q celery --concurrency=5
      
    # nomad app (api + gui)
    app:
      ulimits:
          nofile:
              soft: 1048576
              hard: 1048576
      restart: unless-stopped
      image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest
      container_name: nomad_oasis_app_v1
      environment:
          <<: *nomad_backend_env
          NOMAD_SERVICE: nomad_oasis_app
          NOMAD_SERVICES_API_PORT: 80
          NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "$PWD"
      depends_on:
          rabbitmq:
              condition: service_healthy
          elastic:
              condition: service_healthy
          mongo:
              condition: service_healthy
      volumes:
          - ./configs/nomad.yaml:/app/nomad.yaml
          - /var/snap/docker/common/var-lib-docker/volumes/oasis_nomad_oasis_files/_data:/app/.volumes/fs
      command: ./run.sh
      healthcheck:
          test:
              - "CMD"
              - "curl"
              - "--fail"
              - "--silent"
              - "http://localhost:8000/-/health"
          interval: 10s
          timeout: 10s
          retries: 30
          start_period: 10s
          
    # nomad gui (a reverse proxy for nomad)
  proxy:
      ulimits:
          nofile:
              soft: 1048576
              hard: 1048576
      restart: unless-stopped
      image: nginx:1.13.9-alpine
      container_name: nomad_oasis_proxy_v1
      command: nginx -g 'daemon off;'
      volumes:
          - ./configs/nginx.conf:/etc/nginx/conf.d/default.conf
          - ./configs/cert/cert-oasis-witchchain.pem:/ssl/cert-oasis-witchchain.pem
          - ./configs/cert/server-oasis-mch-key.pem:/ssl/server-oasis-mch-key.pem
      depends_on:
          app:
              condition: service_healthy
          worker:
              condition: service_started # TODO: service_healthy
      ports:
          - 443:443
 
  volumes:
    mongo:
      name: "nomad_oasis_mongo"
    elastic:
      name: "nomad_oasis_elastic"
    rabbitmq:
      name: "nomad_oasis_rabbitmq"
    keycloak:
      name: "nomad_oasis_keycloak"
    nomad_oasis_files:
  
  networks:
    default:
      name: nomad_oasis_network

nginx.conf

Config file for the nginx server, based on the default one as well. Some custom settings include mostly the ssl config. See nginx docs http://nginx.org/en/docs/ for more info.

map $http_upgrade $connection_upgrade {
    default upgrade;
    ''      close;
}

server {
  listen           443 ssl;
  server_name      oasis.mch.rwth-aachen.de;
  proxy_set_header Host $host;
  
  ssl on;
  ssl_certificate      /ssl/cert-oasis-witchchain.pem;
  ssl_certificate_key  /ssl/server-oasis-mch-key.pem;
  ssl_protocols        TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
  ssl_ciphers          HIGH:!aNULL:!MD5;
  
  location / {
      proxy_pass http://app:8000;
  }
  
  location ~ /nomad-oasis\/?(gui)?$ {
      rewrite ^ /nomad-oasis/gui/ permanent;
  }
  
  location /nomad-oasis/gui/ {
      proxy_intercept_errors on;
      error_page 404 = @redirect_to_index;
      proxy_pass http://app:8000;
  }
  
  location @redirect_to_index {
      rewrite ^ /nomad-oasis/gui/index.html break;
      proxy_pass http://app:8000;
  }
  
  location ~ \/gui\/(service-worker\.js|meta\.json)$ {
      add_header Last-Modified $date_gmt;
      add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0';
      if_modified_since off;
      expires off;
      etag off;
      proxy_pass http://app:8000;
  }
  
  location ~ /api/v1/uploads(/?$|.*/raw|.*/bundle?$)  {
      client_max_body_size 35g;
      proxy_request_buffering off;
      proxy_pass http://app:8000;
  }
  
  location ~ /api/v1/.*/download {
      proxy_buffering off;
      proxy_pass http://app:8000;
  }
}

Nomad admin tools

The GUI has almost no administration option. Hence if some hand-editing of the database and uploads is needed, one has to use the command line nomad admin tools.

First, ssh to the oasis machine and connect to the app container with

docker exec -ti nomad_oasis_app /bin/bash

Now the nomad admin command provides a lot of management option. See https://nomad-lab.eu/prod/rae/docs/client/cli_ref.html#admin-cli-commands for documentation, or run it with --help switch. Apply extreme caution as one can easily delete or destroy everything with nomad admin tools!

For example if you would want to delete a selected upload do

nomad admin uploads rm -- <upload id>

If you updated a parser and want to regenerate the metadata or possibly detect new entries in old uploads (if the update has a completely new parser), that can be done with

nomad admin uploads re-process -- <upload id>

for selected upload id, or with

nomad admin uploads re-process

for all uploads (when no upload id is specified, the command is applied to all uploads). Please note that the re-processing is quite intensive and can run for hours (days in the future when we have more entries in the oasis).

Backups

A daily backup is performed on the RWTH Commvault System and a monthly backup on the DC1.mch.rwth-aachen.de NAS station.

Getting help

The main channel to developers is https://matsci.org/c/nomad/32 forum. Developers usually respond within hours. If a bug is found, it should be reported at the github repo https://github.com/nomad-coe/nomad/issues or to the specific parser subprojects https://github.com/nomad-coe/nomad-parser-*/issues. pavel.ondracka@gmail.com might be willing to help as well.

Current TODO list

  • Get LOBSTER and OpenMX parsers fully upstream so we get rid of the maintenance burden. If there are some changes in the upstream metainfo scheme or some significant refactors, the parsers will likely break and will need to be fixed. If we can get it to upstream NOMAD, the developers introducing the changes will have to fix the parsers as well. The parsers are currently residing at https://github.com/ondracka/. LOBSTER is already added to the list of parsers in upstream nomad, with some minor things missing: https://github.com/nomad-coe/nomad/issues/20, however the ultimate goal would be to move the repo to https://github.com/nomad-coe as well. OpenMX parser upstreaming is currently on hold due to some rewrite in upstream nomad https://matsci.org/t/openmx-parser/37525.
  • Make sure there is enough disc space. We now have just 1TB (at the time of writing ~80% full).
  • Get more RAM for the VM (this would allow higher concurency for the workers and hence faster parsing).
  • The current MCh Nomad Oasis manual/guide is located at the wiki. Recent experience however shows that not many people have wiki access and even less people are actually using it. Since we have a webpage running anyway, it might be a good idea to actually move the info to the https://oasis.mch.rwth-aachen.de/nomad-oasis/gui/.
mchadmin/managenomadoasis.txt · Last modified: 2024/07/16 10:57 by fecik