====== MCh NOMAD Oasis Administration ====== FIXME NOMAD Oasis is running on the oasis.mch.rwth-aachen.de machine which is managed by our IT (currently: Sergej Laiko). Contact IT (currently Sergej) for login and password. The system itself is running in a virtual machine (VM) and using some Ubuntu clone optimized for VMs. The machine has currently 8 cores and 32GB of RAM. The disc space of the VM itself is quite with 32GB, the Oasis data and docker images reside on a separate /dev/xvdb1 drive (currently 1TB). ===== MCh NOMAD Oasis ===== The upstream documentation for running an Oasis is here [[https://nomad-lab.eu/prod/rae/docs/oasis.html]]. The system works through a docker [[https://www.docker.com/]] containers, you can think of it as another lightweight level of virtualization, so that several services that are needed for an working Oasis each have its own container. The complete Oasis settings resides in the ''oasis'' folder. The separate docker containers are configured in the ''docker-compose.yaml'' file, the main oasis config file is ''nomad.yaml'', webserver settings are in ''nginx.conf'' and the ssl keys are there as well (in order to have a working https). ==== Docker settings ==== All the Nomad Oasis data reside withing the docker directory tree. In fact the main drive ''/dev/xvdb1'' is mounted as ''/var/snap/docker/common/var-lib-docker'' so most of the docker settings and images should be on it as well. Oasis is started automatically on system startup. If not, you can start it manually by running cd /home/admo/oasis_1 docker-compose up -d from the ''oasis_1'' directory. You can stop everything by docker-compose down For further ''docker-compose'' commands see docker-compose docs [[https://docs.docker.com/compose/]]. ==== Building our custom oasis ==== While the upstream NOMAD provides docker images for use, we want to apply some customization, so we have to build our own images. This is done directly from the [[https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/]] tree, which is cloned to ''nomad-git'' directory on the VM. Updates of the git tree works in the usual way, i.e. cd nomad-git git fetch git rebase origin/ ''origin/master'' is the latest stable branch, so that is the safe bet, unless some extra features from development branches are needed. The tricky part is that the parsers and other components are incorporated through submodules. So after the rebase of the nomad-FAIR tree, also the git submodule update --init --recursive is needed to keep the submodules in sync with the nomad tree. If one needs to update a parser or submodule to a newer version that what is included in the current tree, one must go to the specific directory, for example to update the lobster parser to latest upstream do cd dependencies/parsers/lobster/ git fetch && git rebase origin/master cd ../../.. git add dependencies/parsers/lobster/ git commit When you are happy with the update and changes, the docker images can be build with docker build . After the build finishes, check the list of all build images with docker images and write down the image ID of the most recent one. Now tag it with docker tag and replace the currently used images for ''worker'' and ''app'' container in the ''docker-compose.yaml'' with the newly built one worker: ... image: :latest ... app: ... image: :latest ... Now just restart the docker. A good idea is to check before that no one is uploading/downloading something to the oasis, this could be checked with ''top'' that there are no processes with high CPU usage running. docker-compose down docker-compose up -d Everything should be running the new version now. === Current patches === We have a set of custom patches on top of the vanila NOMAD Oasis, so the it might be possible there will be some conflicts during the rebase. They needs to be resolved in the usual ''git'' way, however sometime a python or web frontend programmings skills might be needed if the conflicts are more complex. Most of the patches are quite simple though. The current set of patches is * MCh specific tweaking (this patch updates the webpage to be actually relevant for us as by default it is just a copy of the upstream NOMAD) * Try harder to detect surfaces (small patch for the system type classification: see https://github.com/nomad-coe/nomad/issues/8) * VASP parser update (we have one custom VASP parser patch from https://github.com/nomad-coe/nomad-parser-vasp/issues/7) * LOBSTER parser update (LOBSTER parser is now part of the upstream but some minor parts of integration are still missing, see https://github.com/nomad-coe/nomad/issues/20) * OpenMX parser * Mark systems that are too large for the MatID classification The MCh specific tweaking is likely gonna stay forever, as well as the system type (MatID based) classification patches. The rest should be hopefully not needed when the OpenMX and LOBSTER parsers are properly upstreamed. The fix for the VASP parser is questionable so it might also never make it to the upstream branch. ==== Current oasis settings ==== === nomad.yaml === This is the main Oasis config file. It was mostly copy-pasted from the example one at [[https://nomad-lab.eu/prod/rae/docs/oasis.html]] and configured for our specific url and settings. See inline comments for explanation of some more specific settings. services: api_host: 'oasis.mch.rwth-aachen.de' https: true https_upload: true api_port: 443 api_base_path: '/nomad-oasis' mongo: db_name: nomad_v1 elastic: entries_index: nomad_oasis_entries_v1 materials_index: nomad_oasis_materials_v1 # This was needed to combat some hard to debug parser bugs in the past, # might not be needed anymore and probably comes with some performance penalty # however it is likely the safe option. See https://matsci.org/t/36150 for # the original problems. # process_reuse_parser: false # If new mainfile matching should be done on reprocessing (for example if a parser) # for new DFT code was added. See https://matsci.org/t/36286 for more details. reprocess_match: true # How large systems (in atoms) should be characterized with the system type normalizer. # Please note that the main limit seems to be the memory. normalize: system_classification_with_clusters_threshold: 350 oasis: is_oasis: true uses_central_user_management: true # Add access for new users by adding their email here. allowed_users: - nomad-oasis@mch.rwth-aachen.de - email1@mch.rwth-aachen.de - ... meta: deployment: 'oasis' deployment_id: 'oasis.mch.rwth-aachen.de' maintainer_email: 'nomad-oasis@mch.rwth-aachen.de' deployment_url: 'https://oasis.mch.rwth-aachen.de/api' celery: timeout: 10000 max_memory: 16000000 === docker-compose.yaml === Mostly copy pasted from [[https://nomad-lab.eu/prod/rae/docs/oasis.html]], the only changes are the ''--concurrency=5'' switch for the ''worker'' container (that influences the number of cores used for parsing, 5 is a compromise between parallelism and possible slowdowns due to swapping when parsing large cases. The memory heavy stuff is ATM mostly vasprun.xml parser [[https://github.com/nomad-coe/nomad-parser-vasp/issues/12]] and the system normalizer for system with lot of atoms) and the ''OMP_NUM_THREADS: 1'' environment variable for the worker to prevent overload (see [[https://github.com/nomad-coe/nomad/issues/10]] for details). version: '3' x-common-variables: &nomad_backend_env NOMAD_RABBITMQ_HOST: rabbitmq NOMAD_ELASTIC_HOST: elastic NOMAD_MONGO_HOST: mongo services: # broker for celery rabbitmq: restart: always image: rabbitmq:3.11.5 container_name: nomad_oasis_rabbitmq_v1 environment: - RABBITMQ_ERLANG_COOKIE=SWQOKODSQALRPCLNMEQG - RABBITMQ_DEFAULT_USER=rabbitmq - RABBITMQ_DEFAULT_PASS=rabbitmq - RABBITMQ_DEFAULT_VHOST=/ volumes: - nomad_oasis_rabbitmq:/var/lib/rabbitmq - healthcheck: test: ["CMD", "rabbitmq-diagnostics", "--silent", "--quiet", "ping"] interval: 10s timeout: 10s retries: 30 start_period: 10s # the search engine elastic: ulimits: nofile: soft: 1048576 hard: 1048576 restart: unless-stopped image: docker.elastic.co/elasticsearch/elasticsearch:7.17.1 container_name: nomad_oasis_elastic_v1 environment: - ES_JAVA_OPTS=-Xms512m -Xmx512m - discovery.type=single-node volumes: - elastic:/usr/share/elasticsearch/data healthcheck: test: - "CMD" - "curl" - "--fail" - "--silent" - "http://elastic:9200/_cat/health" interval: 10s timeout: 10s retries: 30 start_period: 60s # the user data db mongo: ulimits: nofile: soft: 1048576 hard: 1048576 restart: unless-stopped image: mongo:5.0.6 container_name: nomad_oasis_mongo_v1 environment: - MONGO_DATA_DIR=/data/db - MONGO_LOG_DIR=/dev/null volumes: - mongo:/data/db - ./.volumes/mongo:/backup command: mongod --logpath=/dev/null # --quiet healthcheck: test: - "CMD" - "mongo" - "mongo:27017/test" - "--quiet" - "--eval" - "'db.runCommand({ping:1}).ok'" interval: 10s timeout: 10s retries: 30 start_period: 10s # nomad worker (processing) worker: ulimits: nofile: soft: 1048576 hard: 1048576 restart: unless-stopped image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest container_name: nomad_oasis_worker_v1 environment: <<: *nomad_backend_env NOMAD_SERVICE: nomad_oasis_worker OMP_NUM_THREADS: 1 depends_on: rabbitmq: condition: service_healthy elastic: condition: service_healthy mongo: condition: service_healthy volumes: - ./configs/nomad.yaml:/app/nomad.yaml - /var/snap/docker/common/var-lib-docker/volumes/oasis_nomad_oasis_files/_data:/app/.volumes/fs command: python -m celery -l info -A nomad.processing worker -Q celery --concurrency=5 # nomad app (api + gui) app: ulimits: nofile: soft: 1048576 hard: 1048576 restart: unless-stopped image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest container_name: nomad_oasis_app_v1 environment: <<: *nomad_backend_env NOMAD_SERVICE: nomad_oasis_app NOMAD_SERVICES_API_PORT: 80 NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "$PWD" depends_on: rabbitmq: condition: service_healthy elastic: condition: service_healthy mongo: condition: service_healthy volumes: - ./configs/nomad.yaml:/app/nomad.yaml - /var/snap/docker/common/var-lib-docker/volumes/oasis_nomad_oasis_files/_data:/app/.volumes/fs command: ./run.sh healthcheck: test: - "CMD" - "curl" - "--fail" - "--silent" - "http://localhost:8000/-/health" interval: 10s timeout: 10s retries: 30 start_period: 10s # nomad gui (a reverse proxy for nomad) proxy: ulimits: nofile: soft: 1048576 hard: 1048576 restart: unless-stopped image: nginx:1.13.9-alpine container_name: nomad_oasis_proxy_v1 command: nginx -g 'daemon off;' volumes: - ./configs/nginx.conf:/etc/nginx/conf.d/default.conf - ./configs/cert/cert-oasis-witchchain.pem:/ssl/cert-oasis-witchchain.pem - ./configs/cert/server-oasis-mch-key.pem:/ssl/server-oasis-mch-key.pem depends_on: app: condition: service_healthy worker: condition: service_started # TODO: service_healthy ports: - 443:443 volumes: mongo: name: "nomad_oasis_mongo" elastic: name: "nomad_oasis_elastic" rabbitmq: name: "nomad_oasis_rabbitmq" keycloak: name: "nomad_oasis_keycloak" nomad_oasis_files: networks: default: name: nomad_oasis_network === nginx.conf === Config file for the nginx server, based on the default one as well. Some custom settings include mostly the ssl config. See nginx docs [[http://nginx.org/en/docs/]] for more info. map $http_upgrade $connection_upgrade { default upgrade; '' close; } server { listen 443 ssl; server_name oasis.mch.rwth-aachen.de; proxy_set_header Host $host; ssl on; ssl_certificate /ssl/cert-oasis-witchchain.pem; ssl_certificate_key /ssl/server-oasis-mch-key.pem; ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3; ssl_ciphers HIGH:!aNULL:!MD5; location / { proxy_pass http://app:8000; } location ~ /nomad-oasis\/?(gui)?$ { rewrite ^ /nomad-oasis/gui/ permanent; } location /nomad-oasis/gui/ { proxy_intercept_errors on; error_page 404 = @redirect_to_index; proxy_pass http://app:8000; } location @redirect_to_index { rewrite ^ /nomad-oasis/gui/index.html break; proxy_pass http://app:8000; } location ~ \/gui\/(service-worker\.js|meta\.json)$ { add_header Last-Modified $date_gmt; add_header Cache-Control 'no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0'; if_modified_since off; expires off; etag off; proxy_pass http://app:8000; } location ~ /api/v1/uploads(/?$|.*/raw|.*/bundle?$) { client_max_body_size 35g; proxy_request_buffering off; proxy_pass http://app:8000; } location ~ /api/v1/.*/download { proxy_buffering off; proxy_pass http://app:8000; } } ==== Nomad admin tools ==== The GUI has almost no administration option. Hence if some hand-editing of the database and uploads is needed, one has to use the command line ''nomad admin'' tools. First, ssh to the oasis machine and connect to the app container with docker exec -ti nomad_oasis_app /bin/bash Now the ''nomad admin'' command provides a lot of management option. See [[https://nomad-lab.eu/prod/rae/docs/client/cli_ref.html#admin-cli-commands]] for documentation, or run it with ''-''''-help'' switch. **Apply extreme caution as one can easily delete or destroy everything with nomad admin tools!** For example if you would want to delete a selected upload do nomad admin uploads rm -- If you updated a parser and want to regenerate the metadata or possibly detect new entries in old uploads (if the update has a completely new parser), that can be done with nomad admin uploads re-process -- for selected ''upload id'', or with nomad admin uploads re-process for all uploads (when no upload id is specified, the command is applied to all uploads). Please note that the re-processing is quite intensive and can run for hours (days in the future when we have more entries in the oasis). ==== Backups ==== A daily backup is performed on the RWTH Commvault System and a monthly backup on the DC1.mch.rwth-aachen.de NAS station. ===== Getting help ===== The main channel to developers is [[https://matsci.org/c/nomad/32]] forum. Developers usually respond within hours. If a bug is found, it should be reported at the github repo [[https://github.com/nomad-coe/nomad/issues]] or to the specific parser subprojects ''https://github.com/nomad-coe/nomad-parser-*/issues''. ''pavel.ondracka@gmail.com'' might be willing to help as well. ===== Current TODO list ===== * Get LOBSTER and OpenMX parsers fully upstream so we get rid of the maintenance burden. If there are some changes in the upstream metainfo scheme or some significant refactors, the parsers will likely break and will need to be fixed. If we can get it to upstream NOMAD, the developers introducing the changes will have to fix the parsers as well. The parsers are currently residing at [[https://github.com/ondracka/]]. LOBSTER is already added to the list of parsers in upstream nomad, with some minor things missing: [[https://github.com/nomad-coe/nomad/issues/20]], however the ultimate goal would be to move the repo to [[https://github.com/nomad-coe]] as well. OpenMX parser upstreaming is currently on hold due to some rewrite in upstream nomad [[https://matsci.org/t/openmx-parser/37525]]. * Make sure there is enough disc space. We now have just 1TB (at the time of writing ~80% full). * Get more RAM for the VM (this would allow higher concurency for the workers and hence faster parsing). * The current MCh Nomad Oasis manual/guide is located at [[mchadmin:mchnomadoasis|the wiki]]. Recent experience however shows that not many people have wiki access and even less people are actually using it. Since we have a webpage running anyway, it might be a good idea to actually move the info to the [[https://oasis.mch.rwth-aachen.de/nomad-oasis/gui/]].