Databricks has an experimental feature where you may customize your runtime container using Docker containers. In this article, we show how you can quickly build a Docker container for data science purpose. In particular, we will create a Docker container for use with Databricks with some Natural Language Processing (NLP) packages.
To start off, we will build our Docker container based off of the databricksruntime/standard Docker image. This image uses Miniconda to manage environments and dependencies. The Miniconda environment that is used is called
dcs-minimal, and we will assert and insert all our required NLP dependencies into this environment. Below is the
environment.yml file that we will use to declare which packages will be installed.
name: dcs-minimal channels: - default - anaconda dependencies: - pip: - pyarrow==0.13.0 - azure==3.0.0 - scipy - scikit-learn - spacy - nltk - gensim - textblob - allennlp - seaborn - flashtext - pip - python=3.7.3 - six=1.12.0 - nomkl=3 - ipython=7.4.0 - numpy=1.16.2 - pandas=0.24.2
The next thing to do is define our
Dockerfile, which looks like the following.
FROM databricksruntime/standard:latest LABEL One-Off Coder "email@example.com" # updates and install ubuntu packages RUN apt-get update \ && apt-get install -y \ build-essential \ python3-dev \ && apt-get clean # updates conda itself RUN /databricks/conda/bin/conda update -n base -c defaults conda # copies over the environment.yml file COPY environment.yml /tmp/environment.yml # updates the environment, dcs-minimal RUN /databricks/conda/bin/conda env update --file /tmp/environment.yml # download spaCy's english language model # download all NLTK packages RUN /databricks/conda/envs/dcs-minimal/bin/python -m spacy download en_core_web_lg \ && /databricks/conda/envs/dcs-minimal/bin/python -m nltk.downloader all # clean up RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
That’s it! Now you may build the container as follows.
docker build --no-cache -t databricks-nlp:local .
To use this image in Databricks, you have to ask for the Customized Containers features to be enabled. Whether you are on AWS or Azure, you will be able to use your customized data science container. The full source code is available on GitHub and the container is published on Docker Hub.