NVIDIA-powered ML station with Fedora 29 + Docker

tl;dr

  • Use Fedora 29 (soft recommendation — worked for me when CentOS 7 broke)
  • Use Docker (strong recommendation — this is the way to go)
  • Script to configure NVIDIA drivers on Fedora here
  • Script to install NVIDIA docker, followed by a GPU-integrated pytorch container here

A bit of context…

Last year I wrote about my long march toward finally getting a home deep learning station with a CUDA-integrated NVIDIA 1080 GPU up and running on CentOS 7. This worked great for awhile, until one day I rebooted the machine and the console was threw some errors related to the nouveau drivers. After multiple restarts, and doing everything I could to edit the nouveau settings from the GRUB, I was still unable to login. I wasn’t worried: I hadn’t put a ton of time into configuring the software on the machine, and planned to just reinstall based on my guide.

While I was able to complete the CentOS 7 reinstallation, I never saw the login screen again. Something about how the GPU interacted with the OS changed, and I didn’t have the courage to open the box back up, take out the GPU, and debug by running the computer off of the motherboard’s integrated graphics.¹

I was stuck again. Out of a combination of frustration and busyness at work I just gave up for awhile.

Enter Fedora

It wasn’t until I was preparing to leave Credijusto and start a new project with an (awesome 🤘🏼) NYC-based team of friends from college that getting the home work station running again became a priority. On one of my last days in the office, as I explained in exasperation my never ending struggle to get an NVIDIA workstation running, Ivan Californias casually suggested I try Fedora and handed me bootable USB.

I didn’t expect this to work. Most of my team favored removing the GPU, running off of the motherboard only, and fixing whatever was wrong with the nouveau drivers. I still had one of the big takeaways from last year’s struggle to get the system top-of-mind, however: Software is cheap, hardware is expensive. I gave the Fedora install a try, and it went off flawlessly. Eight months since the CentOS 7 setup broke down, I had a desktop again. I decided to get as far as I could with the NVIDIA/CUDA setup on Fedora 29.

Enter Docker

Docker is where it’s at for applications that have elaborate system configurations, like setting up a CUDA-integrated deep learning environment. Without belaboring the point, a lot of what I’ve been struggling with (for a couple years now!) while doing this work is that system builds of these environments causes all kinds of headaches. Updates to linux distros, CUDA, and leading toolkits aren’t synced. So can save your setup code, but it probably won’t work in 6 months. When you’re building at the level system you’re trying to hit a bull’s eye in a shifting 3-dimensional target space. Sometimes this is super easy. Often times it’s really complicated. My moment of truth came when, after two previous installation attempts, I couldn’t compile CUDA’s test suite because Fedora’s gcc version was ahead of the version required by the most recent CUDA release at the time.

The beauty of docker for this specific use case is that once you’ve configured your NVIDIA drivers you never have to touch your system configurations again. Using NVIDIA docker you can access your system’s GPU without having to do a system install of CUDA. Any other hacky system updates or dependency fine-tuning that you need to do can go into a customized docker container as well. Building a homemade ML environment using docker is something I’ll leave to an upcoming post, but I will show you how to setup Fedora with your NVIDIA drivers, install nvidia-docker, and build a GPU-integrated tensorflow image.

Getting up and running

I’ve written the scripts for configuring the NVIDIA drivers on Fedora here, and the script to install NVIDIA docker, followed by a GPU-integrated pytorch container here. All I’ve done here is scrape together info from other peoples’ posts² — big thanks to the open source community and all the enthusiasts who work on these challenges.

Get a bootable Fedora USB here. Lots of good guides out there for how to put the hardware together if that’s the step you’re on. My last post about building a DL station has some decent pointers (I think).

Setting up NVIDIA

A bit of prep:

$ yum install nano 
$ sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

A couple posts also recommended switching SELINUX=disabled in /etc/sysconfig/selinux update to takedown the selinux firewall and speed up downloads.

Change to root user and update system kernel

$ sudo su
$ dnf update
$ reboot

Download the NVIDIA drivers and make the executable. Note that I’m using the latest version as of when I wrote this: for you it’s probably going to be different.

$ sudo su
$ chmod +x NVIDIA-Linux-x86_64-418.43.run

Install some dependencies:

$ dnf install kernel-devel kernel-headers gcc make dkms acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig

Blacklist nouveau with echo “blacklist nouveau” >> /etc/modprobe.d/blacklist.conf. Update the GRUB with nano /etc/sysconfig/grub and by setting GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora/swap rd.lvm.lv=fedora/root rhgb quiet rd.driver.blacklist=nouveau".

Update the grub2 configuration:

$ grub2-mkconfig -o /boot/grub2/grub.cfg
$ grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cf

Remove xorg drivers and create a new initramfs:

$ dnf remove xorg-x11-drv-nouveau
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img # (backup old initramfs)
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)

At this point we’ve basically unhooked the built-in graphical drivers, and we now need to install and utilize the NVIDIA drivers. To do this will reboot the computer and start a new session in run level 3, the terminal interface, and install the drivers with the .run file from above. IMPORTANT: Keep a copy of this guide up on your phone or a different computer so that you can keep access to it while in the terminal.

Set computer to boot in run level 3 and reboot:

$ systemctl set-default multi-user.target 
$ reboot

Install NVIDIA drivers with .run file:

$ ./NVIDIA-Linux-x86_64-415.27.run
# "Yes" to register the kernel module sources
# "Yes" to 32-bit compatibility
# "Yes" to NVIDIA xconfig post-install

Reboot back into your desktop:

systemctl set-default graphical.target
reboot

Verify that drivers are up and running by typing nvidia-smi. This should dump a status panel for the GPU into your terminal.

NVIDIA docker + pytorch with GPU integration

Docker works best out of the box with Ubuntu or CentOS. You need to massage the install a bit for Fedora, but it’s not too bad.

Install dnf-plugins-core and add docker repository:

$ sudo dnf -y install dnf-plugins-core
$ sudo dnf config-manager \
--add-repo \
https://download.docker.com/linux/fedora/docker-ce.repo

Install latest version of docker and start the daemon:

$ sudo dnf install docker-ce
$ sudo systemctl start docker

Install docker-compose (not strictly necessary, but good to have in the toolbox):

$ sudo curl -L https://github.com/docker/compose/releases/download/1.24.0-rc1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose
$ docker-compose --version

Install and verify nvidia-docker installation:

$ sudo curl -s -L https://nvidia.github.io/nvidia-docker/centos7/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
$ sudo dnf install nvidia-docker2
$ sudo pkill -SIGHUP dockerd
$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

The nvidia-smi call to your nvidia-docker image should have the same result as in the NVIDIA drivers check, above. If you got this far, congrats: nvidia-docker is a base that you can run multiple different CUDA-enabled containers on. I’ll do a proof of concept here with tensorflow-gpu.

First run and install the latest tensorflow-gpu image:

$ sudo docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu-py3 \   
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Enter the image:

sudo docker run --runtime=nvidia -it tensorflow/tensorflow:latest-gpu-py3 bash

Install jupyter and fire up a notebook:

pip install jupyter
jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root

We now have a jupyter notebook running within our tensorflow docker container that we can use as a sandbox to play around in. We’ve exposed port 8888, so all we need to know to connect is the Docker container’s IP address. On my machine this is always172.17.0.2, but you should confirm. In a separate window run sudo docker ps to lookup the container name (assigned at random by default), then sudo docker inspect [docker container name] | grep "IPAddress" to find the address.

Now look back at the terminal window where you started the notebook. After you started the notebook you should have seen some output like this:

root@671af84bd839:/# jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
[I 13:36:20.593 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 13:36:21.392 NotebookApp] Serving notebooks from local directory: /
[I 13:36:21.392 NotebookApp] The Jupyter Notebook is running at:
[I 13:36:21.392 NotebookApp] http://(671af84bd839 or 127.0.0.1):8888/?token=1555c981ce1da0c5430ee7ec876e7396e5e04e30f4d9704b
[I 13:36:21.392 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 13:36:21.395 NotebookApp] No web browser found: could not locate runnable browser.

That long sequence after token is your jupyter notebook token. You can now put together a url for accessing the notebook in your browser as http://[DOCKER_IP_ADDRESS]:8888/?token=[TOKEN]. When I ran this example the url was http://127.17.0.2:8888/?token=1555c981ce1da0c5430ee7ec876e7396e5e04e30f4d9704b.

By now you’re probably ready to prove to yourself that this has been worth all of the effort. In one of your terminal windows activate a status monitor of the NVIDIA card with nvidia-smi -l 1. In your jupyter notebook, run this MNIST demo script:

import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20)
model.evaluate(x_test, y_test)

On my 1080Ti this results in ~25% GPU utilization.

Wrap-up

I’m pretty agnostic on CentOS 7 versus Fedora 29, or any other Linux distro that you can get NVIDIA graphics drivers installed on (I am a bit suspicious of Ubuntu, actually). Fedora helped me fix this weird error on my GPU, and for that reason alone I’m a fan of it. From a usability standpoint I’d actually lean just a bit more toward CentOS 7.

What I do feel strongly about is using Docker to minimize the degree to which you need to play around with your system’s core software and dependencies. I’m convinced that this path will save a lot of you a lot of pain. Some things that we could have very easily have done here include:

  • Set a browser url, e.g. http://workbench.info as our jupyter access point.
  • Mount a local drive for sharing files with the docker container.
  • Install jupyter out of the box.
  • Install additional ML/DL toolkits
  • Clone our git repos

We would have accomplished this by downloading the tensforflow-gpu container code from DockerHub and modifying it according to our specs.

Dockerizing your CUDA-integrated workflows also simplifies collaboration to a significant degree. Once again, once you have nvidia-docker running, there’s no more local environment hell that you have to navigate. As I ramp up with the new gig, this is absolutely the direction I’ll be taking my workflows involving CUDA integration. For pure python I still get by alright on with conda and virtual envs, although my last unpleasant experience getting running xgboost on my system made this ML image with xgboost look pretty good…

Happy hunting.

[1] If you have any ideas about what happened here I would love to hear your thoughts and am happy to unpack the details a bit more.

NVIDIA setup sources:

NVIDIA Docker/pytorch GPU source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aaron Polhamus

Working with Team Vest to transform how retail investing is done throughout the Americas 🌎