오랜만에 AI 작업을 위해 Ubuntu 기반의 GPU 서버를 활용하다 발생한 문제입니다.

(base) root@gpu-server:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.161

 

Nvidia 드라이버를 재설치하여 해결하였습니다.

 

아래 링크 글에서 많은 도움이 되었습니다.

(참고 링크 : https://dfso2222.tistory.com/69)


1. Nvidia 드라이버 완전히 삭제하기

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get install ubuntu-desktop
sudo rm /etc/X11/xorg.conf
echo 'nouveau' | sudo tee -a /etc/modules

위 명령어를 순서대로 실행하면 정상적으로 드라이버가 제거됩니다.

 

2. Nvidia 드라이버 설치

(base) admin@gpu-server:~$ sudo apt install nvidia-driver-550
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  ca-certificates-java fonts-dejavu-extra g++-8 java-common javascript-common
  libaccinj64-10.1 libatk-wrapper-java libatk-wrapper-java-jni libclang-cpp10
  libcublas10 libcublaslt10 libcudart10.1 libcufft10 libcufftw10
  libcuinj64-10.1 libcupti-dev libcupti-doc libcupti10.1 libcurand10
  libcusolver10 libcusolvermg10 libcusparse10 libjs-jquery libncurses5
  libnppc10 libnppial10 libnppicc10 libnppicom10 libnppidei10 libnppif10
  libnppig10 libnppim10 libnppist10 libnppisu10 libnppitc10 libnpps10
  libnvblas10 libnvgraph10 libnvidia-container-tools libnvidia-container1
  libnvidia-ml-dev libnvjpeg10 libnvrtc10.1 libnvtoolsext1 libnvvm3
  libstdc++-8-dev libthrust-dev libtinfo5 libvdpau-dev libz3-4 libz3-dev
  llvm-10-tools node-html5shiv openjdk-8-jre openjdk-8-jre-headless
  python3-pygments
... (중략) ...

 

3. 설치 확인 

'nvidia-smi' 명령어로 확인할 수 있습니다.

(base) admin@gpu-server:~$ nvidia-smi
Mon Apr 22 12:04:24 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:3B:00.0 Off |                    0 |
|  0%   39C    P8             13W /  300W |      13MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     Off |   00000000:AF:00.0 Off |                    0 |
|  0%   33C    P8             13W /  300W |      13MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1755      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A      1755      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

 

감사합니다.

+ Recent posts