使用CUDA为Tensorflow加速

Published at Jul 28, 2020

Last updated Jan 11, 2021

691 words

view

1. 踩坑
2. 吐槽

梯度下降法大部分时间都在进行向量和矩阵运算。这些运算是天然可以并行化的。因此使用GPU进行运算会比CPU运算快得多。而常用的框架Tensorflow就通过CUDA提供了GPU运算的支持。

根据官方页面，对软硬件有如下要求:

The following GPU-enabled devices are supported:

NVIDIA® GPU card with CUDA® architectures 3.5 or higher. See the list of CUDA®-enabled GPU cards.

For GPUs with unsupported CUDA® architectures, or to avoid JIT compilation from PTX, or to use different versions of the NVIDIA® libraries, see the Linux build from source guide.
-On systems with NVIDIA® Ampere GPUs (CUDA architecture 8.0) or newer, kernels are JIT-compiled from PTX and TensorFlow can take over 30 minutes to start up. This overhead can be limited to the first start up by increasing the default JIT cache size with: ‘export CUDA_CACHE_MAXSIZE=2147483648’ (see JIT Caching for details).
-Packages do not contain PTX code except for the latest supported CUDA® architecture; therefore, TensorFlow fails to load on older GPUs when CUDA_FORCE_PTX_JIT=1 is set. (See Application Compatibility for details.)

The following NVIDIA® software must be installed on your system:

NVIDIA® GPU drivers —CUDA® 10.1 requires 418.x or higher.

CUDA® Toolkit —TensorFlow supports CUDA® 10.1 (TensorFlow >= 2.1.0)

CUPTI ships with the CUDA® Toolkit.

cuDNN SDK 7.6

换句话说，只要不是上古时代的NVIDIA GPU，都可以进行运算。

踩坑

Tensorflow

安装tensorflow后要还要安装tensorflow-gpu。tensorflow-gpu不是tensorflow的替代者，而是支持运算GPU的模块。不要被网上的信息误导，两者都需要安装

1 2	pip install tensorflow pip install tensorflow-gpu

CUDA

CUDA请一定安装10.1版本，更新的和更旧的版本都不支持。要去Archive里找。

cuDNN

cuDNN请一定安装7.6版本，更新的和更旧的版本都不支持。
cuDNN要先去注册NVIDIA developer再去Archive里找7.6的

Coding

全部安装好后去跑hello world。tensorflow可能会卡在

1	I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0

现在去泡杯咖啡，坐和放宽，大概过几个小时就好了。这么大的延迟只会第一次出现。原因似乎是因为GPU那边在做JIT🙃

等出现这一行

I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1376 MB memory) -> physical GPU (device: 0, name: GeForce 840M, pci bus id: 0000:01:00.0, compute capability: 5.0)

就说明成功了。

我的渣渣840M跑训练比CPU快了一个数量级;-)

吐槽

已经2020年了，CUDA已经出到11.0了的RC了，tensorflow居然还只支持2019年2月发布的10.1。
cuDNN同理，也用的是很老的版本。

CUDA作为NVIDIA家私有的一套API，形成了事实标准，这很不好。而AMD家的搞得叫做ROCm的一套东西，很遗憾的还没成什么气候。ROCm的tensorflow是官方版的一份fork，binary还是社区自己编译的，可以想象坑是无比的多。

希望开源的标准尽快取代掉私有的CUDA，让A家的GPU也能无痛的跑科学计算。

prev：主成分分析 next：循环神经网络