본문 바로가기
오래된 글

tensorflow 2.0 GPU 에러 | GPU 메모리 부족할 때

by pagehit 2019. 10. 24.
반응형

텐서 플로우 홈페이지 튜토리얼을 따라하면서 TensorFlow 2.0 버전을 학습하고 있는데, 아래와 같은 에러 메시지가 나왔습니다. 튜토리얼에 나와있는 코드를 그대로 작성하였기에 코드에는 문제가 없다는 생각을 가지고, 에러 메시지를 읽어보았습니다. 전체 에러 메시지는 이 글의 아랫 부분에 있습니다.

tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

위의 에러 메시지를 읽어보니 "cuDNN이 초기화에 실패하였으니, 위에 표시된 warning 로그 메시지를 더 읽어보라"는 말이 나옵니다. 에러 메시지를 따라서 위쪽 로그를 읽어 보니 아래와 같은 문구가 눈에 띄었습니다.

2019-10-24 00:24:58.020468: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

cudnn 핸들을 생성할 수 없답니다. 아마도 GPU가 문제인 것 같다는 생각이 듭니다. TensorFlow KR 페이스북 그룹을 시간 날 때마다 보면서 이와 같은 에러 메시지를 본적이 있는 것 같았습니다. 어찌됐든 에러 메시지를 구글링 해보았더니 역시나 스택오버풀로우에 답변이 있습니다. could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR에 대한 질문에 답이 있습니다. 이로썬 GPU out of memory 문제임을 어느 정도 확신할 수 있었습니다.

처음에 살펴 본 에러 메시지에 대해 검색해 본 결과 StackOverflow에서 Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,에 대한 질문과 답을 찾을 수 있었습니다. 여기서는 캐시가 문제일 수도 있고 메모리가 문제일 수도 있다는 답변도 있습니다. GPU 메모리를 제한하는 코드도 답변으로 주어져 있는데 모두 1.X 버전에 대한 코드입니다. 그러니깐 이제 실제로 메모리 문제가 맞는지 직접 확인해보고, 텐서 플로우 공식 문서를 찾아서 문제를 해결해 봅시다.

먼저, 코드를 실행 시켰을 때 그래픽 카드 메모리가 어떻게 변하는지 직접 살펴 봅시다. 터미널에 아래와 같은 명령어를 입력해 GPU 메모리 변화를 수 있습니다.

$ watch -n 0.1 nvidia-smi

텐서 플로우 모델을 실행 시키면 아래 캡쳐 화면과 같이 GPU 메모리가 빠르게 증가하는 것을 볼 수 있습니다. 그러면서 GPU 메모리가 꽉차는 순간 코드가 중지됩니다.

이에 대한 해결 방법은 공식 문서에서 찾아볼 수 있습니다. 공식 문서에서는 다음과 같이 말하고 있습니다.

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

텐서 플로우는 기본적으로 거의 모든 GPU 메모리를 매핑한다고 합니다. 이렇게 하면 메모리 파편화를 줄일 수 있기 때문이라고 합니다.

 

필요한 만큼 메모리를 런타임에 할당하는 방법

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)
  

 

 

GPU에 할당되는 전체 메모리 크기를 제한하는 방법

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)
  

 


 

전체 에러 코드는 아래와 같습니다.

/home/sihyeon-kim/venv-tf2-gpu/bin/python /home/sihyeon-kim/PycharmProjects/tensorflow-2-gpu/mnist-experts.py
2.0.0
2019-10-24 00:24:55.893372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-24 00:24:55.928315: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:55.934706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.725
pciBusID: 0000:01:00.0
2019-10-24 00:24:55.934849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-10-24 00:24:55.935517: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-24 00:24:55.936119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-10-24 00:24:55.936281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-10-24 00:24:55.937158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-10-24 00:24:55.937845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-10-24 00:24:55.940206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-24 00:24:55.940277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:55.941002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:55.941614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-10-24 00:24:55.941838: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-24 00:24:55.965031: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-10-24 00:24:55.965240: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3dac9e0 executing computations on platform Host. Devices:
2019-10-24 00:24:55.965254: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-10-24 00:24:56.032732: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.033353: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3daefa0 executing computations on platform CUDA. Devices:
2019-10-24 00:24:56.033367: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-10-24 00:24:56.033462: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.033988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.725
pciBusID: 0000:01:00.0
2019-10-24 00:24:56.034010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-10-24 00:24:56.034019: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-24 00:24:56.034026: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-10-24 00:24:56.034033: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-10-24 00:24:56.034040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-10-24 00:24:56.034047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-10-24 00:24:56.034054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-24 00:24:56.034083: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.034615: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.035125: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-10-24 00:24:56.035143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-10-24 00:24:56.035887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-24 00:24:56.035896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-10-24 00:24:56.035901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-10-24 00:24:56.035981: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.036521: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-24 00:24:56.037102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6971 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:Layer my_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

2019-10-24 00:24:57.213007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-24 00:24:57.397040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-10-24 00:24:58.020468: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-24 00:24:58.024766: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-24 00:24:58.024813: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node my_model/conv2d/Conv2D}}]]
Traceback (most recent call last):
  File "/home/sihyeon-kim/PycharmProjects/tensorflow-2-gpu/mnist-experts.py", line 86, in 
    train_step(images, labels)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node my_model/conv2d/Conv2D (defined at /venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_step_638]

Function call stack:
train_step

Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in 
    from apport.report import Report
  File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in 
    import apport.fileutils
  File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in 
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 23, in 
    import apt
  File "/usr/lib/python3/dist-packages/apt/__init__.py", line 23, in 
    import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
  File "/home/sihyeon-kim/PycharmProjects/tensorflow-2-gpu/mnist-experts.py", line 86, in 
    train_step(images, labels)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/home/sihyeon-kim/venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node my_model/conv2d/Conv2D (defined at /venv-tf2-gpu/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_step_638]

Function call stack:
train_step

반응형

댓글