算能RISC-V通用云开发空间编译pytorch @openKylin留档

终于可以体验下risc-v了！操作系统是openKylin，算能的云空间

尝试编译安装pytorch

首先安装git

apt install git

然后下载pytorch和算能cpu的库：

git clone https://github.com/sophgo/cpuinfo.git

git clone https://github.com/pytorch/pytorch

注意事项：

cd pytorch
# 确保子模块的远程仓库URL与父仓库中的配置一致
git submodule sync
# 确保获取并更新所有子模块的内容，包括初始化尚未初始化的子模块并递归地处理嵌套的子模块
git submodule update --init --recursive

将pytorch/third-parth目录的cpuinfo删除，换成算能的cpu库cpuinfo

cd pytorch

rm -rf cpuinfo

cp -rf ../cpuinfo .

安装相关库

apt install libopenblas-dev 报错，可以跳过

apt install libblas-dev m4 cmake cython3 ccache

手工编译安装openblas

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make -j8
make PREFIX=/usr/local/OpenBLAS install

编译的时候是一堆warning啊

在/etc/profile最后一行添加：

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/OpenBLAS/lib/

并执行：source /etc/profile

修改代码

到pytorch目录，执行： vi aten/src/ATen/CMakeLists.txt

aten/src/ATen/CMakeLists.txt

将语句：if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
替换为：if(FALSE)

vi caffe2/CMakeLists.txt

将语句：target_link_libraries(${test_name}_${CPU_CAPABILITY} c10 sleef gtest_main)
替换为：target_link_libraries(${test_name}_${CPU_CAPABILITY} c10 gtest_main)

vi test/cpp/api/CMakeLists.txt

在语句下：add_executable(test_api ${TORCH_API_TEST_SOURCES})
添加：target_compile_options(test_api PUBLIC -Wno-nonnull)

环境变量配置

# 直接在终端中输入即可，重启需要重新输入
export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=16

配置原文链接：https://blog.csdn.net/m0_49267873/article/details/135670989

编译安装

执行：

python3 setup.py develop --cmake

或者python3.10 setup.py install

据说要gcc 13以上，自带的gcc版本：

gcc version 9.3.0 (Openkylin 9.3.0-ok12)

需要打patch：

# 若提示无patchelf命令，则执行下列语句
apt install patchelf

# path为存放libtorch_cpu.so的路径
patchelf --add-needed libatomic.so.1 /path/libtorch_cpu.so

对算能云的系统来说，命令为：patchelf --add-needed libatomic.so.1 /root/pytorch/build/lib/libtorch_cpu.so

编译前的准备

编译前还需要安装好这两个库：

pip3 install pyyaml typing_extensions

另外还要升级setuptools

pip3 install setuptools -U

最终编译完成

在pytorch目录执行：

python3 setup.py develop --cmake

整个编译过程大约需要3-4个小时

最终编译完成：

Installed /usr/lib/python3.8/site-packages/mpmath-1.3.0-py3.8.egg
Searching for typing-extensions==4.9.0
Best match: typing-extensions 4.9.0
Adding typing-extensions 4.9.0 to easy-install.pth file
detected new path './mpmath-1.3.0-py3.8.egg'

Using /usr/local/lib/python3.8/dist-packages
Finished processing dependencies for torch==2.3.0a0+git5c5b71b

测试

进入python3，执行import pytorch，报错没有pytorch。执行import torch

看到没有报错，以为测试通过。其实是因为在pytorch目录，有子目录torch，误以为pass了

是我唐突了，因为使用的develop模式，就是这样用。

也就是必须在pytorch的目录，这样才能识别为develop的torch，在~/pytorch目录，执行python3，在命令交互方式下，把下面这段代码cp进去执行，测试通过

import torch
import torch.nn as nn
import torch.optim as optim
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"N,D_in,H,D_out = 64, 1000, 100, 10 # N: batch size, D_in:input size, H:hidden size, D_out: output size
x = torch.randn(N,D_in) # x = np.random.randn(N,D_in)
y = torch.randn(N,D_out) # y = np.random.randn(N,D_out)
w1 = torch.randn(D_in,H) # w1 = np.random.randn(D_in,H)
w2 = torch.randn(H,D_out) # w2 = np.random.randn(H,D_out)
learning_rate = 1e-6
for it in range(200):# forward passh = x.mm(w1) # N * H      h = x.dot(w1)h_relu = h.clamp(min=0) # N * H     np.maximum(h,0)y_pred = h_relu.mm(w2) # N * D_out     h_relu.dot(w2)  # compute lossloss = (y_pred - y).pow(2).sum() # np.square(y_pred-y).sum()print(it,loss.item()) #  print(it,loss)    # BP - compute the gradientgrad_y_pred = 2.0 * (y_pred-y)grad_w2 = h_relu.t().mm(grad_y_pred) # h_relu.T.dot(grad_y_pred)grad_h_relu = grad_y_pred.mm(w2.t())  # grad_y_pred.dot(w2.T)grad_h = grad_h_relu.clone() # grad_h_relu.copy()grad_h[h<0] = 0grad_w1 = x.t().mm(grad_h) # x.T.dot(grad_h)    # update weights of w1 and w2w1 -= learning_rate * grad_w1w2 -= learning_rate * grad_w2

0 29870438.0
1 26166322.0
2 25949932.0
3 25343224.0
4 22287072.0
5 16840522.0
6 11024538.0
7 6543464.5
8 3774165.25
9 2248810.5
10 1440020.25
11 1001724.5
12 749632.625
13 592216.6875
14 485451.34375
15 407586.65625
16 347618.4375
17 299686.625
18 260381.9375
19 227590.734375

怎样全环境可以用torch呢？

感觉是环境变量的问题，敬请期待

调试

安装libopenblas-dev报错

root@863c89a419ec:~/pytorch/third_party# apt install libopenblas-dev
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package libopenblas-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

竟然有人已经过了这个坑，可以跳过它,用编译安装openblas代替

编译pytorch的时候报错

python3 setup.py develop --cmake

Building wheel torch-2.3.0a0+git5c5b71b
-- Building version 2.3.0a0+git5c5b71b
Could not find any of CMakeLists.txt, Makefile, setup.py, LICENSE, LICENSE.md, LICENSE.txt in /root/pytorch/third_party/pybind11
Did you run 'git submodule update --init --recursive'?

进入third_parth目录执行下面命令解决：

rm -rf pthreadpool
# 执行下列指令前回退到pytorch目录
git submodule update --init --recursive

执行完还是报错：

root@863c89a419ec:~/pytorch# python3 setup.py develop --cmake
Building wheel torch-2.3.0a0+git5c5b71b
-- Building version 2.3.0a0+git5c5b71b
Could not find any of CMakeLists.txt, Makefile, setup.py, LICENSE, LICENSE.md, LICENSE.txt in /root/pytorch/third_party/QNNPACK
Did you run 'git submodule update --init --recursive'?

再次执行命令 git submodule update --init --recursive 照旧。

将QNNPACK目录删除，再执行一遍 git submodule update --init --recursive ，过了。

报错RuntimeError: Missing build dependency: Unable to `import yaml`.

python3 install pyyaml

报错：ModuleNotFoundError: No module named 'typing_extensions'

python3 install typing_extensions 搞定。

编译到78%报错

/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/NamedTensor_test.dir/build.make:101: bin/NamedTensor_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3288: caffe2/CMakeFiles/NamedTensor_test.dir/all] Error 2
/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/cpu_profiling_allocator_test.dir/build.make:101: bin/cpu_profiling_allocator_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3505: caffe2/CMakeFiles/cpu_profiling_allocator_test.dir/all] Error 2
[ 78%] Linking CXX executable ../bin/cpu_rng_test
/usr/bin/ld: /root/pytorch/build/lib/libtorch_cpu.so: undefined reference to `__atomic_exchange_1'
collect2: error: ld returned 1 exit status
make[2]: *** [caffe2/CMakeFiles/cpu_rng_test.dir/build.make:101: bin/cpu_rng_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:3536: caffe2/CMakeFiles/cpu_rng_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

初步怀疑是cpu库有问题。看cpu库，没问题。

试试这个办法：

问题分析：对__atomic_exchange_1的未定义引用

解决方法：使用patchelf添加需要的动态库

# 若提示无patchelf命令，则执行下列语句
apt install patchelf

# path为存放libtorch_cpu.so的路径
patchelf --add-needed libatomic.so.1 /path/libtorch_cpu.so

存放libtorch_cpu.so的路径：/root/pytorch/build/lib/libtorch_cpu.so

因此命令为：patchelf --add-needed libatomic.so.1 /root/pytorch/build/lib/libtorch_cpu.so

果然运行完这条命令后，编译就能继续下去了。

编译100%报错

running develop
/usr/lib/python3/dist-packages/setuptools/command/easy_install.py:146: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
Traceback (most recent call last):
File "setup.py", line 1401, in <module>
    main()
File "setup.py", line 1346, in main
    setup(
File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
File "/usr/lib/python3/dist-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
File "/usr/lib/python3/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
File "/usr/lib/python3/dist-packages/setuptools/_distutils/dist.py", line 973, in run_commands
    self.run_command(cmd)
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
File "/usr/lib/python3/dist-packages/setuptools/_distutils/dist.py", line 991, in run_command
    cmd_obj.ensure_finalized()
File "/usr/lib/python3/dist-packages/setuptools/_distutils/cmd.py", line 109, in ensure_finalized
    self.finalize_options()
File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 52, in finalize_options
    easy_install.finalize_options(self)
File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 231, in finalize_options
    self.config_vars = dict(sysconfig.get_config_vars())
UnboundLocalError: local variable 'sysconfig' referenced before assignment

尝试升级setuptools试试

root@863c89a419ec:~# pip3 install setuptools -U
Collecting setuptools
Using cached setuptools-69.1.0-py3-none-any.whl (819 kB)
Installing collected packages: setuptools
Attempting uninstall: setuptools
    Found existing installation: setuptools 65.3.0
    Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr
    Can't uninstall 'setuptools'. No files were found to uninstall.
Successfully installed setuptools-69.1.0
然后再次编译，过了！