Compile numpy without openblas to reduce pack size

Summary

Compiling numpy on Windows by self can save 36 MiB of openblas dependency. Building numpy from source on Windows is not as difficult as imagined.

Preface

I recently needed to package a tool that uses some numpy computations. At first glance with pyinstaller, there is a huge openblas dependency:

1
2
3
4
5
6
7

36.4 MiB [##########] libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
10.6 MiB [## ] main.exe
5.5 MiB [# ] python311.dll
4.9 MiB [# ] /numpy
4.8 MiB [# ] /pydantic_core
...

I admire those powerful tools that could do a lot with just a few megabytes in the past. Besides, the performance requirements for numpy in this program are not high. So I tried to remove this dependency. There are many discussions online about packaging and reducing package size, but they only replace mkl with openblas. However, openblas is still quite large.

After searching, this site provides builds without openblas and mkl, but the last update was in 2022, and the version is still 1.22.4, which is a bit old. Since third-party builds already exist, I guessed this might not be too difficult.

Preparation

Compiler

On Windows, just download Microsoft C++ Build Tools, start it and select Desktop development with C++. Wait for it to finish and you are good to go. No need to set environment variables as numpy will handle it later.

Source Code

1
2
3
git clone --recurse-submodules https://github.com/numpy/numpy.git
cd numpy
git checkout maintenance/1.24.x # or 1.25.x, doesn't matter

Disable openblas

In numpy/distutils/, add a file site.cfg with the following content:

1
2
3
4
[openblas]
libraries =
library_dirs =
include_dirs =

Refer to site.cfg.example in root for complete description.

According to official docs, you can also disable openblas by setting environment variables.

Virtual environment and dependencies

1
2
3
python -m venv .#env
./.#env/Scripts/activate
pip install -r build_requirements.txt

Later we will run commands in the .#env virtual environment by default.

Build

For <=1.24.x, you can build with bare msvc like this:

1
python setup.py build -j 16

However, for 1.25+, this will silently fail on Windows by generating compilation commands longer than 32768 lines.

A viable solution is to switch to using cibuildwheel.

Modify pyproject.toml, find [tool.cibuildwheel], and comment out before-build, before-test, test-command by adding # in front or just delete them. It should look like:

1
2
3
4
5
6
[tool.cibuildwheel]
skip = "cp36-* cp37-* pp37-* *-manylinux_i686 *_ppc64le *_s390x *-musllinux_aarch64"
build-verbosity = "3"
# before-build = "bash {project}/tools/wheels/cibw_before_build.sh {project}"
# before-test = "pip install -r {project}/test_requirements.txt"
# test-command = "bash {project}/tools/wheels/cibw_test_command.sh {project}"

The before-build script is for downloading openblas. The two test steps don’t work for me.

Then build in PowerShell:

1
2
3
$env:CIBW_BUILD="cp311-win_amd64" # cp311 means CPython3.11, change to other versions if needed
$env:CIBW_ENVIRONMENT="NPY_USE_BLAS_ILP64=0" # Don't know why it doesn't read site.cfg, use env var to emphasize
cibuildwheel --platform windows

If everything goes well, you should see a whl file generated under wheelhouse:

1
2
-rwxrwxrwx 1 root root 6.1M Aug  7 23:37 wheelhouse/numpy-1.25.2-cp310-cp310-win_amd64.whl
-rwxrwxrwx 1 root root 6.1M Aug 7 23:44 wheelhouse/numpy-1.25.2-cp311-cp311-win_amd64.whl

Test

After installing the whl, check debug info:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> python -c "import numpy as np; np.show_config(); print(np.__version__)"
blas_armpl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_ssl2_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
accelerate_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
blas_info:
NOT AVAILABLE
blas_src_info:
NOT AVAILABLE
blas_opt_info:
NOT AVAILABLE
lapack_armpl_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
lapack_ssl2_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
openblas_clapack_info:
NOT AVAILABLE
flame_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
lapack_info:
NOT AVAILABLE
lapack_src_info:
NOT AVAILABLE
lapack_opt_info:
NOT AVAILABLE
numpy_linalg_lapack_lite:
language = c
define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
1.25.2

Very good, no openblas at all. Run tests:

1
2
3
4
> python runtests.py -v --no-build 

...
=============== 35028 passed, 1100 skipped, 1308 deselected, 29 xfailed, 2 xpassed in 345.30s (0:05:45) ===============

For normal binary installs, the result is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> python -c "import numpy as np; np.show_config(); print(np.__version__)"
openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['openblas\\lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['openblas\\lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['openblas\\lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['openblas\\lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
1.25.2

Packaging

1
2
3
4
5
6
7
# test.py

import numpy as np

print(np.random.rand(5, 5))
np.show_config()
print(np.__version__)

Without openblas install, pyinstaller test.py gives:

1
2
3
4
5
6
7
8
9
   6.9 MiB [##########] /numpy
5.5 MiB [####### ] python311.dll
3.9 MiB [##### ] test.exe
3.3 MiB [#### ] libcrypto-1_1.dll
1.7 MiB [## ] base_library.zip
1.1 MiB [# ] unicodedata.pyd
996.0 KiB [# ] ucrtbase.dll
...
Total disk usage: 26.1 MiB Apparent size: 25.9 MiB Items: 79

With default install, the result is:

1
2
3
4
5
6
7
8
9
10
  36.4 MiB [##########]  libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
5.5 MiB [# ] python311.dll
4.9 MiB [# ] /numpy
3.9 MiB [# ] test.exe
3.3 MiB [ ] libcrypto-1_1.dll
1.7 MiB [ ] base_library.zip
1.1 MiB [ ] unicodedata.pyd
996.0 KiB [ ] ucrtbase.dll
...
Total disk usage: 60.6 MiB Apparent size: 60.5 MiB Items: 81

The difference is about 35 MiB, quite significant. Removing openblas, my tool can fit in a single 15 MiB file. Although it’s still some way from my goal of a few megabytes, this is Python after all, so I’m quite satisfied.

Misc

  • The commit I built 1.25.x on is ea677928332c37e8052b4d599bf6ee52cf363cf9. Reset to it if yours is different.
  • My Windows version is 22H2 19045.3271
  • The pure build takes around 2 minutes on my trashy E5-2678.
  • Sizes above are from ncdu output.

编译不带加速的 numpy 降低打包分发大小

摘要

在 Windows 中通过自行编译 numpy, 可以省去 36 MiB 的 openblas 依赖. 自行编译 numpy 并没有想象中那么困难.

前言

近日需要打包一个使用了一些 numpy 计算的工具, pyinstaller 起手一看, 一个硕大的 openblas 依赖:

1
2
3
4
5
6
36.4 MiB [##########]  libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
10.6 MiB [## ] main.exe
5.5 MiB [# ] python311.dll
4.9 MiB [# ] /numpy
4.8 MiB [# ] /pydantic_core
...

我很崇尚过去那些几兆就能完成很多事情的强力工具, 加之这个程序里对 numpy 性能的要求并不算高, 因此我尝试着去除这个依赖, 网上有很多讨论打包和减小打包大小的, 但无非是把 mkl 换成 openblas, 但是 openblas 也够大的了.

经过检索, 此处 提供了不带 openblas 和 mkl 的构建, 但是最后一次更新已是 2022 年, 其版本还在 1.22.4, 有点老了. 既然已有第三方构建, 那我猜此事应该也不会太难.

准备

编译器

在 Windows 下构建 numpy, 只需在此下载Microsoft C++ 生成工具, 启动后勾选使用 C++ 的桌面开发, 等待完成即可. 不需要设置什么环境变量, 后面 numpy 会自己处理.

源代码

1
2
3
git clone --recurse-submodules https://github.com/numpy/numpy.git
cd numpy
git checkout maintenance/1.24.x # 或者 1.25.x 随便

关闭 openblas

numpy/distutils/添加一个文件site.cfg, 内容为:

1
2
3
4
[openblas]
libraries =
library_dirs =
include_dirs =

此文件完整描述可参考根目录的site.cfg.example.

按照官方文档, 也可以通过设置环境变量来关闭

虚拟环境和依赖包

1
2
3
python -m venv .#env
./.#env/Scripts/activate
pip install -r build_requirements.txt

之后默认在 .#env 的虚拟环境中执行命令.

构建

在 <=1.24.x 时, 可以裸 msvc 构建, 如下:

1
python setup.py build -j 16 

但是 1.25+ 之后, 这种方式在 Windows 会生成单行超过 32768 的编译命令, 然后静默失败.

可行的方案是换用 cibuildwheel.

修改 pyproject.toml, 找到[tool.cibuildwheel], 在before-build, before-test, test-command 这三行前面都加上#, 或者删掉也行. 结果如下:

1
2
3
4
5
6
[tool.cibuildwheel]
skip = "cp36-* cp37-* pp37-* *-manylinux_i686 *_ppc64le *_s390x *-musllinux_aarch64"
build-verbosity = "3"
# before-build = "bash {project}/tools/wheels/cibw_before_build.sh {project}"
# before-test = "pip install -r {project}/test_requirements.txt"
# test-command = "bash {project}/tools/wheels/cibw_test_command.sh {project}"

before-build 的脚本是用来下载 openblas 的, 后面的两个 test 我这边运行不成.

之后在 PowerShell 中构建:

1
2
3
$env:CIBW_BUILD="cp311-win_amd64" # cp311 意为 CPython3.11, 你可以改成其他版本
$env:CIBW_ENVIRONMENT="NPY_USE_BLAS_ILP64=0" # 我不知道为什么后面的流程不吃 site.cfg, 这里再给一个环境变量强调下
cibuildwheel --platform windows

如果一切正常, 应当在 wheelhouse 目录下生成一个 whl 文件, 就是我们所要的了.

1
2
-rwxrwxrwx 1 root root 6.1M Aug  7 23:37 wheelhouse/numpy-1.25.2-cp310-cp310-win_amd64.whl
-rwxrwxrwx 1 root root 6.1M Aug 7 23:44 wheelhouse/numpy-1.25.2-cp311-cp311-win_amd64.whl

测试

安装 whl 后看看调试输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> python -c "import numpy as np; np.show_config(); print(np.__version__)"
blas_armpl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_ssl2_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
accelerate_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
blas_info:
NOT AVAILABLE
blas_src_info:
NOT AVAILABLE
blas_opt_info:
NOT AVAILABLE
lapack_armpl_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
lapack_ssl2_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
openblas_clapack_info:
NOT AVAILABLE
flame_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
lapack_info:
NOT AVAILABLE
lapack_src_info:
NOT AVAILABLE
lapack_opt_info:
NOT AVAILABLE
numpy_linalg_lapack_lite:
language = c
define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
1.25.2

很好, 完全没有 openblas, 跑一下 tests:

1
2
3
4
> python runtests.py -v --no-build

...
=============== 35028 passed, 1100 skipped, 1308 deselected, 29 xfailed, 2 xpassed in 345.30s (0:05:45) ===============

如果是普通的二进制安装, 则结果为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> python -c "import numpy as np; np.show_config(); print(np.__version__)"
openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['openblas\\lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['openblas\\lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['openblas\\lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['openblas\\lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['openblas\\lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
1.25.2

打包

1
2
3
4
5
6
7
# test.py

import numpy as np

print(np.random.rand(5, 5))
np.show_config()
print(np.__version__)

在无 openblas 安装中, pyinstaller test.py 得到结果:

1
2
3
4
5
6
7
8
9
   6.9 MiB [##########] /numpy
5.5 MiB [####### ] python311.dll
3.9 MiB [##### ] test.exe
3.3 MiB [#### ] libcrypto-1_1.dll
1.7 MiB [## ] base_library.zip
1.1 MiB [# ] unicodedata.pyd
996.0 KiB [# ] ucrtbase.dll
...
Total disk usage: 26.1 MiB Apparent size: 25.9 MiB Items: 79

在默认安装中得到结果:

1
2
3
4
5
6
7
8
9
10
  36.4 MiB [##########]  libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll
5.5 MiB [# ] python311.dll
4.9 MiB [# ] /numpy
3.9 MiB [# ] test.exe
3.3 MiB [ ] libcrypto-1_1.dll
1.7 MiB [ ] base_library.zip
1.1 MiB [ ] unicodedata.pyd
996.0 KiB [ ] ucrtbase.dll
...
Total disk usage: 60.6 MiB Apparent size: 60.5 MiB Items: 81

差不多差了 35 MiB, 还是很可观的. 去掉 openblas, 我的工具单文件 15 MiB 就够了, 虽然离心目中的几兆还有些距离, 但这毕竟是 Python, 我很满足了.

拾遗

  • 我编译 1.25.x 时的 HEAD 是 ea677928332c37e8052b4d599bf6ee52cf363cf9, 如果有哪里不同, 可以git reset ea677928332c37e8052b4d599bf6ee52cf363cf9 过来
  • 我的 Windows 版本是 22H2 19045.3271
  • 你需要准备一把顺畅的梯子, cibuildwheel 会需要从 Python 官方网站下载全新的 Python
  • 在我的洋垃圾 E5-2678 下纯编译过程需要两分钟左右
  • 上面的大小是用 ncdu 输出的