Fast Median Filter over arbitrary datatypes

Japanese Translation:

記事は、中央値フィルタのステップバイステップ最適化を追跡し、V1–V4 の各バージョンが大きな画像で実行時間を秒からミリ秒へと削減する方法を示しています。

ベースライン（V1）。 ピクセルごとのソートは (O(r^2+r^2\log r^2)) です。AMD EPYC 4564P 16コア CPU 上で、1000×1000 の uint8/float/double 画像に対し、5×5 カーネルの実行時間は約 77 ms（uint8）、95 ms（float）、94 ms（double）です。

V2。 ソートを quick‑select (

std::nth_element

) に置き換え、約 4.2 倍の高速化を達成します。例として、30×30 カーネルは約 5 s から約 1.5 s（uint8）へと短縮されます。

V3。 OpenMP 並列処理を追加し、ブロックが並行に処理されるため、同一ハードウェアで約 16 倍の壁時計時間改善が得られます—30×30 uint8 は ≈13.6 ms です。

V4。 順序変換ビットセットを導入し、ピクセル値のランクを 64‑bit ワードに格納します。中央値はウィンドウシフト後に popcount と _pdep_u64/ctzll イントリニックで求められ、V1 に比べ約 420 倍の高速化（5×5 は ≈6.9 ms uint8、10×30 は ≈15.7 ms float）を実現します。

アルゴリズムはジグザグトラバーサルを採用し、ステップごとに (O(h_x+h_y)) のピクセルのみを更新し、初期ブロックソートを主なデータ型依存コストとして保持します。以降の更新はビットセット操作です。

データ型汎用性（例：uint8 に限定）を排除すると、Huang の 256‑bit ヒストグラムや Perreault–Hebert 列ヒストグラム、Moruto のウェーブレット中央値などのアルゴリズムでさらに高速化が可能です。

これらの成果により、中央値フィルタはリアルタイムビデオと組み込みビジョンシステムで実用化され、カメラ・医療画像処理や自律車両感知におけるノイズ除去速度が向上します。

Median Filter – A Concise Survey of Fast Algorithms

1 – V1: Baseline Median Filter

#include <algorithm>
#include <vector>

template <typename T>
void median_filter_v1(const T* input, T* output,
                     int ny, int nx, int hy, int hx)
{
    T *pixels = (T *)malloc((2 * hy + 1) * (2 * hx + 1) * sizeof(T));

    for (int y = 0; y < ny; ++y) {
        for (int x = 0; x < nx; ++x) {

            int len = 0;
            for (int i = std::max(y - hy, 0);
                 i < std::min(y + hy + 1, ny); ++i)
                for (int j = std::max(x - hx, 0);
                     j < std::min(x + hx + 1, nx); ++j) {
                    pixels[len++] = input[nx * i + j];
                }

            const int mid = len / 2;
            std::sort(pixels, pixels + len);

            if (len % 2)                 // odd
                output[nx * y + x] = pixels[mid];
            else                         // even – floor of the mean
                output[nx * y + x] =
                    (pixels[mid] + pixels[mid - 1]) / T(2);
        }
    }

    free(pixels);
}

Time complexity: O(r² + r² log r²) per pixel.
Benchmark (1000 × 1000, 5–30 × 30 kernels)

kernel	uint8	float	double
5 × 5	77.10 ± 0.07 ms	95.31 ± 0.17 ms	94.38 ± 0.41 ms
10 × 10	549.81 ± 1.42 ms	693.87 ± 1.03 ms	690.97 ± 0.51 ms
15 × 15	1142.64 ± 2.15 ms	1449.58 ± 0.97 ms	1442.46 ± 0.59 ms
20 × 20	2495.58 ± 4.78 ms	3160.45 ± 4.05 ms	3163.54 ± 0.88 ms
25 × 25	3699.39 ± 3.45 ms	4701.75 ± 2.10 ms	4713.27 ± 1.47 ms
30 × 30	5946.51 ± 7.35 ms	7668.03 ± 5.78 ms	7659.16 ± 10.32 ms

2 – V2: Linear‑Time Median Finding

Instead of sorting, use quick‑select (

std::nth_element

) to obtain the median in linear expected time.

// Move the element in the middle as if the array was sorted
std::nth_element(pixels, pixels + mid, pixels + len);

if (len & 1) {
    output[nx * y + x] = pixels[mid];
} else {
    T hi = pixels[mid];
    T lo = pixels[mid - 1];
    for (T *p = pixels; p < pixels + mid - 1; ++p)
        lo = std::max(lo, *p);
    output[nx * y + x] = (lo + hi) / T(2);
}

Speedup: ~ 4.2× over V1 for large kernels.

3 – V3: Making it Multithreaded

The median‑filtering operation is embarrassingly parallel; split the image into blocks and use OpenMP to parallelise the outer loops.

int Sx = nx / Nx + 1;
int Sy = ny / Ny + 1;

#pragma omp parallel for collapse(2) schedule(dynamic)
for (int yg = 0; yg < ny; yg += Sy) {
    for (int xg = 0; xg < nx; xg += Sx) {
        /* median filter over input[yg:yg+Sy, xg:xg+Sx] */
    }
}

Hardware: 16‑core CPU with hyper‑threading → 32 logical cores.
Benchmark

kernel	uint8	float	double
5 × 5	15.68 ± 0.17 ms	18.55 ± 0.87 ms	18.19 ± 0.07 ms
10 × 10	60.67 ± 0.93 ms	72.07 ± 1.36 ms	71.45 ± 0.20 ms
15 × 15	106.52 ± 1.80 ms	128.59 ± 5.41 ms	126.34 ± 1.02 ms
20 × 20	199.56 ± 3.94 ms	234.77 ± 0.44 ms	238.22 ± 3.84 ms
25 × 25	275.47 ± 1.99 ms	330.32 ± 3.13 ms	329.48 ± 1.35 ms
30 × 30	417.48 ± 3.98 ms	500.04 ± 4.30 ms	500.43 ± 6.84 ms

4 – V4: Median over Ordinal Transform

4.1 Ordinal transform

Sort a block once, assign each pixel a rank (0 … B‑1).
Represent the current window as a bitset of ranks; the median is the k‑th set bit.

sorted.emplace_back(value, id);          // id = row * W + col
...
ranks[sorted[i].second] = i;

4.2 Window update

inline void add_rank(int ix, int jy) {
    if (ix < 0 || ix >= bx || jy < 0 || jy >= by) return;
    int rank = ranks[jy * bx + ix];
    int idx  = rank >> 6;                // word index
    buff[idx] ^= (uint64_t(1) << rank & 63);
    psum[idx >= p] += 1;
}

inline void remove_rank(int ix, int jy) {
    if (ix < 0 || ix >= bx || jy < 0 || jy >= by) return;
    int rank = ranks[jy * bx + ix];
    int idx  = rank >> 6;
    buff[idx] ^= (uint64_t(1) << rank & 63);
    psum[idx >= p] -= 1;
}

4.3 Median search

inline int pop(int idx) const { return __builtin_popcountll(buff[idx]); }

inline int search(int target) {
    while (psum[0] > target) { --p; psum[0] -= pop(p); psum[1] += pop(p); }
    while (psum[0] + pop(p) <= target) { psum[0] += pop(p); psum[1] -= pop(p); ++p; }
    int n = target - psum[0];
    uint64_t x = _pdep_u64(uint64_t(1) << n, buff[p]);
    return (p << 6) + __builtin_ctzll(x);
}

4.4 Sliding‑window

inline void compute_median(T* out) {
    for (int ix = x0 - hx; ix < x0 + hx; ++ix)
        for (int jy = y0 - hy; jy < y0 + hy; ++jy) add_rank(ix, jy);

    int x = x0 - 1;
    while (x <= x1) {
        /* remove left column, add right */
        for (int jy = y0 - hy; jy < y0 + hy; ++jy)
            remove_rank(x - hx, jy);
        ++x; if (x > x1) break;
        for (int jy = y0 - hy; jy < y0 + hy; ++jy)
            add_rank(x + hx, jy);

        int y = y0;
        while (y < y1) {
            out[(x + x0b) + nx * (y + y0b)] = get_median();

            /* upper / lower boundaries */
            if (y - hy >= 0)
                for (int ix = -hx; ix <= hx; ++ix)
                    remove_rank(x + ix, y - hy);
            ++y;
            if (y + hy < ny)
                for (int ix = -hx; ix <= hx; ++ix)
                    add_rank(x + ix, y + hy);
        }
        out[(x + x0b) + nx * (y + y0b)] = get_median();

        /* left column again */
        for (int jy = y - hy; jy < y + hy; ++jy)
            remove_rank(x - hx, jy);
        ++x; if (x > x1) break;
        for (int ix = -hx; ix <= hx; ++ix)
            add_rank(x + hx, iy);   // corrected
    }
}

Speedup: ~ 420× over V1.

5 – Limitations

Technique	Applicability	Notes
General‑datatype median	✔	Works for any comparable type.
Huang’s 256‑bit histogram	✘ (8‑bit only)	Constant‑time search.
Perreault & Hebert (column‑histogram)	✘ (8‑bit only)	Constant‑time, 8‑bit images.
Moruto’s wavelet method	✔	Requires 8‑bit images.

Take‑aways

Baseline – O(r² log r²).
Quick‑select – ≈ 4× speed‑up (V2).
Multithreading – ~16× on a 32‑logical‑core CPU (V3).
Ordinal‑transform – ≈ 420× over V1 (V4).

The next step is to drop the general‑datatype restriction and employ specialised 8‑bit tricks for even faster results.

Fast Median Filter over arbitrary datatypes

Japanese Translation:

1 – V1: Baseline Median Filter

2 – V2: Linear‑Time Median Finding

3 – V3: Making it Multithreaded

4 – V4: Median over Ordinal Transform