[PATCH 2/2] gdi32: Generate and use a lookup cache when looking up RGB values for a color table.

Gabriel Ivăncescu gabrielopcode at gmail.com
Wed Mar 31 07:35:58 CDT 2021


This vastly improves the performance. The cache generation is relatively
constant in terms of algorithm complexity, around O(n) with some extra
overhead for the entire cache (depending on the color table's colors, because
they get "blocked" soon enough), which is fixed at 32768 entries. It scales
well with large amount of colors in the color table. The lookup after that
should be fast.

In contrast, the current method is O(N * M) where N is the amount of pixels
and M is the number of colors in the table, which is very slow for larger M
(especially 256 colors).

More detailed information in the bug report.

Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=50898
Signed-off-by: Gabriel Ivăncescu <gabrielopcode at gmail.com>
---
 dlls/gdi32/dibdrv/primitives.c | 272 +++++++++++++++++++++++++++++++--
 1 file changed, 256 insertions(+), 16 deletions(-)

diff --git a/dlls/gdi32/dibdrv/primitives.c b/dlls/gdi32/dibdrv/primitives.c
index 569baae..cc86ed9 100644
--- a/dlls/gdi32/dibdrv/primitives.c
+++ b/dlls/gdi32/dibdrv/primitives.c
@@ -3497,22 +3497,255 @@ static void convert_to_16(dib_info *dst, const dib_info *src, const RECT *src_re
     }
 }
 
-static inline BOOL color_tables_match(const dib_info *d1, const dib_info *d2)
+/*
+ * To lookup RGB values into nearest color in the color table, Windows uses 5-bits of the RGB
+ * at the "center" of the RGB cube, presumably to do a similar lookup cache. The lowest 3 bits
+ * of the color are thus set to halfway (0x04) and then it's used in the distance calculation
+ * to the exact color in the color table. We exploit this as well to create a lookup cache.
+ *
+ * Generating the table is done by going "outwards" from the center of each color in the table.
+ * First, we find out the center RGB cube spot into the map for each color in the color table.
+ * In case of conflict, we calculate the *true* distance, since color table is not quantized.
+ * Each color in the color table should then have an associated center in the lookup cache map.
+ *
+ * Next, we go "outwards" from each center by increasing an offset for the main axis. This is
+ * always expressed as an offset from the center, and starts from 1 and goes up. Each offset
+ * has multiple generations. A generation defines the distance from the center, and is always
+ * increasing. Thus when we're at a generation, we go through *all* colors in the color table
+ * first, fill the entire generation for all of them, then move to the next generation. Since
+ * the next generation will always have a larger distance than the current one, we know that
+ * once we filled all the previous generations, we're done with them and don't need to check.
+ *
+ * Note that an offset spans multiple generations, due to how distances work in 3 dimensions.
+ * For example, an offset of 1 is simple and has 3 generations. The examples below are all
+ * offsets from the center cube, e.g. (0,-1,1) means +1 for first axis, -1 for the second.
+ *
+ * First generation (offset = 1):
+ *   ( 0, 0, 1) ( 0, 0,-1) ( 0, 1, 0) ( 0,-1, 0) ( 1, 0, 0) (-1, 0, 0)
+ *
+ * In the first generation, the distance is always 1. For each color in the color table, we
+ * go through all of the above displacements from the center, and fill the lookup cache map.
+ *
+ * Second generation (offset = 1 still, but distance is larger now; some are dups, that's OK):
+ *   ( 0, 1, 1) ( 0,-1, 1) ( 1, 0, 1) (-1, 0, 1) ( 0, 1,-1) ( 0,-1,-1) ( 1, 0,-1) (-1, 0,-1)
+ *   ( 0, 1, 1) ( 0, 1,-1) ( 1, 1, 0) (-1, 1, 0) ( 0,-1, 1) ( 0,-1,-1) ( 1,-1, 0) (-1,-1, 0)
+ *   ( 1, 0, 1) ( 1, 0,-1) ( 1, 1, 0) ( 1,-1, 0) (-1, 0, 1) (-1, 0,-1) (-1, 1, 0) (-1,-1, 0)
+ *
+ * On the 2nd line, we swapped the main axis from (x,x,1) to (x,1,x), and on 3rd to (1,x,x).
+ * Note that the distance is always the same (sqrt(2)), and we just permute the axis values.
+ *
+ * Third generation (offset = 1 still, but distance is even larger: sqrt(3)):
+ *   ( 1, 1, 1) ( 1,-1, 1) (-1, 1, 1) (-1,-1, 1) ( 1, 1,-1) ( 1,-1,-1) (-1, 1,-1) (-1,-1,-1)
+ *   ...
+ *
+ * Only after all of the coordinate absolute values are equal to the offset do we increase it.
+ * When the offset is increased, the main axis' absolute value is increased, and the distance.
+ * For other offsets, such as offset = 3, generations start with zeros for the other axis, for
+ * example (0,0,3). The next generation will increase one axis by 1, e.g. (0,1,3) and (1,0,3)
+ * which are part of the same generation (same distance). The next generation will increase the
+ * third axis, e.g. (1,1,3). Then the second axis is raised again, and the process repeats for
+ * the third axis until it's equal to the second axis, e.g. (0,2,3)->(1,2,3)->(2,2,3).
+ *
+ * As each generation is increased, the lowest value in the axis is increased, and the other
+ * starts from zero again. Then they are swapped and permuted. This is due to how the distance
+ * is calculated in 3 dimensions. For each generation we fill *each* color in the color table's
+ * displacement in all directions, since they have the same distance.
+ *
+ * Lastly, we keep track of each of the six directions (in 3 dimensions) and if one direction
+ * was not filled at all in the current offset, we mark it as "blocked" and we no longer check
+ * for it the next offset. When all six directions are blocked, the entry is no longer checked.
+ *
+ * The lookup cache map is completely filled when *all* color table entries are blocked.
+*/
+struct rgb_lookup_colortable_ctx
 {
-    if (!d1->color_table || !d2->color_table) return (!d1->color_table && !d2->color_table);
-    return !memcmp(d1->color_table, d2->color_table, (1 << d1->bit_count) * sizeof(d1->color_table[0]));
+    /* lookup table indexed by 5-bit-per-color RGB into a color table */
+    BYTE map[32 * 32 * 32];
+};
+
+static inline BOOL rgb_lookup_colortable_set_pos(struct rgb_lookup_colortable_ctx *ctx, unsigned pos,
+                                                 const RGBQUAD *color_table, unsigned index, BYTE *map_set_bits)
+{
+    int dr1, dg1, db1, dr2, dg2, db2, pos_r, pos_g, pos_b;
+
+    if (pos >= ARRAY_SIZE(ctx->map)) return FALSE;
+
+    if (map_set_bits[pos / 8] & (1 << pos % 8))
+    {
+        /* because we loop from low idx to high, a color table index larger than ours means
+           it's part of a former generation, so it can't possibly have a larger distance */
+        if (index <= ctx->map[pos]) return FALSE;
+
+        /* check the *true* distance since the color table is not quantized to 5 bits */
+        pos_r = ((pos & 0x001f) << 3) | 4;
+        pos_g = ((pos & 0x03e0) >> 2) | 4;
+        pos_b = ((pos & 0x7c00) >> 7) | 4;
+        dr1 = pos_r - color_table[index].rgbRed;
+        dg1 = pos_g - color_table[index].rgbGreen;
+        db1 = pos_b - color_table[index].rgbBlue;
+        dr2 = pos_r - color_table[ctx->map[pos]].rgbRed;
+        dg2 = pos_g - color_table[ctx->map[pos]].rgbGreen;
+        db2 = pos_b - color_table[ctx->map[pos]].rgbBlue;
+
+        if (dr1*dr1 + dg1*dg1 + db1*db1 >= dr2*dr2 + dg2*dg2 + db2*db2)
+            return FALSE;
+    }
+    else
+        map_set_bits[pos / 8] |= 1 << pos % 8;
+
+    ctx->map[pos] = index;
+    return TRUE;
 }
 
-static inline DWORD rgb_lookup_colortable(const dib_info *dst, BYTE r, BYTE g, BYTE b)
+static void rgb_lookup_colortable_init(const dib_info *dib, struct rgb_lookup_colortable_ctx *ctx)
 {
-    /* Windows reduces precision to 5 bits, probably in order to build some sort of lookup cache */
-    return rgb_to_pixel_colortable( dst, (r & ~7) + 4, (g & ~7) + 4, (b & ~7) + 4 );
+    BYTE indices[256], available_directions[256], tmp_directions[256], map_set_bits[ARRAY_SIZE(ctx->map) / 8];
+    unsigned color_table_size = dib->color_table ? dib->color_table_size : 1 << dib->bit_count;
+    const RGBQUAD *color_table = get_dib_color_table(dib);
+    unsigned idx, offset, num_entries;
+    int i, j;
+
+    /* Testing shows that for low amount of colors, the overhead is larger than the
+       O(N*M) algorithm, presumably due to branch prediction and cache locality, but
+       it gets quickly out of hand (up to 5x slower) for larger amount of colors... */
+    if (color_table_size <= 40)
+    {
+        unsigned r, g, b;
+
+        for (b = 4; b < 256; b += 1 << 3)
+            for (g = 4; g < 256; g += 1 << 3)
+                for (r = 4; r < 256; r += 1 << 3)
+                    ctx->map[r >> 3 | (g & ~7) << 2 | (b & ~7) << 7] = rgb_to_pixel_colortable(dib, r, g, b);
+        return;
+    }
+
+    memset(map_set_bits, 0, sizeof(map_set_bits));
+    memset(available_directions, 0xff, color_table_size);
+    memset(tmp_directions, 0, color_table_size);
+
+    /* indirect list of valid color table indices, as we remove those fully surrounded */
+    for (idx = 0; idx < color_table_size; idx++)
+        indices[idx] = idx;
+
+    /* first, fill the centers (offset = 0) of each quantized table color in the map */
+    for (idx = 0; idx < color_table_size; idx++)
+    {
+        int dr1, dg1, db1, dr2, dg2, db2, pos_r, pos_g, pos_b;
+        unsigned pos = (color_table[idx].rgbRed >> 3) |
+                       (color_table[idx].rgbGreen & ~7) << 2 |
+                       (color_table[idx].rgbBlue & ~7) << 7;
+
+        if (map_set_bits[pos / 8] & (1 << pos % 8))
+        {
+            pos_r = (color_table[idx].rgbRed & ~7) | 4;
+            pos_g = (color_table[idx].rgbGreen & ~7) | 4;
+            pos_b = (color_table[idx].rgbBlue & ~7) | 4;
+            dr1 = pos_r - color_table[idx].rgbRed;
+            dg1 = pos_g - color_table[idx].rgbGreen;
+            db1 = pos_b - color_table[idx].rgbBlue;
+            dr2 = pos_r - color_table[ctx->map[pos]].rgbRed;
+            dg2 = pos_g - color_table[ctx->map[pos]].rgbGreen;
+            db2 = pos_b - color_table[ctx->map[pos]].rgbBlue;
+
+            if (dr1*dr1 + dg1*dg1 + db1*db1 >= dr2*dr2 + dg2*dg2 + db2*db2)
+                continue;
+        }
+        else
+            map_set_bits[pos / 8] |= 1 << pos % 8;
+
+        ctx->map[pos] = idx;
+    }
+
+    /* now do the rest */
+    for (offset = 1, num_entries = color_table_size; num_entries != 0; offset++)
+    {
+        for (i = 0; i <= offset; i++)
+        {
+            for (j = 0; j <= i; j++)
+            {
+                /* we're at one generation now, go through each color in the color table */
+                for (idx = 0; idx < num_entries; idx++)
+                {
+                    unsigned direction_mask = 1, color_table_index = indices[idx];
+                    unsigned shift2 = 5, shift3 = 10;
+                    int center, main_axis = offset;
+
+                    center = (color_table[color_table_index].rgbRed >> 3) |
+                             (color_table[color_table_index].rgbGreen & ~7) << 2 |
+                             (color_table[color_table_index].rgbBlue & ~7) << 7;
+                    do
+                    {
+                        do
+                        {
+                            if (available_directions[idx] & direction_mask)
+                            {
+                                BOOL direction_blocked = TRUE;
+                                do
+                                {
+                                    unsigned shift_tmp;
+                                    do
+                                    {
+                                        do
+                                        {
+                                            unsigned pos = center + main_axis + (i << shift2) + (j << shift3);
+
+                                            if (rgb_lookup_colortable_set_pos(ctx, pos, color_table, color_table_index, map_set_bits))
+                                                direction_blocked = FALSE;
+                                            j = -j;
+                                        } while (j < 0);
+                                        i = -i;
+                                    } while (i < 0);
+
+                                    /* repeat once and swap the two non-main axis */
+                                    shift_tmp = shift2; shift2 = shift3; shift3 = shift_tmp;
+                                } while (shift2 > shift3 && i != j);
+
+                                if (!direction_blocked)
+                                    tmp_directions[idx] |= direction_mask;
+                            }
+                            direction_mask <<= 1;
+                            main_axis = -main_axis;
+                        } while (main_axis < 0);
+
+                        /* change main to next axis */
+                        shift2 = 0;
+                        shift3 = (main_axis < 32) ? 10 : 5;
+                        main_axis <<= 5;
+                    } while (main_axis < 32768);
+                }
+            }
+        }
+
+        for (i = 0, idx = 0; idx < num_entries; idx++)
+        {
+            /* discard this color table index from further searches if it's fully surrounded */
+            if (!tmp_directions[idx]) continue;
+
+            available_directions[i] = tmp_directions[idx];
+            tmp_directions[i] = 0;
+
+            indices[i++] = indices[idx];
+        }
+        num_entries = i;
+    }
+}
+
+static inline BYTE rgb_lookup_colortable(const struct rgb_lookup_colortable_ctx *ctx, BYTE r, BYTE g, BYTE b)
+{
+    return ctx->map[(r >> 3) | (g & ~7) << 2 | (b & ~7) << 7];
+}
+
+static inline BOOL color_tables_match(const dib_info *d1, const dib_info *d2)
+{
+    if (!d1->color_table || !d2->color_table) return (!d1->color_table && !d2->color_table);
+    return !memcmp(d1->color_table, d2->color_table, (1 << d1->bit_count) * sizeof(d1->color_table[0]));
 }
 
 static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rect, BOOL dither)
 {
     BYTE *dst_start = get_pixel_ptr_8(dst, 0, 0), *dst_pixel;
     INT x, y, pad_size = ((dst->width + 3) & ~3) - (src_rect->right - src_rect->left);
+    struct rgb_lookup_colortable_ctx lookup_ctx;
     DWORD src_val;
 
     switch(src->bit_count)
@@ -3521,6 +3754,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
     {
         DWORD *src_start = get_pixel_ptr_32(src, src_rect->left, src_rect->top), *src_pixel;
 
+        rgb_lookup_colortable_init(dst, &lookup_ctx);
         if(src->funcs == &funcs_8888)
         {
             for(y = src_rect->top; y < src_rect->bottom; y++)
@@ -3530,7 +3764,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst, src_val >> 16, src_val >> 8, src_val );
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx, src_val >> 16, src_val >> 8, src_val );
                 }
                 if(pad_size) memset(dst_pixel, 0, pad_size);
                 dst_start += dst->stride;
@@ -3546,7 +3780,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          src_val >> src->red_shift,
                                                          src_val >> src->green_shift,
                                                          src_val >> src->blue_shift );
@@ -3565,7 +3799,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          get_field(src_val, src->red_shift, src->red_len),
                                                          get_field(src_val, src->green_shift, src->green_len),
                                                          get_field(src_val, src->blue_shift, src->blue_len));
@@ -3582,13 +3816,14 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
     {
         BYTE *src_start = get_pixel_ptr_24(src, src_rect->left, src_rect->top), *src_pixel;
 
+        rgb_lookup_colortable_init(dst, &lookup_ctx);
         for(y = src_rect->top; y < src_rect->bottom; y++)
         {
             dst_pixel = dst_start;
             src_pixel = src_start;
             for(x = src_rect->left; x < src_rect->right; x++, src_pixel += 3)
             {
-                *dst_pixel++ = rgb_lookup_colortable(dst, src_pixel[2], src_pixel[1], src_pixel[0] );
+                *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx, src_pixel[2], src_pixel[1], src_pixel[0] );
             }
             if(pad_size) memset(dst_pixel, 0, pad_size);
             dst_start += dst->stride;
@@ -3600,6 +3835,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
     case 16:
     {
         WORD *src_start = get_pixel_ptr_16(src, src_rect->left, src_rect->top), *src_pixel;
+        rgb_lookup_colortable_init(dst, &lookup_ctx);
         if(src->funcs == &funcs_555)
         {
             for(y = src_rect->top; y < src_rect->bottom; y++)
@@ -3609,7 +3845,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          ((src_val >> 7) & 0xf8) | ((src_val >> 12) & 0x07),
                                                          ((src_val >> 2) & 0xf8) | ((src_val >> 7) & 0x07),
                                                          ((src_val << 3) & 0xf8) | ((src_val >> 2) & 0x07) );
@@ -3628,7 +3864,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          (((src_val >> src->red_shift)   << 3) & 0xf8) |
                                                          (((src_val >> src->red_shift)   >> 2) & 0x07),
                                                          (((src_val >> src->green_shift) << 3) & 0xf8) |
@@ -3650,7 +3886,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          (((src_val >> src->red_shift)   << 3) & 0xf8) |
                                                          (((src_val >> src->red_shift)   >> 2) & 0x07),
                                                          (((src_val >> src->green_shift) << 2) & 0xfc) |
@@ -3672,7 +3908,7 @@ static void convert_to_8(dib_info *dst, const dib_info *src, const RECT *src_rec
                 for(x = src_rect->left; x < src_rect->right; x++)
                 {
                     src_val = *src_pixel++;
-                    *dst_pixel++ = rgb_lookup_colortable(dst,
+                    *dst_pixel++ = rgb_lookup_colortable(&lookup_ctx,
                                                          get_field(src_val, src->red_shift, src->red_len),
                                                          get_field(src_val, src->green_shift, src->green_len),
                                                          get_field(src_val, src->blue_shift, src->blue_len));
@@ -4807,8 +5043,10 @@ static void blend_rect_8(const dib_info *dst, const dib_info *src, LONG diff_x,
                          const struct clipped_rects *clipped_rects, BLENDFUNCTION blend)
 {
     const RGBQUAD *color_table = get_dib_color_table( dst );
+    struct rgb_lookup_colortable_ctx lookup_ctx;
     int i, x, y;
 
+    rgb_lookup_colortable_init( dst, &lookup_ctx );
     for (i = 0; i < clipped_rects->count; i++)
     {
         const RECT *rc = &clipped_rects->rects[i];
@@ -4821,7 +5059,7 @@ static void blend_rect_8(const dib_info *dst, const dib_info *src, LONG diff_x,
             {
                 RGBQUAD rgb = color_table[dst_ptr[x]];
                 DWORD val = blend_rgb( rgb.rgbRed, rgb.rgbGreen, rgb.rgbBlue, src_ptr[x], blend );
-                dst_ptr[x] = rgb_lookup_colortable( dst, val >> 16, val >> 8, val );
+                dst_ptr[x] = rgb_lookup_colortable( &lookup_ctx, val >> 16, val >> 8, val );
             }
         }
     }
@@ -4831,8 +5069,10 @@ static void blend_rect_4(const dib_info *dst, const dib_info *src, LONG diff_x,
                          const struct clipped_rects *clipped_rects, BLENDFUNCTION blend)
 {
     const RGBQUAD *color_table = get_dib_color_table( dst );
+    struct rgb_lookup_colortable_ctx lookup_ctx;
     int i, j, x, y;
 
+    rgb_lookup_colortable_init( dst, &lookup_ctx );
     for (i = 0; i < clipped_rects->count; i++)
     {
         const RECT *rc = &clipped_rects->rects[i];
@@ -4846,7 +5086,7 @@ static void blend_rect_4(const dib_info *dst, const dib_info *src, LONG diff_x,
                 DWORD val = ((x & 1) ? dst_ptr[x / 2] : (dst_ptr[x / 2] >> 4)) & 0x0f;
                 RGBQUAD rgb = color_table[val];
                 val = blend_rgb( rgb.rgbRed, rgb.rgbGreen, rgb.rgbBlue, src_ptr[j], blend );
-                val = rgb_lookup_colortable( dst, val >> 16, val >> 8, val );
+                val = rgb_lookup_colortable( &lookup_ctx, val >> 16, val >> 8, val );
                 if (x & 1)
                     dst_ptr[x / 2] = val | (dst_ptr[x / 2] & 0xf0);
                 else
-- 
2.30.0




More information about the wine-devel mailing list