[PATCH 2/2] gdi32: Generate and use a lookup cache when looking up RGB values for a color table.

Huw Davies huw at codeweavers.com
Thu Apr 1 02:52:05 CDT 2021


On Wed, Mar 31, 2021 at 03:35:58PM +0300, Gabriel Ivăncescu wrote:
> This vastly improves the performance. The cache generation is relatively
> constant in terms of algorithm complexity, around O(n) with some extra
> overhead for the entire cache (depending on the color table's colors, because
> they get "blocked" soon enough), which is fixed at 32768 entries. It scales
> well with large amount of colors in the color table. The lookup after that
> should be fast.
> 
> In contrast, the current method is O(N * M) where N is the amount of pixels
> and M is the number of colors in the table, which is very slow for larger M
> (especially 256 colors).
> 
> More detailed information in the bug report.
> 
> Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=50898
> Signed-off-by: Gabriel Ivăncescu <gabrielopcode at gmail.com>
> ---
>  dlls/gdi32/dibdrv/primitives.c | 272 +++++++++++++++++++++++++++++++--
>  1 file changed, 256 insertions(+), 16 deletions(-)
> 
> diff --git a/dlls/gdi32/dibdrv/primitives.c b/dlls/gdi32/dibdrv/primitives.c
> index 569baae..cc86ed9 100644
> --- a/dlls/gdi32/dibdrv/primitives.c
> +++ b/dlls/gdi32/dibdrv/primitives.c
> @@ -3497,22 +3497,255 @@ static void convert_to_16(dib_info *dst, const dib_info *src, const RECT *src_re
>      }
>  }
>  
> -static inline BOOL color_tables_match(const dib_info *d1, const dib_info *d2)
> +/*
> + * To lookup RGB values into nearest color in the color table, Windows uses 5-bits of the RGB
> + * at the "center" of the RGB cube, presumably to do a similar lookup cache. The lowest 3 bits
> + * of the color are thus set to halfway (0x04) and then it's used in the distance calculation
> + * to the exact color in the color table. We exploit this as well to create a lookup cache.
> + *
> + * Generating the table is done by going "outwards" from the center of each color in the table.
> + * First, we find out the center RGB cube spot into the map for each color in the color table.
> + * In case of conflict, we calculate the *true* distance, since color table is not quantized.
> + * Each color in the color table should then have an associated center in the lookup cache map.
> + *
> + * Next, we go "outwards" from each center by increasing an offset for the main axis. This is
> + * always expressed as an offset from the center, and starts from 1 and goes up. Each offset
> + * has multiple generations. A generation defines the distance from the center, and is always
> + * increasing. Thus when we're at a generation, we go through *all* colors in the color table
> + * first, fill the entire generation for all of them, then move to the next generation. Since
> + * the next generation will always have a larger distance than the current one, we know that
> + * once we filled all the previous generations, we're done with them and don't need to check.
> + *
> + * Note that an offset spans multiple generations, due to how distances work in 3 dimensions.
> + * For example, an offset of 1 is simple and has 3 generations. The examples below are all
> + * offsets from the center cube, e.g. (0,-1,1) means +1 for first axis, -1 for the second.
> + *
> + * First generation (offset = 1):
> + *   ( 0, 0, 1) ( 0, 0,-1) ( 0, 1, 0) ( 0,-1, 0) ( 1, 0, 0) (-1, 0, 0)
> + *
> + * In the first generation, the distance is always 1. For each color in the color table, we
> + * go through all of the above displacements from the center, and fill the lookup cache map.
> + *
> + * Second generation (offset = 1 still, but distance is larger now; some are dups, that's OK):
> + *   ( 0, 1, 1) ( 0,-1, 1) ( 1, 0, 1) (-1, 0, 1) ( 0, 1,-1) ( 0,-1,-1) ( 1, 0,-1) (-1, 0,-1)
> + *   ( 0, 1, 1) ( 0, 1,-1) ( 1, 1, 0) (-1, 1, 0) ( 0,-1, 1) ( 0,-1,-1) ( 1,-1, 0) (-1,-1, 0)
> + *   ( 1, 0, 1) ( 1, 0,-1) ( 1, 1, 0) ( 1,-1, 0) (-1, 0, 1) (-1, 0,-1) (-1, 1, 0) (-1,-1, 0)
> + *
> + * On the 2nd line, we swapped the main axis from (x,x,1) to (x,1,x), and on 3rd to (1,x,x).
> + * Note that the distance is always the same (sqrt(2)), and we just permute the axis values.
> + *
> + * Third generation (offset = 1 still, but distance is even larger: sqrt(3)):
> + *   ( 1, 1, 1) ( 1,-1, 1) (-1, 1, 1) (-1,-1, 1) ( 1, 1,-1) ( 1,-1,-1) (-1, 1,-1) (-1,-1,-1)
> + *   ...
> + *
> + * Only after all of the coordinate absolute values are equal to the offset do we increase it.
> + * When the offset is increased, the main axis' absolute value is increased, and the distance.
> + * For other offsets, such as offset = 3, generations start with zeros for the other axis, for
> + * example (0,0,3). The next generation will increase one axis by 1, e.g. (0,1,3) and (1,0,3)
> + * which are part of the same generation (same distance). The next generation will increase the
> + * third axis, e.g. (1,1,3). Then the second axis is raised again, and the process repeats for
> + * the third axis until it's equal to the second axis, e.g. (0,2,3)->(1,2,3)->(2,2,3).
> + *
> + * As each generation is increased, the lowest value in the axis is increased, and the other
> + * starts from zero again. Then they are swapped and permuted. This is due to how the distance
> + * is calculated in 3 dimensions. For each generation we fill *each* color in the color table's
> + * displacement in all directions, since they have the same distance.
> + *
> + * Lastly, we keep track of each of the six directions (in 3 dimensions) and if one direction
> + * was not filled at all in the current offset, we mark it as "blocked" and we no longer check
> + * for it the next offset. When all six directions are blocked, the entry is no longer checked.
> + *
> + * The lookup cache map is completely filled when *all* color table entries are blocked.
> +*/

Perhaps I'm misunderstanding your algorithm, but doesn't this end up with
(2,2,2) coming before (3,0,0)?

> +static void rgb_lookup_colortable_init(const dib_info *dib, struct rgb_lookup_colortable_ctx *ctx)
>  {
> -    /* Windows reduces precision to 5 bits, probably in order to build some sort of lookup cache */
> -    return rgb_to_pixel_colortable( dst, (r & ~7) + 4, (g & ~7) + 4, (b & ~7) + 4 );
> +    BYTE indices[256], available_directions[256], tmp_directions[256], map_set_bits[ARRAY_SIZE(ctx->map) / 8];
> +    unsigned color_table_size = dib->color_table ? dib->color_table_size : 1 << dib->bit_count;
> +    const RGBQUAD *color_table = get_dib_color_table(dib);
> +    unsigned idx, offset, num_entries;
> +    int i, j;
> +
> +    /* Testing shows that for low amount of colors, the overhead is larger than the
> +       O(N*M) algorithm, presumably due to branch prediction and cache locality, but
> +       it gets quickly out of hand (up to 5x slower) for larger amount of colors... */
> +    if (color_table_size <= 40)
> +    {

As a side note, there are going to be a vanishingly small number of
colour tables with sizes [17,255], so if you need to special case,
special case for sizes <= 16.

> +        unsigned r, g, b;
> +
> +        for (b = 4; b < 256; b += 1 << 3)
> +            for (g = 4; g < 256; g += 1 << 3)
> +                for (r = 4; r < 256; r += 1 << 3)
> +                    ctx->map[r >> 3 | (g & ~7) << 2 | (b & ~7) << 7] = rgb_to_pixel_colortable(dib, r, g, b);
> +        return;
> +    }
> +
> +    memset(map_set_bits, 0, sizeof(map_set_bits));
> +    memset(available_directions, 0xff, color_table_size);
> +    memset(tmp_directions, 0, color_table_size);
> +
> +    /* indirect list of valid color table indices, as we remove those fully surrounded */
> +    for (idx = 0; idx < color_table_size; idx++)
> +        indices[idx] = idx;
> +
> +    /* first, fill the centers (offset = 0) of each quantized table color in the map */
> +    for (idx = 0; idx < color_table_size; idx++)
> +    {
> +        int dr1, dg1, db1, dr2, dg2, db2, pos_r, pos_g, pos_b;
> +        unsigned pos = (color_table[idx].rgbRed >> 3) |
> +                       (color_table[idx].rgbGreen & ~7) << 2 |
> +                       (color_table[idx].rgbBlue & ~7) << 7;
> +
> +        if (map_set_bits[pos / 8] & (1 << pos % 8))
> +        {
> +            pos_r = (color_table[idx].rgbRed & ~7) | 4;
> +            pos_g = (color_table[idx].rgbGreen & ~7) | 4;
> +            pos_b = (color_table[idx].rgbBlue & ~7) | 4;
> +            dr1 = pos_r - color_table[idx].rgbRed;
> +            dg1 = pos_g - color_table[idx].rgbGreen;
> +            db1 = pos_b - color_table[idx].rgbBlue;
> +            dr2 = pos_r - color_table[ctx->map[pos]].rgbRed;
> +            dg2 = pos_g - color_table[ctx->map[pos]].rgbGreen;
> +            db2 = pos_b - color_table[ctx->map[pos]].rgbBlue;
> +
> +            if (dr1*dr1 + dg1*dg1 + db1*db1 >= dr2*dr2 + dg2*dg2 + db2*db2)
> +                continue;
> +        }
> +        else
> +            map_set_bits[pos / 8] |= 1 << pos % 8;
> +
> +        ctx->map[pos] = idx;
> +    }
> +
> +    /* now do the rest */
> +    for (offset = 1, num_entries = color_table_size; num_entries != 0; offset++)
> +    {
> +        for (i = 0; i <= offset; i++)
> +        {
> +            for (j = 0; j <= i; j++)
> +            {
> +                /* we're at one generation now, go through each color in the color table */
> +                for (idx = 0; idx < num_entries; idx++)
> +                {
> +                    unsigned direction_mask = 1, color_table_index = indices[idx];
> +                    unsigned shift2 = 5, shift3 = 10;
> +                    int center, main_axis = offset;
> +
> +                    center = (color_table[color_table_index].rgbRed >> 3) |
> +                             (color_table[color_table_index].rgbGreen & ~7) << 2 |
> +                             (color_table[color_table_index].rgbBlue & ~7) << 7;
> +                    do
> +                    {
> +                        do
> +                        {
> +                            if (available_directions[idx] & direction_mask)
> +                            {
> +                                BOOL direction_blocked = TRUE;
> +                                do
> +                                {
> +                                    unsigned shift_tmp;
> +                                    do
> +                                    {
> +                                        do
> +                                        {
> +                                            unsigned pos = center + main_axis + (i << shift2) + (j << shift3);
> +
> +                                            if (rgb_lookup_colortable_set_pos(ctx, pos, color_table, color_table_index, map_set_bits))
> +                                                direction_blocked = FALSE;
> +                                            j = -j;
> +                                        } while (j < 0);
> +                                        i = -i;
> +                                    } while (i < 0);
> +
> +                                    /* repeat once and swap the two non-main axis */
> +                                    shift_tmp = shift2; shift2 = shift3; shift3 = shift_tmp;
> +                                } while (shift2 > shift3 && i != j);
> +
> +                                if (!direction_blocked)
> +                                    tmp_directions[idx] |= direction_mask;
> +                            }
> +                            direction_mask <<= 1;
> +                            main_axis = -main_axis;
> +                        } while (main_axis < 0);
> +
> +                        /* change main to next axis */
> +                        shift2 = 0;
> +                        shift3 = (main_axis < 32) ? 10 : 5;
> +                        main_axis <<= 5;
> +                    } while (main_axis < 32768);
> +                }
> +            }
> +        }

I suspect things might become clearer if one were to unwind some of these loops.

Did you consider using a lazy initialization of the lookup table instead?

Huw.



More information about the wine-devel mailing list