[PATCH 2/2] gdi32: Generate and use a lookup cache when looking up RGB values for a color table.

Gabriel Ivăncescu gabrielopcode at gmail.com
Thu Apr 1 08:07:00 CDT 2021


Hi Huw,

Thanks for the review!

On 01/04/2021 10:52, Huw Davies wrote:
> On Wed, Mar 31, 2021 at 03:35:58PM +0300, Gabriel Ivăncescu wrote:
>> This vastly improves the performance. The cache generation is relatively
>> constant in terms of algorithm complexity, around O(n) with some extra
>> overhead for the entire cache (depending on the color table's colors, because
>> they get "blocked" soon enough), which is fixed at 32768 entries. It scales
>> well with large amount of colors in the color table. The lookup after that
>> should be fast.
>>
>> In contrast, the current method is O(N * M) where N is the amount of pixels
>> and M is the number of colors in the table, which is very slow for larger M
>> (especially 256 colors).
>>
>> More detailed information in the bug report.
>>
>> Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=50898
>> Signed-off-by: Gabriel Ivăncescu <gabrielopcode at gmail.com>
>> ---
>>   dlls/gdi32/dibdrv/primitives.c | 272 +++++++++++++++++++++++++++++++--
>>   1 file changed, 256 insertions(+), 16 deletions(-)
>>
>> diff --git a/dlls/gdi32/dibdrv/primitives.c b/dlls/gdi32/dibdrv/primitives.c
>> index 569baae..cc86ed9 100644
>> --- a/dlls/gdi32/dibdrv/primitives.c
>> +++ b/dlls/gdi32/dibdrv/primitives.c
>> @@ -3497,22 +3497,255 @@ static void convert_to_16(dib_info *dst, const dib_info *src, const RECT *src_re
>>       }
>>   }
>>   
>> -static inline BOOL color_tables_match(const dib_info *d1, const dib_info *d2)
>> +/*
>> + * To lookup RGB values into nearest color in the color table, Windows uses 5-bits of the RGB
>> + * at the "center" of the RGB cube, presumably to do a similar lookup cache. The lowest 3 bits
>> + * of the color are thus set to halfway (0x04) and then it's used in the distance calculation
>> + * to the exact color in the color table. We exploit this as well to create a lookup cache.
>> + *
>> + * Generating the table is done by going "outwards" from the center of each color in the table.
>> + * First, we find out the center RGB cube spot into the map for each color in the color table.
>> + * In case of conflict, we calculate the *true* distance, since color table is not quantized.
>> + * Each color in the color table should then have an associated center in the lookup cache map.
>> + *
>> + * Next, we go "outwards" from each center by increasing an offset for the main axis. This is
>> + * always expressed as an offset from the center, and starts from 1 and goes up. Each offset
>> + * has multiple generations. A generation defines the distance from the center, and is always
>> + * increasing. Thus when we're at a generation, we go through *all* colors in the color table
>> + * first, fill the entire generation for all of them, then move to the next generation. Since
>> + * the next generation will always have a larger distance than the current one, we know that
>> + * once we filled all the previous generations, we're done with them and don't need to check.
>> + *
>> + * Note that an offset spans multiple generations, due to how distances work in 3 dimensions.
>> + * For example, an offset of 1 is simple and has 3 generations. The examples below are all
>> + * offsets from the center cube, e.g. (0,-1,1) means +1 for first axis, -1 for the second.
>> + *
>> + * First generation (offset = 1):
>> + *   ( 0, 0, 1) ( 0, 0,-1) ( 0, 1, 0) ( 0,-1, 0) ( 1, 0, 0) (-1, 0, 0)
>> + *
>> + * In the first generation, the distance is always 1. For each color in the color table, we
>> + * go through all of the above displacements from the center, and fill the lookup cache map.
>> + *
>> + * Second generation (offset = 1 still, but distance is larger now; some are dups, that's OK):
>> + *   ( 0, 1, 1) ( 0,-1, 1) ( 1, 0, 1) (-1, 0, 1) ( 0, 1,-1) ( 0,-1,-1) ( 1, 0,-1) (-1, 0,-1)
>> + *   ( 0, 1, 1) ( 0, 1,-1) ( 1, 1, 0) (-1, 1, 0) ( 0,-1, 1) ( 0,-1,-1) ( 1,-1, 0) (-1,-1, 0)
>> + *   ( 1, 0, 1) ( 1, 0,-1) ( 1, 1, 0) ( 1,-1, 0) (-1, 0, 1) (-1, 0,-1) (-1, 1, 0) (-1,-1, 0)
>> + *
>> + * On the 2nd line, we swapped the main axis from (x,x,1) to (x,1,x), and on 3rd to (1,x,x).
>> + * Note that the distance is always the same (sqrt(2)), and we just permute the axis values.
>> + *
>> + * Third generation (offset = 1 still, but distance is even larger: sqrt(3)):
>> + *   ( 1, 1, 1) ( 1,-1, 1) (-1, 1, 1) (-1,-1, 1) ( 1, 1,-1) ( 1,-1,-1) (-1, 1,-1) (-1,-1,-1)
>> + *   ...
>> + *
>> + * Only after all of the coordinate absolute values are equal to the offset do we increase it.
>> + * When the offset is increased, the main axis' absolute value is increased, and the distance.
>> + * For other offsets, such as offset = 3, generations start with zeros for the other axis, for
>> + * example (0,0,3). The next generation will increase one axis by 1, e.g. (0,1,3) and (1,0,3)
>> + * which are part of the same generation (same distance). The next generation will increase the
>> + * third axis, e.g. (1,1,3). Then the second axis is raised again, and the process repeats for
>> + * the third axis until it's equal to the second axis, e.g. (0,2,3)->(1,2,3)->(2,2,3).
>> + *
>> + * As each generation is increased, the lowest value in the axis is increased, and the other
>> + * starts from zero again. Then they are swapped and permuted. This is due to how the distance
>> + * is calculated in 3 dimensions. For each generation we fill *each* color in the color table's
>> + * displacement in all directions, since they have the same distance.
>> + *
>> + * Lastly, we keep track of each of the six directions (in 3 dimensions) and if one direction
>> + * was not filled at all in the current offset, we mark it as "blocked" and we no longer check
>> + * for it the next offset. When all six directions are blocked, the entry is no longer checked.
>> + *
>> + * The lookup cache map is completely filled when *all* color table entries are blocked.
>> +*/
> 
> Perhaps I'm misunderstanding your algorithm, but doesn't this end up with
> (2,2,2) coming before (3,0,0)?
> 

Actually yeah, you're right, I messed up somewhere when testing with the 
quantized cubes. I'll need to revamp this at some point, but for now 
I'll leave it aside. I'm thinking of going with something else easier 
(see below), if it's acceptable.

>> +static void rgb_lookup_colortable_init(const dib_info *dib, struct rgb_lookup_colortable_ctx *ctx)
>>   {
>> -    /* Windows reduces precision to 5 bits, probably in order to build some sort of lookup cache */
>> -    return rgb_to_pixel_colortable( dst, (r & ~7) + 4, (g & ~7) + 4, (b & ~7) + 4 );
>> +    BYTE indices[256], available_directions[256], tmp_directions[256], map_set_bits[ARRAY_SIZE(ctx->map) / 8];
>> +    unsigned color_table_size = dib->color_table ? dib->color_table_size : 1 << dib->bit_count;
>> +    const RGBQUAD *color_table = get_dib_color_table(dib);
>> +    unsigned idx, offset, num_entries;
>> +    int i, j;
>> +
>> +    /* Testing shows that for low amount of colors, the overhead is larger than the
>> +       O(N*M) algorithm, presumably due to branch prediction and cache locality, but
>> +       it gets quickly out of hand (up to 5x slower) for larger amount of colors... */
>> +    if (color_table_size <= 40)
>> +    {
> 
> As a side note, there are going to be a vanishingly small number of
> colour tables with sizes [17,255], so if you need to special case,
> special case for sizes <= 16.
> 
>> +        unsigned r, g, b;
>> +
>> +        for (b = 4; b < 256; b += 1 << 3)
>> +            for (g = 4; g < 256; g += 1 << 3)
>> +                for (r = 4; r < 256; r += 1 << 3)
>> +                    ctx->map[r >> 3 | (g & ~7) << 2 | (b & ~7) << 7] = rgb_to_pixel_colortable(dib, r, g, b);
>> +        return;
>> +    }
>> +
>> +    memset(map_set_bits, 0, sizeof(map_set_bits));
>> +    memset(available_directions, 0xff, color_table_size);
>> +    memset(tmp_directions, 0, color_table_size);
>> +
>> +    /* indirect list of valid color table indices, as we remove those fully surrounded */
>> +    for (idx = 0; idx < color_table_size; idx++)
>> +        indices[idx] = idx;
>> +
>> +    /* first, fill the centers (offset = 0) of each quantized table color in the map */
>> +    for (idx = 0; idx < color_table_size; idx++)
>> +    {
>> +        int dr1, dg1, db1, dr2, dg2, db2, pos_r, pos_g, pos_b;
>> +        unsigned pos = (color_table[idx].rgbRed >> 3) |
>> +                       (color_table[idx].rgbGreen & ~7) << 2 |
>> +                       (color_table[idx].rgbBlue & ~7) << 7;
>> +
>> +        if (map_set_bits[pos / 8] & (1 << pos % 8))
>> +        {
>> +            pos_r = (color_table[idx].rgbRed & ~7) | 4;
>> +            pos_g = (color_table[idx].rgbGreen & ~7) | 4;
>> +            pos_b = (color_table[idx].rgbBlue & ~7) | 4;
>> +            dr1 = pos_r - color_table[idx].rgbRed;
>> +            dg1 = pos_g - color_table[idx].rgbGreen;
>> +            db1 = pos_b - color_table[idx].rgbBlue;
>> +            dr2 = pos_r - color_table[ctx->map[pos]].rgbRed;
>> +            dg2 = pos_g - color_table[ctx->map[pos]].rgbGreen;
>> +            db2 = pos_b - color_table[ctx->map[pos]].rgbBlue;
>> +
>> +            if (dr1*dr1 + dg1*dg1 + db1*db1 >= dr2*dr2 + dg2*dg2 + db2*db2)
>> +                continue;
>> +        }
>> +        else
>> +            map_set_bits[pos / 8] |= 1 << pos % 8;
>> +
>> +        ctx->map[pos] = idx;
>> +    }
>> +
>> +    /* now do the rest */
>> +    for (offset = 1, num_entries = color_table_size; num_entries != 0; offset++)
>> +    {
>> +        for (i = 0; i <= offset; i++)
>> +        {
>> +            for (j = 0; j <= i; j++)
>> +            {
>> +                /* we're at one generation now, go through each color in the color table */
>> +                for (idx = 0; idx < num_entries; idx++)
>> +                {
>> +                    unsigned direction_mask = 1, color_table_index = indices[idx];
>> +                    unsigned shift2 = 5, shift3 = 10;
>> +                    int center, main_axis = offset;
>> +
>> +                    center = (color_table[color_table_index].rgbRed >> 3) |
>> +                             (color_table[color_table_index].rgbGreen & ~7) << 2 |
>> +                             (color_table[color_table_index].rgbBlue & ~7) << 7;
>> +                    do
>> +                    {
>> +                        do
>> +                        {
>> +                            if (available_directions[idx] & direction_mask)
>> +                            {
>> +                                BOOL direction_blocked = TRUE;
>> +                                do
>> +                                {
>> +                                    unsigned shift_tmp;
>> +                                    do
>> +                                    {
>> +                                        do
>> +                                        {
>> +                                            unsigned pos = center + main_axis + (i << shift2) + (j << shift3);
>> +
>> +                                            if (rgb_lookup_colortable_set_pos(ctx, pos, color_table, color_table_index, map_set_bits))
>> +                                                direction_blocked = FALSE;
>> +                                            j = -j;
>> +                                        } while (j < 0);
>> +                                        i = -i;
>> +                                    } while (i < 0);
>> +
>> +                                    /* repeat once and swap the two non-main axis */
>> +                                    shift_tmp = shift2; shift2 = shift3; shift3 = shift_tmp;
>> +                                } while (shift2 > shift3 && i != j);
>> +
>> +                                if (!direction_blocked)
>> +                                    tmp_directions[idx] |= direction_mask;
>> +                            }
>> +                            direction_mask <<= 1;
>> +                            main_axis = -main_axis;
>> +                        } while (main_axis < 0);
>> +
>> +                        /* change main to next axis */
>> +                        shift2 = 0;
>> +                        shift3 = (main_axis < 32) ? 10 : 5;
>> +                        main_axis <<= 5;
>> +                    } while (main_axis < 32768);
>> +                }
>> +            }
>> +        }
> 
> I suspect things might become clearer if one were to unwind some of these loops.
> 
> Did you consider using a lazy initialization of the lookup table instead?
> 
> Huw.
> 

Lazy initialization meaning build it as we scan the image? That would 
work too. I also thought of caching the color tables globally.

Now I don't know if that's acceptable, but here's my idea since it's 
what is good in "practice":

Hold a small fixed size global cache (4-5 entries) of color tables, with 
synchronization in case of multi-threading (just for correctness, I 
doubt any app will actually multi-thread this). If we go the lazy 
initialization approach, we'll also keep track which of the entries were 
not filled yet in the cache.

It would be a global table for all color tables, but since I expect all 
the performance problems to be using the same color table over and over 
again in a loop, it would work "in practice".

When we have to discard/replace one, we choose the oldest one that 
wasn't accessed (we keep some timestamp on each cache). Does that sound 
feasible?

Maybe if I end up using this global cache, I can discard the lazy init 
since it complicates things, do you think that's better?

Thanks,
Gabriel



More information about the wine-devel mailing list