[Bug 12071] New: MSI SQL joins on tables with many rows are extremely slow

wine-bugs at winehq.org wine-bugs at winehq.org
Mon Mar 17 00:30:51 CDT 2008


http://bugs.winehq.org/show_bug.cgi?id=12071

           Summary: MSI SQL joins on tables with many rows are extremely
                    slow
           Product: Wine
           Version: 0.9.57.
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: msi
        AssignedTo: truiken at gmail.com
        ReportedBy: truiken at gmail.com
                CC: ead1234 at hotmail.com


With the current implementation of MSI SQL joins, we perform a Cartesian
product on the tables joined together.  In the case where there is no WHERE
clause, this is the correct output, but if there is a WHERE clause, we still
perform the Cartesian product and then filter on the WHERE clause.  Though it's
uncommon, there are a few installers (Visual Studio, Nero Express 7) that join
tables with thousands of rows.  For example, say we have tables A, B, and C
s.t.

colA   colB   colC
-----   -----   -----
1       1       1
2       2       2
...     ...     ...
1000   1000    1000


and we have the query

SELECT * FROM A, B, C WHERE `colA`=1 AND `colB`=1 AND `colC`=1

then the current implementation will create a new table with those columns
containing 1000*1000*1000=1 billion rows.  Then we check each of the 1 billion
rows for any matches.  There are several algorithms for optimizing joins:

http://en.wikipedia.org/wiki/Join_(SQL)#Join_algorithms

The solution I'll be working on is the merge join.  This solution parses the
WHERE clause,  starting with the two tables that the columns of the clause
belong to (colA -> A, colB -> B, etc) and joins those together, making sure to
eliminate rows based on the WHERE clause.  Then the next table in the clause is
merged into the previously created table, eliminating rows.  This continues
until there are no tables left.  One optimization of this solution is to start
with parts of the WHERE clause that compare against literals.  For example, the
query:

SELECT * FROM A, B, C WHERE `colA` = `colB` AND `colB` = 1 AND `colC` = 1

We'd start with "`colB` = 1" since the comparison is against a literal.  We
perform 1000 comparisons on the rows of table B (count = 1000).  The resulting
table is:

colB
----
1

then we do the same for table C (count = 1000), and the merged table is:

colB   colC
----   ----
1      1

next we merge this table with table A:

colA   colB   colC
----   ----   ----
1      1      1
2      1      1
...    ...    ...
1000   1      1

and we search through this table for the condition `colA` = `colB` (count =
1000).  So we've reduced billions of comparisons to 3000.

Unfortunately, the way our SQL engine is implemented, this will not be an easy
task.


-- 
Configure bugmail: http://bugs.winehq.org/userprefs.cgi?tab=email
Do not reply to this email, post in Bugzilla using the
above URL to reply.
------- You are receiving this mail because: -------
You are watching all bug changes.



More information about the wine-bugs mailing list