CHOOSE on INNER JOIN take 9 hours (and count) to complete

advertisements

I'm using a sqlite database I have from the output of another script. I have a query that is taking a huge amount of time to complete. The samples table and multiclass table both have the same ~4,000,000 name's. The multiclass table has one row for each name (4 million rows), and the sample table could have one or many rows for each name (>100 million rows). I am joining on the names and summing the count grouped by the tax_id, day, and sample that names belong to. This query should return ~25000 rows

Here is a toy version of the schema and query I'm using:

SQL Fiddle

SQLite (SQL.js) Schema Setup:

CREATE TABLE samples
    (
     name varchar(20),
     day integer,
     sample integer,
     count integer
    );

CREATE TABLE multiclass
    (
     name varchar(20),
     tax_id varchar(20),
     details varchar(30)
    );

INSERT INTO samples
(name, day, sample, count)
VALUES
('seq1', 204, 37, 50),
('seq2', 205, 37, 50),
('seq2', 206, 37, 50),
('seq3', 204, 37, 50),
('seq4', 205, 37, 50),
('seq4', 206, 37, 50);

INSERT INTO multiclass
(name, tax_id, details)
VALUES
('seq1', 'Vibrio', 'unimportant'),
('seq2', 'Shewenella', 'still_unimportant'),
('seq3', 'Vibrio', 'also_unimportant'),
('seq4', 'Shewenella', 'doesntmatter');

Query 1:

SELECT tax_id, day, sample, SUM(count)
FROM samples INNER JOIN multiclass USING(name)
GROUP BY tax_id, day, sample
ORDER BY day, sample;

Results:

|     tax_id | day | sample | SUM(count) |
|------------|-----|--------|------------|
|     Vibrio | 204 |     37 |        100 |
| Shewenella | 205 |     37 |        100 |
| Shewenella | 206 |     37 |        100 |

I am very new to SQL and am not sure how to proceed. This is a query I would only need to execute once. so I'm not sure adding indexes to the table is appropriate.

Is there a different way to construct the query to make it run faster? Would adding indexes make sense or take too long? If it is taking 9 hours, is it likely to still be hung up on the SQL, or is something else going wrong?

Edit: updated question to include database schema and intended results. I am currently building indexes on the samples.name column, it's been running for over 4 hours (using a node on a cluster environment with 60 Gigs of ram and many cpus).


This query:

SELECT tax_id, day, sample, SUM(count)
FROM samples INNER JOIN
     multiclass
     ON samples.name = multiclass.name
GROUP BY tax_id, day, sample
ORDER BY day, sample;

is pretty simple. An index on either samples(name) or multiclass(name) would normally be recommended.

However, there is a hint in your question that both tables contain 4 million rows, but you are only expecting 25,000. I suspect that you have duplicate names in each table. To determine the number of intermediate rows generated by the join, run this query:

select sum(s.cnt * m.cnt), max(s.cnt * m.cnt)
from (select name, count(*) as cnt from samples group by name
     ) s join
     (select name, count(*) as cnt from multiclass group by name
     ) m
     on s.name = m.name;

I am guessing that you will get a really large number, explaining why the query is taking so long.

Unfortunately, at this point, I don't have real answer on how to solve the problem, because your question doesn't specify what you actually want the query to produce. However, aggregating the tables before joining them is likely to be one possible solution.