How to Design Indexes, Really

By: Zend

23   0   1900

Uploaded on 11/06/2012

Speaker: Bill Karwin Karwin Software Solutions

MySQL users commonly ask: Here's my table, what indexes do I need? Why aren't my indexes helping me? Don't indexes cause overhead? This talk gives you some practical answers, with a step by step method for finding the queries you need to optimize, and choosing the best indexes for them.

Comments (13):

By anonymous    2017-09-20

You're right, the index on (year, studioName) is the best. It allows a fast lookup when you have conditions on both columns.

If you have a second query that only searches for year, but not for studioName, then the same index would also help.

But if you only had an index on (studioName, year), and you search only for a specific year, the index would not help.

Think of a telephone book. You can search for people by last name because that's how the book is sorted. You can also search for people by last and first name, and the book helps even more.

But if you search only for people with a given first name, the sort order of the book isn't much help.

You might like my presentation, How to Design Indexes, Really, or the video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU

The presentation is tailored for MySQL, but many of the ideas apply to any database that implements indexes using B-tree structures.

Original Thread

By anonymous    2017-09-20

Change the primary key to currency, unix_date. Since you have an equality condition for currency, and a range condition for unix_date, you should put the column with the equality condition first. Then both the range condition on unix_date and the ORDER BY should use the primary key order.

Apply the condition to unix_date, not create_date, to get it to use the primary key index.

You have to use a derived table subquery, but you don't have to use two nested levels of subqueries.

SELECT row_num, unix_date, price
FROM (
    SELECT @row := @row + 1 AS row_num, unix_date, price
    FROM (SELECT @row := 0) AS _init
    CROSS JOIN price_data
    WHERE currency = 'USD' 
     AND unix_date BETWEEN UNIX_TIMESTAMP('2017-03-26 00:00:00') 
                       AND UNIX_TIMESTAMP('2017-06-26 23:59:59')
    ORDER BY unix_timestamp DESC
) AS t
WHERE MOD(row_num, 288) = 1

You should learn to use EXPLAIN to help you analyze index usage.

You might also like my presentation How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU

MySQL 8.0 should have windowing functions, so look for that next year sometime.

Original Thread

By anonymous    2017-09-20

MySQL 5.6 introduced some optimizer improvements specifically for long lists of ids in an IN() predicate. Read https://dev.mysql.com/doc/refman/5.6/en/range-optimization.html#equality-range-optimization

That optimization is relevant if you are relying on an index for the IN() search. If you narrow down the search first as in your query 2a, the index will be used for the inequality expression, but not for the IN() predicate.

In general, when you have multiple search terms, only one range predicate will be optimized with an index. A range predicate is anything other than =.

So in your 2a example, the BETWEEN predicate will use the index (I believe the pair of inequality expressions you used will be optimized as if it is one BETWEEN predicate), and then the IN() predicate will just search linearly through the result of the first index scan.

In your 2b example, trying to narrow down the search with MONTH(created) cannot use an index at all. Think of looking up people in a phone book—if I ask you to find everyone whose middle initial is "J." then the fact that the phone book is sorted by last name doesn't help you.

You might like to view my presentation How to Design Indexes, Really, or the video: https://www.youtube.com/watch?v=ELR7-RdU9XU

Original Thread

By anonymous    2017-09-23

+1 to answer from @JuanCarlosOropeza, but you can go a little further with the index.

ALTER TABLE transactions ADD INDEX (
  transaction_type,
  transaction_status,
  created_at,
  transaction_currency,
  transaction_total
);

As @RickJames mentioned in comments, the order of columns is important.

  • First, columns in equality comparisons
  • Next, you can index one column that is used for a range comparison (which is anything besides equality), or GROUP BY or ORDER BY. You have both range comparison and GROUP BY, but you can only get the index to help with one of these.
  • Last, other columns needed for the query, if you think you can get a covering index.

I describe more detail about index design in my presentation How to Design Indexes, Really (video: https://www.youtube.com/watch?v=ELR7-RdU9XU).

You're probably stuck with the "using temporary" since you have a range condition and also a GROUP BY referencing different columns. But you can at least eliminate the "using filesort" by this trick:

...
GROUP BY 
    transaction_currency
ORDER BY NULL

Supposing that it's not important to you which order the rows of the query results return in.

Original Thread

By anonymous    2017-10-15

You might like my presentation, [How to Design Indexes, Really](https://www.slideshare.net/billkarwin/how-to-design-indexes-really), which describes the benefit of additional columns in an index. Also the video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU

Original Thread

By anonymous    2017-10-22

First of all, you have no indexes defined for your transaction table. So any searches are forced to do a table-scan. It's important to use indexes to speed up searches.

You might like my presentation How to Design Indexes, Really. Or the video: https://www.youtube.com/watch?v=ELR7-RdU9XU

This is probably the most important index for you to create:

ALTER TABLE tbl_transaction ADD INDEX (create_date);

But even after you have an index, when you try to search on an expression like the following, it can't use the index.

date(t.date_create) BETWEEN ? AND ?

You have to rewrite it so the indexed column is alone:

t.date_create BETWEEN ? AND ?

But this means you want the first parameter to be "floored" to time 00:00:00 and the second parameter to be "ceilinged" to time 23:59:59, to make sure the date_create with any time component matches the date range you want it to match.

if( strlen($search['date_from']) > 0  && strlen($search['date_to']) > 0  ){
    $from = search['date_from'] . ' 00:00:00'; 
    $to   = search['date_to']   . ' 23:59:59'; 

    $sQuery .= ' AND t.date_create BETWEEN ? AND ?';
    array_push($fill,$from);
    array_push($fill,$to);
}

Likewise, even for the searches for one specific date, you still have to do a range:

else if( strlen($search['date_from']) > 0 ){
    $from = search['date_from'] . ' 00:00:00'; 
    $to   = search['date_from'] . ' 23:59:59'; 

    $sQuery .= ' AND t.date_create BETWEEN ? AND ?';
    array_push($fill,$from);
    array_push($fill,$to);
}

If you read my presentation on designing indexes, you will have noticed that a range condition like the one on your create_date means any subsequent columns in the index won't help the search. You can use multi-column indexes, but all columns involved in equality conditions must be to the left in the index.

Also, you have multiple conditions that may or may not be present, based on the user's search criteria. This is also hard to optimize, because an indexed search only uses the columns of the index left-to-right. If you skip a column because the search doesn't need it, the subsequent columns won't help.

Another type of condition in your search is text pattern matching with LIKE '%'?'', instead of equality conditions. These LIKE searches can't use a conventional index, but they can use a special fulltext search index. I have a presentation for that too: Full Text Search Throwdown. Video: https://www.youtube.com/watch?v=V8yA8C3CZOc

Original Thread

By anonymous    2018-02-18

Based on your EXPLAIN report, I see it says "type: ALL" which means it's scanning all the rows (the whole table) for every query.

You need an index to help it scan fewer rows.

Your first condition for parent_id=X is an obvious choice. You should create an index starting with parent_id.

The other condition on timestamp >= ... is probably the best second choice. Your index should include timestamp as the second column.

You can create this index this way:

ALTER TABLE host_monitoring_data ADD INDEX (parent_id, timestamp);

You might like my presentation How to Design Indexes, Really and a video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU

P.S.: Please when you ask questions about query optimization, run SHOW CREATE TABLE <tablename> and include its output in your question. This shows us your columns, data types, current indexes, and constraints. Don't make us guess! Help us help you!

Original Thread

By anonymous    2018-02-26

The IN( ) predicate counts as a range condition which has the effect that no subsequent columns in the index will help with filtering or sorting.

In other words, if your foo_index is on columns (a, b, c, local_time), and your conditions are:

a = '332719'                           equality condition
b IN ('2', '-3')                       range condition
c = '1'                                equality condition
local_time > '2017-02-20 11:20:30.943' range condition
e & 1 = '1'                            not indexable

The columns a and b in the index will be useful, but the columns c and local_time can't help.

The index order (a, c, b) would be better, because the equality conditions on a and c would be on the left-most columns in the index.

You might like my presentation How to Design Indexes, Really, or a video of me giving that talk: https://www.youtube.com/watch?v=ELR7-RdU9XU

Original Thread

By anonymous    2018-03-05

As with all things, the answer is it depends. :-)

Most web sites you visit queries something from a database on every page load. If the queries are crafted well, they look up just the data they need, and avoid scanning through a big database. You might like my presentation How to Design Indexes, Really (video) to help with this.

Another strategy is to use a fast cache in RAM for data that is needed frequently. RAM is thousands of times faster than disk drives. You might like to familiarize yourself with the Numbers Everyone Should Know. Those numbers are just examples, but the intention is to get programmers to think about the fact that moving data around has different cost as you use RAM vs. disk vs.network vs. CPU.

P.S.: Please don't buy into the myth that you're not good at computers because you're a woman. Everyone starts out as a novice, no matter what their gender or background. Only through practice and study do any of us learn this stuff. I recommend seeing Hidden Figures, the story of the women who did pioneering math and programming for NASA.

Another notable woman is Margaret Hamilton, who practically invented the profession of "software engineering."

Original Thread

By anonymous    2018-04-02

You show the CREATE TABLE in your post, which is good, but you don't mention any other query analysis. When you're investigating query optimization, you should consider:

I tried testing EXPLAIN at least for your subquery. By the way, column pop is mentioned in your index but does not appear in your table, so you haven't posted the real CREATE TABLE.

I got this:

mysql> EXPLAIN SELECT m.time, sum(m.rtt*m.reqs)/sum(m.reqs) AS weighted_rtt, 
sum(m.util*m.reqs)/sum(m.reqs) AS weighted_util FROM metrics AS m 
WHERE m.asn = '7018' and m.cty = 'us' GROUP BY m.time\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: m
         type: ref
possible_keys: PRIMARY,asn,bk1
          key: asn
      key_len: 6
          ref: const,const
         rows: 1
        Extra: Using index condition; Using where; Using temporary; Using filesort

Notice that only the first two columns of your asn index are used, as indicated by const,const. Also the Using temporary; Using filesort often indicates a costly overhead for a query.

I got better when I added an index:

mysql> alter table metrics add index bk1 (asn,cty,time);

I had to use an index hint to persuade the MySQL optimizer to use my index. This may be necessary only because I had no rows of data in my table, so the optimizer wasn't able to analyze which index would be better.

mysql> EXPLAIN SELECT m.time, sum(m.rtt*m.reqs)/sum(m.reqs) AS weighted_rtt, 
sum(m.util*m.reqs)/sum(m.reqs) AS weighted_util FROM metrics AS m use index(bk1) 
WHERE m.asn = '7018' and m.cty = 'us' GROUP BY m.time\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: m
         type: ref
possible_keys: PRIMARY,asn,bk1
          key: bk1
      key_len: 6
          ref: const,const
         rows: 1
        Extra: Using index condition; Using where

The temp table / filesort is gone. This is because the GROUP BY can execute in index order once I put the time column following the two columns used for filtering.

Finally I tried to create an index that included all the columns referenced in the subquery:

mysql> alter table metrics add index bk2 (asn,cty,time,rtt,reqs,util);

mysql> EXPLAIN SELECT m.time, sum(m.rtt*m.reqs)/sum(m.reqs) AS weighted_rtt, 
sum(m.util*m.reqs)/sum(m.reqs) AS weighted_util FROM metrics AS m use index(bk2) 
WHERE m.asn = '7018' and m.cty = 'us' GROUP BY m.time\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: m
         type: ref
possible_keys: PRIMARY,asn,bk1,bk2
          key: bk2
      key_len: 6
          ref: const,const
         rows: 1
        Extra: Using where; Using index

The Using index is a good sign. This is called a "covering index" which means the query was able to get all the columns it needed just by reading the index, without having to read the table at all. This is a useful technique.

You might like my presentation How to Design Indexes, Really, or the youtube video.

You mention that you can't change the MySQL configuration options, but you don't say what the options are. One of the important options is the InnoDB buffer pool size. Without a sufficiently sized buffer pool, your query will force a lot of I/O as it swaps pages of indexes into RAM and back out again.

I have no experience with the MariaDB column store, so I can't comment on its features, or how to monitor or tune it. You might want to engage with MariaDB services.

I agree with James Scheller's answer that pre-aggregating partial results and storing it is important, and may be the only way to address this problem. Some column stores I've read about do this automatically, pre-calculating various aggregate results for each partition. I have no idea what the MariaDB column store does.

Original Thread

Submit Your Video

If you have some great dev videos to share, please fill out this form.