Everything MySQL

A great place to be!

There is thorough documentation about the Partitioning feature in MySQL 5.1. There are also nice articles like this one by Robin. However, I thought it would be useful to have a quick "how-to" guide to partitioning by dates. I'll use the world schema to keep it easy.

Partitioning a table by a range of dates is quite popular. Unfortunately, the PARTITION BY RANGE only accepts an integer (or a function that evaluates to an integer) as the partition expression. That's fine if you want to partition by numeric types like this:

ALTER TABLE City PARTITION BY RANGE(id) (
PARTITION p0 VALUES LESS THAN (1000),
PARTITION p1 VALUES LESS THAN (2000),
PARTITION p2 VALUES LESS THAN (3000),
PARTITION p3 VALUES LESS THAN (4000),
PARTITION p4 VALUES LESS THAN (MAXVALUE));


Side node: when partitioning using range, the partitions must be defined using "VALUES LESS THAN" and the ranges must be listed in increasing order.

If you'd rather partition by a date or datetime, it's not as easy. You cannot use a date directly (e.g., this syntax fails: RANGE(date_column)..PARTITION p0 VALUES LESS THAN ('2000-01-01')). There are plenty of workarounds, but my favorite is the to_days function.

First I need a date column, so let's add one to the City table:
ALTER TABLE City ADD citydate DATE;

# Now populate it with random days in the last ~3 years
UPDATE City SET citydate = current_date - INTERVAL truncate(rand()*1000,0) DAY;

#Remove the old partitioning
ALTER TABLE City REMOVE PARTITIONING;

#Remove the PRIMARY KEY constraint and replace it with an index (I'll explain below)
ALTER TABLE City DROP PRIMARY KEY, ADD INDEX(id);

#Partition by the date col:
ALTER TABLE City PARTITION BY RANGE (to_days(citydate)) (
PARTITION p0 VALUES LESS THAN (to_days('2007-01-01')),
PARTITION p1 VALUES LESS THAN (to_days('2008-01-01')),
PARTITION p2 VALUES LESS THAN (to_days('2009-01-01')),
PARTITION p3 VALUES LESS THAN (to_days('2009-03-01')),
PARTITION p4 VALUES LESS THAN (to_days('2009-06-01')),
PARTITION p5 VALUES LESS THAN MAXVALUE);


Notice the partitions do not have to be even. This is very handy; you may want to put older records that aren't accessed often in a larger partition and keep the recent data in small, fast partitions.

Let's see partition pruning in action. If I run a query that only needs rows from a few partitions, the optimizer will only read from those partitions:
EXPLAIN PARTITIONS SELECT count(*) FROM City WHERE citydate BETWEEN '2009-01-01' AND '2009-08-01';

+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | City | p3,p4,p5 | ALL | NULL | NULL | NULL | NULL | 4079 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+

Notice the optimizer realizes it only needs to use certain partitions to find the results, shown by the partitions column.

Now let me explain why I removed the primary key. There is a rule about using unique constraints with partitioning. The short answer is: you can't have a unique constraint on something that you didn't partition by. The reason is that when you insert a record and the uniqueness needs to be checked, we don't want to search through every partition to verify uniqueness. Indexes are local to the partition (global indexes may be implemented in the future). So you can only have a unique constraint if all the columns in the constraint are used in the partitioning expression.

Views: 52769

Comment

You need to be a member of Everything MySQL to add comments!

Join Everything MySQL

Comment by Sarah Sproehnle on November 9, 2009 at 11:22am
FYI, there was a bug (fixed in the upcoming release of 5.1.41) that caused bad execution plans because the index statistics of only the first partition were used:
http://bugs.mysql.com/bug.php?id=44059
Comment by jshoulet on November 5, 2009 at 6:20pm
THANK YOU!
Comment by Sarah Sproehnle on November 5, 2009 at 3:51pm
Hi jshoulet,

The row count should be the same in EXPLAIN SELECT and EXPLAIN PARTITIONS SELECT. However, I think what you are noticing is a bug of sorts (though a documented one). In EXPLAIN, the number of rows is an estimate of the number or rows that would be read. This estimate is based on the optimizer's statistics. If the explain plan is type "ALL", this number should be approximately the total number of rows in a table. In a partitioned table, it should be approximately the number of rows in the partitions that are to be scanned.. however the explain is returning the total number or rows in the table instead.

By the way, if there was an index on the column (e.g., ALTER TABLE City ADD INDEX (citydate) in my above example), then the number or rows would go down drastically because the optimizer has statistics about indexes.
Comment by jshoulet on November 5, 2009 at 3:00pm
Can you tell me what the row count is when you use EXPLAIN SELECT vs. EXPLAIN PARTITIONS SELECT? I know that total table rows is used in EXPLAIN PARTITIONS SELECT since v5.1.28 but what about using EXPLAIN SELECT? I've set up test cases in 3 different environments and in all cases the EXPLAIN SELECT row count shows total table rows instead of the number of partitioned rows scanned. Is this row count behavior expected when using partitioning?
Comment by Chris on September 8, 2009 at 7:57pm
They'll be up next week!
Comment by Sarah Sproehnle on September 8, 2009 at 2:40pm
Chris, good food for thought! I'd be curious to see the benchmarks if you have a chance. One small point is readability. Neither way gives a very readable SHOW CREATE TABLE output. You could set up a view to do SELECT partition_name, from_days(partition_description) FROM information_schema.partitions WHERE table_name='City'; (thanks Giuseppe for this idea). Moreover, the queries against these tables would be slightly different. In the to_days approach, a query would refer to dates as dates (e.g., WHERE citydate= '2009-09-08'). In the second approach, you'd write WHERE modified = unix_timestamp('2009-09-08'). Small price to pay if the unix timestamp was truly faster.. we await the benchmarks!
Comment by Chris on September 8, 2009 at 1:11pm
Thx for the information above. Here is what I currently testing.

I have installed 5.1.35-community on newdbadmin1 and am starting to run tests with partitioning.

-- What I want to find out
1. cost of alter, pruning and efficiency
2. is this better than just specifying N number of tables?
3. is using UNIX_TIMESTAMPS better than to_days?

If anyone has these answers at the moment that would be great. If not I will post a BLOG about it in a week or so.

Testing Details:

-- The Table
Create Table: CREATE TABLE `t2` (
`fname` varchar(50) NOT NULL,
`lname` varchar(50) NOT NULL,
`region_code` tinyint(3) unsigned NOT NULL,
`modified` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE ( modified)
(PARTITION m0 VALUES LESS THAN (1233392461) ENGINE = InnoDB,
PARTITION m1 VALUES LESS THAN (1235808001) ENGINE = InnoDB,
PARTITION m2 VALUES LESS THAN (1238482801) ENGINE = InnoDB,
PARTITION m3 VALUES LESS THAN (1241074801) ENGINE = InnoDB,
PARTITION m4 VALUES LESS THAN (1243753201) ENGINE = InnoDB,
PARTITION m5 VALUES LESS THAN (1246345201) ENGINE = InnoDB,
PARTITION m7 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

-- Refference to how I determined the partiton numbers
mysql> select unix_timestamp('2009-01-31 01:01:01');
+---------------------------------------+
| unix_timestamp('2009-01-31 01:01:01') |
+---------------------------------------+
| 1233392461 |
+---------------------------------------+
1 row in set (0.00 sec)

mysql> select unix_timestamp('2009-02-28 01:01:01');
+---------------------------------------+
| unix_timestamp('2009-02-28 01:01:01') |
+---------------------------------------+
| 1235811661 |
+---------------------------------------+
1 row in set (0.00 sec)

etc...


-- Script to load data

#!/bin/bash
# Insert records for every month,day,hour,min and second was accounted for in the finished data set from January to June
# NOTE: this takes a LONG time so you can make a multi-threaded script or another variation to load test data

date2stamp () {
date --utc --date "$1" +%s
}

# convert a date into a UNIX timestamp
for g in 2009
do
for x in $( seq -w 06 );
do
for y in $( seq -w 01 30 );
do
for z in $( seq -w 00 23 );
do
for a in $( seq -w 00 59 );
do
for b in $( seq -w 00 59 );
do
thedate="${g}-${x}-${y} ${z}:${a}:${b}"
stamp=$(date2stamp "$thedate")
mysql -uUSERNAME -pPASSWORD -e "INSERT INTO test.t2 (fname,lname,region_code,modified) VALUES ('test${a}','testing${b}','${y}','$
{stamp}');"
done
done
done
done
done
done
Comment by Giuseppe Maxia on September 8, 2009 at 12:41pm
Some tricks that make partitioning by date a bit less painful:

http://datacharmer.blogspot.com/2008/12/partition-helper-improving-usability.html
http://datacharmer.blogspot.com/2008/11/quick-usability-hack-with-partitioning.html

Giuseppe

RSS

© 2024   Created by Chris.   Powered by

Badges  |  Report an Issue  |  Terms of Service