Showing posts with label handful. Show all posts
Showing posts with label handful. Show all posts

Wednesday, March 7, 2012

Fact Tables and Clustered Indexes

A design question:
Say I have a medium-sized fact table w/ a handful of dimension key
columns (including a date) and a handful of measures.
I necessarily need a primary key on the composite dimension key
columns, and I don't know ahead of time which of my dimension key
column(s) will be the best candidate for a clustered index. I do plan
on putting non-clustered indexes on all my dimension key columns, and
the related dimension tables' key columns.
For the sake of argument, let's say we're not partitioning the fact
table.
Assume that new facts occur in time, the fact table grows with time,
and (nearly) all changes to the fact table occur as INSERTs.
Now, all things being equal, is there a benefit of adding a clustered
index to the fact table? Two options:
- Add an IDENTITY column, make it the primary key, and add the
clustered index to it.
- Add the clustered index on the date column, since it has a natural
order.
Basically, I'm after two answers in this scenario:
- Is there a benefit to having a clustered index on a table when the
application doesn't 'really' call for one?
- If so, is it better to add an IDENTITY column (adding size to the
table) or to pick an naturally ordered dimension key? A random key?
The fact's composite key?
Thanks much.
Steven D. Clark
stevec@.clarkdev.com
"Steven Clark" <stevec@.clarkdev.com> wrote in message
news:d7740507.0407100710.2b1644b5@.posting.google.c om...
> Basically, I'm after two answers in this scenario:
> - Is there a benefit to having a clustered index on a table when the
> application doesn't 'really' call for one?
In my opinion, there is never a reason NOT to have one; it's a freebie,
basically, unlike clustered indexes. It doesn't take up any extra disc
space, and you might as well order the data on the disc somehow, rather than
letting the server take care of it... So I always make sure that every table
has one.

> - If so, is it better to add an IDENTITY column (adding size to the
> table) or to pick an naturally ordered dimension key? A random key?
> The fact's composite key?
Some tips for clustered indexes:
- They assist with range queries and grouping. So try to use them for
columns that will be used for those kinds of operations (>, <, BETWEEN, etc,
or make it composite in the same order that you'll be grouping. If you do
composite, order the columns by selectivity, least selective first. This
will create a wider tree, which will result in somewhat quicker grouping.)
- Clustering on a random key is a very bad idea, because it will cause a
lot of page splits, leading to fragmentation. This will slow your data
loads. It will also give you no query benefits at all. So you'll actually
lose on this option.
- Clustering on an IDENTITY key or a DATETIME column that's
automatically set to the date the row is inserted will actually speed up
inserts as it will create a hotspot at the end of the table. So you'll
never have page splits when inserting new data. This can definitely help
speed your data load! Clustering on an IDENTITY will usually not help too
much with queries as, in my experience, most grouping and range operations
don't consider surrogates. Depending on your app, clustering on a DATETIME
as I described can help a lot, as a lot of queries will request data between
two dates, greater than a date, etc.
- Finally, clustering on the composite of your dimensions may be helpful
if you're grouping on them or requesting ranges. However, in the latter
case, remember that a composite index will only be used for searching if the
first column is part of the search criteria, so try to choose one of your
dimensions that will always be searched on (if that exists in your
warehouse).
I hope that answered your questions? Post back if you need some
clarification or further assistance.
|||> It doesn't take up any extra disc space,
That's not true. The clustered index uses the datapages as the leaves, so
that is disk space you use already anyway, but the nodes of the index still
take up extra space on disk. That doesn't mean btw that it is not a good
idea to have a clustered index on every table. In almost all cases the
performance improvement that a clustered index provides more than offsets
the extra disk space used.
Jacco Schalkwijk
SQL Server MVP
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:%23rDxElvZEHA.2388@.TK2MSFTNGP11.phx.gbl...
> "Steven Clark" <stevec@.clarkdev.com> wrote in message
> news:d7740507.0407100710.2b1644b5@.posting.google.c om...
> In my opinion, there is never a reason NOT to have one; it's a
freebie,
> basically, unlike clustered indexes. It doesn't take up any extra disc
> space, and you might as well order the data on the disc somehow, rather
than
> letting the server take care of it... So I always make sure that every
table
> has one.
>
> Some tips for clustered indexes:
> - They assist with range queries and grouping. So try to use them for
> columns that will be used for those kinds of operations (>, <, BETWEEN,
etc,
> or make it composite in the same order that you'll be grouping. If you do
> composite, order the columns by selectivity, least selective first. This
> will create a wider tree, which will result in somewhat quicker grouping.)
> - Clustering on a random key is a very bad idea, because it will cause
a
> lot of page splits, leading to fragmentation. This will slow your data
> loads. It will also give you no query benefits at all. So you'll
actually
> lose on this option.
> - Clustering on an IDENTITY key or a DATETIME column that's
> automatically set to the date the row is inserted will actually speed up
> inserts as it will create a hotspot at the end of the table. So you'll
> never have page splits when inserting new data. This can definitely help
> speed your data load! Clustering on an IDENTITY will usually not help too
> much with queries as, in my experience, most grouping and range operations
> don't consider surrogates. Depending on your app, clustering on a
DATETIME
> as I described can help a lot, as a lot of queries will request data
between
> two dates, greater than a date, etc.
> - Finally, clustering on the composite of your dimensions may be
helpful
> if you're grouping on them or requesting ranges. However, in the latter
> case, remember that a composite index will only be used for searching if
the
> first column is part of the search criteria, so try to choose one of your
> dimensions that will always be searched on (if that exists in your
> warehouse).
> I hope that answered your questions? Post back if you need some
> clarification or further assistance.
>
>
|||"Jacco Schalkwijk" <jacco.please.reply@.to.newsgroups.mvps.org.invalid > wrote
in message news:%23cIz3GDaEHA.2544@.TK2MSFTNGP10.phx.gbl...
> That's not true. The clustered index uses the datapages as the leaves, so
> that is disk space you use already anyway, but the nodes of the index
still
> take up extra space on disk. That doesn't mean btw that it is not a good
> idea to have a clustered index on every table. In almost all cases the
> performance improvement that a clustered index provides more than offsets
> the extra disk space used.
Thanks for the clarification on that... I also didn't think about fill
factor, which could also create the impression of more disc space being
used.

Fact Tables and Clustered Indexes

A design question:
Say I have a medium-sized fact table w/ a handful of dimension key
columns (including a date) and a handful of measures.
I necessarily need a primary key on the composite dimension key
columns, and I don't know ahead of time which of my dimension key
column(s) will be the best candidate for a clustered index. I do plan
on putting non-clustered indexes on all my dimension key columns, and
the related dimension tables' key columns.
For the sake of argument, let's say we're not partitioning the fact
table.
Assume that new facts occur in time, the fact table grows with time,
and (nearly) all changes to the fact table occur as INSERTs.
Now, all things being equal, is there a benefit of adding a clustered
index to the fact table? Two options:
- Add an IDENTITY column, make it the primary key, and add the
clustered index to it.
- Add the clustered index on the date column, since it has a natural
order.
Basically, I'm after two answers in this scenario:
- Is there a benefit to having a clustered index on a table when the
application doesn't 'really' call for one?
- If so, is it better to add an IDENTITY column (adding size to the
table) or to pick an naturally ordered dimension key? A random key?
The fact's composite key?
Thanks much.
Steven D. Clark
stevec@.clarkdev.com"Steven Clark" <stevec@.clarkdev.com> wrote in message
news:d7740507.0407100710.2b1644b5@.posting.google.com...
> Basically, I'm after two answers in this scenario:
> - Is there a benefit to having a clustered index on a table when the
> application doesn't 'really' call for one?
In my opinion, there is never a reason NOT to have one; it's a freebie,
basically, unlike clustered indexes. It doesn't take up any extra disc
space, and you might as well order the data on the disc somehow, rather than
letting the server take care of it... So I always make sure that every table
has one.

> - If so, is it better to add an IDENTITY column (adding size to the
> table) or to pick an naturally ordered dimension key? A random key?
> The fact's composite key?
Some tips for clustered indexes:
- They assist with range queries and grouping. So try to use them for
columns that will be used for those kinds of operations (>, <, BETWEEN, etc,
or make it composite in the same order that you'll be grouping. If you do
composite, order the columns by selectivity, least selective first. This
will create a wider tree, which will result in somewhat quicker grouping.)
- Clustering on a random key is a very bad idea, because it will cause a
lot of page splits, leading to fragmentation. This will slow your data
loads. It will also give you no query benefits at all. So you'll actually
lose on this option.
- Clustering on an IDENTITY key or a DATETIME column that's
automatically set to the date the row is inserted will actually speed up
inserts as it will create a hotspot at the end of the table. So you'll
never have page splits when inserting new data. This can definitely help
speed your data load! Clustering on an IDENTITY will usually not help too
much with queries as, in my experience, most grouping and range operations
don't consider surrogates. Depending on your app, clustering on a DATETIME
as I described can help a lot, as a lot of queries will request data between
two dates, greater than a date, etc.
- Finally, clustering on the composite of your dimensions may be helpful
if you're grouping on them or requesting ranges. However, in the latter
case, remember that a composite index will only be used for searching if the
first column is part of the search criteria, so try to choose one of your
dimensions that will always be searched on (if that exists in your
warehouse).
I hope that answered your questions? Post back if you need some
clarification or further assistance.|||> It doesn't take up any extra disc space,
That's not true. The clustered index uses the datapages as the leaves, so
that is disk space you use already anyway, but the nodes of the index still
take up extra space on disk. That doesn't mean btw that it is not a good
idea to have a clustered index on every table. In almost all cases the
performance improvement that a clustered index provides more than offsets
the extra disk space used.
Jacco Schalkwijk
SQL Server MVP
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:%23rDxElvZEHA.2388@.TK2MSFTNGP11.phx.gbl...
> "Steven Clark" <stevec@.clarkdev.com> wrote in message
> news:d7740507.0407100710.2b1644b5@.posting.google.com...
> In my opinion, there is never a reason NOT to have one; it's a
freebie,
> basically, unlike clustered indexes. It doesn't take up any extra disc
> space, and you might as well order the data on the disc somehow, rather
than
> letting the server take care of it... So I always make sure that every
table
> has one.
>
> Some tips for clustered indexes:
> - They assist with range queries and grouping. So try to use them for
> columns that will be used for those kinds of operations (>, <, BETWEEN,
etc,
> or make it composite in the same order that you'll be grouping. If you do
> composite, order the columns by selectivity, least selective first. This
> will create a wider tree, which will result in somewhat quicker grouping.)
> - Clustering on a random key is a very bad idea, because it will cause
a
> lot of page splits, leading to fragmentation. This will slow your data
> loads. It will also give you no query benefits at all. So you'll
actually
> lose on this option.
> - Clustering on an IDENTITY key or a DATETIME column that's
> automatically set to the date the row is inserted will actually speed up
> inserts as it will create a hotspot at the end of the table. So you'll
> never have page splits when inserting new data. This can definitely help
> speed your data load! Clustering on an IDENTITY will usually not help too
> much with queries as, in my experience, most grouping and range operations
> don't consider surrogates. Depending on your app, clustering on a
DATETIME
> as I described can help a lot, as a lot of queries will request data
between
> two dates, greater than a date, etc.
> - Finally, clustering on the composite of your dimensions may be
helpful
> if you're grouping on them or requesting ranges. However, in the latter
> case, remember that a composite index will only be used for searching if
the
> first column is part of the search criteria, so try to choose one of your
> dimensions that will always be searched on (if that exists in your
> warehouse).
> I hope that answered your questions? Post back if you need some
> clarification or further assistance.
>
>|||"Jacco Schalkwijk" <jacco.please.reply@.to.newsgroups.mvps.org.invalid> wrote
in message news:%23cIz3GDaEHA.2544@.TK2MSFTNGP10.phx.gbl...
> That's not true. The clustered index uses the datapages as the leaves, so
> that is disk space you use already anyway, but the nodes of the index
still
> take up extra space on disk. That doesn't mean btw that it is not a good
> idea to have a clustered index on every table. In almost all cases the
> performance improvement that a clustered index provides more than offsets
> the extra disk space used.
Thanks for the clarification on that... I also didn't think about fill
factor, which could also create the impression of more disc space being
used.