Extracting Sql: indexes

Showing posts with label indexes. Show all posts

Wednesday, March 7, 2012

Fact Tables and Clustered Indexes

A design question:
Say I have a medium-sized fact table w/ a handful of dimension key
columns (including a date) and a handful of measures.
I necessarily need a primary key on the composite dimension key
columns, and I don't know ahead of time which of my dimension key
column(s) will be the best candidate for a clustered index. I do plan
on putting non-clustered indexes on all my dimension key columns, and
the related dimension tables' key columns.
For the sake of argument, let's say we're not partitioning the fact
table.
Assume that new facts occur in time, the fact table grows with time,
and (nearly) all changes to the fact table occur as INSERTs.
Now, all things being equal, is there a benefit of adding a clustered
index to the fact table? Two options:
- Add an IDENTITY column, make it the primary key, and add the
clustered index to it.
- Add the clustered index on the date column, since it has a natural
order.
Basically, I'm after two answers in this scenario:
- Is there a benefit to having a clustered index on a table when the
application doesn't 'really' call for one?
- If so, is it better to add an IDENTITY column (adding size to the
table) or to pick an naturally ordered dimension key? A random key?
The fact's composite key?
Thanks much.
Steven D. Clark
stevec@.clarkdev.com
"Steven Clark" <stevec@.clarkdev.com> wrote in message
news:d7740507.0407100710.2b1644b5@.posting.google.c om...
> Basically, I'm after two answers in this scenario:
> - Is there a benefit to having a clustered index on a table when the
> application doesn't 'really' call for one?
In my opinion, there is never a reason NOT to have one; it's a freebie,
basically, unlike clustered indexes. It doesn't take up any extra disc
space, and you might as well order the data on the disc somehow, rather than
letting the server take care of it... So I always make sure that every table
has one.

> - If so, is it better to add an IDENTITY column (adding size to the
> table) or to pick an naturally ordered dimension key? A random key?
> The fact's composite key?
Some tips for clustered indexes:
- They assist with range queries and grouping. So try to use them for
columns that will be used for those kinds of operations (>, <, BETWEEN, etc,
or make it composite in the same order that you'll be grouping. If you do
composite, order the columns by selectivity, least selective first. This
will create a wider tree, which will result in somewhat quicker grouping.)
- Clustering on a random key is a very bad idea, because it will cause a
lot of page splits, leading to fragmentation. This will slow your data
loads. It will also give you no query benefits at all. So you'll actually
lose on this option.
- Clustering on an IDENTITY key or a DATETIME column that's
automatically set to the date the row is inserted will actually speed up
inserts as it will create a hotspot at the end of the table. So you'll
never have page splits when inserting new data. This can definitely help
speed your data load! Clustering on an IDENTITY will usually not help too
much with queries as, in my experience, most grouping and range operations
don't consider surrogates. Depending on your app, clustering on a DATETIME
as I described can help a lot, as a lot of queries will request data between
two dates, greater than a date, etc.
- Finally, clustering on the composite of your dimensions may be helpful
if you're grouping on them or requesting ranges. However, in the latter
case, remember that a composite index will only be used for searching if the
first column is part of the search criteria, so try to choose one of your
dimensions that will always be searched on (if that exists in your
warehouse).
I hope that answered your questions? Post back if you need some
clarification or further assistance.
|||> It doesn't take up any extra disc space,
That's not true. The clustered index uses the datapages as the leaves, so
that is disk space you use already anyway, but the nodes of the index still
take up extra space on disk. That doesn't mean btw that it is not a good
idea to have a clustered index on every table. In almost all cases the
performance improvement that a clustered index provides more than offsets
the extra disk space used.
Jacco Schalkwijk
SQL Server MVP
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:%23rDxElvZEHA.2388@.TK2MSFTNGP11.phx.gbl...
> "Steven Clark" <stevec@.clarkdev.com> wrote in message
> news:d7740507.0407100710.2b1644b5@.posting.google.c om...
> In my opinion, there is never a reason NOT to have one; it's a
freebie,
> basically, unlike clustered indexes. It doesn't take up any extra disc
> space, and you might as well order the data on the disc somehow, rather
than
> letting the server take care of it... So I always make sure that every
table
> has one.
>
> Some tips for clustered indexes:
> - They assist with range queries and grouping. So try to use them for
> columns that will be used for those kinds of operations (>, <, BETWEEN,
etc,
> or make it composite in the same order that you'll be grouping. If you do
> composite, order the columns by selectivity, least selective first. This
> will create a wider tree, which will result in somewhat quicker grouping.)
> - Clustering on a random key is a very bad idea, because it will cause
a
> lot of page splits, leading to fragmentation. This will slow your data
> loads. It will also give you no query benefits at all. So you'll
actually
> lose on this option.
> - Clustering on an IDENTITY key or a DATETIME column that's
> automatically set to the date the row is inserted will actually speed up
> inserts as it will create a hotspot at the end of the table. So you'll
> never have page splits when inserting new data. This can definitely help
> speed your data load! Clustering on an IDENTITY will usually not help too
> much with queries as, in my experience, most grouping and range operations
> don't consider surrogates. Depending on your app, clustering on a
DATETIME
> as I described can help a lot, as a lot of queries will request data
between
> two dates, greater than a date, etc.
> - Finally, clustering on the composite of your dimensions may be
helpful
> if you're grouping on them or requesting ranges. However, in the latter
> case, remember that a composite index will only be used for searching if
the
> first column is part of the search criteria, so try to choose one of your
> dimensions that will always be searched on (if that exists in your
> warehouse).
> I hope that answered your questions? Post back if you need some
> clarification or further assistance.
>
>
|||"Jacco Schalkwijk" <jacco.please.reply@.to.newsgroups.mvps.org.invalid > wrote
in message news:%23cIz3GDaEHA.2544@.TK2MSFTNGP10.phx.gbl...
> That's not true. The clustered index uses the datapages as the leaves, so
> that is disk space you use already anyway, but the nodes of the index
still
> take up extra space on disk. That doesn't mean btw that it is not a good
> idea to have a clustered index on every table. In almost all cases the
> performance improvement that a clustered index provides more than offsets
> the extra disk space used.
Thanks for the clarification on that... I also didn't think about fill
factor, which could also create the impression of more disc space being
used.

Fact Tables and Clustered Indexes

A design question:
Say I have a medium-sized fact table w/ a handful of dimension key
columns (including a date) and a handful of measures.
I necessarily need a primary key on the composite dimension key
columns, and I don't know ahead of time which of my dimension key
column(s) will be the best candidate for a clustered index. I do plan
on putting non-clustered indexes on all my dimension key columns, and
the related dimension tables' key columns.
For the sake of argument, let's say we're not partitioning the fact
table.
Assume that new facts occur in time, the fact table grows with time,
and (nearly) all changes to the fact table occur as INSERTs.
Now, all things being equal, is there a benefit of adding a clustered
index to the fact table? Two options:
- Add an IDENTITY column, make it the primary key, and add the
clustered index to it.
- Add the clustered index on the date column, since it has a natural
order.
Basically, I'm after two answers in this scenario:
- Is there a benefit to having a clustered index on a table when the
application doesn't 'really' call for one?
- If so, is it better to add an IDENTITY column (adding size to the
table) or to pick an naturally ordered dimension key? A random key?
The fact's composite key?
Thanks much.
Steven D. Clark
stevec@.clarkdev.com"Steven Clark" <stevec@.clarkdev.com> wrote in message
news:d7740507.0407100710.2b1644b5@.posting.google.com...
> Basically, I'm after two answers in this scenario:
> - Is there a benefit to having a clustered index on a table when the
> application doesn't 'really' call for one?
In my opinion, there is never a reason NOT to have one; it's a freebie,
basically, unlike clustered indexes. It doesn't take up any extra disc
space, and you might as well order the data on the disc somehow, rather than
letting the server take care of it... So I always make sure that every table
has one.

> - If so, is it better to add an IDENTITY column (adding size to the
> table) or to pick an naturally ordered dimension key? A random key?
> The fact's composite key?
Some tips for clustered indexes:
- They assist with range queries and grouping. So try to use them for
columns that will be used for those kinds of operations (>, <, BETWEEN, etc,
or make it composite in the same order that you'll be grouping. If you do
composite, order the columns by selectivity, least selective first. This
will create a wider tree, which will result in somewhat quicker grouping.)
- Clustering on a random key is a very bad idea, because it will cause a
lot of page splits, leading to fragmentation. This will slow your data
loads. It will also give you no query benefits at all. So you'll actually
lose on this option.
- Clustering on an IDENTITY key or a DATETIME column that's
automatically set to the date the row is inserted will actually speed up
inserts as it will create a hotspot at the end of the table. So you'll
never have page splits when inserting new data. This can definitely help
speed your data load! Clustering on an IDENTITY will usually not help too
much with queries as, in my experience, most grouping and range operations
don't consider surrogates. Depending on your app, clustering on a DATETIME
as I described can help a lot, as a lot of queries will request data between
two dates, greater than a date, etc.
- Finally, clustering on the composite of your dimensions may be helpful
if you're grouping on them or requesting ranges. However, in the latter
case, remember that a composite index will only be used for searching if the
first column is part of the search criteria, so try to choose one of your
dimensions that will always be searched on (if that exists in your
warehouse).
I hope that answered your questions? Post back if you need some
clarification or further assistance.|||> It doesn't take up any extra disc space,
That's not true. The clustered index uses the datapages as the leaves, so
that is disk space you use already anyway, but the nodes of the index still
take up extra space on disk. That doesn't mean btw that it is not a good
idea to have a clustered index on every table. In almost all cases the
performance improvement that a clustered index provides more than offsets
the extra disk space used.
Jacco Schalkwijk
SQL Server MVP
"Adam Machanic" <amachanic@.hotmail._removetoemail_.com> wrote in message
news:%23rDxElvZEHA.2388@.TK2MSFTNGP11.phx.gbl...
> "Steven Clark" <stevec@.clarkdev.com> wrote in message
> news:d7740507.0407100710.2b1644b5@.posting.google.com...
> In my opinion, there is never a reason NOT to have one; it's a
freebie,
> basically, unlike clustered indexes. It doesn't take up any extra disc
> space, and you might as well order the data on the disc somehow, rather
than
> letting the server take care of it... So I always make sure that every
table
> has one.
>
> Some tips for clustered indexes:
> - They assist with range queries and grouping. So try to use them for
> columns that will be used for those kinds of operations (>, <, BETWEEN,
etc,
> or make it composite in the same order that you'll be grouping. If you do
> composite, order the columns by selectivity, least selective first. This
> will create a wider tree, which will result in somewhat quicker grouping.)
> - Clustering on a random key is a very bad idea, because it will cause
a
> lot of page splits, leading to fragmentation. This will slow your data
> loads. It will also give you no query benefits at all. So you'll
actually
> lose on this option.
> - Clustering on an IDENTITY key or a DATETIME column that's
> automatically set to the date the row is inserted will actually speed up
> inserts as it will create a hotspot at the end of the table. So you'll
> never have page splits when inserting new data. This can definitely help
> speed your data load! Clustering on an IDENTITY will usually not help too
> much with queries as, in my experience, most grouping and range operations
> don't consider surrogates. Depending on your app, clustering on a
DATETIME
> as I described can help a lot, as a lot of queries will request data
between
> two dates, greater than a date, etc.
> - Finally, clustering on the composite of your dimensions may be
helpful
> if you're grouping on them or requesting ranges. However, in the latter
> case, remember that a composite index will only be used for searching if
the
> first column is part of the search criteria, so try to choose one of your
> dimensions that will always be searched on (if that exists in your
> warehouse).
> I hope that answered your questions? Post back if you need some
> clarification or further assistance.
>
>|||"Jacco Schalkwijk" <jacco.please.reply@.to.newsgroups.mvps.org.invalid> wrote
in message news:%23cIz3GDaEHA.2544@.TK2MSFTNGP10.phx.gbl...
> That's not true. The clustered index uses the datapages as the leaves, so
> that is disk space you use already anyway, but the nodes of the index
still
> take up extra space on disk. That doesn't mean btw that it is not a good
> idea to have a clustered index on every table. In almost all cases the
> performance improvement that a clustered index provides more than offsets
> the extra disk space used.
Thanks for the clarification on that... I also didn't think about fill
factor, which could also create the impression of more disc space being
used.

Sunday, February 19, 2012

Extracting duplicate records

Friday, February 17, 2012

Extracting duplicate records

I am working on an inherited database and trying to add indexes on existing
tables. A few of the tables have duplicate entries that cause the
application of the indices to fail. I know that on some of the tables I can
simply copy the table, delete the records apply the index import from old
table and only unique records will make it into the table. ( I am correct
here aren't I?) However...
On a couple of tables I need to actually "see" the duplicate records so that
I can choose which records to keep and do a manual deletion of the duplicate
records.
There are no PK's on the tables in question and the indices I need to put in
place are, for example, on 6 of 15 columns. I need to find the dupes and
view the information in non-indexed columns to make my decision on which to
keep and which to delete.
Does this make any sense? Can anyone help me out with how I would go about
identifying the duplicate records?
Thanks in advance.
danieldaniel wrote:
> I am working on an inherited database and trying to add indexes on
> existing tables. A few of the tables have duplicate entries that
> cause the application of the indices to fail. I know that on some of
> the tables I can simply copy the table, delete the records apply the
> index import from old table and only unique records will make it into
> the table. ( I am correct here aren't I?) However...
> On a couple of tables I need to actually "see" the duplicate records
> so that I can choose which records to keep and do a manual deletion
> of the duplicate records.
> There are no PK's on the tables in question and the indices I need to
> put in place are, for example, on 6 of 15 columns. I need to find the
> dupes and view the information in non-indexed columns to make my
> decision on which to keep and which to delete.
> Does this make any sense? Can anyone help me out with how I would go
> about identifying the duplicate records?
> Thanks in advance.
> daniel
You can group by the columns to be indexed and use a having clause to
return the dupes. Not sure of your table size, so the query may be CPU
and tiem consuming:
For example:
Select a, b, c, d
From MyTable
Group By a, b, c, d
Having COUNT(*) > 1
To get rid of the dupes, you could simply create another table and add a
clustered index with the ignore dup keys flag and insert the rows. But
you wouldn't have much control over which dupes were removed.
David Gugick
Imceda Software
www.imceda.com|||"David Gugick" <davidg-nospam@.imceda.com> wrote in message
news:OrwxbzgCFHA.444@.TK2MSFTNGP15.phx.gbl...
> daniel wrote:
> > I am working on an inherited database and trying to add indexes on
> > existing tables. A few of the tables have duplicate entries that
> > cause the application of the indices to fail. I know that on some of
> > the tables I can simply copy the table, delete the records apply the
> > index import from old table and only unique records will make it into
> > the table. ( I am correct here aren't I?) However...
> >
> > On a couple of tables I need to actually "see" the duplicate records
> > so that I can choose which records to keep and do a manual deletion
> > of the duplicate records.
> >
> > There are no PK's on the tables in question and the indices I need to
> > put in place are, for example, on 6 of 15 columns. I need to find the
> > dupes and view the information in non-indexed columns to make my
> > decision on which to keep and which to delete.
> >
> > Does this make any sense? Can anyone help me out with how I would go
> > about identifying the duplicate records?
> >
> > Thanks in advance.
> >
> > daniel
> You can group by the columns to be indexed and use a having clause to
> return the dupes. Not sure of your table size, so the query may be CPU
> and tiem consuming:
> For example:
> Select a, b, c, d
> From MyTable
> Group By a, b, c, d
> Having COUNT(*) > 1
> To get rid of the dupes, you could simply create another table and add a
> clustered index with the ignore dup keys flag and insert the rows. But
> you wouldn't have much control over which dupes were removed.
>
> --
> David Gugick
> Imceda Software
> www.imceda.com
>
Thanks David but I guess maybe I've misunderstood the error message from
SQL.(?) I queried one of the tables on which I receive the indexing error,
with the query you provided, and it returned 0 records. Perhaps providing
the error message will allow someone more knowledgeable than I to tell me
I'm interpreting it wrong. The message is as follows:
"Server: Msg 1505, Level 16, State 1
CREATE UNIQUE INDEX terminated because a duplicate key was found for index
id 9. Most significant primary key is '115040'"
Does this not mean there are duplicate row entries based on the attempted
index columns? If so why do I get a return of 0 records with the query
provided below?
Any ideas anyone?|||daniel wrote:
> Thanks David but I guess maybe I've misunderstood the error message
> from SQL.(?) I queried one of the tables on which I receive the
> indexing error, with the query you provided, and it returned 0
> records. Perhaps providing the error message will allow someone more
> knowledgeable than I to tell me I'm interpreting it wrong. The
> message is as follows:
> "Server: Msg 1505, Level 16, State 1
> CREATE UNIQUE INDEX terminated because a duplicate key was found for
> index id 9. Most significant primary key is '115040'"
> Does this not mean there are duplicate row entries based on the
> attempted index columns? If so why do I get a return of 0 records
> with the query provided below?
> Any ideas anyone?
You'll need to provide the query, the table ddl, and the index statement
in order to tell what's going on.
David Gugick
Imceda Software
www.imceda.com|||You might find this interesting.
http://www.15seconds.com/issue/011009.htm
Ben Miller
"daniel" <dhagwood@.hotmail.com> wrote in message
news:OVAIxhgCFHA.560@.TK2MSFTNGP15.phx.gbl...
>I am working on an inherited database and trying to add indexes on existing
> tables. A few of the tables have duplicate entries that cause the
> application of the indices to fail. I know that on some of the tables I
> can
> simply copy the table, delete the records apply the index import from old
> table and only unique records will make it into the table. ( I am correct
> here aren't I?) However...
> On a couple of tables I need to actually "see" the duplicate records so
> that
> I can choose which records to keep and do a manual deletion of the
> duplicate
> records.
> There are no PK's on the tables in question and the indices I need to put
> in
> place are, for example, on 6 of 15 columns. I need to find the dupes and
> view the information in non-indexed columns to make my decision on which
> to
> keep and which to delete.
> Does this make any sense? Can anyone help me out with how I would go about
> identifying the duplicate records?
> Thanks in advance.
> daniel
>|||daniel wrote:
> "David Gugick" <davidg-nospam@.imceda.com> wrote in message
> news:OrwxbzgCFHA.444@.TK2MSFTNGP15.phx.gbl...
>> daniel wrote:
>> I am working on an inherited database and trying to add indexes on
>> existing tables. A few of the tables have duplicate entries that
>> cause the application of the indices to fail. I know that on some of
>> the tables I can simply copy the table, delete the records apply the
>> index import from old table and only unique records will make it
>> into the table. ( I am correct here aren't I?) However...
>> On a couple of tables I need to actually "see" the duplicate records
>> so that I can choose which records to keep and do a manual deletion
>> of the duplicate records.
>> There are no PK's on the tables in question and the indices I need
>> to put in place are, for example, on 6 of 15 columns. I need to
>> find the dupes and view the information in non-indexed columns to
>> make my decision on which to keep and which to delete.
>> Does this make any sense? Can anyone help me out with how I would go
>> about identifying the duplicate records?
>> Thanks in advance.
>> daniel
>> You can group by the columns to be indexed and use a having clause to
>> return the dupes. Not sure of your table size, so the query may be
>> CPU and tiem consuming:
>> For example:
>> Select a, b, c, d
>> From MyTable
>> Group By a, b, c, d
>> Having COUNT(*) > 1
>> To get rid of the dupes, you could simply create another table and
>> add a clustered index with the ignore dup keys flag and insert the
>> rows. But you wouldn't have much control over which dupes were
>> removed.
>>
>> --
>> David Gugick
>> Imceda Software
>> www.imceda.com
> Thanks David but I guess maybe I've misunderstood the error message
> from SQL.(?) I queried one of the tables on which I receive the
> indexing error, with the query you provided, and it returned 0
> records. Perhaps providing the error message will allow someone more
> knowledgeable than I to tell me I'm interpreting it wrong. The
> message is as follows:
> "Server: Msg 1505, Level 16, State 1
> CREATE UNIQUE INDEX terminated because a duplicate key was found for
> index id 9. Most significant primary key is '115040'"
> Does this not mean there are duplicate row entries based on the
> attempted index columns? If so why do I get a return of 0 records
> with the query provided below?
> Any ideas anyone?
This works for me:
create table #a (col1 int, col2 int, col3 int)
insert into #a values (1, 2, 3)
insert into #a values (1, 2, 3)
insert into #a values (2, 2, 3)
insert into #a values (3, 2, 1)
insert into #a values (3, 2, 1)
insert into #a values (3, 2, 1)
Select count(*) as "# Dupes", col1, col2, col3
From #a
Group By col1, col2, col3
Having count(*) > 1
# Dupes col1 col2 col3
-- -- -- --
2 1 2 3
3 3 2 1
David Gugick
Imceda Software
www.imceda.com|||"David Gugick" <davidg-nospam@.imceda.com> wrote in message
news:ea8zyRlCFHA.2676@.TK2MSFTNGP12.phx.gbl...
> daniel wrote:
> > "David Gugick" <davidg-nospam@.imceda.com> wrote in message
> > news:OrwxbzgCFHA.444@.TK2MSFTNGP15.phx.gbl...
> >> daniel wrote:
> >> I am working on an inherited database and trying to add indexes on
> >> existing tables. A few of the tables have duplicate entries that
> >> cause the application of the indices to fail. I know that on some of
> >> the tables I can simply copy the table, delete the records apply the
> >> index import from old table and only unique records will make it
> >> into the table. ( I am correct here aren't I?) However...
> >>
> >> On a couple of tables I need to actually "see" the duplicate records
> >> so that I can choose which records to keep and do a manual deletion
> >> of the duplicate records.
> >>
> >> There are no PK's on the tables in question and the indices I need
> >> to put in place are, for example, on 6 of 15 columns. I need to
> >> find the dupes and view the information in non-indexed columns to
> >> make my decision on which to keep and which to delete.
> >>
> >> Does this make any sense? Can anyone help me out with how I would go
> >> about identifying the duplicate records?
> >>
> >> Thanks in advance.
> >>
> >> daniel
> >>
> >> You can group by the columns to be indexed and use a having clause to
> >> return the dupes. Not sure of your table size, so the query may be
> >> CPU and tiem consuming:
> >>
> >> For example:
> >>
> >> Select a, b, c, d
> >> From MyTable
> >> Group By a, b, c, d
> >> Having COUNT(*) > 1
> >>
> >> To get rid of the dupes, you could simply create another table and
> >> add a clustered index with the ignore dup keys flag and insert the
> >> rows. But you wouldn't have much control over which dupes were
> >> removed.
> >>
> >>
> >> --
> >> David Gugick
> >> Imceda Software
> >> www.imceda.com
> >>
> >
> > Thanks David but I guess maybe I've misunderstood the error message
> > from SQL.(?) I queried one of the tables on which I receive the
> > indexing error, with the query you provided, and it returned 0
> > records. Perhaps providing the error message will allow someone more
> > knowledgeable than I to tell me I'm interpreting it wrong. The
> > message is as follows:
> >
> > "Server: Msg 1505, Level 16, State 1
> > CREATE UNIQUE INDEX terminated because a duplicate key was found for
> > index id 9. Most significant primary key is '115040'"
> >
> > Does this not mean there are duplicate row entries based on the
> > attempted index columns? If so why do I get a return of 0 records
> > with the query provided below?
> >
> > Any ideas anyone?
> This works for me:
> create table #a (col1 int, col2 int, col3 int)
>
> insert into #a values (1, 2, 3)
> insert into #a values (1, 2, 3)
> insert into #a values (2, 2, 3)
> insert into #a values (3, 2, 1)
> insert into #a values (3, 2, 1)
> insert into #a values (3, 2, 1)
>
> Select count(*) as "# Dupes", col1, col2, col3
> From #a
> Group By col1, col2, col3
> Having count(*) > 1
> # Dupes col1 col2 col3
> -- -- -- --
> 2 1 2 3
> 3 3 2 1
>
> --
> David Gugick
> Imceda Software
> www.imceda.com
>
Thanks again David. I finally got the query to work. Knowing that there were
duplicates I did some intensive eye straining work to figure out the
problem. Appears as though some of the "duplicate" records also contained
unexpected data, empty string vs. NULL value and this was thowing off the
query. I updated the fields as necessary and everything worked out fine.
Thanks again for your assistance.
daniel

Wednesday, March 7, 2012

Fact Tables and Clustered Indexes

Fact Tables and Clustered Indexes

Sunday, February 19, 2012

Extracting duplicate records

Friday, February 17, 2012

Extracting duplicate records

Extracting duplicate records

Extracting Sql

Blog Archive

About Me