Friday, February 24, 2012

Extremely high % Disk Time .. How is this possible?

Hi Group
SYSTEM:
Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0, 6
drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB RAM /
Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @. 100
MB/sec ). Our SQL servers have fiber cards @. 400 MB/sec connected to hub.
About 50 DBases, ranging in size from about 50-100 MB up to 6 GB. They are
shared by 15 clients (they are fairly balanced over a 24 hour period, when 1
client is busy, the other isn't). The total size of dbases is ~ 65 GB, Log
File ~ 35 GB. 100-250 connected users (to our application), with 300-500
SQL connections. Dual XEON 2.8Ghz Processors / Server.
PROBLEM:
We are experiencing disk problems but are having a very difficult time
diagnosing 'exactly' what is the cause and solution. On the data drives,
thru PerfMon, the % Disk Times are averaging between 250 and 1500, with
peaks of over 30,000 (How do I get 30,000 % disk use?). Average disk
queues are setting at about 100-250 during the problem times, but
occasionally see it spike well over 1000. The transactions/sec hover
between 50-100. Processor hovers between 10 and 40%, spiking to 70-80%.
The disk bytes/sec for the data drives average betwwen 1 and 6 MBytes/sec.
I know there are a lot of variables that can be at play here, but where
should I start? Is the system overall too small? Over worked? We did have
memory issues but increased the RAM and there no longer appears to be buffer
cache problems and we have not yet exceeded the available RAM. We have
removed most of our temp tables (in favor of table variables) and removed a
lot of the cursors that were in the system. We moved our reporting server
to it's own box (completely seperate from prod servers). We reduced the log
file backup intervals and staggered our database backup jobs. We converted
portions of our reporting to Analysis services. All with short term
marginal improvements. We have to speed our system up before we start
losing clients. Does it sound/appear as though the next option would be to
scale out? scale up? Goto a SAN environment? And info and advise would be
greatly appreciated!
If you got this far, thanks for the time in reading this long post!!! And am
anxious to read any responses. Thank you.
It sure sounds like your storage array is not able to handle the load.
Having a Raid 1+0 is great but only having 6 drives is on the low end.
Since you are not on a SAN you only have the cache on the controllers
themselves to help deal with all the writes. Make sure the cache on the
controllers is set to 100% write back and not read. It also sounds like you
only have a single drive array and everything is on it (logs , data and
tempdb). That also is not good. You should have at least a Raid1 or
Raid1+0 for just the log files. The spikes can be caused by lots of things
but you are probably seeing some of this due to checkpoints. You can use
the Checkpoint pages per second perfmon counter to correlate this to the
disk queue spikes. You might also try using connection pooling instead of
that many individual connections. You will need multiple pools to handle
the different dbs but it should cut down the actual number of connections
quite a bit.Have you ran any traces to see what kind of queries are being
run and who the worst offenders are? You may have a few really poorly tuned
queries that are hurting everything else.
Andrew J. Kelly SQL MVP
"Tim" <tim@.pds.notanaddress.ca> wrote in message
news:%23v7AhlsDFHA.2804@.TK2MSFTNGP14.phx.gbl...
> Hi Group
> SYSTEM:
> Clustered SQL Server 2000 on Windows 2000 (2 servers), with RAID 1+0, 6
> drives (Total of 24 20GB hard drives with mirroring/striping). 4 GB RAM /
> Server. All fiber back end, with Fibre Channel Storage Hub 7 ( @. 100
> MB/sec ). Our SQL servers have fiber cards @. 400 MB/sec connected to hub.
> About 50 DBases, ranging in size from about 50-100 MB up to 6 GB. They
> are
> shared by 15 clients (they are fairly balanced over a 24 hour period, when
> 1
> client is busy, the other isn't). The total size of dbases is ~ 65 GB,
> Log
> File ~ 35 GB. 100-250 connected users (to our application), with 300-500
> SQL connections. Dual XEON 2.8Ghz Processors / Server.
> PROBLEM:
> We are experiencing disk problems but are having a very difficult time
> diagnosing 'exactly' what is the cause and solution. On the data drives,
> thru PerfMon, the % Disk Times are averaging between 250 and 1500, with
> peaks of over 30,000 (How do I get 30,000 % disk use?). Average disk
> queues are setting at about 100-250 during the problem times, but
> occasionally see it spike well over 1000. The transactions/sec hover
> between 50-100. Processor hovers between 10 and 40%, spiking to 70-80%.
> The disk bytes/sec for the data drives average betwwen 1 and 6 MBytes/sec.
> I know there are a lot of variables that can be at play here, but where
> should I start? Is the system overall too small? Over worked? We did
> have
> memory issues but increased the RAM and there no longer appears to be
> buffer
> cache problems and we have not yet exceeded the available RAM. We have
> removed most of our temp tables (in favor of table variables) and removed
> a
> lot of the cursors that were in the system. We moved our reporting server
> to it's own box (completely seperate from prod servers). We reduced the
> log
> file backup intervals and staggered our database backup jobs. We
> converted
> portions of our reporting to Analysis services. All with short term
> marginal improvements. We have to speed our system up before we start
> losing clients. Does it sound/appear as though the next option would be
> to
> scale out? scale up? Goto a SAN environment? And info and advise would
> be
> greatly appreciated!
> If you got this far, thanks for the time in reading this long post!!! And
> am
> anxious to read any responses. Thank you.
>
|||I do have 5 logical drives, 1 for logs and 1 for dbases for each server, and
a quorum drive. Been evaluating the benefits of adding a dedicated array
for tempDB and indexes. Roughly 30-40 % of ransactions occur in tempDB
(still have a lot of tempDbases as it seems to be the most efficient way to
maintain/handle the passing of tables between stored procedures. In many
cases, there is no easy way to eliminate the passing of the tables as a
couple of stored procedures (the big offenders) require a large amount of
processing.
"Andrew J. Kelly" <sqlmvpnooospam@.shadhawk.com> wrote in message
news:e65DCOtDFHA.2632@.TK2MSFTNGP12.phx.gbl...
> It sure sounds like your storage array is not able to handle the load.
> Having a Raid 1+0 is great but only having 6 drives is on the low end.
> Since you are not on a SAN you only have the cache on the controllers
> themselves to help deal with all the writes. Make sure the cache on the
> controllers is set to 100% write back and not read. It also sounds like
you
> only have a single drive array and everything is on it (logs , data and
> tempdb). That also is not good. You should have at least a Raid1 or
> Raid1+0 for just the log files. The spikes can be caused by lots of
things
> but you are probably seeing some of this due to checkpoints. You can use
> the Checkpoint pages per second perfmon counter to correlate this to the
> disk queue spikes. You might also try using connection pooling instead of
> that many individual connections. You will need multiple pools to handle
> the different dbs but it should cut down the actual number of connections
> quite a bit.Have you ran any traces to see what kind of queries are being
> run and who the worst offenders are? You may have a few really poorly
tuned[vbcol=seagreen]
> queries that are hurting everything else.
>
> --
> Andrew J. Kelly SQL MVP
>
> "Tim" <tim@.pds.notanaddress.ca> wrote in message
> news:%23v7AhlsDFHA.2804@.TK2MSFTNGP14.phx.gbl...
/[vbcol=seagreen]
hub.[vbcol=seagreen]
when[vbcol=seagreen]
300-500[vbcol=seagreen]
drives,[vbcol=seagreen]
MBytes/sec.[vbcol=seagreen]
removed[vbcol=seagreen]
server[vbcol=seagreen]
would[vbcol=seagreen]
And
>
|||As Andrew pointed out, it is the physical drives that appear to be limited.
Slicing them into logical drives may look better but it does not increase IO
capability. Separating data and logs onto different physical disks with the
correct RAID layouts will increase performance. Andrew also suggested
profiling for performance to find if there are just a few really ugly
queries that are killing your system. I agree completely.
As for the original question on Percent IO Time, that counter has been
largely irrelevent since SCSI Overlapped IO was introduced way back when.
Caching controllers really killed it. I prefer to look at disk queue length
and read bytes/write bytes per second as true measures of disk performance.
I believe that you will be running with write cache off in a non-SAN cluster
to prevent data loss during a cluster failover. This is a big performance
killer and one major reason I do not recommend a cluster without a SAN. All
indicators are that you need to modify your IO system. You may find other
performance bottlenecks as you remove the IO limit, but for right now, that
is the worst offender.
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com
I support the Professional Association for SQL Server
www.sqlpass.org
"Tim" <tim@.pds.notanaddress.ca> wrote in message
news:%23kwePe3DFHA.2608@.TK2MSFTNGP10.phx.gbl...
> I do have 5 logical drives, 1 for logs and 1 for dbases for each server,
and
> a quorum drive. Been evaluating the benefits of adding a dedicated array
> for tempDB and indexes. Roughly 30-40 % of ransactions occur in tempDB
> (still have a lot of tempDbases as it seems to be the most efficient way
to[vbcol=seagreen]
> maintain/handle the passing of tables between stored procedures. In many
> cases, there is no easy way to eliminate the passing of the tables as a
> couple of stored procedures (the big offenders) require a large amount of
> processing.
>
> "Andrew J. Kelly" <sqlmvpnooospam@.shadhawk.com> wrote in message
> news:e65DCOtDFHA.2632@.TK2MSFTNGP12.phx.gbl...
> you
> things
use[vbcol=seagreen]
of[vbcol=seagreen]
handle[vbcol=seagreen]
connections[vbcol=seagreen]
being[vbcol=seagreen]
> tuned
6[vbcol=seagreen]
RAM[vbcol=seagreen]
> /
> hub.
They[vbcol=seagreen]
> when
GB,[vbcol=seagreen]
> 300-500
> drives,
with[vbcol=seagreen]
70-80%.[vbcol=seagreen]
> MBytes/sec.
where[vbcol=seagreen]
did[vbcol=seagreen]
have[vbcol=seagreen]
> removed
> server
the[vbcol=seagreen]
be
> would
> And
>
|||Completely agree with Geoff.
Andrew J. Kelly SQL MVP
"Geoff N. Hiten" <SRDBA@.Careerbuilder.com> wrote in message
news:%237$Gb13DFHA.2824@.tk2msftngp13.phx.gbl...
> As Andrew pointed out, it is the physical drives that appear to be
> limited.
> Slicing them into logical drives may look better but it does not increase
> IO
> capability. Separating data and logs onto different physical disks with
> the
> correct RAID layouts will increase performance. Andrew also suggested
> profiling for performance to find if there are just a few really ugly
> queries that are killing your system. I agree completely.
> As for the original question on Percent IO Time, that counter has been
> largely irrelevent since SCSI Overlapped IO was introduced way back when.
> Caching controllers really killed it. I prefer to look at disk queue
> length
> and read bytes/write bytes per second as true measures of disk
> performance.
> I believe that you will be running with write cache off in a non-SAN
> cluster
> to prevent data loss during a cluster failover. This is a big performance
> killer and one major reason I do not recommend a cluster without a SAN.
> All
> indicators are that you need to modify your IO system. You may find other
> performance bottlenecks as you remove the IO limit, but for right now,
> that
> is the worst offender.
> --
> Geoff N. Hiten
> Microsoft SQL Server MVP
> Senior Database Administrator
> Careerbuilder.com
> I support the Professional Association for SQL Server
> www.sqlpass.org
> "Tim" <tim@.pds.notanaddress.ca> wrote in message
> news:%23kwePe3DFHA.2608@.TK2MSFTNGP10.phx.gbl...
> and
> to
> use
> of
> handle
> connections
> being
> 6
> RAM
> They
> GB,
> with
> 70-80%.
> where
> did
> have
> the
> be
>

No comments:

Post a Comment