Sampling for update statistics in ASE has been introduced a long time ago, in ASE version 12.5.0.3 which was released in 2003. For some reason, this very useful feature has not been advertised nor documented by Sybase very well. Actually, the whitepaper by Eric Miner still remains the most valuable and informative source about the sampling for update statistics, even though it was written more than 10 years ago. In my experience, many customers either don't use sampling for update statistics at all or do it in a less than optimal way. I would recommend you to read the whitepaper to better understand how sampling for update statistics works in general.

Recently, I implemented sampling for two different OEM customers that have just upgraded to ASE 15.7. The goal was to decrease the time for update statistics that used to take hours for big history tables and consume a lot of CPU and I/O. During my tests, I learned some new facts about sampling for update statistics and I'd like to share my findings here, I hope it will be useful.

There is a number of different variations of "update statistics" command available. Let's see what options are available for us in ASE 15.7:

**update statistics table_name [index_name]** - updates statistics on leading column only for the specified index, or on leading columns of all indexes on the table if the index name is not specified. Histograms on non-leading columns don't get updated, which is the main limitation of this syntax of the command. In my experience, absence or staleness of statistics on non-leading index columns may cause serious performance problems for queries, this is why I tend to not use this syntax of update statistics. The sampling cannot be used in this case, not even for leading index columns.

**update index statistics table_name [index_name] **- updates statistics on all columns of the index, or on all indexed columns of the table if the index name is not specified. In the latter case, if a column appears in more than one index, the statistics on such a column may be updated a number of times. The sampling can be used on non-leading columns only. Usually, I recommend this syntax as a default option because it is safe, simple and updates all the required statistics. However, in some cases, it may just take too much time and resources. In such cases, my approach that I'm going to describe later would be useful.

It is interesting that in some ASE releases optdiag may misleadingly report using of sampling on leading index columns, see CR 725185. The CR is still open and I saw this problem on different customer sites and on ASE versions 15.7 ESD#2 and upper. In such cases, optdiag reports the same sampling rate for leading index column(s) as for non-leading columns, but in fact the sampling on leading columns is not performed.

**update statistics table_name (colA, colB, ...)** - affects statistics in a way similar to what "create index" on the same columns set would do. The command updates histograms on the leading column only and gathers multicolumn statistics on column combinations like (colA, colB), (colA, colB, colC) etc. By multicolumn statistics I mean information about densities and unique values for groups of columns in a composite index, you will be able to find it in optdiag output easily.

**update statistics table_name (colA),(colB),...** - not to be confused with the previous syntax, each column is quoted by brackets. Updates histograms on all columns in the list and doesn't gather multicolumn statistics. It is like a shortcut for running multiple update statistics commands, each on a single column. Sampling can be used on all columns in the list.

Note that the list above doesn't contain hash-based update statistics, a new feature that comes with ASE 15.7 ESD#2. This feature is still functionally limited and given the number of CRs about it that I saw in cover letters of recent EBFs/SPs, I can conclude that this feature is not very mature yet. If you have a positive experience with hash-based update statistics in production systems - please let me know, it would be very interesting.

After some tests, I came to following guidelines:

1. If **update index statistics** without sampling is fast enough for you - just use it on all your tables.

2. You may discover that on bigger tables **update index statistics** without sampling is just too slow and consume too many resources. Then, you may try to use **update index statistics** with sampling. This will be useful mostly for composite indexes.

3. If the previous step is not effective enough or not relevant, try update statistics on column sets as they appear in indexes on the table. This should allow you to fully use the power of sampling. You can do it in two steps:

Suppose we have a table with a composite index on (colA, colB, colC) and we have decided that sampling of 3% is good enough to our needs. Then, we can update the histogram on the leading column and update the multi-column statistics:

**update statistics table_name (colA, colB, colC) with sampling = 3 percent**

At this stage, histograms on colB and colC will not be updated, so we can update it as follows:

**update statistics table_name (colB),(colC) with sampling = 3 percent**

That's all, now we have statistics on all three columns, including muticolumn statistics, up-to-date and we have used sampling for all columns in the index. You are welcome to try it on your tables and see the difference.

As to the sampling rate - I have found that if we are dealing with a relatively big table (at least millions of rows) then we can benefit greatly from very low sampling rates, as low as 1%, in most of the cases. I would recommend comparing optdiag outputs after applying of various sampling rates, the difference in terms of performance may be quite significant. For one of my customers, I discovered that decreasing of the sampling from 4% to 1% decreases the time required for update statistics by more than 3 times, and this without compromising the accuracy of column histograms.

I have prepared a test case based on the dataset that I used in my previous blog post, about materialized views in ASE, so fell free to ask me for test case details if you feel that it may be useful for you. However, in my opinion, it would be much better if you test my approach on your own data, because in most of the cases, your data distribution will be quite different from one in my test case.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.