sql statistics

Dale Burnett has a very nice series of 8 posts on SQL statistics.

In the first one, he creates a database of 10,000 entries. All but one of them are “a”, one is “b”. the table has a clustered index on the identity column, a NCI on the character column (the A’s and the b) and a third, varchar(10) column with junk in it.

If you create the NCI, statistics are generated (assuming that create_auto_statistics is turned on, which is the default, or you run the create statistics command). If you “select * from table where column1 = ‘a'”, you get a index scan (on the clustered index), but if you select “b”, you get a index seek on the NCI. The Query analyser ‘knows’ that performing 9999 seeks to return the ‘a’s’ is going to be more expensive than a scan, so it chooses the scan. It ‘knows’ that to return the single ‘b’ it is cheaper (in this case given the depth of the b-tree) to make three physical reads than scan the entire table (41 reads). Of course, it has to ‘know’ the number of rows to return, and it finds that from the statistics.

The next post is about density (uniqueness or selectivity). A new table, sample of address1, address2, city and zip from “adventurerWorks” with an index on all the four columns. If you query the number of distinct address1 values, the density is the reciprocal of the unique count. In his case, 1/2815 = 0.0003552398. If all values were the same, the density would be one, decreasing as the number of unique values increases to a minimum of 1/4040008 where 4040008 is the number of rows. By default, the best guess of the number of rows to be returned, if you pick one of the 2815 possible address1 values in a query, would be the density * the total rows or ((1/2815) *4040008) = 1435.1715, which is the estimated number of rows calculated in the estimated execution plan.

However a density of 1/2 in 10 rows can be achieved from 1+9, 5+5 or 8+2. Density is not distribution, and the actual number of rows is not the same as the estimated number of rows. The estimate is close if the distribution in the column of values is unique (which means that the 1435 number is most accurate). In case one, with 1/9999, a density of .5 would probably suggest a scan, whereas the reality of one value of ‘b’ would indicate that that a query for b would be best served with a lookup.

The distribution information is displayed in the histogram table, which is as many as 200 buckets, each containing the number of rows of data greater than the previous bucket but less than or equal to the ‘range_hi_key’, which is the index or the first field in a compound index.  Depending on the sample size, the histogram can be a census or a survey.