# sql statistics

Dale Burnett has a very nice series of 8 posts on SQL statistics.

In the first one, he creates a database of 10,000 entries. All but one of them are “a”, one is “b”. the table has a clustered index on the identity column, a NCI on the character column (the A’s and the b) and a third, varchar(10) column with junk in it.

If you create the NCI, statistics are generated (assuming that create_auto_statistics is turned on, which is the default, or you run the create statistics command). If you “select * from table where column1 = ‘a'”, you get a index scan (on the clustered index), but if you select “b”, you get a index seek on the NCI. The Query analyser ‘knows’ that performing 9999 seeks to return the ‘a’s’ is going to be more expensive than a scan, so it chooses the scan. It ‘knows’ that to return the single ‘b’ it is cheaper (in this case given the depth of the b-tree) to make three physical reads than scan the entire table (41 reads). Of course, it has to ‘know’ the number of rows to return, and it finds that from the statistics.

The next post is about density (uniqueness or selectivity). A new table, sample of address1, address2, city and zip from “adventurerWorks” with an index on all the four columns. If you query the number of distinct address1 values, the density is the reciprocal of the unique count. In his case, 1/2815 = 0.0003552398. If all values were the same, the density would be one, decreasing as the number of unique values increases to a minimum of 1/4040008 where 4040008 is the number of rows. By default, the best guess of the number of rows to be returned, if you pick one of the 2815 possible address1 values in a query, would be the density * the total rows or ((1/2815) *4040008) = 1435.1715, which is the estimated number of rows calculated in the estimated execution plan.

However a density of 1/2 in 10 rows can be achieved from 1+9, 5+5 or 8+2. Density is not distribution, and the actual number of rows is not the same as the estimated number of rows. The estimate is close if the distribution in the column of values is unique (which means that the 1435 number is most accurate). In case one, with 1/9999, a density of .5 would probably suggest a scan, whereas the reality of one value of ‘b’ would indicate that that a query for b would be best served with a lookup.

The distribution information is displayed in the histogram table, which is as many as 200 buckets, each containing the number of rows of data greater than the previous bucket but less than or equal to the ‘range_hi_key’, which is the index or the first field in a compound index.  Depending on the sample size, the histogram can be a census or a survey.

# execute execute

I had to fuss with this for a while until i got it.  I needed dynamic sql in order to count rows in different tables.   The tables depended on the dataset, and even the databases, so i had to be able to pass in a list of the names of the tables when the usp (user stored procedure) ran.

But i wanted to get the value of the stored procedure, and of course it usually returns 0 or 1 for success/failure. By using sp_executesql, we can configure a return or output variable and make sure our data, in this case the count, is returned to it.  Then i put it into another local variable and use that for SQLMail.

```set @SQL = 'select @mycount = count(*)
from database.dbo.' + @destTable + ''
EXEC sp_executesql @SQL
, N'@mycount nvarchar(10) OUTPUT'
,@mycount = @mycount OUTPUT
set @bodyText += CHAR(13) + CHAR(10) +
' There are ' + @mycount +
' rows in ' + @sourceTable +  CHAR(13) + CHAR(10)```

In the real case, there are pairs of table lookups and the CHAR(13)+CHAR(10) are needed to force a line feed for the email message in @body.  Of course,  you need to declare @bodyText first.  AND FYI, you need to initialize it with, like, ‘report’.  If you try to do a += into @bodyText before there is something in it, it will fail (and output null).  Obvious answer is to use @bodyText first as an set =, and next as set +=, but this works too.  I did not test with ‘ ‘ (with a space) and it would have worked with ‘.’ and then i could remove it later, but this works for me.

# number the table

I have a temporary table with data in it.  I want to look at each row separately, like a cursor.  But not since cursors are very hard, and very maligned.

If the table had an id column in it, and the IDs were a series, i could create a for loop with a counter, and add the the counter, repeat until done.

When i create the table, i can add an identity column that starts at one and increases each row.  But then you say i have to “insert” into it.  Nay, i say, you can still ‘select into’ to make the table – remember you don’t have to select * into #temp, you don’t have to use all the columns in the source table.  AND you can add columns also.  Consider

``` select identity(int, 1, 1) as rowid
, rtrim(source) as source
, rtrim(dest) as dest
, rtrim(copy) as copy
, rtrim(partition) as partition
into #copy_tables from dbo.load_tables```

We have 5 columns, four of them from load_tables, and one of them made on the fly.  You can always do this even if you have the data in an existing table, just select into another temp table and you are GBD (golden brown and declicious).

In my case, i wasn’t so smart.  I got a distinct list of items and wanted to foreach over them.  Of course, in t_sql,  you can’t.  I could loop over the table if i had that identity column, so i created one.

```set @cmd = 'select M.metric, ROW_NUMBER ( )
OVER (order by M.metric ) as RowNum into dbo.myTable
from (select distinct metric
from server.table.dbo.inputTable ) as M'
exec (@cmd)
select * from dbo.myTable```

As you see, this is a bit of dynamic SQL – i got the distinct metrics in a subquery, and then selected each one with the ROW_NUMBER OVER function.

Then i could loop like this

```set @mycounter = @@rowcount
set @x = 1
while @x <= @mycounter
begin
set @met = (select metric from myTable where RowNum = @x)
set @cmd = 'some SQL command here;'
exec (@cmd)
set @x = @x+1
end
END```

@@rowcount is the number of rows in the myTable, or rather, it is the number of rows which were selected  into it, assuming that this code follows the previous.  If there are other operations in between, @@rowcount might get reset.