nested replacements

I guess i should check and see if you could use regular expressions here, but we have a column whose values have spaces in them.  We want them out.  In fact, we want them replaced with an underscore.  Easy:

select replace(bad_column_value, ' ','_') into ...

The logic is obvious – take the bad column, get its value, and IF there is a space in the value, replace it with the underscore.  “My Name” becomes “My_Name”.  If there is no problem with the value, no replacement is done and the select continues with the value in the row.

We have another case where the value is like “My Name / My Last Name”.  We can handle the replacement, but what to do if value one has the spaces and value two has the “/”?  One replacement won’t work.  Want to try some smart-ass case statement?

In powershell or c# you can just string replacements one after the other on a variable to first replace A and then replace B.  T-sql doesn’t make it that easy.

What you can do is nest the replacements like this:

select replace(replace(bad_column_value, ' ','_'), '/','') into ...

But wait – what is our expected output?  We want My_Name/My_Last_Name, and this won’t do that.  What we will get is My_Name__My_Last_Name.  What we want to find and replace is ”  /  ” and replace it with “/”.  Right?

But wait – if we run the “replace spaces with _” first, THEN the “find and replace space/space”, we will not find “space/space” since it has become “_/_”  So now we have to replace “_/_” with ???

The issue is that the final clean up, replacing spaces is easy – and ‘destructive’ – we can’t simply reverse the procedure and get back to our starting point.  Our difficult selection, in this case, is finding ‘  /  ‘ and replacing that.  THEN we can find it easy to just replace additional spaces.  So we need to reverse the order of the replacements to do the ‘/’ on the inside (first) like so

select replace(replace(bad_column_value, ' / ','/'),' ','_') into ...

This will work.  It should be obvious that we can continue to string these things deeper than two.  But much more and we ought to write a function to do this.  Of course, that would probably be slower.

One neat approach is to create a temp table or table valued function listing all the bad characters, and use that table to determine which characters to remove:

declare @badStrings table (item varchar(50))
INSERT INTO @badStrings(item)
SELECT '>' UNION ALL
SELECT '<' UNION ALL
SELECT '(' UNION ALL
SELECT ')' UNION ALL
SELECT '!' UNION ALL
SELECT '?' UNION ALL
SELECT '@'
declare @testString varchar(100), @newString varchar(100)
set @teststring = 'Juliet ro>
set @newString = @testString
SELECT @newString = Replace(@newString, item, '') FROM @badStrings

Don’t know that i would have reassigned @newstring and @teststring, but anyway this would work fine to REPLACE the badstrings with blanks, but not to replace blanks with, for examle, underscores.  That can be done by adding another column to our list of bad strings for the replacement values:

create  TABLE #ReplaceStrings (symb VARCHAR(5),replace_char varchar(5))
INSERT INTO #ReplaceStrings (symb,replace_char) VALUES ('/','|'),(',','|')
DECLARE @OMG VARCHAR(200)
SET @OMG = 'ABC,DE/F'
SELECT @OMG = Replace(@OMG, symb, replace_char)
FROM   #ReplaceStrings
select @OMG

Next, the author goes on to … write a function.

apply

i have grown in my t-sql stature:  i used cross-apply twice in the last week.

There are a number of relevant articles about apply\cross apply\outer apply.  I particularly like what Rob Farley says about apply: “APPLY, in both its inner (CROSS APPLY) and outer (OUTER APPLY) forms, allows a correlated sub-query or table-valued function to be part of the FROM clause.”  This means that you can have a subquery that returns values keyed to the same value in the outer query.  It is commonly said that you run the apply function for each row of the outer function.

I have used it twice:  one when i had to fix the first value (max) for a selection to add to a table.  Patient A, diagnosis 1, diagnosis 2, diagnosis 3, diagnosis 4 (actually both the idc9 codes and the name or description.  Create the outer table with a select into statement the columns for patient and all the other details.  Include a select column for diagnosis1.code and diagnosis2.name, and four times.  Obviously, we don’t know what diagnosis1 is, so the query will fail, or rather, work if we comment out the four\eight columns.

For each of the diagnoses, create a outer-apply code block immediately after the joins (and or groups) but before any final where predicates.  In my case, it looks like:

outer apply
(select sd.diagnosis_code_id as code, dc.description as name
FROM Service_Diagnoses sd 
join Diagnosis_codes dc 
on dc.Diagnosis_Code_ID = sd.Diagnosis_Code_ID
where ser.Service_ID = service_id order by service_diagnosis_id
offset 0 rows fetch first 1 rows only
) as diagnosis1

Notice what is happening.  For each diagnosis_code_id, i get the diagnosis code and the name for that service – which is passed in from the main table.  Next – they are ordered and only the first one is returned to the outside table as diagnosis1.code and diagnosis1.name.

Next, the same code for diagnosis2 -except that i return the SECOND value in the list with

offset 1 rows fetch first 1 rows only

and similarly with three and four.

The second case was somewhat similar – vouchers are paid by a carrier (or two), and we wanted to know the number of services on that voucher.  Shouldn’t be an issue, but as i built out the voucher and carrier component, each time i tried to add count(services) into the join, the count of records exploded.  And the counts were all 1 instead of the counts for each voucher, which could be numerous.  I truly think that i was doing something wrong with the group-bys that i had to do to put the count function into the big table, but i never got it to work.

What DID work was to write the table as before, include a column in the initial section that was “a.acount”.  The cross apply select was created “as a”  and i selected the count of services for each voucher as acount.  It didn’t work at first – i kept getting 5 million instead of 12 until i realized that in the cross apply, i joined vouchers to services as a JOIN and not just on the join conditions.  This is

cross apply 
(select count(ser.service_id) as acount 
from pm.services ser
join  pm.vouchers vou on 
vou.voucher_id = ser.voucher_id) as a

when it SHOULD BE

cross apply 
(select count(ser.service_id) as acount 
from pm.services ser
where vou.voucher_id = ser.voucher_id) as a

Voucher (vou) isn’t even defined in the cross-apply – it gets passed in and a.acount gets passed out when the process is run.

Pivot function

You have rows and rows and wish you had more columns.  You have 10 clients, 5 years, and a list of activity/orders/contacts.  Lots of rows.  You would rather have 5 rows and a count column for each of the clients.  That is what the pivot function does.  You could probably write code to do this, which is the same as reinventing the wheel, but why work so hard.

The following is annotated syntax for PIVOT from Microsoft

SELECT <non-pivoted column>,
[first pivoted column] AS <column name>,
[second pivoted column] AS <column name>,

[last pivoted column] AS <column name>
FROM
(<SELECT query that produces the data>)
AS source_alias
PIVOT
(
<aggregation function>(<column being aggregated>)
FOR
[<column that contains the values that will become column headers>]
IN ( [first pivoted column], [second pivoted column],
… [last pivoted column])
) AS PivotTable_alias
<optional ORDER BY clause>;

Let’s try to break this down.  What do you want in each row of the first column?  In our example, that would be years; years is the <non-pivoted column>.  Next column in the first row would be the appropriate number for client 1, as [client 1], then client 2 as [client 2] and so on.  So all you need are years, the clients and the count aggregate.  If you have other data, at least one recommendation is to create a CTE or derived table to simplify the data and the syntax.  Otherwise, you could be grouping on years, countries, states, et al as the non-pivoted columns.  You would then have 5 (years) X number of countries x number of states as the number of rows.  Client aggregates remain the pivoted columns.  If you don’t want the extra columns and wanted everything rolled up into years, don’t select the extra columns in the select FROM select query.

Next, the FROM query; give this an alias.

FROM (Select year, client_id from activities) as source.

Now the magic – immediately following the source alias, continue with (next line is OK…)

PIVOT (aggregation FOR … IN … ) as PT.  Run that as you get your desired output.

PIVOT (sum(client_id) FOR client_id IN [client_1], [client_2], [client_3], …[client_10]) as PT

Let try that again – diagram your desired output:  years are the rows, the columns are for each client and the number to display is the count of each client – for each year.  The fiddly bit is defining the column headers and putting that back into the IN clause

update table with Join

it’s easy to update an existing sql table. In T-sql:

update table set column1 = 'something'

This updates all column1 values to ‘something’

More typically, you want to update only SOME rows, so you would use a where predicate

update states set ST to 'NY" where ST = 'New York'

or

update sales set discount = discount + .1 where client_type = 'super saver'

What i need to do is to update table 1 based on a matching value in table 2. The syntax goes all to hell.

Update TABLE set TABLE.column1 = TABLE_2.column1 
FROM theFirstTable as TABLE
JOIN theSecondTable as TABLE_2 on TABLE.ID = TABLE_2.ID
where ...

I have this big old table that takes nearly an hour to build. I needed to add three columns to the schema – which actually involved three separate joins on the same second table (three times since i needed to get three different join conditions based on the id). With the additional joins, we got timeout and resource errors. Might not have been the complexity, but having activity supply disorder, i wanted to get it done asap.

I created the three new columns as “,null ColumnName”. After running the creation event, i had my original table with three new columns, each with the proper Name and each filled with nulls. Next, rather than the three joins to the Name table, i ran an update on the original table three times. Update the big table Name1 column with the matching value from the second table. Then update the big Table Name2 column with the matching value from the second table. Rather than dying from resources after ten minutes on several separate runs, i got my big table with null, and each of the three updates took 5 minutes for 5 millions rows.

My actual code, in dynamic SQL, looks like this

set @cmd = 'update pay
set pay.Referring_dr_id = pm.pm_provider_id 
from ' + @dbname + '.dbo.qt_service_billing_payments3 pay
join ' + @dbname + '.dbo.pm_providers pm 
on pm.abbreviation = pay.Referring_Dr_Abbr
where pm.abbreviation = pay.billing_dr_abbr'
-- exec(@cmd)

In hindsight,  you could do it in one pass with case logic.

load testing

David Klee has a couple of articles about load-testing. That is – running software utilities to simulate various kinds of load on SQL Server and the horse it rode in on – the Server. Since we can have SQL in VMWare\HyperV and VMWare Player can run on a linux box, the server as hardware or as the OS is not strickly defined. Whatever.

David Klee has an article on his blog about using SQLIO Batch, HammerDB, DiskSpd and automating DiskSpd with powershell.

Now, as you run these things (and collect the typical output) you can also start monitoring extended events to look at what MS SQL is doing as I/O or stress levels change.

Mike Fal has a blog post on putting “lines in the sand” (my capitalization, not his). withing with HammerDB.

Brent Ozar champions Adam Machanic’s SQLQueryStress application that runs multiple sessions of a random selection of stored procedures

version control for this dummy

More often than i’d like to admit, i have started to make changes in a stored procedure (or any other random act of programming) and only eventually realized that 1) it was a bigger deal than i throught, and 2) it no longer works.  And i REALLY HATE to walk away from code this isn’t working.

The obvious (ha!) solution is to clone my USP before making changes (modify, change alter to create, change the name, execute the code) , but sometimes that is too simple to work for me.  (“harder, harder, must make it harder…”).

The general solution is to think about source control or version control.  Unfortunately, since the document of interest, the USP, is stored inside of the MS SQL database, it is difficult to use traditional subversion software to back them up.  There are a couple of expensive solutions for this, and i GUESS i could go to my boss and say that all of us should use something and that he should buy it and reward me for being the first person to notice there was a problem.  Guess whose problem THAT would be?

Other workarounds would be to copy\paste from the console into a text document on my local machine and use traditional source control from there, or script\save the stored procedures as a .sql file on my local machine.  There is a script wizard that would do that, but i haven’t learned how to script or automate it, and t-sql output seems limited to 4000 odd characters for display.  And i got the oddest stuff if i output the results of querying the description property (the text) to a file.

What i am leaning towards is a low-cost hybrid solution.  Anup Saund has a project on gitHub called .gitSQL.  The free version saves only 4 item each object type, the $40/year version is unlimited.  What it does is to interrogate your sql server, list the databases, let you pick one,  and shows you the stored procedures, tables, views and functions.  You can select all or some of them, and export them to your local machine.  I had some issues with the initial import.  Silly me checked the “import new” button, and that was a mistake.  It also took a long time to initially read the database, but now it comes back up fast.

What you get on your local machine are .sql files that will, if imported back to the server, recreate the object.  This script is familiar, since you could generate it with “script to > create” in SSMS.  But this file also contains the entire code that you would be editing locally, so you can just cut/paste into SSMS if the current code is not running.

So right now i have four files that correspond to four USPs from Data_2.  Anytime i am going to work on a stored procedure, i can put an updated local copy on my machine, or backup ALL the USP scripts.  However – there is something else – the local script contains not only the code for the USP, it also contains any code to create stored procedures that your stored procedure references, any output tables (at least the schema) and other assorted code.  Since i am going to want to just copy\paste a working copy of the code, this  is not a real problem for me.

Secondly, i have also installed GIT on my machine, and the .gitSql destination is a’ git inited’ directory.  So i open .gitSQL and a GIT client (GIT BASH is going to be fine), and each time i get nervous, i can GIT ADD and GIT COMMIT to source control.

This means 30 seconds or so of work at various stages of my editing.  i don’t see this as a major issue – every 10-20 minutes or so i SHOULD take a 30 second break anyway.

I can also back up SQL tables – create schema and import data in the .sql file or just the data into a flat file.  Not good for backups, but fine for utility tables, such as tables2process.

This just might work for me — and maybe i can claim it as a tax deduction.

NUMA nodes

What is NUMA?  National Underwater and Marine Agency, (Clive Eric Cussler) and there actually is such a place, although they probably don’t have fleets of ships and a former President who is POTUS.

Non-Uniform Memory Access (NUMA) is a condition where different regions of RAM have different characteristics.  I mean that because RAM is large and there are lots of really fast CPU, addressing one region of memory will take longer than a “closer” region for a particular computer chip.  In this case, it makes sense for a CPU to use memory in it’s neighborhood.  Or more correctly, the data that a CPU is storing and working with is best if it is in one ‘node’ that will be highly accessible to that CPU.

Jonathan Kehayias of PSSkills indicates that this query

select * from sys.dm_os_memory_nodes

will return data like this

memory_node_id virtual_address_space_reserved_kb ... more rows
0              99838476 
64             0

on a non-numa configured machine.  If there are configured numa nodes, there will be more than just two memory_node_id entries.  The second node (64) is the DAC.  You can get a similar view from this

select * from sys.dm_os_nodes

Glenn Berry has a script that gives information about cores and CPUs, et al

-- Hardware information from SQL Server 2012
-- (Cannot distinguish between HT and multi-core)
 SELECT cpu_count AS [Logical CPU Count]
, hyperthread_ratio AS [Hyperthread Ratio]
, cpu_count/hyperthread_ratio AS [Physical CPU Count]
, physical_memory_kb/1024 AS [Physical Memory (MB)]
, committed_target_kb/1024 AS [Committed Target Memory (MB)]
, max_workers_count AS [Max Workers Count]
, affinity_type_desc AS [Affinity Type]
, sqlserver_start_time AS [SQL Server Start Time]
, virtual_machine_type_desc AS [Virtual Machine Type]
FROM sys.dm_os_sys_info WITH (NOLOCK) OPTION (RECOMPILE);