MAX Memory in an active cluster.

In a single node instance, you want to reserve some computer memory for the OS or other activities (SSMS on the box, RDP into the box, backup, AV, et al).  If you don’t do this, MS SQL will have a max memory setting of a huge value, which means “take everything you can get” — which means that if/WHEN the OS needs memory to do some OS task, it will have to reclaim it from SQLOS, which means probably dumping some cached data/index/Stored procedures, or disk thrashing.

OK, got it.  Now create a cluster, with two nodes.  Set up the max memory, fail over to the other node with the same max memory – no sweat.  NOW, run a second instance on the other node.  We are now Active-Active (is this the correct term?), and when the failover occurs (as it eventually will even if we are only doing maintenance), we have two instances both wanting most of the ram on the same physical server:  total ram configured for SQL greater than … total ram.

An article in SQLSERVERCENTRAL  discusses how to deal with this.  The obvious solution is to waste 1/2 the configured RAM on both machines so that the failing over server immediately sees what it expects – nearly half the total CPU ram – for the fail-over instance.

A better approach is to reconfigure the max memory for each instance depending on if it has a whole server for itself or if it is sharing.

First step is to set up each instance with a data drive on the SAN – with different letters.  For each instance, put a blank txt file on that drive.  Instance one has E:\instanceOne.txt and Instance two has F:\InstanceTwo.txt.

Second step is to write three SQLAgent jobs to reconfigure the max memory settings (there will be 6 jobs, three for each instance, although if we have three instances, we would need a different number per instance or per cluster). JOB one  is to give the instance all available memory (minus system reserve).  JOB two is what we want for each instance if we are sharing.  This COULD be 1/2 of non-reserved memory, but if the two instances don’t really need the same max memory, it could be different.  JOB three is what we want for the OTHER (active?) instance into whose machine we are failing.

Finally, we need one more SQLAgent job that runs on restart.  It is the same on both instances, and says on restart, check and run the appropriate local SQL job (ie, reconfigure MAX or reconfigure SHARED)

If the SQL job starts and sees only E or F, the correct memory is MAX.  If the SQL job sees E and F (ie, we have two instances on the same machine), configure your instance to its reconfigured max and the OTHER  instance to its max.  When things are fixed, and the machines restarted, we check, see the nominal drive, and reconfigure for max again.

That all makes sense BUT the machine that stays up – why will that SQL instance think it is in  “restart” mode?  the trick is that these are linked servers, so that the SQL agent job on the instance that restarts will be able to evoke sp_configure on the machine that stays up:

EXEC sys.sp_configure N'max server memory (MB)', N'73728'
GO

So the instance that restarts will check to see it is sharing resources, configure itself, configure the initial instance and both machines are running right.  When we fix machine two, we fail instance two back to machine two, SQL restarts, sees it is alone, configures max memory for itself, and removes the hold-back on machine one.

Or so i think….

Advertisements