Working with XML

XML data is a universal standard for storing human readable, self-documenting data.  It is useful for configuration files, data interchange files and non-relational data in database servers.  Non-relational in that a XML document might have different sized clusters of data entries, but within a cluster, each entity (think row here) doesn’t have to have the same attributes or values.  The document could be about a business, with part of it referring to the physical plant, a second part to the employees, only one of whom we know their favorite color.  The XML structure can support that.

XML data is text and can be stored in a varchar or nvarchar field.  This might be sufficient if all we want to do is retrieve the data, but MS SQL has a XML Datatype that allows additional functions – there is a built-in XML editor to show the structure of each XML item, and there are other methods that allow querying, updating and inserting XML data.

One approach to working with XML data is to ‘flatten’ it into normal rows and columns. this is straight-forward when you are dealing with non-nested data: groups of users, each with one first name, one second name, one collection of single elements. It becomes more difficult with you have a skills element with zero to many skill elements nested in that element. This is putting you into the situation of skill column one with an entry, and for the second user, two columns for skill one and for skill two.

CROSS APPLY

By using cross-apply, we can get around this. A great article on blogs.msdn.com by Simon Ince documents this well

First the setup:

Save this XML file as as c:\users\rbeck\desktop\new.xml

<?xml version="1.0" ?>
<Root>
 <Person>
 <Name>Simon</Name>
 <Age>20</Age>
 <Skills>
 <Skill>Cooking</Skill>
 <Skill>Cleaning</Skill>
 </Skills>
 </Person>
 <Person>
 <Name>Peter</Name>
 <Age>21</Age>
 <Skills>
 <Skill>Ironing</Skill>
 </Skills>
 </Person>
</Root>

Create a table and import the data

Use MIU
CREATE TABLE XmlSourceTable
(
 RecordId INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
 XmlData XML NOT NULL
)

notice the XML data type.

Import the data and display it

INSERT INTO XmlSourceTable(XmlData)
SELECT *
FROM OPENROWSET(
 BULK 'C:\users\rbeck\desktop\new.xml', SINGLE_BLOB)
AS ImportSource
GO
SELECT * FROM XmlSourceTable

RecordId XmlData
———– ——-
1 <Root><Person><Name>Simon</Name><Age>20</Age… (snip)
(1 row(s) affected)

One row for each file/xml document. If we import this twice, we get the same XML data, and RecordIDs of 1 and 2. The rest of the output below is just duplicated. If the second import was different data (but the same schema) that would just look like more (and not duplicated) data.

Query the data

SELECT
 pref.value('(Name/text())[1]', 'varchar(50)') as PersonName,
 pref.value('(Age/text())[1]', 'int') as PersonAge,
 pref.query('Skills') as PersonSkills
FROM 
 XmlSourceTable 
 CROSS APPLY
 XmlData.nodes('/Root/Person') AS People(pref)

PersonName       PersonAge         PersonSkills
———- ——— ———————-
Simon                          20                <Skills><Skill>Cooking</Skill><Skill>Clean… (snip)
Peter                           21                <Skills><Skill>Ironing</Ski… (snip)

Notice that Simon has more skills than Peter. In this case, try this:

SELECT
 pref.value('(Name/text())[1]', 'varchar(50)') as PersonName,
 sref.value('(text())[1]', 'varchar(50)') as PersonSkill
FROM 
 XmlSourceTable 
 CROSS APPLY
 XmlData.nodes('//Person') AS People(pref) 
 CROSS APPLY
 pref.nodes('Skills/Skill') AS Skill(sref)

PersonName PersonAge PersonSkill
Simon            20                Cooking
Simon            20                Cleaning
Peter             21                 Ironing

The magic is done by CROSS APPLY and by the XMLData.nodes function.

From BOL: “The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query.” I am going to simplify that to “the nodes table-valued function is applied to each pref.value and the one pref.query – the output is concatentated left to right.

When we use xxx.query, we get back XML, even if it is not enclosed in an element. If we use xxx.value, we get back the value – we need to specify the type of data (int, varchar()) and the explict path:

“Although there is only one PersonName or PersonAge attribute in the XML instance, the static typing rules require you to explicitly specify that the path expression returns a singleton. Therefore, the additional [1] is specified at the end of the path expression” (emphasis from rcb)

If we apply the value to the skills element and a path of [1] and a varchar datatype, we get back “CookingCleaning” for Simon

pref.value('Skills[1]', 'varchar(50)') as PersonSkills

The nodes method returns an xml fragment that matches the path supplied. It puts the result into a table with the supplied name and into the column in the parenthesis:

XmlData.nodes('/Root/Person') AS People(pref)

return xml that is found below Root in the or all Person elements

another example of using cross-apply is from The Cheater’s Guide to Mastering XQuery on SQL Server – Part 1.  Unfortunately, there is no Part 2.

His data looks like this

<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
 <gesmes:subject>Reference rates</gesmes:subject>
 <gesmes:Sender>
 <gesmes:name>European Central Bank</gesmes:name>
 </gesmes:Sender>
 <Cube>
 <Cube time="2011-01-14">
 <Cube currency="USD" rate="1.3349" />
 <Cube currency="JPY" rate="110.71" />
 <Cube currency="GBP" rate="0.8422" />
 </Cube>
 <Cube time="2011-01-13">
 <Cube currency="USD" rate="1.3199" />
 <Cube currency="JPY" rate="109.5" />
 <Cube currency="GBP" rate="0.83575" />
 </Cube>
 <Cube time="2011-01-12">
 <Cube currency="USD" rate="1.2973" />
 <Cube currency="JPY" rate="108.17" />
 <Cube currency="GBP" rate="0.83155" />
 </Cube>
 </Cube>
</gesmes:Envelope>

Notice the namespace – there is a problem coming up.

His first select statement looks like this

SELECT
T.rows.value('local-name(.)','nvarchar(100)') as [TagName]
FROM
 ReferenceRates
CROSS APPLY
 xmldata.nodes('//*') T(rows)

This gets  the top level tags from the references table we created and populated with the data, select the value of the XML cross apply the nodes method to the root of the document.

Notice also the irregular tag name cube, which is used at several levels. The cross apply nodes method can be started at some other place than the root (//). For example, we can push it down to the second level of cube with

CROSS APPLY xmldata.nodes('/gesmes:Envelope/Cube//*') T(rows)

However, in his example he gets an error – by starting the search at cube\cube, we lose the namespace reference which is in the envelope element. He gets around that with the WITH XMLNAMESPACE prefix. (You can also specify the namespace in the query or nodes methods of XQuery).

WITH XMLNAMESPACES(
'http://www.gesmes.org/xml/2002-08-01' as gesmes,
DEFAULT 'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'
)

He also has an example of cross apply … cross apply that looks like a pivot table – it puts each cube entry cube values (the different times, currencies and rates) as columns.

 
id TagName Time      Currency Rate     xmlfragment
A neat grid of all currency rates. by date.
1  Cube    2011-01-14 USD     1.3349  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-14 JPY     110.71  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-14 GBP     0.8422  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-13 USD     1.3199  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-13 JPY     109.5   <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-13 GBP     0.83575 <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-12 USD     1.2973  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-12 JPY     108.17  <p1:Cube xmlns:p1=”h…
1  Cube    2011-01-12 GBP    0.83155  <p1:Cube xmlns:p1=”h…

Here is the XQUERY standard reference from Microsoft

He also has a link to an IBM developerworks article talking about XML namespaces ().

OPENXML

Another approach is to use OPENXML – which is a rowset provider that makes XML stored in SSMS xml datatypes viewable as if it were relational table or a relational view of the xml data.

sp_xml_preparedocument puts the XML into memory and returns a handle to the location that OPENXML will then use. sp_xml_removedocument will then trash the inmemory entity.

http://thinkingcog.com/2013/10/default.aspx has an example

DECLARE @XML xml = 
'<Students>
 <Student id="1">
 <FName>Parakh</FName>
 <LName>Singhal</LName>
 <Age>30</Age>
 <Courses>
 <Course id="1">Fundamentals of Databases</Course>
 <Course id="10">Fundamentals of Networking</Course>
 <Course id="15">Fundamentals of Security</Course>
 </Courses>
 </Student>
 <Student id="2">
 <FName>Glen</FName>
 <LName>Bennet</LName>
 <Age>31</Age>
 <Courses>
 <Course id="12">Fundamentals of Data Warehousing</Course>
 <Course id="15">Fundamentals of Security</Course>
 </Courses>
 </Student> 
</Students>';
declare @pointer int
EXEC sp_XML_preparedocument @pointer OUTPUT, @XML;
SELECT
'columns'
FROM OPENXML(@pointer,'/Students/Student',2)
WITH
(StudentID int '@id', 
StudentFirstName varchar(50) 'FName', 
StudentLastName varchar(50) 'LName',
StudentAge int 'Age',
EnrolledCourse1 varchar(50) '(Courses/Course)[1]',
...
EnrolledCourseN varchar(50) '(Courses/Course)[N]'
);
EXEC sp_xml_removedocument @docpointer;

In case you are wondering, if student 100 does not have N courses, a null is returned for the missing courses. In the example, @id is an attribute of the Student element. There is also a course_id attribute, but this code does not get it.

I have modified the code to support the course id attribute. First put the item(s) into the select part of the document.

SELECT
StudentID,
StudentFirstName,
..
EnrolledCourse1,
CourseID1, <-- added
EnrolledCourse2,
CourseID2, <-- added
EnrolledCourse3,
CourseID3 <-- added

Notice that the with FROM OPENXML(@pointer,’/Students/Student’,2), you have provided the starting point to query the student element. Courses are nested inside of student, so you need to provide the augmented path to the course element in order to query the course title, which is the entire contents of the course element. Since we can have 1 to N entries (in our example 3), we need the exact path to the expected element [1],[2] or [3].

EnrolledCourse1 varchar(50) '(Courses/Course)[1]'

When we go after the attribute for the first course, we query for @id but still need the path

EnrolledCourse1 varchar(50) '(Courses/Course)[1]',
CourseID1 int '(Courses/Course[1])/@id',

and variously for all three course elements. The second student has only two courses so the ID and the Name are reutnred as null

nodes method

You can also use the nodes method rather than OPENXML. You don’t need anything like EXEC sp_xml_preparedocument and EXEC sp_xml_removedocument since the engine extracts the data from the XML directly with XQuery

select 
columns
FROM @XML.nodes('/Students/Student')
as StudentTable(Student)

We get the structure as a rowset, starting with the students/student path, store it in StudentTable as the column student. From the rowset, we select the individual values (in the very first example we also selected XML with query) using code like this (i edited the original paper to include the course_id attribute

SELECT
Student.value('@id','int') as StudentID,
Student.value('(FName/text())[1]','varchar(50)') as StudentFirstName,
Student.value('(LName/text())[1]','varchar(50)') as StudentLastName,
Student.value('(Age/text())[1]','int') as StudentAge,
Student.value('(Courses/Course/text())[1]','varchar(50)') as EnrolledCourse1,
Student.value('(Courses/Course/@id)[1]','int') as EnrolledCourse1ID, <-- added
Student.value('(Courses/Course/text())[2]','varchar(50)') as EnrolledCourse2,
Student.value('(Courses/Course/@id)[2]','int') as EnrolledCourse1ID, <-- added
Student.value('(Courses/Course/text())[3]','varchar(50)') as EnrolledCourse3
FROM @XML.nodes('/Students/Student')
as StudentTable(Student)

This blog has a stored procedure to examine the schema of an XML table

AD HOC XML File Queries

An interesting idea is suggested by Seth Delconte in SimpleTalk.  Store the XML in the file system (c:\myXML.xml).  Use OPENROWSET to read the xml into a CTE with an XML datatyle.  Or perhaps, create a function to return an XML variable that can be up to 2GB in content.

Now that we have XML data or at least an XML fragment, work with XQuery FLOR, XQuery nodes or XQuery value methods.

XML INDEXES

Still another approach is to leave the XML along and put a primary and perhaps secondard index on the XML column.  This makes XQUERY value, nodes or other methods run twice or more as fast.  Information on “Getting Started With XML Indexes” available and more on Selective XML indexes here.

FOR XML RAW

if we want to get XML format  from SSMS, we can use the for XML syntax as explained in a SimpleTalk article. Options are “for XML raw”, “for XML auto”, “for XML explicit” or “for XML Path”.

Just as a side note, from XML is an excellent way to document the format of a table if the rows and columns are running off the screen – you can return the header as XML and the first line of results also as XML and use this to describe the table.

For XML raw returns each row as an element, each element as an value expressed as an attribute of the column name. Using the simpletalk example

select * from stores for xml raw

<row StoreID=”292″>
<Survey_untyped>
<StoreSurvey>
<AnnualSales>800000</AnnualSales>
<AnnualRevenue>80000</AnnualRevenue>
<BankName>United Security</BankName>
<BusinessType>BM</BusinessType>
<YearOpened>1996</YearOpened>
<Specialty>Mountain</Specialty>
<SquareFeet>21000</SquareFeet>
<Brands>2</Brands>
<Internet>ISDN</Internet>
<NumberEmployees>13</NumberEmployees>
<Products Type=”Bikes”>
<Product>Mountain</Product>
<Product>Road</Product>
<Product>Racing</Product>
</Products>
<Products Type=”Clothes”>
<Product>Jerseys</Product>
<Product>Jackets</Product>
<Product>Shorts</Product>
</Products>
</StoreSurvey>
</Survey_untyped>

SELECT * FROM Stores for XML raw ('root')

just names the root element, in this case the case of column (survey_untyped) as ‘root’. You probably want to go the other way – make root into something more useful. If you can comparing versions, you can ‘for XML raw (‘v1′)’ et al.

And this is returning the attributes as attributes. You cAn force them into elements, for possible ease of querying, by specifying elements to the XML ROW method

 SELECT * FROM Stores for XML raw ('root'), ELEMENTS;

Further, you can wrap the entire output inside an XML element, which makes it valid XML:

SELECT * FROM Stores for XML raw ('survey'), ROOT('VERSION1')

One use for this is to take row column data, which you can query by column name or alias and convert it into valid XML.

FOR XML PATH

For XML path returns differently formatted XML – you get to indicate what the path is and name the root.  If you finished the select with

FOR XML PATH('customer'), ROOT('customers')

you would get a root of customers and a top element of customer.  The individual elements would be named as you select them with aliases or with the field name.  Or you can put in nothing at all.  There is a clever way to use XML Path to string together common separated values in a query to get a field with several values.

SELECT CAT.Name AS [Category],
 STUFF(( SELECT ',' + SUB.Name AS [text()]
 FROM Subcategory SUB
 where SUB.ID = CAT.ID
 FOR XML PATH('') 
 ), 1, 1, '' )
 AS [Sub Categories]
FROM Category CAT

this will give you cat1               subcata, subcatb, subcatc
cat2               sub2a, sub2b, sub2c

so that you no longer have to tell people that ‘it’s too difficult’ if this is the format they want.

Examples

A nice brief tutorial on XML is from Joel Lipman

He postulates an xml datatype column “Event_XML’ in a table “XML_Events” .  The root is Staff

Either of these queries will return a column that will contain ‘male’ for each staff entry. – not the XML data

SELECT 
 CAST(CAST(Event_XML AS XML).query('data(/STAFF/GENDER)') 
 AS VARCHAR(10)) AS Gender 
FROM [dbo].[XML_EVENTS]
SELECT 
   CAST(Event_XML AS XML).value('data(/STAFF/GENDER)[1]',
   'VARCHAR(10)') AS Gender 
FROM [dbo].[XML_EVENTS]

this will return the attribute financial_year for the current_employee element in staff

SELECT CAST(
   CAST(Event_XML AS XML)
   .query('data(/STAFF/CURRENT_EMPLOYEE/@financialyear)') 
   AS INT(4)) AS FinancialYear 
FROM [dbo].[XML_EVENTS]

if you look at the query, understand this is a normal t-sql query that also has support for XML via query or value (or node) with XPAth. So it is not surprising we can use a where clause

SELECT * FROM [dbo].[XML_EVENTS] WHERE 
    CAST(Event_XML AS XML)
   .value('(/STAFF/GENDER="MALE")[1]','VARCHAR(10)')='TRUE' 
AS Gender

Note that this will return all the columns, including an XML datatype which you can access with the XML editor

If we want to return the XML, use nodes as below

-- to get all below nodes from root node - STAFF 
 SELECT Event_XML.query('/STAFF') 
 FROM [dbo].[XML_EVENTS]
 -- to get all element nodes named 'EMPLOYEE_NUMBER' 
 SELECT Event_XML.query('//EMPLOYEE_NUMBER') 
 FROM [dbo].[XML_EVENTS]
 -- to get employee node with ID = 1 
 SELECT Event_XML.query('//EMPLOYEE_NUMBER[@ID = 1]') 
 FROM [dbo].[XML_EVENTS]
 -- to get employee node at position = 1 from root node 
 SELECT Event_XML.query('/STAFF/EMPLOYEE_NUMBER[position()=1]') 
 FROM [dbo].[XML_EVENTS]
 -- to get employee node at last position from root node 
 SELECT Event_XML.query('/STAFF/EMPLOYEE_NUMBER[last()]') 
 FROM [dbo].[XML_EVENTS]
 -- to return unknown nodes from root node - STAFF 
 SELECT Event_XML.query('/STAFF/*') 
 FROM [dbo].[XML_EVENTS]
 -- to return unknown nodes from XML column - Event_XML 
 SELECT Event_XML.query('//* ') 
 FROM [dbo].[XML_EVENTS]

if we want to look for strings in the elements

-- exist() returns 1 or 0 if a node name exists 
-- containing "NAME" or not respectively 
 SELECT Event_XML.exist('(/*/*[1][contains(local-name(.),"NAME")])')
 FROM [dbo].[XML_EVENTS]
-- query() returns XML branch if the node name 
-- contains string "NAME" -- returns <SURNAME>LIPMAN</SURNAME> 
 SELECT Event_XML.query('(/*/*[1][contains(local-name(.),"NAME")])')
 FROM [dbo].[XML_EVENTS]
-- query() returns XML branch if the node value contains
-- string "MAN" -- returns <SURNAME>LIPMAN</SURNAME> 
 SELECT Event_XML.query('(/*/*[1][contains(.,"MAN")])') 
 FROM [dbo].[XML_EVENTS]
-- value() returns value of node if the node name 
-- contains string "NAME" -- returns LIPMAN 
SELECT Event_XML
.value('(/*/*[1][contains(local-name(.),"NAME")])[1]','NVARCHAR(50))
FROM [dbo].[XML_EVENTS]
 -- value() returns value of node if the node value 
-- contains string "MAN" -- returns LIPMAN 
SELECT Event_XML
.value('(/*/*[1][contains(.,"MAN")])[1]','NVARCHAR(50)') 
 FROM [dbo].[XML_EVENTS]
Advertisements