XML data is a universal standard for storing human readable, self-documenting data. It is useful for configuration files, data interchange files and non-relational data in database servers. Non-relational in that a XML document might have different sized clusters of data entries, but within a cluster, each entity (think row here) doesn’t have to have the same attributes or values. The document could be about a business, with part of it referring to the physical plant, a second part to the employees, only one of whom we know their favorite color. The XML structure can support that.
XML data is text and can be stored in a varchar or nvarchar field. This might be sufficient if all we want to do is retrieve the data, but MS SQL has a XML Datatype that allows additional functions – there is a built-in XML editor to show the structure of each XML item, and there are other methods that allow querying, updating and inserting XML data.
One approach to working with XML data is to ‘flatten’ it into normal rows and columns. this is straight-forward when you are dealing with non-nested data: groups of users, each with one first name, one second name, one collection of single elements. It becomes more difficult with you have a skills element with zero to many skill elements nested in that element. This is putting you into the situation of skill column one with an entry, and for the second user, two columns for skill one and for skill two.
CROSS APPLY
By using cross-apply, we can get around this. A great article on blogs.msdn.com by Simon Ince documents this well
First the setup:
Save this XML file as as c:\users\rbeck\desktop\new.xml
<?xml version="1.0" ?>
<Root>
<Person>
<Name>Simon</Name>
<Age>20</Age>
<Skills>
<Skill>Cooking</Skill>
<Skill>Cleaning</Skill>
</Skills>
</Person>
<Person>
<Name>Peter</Name>
<Age>21</Age>
<Skills>
<Skill>Ironing</Skill>
</Skills>
</Person>
</Root>
Create a table and import the data
Use MIU
CREATE TABLE XmlSourceTable
(
RecordId INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
XmlData XML NOT NULL
)
notice the XML data type.
Import the data and display it
INSERT INTO XmlSourceTable(XmlData)
SELECT *
FROM OPENROWSET(
BULK 'C:\users\rbeck\desktop\new.xml', SINGLE_BLOB)
AS ImportSource
GO
SELECT * FROM XmlSourceTable
RecordId XmlData
———– ——-
1 <Root><Person><Name>Simon</Name><Age>20</Age… (snip)
(1 row(s) affected)
One row for each file/xml document. If we import this twice, we get the same XML data, and RecordIDs of 1 and 2. The rest of the output below is just duplicated. If the second import was different data (but the same schema) that would just look like more (and not duplicated) data.
Query the data
SELECT
pref.value('(Name/text())[1]', 'varchar(50)') as PersonName,
pref.value('(Age/text())[1]', 'int') as PersonAge,
pref.query('Skills') as PersonSkills
FROM
XmlSourceTable
CROSS APPLY
XmlData.nodes('/Root/Person') AS People(pref)
PersonName PersonAge PersonSkills
———- ——— ———————-
Simon 20 <Skills><Skill>Cooking</Skill><Skill>Clean… (snip)
Peter 21 <Skills><Skill>Ironing</Ski… (snip)
Notice that Simon has more skills than Peter. In this case, try this:
SELECT
pref.value('(Name/text())[1]', 'varchar(50)') as PersonName,
sref.value('(text())[1]', 'varchar(50)') as PersonSkill
FROM
XmlSourceTable
CROSS APPLY
XmlData.nodes('//Person') AS People(pref)
CROSS APPLY
pref.nodes('Skills/Skill') AS Skill(sref)
PersonName PersonAge PersonSkill
Simon 20 Cooking
Simon 20 Cleaning
Peter 21 Ironing
The magic is done by CROSS APPLY and by the XMLData.nodes function.
From BOL: “The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query.” I am going to simplify that to “the nodes table-valued function is applied to each pref.value and the one pref.query – the output is concatentated left to right.
When we use xxx.query, we get back XML, even if it is not enclosed in an element. If we use xxx.value, we get back the value – we need to specify the type of data (int, varchar()) and the explict path:
“Although there is only one PersonName or PersonAge attribute in the XML instance, the static typing rules require you to explicitly specify that the path expression returns a singleton. Therefore, the additional [1] is specified at the end of the path expression” (emphasis from rcb)
If we apply the value to the skills element and a path of [1] and a varchar datatype, we get back “CookingCleaning” for Simon
pref.value('Skills[1]', 'varchar(50)') as PersonSkills
The nodes method returns an xml fragment that matches the path supplied. It puts the result into a table with the supplied name and into the column in the parenthesis:
XmlData.nodes('/Root/Person') AS People(pref)
return xml that is found below Root in the or all Person elements
another example of using cross-apply is from The Cheater’s Guide to Mastering XQuery on SQL Server – Part 1. Unfortunately, there is no Part 2.
His data looks like this
<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
<gesmes:subject>Reference rates</gesmes:subject>
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time="2011-01-14">
<Cube currency="USD" rate="1.3349" />
<Cube currency="JPY" rate="110.71" />
<Cube currency="GBP" rate="0.8422" />
</Cube>
<Cube time="2011-01-13">
<Cube currency="USD" rate="1.3199" />
<Cube currency="JPY" rate="109.5" />
<Cube currency="GBP" rate="0.83575" />
</Cube>
<Cube time="2011-01-12">
<Cube currency="USD" rate="1.2973" />
<Cube currency="JPY" rate="108.17" />
<Cube currency="GBP" rate="0.83155" />
</Cube>
</Cube>
</gesmes:Envelope>
Notice the namespace – there is a problem coming up.
His first select statement looks like this
SELECT
T.rows.value('local-name(.)','nvarchar(100)') as [TagName]
FROM
ReferenceRates
CROSS APPLY
xmldata.nodes('//*') T(rows)
This gets the top level tags from the references table we created and populated with the data, select the value of the XML cross apply the nodes method to the root of the document.
Notice also the irregular tag name cube, which is used at several levels. The cross apply nodes method can be started at some other place than the root (//). For example, we can push it down to the second level of cube with
CROSS APPLY xmldata.nodes('/gesmes:Envelope/Cube//*') T(rows)
However, in his example he gets an error – by starting the search at cube\cube, we lose the namespace reference which is in the envelope element. He gets around that with the WITH XMLNAMESPACE prefix. (You can also specify the namespace in the query or nodes methods of XQuery).
WITH XMLNAMESPACES(
'http://www.gesmes.org/xml/2002-08-01' as gesmes,
DEFAULT 'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'
)
He also has an example of cross apply … cross apply that looks like a pivot table – it puts each cube entry cube values (the different times, currencies and rates) as columns.
id TagName Time Currency Rate xmlfragment
A neat grid of all currency rates. by date.
1 Cube 2011-01-14 USD 1.3349 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-14 JPY 110.71 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-14 GBP 0.8422 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-13 USD 1.3199 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-13 JPY 109.5 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-13 GBP 0.83575 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-12 USD 1.2973 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-12 JPY 108.17 <p1:Cube xmlns:p1=”h…
1 Cube 2011-01-12 GBP 0.83155 <p1:Cube xmlns:p1=”h…
Here is the XQUERY standard reference from Microsoft
He also has a link to an IBM developerworks article talking about XML namespaces ().
OPENXML
Another approach is to use OPENXML – which is a rowset provider that makes XML stored in SSMS xml datatypes viewable as if it were relational table or a relational view of the xml data.
sp_xml_preparedocument puts the XML into memory and returns a handle to the location that OPENXML will then use. sp_xml_removedocument will then trash the inmemory entity.
http://thinkingcog.com/2013/10/default.aspx has an example
DECLARE @XML xml =
'<Students>
<Student id="1">
<FName>Parakh</FName>
<LName>Singhal</LName>
<Age>30</Age>
<Courses>
<Course id="1">Fundamentals of Databases</Course>
<Course id="10">Fundamentals of Networking</Course>
<Course id="15">Fundamentals of Security</Course>
</Courses>
</Student>
<Student id="2">
<FName>Glen</FName>
<LName>Bennet</LName>
<Age>31</Age>
<Courses>
<Course id="12">Fundamentals of Data Warehousing</Course>
<Course id="15">Fundamentals of Security</Course>
</Courses>
</Student>
</Students>';
declare @pointer int
EXEC sp_XML_preparedocument @pointer OUTPUT, @XML;
SELECT
'columns'
FROM OPENXML(@pointer,'/Students/Student',2)
WITH
(StudentID int '@id',
StudentFirstName varchar(50) 'FName',
StudentLastName varchar(50) 'LName',
StudentAge int 'Age',
EnrolledCourse1 varchar(50) '(Courses/Course)[1]',
...
EnrolledCourseN varchar(50) '(Courses/Course)[N]'
);
EXEC sp_xml_removedocument @docpointer;
In case you are wondering, if student 100 does not have N courses, a null is returned for the missing courses. In the example, @id is an attribute of the Student element. There is also a course_id attribute, but this code does not get it.
I have modified the code to support the course id attribute. First put the item(s) into the select part of the document.
SELECT
StudentID,
StudentFirstName,
..
EnrolledCourse1,
CourseID1, <-- added
EnrolledCourse2,
CourseID2, <-- added
EnrolledCourse3,
CourseID3 <-- added
Notice that the with FROM OPENXML(@pointer,’/Students/Student’,2), you have provided the starting point to query the student element. Courses are nested inside of student, so you need to provide the augmented path to the course element in order to query the course title, which is the entire contents of the course element. Since we can have 1 to N entries (in our example 3), we need the exact path to the expected element [1],[2] or [3].
EnrolledCourse1 varchar(50) '(Courses/Course)[1]'
When we go after the attribute for the first course, we query for @id but still need the path
EnrolledCourse1 varchar(50) '(Courses/Course)[1]',
CourseID1 int '(Courses/Course[1])/@id',
and variously for all three course elements. The second student has only two courses so the ID and the Name are reutnred as null
nodes method
You can also use the nodes method rather than OPENXML. You don’t need anything like EXEC sp_xml_preparedocument and EXEC sp_xml_removedocument since the engine extracts the data from the XML directly with XQuery
select
columns
FROM @XML.nodes('/Students/Student')
as StudentTable(Student)
We get the structure as a rowset, starting with the students/student path, store it in StudentTable as the column student. From the rowset, we select the individual values (in the very first example we also selected XML with query) using code like this (i edited the original paper to include the course_id attribute
SELECT
Student.value('@id','int') as StudentID,
Student.value('(FName/text())[1]','varchar(50)') as StudentFirstName,
Student.value('(LName/text())[1]','varchar(50)') as StudentLastName,
Student.value('(Age/text())[1]','int') as StudentAge,
Student.value('(Courses/Course/text())[1]','varchar(50)') as EnrolledCourse1,
Student.value('(Courses/Course/@id)[1]','int') as EnrolledCourse1ID, <-- added
Student.value('(Courses/Course/text())[2]','varchar(50)') as EnrolledCourse2,
Student.value('(Courses/Course/@id)[2]','int') as EnrolledCourse1ID, <-- added
Student.value('(Courses/Course/text())[3]','varchar(50)') as EnrolledCourse3
FROM @XML.nodes('/Students/Student')
as StudentTable(Student)
This blog has a stored procedure to examine the schema of an XML table
AD HOC XML File Queries
An interesting idea is suggested by Seth Delconte in SimpleTalk. Store the XML in the file system (c:\myXML.xml). Use OPENROWSET to read the xml into a CTE with an XML datatyle. Or perhaps, create a function to return an XML variable that can be up to 2GB in content.
Now that we have XML data or at least an XML fragment, work with XQuery FLOR, XQuery nodes or XQuery value methods.
XML INDEXES
Still another approach is to leave the XML along and put a primary and perhaps secondard index on the XML column. This makes XQUERY value, nodes or other methods run twice or more as fast. Information on “Getting Started With XML Indexes” available and more on Selective XML indexes here.
FOR XML RAW
if we want to get XML format from SSMS, we can use the for XML syntax as explained in a SimpleTalk article. Options are “for XML raw”, “for XML auto”, “for XML explicit” or “for XML Path”.
Just as a side note, from XML is an excellent way to document the format of a table if the rows and columns are running off the screen – you can return the header as XML and the first line of results also as XML and use this to describe the table.
For XML raw returns each row as an element, each element as an value expressed as an attribute of the column name. Using the simpletalk example
select * from stores for xml raw
<row StoreID=”292″>
<Survey_untyped>
<StoreSurvey>
<AnnualSales>800000</AnnualSales>
<AnnualRevenue>80000</AnnualRevenue>
<BankName>United Security</BankName>
<BusinessType>BM</BusinessType>
<YearOpened>1996</YearOpened>
<Specialty>Mountain</Specialty>
<SquareFeet>21000</SquareFeet>
<Brands>2</Brands>
<Internet>ISDN</Internet>
<NumberEmployees>13</NumberEmployees>
<Products Type=”Bikes”>
<Product>Mountain</Product>
<Product>Road</Product>
<Product>Racing</Product>
</Products>
<Products Type=”Clothes”>
<Product>Jerseys</Product>
<Product>Jackets</Product>
<Product>Shorts</Product>
</Products>
</StoreSurvey>
</Survey_untyped>
SELECT * FROM Stores for XML raw ('root')
just names the root element, in this case the case of column (survey_untyped) as ‘root’. You probably want to go the other way – make root into something more useful. If you can comparing versions, you can ‘for XML raw (‘v1′)’ et al.
And this is returning the attributes as attributes. You cAn force them into elements, for possible ease of querying, by specifying elements to the XML ROW method
SELECT * FROM Stores for XML raw ('root'), ELEMENTS;
Further, you can wrap the entire output inside an XML element, which makes it valid XML:
SELECT * FROM Stores for XML raw ('survey'), ROOT('VERSION1')
One use for this is to take row column data, which you can query by column name or alias and convert it into valid XML.
FOR XML PATH
For XML path returns differently formatted XML – you get to indicate what the path is and name the root. If you finished the select with
FOR XML PATH('customer'), ROOT('customers')
you would get a root of customers and a top element of customer. The individual elements would be named as you select them with aliases or with the field name. Or you can put in nothing at all. There is a clever way to use XML Path to string together common separated values in a query to get a field with several values.
SELECT CAT.Name AS [Category],
STUFF(( SELECT ',' + SUB.Name AS [text()]
FROM Subcategory SUB
where SUB.ID = CAT.ID
FOR XML PATH('')
), 1, 1, '' )
AS [Sub Categories]
FROM Category CAT
this will give you cat1 subcata, subcatb, subcatc
cat2 sub2a, sub2b, sub2c
so that you no longer have to tell people that ‘it’s too difficult’ if this is the format they want.
Examples
A nice brief tutorial on XML is from Joel Lipman
He postulates an xml datatype column “Event_XML’ in a table “XML_Events” . The root is Staff
Either of these queries will return a column that will contain ‘male’ for each staff entry. – not the XML data
SELECT
CAST(CAST(Event_XML AS XML).query('data(/STAFF/GENDER)')
AS VARCHAR(10)) AS Gender
FROM [dbo].[XML_EVENTS]
SELECT
CAST(Event_XML AS XML).value('data(/STAFF/GENDER)[1]',
'VARCHAR(10)') AS Gender
FROM [dbo].[XML_EVENTS]
this will return the attribute financial_year for the current_employee element in staff
SELECT CAST(
CAST(Event_XML AS XML)
.query('data(/STAFF/CURRENT_EMPLOYEE/@financialyear)')
AS INT(4)) AS FinancialYear
FROM [dbo].[XML_EVENTS]
if you look at the query, understand this is a normal t-sql query that also has support for XML via query or value (or node) with XPAth. So it is not surprising we can use a where clause
SELECT * FROM [dbo].[XML_EVENTS] WHERE
CAST(Event_XML AS XML)
.value('(/STAFF/GENDER="MALE")[1]','VARCHAR(10)')='TRUE'
AS Gender
Note that this will return all the columns, including an XML datatype which you can access with the XML editor
If we want to return the XML, use nodes as below
-- to get all below nodes from root node - STAFF
SELECT Event_XML.query('/STAFF')
FROM [dbo].[XML_EVENTS]
-- to get all element nodes named 'EMPLOYEE_NUMBER'
SELECT Event_XML.query('//EMPLOYEE_NUMBER')
FROM [dbo].[XML_EVENTS]
-- to get employee node with ID = 1
SELECT Event_XML.query('//EMPLOYEE_NUMBER[@ID = 1]')
FROM [dbo].[XML_EVENTS]
-- to get employee node at position = 1 from root node
SELECT Event_XML.query('/STAFF/EMPLOYEE_NUMBER[position()=1]')
FROM [dbo].[XML_EVENTS]
-- to get employee node at last position from root node
SELECT Event_XML.query('/STAFF/EMPLOYEE_NUMBER[last()]')
FROM [dbo].[XML_EVENTS]
-- to return unknown nodes from root node - STAFF
SELECT Event_XML.query('/STAFF/*')
FROM [dbo].[XML_EVENTS]
-- to return unknown nodes from XML column - Event_XML
SELECT Event_XML.query('//* ')
FROM [dbo].[XML_EVENTS]
if we want to look for strings in the elements
-- exist() returns 1 or 0 if a node name exists
-- containing "NAME" or not respectively
SELECT Event_XML.exist('(/*/*[1][contains(local-name(.),"NAME")])')
FROM [dbo].[XML_EVENTS]
-- query() returns XML branch if the node name
-- contains string "NAME" -- returns <SURNAME>LIPMAN</SURNAME>
SELECT Event_XML.query('(/*/*[1][contains(local-name(.),"NAME")])')
FROM [dbo].[XML_EVENTS]
-- query() returns XML branch if the node value contains
-- string "MAN" -- returns <SURNAME>LIPMAN</SURNAME>
SELECT Event_XML.query('(/*/*[1][contains(.,"MAN")])')
FROM [dbo].[XML_EVENTS]
-- value() returns value of node if the node name
-- contains string "NAME" -- returns LIPMAN
SELECT Event_XML
.value('(/*/*[1][contains(local-name(.),"NAME")])[1]','NVARCHAR(50))
FROM [dbo].[XML_EVENTS]
-- value() returns value of node if the node value
-- contains string "MAN" -- returns LIPMAN
SELECT Event_XML
.value('(/*/*[1][contains(.,"MAN")])[1]','NVARCHAR(50)')
FROM [dbo].[XML_EVENTS]