Matching Structural Similarity between XML Fragments

Problem Setup

Given a collection of XML documents all adhering to a common schema and an XML Fragment with a different set of tags, find a mapping between the tags of the new XML fragment to the tags of the collection documents.

A leaf tags is the one, that doesn’t have anything except text as its child.

A non-leaf node is the one that has at least one non-text child tag.

Using Content for matching tags

Hypothesis

Using each of the leaf tags in the XML collection as a training example, with tag name as the training class, a text classifier can classify each of the leaf tags in the test XML fragment with the correct tag name from the training schema.

Expected Results

For leaf tags

1. Would give high accuracy on the training set

2. Would work well for

a. Text Content tags, with the same accuracy as in text classification problems

b. Date & URL tags, because of high learning of some similar tokens

Experiment Process

The experiment consists of the following steps

1. Parse each of the tags in each of the training documents, and extract out leaf tags

2. Consider each of the above extracted texts as individual documents with tag names as classification & create a Lemur Web Document input format.

3. Create a Vector Representation of each of the documents

4. Train the SVM with training set defined in above set

5. For the test XML fragment, extract out tags and content to form the test fragment to form a vector representation similar to step 3

6. Use the learnt model to classify each of the test fragments tags and find suggested tag names from the training set.

Using Content Properties as features to improve match

Hypothesis

Using various properties of content in various tags in training collection can improve classification accuracy.

Expected Results

For leaf tags, following properties can have stated results.

1. Content Length: The classifier differentiates between tags based on size, like Title and Category.

2. Average, Minimum & Maximum Term Frequency: The classifier differentiates between tags based on content Semantics, like Abstract & Main Text.

3. Data Type for certain specific types like date, time, URL: The classifier perfectly classifies some specific tags that have a very defined data type.

Experiment Process

To the above experiment process, add the following steps.

1. While extracting out tags and content from XML documents to represent them documents and classification, add the following to the content before writing it in Lemur Document format

a. Depending the size, add information for sizes 0, 1, 2-4, 5-32, 33-256, 257 and above

b. Calculate the average, minimum and maximum term frequencies and add to the content

c. If the content type matches a certain pattern, like URL, date, time mark it with data type.

Using Structural Information as features to match non-leaf elements

Hypothesis

Using structural information properties as features in vector representation of the non-leaf tags can provide matches with fair accuracy.

Expected Results

The classification for non-leaf tags in the test fragment would give classification with positive confidence levels. This can then be used in conjugation with leaf tag classification to find out a better mapping for the entire XML fragment.

Experiment Process

Use the experiment processes of first hypothesis for non-leaf tags and instead of tag content, use the following properties similar to the second experiment above.

1. Find the number of child tags and

a. Add a feature stating Number of children exactly n times

b. Add a feature stating Number of children – 1 & Number of children + 1 k times.

c. Let n=5 & k=2. Note that this can be distributed over a distribution curve as well, but for simplicity we keep it this way.

2. Follow the same process of attributes, except set n=5 & k=2

Using rules in post-classification phase to improve accuracy

Hypothesis

Suggested mappings for all of the instances of a given tag in XML documents should be the same. Use of voting & confidence values after classification can suggest correct uniform mapping.

Expected Results

A few of the classification made by the classifier would be weak.

1. This could be either because of lack of positive confidence in the results

2. Or because of different classifications being made for the various instances of the same type of tag.

The post-processor would use the classification results & it’s own rules to come up with a common, stronger & better suggestion all the test instances of the same tag. This would improve the accuracy results.

Experiment Process

Use the above experiment to suggest tags for all the tags in the test document. Then apply the following post processing rules.

1. Identify the list of unique tag names in the Test XML Document.

2. For each tag type chosen above, check if the all the suggested tag values are same. If not, change all of them to the same value as chosen below.

a. If there is at least one positive match: Among the tag names with positive matches, choose the one that is suggested the most.

b. If there is no tag with a positive value: Among the tags names with negative values, choose the most suggested tag.