Given a XML document collection having common schema, find a new test document, find the how much does
The can be broken into two
generatorAgent( 1) +generatorAgent(1) generatorAgent
creator(16) +creator(1)+subject(15) subject
description(16) +description(14)+title(1)-description(1) description
language( 1) +language(1) language
title(16) +creator(1)+title(11)-title(4) title
item(15) +item(15) item
subject(15) +subject(15) subject
li(15) +li(15) li
channel( 1) +channel(1) channel
items( 1) +items(1) items
RDF( 1) +RDF(1) RDF
date(16) +date(16) date
Seq( 1) +Seq(1) Seq
link(16) +link(16) link
Class Total + - ~ Acc
> generatorAgent 1 1 0 0 1.0000
> creator 16 0 16 0 0.0000
> description 16 16 0 0 1.0000
> language 1 1 0 0 1.0000
> title 16 16 0 0 1.0000
> item 15 15 0 0 1.0000
> subject 15 15 0 0 1.0000
> li 15 15 0 0 1.0000
> channel 1 1 0 0 1.0000
> items 1 1 0 0 1.0000
> RDF 1 1 0 0 1.0000
> date 16 16 0 0 1.0000
> Seq 1 1 0 0 1.0000
> link 16 16 0 0 1.0000
> Overall 131 115 16 0 0.8779
> Overall 131 100 31 0 0.7634
> Overall 131 115 16 0 0.8779
> Overall 131 100 31 0 0.7634
> Overall 131 115 16 0 0.8779
> Overall 132 115 17 0 0.8712
> Overall 131 115 16 0 0.8779
Average Accuracy 0.838616666666667
Final set of tasks due
i. Number of siblings of a node
ii. Number of siblings of parent of node
iii. Level of the tag
iv. Is children repeating
v. Is repeating
vi. Is parent repeating
i. Meta attributes like term / document frequency: Min, Max & Average
i. In case all the suggestions are weak, instead of vote, use the suggestion with highest confidence.
ii.
If multiple tags get the
same class, and if it’s a non-repeating tag, match only if confidence is
greater than a certain thresh hold.
iii.
Check if a tag has already
matched, avoid re-matching it, unless it is repeating.
i. Suggested Mappings
ii. Individual Accuracies, per tag
iii. Overall Accuracy
i. Content itself
ii. Length
iii. Null
i. Num of children
ii. Num of attributes
iii. Is a tag leaf
iv. Does a tag have repeating children
bash-2.03$ grep 'Overall' results.vanilla
> Overall 131 76 38 17 0.5802
> Overall 131 82 17 32 0.6260
> Overall 131 79 28 24 0.6031
> Overall 131 83 16 32 0.6336
> Overall 105 45 23 37 0.4286
> Overall 132 93 19 20 0.7045
> Overall 131 91 20 20 0.6947
Avg 0.615496188
bash-2.03$ grep 'Overall' results.postp
> Overall 131 84 47 0 0.6412
> Overall 131 99 32 0 0.7557
> Overall 131 83 48 0 0.6336
> Overall 131 81 50 0 0.6183
> Overall 105 45 60 0 0.4286
> Overall 132 99 33 0 0.7500
> Overall 131 82 49 0 0.6260
Avg 0.6423787
Class Total + - ~ Acc
> generatorAgent 1 0 0 1 0.0000
> creator 16 0 15 1 0.0000
> description 16 14 1 1 0.8750
> language 1 1 0 0 1.0000
> title 16 11 0 5 0.6875
> item 15 15 0 0 1.0000
> subject 15 5 0 10 0.3333
> li 15 0 0 15 0.0000
> channel 1 1 0 0 1.0000
> items 1 1 0 0 1.0000
> RDF 1 1 0 0 1.0000
> date 16 16 0 0 1.0000
> Seq 1 1 0 0 1.0000
> link 16 16 0 0 1.0000
> Overall 131 82 16 33 0.6260
Found mixed suggestions for generatorAgent. Enforcing errorReportsTo
Found mixed suggestions for creator. Enforcing subject
Found mixed suggestions for description. Enforcing description
Found mixed suggestions for title. Enforcing title
Found mixed suggestions for subject. Enforcing subject
Found mixed suggestions for li. Enforcing title
Class Total + - ~ Acc
> generatorAgent 1 0 1 0 0.0000
> creator 16 0 16 0 0.0000
> description 16 16 0 0 1.0000
> language 1 1 0 0 1.0000
> title 16 16 0 0 1.0000
> item 15 15 0 0 1.0000
> subject 15 15 0 0 1.0000
> li 15 0 15 0 0.0000
> channel 1 1 0 0 1.0000
> items 1 1 0 0 1.0000
> RDF 1 1 0 0 1.0000
> date 16 16 0 0 1.0000
> Seq 1 1 0 0 1.0000
> link 16 16 0 0 1.0000
> Overall 131 99 32 0 0.7557
i. For each document
ii. For each tag
iii. For the total set
i. The fact if it’s a leaf element. Done
ii. Encode how many siblings it has
iii. How many siblings of parent it has
iv. The level of a tag in the tree.
i. In case all the suggestions are weak, instead of vote, use the suggestion with highest confidence.
Poetry
------
log channel_()
heading title()
url title(,link,date)
posting description()
category title()
author title()
time date()
Meinem Leben
------------
log channel_()
heading title()
url link()
posting description()
category subject()
author subject()
time date()
Poetry with Structural Information
----------------------------------
log title(,item)
heading title()
url title(,link,date)
posting description()
category title()
author title()
time date()
Meinem Leben with Structural Information
----------------------------------------
log title(,item)
heading title()
url link()
posting description()
category subject()
author subject()
time date()
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:cc="http://web.resource.org/cc/"
xmlns="http://purl.org/rss/1.0/">
<item rdf:about="http://weblogs.cs.cornell.edu/aseem/archives/000312.html">
<title>Lucky Dog</title>
<link>http://weblogs.cs.cornell.edu/aseem/archives/000312.html</link>
<description>1 pm in the after noon you get a job from Lehman Brothers on Wall Street 3 pm in the Microsoft interview, you do well(not to mention the wireless keyboard you won in the raffle) 12 midnight,...</description>
<dc:subject>Updates</dc:subject>
<dc:creator>Aseem Bajaj</dc:creator>
<dc:date>2003-10-29T12:45:20-05:00</dc:date>
</item>
</rdf:RDF>
<log about="http://weblogs.cs.cornell.edu/aseem/archives/000312.html">
<heading>Lucky Dog</heading>
<url>http://weblogs.cs.cornell.edu/aseem/archives/000312.html</url>
<posting>1 pm in the after noon you get a job from Lehman Brothers on Wall Street 3 pm in the Microsoft interview, you do well (not to mention the wireless keyboard you won in the raffle) 12 midnight,...</posting>
<category>Updates</category>
<author>Aseem Bajaj</author>
<time>2003-10-29T12:45:20-05:00</time>
</log>
log RDF()
heading title()
url link()
posting description()
category date()
author description(,link)
time description()
(the colors above indicating errors in results are not a part of the output, but have been manually done)
We are classifying content in tags here rather than the documents themselves. It’s a multi class single label problem.
Tags and nodes are used interchangeably here.
Each tag name in a given document is a class. Given a new test data, each of its leaf tags would be judged by its content type.
For each leaf node in the training documents, each tag with content is a class & the content a value.
The content of a given tag in all the n documents is used to form n training vectors for that class.
Unresolved issue: Tokenization of data oriented fields is significantly different from content oriented field and that is quite possible in typical xml semi-data, semi-content documents. For example the content
Words are used as features.
For each of the tags in the training set that best matches each of the leaf tags in the test document, identify a mapping between the two XML structures.
Id |
Type |
Tag |
|
|
|
1 |
Name |
<name/> |
2 |
Age |
<age/>, <exp/> |
3 |
Desc |
<desc/> |
Class |
|
|
|
|
|
|
|
Name |
1:0.5 |
2:0.5 |
|
|
|
|
|
Name |
23:0.7 |
53:0.1 |
1293:0.3 |
|
|
|
|
. |
|
|
|
|
|
|
|
. |
|
|
|
|
|
|
|
Desc |
21:0.11 |
43:0.68 |
213:0.11 |
729:0.1 |
|
|
|
Desc |
8:0.1 |
8679:0.1 |
1221:0.1 |
230:0.1 |
2167:0.1 |
836:0.5 |
|