Tag Archives: PSI

Finding “facts” (without scanning millions of documents)

The new version of Ontopedia PSI server is out. There are several interesting features in this release. We introduced auto-reification of all assertions, “everything is a subject” now. In the new version, preferable and recommended way to model web resources is to model them as first class “subjects”. Another interesting feature is ability to search for ‘facts’ related to various subjects.

Every assertion created in Ontopedia’s knowledge map is automatically reified as a ‘subject’. Starting from the moment of ‘creation’, assertion-based PSIs have a regular ‘life cycle’. Users can change PSI default name, description. It is also possible to deprecate PSIs and introduce new PSIs for the same subject. Of course, users can make assertions about other assertions. This feature is quite helpful for modeling changes in time (combined with time interval scoping), for example.

Speaking about modeling resources on the web, we continue to support URI-based properties/occurrences, but main modeling practice moving forward is based on creation of explicit subjects for web resources and using associations for connecting resources and other subjects. Ontopedia’s generic user interface will be optimized in next releases to support dual nature of web resources (as “subjects” and “links”).

Next feature is related to improving ‘findability’. I think that in many cases we are looking for ‘facts’, not documents, so we try to do the first step in providing direct access to ‘facts’ collected in Ontopedia’s knowledge map. We use basic faceted search/navigation with three main facets: ‘Concepts’, ‘Web Resources’, and ‘Assertions’. For example, if we type ‘apple’ in Ontopedia’s search box, we can find some information items in all three tabs on the front search page. The most interesting tab is probably ‘Assertions’. This tab provides direct access to facts which include reference to ‘apple’. Future versions of ‘Assertions’ tab will include additional facets which will allow to ‘slice and dice’ assertions.

With this new feature, our goal is to demonstrate that Subject-centric computing can change ‘search paradigm’ by providing direct, reliable access to ‘facts’. Of course, it will take lots of efforts to make this approach scalable. But recent enhancements in commercial and open source faceted search engines, achievements in creating “knowledge maps”/”smart indices ” make me believe that we are not that far from ability to directly find ‘facts’ that we are interested in.

Watching an interview about Powerset

InfoQ published an interview with Tom Preston-Werner on Powerset, GitHub, Ruby and Erlang. I really like projects that try to analyze text/resources on the web and try to implement “smart search”. Powerset is one of these projects. But what I like even more is the approach when we explicitly represent facts/information items using open knowledge representation standards such as Topic Maps or RDF.

Topic Maps can play the role of “knowledge middleware” that helps to integrate various components of “smart search puzzle”. A topic map-based index allows to represent and connect subjects and resources. Explicit representation of relatively small number of relationships (“facts”,”assertions”) between resources and subjects can dramatically change the world of smart search.

Topic Maps based-knowledge middleware is a disruptive technology because it replaces proprietary knowledge organization schemas and modules and it allows multiple players to build various solutions that help to create or use smart index.

Topic Maps-based Ontopedia PSI server, for example, can represent assertions that are manually created by users or generated by some algorithms. We do not have our own text analysis infrastructure, but I hope that in the future we can leverage some services on the web (such as OpenCalais) which can perform text analysis on “as needed” basis. The core ability of Ontopedia PSI server is maintaining explicit representations of subjects that are important for people and ability to maintain assertions about these subjects.

The new version of Ontopedia PSI server can play a role of an aggregator that can extract assertions from existing topic maps/fragements hosted on other websites. Assertions from multiple sources are aggregated into one assertion set/information map/semantic index. Ontopedia PSI server keeps track of information provenance and supports multiple truth values. The server, for example, can handle a situation when one source on the web asserts that Person X did a Presentation P and someone else makes the opposite assertion.

I think that natural language processing can play a huge role in improving search. Ideal text analysis tool should allow to provide ‘clues’ about subjects in a text. I am looking for equivalent of some kind of ‘binding’ that is used in programming quite often these days. I would love to have the ability to provide list of main subjects in a form of PSIs to text analysis tool (using embedded markup or attached external assertions). If I do so, I expect much more precise results. If I do not have an initial list of subjects I expect some kind of suggestions from text analysis tools that I can check against existing information map.

Ontopedia (as many other Topic Maps-based projects) promotes usage of Public Subject Identifiers (PSIs) for “all thinkable” subjects. For example, there is an identifier for TMRA 2008 conference – http://psi.ontopedia.net/TMRA_2008 .
There are identifiers for each presenter and presentation. Basic relationships between various subjects are also “mapped”/explicitly represented. Each basic resource, such as a blog post can have a small assertion set that describes metadata (using Dublin Core metadata vocabulary, for example) and maybe some main assertions. Traditional websites can provide combined assertion sets in XTM or RDF which can be consumed by semantic aggregators such as Ontopedia PSI server. Text analysis is great (when it is good enough). But even simple (semi-)manual “mapping” of subjects, resources and relationships can change the search game.

When we manually try to “map” an existing resource such as a conference website for the first time, it can look as a complicated and time consuming task. Mapping a website for another conference will take much less time. And, of course, in many cases it is possible to reverse traditional website building/assertion extraction paradigm.

It is possible to build nice looking and functional web sites based on “assertion sets”. Topicmaps.com is a great example of this approach. It is driven by a topic map. Humans can enjoy HTML-based representation of this site and aggregators like Ontopedia PSI Server can consume raw XTM-based representation and aggregate it with other assertion sets such as TMRA 2008 conference assertion set.

References

Interview link on InfoQ

Extending Ontopedia PSI server to handle PURLs: support for RDF, step one

I have been thinking about RDF support on Ontopedia PSI server for quite some time. Semantic Technology Conference that I attended this spring gave me some new ideas in this direction. I decided to follow recommendations from Eric Miller’s and David Wood’s presentation “Persistent Identifiers for the ‘Real Web'” regarding PURLs (Persistent Uniform Resource Locators). Ontopedia PSI server was extended to handle PURLs

Each Published Subject Identifier (PSI) on http://psi.ontopedia.net has an equivalent PURL on http://purl.ontopedia.net. For example, http://psi.ontopedia.net/TMRA_2008 has the corresponding PURL http://purl.ontopedia.net/TMRA_2008. What happens when we type in our browser PURL http://purl.ontopedia.net/TMRA_2008? Ontopedia PURL server returns HTTP code 303 “See Other” with “Location” header set to http://psi.ontopedia.net/TMRA_2008.

For RDF-based applications, code 303 is an indication that URI does not correspond to a “digital resource”. Web browsers will automatically jump to http://psi.ontopedia.net/TMRA_2008 which will provide nice subject/resource description.

When we need to export RDF assertions from Ontopedia, we can do something like this:

<rdf:Description rdf:about="http://purl.ontopedia.net/TMRA_2008">
      <rdfs:label>
           TMRA 2008 (Topic Maps Research and Applications  Conference)
      </rdfs:label>
      <rdfs:comment>
               Fourth International Conference on 
               Topic Maps Research and Applications
       </rdfs:comment>
       <rdf:type rdf:resource="http://purl.ontopedia.net/Conference"/>	
</Definition>

In topic maps-based version we can have:

<topic id="id_98c49a0d3d87f067a4ba13b6d2f6d086">
	<subjectIdentifier href="http://psi.ontopedia.net/TMRA_2008"/>
	<instanceOf>
           <topicRef href="http://psi.ontopedia.net/Conference"/>
        </instanceOf>
	<name>
	   <value>
              TMRA 2008 (Topic Maps Research and Applications  Conference)
           </value>
	</name>
	<occurrence>
            <type>
	        <topicRef href="http://psi.ontopedia.net/Description"/>
            </type>
            <resourceData>
                      Fourth International Conference on 
                      Topic Maps Research and Applications
            </resourceData>
	</occurrence>
</topic>

RDF-based version uses PURLs and Topic Maps-based version uses PSIs for identification of subjects/resources.

Reference:

Persistent Identifiers for the ‘Real Web’, David Wood, Eric Miller, May 2008, PDF

The new version of Ontopedia PSI server

The new version of Ontopedia PSI server is out now. It is possible to represent various types of assertions related to subjects (names, occurrences, associations). The new PSI server allows also to record and integrate opinions of different users. Its internal knowledge representation is optimized for paraconsistent reasoning.

I started to play with some topics that I am interested in. For example, Subject-centric Computing , Apple Inc .
As with typical Topic Maps-based system, we can easily add new subject and assertion types, we are not limited by fixed domain models. In addition, the new PSI server supports recording of assertion provenance and five truth values.

We also tried to follow the Resource-Oriented Architecture: each subject, each assertion, each subject-centric group of assertions of the same type has own Uri and “page”.

The main goal of this version is to experiment with assertion level subject-centric representations vs. more traditional portal-based approach.

2008 Semantic Technology Conference: random observations

I am back from Semantic Technology Conference. It is becoming bigger and bigger each year. This year there were more than hundred sessions, full day of tutorials, product exhibition. It was quite crowded and energizing.

Just some random observations:

– Oracle improves RDF / OWL support in 11g database, considers RDF/OWL as strategic/enabling technologies which will be leveraged in future versions of Oracle products.

– Yahoo uses RDF to organize content on various web sites. It also introduced SearchMonkey – extension to Yahoo search platform which allows to provide more detailed information about information resources.

– Consumer oriented web sites powered by semantic technologies are here. Twine, Freebase, Powerset are good examples, more to come.

– Resource Oriented Architecture and RDF could be a very powerful combination. More and more people understand the value of exposing data through URIs in the form of information resources.
Linked Data initiative looks quite interesting.

– Some advanced semantic applications use knowledge representation formalisms that go beyond basic RDF/OWL model.
But RDF/OWL can be used to surface/exchange information based on W3C standards. Lots of discussions about
information provenance, trust, “semantic spam”.

– It looks like there is a workable solution (compromise) for ’Web’s Identity Crisis’. The idea is to reserve HTTP 303 (“See Other”) code for indication of “Concept URIs”. 303 response should include an additional URI for “See Other” information resource. This approach combined
with new PURL -like servers allows to keep RDF “as is” and to implement something close to the idea of Published Subject Identifiers

– Franz demonstrated a new version of AllegroGraph 64-bit RDFStore. Franz implemented support for Named Graphs (can be used for representing weights, trust factors, provenance)
and incorporated geospatial and temporal libraries. Named Graphs allow to deal with contexts using RDF.

– Text analysis tools become better and better. Interesting example is AllegroGraph.
Incorporating natural language processors allows to extract entities and relationships with reasonable level of precision (News Portal sample).

– Doug Lenat did a great presentation on the conference about the history of Cyc project. It looks like in 5-10 years we can expect “artificial intelligent assistants” with quite sophisticated abilities to reason.