Sunday, April 13, 2008

What is "Docsum"?

Once upon a time, I had an idea. There is a lot of documentation out there, meaning web pages, new feeds, research papers, e-mails, etc. All typed in text, all in language, all with meaning, all without structure... mostly. So I was thinking, wouldn't it be nice to have a search that rather then resulting in links to documentation, it presented a summary of the documentation it knew about with references to the originating document. The goal being, just tell me about the important aspect of a topic rather them me having to go out to each result and find the important sections of documentation that are relevant to the topic I am interested in.

From this idea, I started a project on SourceForge called Docsum. I haven't spend much time developing this project in awhile, but I wanted to write a little about it. The process works as so, you add some sort of documentation, for example the demo site uses software licenses, and then enter a topic to work against the added content for form a summary. A topic entered might be "free speech". The result is a document summarizing the content in the Docsum repository pertaining to "free speech".

The current project is limited to only plain text files, and RSS/Atom XML. The generated summary is plain text as well, and is not pretty. But isn't that what ideas are all about... the idea.

One day, I hope to pursue this project more, but at the moment, my time is limited. At the moment, the project is made up of 3 parts, the web view, the command line view, and the datastore. I wanted a good command line interface into the application, and due to the easy of using Python for text processing, all the core functionality is written in Python. The web end is simple PHP pages that interact with the Python core via running subprocesses. The datastore is MySQL and the Python core directly interacts with that piece.

