====== PHP Textcat ======
This project is about the PHP TextCat extension, which aims to provide a fast, language independent and extensive tool to categorize texts.
[[http://github.com/crodas/phplibtextcat/tree/master|View the Project Git repository]]
===== Theory of operations =====
The main theory is the [[http://citeseer.ist.psu.edu/68861.html|N-Gram-Based Text Categorization]]. At this point, is not possible too much extensibility from PHP itself, but with rewrite of library it will change.
===== How to compile =====
You should fetch the code, you can do it from the [[http://github.com/crodas/phplibtextcat/downloads|Download page]], or from the git repository.
git clone git://github.com/crodas/phplibtextcat.git
cd phplibtextcat
If you wish to use the development version:
git pull origin devel
In order to compile the module, you must have the development version of PHP (in Redhat based php5-dev or php4-dev) or compile from the source code, then do the follow instructions:
$ phpize
$ ./configure --with-textcat
$ make
$ make test
$ make install
===== How to use =====
In order to use, you train it feeding it with sample text, if you want to avoid this step it comes with some //knowledge// files about some common languages that can be found at ''samples/knowledge/''.
textcat_train(
"knowledge-output.lm",
"Here goes a sample of the text",
"Here another text",
"And so forth"
);
The degree of accuracy is given by the quality and quantity of samples. Also if it miscalculate a category, and you detect it, you should use this file as a sample when you rebuild you knowledge.