====== PHP Textcat ====== This project is about the PHP TextCat extension, which aims to provide a fast, language independent and extensive tool to categorize texts. [[http://github.com/crodas/phplibtextcat/tree/master|View the Project Git repository]] ===== Theory of operations ===== The main theory is the [[http://citeseer.ist.psu.edu/68861.html|N-Gram-Based Text Categorization]]. At this point, is not possible too much extensibility from PHP itself, but with rewrite of library it will change. ===== How to compile ===== You should fetch the code, you can do it from the [[http://github.com/crodas/phplibtextcat/downloads|Download page]], or from the git repository. git clone git://github.com/crodas/phplibtextcat.git cd phplibtextcat If you wish to use the development version: git pull origin devel In order to compile the module, you must have the development version of PHP (in Redhat based php5-dev or php4-dev) or compile from the source code, then do the follow instructions: $ phpize $ ./configure --with-textcat $ make $ make test $ make install ===== How to use ===== In order to use, you train it feeding it with sample text, if you want to avoid this step it comes with some //knowledge// files about some common languages that can be found at ''samples/knowledge/''. textcat_train( "knowledge-output.lm", "Here goes a sample of the text", "Here another text", "And so forth" ); The degree of accuracy is given by the quality and quantity of samples. Also if it miscalculate a category, and you detect it, you should use this file as a sample when you rebuild you knowledge.