Open Source Data Mining on Linux

Thursday, 26 October 2006

A quick search for postgresql on my system using `apt-cache search postgresql` threw up `postgresql-8.1-plr – Procedural language interface between PostgreSQL 8.1 and R`. R is a popular (and very capable) open source statistical application.

That got me curious about all the datamining/statistical/machine learning tools available on a standard ubuntu distro (I’m running Ubuntu 6.10)

So, by running `apt-cache search ` this is what I got :


  • Data mining – Nothing! (Not a good start to the experiment ;)
  • Machine Learning – ifile and libtorch3
  • Fuzzy – cl-rsm-fuzzy
    • Statistics – Quite a few actually, so I’m filtering out obvious false positives
    • cl-statistics – Common Lisp Statistics Package
    • dspam – is a scalable, fast and statistical anti-spam filter
    • ent – pseudorandom number sequence test program
    • ess – Emacs statistics mode, supporting R,S and others
    • euler – interactive mathematical programming environment
    • libbow – Bag of Words Library
    • libdspam7 – DSPAM is a scalable and statistical anti-spam filter
    • libnewmat10 – matrix manipulations C++ library
    • paw – Physics Analysis Workstation – a graphical analysis program
    • postgresql-8.1-plr – Procedural language interface between PostgreSQL 8.1 and R
    • pspp – Statistical analysis tool (ed: a take on SPSS?)
    • python-stats – A collection of statistical functions for Python
      r-base – GNU R statistical computing language and environment
    • rkward – a KDE frontend to the R statistics language (ed:wow! this is a real find :)
    • spambayes – Python-based spam filter using statistical analysis
    • spamoracle – A statistical analysis spam filter based on Bayes’ formula


  • Neural Network
    • achilles – An artificial life and evolution simulator
    • dspam – is a scalable, fast and statistical anti-spam filter
    • genesis – general-purpose neural simulator
    • libfann1 – Fast Artificial Neural Network Library (fann)

Of course, some the other open source tools which I have tinkered with, which are not mentioned above are:

  • WEKA - Java Machine Learning library
  • Orange - C++/Python ML library
  • DAP - SAS equivalent (Runs SAS programs)
  • GLPK - GNU Linear Programming Kit

This my no means is an exhaustive list of all Open Source Datamining/Analytics toolkits. However, it is heartening to know that these many are just 2 clicks away.

Also, not finding any software while searching for data mining is not very surprising either. Data mining is a process which involves various tasks such as data cleaning, attribute selection, algorithm selection and of course presentation. More over DM is an iterative process which is not tied to any one tool (unlike many vendors would like us to belive.

So, knowledge of a handful of the above tools will go a long way in increasing the skill as a data mining practitioner.

Good luck.