a package for reading and writing ARFF format files.
from arff import arffread, arffwrite import sys f = open(sys.argv[1]) (name, sparse, alist, m) = arffread(f) arffwrite(sys.stdout, alist, m)
The package arff implements an ARFF format antlr v3 parser/lexer specification derived from http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_%283.5.1%29
There are two parts to arff. The first is the Python software which implements functions for reading and writing ARFF format files. The other part is a formal ANTLR (http://www.antlr.org/) parser/lexer specification of the ARFF syntax. This specification can be used to automatically generate parsers and lexical analyzers for a range of language targets. These language targets currently include java, C, C#, C++, Objective C, Python, Ruby, Perl, Ada, ActionScript. Indeed the specification was used to generate the Python code for the parser and lexical analyzer.
The arff package has two main entry points. They are:
arffread(fstream) arffwrite(f, alist, m, name='Unknown', sparse=False, comment=None)
The parameters of these functions are:
f, fstream - output, input stream name : a string containing the relation name. sparse: a Boolean indicating whether the data is sparse. alist : a list of attribute type information tuples (name, typecode, rest) where name: is a string containing the attribute name. typecode: an integer indicating the type of the attribute: 1 - number, 0 - string and nominal type 2 - date, 3 - relational. rest: for typecode 0 this is either the empty list [] indicating 'string' type or a list of strings from which the nominal values are taken. for typecode 1 this is the empty list [] for typecode 2 this is the emptily list [] indicating use of the default date format or a list containing the date format string. for typecode 3 this is a list of the attribute type information tuples of the sub-attributes. comment: a string that is printed on the output stream before anything else. m : Each element in the list m is an instance. When writing relational attributes, that attribute must be a list of the relational attribute values.
arffwrite will try to be conservative with quoting strings.
Note that the reader and writer functions do not do any semantic checking of the data and its values.
arffread returns None in case there was an error. An error message is printed to the standard error stream.
Let the file test.arff contain the following:
@relation sparse @attribute a1 string @attribute a2 string @attribute a3 {val1,val2,val3} @attribute a4 string @attribute a5 relational @attribute a51 numeric @attribute a52 string @attribute a53 date yyyy-MM-ddTHH:mm:ss @attribute a54 {'val 0','val 1'} @end a5 @attribute a6 numeric @data {1 X,3 Y,4 'class A'} {2 W,4 'class B'}
The result of
from arff import arffread from pprint import pprint f = open('test.arff') (name, sparse, alist, m) = arffread(f) pprint((name,sparse,alist,m))
is (comments are inserted to help understanding):
('sparse', # name of the relation True, # this is a sparse data set # the alist: [('a1', 0, []), ('a2', 0, []), ('a3', 0, ['val1', 'val2', 'val3']), ('a4', 0, []), ('a5', 3, # the typecode for attribute a5 [('a51', 1, []), # the alist for attribute a5 ('a52', 0, []), ('a53', 2, ['yyyy-MM-ddTHH:mm:ss']), ('a54', 0, ['val 0', 'val 1'])]), ('a6', 1, [])], # the data list of lists m [[(1, 'X'), (3, 'Y'), (4, 'class A')], [(2, 'W'), (4, 'class B')]])
The full grammar with semantic actions for Python is contained in the file arff.g distributed with the arff package.
The parser grammar essentially is:
file : header data; header : '@relation' string adecls; adecls : adecl (adecl)*; adecl : '@attribute' string datatype; datatype : 'numeric'|'integer'|'real'|'string'| 'relational' adecls '@end' string | date | '{' values '}'; date : 'date' (string)?; data : '@data' ( (pairs)+ | (values)+ ); pairs : '{' pair (',' pair)* '}'; pair : INT value; values : value (',' value)*; value : '?' | FLOAT | INT | string; keyword : 'numeric' | 'integer' | 'real' | 'string' | 'relational' | 'date'; string : QSTRING | STRING | keyword;
License: | arff in general is distributed as free (not as in beer) software under the GNU General Public License (http://www.gnu.org/licenses/gpl.txt). The ANTLR grammar file arff.g is distributed under the Apache licence version 2.0 (http://www.apache.org/licenses/LICENSE-2.0). |
---|---|
Download: | dist/ |
Homepage: | here |
This documentation was generated for arff version 1.0c.