Mathieu Blondel // Data

Amazon7

Amazon7 is a dataset that we created from raw data collected by Mark Dredze and his colleagues. The dataset contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144 dimensional vector. The dataset contains 0.04% non-zero features. This is the dataset that we used in our paper "Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification" (see below).

pickle

amazon7_pkl.tar.bz2 (303 MB download)

For Python users, the easiest way to load the dataset is to use joblib's pickle functionalities. Note that joblib is also part of scikit-learn as sklearn.externals.joblib. After decompressing the above archive, you can load the dataset as follows:

try:
    import joblib
except ImportError:
    from sklearn.externals import joblib

data = joblib.load("amazon7_pkl/amazon7.pkl")
X = data["X"]
y = data["y"]
print X.shape
print y.shape
print data["categories"]

Alternatively, you may want to use:

amazon7_uncompressed_pkl.tar.bz2 (204 MB download, 1.5 GB uncompressed)

The advantage is that the data can be loaded with mmap:

data = joblib.load("amazon7_uncompressed_pkl/amazon7.pkl", mmap_mode="r")

This is much faster, since data is not copied to memory.

svmlight / libsvm format

amazon7.bz2 [209 MB]

For convenience, the dataset is also provided in the well-known svmlight / libsvm format.

Citation

If you use Amazon7 in a paper, please cite both the following papers.

Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification.
Mathieu Blondel, Kazuhiro Seki, and Kuniaki Uehara.
Machine Learning, May 2013.
Confidence-Weighted Linear Classification.
Mark Dredze, Koby Crammer, and Fernando Pereira.
International Conference on Machine Learning (ICML), 2008.