Amazon7
Amazon7 is a dataset that we created from raw data collected by Mark Dredze and his colleagues. The dataset contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144 dimensional vector. The dataset contains 0.04% non-zero features. This is the dataset that we used in our paper "Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification" (see below).
pickle
amazon7_pkl.tar.bz2 (303 MB download)
For Python users, the easiest way to load the dataset is to use joblib's pickle functionalities. Note that joblib is also part of scikit-learn as sklearn.externals.joblib. After decompressing the above archive, you can load the dataset as follows:
try: import joblib except ImportError: from sklearn.externals import joblib data = joblib.load("amazon7_pkl/amazon7.pkl") X = data["X"] y = data["y"] print X.shape print y.shape print data["categories"]
Alternatively, you may want to use:
amazon7_uncompressed_pkl.tar.bz2 (204 MB download, 1.5 GB uncompressed)
The advantage is that the data can be loaded with mmap:
data = joblib.load("amazon7_uncompressed_pkl/amazon7.pkl", mmap_mode="r")
This is much faster, since data is not copied to memory.
svmlight / libsvm format
amazon7.bz2 [209 MB]
For convenience, the dataset is also provided in the well-known svmlight / libsvm format.
Citation
If you use Amazon7 in a paper, please cite both the following papers.
- Block Coordinate Descent Algorithms for Large-scale Sparse
Multiclass Classification.
Mathieu Blondel, Kazuhiro Seki, and Kuniaki Uehara.
Machine Learning, May 2013. - Confidence-Weighted Linear Classification.
Mark Dredze, Koby Crammer, and Fernando Pereira.
International Conference on Machine Learning (ICML), 2008.