The fetch_20newsgroups_vectorized method in the Scikit-learn datasets module is a variation of the fetch_20newsgroups method that retrieves the 20 Newsgroups dataset. The fetch_20newsgroups_vectorized method loads the same dataset as fetch_20newsgroups, but with an additional preprocessing step that converts the text data into a vectorized format suitable for machine learning algorithms.
Specifically, fetch_20newsgroups_vectorized returns a preprocessed version of the 20 Newsgroups dataset where each document has been transformed into a vector of features. The feature vectors are created using the TfidfVectorizer class from Scikit-learn, which converts text into a sparse matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.
The resulting dataset is useful for text classification tasks, such as topic modeling or sentiment analysis, where the goal is to classify documents into one of the multiple predefined categories.
#method signature sklearn.datasets.fetch_20newsgroups_vectorized( *, subset='train', remove=(), data_home=None, download_if_missing=True, return_X_y=False, normalize=True, as_frame=False)
- subset: This parameter specifies which subset of the 20 Newsgroups dataset to load. It can be set to ‘train’, ‘test’, or ‘all’ (default is ‘all’).
- remove: This parameter allows you to specify a list of strings that should be removed from the dataset. For example, you can remove email headers by setting remove=(‘headers’, ‘footers’, ‘quotes’) (default is ()).
- data_home: This parameter specifies the directory where the dataset should be downloaded and cached (default is None, which means the data will be downloaded to the Scikit-learn data directory).
- return_X_y: This parameter controls whether to return the feature matrix and target vector as separate objects (True, default) or as a single tuple (False).
- subset: This parameter specifies which subset of the data to load. It can be set to ‘train’, ‘test’, or ‘all’ (default is ‘all’).
- categories: This parameter allows you to specify which categories to load. It can be a list of category names or ‘all’ to load all categories (default is None).
- shuffle: This parameter controls whether to shuffle the data before returning it (default is True).
The fetch_20newsgroups_vectorized method returns a dictionary-like object that contains the following keys:
- data: A sparse matrix of shape (n_samples, n_features) that represents the preprocessed text data. Each row of the matrix corresponds to a document, and each column corresponds to a feature (i.e., a word or n-gram).
- target: An array of shape (n_samples,) that represents the target variable (i.e., the category label) for each document. The label is an integer between 0 and 19, corresponding to one of the 20 newsgroups.
- DESCR: A string that describes the dataset and its attributes.
- target_names: A list of length 20 that contains the names of the 20 newsgroups.
Here is an example code snippet to fetch the vectorized version of the 20 Newsgroups dataset:
from sklearn.datasets import fetch_20newsgroups_vectorized # Fetch the vectorized version of the 20 Newsgroups dataset newsgroups = fetch_20newsgroups_vectorized() # Print the shape of the feature matrix print(newsgroups.data.shape)
This code will fetch the preprocessed 20 Newsgroups dataset and print the shape of the feature matrix, which should be (11314, 130107) indicating that there are 11,314 documents in the dataset and each document is represented by a vector of 130,107 features.