sklearn.datasets.fetch_20newsgroups
is a function in the scikit-learn library that downloads and returns the “20 Newsgroups” dataset.
The “20 Newsgroups” dataset is a collection of approximately 20,000 newsgroup documents, partitioned (almost) evenly across 20 different newsgroups. These newsgroups cover various topics, including politics, religion, sports, science, and technology. The dataset was originally collected by Ken Lang in 1995 for text classification research purposes.
Syntax
#syntax of fetch_20newsgroups() method
sklearn.datasets.fetch_20newsgroups(
*,
data_home= None,
subset= 'train',
categories= None,
shuffle= True,
random_state= 42,
remove= (),
download_if_missing= True,
return_X_y= False)
Parameters
*
: Variable number of non-keyword argument values to this method.
data_home
: The directory where the dataset should be stored. If not specified, scikit-learn will use a default directory.subset
: Specifies whether to load the"train"
or"test"
subset of the dataset. Defaults to"train"
.categories
: Specifies the categories (newsgroup names) to load. By default, all 20 newsgroups are loaded. Ifcategories
is set to a list of strings, only documents from the specified newsgroups will be loaded.shuffle
: Specifies whether to shuffle the documents before returning them. Defaults to True.random_state
: The random seed used to shuffle the documents, if shuffle is True.remove
: A list of strings indicating the types of documents to remove from the dataset. By default, no documents are removed. Valid options include “headers”, “footers”, and “quotes”, which correspond to the portions of the documents that contain metadata, signatures, and quoted text, respectively.download_if_missing
: Specifies whether to download the dataset if it is not already downloaded in data_home. Defaults to True.return_X_y
: If True, returns a tuple (data, target) where data is a list of strings containing the documents and target is an array of integers containing the labels. If False (the default), returns a dictionary-like object with the fields data, target, target_names, and DESCR.
Return value
The fetch_20newsgroups function returns the data as a dictionary-like object with the following keys:
data
: a list of strings, where each string is the text of a newsgroup post.target
: a list of integers, where each integer represents the index of the newsgroup to which the corresponding post belongs.target_names
: a list of strings, where each string is the name of one of the 20 newsgroups.DESCR
: a description of the dataset
This dataset is often used as a benchmark for text classification and topic modeling algorithms in machine learning research.
Explanation
In this code snippet, we will elaborate on how fetch_20newsgroups()
method will load the “20 Newsgroups” dataset.
from sklearn.datasets import fetch_20newsgroups
# Load the dataset
newsgroups = fetch_20newsgroups()
# Print the target names
print(newsgroups.target_names)
# Print the first document
print(newsgroups.data[0])
# Print the label of the first document
print(newsgroups.target[0])
- Line#1: imports the
fetch_20newsgroups()
method fromsklearn.datasets
module, which we’ll use to load the 20 Newsgroups dataset. - Line#4:
fetch_20newsgroups()
method with no arguments, downloads, and loads the 20 Newsgroups dataset. The resulting data is stored in a dictionary-like object called newsgroups. - Line#7: prints the list of target names for the dataset, representing the 20 different newsgroups that the documents are partitioned across.
- Line#10: prints the first document in the dataset, which is stored as a string in the data field of the newsgroups object.
- Line#13: This line prints the label of the first document in the dataset, which is stored as an integer in the target field of the newsgroups object. The label represents the newsgroup to which the document belongs. In this case, the label is -1, indicating that the document is unlabeled.