Scikit-learn (also called sklearn
) is a free, open-source Python library for machine learning. It provides simple and efficient tools for:
- Supervised Learning (classification, regression)
- Unsupervised Learning (clustering, dimensionality reduction)
- Model evaluation (cross-validation, metrics)
- Data preprocessing (scaling, feature extraction)
Key Features:
- Built on NumPy, SciPy, and Matplotlib.
- Easy-to-use API for training models (
fit()
,predict()
). - Includes popular algorithms (e.g., SVM, Random Forest, Logistic Regression).
Example Use Case:
python
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X_train, y_train) # Train predictions = model.predict(X_test) # Predict
What is NLTK?
Natural Language Toolkit (NLTK) is a Python library for working with human language data (text). It’s widely used for:
- Tokenization (splitting text into words/sentences)
- Stemming/Lemmatization (reducing words to root forms)
- Stopword removal (filtering out common words like “the”)
- Part-of-speech tagging (identifying nouns, verbs, etc.)
Key Features:
- Includes corpora (sample datasets) for training.
- Supports sentiment analysis, named entity recognition (NER).
- Integrates with scikit-learn for ML-based NLP.
Example Use Case:
python
from nltk.tokenize import word_tokenize nltk.download("punkt") # Download required data text = "Hello, world! This is NLP." tokens = word_tokenize(text) # Split into words print(tokens) # Output: ['Hello', ',', 'world', '!', 'This', 'is', 'NLP', '.']
How Scikit-learn and NLTK Work Together
- NLTK preprocesses text (cleaning, tokenizing).
- Scikit-learn converts text to features (e.g., TF-IDF, word embeddings) and trains ML models.
Example: Sentiment Analysis Pipeline
python
from sklearn.feature_extraction.text import TfidfVectorizer from nltk.corpus import stopwords import nltk nltk.download("stopwords") # Step 1: NLTK for text cleaning stop_words = set(stopwords.words("english")) # Step 2: Scikit-learn for feature extraction tfidf = TfidfVectorizer(stop_words=stop_words) X = tfidf.fit_transform(["I love this movie!", "It was terrible."]) # Step 3: Train a classifier from sklearn.svm import LinearSVC model = LinearSVC() model.fit(X, [1, 0]) # 1=positive, 0=negative
When to Use Each
Task | Tool |
---|---|
Machine Learning (general) | Scikit-learn |
Text preprocessing | NLTK |
Deep Learning (NLP) | TensorFlow/PyTorch |
Production NLP pipelines | spaCy |
Installation
bash
pip install scikit-learn nltk
NLTK Data Download (run once in Python):
python
import nltk nltk.download("punkt") # For tokenizers nltk.download("stopwords") # For stopword lists
Summary
- Scikit-learn: Swiss Army knife for ML (non-deep learning).
- NLTK: NLP-focused toolkit for text processing.