Key idea

Pretend each feature is independent of the others, then multiply probabilities. The assumption is almost always wrong, but the math becomes so simple — and so fast — that the model is surprisingly competitive on text and other discrete-feature problems.

Each ellipse is axis-aligned — that's the "naive" assumption — and the heatmap is the posterior class probability

Naive Bayes models each class with a diagonal-covariance Gaussian — features are assumed independent within each class. That means the ellipses are always parallel to the axes, and the decision boundary is a conic section. Try the Tilted dataset to see the assumption fail: the true direction of variation is diagonal, but NB can't see it. Despite that, on high-dimensional sparse data (text, bag-of-words) the simplicity often wins anyway.

Naive Bayes asks: given the words in this email, how likely is each class (spam vs. not spam)? It computes that by pretending each word is independent of every other word — which is obviously false in real language — but then it argmaxes over classes, and the wrong probabilities often still rank the classes correctly.

It's fast, trains in a single pass over the data, handles thousands of features without breaking a sweat, and is still the default baseline for many text-classification problems.

Reach for it when

  • Text classification (spam, sentiment, topic)
  • You need a fast baseline before trying anything fancier
  • Training data is small but you have many features
  • You need online / streaming training

Skip it when

  • Feature interactions matter (the independence assumption costs too much)
  • Continuous features without natural Gaussian structure
  • You need calibrated probability estimates
  • You're modelling images or sequences
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vec = CountVectorizer()
X = vec.fit_transform(texts)

clf = MultinomialNB().fit(X, labels)
print("Top spam words:", vec.get_feature_names_out()[clf.feature_log_prob_[1].argsort()[-10:]])
Want the actual Bayes' rule?
Key idea $$ P(y \mid \mathbf{x}) \;\propto\; P(y)\,\prod_{i=1}^{d} P(x_i \mid y) $$
  • P(y)class prior — proportion of each class in training data
  • P(xi|y)per-feature likelihood, modelled by a simple distribution
  • The "naive" part is the product — features assumed independent given y

Bayes' rule plus the conditional-independence assumption. The denominator P(x) doesn't depend on y, so it drops out when comparing classes.

Variants by likelihood. Multinomial NB models word counts and is the classic choice for text. Bernoulli NB models presence / absence of words (useful for short documents). Gaussian NB models continuous features as Gaussian per class. The structure is the same; only the per-feature density changes.

Laplace smoothing. If a word never appeared with class y in training, P(xi | y) = 0 would zero the entire product. Add-one (Laplace) smoothing replaces zeros with a small prior count, which is essentially a Dirichlet prior on the categorical likelihood.

Why it works despite being wrong. Naive Bayes is a biased probability estimator but a competitive ranker: the argmax over classes is often correct even when the individual probabilities are miscalibrated. For decision-making it's fine; for confidence estimates it isn't.

Reach for it when

  • Bag-of-words text classification
  • Document filtering with online updates
  • You want a transparent baseline you can read off
  • Class priors shift between train and test (you can re-prior easily)

Skip it when

  • Probability calibration matters — the over-confidence is notorious
  • Features are heavily correlated and you can't decorrelate them
  • Continuous features that are multi-modal per class
  • Sequence structure matters (use a sequence model)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(min_df=5, ngram_range=(1, 2))),
    ("nb",    MultinomialNB(alpha=0.1)),    # Laplace smoothing
])
pipe.fit(X_train_text, y_train)

y_pred = pipe.predict(X_test_text)
print(classification_report(y_test, y_pred))
Want the connection to logistic regression and the failure modes?
Log-posterior (linear in features) $$ \log P(y \mid \mathbf{x}) \;=\; \log P(y) \;+\; \sum_{i=1}^{d} \log P(x_i \mid y) \;+\; \text{const} $$
  • The log-posterior is linear in the features (for many likelihood families)
  • Naive Bayes is therefore a linear classifier in disguise

NB vs. logistic regression. Both produce linear decision boundaries in the log-likelihood representation. NB is a generative model: it models P(x, y) = P(y)·P(x|y). Logistic regression is discriminative: it models P(y | x) directly. Ng & Jordan (2001) showed that NB reaches its (worse) asymptotic error faster — so NB wins at small N, LR wins at large N.

Complement NB (Rennie et al. 2003). For unbalanced text classification, modeling P(x | ¬y) instead of P(x | y) and flipping signs reduces the bias toward the majority class. Implemented in sklearn as ComplementNB — often beats MultinomialNB on real text.

Calibration. NB's probability estimates are systematically over-confident because the independence assumption underestimates correlated evidence (the same signal gets counted multiple times). Post-hoc calibration (Platt scaling, isotonic regression) on a held-out set can fix the probabilities without changing the classifier.

Connection to language modelling. A unigram language model with class-conditional unigrams is multinomial NB. n-gram smoothing techniques (Good-Turing, Kneser-Ney) directly carry over.

Reach for it when

  • Limited labelled data, many features — NB's asymptotic-error trade-off wins
  • Streaming setting with class priors that drift
  • Interpretable baseline alongside a heavier model for ablation
  • You need very fast inference (e.g. online filtering)

Skip it when

  • Long-form text or sequence structure — n-gram NB caps at small windows
  • Continuous features are multi-modal — Gaussian NB will get badly wrong
  • You need probabilities used in downstream Bayesian computation
  • Adversarial inputs — easily exploited via crafted feature combinations
from sklearn.naive_bayes import ComplementNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# ComplementNB for imbalanced text + isotonic calibration for honest probabilities
base = Pipeline([
    ("tfidf", TfidfVectorizer(min_df=5, sublinear_tf=True, ngram_range=(1, 2))),
    ("nb",    ComplementNB(alpha=0.1)),
])
clf = CalibratedClassifierCV(base, method="isotonic", cv=5)
clf.fit(X_train, y_train)

# Now clf.predict_proba is well-calibrated on held-out data
Too dense?