Linear & Logistic Regression

Regression models predict continuous values by learning relationships between variables. They are fundamental building blocks in machine learning and statistics, providing interpretable results and serving as a baseline for more complex models. To understand the basic principles of regression, please check out this interactive visualization.

Core Concepts

Regression models come in two main flavors: linear regression for continuous outcomes and logistic regression for binary classification. Both are interpretable and serve as excellent baselines before trying more complex models.

  • Linear Regression

    Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation. It's particularly useful because:

    • It's highly interpretable - coefficients directly show feature importance
    • It's computationally efficient
    • It provides statistical significance of features
    • It serves as a baseline for more complex models

    You can see a beautiful visualization and explanation of how linear regression works here.


  • Logistic Regression

    Logistic regression, despite its name, is actually a classification algorithm that predicts binary outcomes. It's valuable because:

    • It provides probability estimates for predictions
    • It's interpretable through odds ratios
    • It can be regularized to prevent overfitting
    • It's often used as a baseline for binary classification

    You can see a beautiful visualization and explanation of how logistic regression works here.

Aspect Linear Regression Logistic Regression
Output Type Continuous values Binary probabilities
Interpretability Direct coefficient interpretation Odds ratio interpretation
Use Cases Price prediction, trend analysis Classification, risk assessment
Assumptions Linear relationship, normal errors Logit linearity, independence
Regularization Ridge, Lasso, Elastic Net Same options available

Detailed Concepts

1. Linear Regression

2. Advanced Regression

3. Model Evaluation

4. Practical Applications

Implementation Examples

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Linear Regression
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Lasso Regression
lasso = Lasso(alpha=1.0)
lasso.fit(X_train_scaled, y_train)

# Model Evaluation
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
                                        

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split

# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Ensure balanced split for binary classification
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the model
log_reg = LogisticRegression(
    C=1.0,              # Inverse of regularization strength
    penalty='l2',       # L2 regularization
    solver='liblinear', # Algorithm for optimization
    random_state=42
)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print results
print(f"Accuracy: {accuracy:.2f}")
print(f"ROC AUC Score: {roc_auc:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': log_reg.coef_[0]
}).sort_values('coefficient', ascending=False)

# Calculate odds ratios
feature_importance['odds_ratio'] = np.exp(feature_importance['coefficient'])
print("\nFeature Importance (Odds Ratios):")
print(feature_importance)
                                        

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_poly, y, 
    test_size=0.2, 
    random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

# Get feature names
feature_names = poly.get_feature_names_out(X.columns)
coefficients = pd.DataFrame({
    'feature': feature_names,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)