Regression models predict continuous values by learning relationships between variables. They are fundamental building blocks in machine learning and statistics, providing interpretable results and serving as a baseline for more complex models. To understand the basic principles of regression, please check out this interactive visualization.
Regression models come in two main flavors: linear regression for continuous outcomes and logistic regression for binary classification. Both are interpretable and serve as excellent baselines before trying more complex models.
Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation. It's particularly useful because:
You can see a beautiful visualization and explanation of how linear regression works here.
Logistic regression, despite its name, is actually a classification algorithm that predicts binary outcomes. It's valuable because:
You can see a beautiful visualization and explanation of how logistic regression works here.
Aspect | Linear Regression | Logistic Regression |
---|---|---|
Output Type | Continuous values | Binary probabilities |
Interpretability | Direct coefficient interpretation | Odds ratio interpretation |
Use Cases | Price prediction, trend analysis | Classification, risk assessment |
Assumptions | Linear relationship, normal errors | Logit linearity, independence |
Regularization | Ridge, Lasso, Elastic Net | Same options available |
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Linear Regression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
# Lasso Regression
lasso = Lasso(alpha=1.0)
lasso.fit(X_train_scaled, y_train)
# Model Evaluation
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print results
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # Ensure balanced split for binary classification
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize and train the model
log_reg = LogisticRegression(
C=1.0, # Inverse of regularization strength
penalty='l2', # L2 regularization
solver='liblinear', # Algorithm for optimization
random_state=42
)
log_reg.fit(X_train_scaled, y_train)
# Make predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
# Print results
print(f"Accuracy: {accuracy:.2f}")
print(f"ROC AUC Score: {roc_auc:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Feature importance (coefficients)
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'coefficient': log_reg.coef_[0]
}).sort_values('coefficient', ascending=False)
# Calculate odds ratios
feature_importance['odds_ratio'] = np.exp(feature_importance['coefficient'])
print("\nFeature Importance (Odds Ratios):")
print(feature_importance)
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X_poly, y,
test_size=0.2,
random_state=42
)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
# Get feature names
feature_names = poly.get_feature_names_out(X.columns)
coefficients = pd.DataFrame({
'feature': feature_names,
'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)