Software Toolbox for CS7641 Machine Learning – OMSCS 7641: Machine Learning

Summary

This blog post introduces Machine Learning in Python, focusing on setting up the programming environment, managing packages with Conda, and installing essential Machine Learning packages such as scikit-learn, pandas, numpy, and others. The post guides through installing Jupyter Lab for working with data and figures in Python, and emphasizes the importance of using pipelines in scikit-learn to streamline Machine Learning workflows and prevent common errors. The post also highlights the use of the Yellowbrick library for enhancing data visualization capabilities beyond what's available in scikit-learn, offering practical examples of how to effectively visualize and interpret Machine Learning model performance. Finally, the blog post provides troubleshooting tips for common scikit-learn errors and advice on efficient data visualization with Matplotlib and Seaborn.

Introduction

Welcome! This blog post will serve as your introduction to Machine Learning in Python. This guide is designed to set you up to use many of the foundational tools and resources you will use during your time in OMSCS 7641. This post is intended to be a practical crash course introduction to setting up your environment and understanding the purpose of each tool for data science.

Together we will cover how to set up your programming environment, troubleshoot some common technical issues, and effectively present your analytical findings during our assignments. This guide is your first steppingstone to Machine Learning coursework and research. Let’s begin!

Setting Up Your Machine Learning Environment

Quick Intro to Conda

Homepage: https://www.anaconda.com/
Documentation: https://docs.anaconda.com/index.html

Conda is an indispensable tool for data scientists and Machine Learning practitioners. As a packaging and environment manager for Python, Conda will help us manage the task of installing and updating packages across different environments. Conda will help ensure that your projects remain consistent and reproducible across multiple possible platforms.

Installation (Graphic Interface)

The best approach is to visit the official Conda website and download the appropriate installer for your Operating System. https://www.anaconda.com/download. Execute the downloaded installer and adhere to the instructions presented. It is possible to download and install Anaconda using nothing but your command line. For that, we suggest getting into the Anaconda documentation https://docs.anaconda.com/free/anaconda/install/ or referring one of many blog posts on the matter.

Verify and update the installation

You need to ensure Conda was installed successfully by typing conda --version in your terminal or command prompt. If the installation was successful, the terminal will print out the version number.

In addition, you can run conda update conda in order to make sure the version you have installed is indeed the latest one. Running this will update your conda installation, not the packages installed in that environment.

Create a new Conda Environment

Environments are containers within your computer for packages to be installed and version controlled. In order to install our desired Python packages, we need to set up that environment.

First, launch your terminal or command prompt and create a new environment by executing: conda create --name cs7641 python=3.8

In this case we created a new environment named cs7641 which we will use while working on the Machine Learning course. Choosing python=3.8 ensures compatibility and stability with a wide array of most ML libraries. If you encounter a framework or package that requires you to use Python 3.10, it is possible to upgrade your environment’s python , or simply start a new environment with python=3.10.

Activate your environment by typing conda activate cs7641 and you should notice that the command line will contain a reference to your environment name i.e. (cs7641) > $

Install ML Packages

With your new environment active, it’s time to install the most relevant packages for ML with python. From your terminal, run conda install scikit-learn pandas numpy scipy matplotlib seaborn yellowbrick. This command tells conda to install our list of some of the most commonly packages for ML.

scikit-learn for machine learning algorithms (docs: https://scikit-learn.org/stable/)
pandas for data manipulation and analysis (docs: https://pandas.pydata.org/docs/)
numpy for numerical and matrix operations (docs: https://numpy.org/doc/)
scipy for scientific computing (docs: https://docs.scipy.org/doc/scipy/)
matplotlib, seaborn, and yellowbrick for data visualization.
- https://matplotlib.org/stable/index.html
- https://seaborn.pydata.org/
- https://www.scikit-yb.org/en/latest/

Install Jupyter Lab

Homepage: https://jupyter.org/
Documentation: https://docs.jupyter.org/en/latest/

Now we will install Jupyter Lab, to provide us a simple graphic interface for working with data and figures in Python. The most common software for this is Jupyter (Notebook or Lab). We opt for Lab due to it’s richer features and multi-tab interface. You can install Jupyter Lab in your conda environment by running conda install -c conda-forge jupyterlab. In this situation we are telling Conda to use the channel conda-forge by including the -c flag in our command.

After installing jupyterlab we can activate it by running jupyter lab in our terminal. Please note that the terminal will now launch a small http service to provide you access to Jupyter using a website from your default browser. Running jupyter lab starts the Jupyter Lab server and typically opens up your default web browser to the Jupyter Lab interface. If for some reason the web site does not automatically open, you can open a browser and manually navigate to the URL provided in the terminal (it is printed in the terminal after launching the service, usually something like http://localhost:8888/). It is possible to kill the service without closing the window by using Command+C/Ctrl+C

Deactivating the Conda Environment

If you have completed your work in the Terminal and wish to exit the class Conda environment, you can do so by typing conda deactivate to return to your default shell environment.

Dealing with Common Errors in Scikit-Learn

Preventing Errors by using Pipelines in Scikit-Learn

Our opinion is that learning to use the scikit-learn “pipeline” is a real a game-changer for streamlining your Machine Learning workflows. Pipelines enable you to encapsulate multiple processing steps into a single, manageable unit, ensuring your code is not only cleaner but also more robust against common data handling errors. Often times errors can be created in Jupyter Notebooks by running cells out of order. Or a cell could have its text changed before/after being executed which complicates the debugging process. One of the best ways to get ahead of this is to use pipelines to reduce the chance of human error being the problem. By isolating each step, we have also ensured that preprocessing steps are confined to your training data during fitting. This is a strong safeguard against subtle data leakage or incorrect data augmentations.

# Example of an scikit-learn pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the steps in the pipeline
steps = [
    ('scaler', StandardScaler()),  # Preprocessing step
    ('rf', RandomForestClassifier())  # Estimator step
]

# Create the pipeline
pipeline = Pipeline(steps)

# Define the parameter grid
param_grid = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [None, 5, 10, 15]
}

# Set up the GridSearchCV
cv = GridSearchCV(pipeline, param_grid, cv=5)

# Fit the model with GridSearchCV to find the best parameters
cv.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", cv.best_params_)

# Train the model with the best parameters on the full training data
best_pipeline = cv.best_estimator_
best_pipeline.fit(X_train, y_train)

# Predictions
predictions = best_pipeline.predict(X_test)

Troubleshooting Problems with Scikit-Learn

Troubleshooting in Scikit-Learn is an integral part of the Machine Learning process. Lets cover how to read the errors from scikit, and then review a brief list of common issues and their solutions.

Understanding the Error Output

In our Machine Learning class encountering software errors are not a setback but a part of the learning curve. We consider learning how to debug and deal with these common problems a critical part of being an effective computer scientist. When it comes to scikit-learn you will be encountering not just standard Python errors but new category of errors related to the ML process. We consider that understanding how to interpret and resolve these errors is crucial. Let’s try and demystify the process of troubleshooting in Scikit-Learn!

Interpreting Error Messages

Errors in Python are not cryptic codes; they are your guides! Each error message in Scikit-Learn generally includes the type of error, a description, and often, the location where it occurred. Here’s how to tackle them:

Identify the Error Type: Python specifies the error type (e.g., ValueError, TypeError). This classification is your first clue and directs you towards the nature of the problem.
Analyze the Description: The error message will often explain what went wrong. It might indicate that a function received an unexpected type of argument or that a required library is missing.
Locate the Error: The message will usually point to a line number or a specific part of your code. This pinpointing is invaluable as it narrows down the area you need to review.

Common Scikit-Learn Errors

NaN or Infinite Values Error
- Issue: ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).
- Cause: The model encounters non-numeric data.
- Solution: Apply data preprocessing to handle NaNs and infinite values.
Attribute Access on NoneType
- Issue: AttributeError: ‘NoneType’ object has no attribute ‘something’
- Cause: Null object reference.
- Solution: Ensure all objects in your code are properly initialized and not inadvertently set to None.
Sample Number Mismatch
- Issue: ValueError: Found input variables with inconsistent numbers of samples: [x, y]
- Cause: Inconsistency between the number of samples in your data and labels.
- Solution: Confirm the dimensions of your feature matrix X and target vector y.
Index Out of Bounds
- Issue: IndexError: Indices are out of bounds
- Cause: Attempting to reference an array element or DataFrame row/column that doesn’t exist.
- Solution: Verify the range of your data structures and the indices you’re accessing.
Unexpected Data Types
- Issue: ValueError: could not convert string to float: ‘some_string’
- Cause: The model expects numerical inputs, but the data contains non-numeric (e.g., categorical) data.
- Solution: Convert categorical data into a numerical format using techniques like one-hot encoding or label encoding.

Yellowbrick

Yellowbrick is a package developed for ML students to extend the visualization capabilities of Scikit-Learn by providing tools to generate many common ML graphs. Effective visualization techniques are not just about making your data look good; it is about communication ideas through visual information to help build the narrative of your analysis.

Feature Importance

from yellowbrick.model_selection import FeatureImportances
model = RandomForestClassifier()
model.fit(X_train, y_train)
viz = FeatureImportances(model)
viz.fit(X_train, y_train)
viz.show()

Confusion Matrix to get a clear picture of where your model is succeeding, but more importantly, where it needs improvement.

from yellowbrick.classifier import ConfusionMatrix
cm = ConfusionMatrix(model, classes=class_names)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
cm.show()

Learning Curves to visualize your model learning to understand and diagnose issues like overfitting or underfitting.

from yellowbrick.model_selection import LearningCurve
viz = LearningCurve(model)
viz.fit(X_train, y_train)
viz.show()

Validation Curves to understand how your model’s performance changes with different parameter values.

from yellowbrick.model_selection import ValidationCurve
viz = ValidationCurve(model, param_name="param", param_range=param_range)
viz.fit(X_train, y_train)
viz.show()

ROC-AUC Curve to assess your model’s performance in binary classification tasks.

from yellowbrick.classifier import ROCAUC
roc_viz = ROCAUC(model)
roc_viz.fit(X_train, y_train)
roc_viz.score(X_test, y_test)
roc_viz.show()

You can learn more by checking out the official documentation and resources:

Common Questions about YellowBrick

Efficient Plot Saving: Often, you will want your plots saved directly without the hassle of pop-up dialogs getting in the way. You can streamline this process by tweaking Matplotlib’s backend settings. This approach is especially useful when you’re running large amounts of code where you can’t or don’t want to display plots on-screen. Here’s how to set it up:

import matplotlib matplotlib.rcParams["interactive"] = False matplotlib.use("Agg")

By setting the interactive parameter to False and using the ‘Agg’ backend, you’re instructing Matplotlib to operate in a non-interactive mode and render the plots to a file instead of displaying them on-screen. This setup is perfect for saving plots programmatically without any interruption.

Axis Label Customization: Well-labeled plots help communicates information efficiently. You will want to make sure the labels, size, color, etc. are all aligned with telling your story. Yellowbrick visualizers, built on top of Matplotlib, allow you to customize the axes labels to better describe your data and the insights you’re showcasing. Here’s how to personalize the axis labels to fit your narrative. After running the visualization, you can access the axes of the plot directly and update the labels:

viz.ax.set_ylabel("Custom Y-axis Label")
viz.ax.set_xlabel("Custom X-axis Label")

For scenarios where you require even more control, especially when you wan to use subplots or complex layouts, it is a good idea to work directly with Matplotlib’s figure and axes objects. Create your figure and axes upfront, and then pass the axes object to your visualizer:

impot matplotlib.pyplot as plt
fig, ax = plt.subplots()
cm = ConfusionMatrix(model, ax=ax, classes=class_names)
cm.fit(X_train, y_train)
cm.score(X_test, y_test)
cm.show()

Matplotlib & Seaborn

Both of these packages are a crucial part of Machine Learning visualization. We strongly encourage you to visit external resources on how to fully leverage each library. Each package is already throughly documented with plenty of resources available for free online. Learning how to visualize your data is a critical skill. Whether you’re looking to plot complex figures or just understand your data’s distribution, matplotlib is without a doubt the most powerful (and complex) workhorse for Python visualization.

Featured Image created with DALLE 3.