LLMs on Apple Silicon with MLX
May 1, 2024

LLMs on Apple Silicon with MLX

Unleash the power of Apple Silicon by running large language models locally with MLX.


LLMs on Apple Silicon with MLX

Running Large Language Models on Apple Silicon with MLX

In this post, we’ll explore how to leverage the power of Apple Silicon hardware (M1, M2, M3) to run large language models locally using MLX. MLX is an open-source project that enables GPU acceleration on Apple’s Metal backend, allowing you to harness the unified CPU/GPU memory for efficient model execution.

Installing MLX on macOS

MLX supports GPU acceleration on Apple’s Metal backend through the mlx-lm Python package. To get started, follow the instructions provided in the mlx-lm package installation guide.

Note: MLX is currently supported only on Mac MX series devices.

Loading Models with MLX

While MLX supports common HuggingFace models directly, it is recommended to use converted and quantized models provided by the mlx-community. These models have been optimized for efficient performance on Apple Silicon hardware, depending on your device’s capabilities.

To load a model with MLX, follow these steps:

  1. Browse the available models on HuggingFace.

  2. Copy the text from the model page in the format <author>/<model_id> (e.g., mlx-community/Meta-Llama-3-8B-Instruct-4bit).

  3. Check the model size. Models that can run in CPU/GPU unified memory tend to perform better.

  4. Follow the instructions to launch the model server Run OpenAI Compatible Server Locally by running the command:

    Launch the model server
    mlx_lm.server --model <author>/<model_id>

Configuring LibreChat

To use MLX with LibreChat, you’ll need to add it as a separate endpoint in the librechat.yaml configuration file. An example configuration for the Llama-3 model is provided. Follow the Custom Endpoints & Configuration Guide for more details.

With MLX, you can now enjoy the benefits of running large language models locally on your Apple Silicon hardware, unlocking new possibilities for efficient and powerful natural language processing tasks.