LLMs on Apple Silicon with MLX
Unleash the power of Apple Silicon by running large language models locally with MLX.
Running Large Language Models on Apple Silicon with MLX
In this post, we’ll explore how to leverage the power of Apple Silicon hardware (M1, M2, M3) to run large language models locally using MLX. MLX is an open-source project that enables GPU acceleration on Apple’s Metal backend, allowing you to harness the unified CPU/GPU memory for efficient model execution.
Installing MLX on macOS
MLX supports GPU acceleration on Apple’s Metal backend through the mlx-lm
Python package. To get started, follow the instructions provided in the mlx-lm package installation guide.
Note: MLX is currently supported only on Mac MX series devices.
Loading Models with MLX
While MLX supports common HuggingFace models directly, it is recommended to use converted and quantized models provided by the mlx-community. These models have been optimized for efficient performance on Apple Silicon hardware, depending on your device’s capabilities.
To load a model with MLX, follow these steps:
-
Browse the available models on HuggingFace.
-
Copy the text from the model page in the format
<author>/<model_id>
(e.g.,mlx-community/Meta-Llama-3-8B-Instruct-4bit
). -
Check the model size. Models that can run in CPU/GPU unified memory tend to perform better.
-
Follow the instructions to launch the model server Run OpenAI Compatible Server Locally by running the command:
Launch the model servermlx_lm.server --model <author>/<model_id>
Configuring LibreChat
To use MLX with LibreChat, you’ll need to add it as a separate endpoint in the librechat.yaml
configuration file. An example configuration for the Llama-3 model is provided. Follow the Custom Endpoints & Configuration Guide for more details.
With MLX, you can now enjoy the benefits of running large language models locally on your Apple Silicon hardware, unlocking new possibilities for efficient and powerful natural language processing tasks.