How to Use LM Studio for Local LLMs: Complete Guide
By Braincuber Team
Published on April 17, 2026
LM Studio is a cross-platform application that lets you download and run large language models locally on your machine so that your data never leaks to external servers. In this complete tutorial, you will learn how to install LM Studio, download models, chat with documents using built-in RAG, and set up a local API server.
What You'll Learn:
- Install and set up LM Studio on your computer
- Download and run LLMs locally with privacy control
- Chat with documents using built-in RAG support
- Configure model parameters for optimal performance
- Set up a local API server for application integration
What is LM Studio?
LM Studio is a cross-platform application that lets you download and run large language models locally on your machine. When everything runs on your computer, your prompts and data stay within your environment, giving you more control and better privacy.
It comes with a built-in model browser where you can search, browse, and download models directly from Hugging Face. You can download models including different versions of DeepSeek, Llama, Gemma, Phi, or Mistral. No extra setup required.
LM Studio is also great for beginners who are not comfortable working with command-line prompts. It gives you a user-friendly interface where you can select a model, adjust configuration, and start chatting right away.
You can also upload local files and chat with them. LM Studio can attach .docx, .pdf, and .txt files to chat sessions. If a document fits in context, it is added in full, and if it is very long, LM Studio uses retrieval-augmented generation (RAG) to pull relevant information from those files.
LM Studio vs Ollama
| Feature | LM Studio | Ollama |
|---|---|---|
| Interface | GUI-first, user-friendly | CLI-first, terminal-based |
| Built-in RAG | Yes, no extra setup | Requires external tools |
| MCP support | Built-in | Limited / not native |
| Model downloads | Hugging Face in-app | Using ollama pull commands |
| Ease of setup | Very beginner-friendly | Slight learning curve |
System Requirements and Model Choice
Before downloading models in LM Studio, understand what your system can handle. The model you choose directly depends on your available RAM, and picking the wrong one can slow things down or make the app unusable.
RAM Requirements by Model Size
| RAM | What You Can Run |
|---|---|
| 8GB | Small models (1B-4B parameters) |
| 16GB | Mid-sized models (7B-9B parameters) |
| 32GB+ | Larger models (13B and above) |
GPU is optional but makes a noticeable difference. If you have one, model responses become faster and smoother. NVIDIA GPUs with CUDA support work best, Apple Silicon uses Metal effectively, and AMD has partial support.
Recommended Models by Hardware
| RAM/VRAM | Recommended Models |
|---|---|
| 8GB | Qwen 2.5 3B/4B, Phi-3 Mini (3.8B), Gemma 2 2B |
| 16GB | Llama 3 8B, Gemma 2 9B, Mistral 7B, Qwen 2.5 7B |
| 24GB | Llama 3.1 8B (higher quality), Mixtral 8x7B (quantized), Qwen 2.5 14B |
| 32GB+ | Llama 3.1 70B (heavily quantized), Qwen 2.5 32B, Mixtral variants |
Understanding Quantization
You will notice different versions of the same model with labels like Q4_K_M or Q8_0. This refers to quantization levels, which tell how the model is compressed.
- Q4 (lower quantization): Reduces memory usage and runs faster, but you lose some quality
- Q8 (higher quantization): Better output quality, but requires more RAM and runs slower
If you are unsure, Q4 or Q5 is usually a safe place to start, especially on a 16GB setup.
GGUF Format
Most models in LM Studio are available in GGUF format. GGUF stands for GPT-Generated Unified format, a binary format to store and run LLMs efficiently on consumer-grade hardware. It maps high-precision weights to lower-bit integers and packages everything into a single optimized file.
Installing LM Studio
To get started with LM Studio, head over to the official website and download the app. The website automatically detects your operating system and offers the relevant version for Windows, Mac, or Linux.
You might be asked to allow permissions depending on your system settings. On Mac, it becomes available as an application right after opening the downloaded installer.
Downloading Your First Model
When you open LM Studio for the first time, you land on a clean interface with the model browser. You can search for models, explore options, and begin downloading immediately.
Browsing the Discover Tab
Open LM Studio
Open LM Studio and click the search icon from the left sidebar to access the model browser.
Search for Models
Search for specific models, filter by size, and explore different options. Each model comes with a model card showing size, capabilities, and recommended use cases.
Download a Model
Download a model like Qwen 2.5 7B (Q4_K_M) for a 16GB system. It strikes a good balance between performance and quality.
Chatting with a Local LLM
Loading a Model and Configuring Parameters
Navigate to My Models
Open LM Studio and go to the My Models section from the left menu bar.
Load the Model
Click the Settings icon on the model and click Load Model.
Configure Parameters
Go to the Inference tab to access controls for context length, temperature, and system prompt.
Key Configuration Parameters
Context Length
Controls how much information the model can remember during a conversation. Higher values let you work with longer inputs but use more memory.
Temperature
Controls how creative or predictable the model is. Lower values make responses more deterministic, higher values make them more varied.
System Prompt
Sets the behavior of the model. Define how the assistant should respond, including tone, style, and role.
Max Tokens
Limits the maximum length of the model's response. Set this based on how long you expect answers to be.
Chatting with Your Documents (RAG)
One of the most useful features in LM Studio is its built-in RAG support. You can directly upload documents to the chat and start asking questions.
Setting Up Document Q&A
Open a Chat Session
Open a chat session with your loaded model.
Upload Documents
Click the + icon for attaching files. Upload documents like PDFs or text files directly into the chat.
Start Querying
Once the file is added, LM Studio prepares it automatically. Ask questions and the model retrieves relevant sections from your documents.
How RAG Works Under the Hood
Under the hood, LM Studio:
- Splits the document into smaller chunks
- Converts chunks into embeddings (numerical representations of text)
- Retrieves the most relevant chunks when you ask a question
- Passes retrieved chunks to the model along with your query
RAG Limitations
The model depends on its context window, so very large documents may not be fully considered at once. Retrieval quality depends on how well the document is chunked. Some answers might miss details if relevant sections are not retrieved correctly.
Running LM Studio as a Local API Server
One of the more powerful features is running LM Studio as a local API server. This lets you use your local LLM inside scripts, apps, or other tools.
Starting the Server
Enable Developer Mode
Click the Settings icon in the bottom-left corner, go to the Developer section, and turn on the Developer Mode toggle.
Access Developer Menu
Go back to the chat interface and click the Developer icon from the left menu bar.
Start the Server
Turn on the toggle next to Status to start the server. Copy the server address once it is running.
Testing with curl
Test the server with a simple curl request:
curl http://127.0.0.1:1234/v1/models
If everything is set up correctly, you will see a JSON response listing the available model.
Connecting from Python
Once your local server is running, you can treat it like any other API. The only difference is you are calling your own machine instead of OpenAI servers.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "user", "content": "Explain how local LLMs work"}
],
)
print(response.choices[0].message.content)
Understanding the Code
base_url
Tells the code to use your local LM Studio server instead of OpenAI servers.
api_key
Can be any value. LM Studio does not enforce API key authentication.
model
Refers to the model you loaded in LM Studio.
messages
Your prompt structured as a message with role and content.
When you run this code, your request goes to localhost:1234, the model processes it, and you get a response back. This works because LM Studio follows the OpenAI API format.
Frequently Asked Questions
Is LM Studio free?
Yes, LM Studio is free to download and use. There are no subscription fees, usage limits, or paywalled features for the core app.
Is LM Studio completely offline?
Yes, once you download a model, everything runs locally. You only need an internet connection to download models from Hugging Face.
Can I use LM Studio in my own applications?
Yes, LM Studio can run as a local API server that follows the OpenAI API format. You can connect it to scripts, apps, or other tools running on your machine.
Can LM Studio handle images?
Yes, but only with vision-enabled models. If you load a multimodal model like LLaVA or Qwen-VL, you can upload images and ask questions about them.
How much RAM do I need to run LLMs locally?
It depends on the model size. 8GB can handle smaller models, 16GB is sufficient for many use cases, but 32GB or more is required for larger models.
Need Help with Local LLM Setup?
Our AI experts can help you set up LM Studio, optimize model selection for your hardware, and integrate local LLMs into your applications.
