How to run local AI model using llama.cpp

Running a local AI model is a game-changer. It means complete privacy, zero subscription fees, and an assistant that works entirely offline.

While popular apps like Ollama, LM Studio, or Faraday are great options, they can be incredibly heavy and resource-hungry. Today, we are going to learn how to run an AI model directly on your own computer using llama.cpp as it is the absolute lightest, fastest, and most efficient way to run local AI.

Step 1: Download llama.cpp

First, head over to the official GitHub releases page to grab the software: 👉 llama.cpp GitHub Releases

For this guide, I am using the standard Windows package: llama-b9333-bin-win-cpu-x64.zip. You should download the specific zip or tarball archive that matches your operating system and hardware (e.g., choose a cuda version if you have an Nvidia GPU, or macos-arm64 if you are on an Apple Silicon Mac).

Once downloaded, extract the contents of the folder somewhere easy to find, like your Desktop.

Step 2: Download Your AI Model

Next, we need the actual brains—the AI model. Local models use a highly compressed, single-file format called GGUF. You can browse thousands of pre-configured models directly on Hugging Face: 👉 Hugging Face GGUF Model Directory

For this setup, I downloaded qwen2.5-1.5b-instruct-q4_k_m.gguf from Alibaba's excellent Qwen repository: 👉 Qwen2.5-1.5B-Instruct-GGUF on Hugging Face

💡 Tip: At just 1.5 billion parameters, this model runs incredibly fast on almost any computer while remaining surprisingly smart!

Step 3: Organize and Open Your Terminal

Move or copy your downloaded .gguf model file directly into the extracted llama.cpp folder.
Open your terminal or command prompt inside that exact folder.
- On Windows: Hold Shift, right-click inside the folder, and select "Open PowerShell window here" or "Open Git Bash here".

Step 4: Run the Model

There are two distinct ways to use your local model: directly inside the terminal, or hosted locally as a sleek web-browser interface. Pick the one that fits your style!

Option A: Chat Directly in the Terminal (CLI Mode)

If you want the absolute lowest memory footprint, you can interact with the model directly inside your command line.

1. Windows (Command Prompt / PowerShell)

llama-cli.exe -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096 -cnv

2. Windows (Git Bash)

./llama-cli.exe -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096 -cnv

3. Linux / macOS

./llama-cli -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096 -cnv

(Note: The -c 4096 flag sets your memory context length, and -cnv starts an interactive chat conversation. If your local AI model's location is different then write the path like this: ./llama-server.exe -m "/c/Users/Your Name/My Models/qwen.gguf" -c 4096).

Option B: Launch the Browser GUI Interface (Web Server Mode)

If you prefer a clean graphical interface that looks and feels exactly like ChatGPT, you can instruct llama.cpp to spin up a local lightweight web server instead.

1. Windows (Command Prompt / PowerShell)

llama-server.exe -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096

2. Windows (Git Bash)

./llama-server.exe -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096

3. Linux / macOS

./llama-server -m qwen2.5-1.5b-instruct-q4_k_m.gguf -c 4096

Step 5: Access Your Local Web UI

If you chose Option B, leave your terminal running in the background and open your favorite web browser. Navigate to this address:

👉 http://localhost:8080

Boom! You will see a beautiful, minimalist, ChatGPT-style chat interface. You can now chat with your Qwen AI completely locally, seamlessly, and at lightning-fast speeds. When you are done, simply go back to your terminal and hit Ctrl + C to shut it down.

Bonus Tip

Here is how you can launch your local AI model easily with just one click. If you are using Windows operating system then just download this file to your Desktop. Then open it using Notepad and edit the following part and replace them with your own path then save.

:: Configuration Variables
set "LLAMA_DIR=C:\Users\Administrator\Desktop\llama"
set "MODEL_PATH=C:\Users\Administrator\Desktop\llama\qwen.gguf"
set "CONTEXT=4096"

That's it! Now double click on it and it will let you launch your AI either gui or cli mode based on your selection. 🥳

Happy offline chatting!