Building an AI Assistant to help people navigate ZIMIRA Customs and Tariffs

18 Apr, 2025

Note: This post gives an overview of the system's design and how it was built. If you're looking for the full technical walkthrough, check out the Kaggle Notebook.

Understanding Zimbabwe’s customs and tariff regulations can be extremely difficult. The official documentation provided by ZIMRA (Zimbabwe Revenue Authority) is complex, full of legal terms, and hard for everyday people to access or understand.

A large portion of Zimbabwe’s workforce operates in the informal sector, often relying on importing cheaper goods from neighboring countries as a means of survival. However, the complexity of the customs system forces many businesses and individuals to rely on customs agents or, in some cases, bribe officials at border entry points to get their products through.

To make matters worse, tariff and customs regulations change frequently, making it hard for anyone to keep up.

Why Traditional Solutions Fall Short

Existing solutions often involve online forms that require users to manually input their products and categorize them themselves. This approach has several issues:

Poor user experience: Long, tedious forms can frustrate users.
Inaccurate classification: Users can easily miscategorize products.
Outdated information: Tariff and customs data can change quickly, and traditional systems struggle to stay up to date.
Technical limitations: Building flexible yet accurate rules-based systems becomes overly complex.

How Generative AI Can Help

This is where Generative AI (Gen AI) can make a real difference. With Gen AI, people can interact in natural language, removing the barrier of technical knowledge or bureaucratic jargon.

By using Retrieval-Augmented Generation (RAG), the system can fetch relevant, up-to-date information and feed it into a large language model (LLM). This ensures accurate responses based on current customs and tariff rules.

Project Goal

The primary goal of this project is to democratize access to customs and tariff information in Zimbabwe. By simplifying the language and providing accurate, up-to-date answers, we aim to empower:

Small businesses
Informal traders
Everyday citizens

What This Blog Covers

This post will showcase the solution and review the system's design and key components. Here’s a breakdown of what we’ll cover:

Architecture Overview
Document Pre-processing
Vector and Keyword Search Database Setup
LLM Setup
Sample Output
Conclusion

Architecture Overview

Architecture In this setup, the user's prompt is pre-processed by an LLM, which extracts a structured JSON object containing relevant keywords. These keywords are then used to query the Keyword Database, retrieving related documents. Simultaneously, the original user prompt performs a semantic search against a Vector Database, retrieving semantically similar documents.

Both the keyword-based and semantic results are merged to form the final context, which is then passed into the LLM to generate a response. The result is returned to the user. More information on these steps and the process will be given below.

Document Pre-processing

This section outlines the pre-processing steps applied to the documents used in this project. The two main documents were the Customs and Excise Tariff Notice 2022 Tariff Handbook and the Value Added Tax Act.

One of the biggest challenges was segmenting the Zimbabwe Customs and Excise Tariff Handbook into clean, meaningful chunks suitable for input into a large language model (LLM). The structure of the table of contents provided a way to divide the document into chapters. Chapters were chosen over sections because sections grouped information together that fell into a broad category, and so irrelevant information could have been added into the prompt

Additionally, certain key areas — such as rules, abbreviations, and flat-rate assessments — were critical for preserving the necessary context.

Document segmentation overview:

Pages 2 - 10: Table of contents
Pages 11 - 16: Rules, definitions, abbreviations — essential context for the LLM
Pages 17 - 1151: Main tariff and customs chapters and sections
Pages 1152 - 1155: Conditional entry guidelines for specified goods — also relevant context
Pages 1156 - 1162: Mostly blank or irrelevant content — excluded from preprocessing

The pymupdf4llm library was used to convert the PDF files into markdown.

Example: Extracting Text Ranges from the PDF:

def getTextRange(data, page_range) -> str:
    """
    This function extracts text from a range of pages and combines it into a single string.

    Args:
        data (list): A list of pages or documents containing the text data.
        page_range (range): A range of page numbers to extract text from.

    Returns:
        str: A concatenated string of text from all pages in the specified range.
    """
    
    notes = ""  # Initialize an empty string to store the combined text
    
    # Loop through each page in the specified page range
    for i in page_range:
        # Access the text from the page and append it to the 'notes' string
        notes += data[i].to_dict()["text"]
    
    return notes  # Return the concatenated text

# Extracting text from pages 10 to 15 (inclusive)
rules = getTextRange(data, range(10, 16))

Text Cleaning

After extraction, the text was cleaned to prepare it for indexing into the vector database. The main steps included:

Removing markdown syntax and unnecessary whitespace
Preserving numbers, as they are crucial for calculations and tax code references
Avoiding stemming and lemmatization, which might reduce semantic accuracy

Instead, we opted for punctuation removal as the primary cleaning method to simplify the text while preserving word meaning.

Example: Txt Cleaning:

import string 

# removing punctuation and whitespace 
mytable = str.maketrans("", "", string.punctuation)
clean_conditional_entry = "".join([i.translate(mytable) for i in conditional_entry]).replace("\n", " ").strip()

clean_rules = "".join([i.translate(mytable) for i in rules]).replace("\n", " ").strip()

clean_rules

Vector and Keyword Search Database Setup

This project uses two types of search systems to retrieve relevant information: a vector database for semantic search and a keyword-based search for exact term matching.

Each serves a unique purpose in the pipeline:

The vector database stores embeddings of the Zimbabwe Customs and Excise Tariff Handbook and supports semantic search to find contextually similar content.
The keyword search system provides a fallback when semantic search might return loosely related or irrelevant content, helping to pinpoint exact matches for specific terms or codes.

Vector Database

For semantic search, we used ChromaDB as the vector database. Embeddings were generated using the Gemini model: models/text-embedding-004. This allows the model to find and return chapters or sections that are semantically similar to the user’s query — even if the exact words aren’t used.

Example: Adding Documents to the Vector Database

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=chapters, ids=[str(i) for i in range(len(chapters))]) # Adding the chapters into the DB

Keyword database

While semantic search is powerful, it can sometimes return loosely related content due to how cosine similarity measures embedding proximity. To improve accuracy and give users more control, we also implemented a keyword search mechanism.

This approach is especially useful when users are looking for specific terms, product names, or codes that fall within a well-defined category

For an example of where keyword search outperforms semantic search (e.g., distinguishing between Google Pixel and actual fruits), check out the Kaggle Notebook.

LLM Setup

The Google Gemini Flash 2.0 model was used throughout this project for both keyword extraction and generating responses from the processed documents.

This section outlines two key parts of the query workflow:

Keyword Prompt Generation – Converting a user’s natural language query into a structured list of keywords, which are passed to the keyword search system.
Final Query Prompt – The fully-formed prompt sent to the LLM, containing instructions and contextual information for generating a meaningful answer.

Keyword Query Prompt

The keyword search database expects a list of relevant terms to retrieve the most relevant documents. To support this, we first process the user's input by crafting a prompt that instructs Gemini to:

Extract keywords from the user’s query
Return the result in a clean, structured JSON format

This approach enables us to integrate the output directly into the keyword search system for improved context-aware retrieval.

Example Prompt (for keyword extraction)

prompt = f"""
Please extract keywords from this user query about tariffs and customs. Please extract only the keywords from products listed within the query. Please 
generate keywords including variations, synonyms, and categories that the product might fall into. The keywords have to be single words and be all lowercase.
The keywords also need to include the singular and plural forms. Make the categories as broad and abstract as you can


Please return this output as a JSON object in the form below. The data field will include all the keywords for all the products listed in the prompt. 

If no keywords are found, make the array/list the value in the JSON as an empty array/list.

{{
	'data': ['keyword']
}}


An example of a keyword extraction for the product Honda fit will look like

{{
	'data': ['cars', 'car', 'petrol', 'diesel', 'electric', 'hybrid' , 'automobile', 'vehicle', 'Transportation', 'Machinery']
}}

User Query: {query}

"""

Once Gemini generates the keywords, the output is cleaned and formatted as proper JSON before being passed into the keyword search module.

Final Prompt to the LLM

After retrieving relevant documents from the vector and/or keyword databases, a final prompt is constructed and sent to the Gemini model. This prompt includes:

A system message defining the model's role
The user's original query
The most relevant document chunks retrieved

Example Final Prompt

    def _build_prompt(self, base_prompt: str, results: set[str], tax: bool = False)->str:
        """
        Constructs a final prompt for the model using available rules and documents.
        """
        header = (
            "Please act like a helpful and informative customs officer.\n"
            "Be comprehensive, include all relevant information.\n"
            "Can you outline what classification the product in the question falls in. \n"
            "If possible, could you give the taxes that apply to that specific product.\n"
            "If the question is irrelevant or the documents provided do not refer to it, stop and say that the prompt is irrelevant.\n"
            "Always give a disclaimer that you are a bot, and so this information might be wrong\n"
            f"Rules: {self.rules}\n"
            f"Additional considerations: {self.conditional_entry}\n"
        )

        if tax:
            header += f"Tax information: {self.tax}\n"

        info_section = "\n".join(["Information: " + text.replace('\n', ' ') for text in results])
        return f"{header}\nQUESTION: {base_prompt}\n\nBelow is information that will help you:\n{info_section}"

Sample Output

In this section, we evaluate the performance of the Retrieval-Augmented Generation (RAG) pipeline using a set of sample queries. The goal is to assess how well the system responds to real-world questions under two scenarios:

With tax/customs information included in the retrieved context
Without tax/customs information in the context

By comparing the model’s responses across these two scenarios, we can better understand the impact of including detailed domain-specific knowledge during inference.

Test Prompts

I want to import a 1985 Toyota Hilux, How do I do this?
"How much will it cost to get me 200 Google Pixel phones imported

Google Pixel without tax output:

Okay, I can help you with that! Let's figure out the classification and potential costs for importing 200 Google Pixel phones into Zimbabwe, based on the provided document.

Classification:

Given the provided HS code excerpts, the most relevant chapter appears to be Chapter 85: Electrical machinery and equipment and parts thereof; sound recorders and reproducers, television image and sound recorders and reproducers, and parts and accessories of such articles.

Specifically, considering the definition of "smartphones" in Note 6 of Chapter 85, and the descriptions within heading 8517, the classification would fall under:

8517.13.00 Smartphones.
Taxes & Duties:

Based on the provided document:

8517.13.00 has a general duty of 25% and a MFN duty of 25%.
Cost Estimation:

To estimate the cost, we'll need the following information:

Value of the Goods: The total value of the 200 Google Pixel phones being imported (in USD or another convertible currency). This is the price you paid for the phones.
Freight and Insurance Costs: The cost of shipping the phones to Zimbabwe, including any insurance.
Calculation:

Customs Value: Value of Goods + Freight + Insurance
Customs Duty: Customs Value * 25% (assuming MFN status applies)
Example:

Let's assume:

The total value of 200 Google Pixel phones is
2,000 USD.
Customs Value:
2,000 =
62,000 * 0.25 = $15,500 USD
Important Considerations:

Other Taxes: The provided document focuses primarily on Customs Duty. You'll almost certainly have to pay Value Added Tax (VAT) in Zimbabwe as well and potentially other taxes. The current VAT rate in Zimbabwe is 14.5%. This would be calculated on the Customs Value plus the Customs Duty.
Prevailing Customs Exchange Rate: If the duty is denominated in United States dollars, with effect from the 1st December 2019 the duty leviable thereon shall be payable in Zimbabwe dollars at the prevailing customs exchange rate applicable from time to time Provided that this shall not apply to goods designated to pay duty in foreign currency
Import License: You'll likely need an import license to import these phones into Zimbabwe. Ensure you have the necessary permits and licenses before shipping.
Compliance: Ensure the phones meet any relevant Zimbabwean technical standards or regulatory approvals.
Declaration: You are required to declare the goods truthfully and accurately, and all documents should match the goods being imported.
Inspection: The customs officials may inspect the goods upon arrival in Zimbabwe.
Disclaimer:

I am an AI assistant and cannot provide definitive customs advice. This information is based solely on the provided document and is for informational purposes only. Customs regulations are subject to change, and it's essential to consult with a qualified customs broker or the Zimbabwe Revenue Authority (ZIMRA) directly for accurate and up-to-date information specific to your import.

I strongly recommend consulting with a Zimbabwean customs broker. They can provide accurate duty calculations, ensure compliance with all regulations, and help you navigate the import process smoothly.

Toyota Hulix with tax output:

Okay, let's break down the process of importing your 1985 Toyota Hilux into Zimbabwe, based on the provided customs tariff information.

Disclaimer: I am an AI and the below information might be incorrect and may not account for all fees or regulations. Always consult directly with the Zimbabwe Revenue Authority (ZIMRA) or a licensed customs broker for the most accurate and up-to-date guidance.

1. Classification of Your Vehicle

Based on the limited information available, a 1985 Toyota Hilux would likely fall under:

Heading 8704: Motor vehicles for the transport of goods.

To determine the exact sub-classification, you'd need to know the:

Engine Type: Is it compression-ignition (diesel) or spark-ignition (petrol/gasoline)?
Gross Vehicle Weight (gvw): This is the maximum operating weight/mass of the vehicle as specified by the manufacturer.
Payload: What is the maximum weight of the goods that can be transported?
Examples using the available tariff schedule *Assuming your Hilux has a compression ignition diesel engine, and a payload exceeding 800 kg but not exceeding 1400 kg, then the commodity code is likely 87042140 *Assuming your Hilux has a spark ignition (gasoline) engine, and a payload exceeding 800 kg but not exceeding 1400 kg, then the commodity code is likely 87043140

2. Import Duties and Taxes

Customs Duty:

The MFN (Most Favored Nation) rate is what would apply if importing from a country that has a trade agreement with Zimbabwe.
The "General" rate is what applies if importing from a country without such an agreement.
Based on this code, 87042140/87043140, The General rate and MFN rate would be 40%
Value Added Tax (VAT):

VAT is applied to the value of the imported goods plus the customs duty.
The current standard VAT rate in Zimbabwe needs to be determined (check the current Finance Act or consult ZIMRA). VAT is applied to the sum of the value of the Hilux + the Customs Duty.
Other Potential Taxes/Fees:

Surtax: While not explicitly mentioned in the provided documents, surtax could potentially apply. Check current regulations.
Excise Duty: Second-hand motor vehicles may be subject to special excise duty. You'll need to check if this applies to your Hilux based on its age and engine type.
Carbon Tax: Zimbabwe has a carbon tax based on engine capacity. You will need to determine what that is
Other Fees: There are likely to be other fees associated with customs processing, handling, and storage.
3. Calculating the Duty and VAT

Here's a hypothetical example:

Assumed Value of the 1985 Toyota Hilux: 
2,000 x 0.40 = 
2,000 (Vehicle Value) + 
2,800
Assumed VAT Rate: 14.5% (You MUST confirm the current rate!)
VAT: 
406
Total Import Costs (excluding other potential fees): 
406 (VAT) = $1206
Remember to convert the values to Zimbabwe dollars (ZWL) at the prevailing customs exchange rate at the time of import. However, note that duty on certain goods may be designated to be paid in foreign currency.

4. Steps to Import

Determine Eligibility: Ensure that importing a vehicle of this age is permitted under current import regulations. There might be restrictions on the age of imported vehicles to encourage newer, more fuel-efficient models.

Obtain Necessary Documents:

Original Purchase Invoice/Bill of Sale: To prove the value of the vehicle.
Vehicle Registration Documents: From the country of export.
Bill of Lading: Issued by the shipping company.
Export Documents: From the country of export.
Any other documents required by ZIMRA.
Engage a Customs Broker (Recommended): A licensed customs broker can greatly simplify the process, ensure compliance, and handle the paperwork.

Shipping: Arrange for the vehicle to be shipped to Zimbabwe.

Customs Clearance:

Submit all required documents to ZIMRA.
Have the vehicle inspected by customs officials.
Pay all applicable duties, taxes, and fees.
Vehicle Registration: Once cleared, register the vehicle in Zimbabwe.

I hope this helps, and good luck with your import!

Conclusion

As demonstrated in the sample outputs, combining prompt pre-processing with both keyword and semantic (vector) search significantly improves the relevance and accuracy of model responses. This hybrid approach allows the system to retrieve more meaningful context, leading to better answers for domain-specific queries like customs and tariff information.

The Tariff RAG system has shown strong performance and is well-suited for day-to-day use. That said, there’s still room for enhancement. Future improvements could include:

Exploring more advanced RAG architectures
Implementing an automated document update pipeline (e.g., for daily tariff changes)
Fine-tuning retrieval parameters like top-k and top-n to reduce hallucinations and improve precision
Enhancing error handling, especially for edge cases and ambiguous queries

One limitation of this project was the absence of real user interactions, which restricted testing to hypothetical scenarios. In future iterations, collecting user feedback will be crucial to understanding how people engage with the system and identifying practical areas for improvement in real-world use.