Building Production-Ready RAG Systems: From Semantic Search to Intelligent Responses

We built a system that uses retrieval-augmented generation to answer guest questions accurately by pulling real property data instead of letting the AI make things up. Here's what we learned implementing semantic search with pgvector and wiring it into GPT-4 for a vacation rental platform.
Here's a problem we kept running into: property managers answering the same questions all day long. Different guests, same questions. WiFi password. Parking location. Check-in time. Pet policy.
You can handle this manually when you've got 20 or 30 properties. At 200 properties across Airbnb, Vrbo, and your own booking site, you're either hiring a team just to answer messages or you're automating it.
Why Basic Chatbots Don't Work Here
We tried the obvious solutions first. Rule-based bots that match keywords to canned responses. Fast, cheap, useless. Guest asks "where's parking?" and gets back the parking policy PDF instead of "behind the building, spots 12-15." Or it can't handle slight variations in phrasing.
Then there's the pure LLM approach. Feed the question to GPT-4 and let it respond. This works great until you realize it's inventing WiFi passwords. The model doesn't know your properties. It doesn't have your house rules or your specific check-in procedures. So when it hits a knowledge gap, it fills in something that sounds right. Confidently wrong is worse than no answer at all.
How We Actually Built This
The fix is retrieval-augmented generation. You take the guest's question, find the relevant property information in your database, and hand both to the LLM. Now it's working from facts instead of statistical guesses.
Our implementation has four pieces working together:
Vector knowledge base sitting in Postgres
Every property detail (amenities, rules, check-in instructions, parking info) gets turned into a 1536-dimension vector using OpenAI's embedding model. These vectors represent meaning, so "where can I leave my car" semantically matches parking location data even though the exact words are different.
Search layer that actually understands questions
Guest message comes in, we convert it to a vector, run a similarity search in Postgres using pgvector. Cosine similarity finds the closest matches. Usually nails it on the first try.
Context assembly
Take what we found, mix in conversation history, add any special notes from the property manager, pull live availability if they're asking about dates. Bundle all of this together.
Response generation
GPT-4 gets the complete package and writes a natural response. Since it has the actual facts sitting right there, it stops making things up.
Implementation Details: PostgreSQL + pgvector
We chose PostgreSQL with pgvector over specialized vector databases for three reasons:
First, our data already lives in Postgres. Adding vector search capabilities to our existing database meant fewer moving parts and simpler deployments across our multi-tenant architecture.
Second, pgvector performance is excellent for our scale. With HNSW indexing, similarity searches complete in under 50ms even with 10,000+ properties per tenant.
Third, transactional consistency matters. When property data updates, we can atomically update both the relational records and the embeddings in a single transaction. No eventual consistency headaches.
The setup is straightforward:
CREATE EXTENSION vector;
CREATE TABLE property_embeddings (
id BIGSERIAL PRIMARY KEY,
property_id INTEGER REFERENCES properties(id),
content_type VARCHAR(50), -- 'amenities', 'rules', 'instructions', etc.
content TEXT,
embedding VECTOR(1536),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON property_embeddings
USING hnsw (embedding vector_cosine_ops);Each property ends up with multiple embedding records, one per content type. This granularity helps us track which specific information piece matched the guest's question.
Embedding Generation: Batch Processing Everything
When a property gets added or updated, we don't generate embeddings synchronously. That would block the API response and waste database connections waiting on OpenAI's API.
Instead, we queue embedding generation as background jobs. Our worker processes pick these up, batch multiple items together (OpenAI allows up to 2048 inputs per request), and write the results back in bulk. This approach:
- Keeps API response times under 200ms
- Reduces embedding API costs by 60% through batching
- Handles OpenAI rate limits gracefully with exponential backoff
- Makes embedding updates non-blocking for property managers
The batching code looks roughly like this:
def generate_embeddings_batch(items):
texts = [item['content'] for item in items]
response = openai.Embedding.create(
model="text-embedding-3-small",
input=texts
)
embeddings = [r['embedding'] for r in response['data']]
# Bulk insert to database
with db.transaction():
for item, embedding in zip(items, embeddings):
PropertyEmbedding.create(
property_id=item['property_id'],
content_type=item['content_type'],
content=item['content'],
embedding=embedding
)Query Processing: Understanding Guest Intent
Not all guest messages are straightforward questions. Someone might say "We're arriving around 7pm and wondering about parking, also what's the WiFi?" That's three separate intents: arrival time update, parking location question, and WiFi credentials request.
We handle this in two passes:
First, send the message to GPT-4 and ask it to pull out what the guest actually wants. What questions are they asking (WiFi, parking, checkout times)? What are they requesting (early check-in, extra towels)? Any dates mentioned?
This gives us structured JSON we can actually use.
- Information being provided (arrival time, special needs)
- Any date mentions
This returns structured JSON we can act on programmatically.
Then we run separate searches for each question. Guest asked about parking and WiFi? Two similarity searches, combine the results.
This approach means we fetch all relevant information in one round trip, even for complex multi-part messages.
Context Assembly: Feeding the LLM
Once we have the relevant property data, we need to format it for GPT-4. The prompt structure matters enormously here. Too little context and the LLM lacks information. Too much and it gets confused about what's actually relevant.
Our prompt template looks like this:
You are a helpful property manager assistant responding to a guest.
PROPERTY INFORMATION:
{retrieved_property_data}
CONVERSATION HISTORY:
{last_5_messages}
GUEST'S CURRENT MESSAGE:
{current_message}
ADDITIONAL INSTRUCTIONS FROM PROPERTY MANAGER:
{custom_instructions}
Generate a helpful response to the guest's message. Stick to the
property information above. Don't make anything up. If you don't
have the info, say so.The retrieved property data is already filtered down to what's relevant for this specific question. Usually ends up being 1500-2500 tokens, which is manageable and gives the LLM what it needs without overloading it.
Handling Dates and Availability
Semantic search breaks down when guests ask about specific dates. "Can we book June 15-22?" or "Is early check-in available?" need real availability data, not embeddings.
For these, semantic search alone isn't enough. We also need to query our availability calendar, which is a time-series data structure completely separate from embeddings.
When our message categorization step detects date mentions, we extract the specific dates using a simple regex pattern (handling various date formats). Then we:
- Run the semantic search to find contextually relevant information
- Query the availability table for the exact dates mentioned
- Combine both in the context sent to GPT-4
So the LLM sees both the semantic matches (house rules, parking info) and the hard data (yes/no on those specific dates). Works well enough.
Keeping the AI From Making Things Up
Even with RAG, LLMs can still hallucinate if you're not careful. We've implemented three safeguards:
- Explicit Instructions Against Invention Our system prompt explicitly tells GPT-4 to only use provided information and to say "I don't have that information" rather than guessing. This simple instruction reduces hallucinations by roughly 80%.
- Source Attribution We track which property data chunk was used for each part of the response. If an operator questions an AI response, we can show them exactly where that information came from.
- Confidence scoring Cosine similarity spits out a score between 0 and 1 for each match. We set a cutoff at 0.75. Below that, the match is probably garbage and we leave it out. Took some testing to find that number, but it works.
Multi-Question Responses: Deduplication Matters
Here's a subtle problem we ran into: if a guest asks "What's the WiFi password and where do I park?", and both the WiFi data and parking data mention the address, the LLM might include the address twice in its response.
We solved this with deduplication at the context level. Before sending information to GPT-4, we:
- Identify overlapping content across retrieved chunks
- Merge chunks that cover the same topic
- Remove redundant information
- Present a clean, non-repetitive context
The LLM still generates natural-sounding responses, but the input is already cleaned up so it doesn't have to figure out what to include or exclude.
Conversation History: Context Windows That Actually Work
Early versions of our system included the entire conversation history in every request. With long threads, this quickly exceeded token limits or made responses expensive.
Now we use a sliding window approach: include only the most recent 5 messages, plus any messages that contained important information (like the guest's arrival time or special requests).
This selective history keeps token counts manageable while ensuring the LLM doesn't lose critical context. The categorization step from earlier helps here too, because we know which past messages contained information versus just pleasantries.
Custom Instructions: Giving Operators Control
Property managers sometimes need to override or supplement the standard information. Maybe the pool is under maintenance, or they can offer early check-in for a specific booking.
Our system lets operators add custom instructions when triggering an AI response. These get appended to the context with high priority, and we explicitly instruct GPT-4 to incorporate them naturally into the response.
For example:
- Standard response: "The pool is available 8am-10pm daily."
- With custom instruction "mention pool is closed for maintenance": "Hey, just so you know, the pool's closed for maintenance right now. Should be back open next week. Normally it's 8am-10pm."
This flexibility is crucial for production use. Operators stay in control while still getting the efficiency benefits of AI.
Performance Metrics: What Actually Improved
After rolling out the RAG-based response system across our platform, we tracked several metrics:
Response Time for Operators: Dropped from 4.2 minutes average to 45 seconds. Most of that time is now just reviewing the AI suggestion before sending.
Response Accuracy: Guest follow-up questions decreased 63%, indicating responses contained the right information the first time.
After-Hours Coverage: Properties now maintain <15 minute response times even when operators are offline, using fully automated responses for common questions.
Cost Per Response: Including OpenAI API costs (embeddings + generation), each AI response costs about $0.08. Compared to 4 minutes of operator time, ROI is obvious.
Hallucination Rate: After tuning our retrieval thresholds and prompt engineering, verifiable errors dropped to under 0.5% of responses.
Scaling Considerations: Multi-Tenant Architecture
Our platform uses isolated instances per tenant (client). Each client gets their own API layer, database, and background workers. This architecture choice impacts RAG in interesting ways.
The good: Complete data isolation. Client A's property embeddings never mix with Client B's. No cross-contamination risk.
The challenging part: Embedding generation and search runs independently per tenant. More properties means more worker processes and bigger batch sizes, but we're not rebuilding anything.
We handle this with per-tenant resource allocation. Heavy clients (large property portfolios, high message volume) automatically get more worker processes and higher embedding batch sizes. Light clients stay on minimal resources.
The vector database (Postgres with pgvector) scales linearly with property count. A tenant with 1000 properties might have 15,000 embedding records (multiple content types per property), but HNSW indexing keeps search performance fast.
Where RAG Falls Short: Edge Cases We Handle Differently
RAG works great for factual retrieval, but some scenarios need different approaches:
Policy Questions: "Can I bring my dog?" requires understanding pet policies, but also involves some judgment. We retrieve the policy text but also include a note for operators to review the response before sending.
Pricing and Availability: We don't use embeddings for this at all. Dates and numbers need exact lookups, not semantic search. These queries go straight to our structured database.
Operational Issues: "The AC isn't working" needs to create a maintenance ticket, not generate a response. Our categorization step routes these to a different workflow entirely.
Subjective Questions: "Is your property family-friendly?" involves interpretation. We retrieve objective facts (number of bedrooms, presence of amenities) but let the LLM synthesize an appropriate answer rather than trying to pre-define what "family-friendly" means.
Future Improvements: What's Next
We're constantly iterating on the system. Current focus areas:
Fine-tuning for property domain: Training a specialized embedding model on vacation rental content might improve retrieval accuracy over general-purpose models.
Hybrid search: Sometimes semantic search misses exact keyword matches. "Pool hours" might not hit if we embedded it as "swimming pool schedule." Adding traditional keyword search as a fallback could help.
Response personalization: If a guest already asked about parking once, we could proactively mention it in check-in instructions. Right now we treat every message independently.
Multi-modal embeddings: Would be useful to search property photos, not just text. Guest asks "does the kitchen have a coffee maker?" and we could actually look at the kitchen photo to answer instead of hoping it's listed in the amenities.
Lower latency: Currently, response generation takes 2-4 seconds. With prompt caching and streaming, we could get this under 1 second for better operator experience.
Takeaways for Building Your Own RAG System
If you're implementing RAG for a production system, here's what matters most:
Start with clear success criteria. We went in saying "reduce average operator response time to under 60 seconds" and "get guest follow-up rate below 10%". Way better than "make responses better" which doesn't tell you if you're done.
Invest in data quality. RAG is only as good as your knowledge base. Garbage embeddings produce garbage responses. We spent significant time cleaning and structuring property data before generating embeddings.
Test retrieval separately from generation. Debug your semantic search in isolation. Make sure it returns the right content before worrying about how the LLM uses that content.
Don't over-engineer initially. Our first version used just text-embedding-ada-002 (older model) and basic cosine similarity with no fancy re-ranking or query expansion. It still provided 80% of the value.
Monitor everything. Track retrieval accuracy, response quality, hallucination rates, latency, and costs. RAG systems drift over time as data changes, and you need visibility to catch problems early.
Give humans oversight. We're doing guest communication here. Wrong information pisses people off and costs bookings. Every AI suggestion goes through an operator before it gets sent. Takes 30 seconds to review, saves you from confident mistakes.
Getting RAG Into Production vs Building a Demo
We went from property managers manually typing out hundreds of messages a day to AI handling most of the routine stuff. Operators now spend their time on the weird cases and angry guests instead of copy-pasting WiFi passwords.
The tech stack matters less than you'd think. Postgres, pgvector, OpenAI embeddings, GPT-4. You could swap pieces and get similar results. What matters more: how you structure your knowledge base, when you decide to retrieve versus generate, how you handle conversations that span multiple messages, and how you stop the AI from inventing things.
RAG works when you're answering questions from a large corpus of information that keeps changing. It stops working when you need it to work perfectly in a demo but haven't thought through data quality, performance at scale, or what happens when the AI gets something wrong.
Production systems need human oversight, especially when mistakes have consequences. Build for that from the start.
At Devslane, we specialize in building production-grade AI systems for clients across property management, fintech, healthcare, and other domains where accuracy and reliability matter. Our engineering teams have delivered RAG implementations, custom LLM integrations, and AI-powered automation for clients processing millions of interactions. If you're considering AI augmentation for your platform, let's talk about what's actually feasible.
FAQs (Frequently Asked Questions)
When should I use RAG instead of fine-tuning my model?
Use RAG when your information changes frequently or when you need to cite sources. We went with RAG because property details (WiFi passwords, parking locations, house rules) update constantly. Fine-tuning would mean retraining every time someone changes a door code. RAG lets us update the database and the AI immediately has the new information. Fine-tuning makes more sense when you need to change how the model talks or reasons, like teaching it medical terminology or legal writing style. For us, the base model already writes fine. We just needed it to access accurate property data.
Why use pgvector instead of a dedicated vector database like Pinecone or Qdrant?
Our data already lived in Postgres. Adding pgvector meant one less system to manage, one less thing to fail, and we could update both property records and embeddings in the same transaction. Search performance was fine for our scale (10,000+ properties per tenant, sub-50ms queries with HNSW indexing). Specialized vector databases are faster at the high end, but we weren't hitting those limits. The operational simplicity of keeping everything in Postgres won out. If we were building a semantic search engine handling billions of vectors, different story. For a property management system doing a few hundred searches per second, pgvector does the job.
How do you prevent the AI from hallucinating property information?
Three things worked for us. First, we explicitly tell GPT-4 in the system prompt to only use the provided context and to say "I don't have that information" rather than guess. Second, we set a confidence threshold on retrieval. If the semantic search match scores below 0.75, we don't include it in the context at all. Better to say we don't know than to work with marginally relevant information. Third, every AI response goes through an operator before it gets sent to a guest. Takes 30 seconds to review, catches the occasional weird response. The combination dropped our hallucination rate to under 0.5%.
What's the difference between semantic search and keyword search, and do you need both?
Semantic search understands meaning. A guest asks "where can I leave my car" and it matches that with parking location data even though the words are completely different. Keyword search is literal. It looks for exact words or phrases. We use semantic search as the primary method because guests phrase questions in dozens of ways. But we've seen cases where semantic search missed obvious exact matches. Someone searches "pool hours" and the system doesn't hit the chunk that says "swimming pool schedule" because it got embedded slightly differently. We're considering adding keyword search as a fallback, but haven't needed it yet. For most queries, semantic search handles the variation just fine.
How much does it cost to run RAG in production with OpenAI's APIs?
For us, about $0.08 per response including both embedding generation and GPT-4 inference. Embeddings are cheap (text-embedding-3-small is $0.00002 per 1K tokens). The GPT-4 generation is where the cost sits. We keep prompts under 2500 tokens and responses under 500 tokens, so each complete interaction runs about 3000 tokens total. At current pricing that's roughly $0.09 for the generation, plus a negligible amount for the embedding lookup. We batch embedding generation for new properties (up to 2048 at once) which cuts that cost by about 60%. The ROI is obvious when you compare $0.08 to paying someone to manually type the same response for 4 minutes.
What is Retrieval-Augmented Generation (RAG) and why is it important for building production-ready systems?
Retrieval-Augmented Generation (RAG) combines semantic search with language generation to provide intelligent, accurate responses by retrieving relevant information before generating answers. It addresses limitations of basic chatbots and enhances response quality in production environments.
Why don't basic rule-based chatbots work well for complex guest interactions?
Basic rule-based chatbots rely on predefined rules and fail to handle the nuances and varied intents in guest messages, especially when questions are indirect or multi-faceted. This leads to poor user experience and inaccurate responses.
How does the system use PostgreSQL with pgvector for embedding storage and retrieval?
The system uses PostgreSQL with the pgvector extension to store vector embeddings efficiently, enabling fast similarity searches without relying on specialized vector databases. This choice balances scalability, ease of integration, and performance.
How are context and conversation history managed to improve response accuracy?
Relevant property data and prior conversation snippets are carefully assembled into context windows fed into the language model. This selective context assembly avoids overwhelming the model while preserving essential information for accurate responses.
What strategies are used to prevent AI hallucinations in RAG systems?
To reduce hallucinations, the system grounds generated responses strictly on retrieved factual data, implements deduplication for multi-question handling, and applies custom operator instructions to control output reliability.
When should I choose RAG over fine-tuning a language model for my application?
RAG is ideal when you need up-to-date factual retrieval without extensive retraining, especially for dynamic domains like property management. Fine-tuning suits scenarios requiring specialized language understanding but may lack flexibility for rapidly changing content.