Is Your Data Ready for AI? What Businesses Get Wrong Before They Start

One of the most common conversations I have with new clients goes something like this: they've identified a process they want to automate or a decision they want AI to support. They've seen the demos. They're convinced the technology works. They want to move quickly.

Then I ask to see the data.

What follows is usually the real project. Not the AI — the work that has to happen before AI becomes viable. Data that lives in three different systems with no reliable way to join them. Customer records with no consistent identifiers. Years of transaction history stored in PDF exports. A spreadsheet that one person maintains manually and nobody fully trusts.

This isn't unusual. It's the norm. And it's the single most overlooked factor when businesses plan AI initiatives.

Why Data Quality Makes or Breaks AI Projects

Machine learning models and AI systems don't add intelligence to bad data — they amplify whatever's already there. Biased data produces biased outputs. Incomplete data produces incomplete answers. Inconsistent data produces unpredictable behavior.

The computing adage "garbage in, garbage out" is older than modern AI, but it has never been more relevant. When you train or fine-tune a model on your business data, the model learns the patterns in that data. If those patterns are noisy, sparse, or poorly structured, the model learns to be noisy, sparse, and poorly structured too.

This is why data readiness — assessing whether your data is actually fit for the AI use case you have in mind — needs to happen before you commission a build, sign a vendor contract, or run a proof of concept.

What Does "Data Readiness" Actually Mean?

Data readiness isn't a single thing. It's a set of conditions that vary depending on what you're trying to do with AI. A business looking to automate document classification has different data requirements than one building a customer churn prediction model. But there are four dimensions that matter across nearly every use case.

Availability

Can you actually get to the data? This sounds basic, but it's often the first blocker. Data locked in legacy systems with no API access, vendor-controlled platforms that limit exports, or operational databases that can't be queried without disrupting production — these are access problems, not data problems, but they stop AI projects just as effectively.

Ask yourself: If I needed to pull the last two years of this data for analysis, how long would that take, and who would I need to involve?

Structure and Consistency

Structured data — rows and columns with consistent types and values — is significantly easier to work with than unstructured text, images, or audio. AI can handle unstructured data, but it requires more preprocessing work and typically more volume to get reliable results.

Consistency matters as much as structure. If your customer records use "NZ", "New Zealand", and "New Zealand (NZ)" interchangeably in the country field, downstream models will treat these as three different values. Multiplied across every field in every table, inconsistency creates significant noise that degrades model performance.

Ask yourself: If I asked two people to describe the same customer record in our system, would they describe it the same way?

Volume and Recency

Different AI tasks need different amounts of data. A rule of thumb: the more nuanced the prediction or classification, the more examples you need to train on. For many supervised learning tasks, you're looking for at minimum hundreds of labeled examples — ideally thousands. For simpler tasks like document routing or anomaly detection, you may need less.

Recency matters separately. A churn model trained on customer behaviour from 2019 may not reflect how customers behave today. If your historical data is old, sparse, or doesn't reflect current conditions, its predictive value is limited.

Ask yourself: How many examples do I have of the outcome I'm trying to predict? And when does my data history start and stop?

Completeness and Accuracy

Missing values and factual errors are different problems, but both degrade AI outputs. Partially complete records — customers with no purchase history, invoices with no line items, support tickets with no resolution recorded — create gaps that models have to work around. And if the data that exists is wrong (prices entered incorrectly, dates misformatted, categories misapplied), the model learns the wrong thing.

Accuracy is particularly hard to assess without domain knowledge. This is why I always recommend involving people who work with the data day-to-day early in any AI engagement — they know where the bodies are buried.

Ask yourself: Do the people who use this data daily trust it? Are there fields everyone knows are unreliable but nobody has fixed?

The Most Common Data Problems I See

After working across a range of businesses in New Zealand and internationally, these are the data situations that most frequently block AI projects.

Data scattered across too many systems. CRM, accounting software, job management platform, and a folder of spreadsheets — each holding different pieces of the same customer picture, with no reliable way to connect them. Before any AI can work across this data, you need a way to unify it. This is often an integration project before it's an AI project.

No ground truth for the thing you want to predict. You want to predict which leads will convert — but your sales data doesn't reliably record why a deal was lost. You want to classify customer complaints — but nobody has ever labelled them. Building a model without labelled examples of what you're trying to detect is significantly harder and more expensive.

Data that exists only in documents. Contracts, invoices, emails, reports stored as PDFs or images. Extracting structured information from these is possible with modern AI, but it requires preprocessing infrastructure and validation that many businesses underestimate. If your core business data lives in documents rather than databases, that's the starting point.

Privacy and compliance constraints you haven't mapped. In many AI use cases, particularly anything involving customer behaviour, you'll be using personally identifiable information. If you haven't mapped what data you hold, under what consent basis, and what your obligations are under privacy law, this is something to resolve before you start feeding that data into AI systems — especially cloud-based ones.

How to Do a Basic Data Audit Before an AI Project

You don't need a data engineering team to get a working picture of your data readiness. Here's a practical starting point.

List every system that holds data relevant to the problem. Not just the obvious ones — include email, spreadsheets, and any tools used only by specific teams. This becomes your data inventory.

For each system, answer: What data is in it? How is it structured? Can I export or access it programmatically? When was it last cleaned or audited?

Identify the joins. For any AI use case that requires combining data from multiple sources, find out whether reliable joining keys exist. A customer ID that appears in both your CRM and your accounting system is what makes cross-system analysis possible. If those keys don't match up, you have a data integration problem to solve first.

Pull a sample and look at it. Not a summary report — the actual raw records. Missing fields, inconsistent values, and data entry quirks only become visible when you look at real rows. Pick 100 records and inspect them manually. This is more revealing than any data quality dashboard.

Talk to the people who use the data. Your operations team, customer service staff, or salespeople will have a working understanding of where the data is trustworthy and where it isn't. This context is invaluable and can't be discovered through technical analysis alone.

When Should You Fix Data Before Starting an AI Project?

Not always. The answer depends on what you're trying to achieve and how quickly.

If your data problems are structural — missing integration between systems, no historical records, fundamental gaps in what you've been capturing — you generally need to address these before AI can add value. Trying to build AI on top of structurally broken data is expensive and usually produces poor results that undermine confidence in the whole initiative.

If your data problems are quality issues — inconsistencies, some missing values, moderate noise — you may be able to start a proof of concept while cleaning and improving data in parallel. The proof of concept can help you understand exactly what quality level is needed, rather than trying to achieve perfection upfront.

If you're using general-purpose AI tools (not training models on your own data), your data readiness requirements are lower. Tools like AI writing assistants, summarisation tools, or chat-based interfaces don't require you to have clean historical data — they work with whatever you give them at the time. The data readiness question becomes more acute when you're doing predictive analytics, automation that depends on your historical records, or anything that involves fine-tuning or training models.

What Good Data Looks Like in Practice

I'm sometimes asked what "good enough" data looks like before starting an AI project. Here are the signals I look for.

The data you need exists in accessible systems — not locked in documents or manual processes. It covers a meaningful time period (typically at least 12-24 months for business analytics). There's a reliable identifier that connects records across systems. The people who use the data consider it trustworthy for day-to-day decisions. And there are enough examples of the outcome you're trying to model that a human could learn from them.

None of this requires perfect data. It requires data that's good enough to learn from. That bar is lower than many businesses assume — and also higher than many businesses actually meet.

Getting Help with Data Readiness

If you're planning an AI project and aren't sure whether your data is ready, a structured assessment before you commit to a build is worth the investment. Understanding your data landscape — what you have, what's missing, and what needs to change — is the work that makes everything downstream more likely to succeed.

As part of AI consulting engagements, I include a data readiness review at the start. It's often the most valuable thing I do for a client, because it either confirms they can move forward with confidence or surfaces the real work that needs to happen first — before money is spent on technology that can't yet deliver.

If you're at the planning stage, I'm happy to have a direct conversation about what you're trying to achieve and whether your data foundation is ready to support it. You can get in touch here.