SCRUB & SCALE

WHY CLEAN DATA IS THE FOUNDATION OF ENTERPRISE AI

A live demonstration for program leadership — Aerospace & Defense

Your organization is planning to implement AI. The budget is approved. The teams are excited. But there is a problem hiding in your data — and if you don’t address it first, your AI investment will underperform before it starts. This demonstration walks through exactly what that problem looks like, how to find it, and how to fix it. No coding experience required.

DATA SOURCEDeltek Open Plan (OPP) — Integrated Master Schedule Export

INDUSTRYAerospace & Defense — Program Management / EVM

PROGRAM TYPEMajor Defense Acquisition Program (MDAP)

AUDIENCEProgram Leadership, Finance, Controls, and Strategy

OBJECTIVEDemonstrate why data quality is the prerequisite for enterprise AI

01 — RAW DATA 02 — QUERY 03 — PROBLEMS 04 — AUDITOR 05 — THE FIX 06 — MONITOR

01 / THE RAW DATA

Your Integrated Master Schedule —
Raw from Deltek Open Plan

Deltek Open Plan (OPP) is the scheduling engine behind some of the largest defense programs in the world. It manages Integrated Master Schedules — the detailed, task-by-task roadmaps that define how a program is planned, resourced, and executed. It tracks thousands of individual activities across the Work Breakdown Structure (WBS), assigns resources and budgets, and records planned versus actual performance over time.

The three numbers at the heart of every OPP export are the foundation of Earned Value Management. BCWS (Budgeted Cost of Work Scheduled) is what you planned to spend by this point in time — your budget baseline. BCWP (Budgeted Cost of Work Performed) is what you actually earned by completing work — the value you’ve produced. ACWP (Actual Cost of Work Performed) is what you actually spent to produce that work. Divide BCWP by BCWS and you get the Schedule Performance Index (SPI). Divide BCWP by ACWP and you get the Cost Performance Index (CPI). These are the vital signs of a program’s health.

This data is the primary input for any AI system that would predict schedule delays, forecast cost overruns, detect anomalies, or recommend program adjustments. The quality of this data directly determines the quality of any AI output built on top of it.

What you’re about to see is a realistic representation of what OPP export data looks like in practice — not after it has been cleaned, validated, or processed. This is the raw feed. Look closely.

This is your data. Imported directly from Open Plan. Exactly as the system exported it. Click NEXT to ask it a question.

02 / THE QUERY

Asking the Data a Question —
An Intermediate SQL Analysis

SQL — Structured Query Language — is how we ask questions of a database. Think of it as a very precise, very literal assistant. You write out exactly what you want to find, what conditions it must meet, and how you want it organized. The database returns only what matches. Every modern reporting tool, dashboard, and AI system uses SQL (or something very similar) at its foundation.

The question we’re asking is one every program manager cares about: which active tasks are simultaneously behind schedule AND over budget? In Earned Value terms, we want tasks where both the Schedule Performance Index (SPI) and Cost Performance Index (CPI) fall below 1.0. These are the tasks that need immediate attention — and they’re the first thing a predictive AI model would focus on.

This query is well-written. The logic is sound. But the data it’s running against is not clean. Watch what happens.

        PROGRAM HEALTH QUERY — SENTINEL-7 IMS
        SQL
      

03 / THE PROBLEMS

What’s Actually Wrong —
And Why It Matters for AI

Artificial intelligence doesn’t think. It learns patterns from examples. If those examples contain errors, inconsistencies, and gaps, the AI learns those errors as valid patterns. This is the single most underestimated risk in enterprise AI deployment — not that the AI will go rogue, but that it will quietly, confidently be wrong because the data it learned from was quietly, confidently wrong.

Below is your Integrated Master Schedule again. This time, every data quality issue has been identified and annotated. Click any highlighted cell or row to see exactly what the problem is, why it happened, and what it would do to an AI system built on this data. Watch your Data Health Score in the upper right as each issue is revealed.

Click highlighted cells to inspect individual issues

DATA HEALTH SCORE

04 / THE AUDITOR

The AI Readiness Auditor —
Automated Detection at Scale

A Python script is a set of instructions written in a programming language called Python — one of the most widely used languages in data science and AI. Think of it like a detailed checklist given to a very fast, very literal, very tireless assistant. You define the rules. The script applies them to every single row of data, every single time, in seconds — without missing anything, without getting tired, and without needing to understand context it wasn’t programmed to handle.

The AI Readiness Auditor below was written specifically for Deltek Open Plan exports. It codifies the 10 data quality rules we just examined into automated checks. Every function is annotated in plain English so any stakeholder can understand what the script is doing and why. This is not a black box — it is a documented, auditable, repeatable quality gate.

In practice, this script would run automatically each time a new OPP export is produced — as part of a scheduled task, a CI/CD pipeline, or a file-watcher monitoring your exports folder. It produces a structured JSON report that feeds directly into dashboards, alerts, and the AI pipeline gating system.

        AI READINESS AUDITOR — v1.0 — ai_readiness_auditor.py
        Python
      

TERMINAL — audit_runner.sh

Click RUN AUDITOR to execute the script against IMS_EXPORT_SENTINEL7.csv

05 / THE FIX

Scrubbing the Data —
Automated Correction and Human Review

Finding data problems is the first half of the solution. Fixing them is the second. But not all data problems are the same kind of problem — and this distinction is critical in a defense program environment where data drives contractual reporting, government oversight, and earned value system compliance.

The AI Readiness Auditor separates its findings into two queues: AUTO-CORRECTABLE issues, where the fix is unambiguous and a computer can safely apply it, and HUMAN REVIEW REQUIRED issues, where a decision must be made by a credentialed program controls professional before anything is changed. Automating a decision that requires human judgment — especially on CPR or IPMR data — is not a feature. It’s a liability.

AUTO-CORRECT

✓

WBS code format — Standardize all codes to dot notation. "1-2-4-1" → "1.2.4.1"

✓

OBS code typos — Fuzzy-match to nearest valid code (≥91% confidence only). "ENG-GCN" → "ENG-GNC"

✓

Resource name normalization — Standardize to "Last, First." format. "J.Mitchell" and "john mitchell" → "Mitchell, J."

✓

Whitespace trimming — Remove leading/trailing spaces from all text fields.

✓

Status auto-close — Where PCT_COMPLETE = 100 AND ACTUAL_FINISH is populated, set STATUS = ‘Closed’.

REQUIRES REVIEW

⚑

Null WBS_CODE — Cannot be assumed. Must be assigned by Program Controls from the WBS dictionary.

⚑

Date logic violations — Could be a real event or a data entry error. Cannot correct without SME confirmation.

⚑

Duplicate Task IDs — Cannot determine the authoritative record without reviewing source schedules.

⚑

Zero ACWP on active tasks — May be a legitimate cost deferral, a system integration failure, or an error. Finance must adjudicate.

⚑

Implausible EVM values — May indicate a legitimate replan event or a critical error. Requires PCO or BCWS authorization review.

INTERACTIVE CORRECTION DEMO

HUMAN REVIEW QUEUE — SENTINEL-7 IMS — Generated

CORRECTION MODULE SOURCE CODE

          AUTO-CORRECTION MODULE — corrections.py
          Python
        

06 / THE MONITOR

Continuous Monitoring —
Keeping Your Data AI-Ready Over Time

Data cleaning is not a one-time project. It is an ongoing operational discipline. In a live defense program, your Open Plan database is updated continuously — schedulers log progress, finance systems post actuals, subcontractors submit status. Every update is an opportunity for new bad data to enter the system. Without automated monitoring, data quality degrades silently, and the AI systems built on top of it degrade with it.

The monitoring script below is a watchdog. It runs continuously in the background, watching a designated folder where Open Plan exports are dropped. The moment a new export file appears, it automatically triggers the AI Readiness Auditor, applies auto-corrections, queues human review items, sends alerts to the program controls team, and forwards clean data to the AI pipeline — all within seconds of the file landing.

This is what a mature AI data pipeline looks like: not a one-time integration, but a self-monitoring, self-correcting quality gate that enforces data standards continuously, creates an auditable record of every data event, and ensures that the only data an AI model ever sees is data that has passed inspection.

AUTOMATED DATA QUALITY PIPELINE ARCHITECTURE

DELTEK
OPEN PLAN

SCHEDULED
EXPORT (.csv)

FILE WATCHER
watchdog

AI READINESS
AUDITOR

DECISION
CLEAN or DIRTY?

SCORE ≥70 — CLEAN

AI DATA PIPELINE

FORECASTING MODELS
DASHBOARDS

SCORE <70 — DIRTY

AUTO-CORRECT

REVIEW QUEUE

⚑ ALERT
PROG CONTROLS

        AI READINESS MONITOR — v1.0 — ai_readiness_monitor.py
        Python
      

LIVE MONITOR — SENTINEL-7 PROGRAM

MONITOR CONSOLE — /exports/opp/incoming/

Click ACTIVATE MONITOR to start the simulation

12-WEEK DATA HEALTH SCORE TREND — SENTINEL-7 PROGRAM

Annotations: Wk 5 — Subcontractor data feed integrated | Wk 9 — Emergency replan event | Wk 11 — Auto-correction pipeline deployed

WHY THIS MATTERS AT SCALE

A defense program with 5,000 IMS tasks and 200 weekly updates generates approximately 10,400 data change events per year.
Manual data quality review at current staffing levels would require an estimated 800–1,200 analyst hours per year — per program.
Automated monitoring reduces ongoing review effort by an estimated 70–85% after initial setup and configuration.
More importantly: it creates an auditable, timestamped, machine-readable quality record — essential for DCMA surveillance, DCSA compliance, EVM system certification under ANSI/EIA-748, and CPR/IPMR submission confidence.
When program leadership asks “can we trust this data?” — the answer becomes a dashboard, not a conversation.

THE CASE

The Bottom Line:
Data Quality Is Not an IT Problem.
It’s a Program Problem.

The demonstrations you’ve just seen are not hypothetical. Every data quality failure shown here — the null WBS codes, the duplicate task IDs, the zero-cost anomalies, the names formatted five different ways — exists in production program data today. On programs that are preparing to implement AI. On programs that are already feeding this data into dashboards that leadership trusts. The question is not whether your data has these problems. The question is whether you know about them.

WHAT WE DEMONSTRATED

01

THE RAW DATA

We looked at a real-world Deltek Open Plan IMS export and saw what program data actually looks like before any quality controls are applied.

02

THE SQL QUERY

We ran a standard EVM performance query and watched it return results that were silently wrong, incomplete, and misleading — not because the query was bad, but because the data was.

03

THE PROBLEMS

We identified 10 distinct categories of data quality failure, each with a direct, specific consequence for any AI system built on top of that data.

04

THE AUDITOR

We showed how a Python script can automatically detect every one of those 10 failure categories across an entire dataset in seconds — with full documentation and an auditable score.

05

THE FIX

We demonstrated automated correction for unambiguous issues and a disciplined human-review workflow for issues that require program controls judgment.

06

THE MONITOR

We deployed a continuous watchdog that enforces data quality standards automatically with every new export, creating a persistent, defensible quality record.

For Aerospace and Defense programs operating under DCMA surveillance, every data quality failure is a potential finding. For programs under ANSI/EIA-748 EVM compliance requirements, data integrity is not optional — it is contractually mandated. And for programs pursuing AI-enabled forecasting, anomaly detection, or decision support, data quality is not a prerequisite that can be deferred — it is the foundation on which the entire investment rests. A semantic data model — a formally defined, enforced set of rules about what your data means, how it must be structured, and what values are valid — is the infrastructure that makes AI pay off.

RECOMMENDED PATH FORWARD

PHASE 1

BASELINE

Conduct a full AI Readiness Audit of your current Open Plan environment. Establish your baseline Data Health Score. Identify your highest-severity issues.

PHASE 2

DEFINE

Define and document your program’s semantic data model: a data dictionary, field-level validation rules, OBS/WBS reference tables, and naming conventions. Enforce these rules at the source system level.

PHASE 3

AUTOMATE

Deploy the AI Readiness Auditor as a scheduled or event-triggered task in your environment. Integrate it with your export workflow. Establish alerting to the program controls team.

PHASE 4

ENABLE

With a trusted, validated, monitored data pipeline in place, enable AI-powered forecasting, anomaly detection, and decision support — built on a foundation that can withstand scrutiny.

QUESTIONS FOR YOUR PROGRAM CONTROLS TEAM

“If we ran an AI readiness audit on our current Open Plan export today, what score would we get? Does anyone know?”
“Do we have a documented data dictionary for our IMS? Are WBS and OBS codes validated at entry, or entered as free text?”
“When our financial system and our scheduling system don’t agree, which one wins — and does the other get corrected automatically or manually?”
“If we started feeding our IMS data to an AI forecasting model tomorrow, what’s the process for ensuring that model isn’t trained on stale, corrected, or duplicate data?”
“What does our audit trail look like for data quality events? Could we produce it for a DCMA review?”

Rob Hale

Program Finance Systems Analyst

github.com/rwhale/scrub-and-scale

Your Integrated Master Schedule —Raw from Deltek Open Plan

Asking the Data a Question —An Intermediate SQL Analysis

What’s Actually Wrong —And Why It Matters for AI

The AI Readiness Auditor —Automated Detection at Scale

Scrubbing the Data —Automated Correction and Human Review

Continuous Monitoring —Keeping Your Data AI-Ready Over Time

The Bottom Line:Data Quality Is Not an IT Problem.It’s a Program Problem.