Acquiring Machine-Readable Data

Published Nov. 29, 2022
By Major Andrew Bowne and Captain Ryan Holte

Download PDF: Acquiring Machine-Readable Data for an AI-Ready Department of the Air Force

Acquiring Machine-Readable Data for an AI-Ready Department of the Air Force

This article presents contracting and program management best practices on how to negotiate for the delivery of and rights to AI-Ready data, including sample clauses that can be used in all contracts and agreements.

Artificial Intelligence

Artificial Intelligence - Foundational Concepts For An AI Workflow (24:30). MIT Lincoln Laboratory provides an engaging look into the necessary building blocks for incorporating Artificial Intelligence and Machine Learning capabilities into operational workflows. Video may be blocked on some machines and can be accessed at: youtube.com/watch?v=RaE33j9IkN4.

Though often invisible to the human eye, artificially intelligent systems are ubiquitous in our daily lives. Artificial intelligence (AI) augments tasks as trivial as crunching numbers on a calculator all the way to previously insurmountable tasks like analyzing massive molecular data sets to create one of the most capable antibiotics in the world.[1] As a powerful technology enabler, AI is critical to national security. The Department of Defense (DoD) AI Strategy defines AI as “the ability to perform tasks that normally require human intelligence.”[2] Yet, despite the lofty biological comparisons, these intelligent systems are beholden to logical principles. Those principles vary little from commercial to DoD use cases. However, the DoD acquisition system, created initially for hardware systems, inherently creates challenges acquiring, developing, and sustaining AI technologies.[3]

To effectively prepare for and leverage AI technologies, the DoD acquisition community and stakeholders must understand the technology and implement new data acquisition, intellectual property (IP), and contract management policies and best practices. An AI-Ready force is only possible through education that builds a foundation for technological fluency. This article provides a background on the state of the art in machine learning (ML) and introduces the elements of AI-Ready data. It then presents contracting and program management best practices on how to negotiate for the delivery of and rights to AI-Ready data, including sample clauses that can be used in all contracts and agreements. This knowledge is especially critical for program managers, contracting and agreements officers, and contract attorneys who will have to collaborate on bespoke clauses and understand the regulatory and statutory limits of negotiating for data delivery and the necessary data rights required.

Artificial Intelligence and Data Foundations

Artificial intelligence theory is comprised of three types of AI: Artificial Narrow Intelligence (ANI), Artificial General Intelligence (AGI), and Artificial Super Intelligence (ASI).[4] ANI, often called weak AI, refers to non-sentient AI that outperforms human decision-making in a particular use case and environment.[5] In situations where the input from the use case or environment changes relative to the training input (i.e., training in a snowy environment and operating in a desert), the AI would fail miserably.[6] AGI, called strong AI, allows one algorithm to apply human-like intelligence to disparate use cases.[7] To achieve ASI, the machine must attain intelligence greater than a human.[8]

Allusions from science fiction notwithstanding, all current artificially intelligent technology is ANI or composed of many different ANI algorithms to give the appearance of AGI.[9] Because ANI is built using data that is created and aggregated by humans from a specific environment, the algorithm can be brittle and susceptible to bias.[10] The brittleness and susceptibility to bias is a manifestation of its quality of training data.[11] In other words, AI cannot comprehend what it has never been taught.[12] If a model is trained with unrepresentative or inaccurate data, it will likely misunderstand what could appear unequivocally clear to its human counterpart; an error that could lead to incorrect and potentially unethical predictions.[13] This susceptibility to bias underscores the importance of good quality data sets.[14]

Training Quality and Machine-Readable Data

The DoD AI Strategy elucidates high-impact focus areas that, if pursued, will accelerate AI proliferation across the Department.[15] Specifically, the DoD implores delivery of AI-enabled capabilities that address key missions and leadership in military ethics and AI safety.[16] However, if the DoD is to invest in and develop AI technology, it must also heavily invest in robust, diverse and relevant data, commonly referred to as Training Quality Data (TQD).[17]

TQD is required for the successful application of an algorithm.[18] Machine learning, a subset of AI, builds statistical models based on data it observes and uses the model as both a hypothesis and as software that can solve problems.[19] This model continuously and iteratively trains itself using the available data to refine the algorithm that will ultimately produce a concise and environment-specific model.[20] If successful, ML can accurately predict an event at the same or greater accuracy than humans. However, without properly formatted and conditioned data, the model will fail to achieve the intended objectives.[21]

Most AI algorithms, such as, deep neural networks,[22] use matrix mathematics to perform their computations.[23] As such, certain data formats are inherently preferred. In general, preparing systems to be “AI-Ready” involves collecting robust and diverse raw data and then parsing of the data for ensuing ingest, scan, query and analysis.[24] If the data is not AI-Ready, data conditioning can be the most onerous portion of the developmental process, taking nearly 80 percent of the total development timeline.[25] Fortunately, there are simple techniques that can be applied during the initial data collection and parsing that can radically lessen the time required. A best practice is to ensure that the data is in an industry standard machine-readable file format.[26] Machine-readable data is a computer’s natural language, which minimizes the work required to produce the data needed for a model. Machine-readable formats, such as .csv (comma separated values) or .tsv (tab separated values), are examples of this data format and are easily ingested by the algorithm.[27]

Preparing and processing collected data in an AI-Ready format by using best practices such as the above can accelerate the creation of training quality data that can be used to train algorithms and ultimately meet DoD AI Strategy requirements.

Intellectual Property and Data Rights

The DoD created The DoD Data Strategy to unleash data and ultimately advance the overall National Defense Strategy (NDS).[28] The DoD Data Strategy conveys foundational principles that, if put into action, leverage data to enable ethical AI and ML development and proliferation to meet NDS and DoD AI Strategy requirements.[29] To truly capture and safely employ the DoD Data and AI strategies, the DoD must reassess how it views intellectual property (IP) and data rights. According to the DoD Instruction (DoDI) 5010.44 IP Acquisition and Licensing guidance, weapon and information systems that the DoD acquires in support of the warfighter will become increasingly dependent on technology, such as AI, and data for all stages of a system’s lifecycle.[30]

Acquiring the appropriate license rights is vital in ensuring that all systems remain functional, sustainable, upgradable, and affordable as the DoD becomes increasingly more reliant on IP-based and data-centric technology.[31] However, in addition to obtaining rights to the data, the government must seek fair treatment of all IP owners and create conditions that are conducive to contracting for technologically advanced solutions.[32] Balancing government and industry interests can be difficult, but early, consistent, and effective communication can facilitate clear expectations for all parties throughout negotiation and performance.[33]

Data rights are considered the license rights in technical data or computer software, provided to the government incident to a contract or agreement.[34] DFARS Part 227 outlines rights in technical data and computer software.[35] Basic rights under the DFARS contemplated license rights predicated by whether the technical data or computer software was developed with Government funds, produced by the contract as specified as an element of performance, or created with Government funds in the performance of a contract.[36]

Data rights as contemplated by DFARS apply to defined categories that may exclude important data described in this article. For example, the Court of Federal Claims granted summary judgment against the Government when it asserted it had rights over vendor lists, certainly data of a type that could be relevant to analysis and prediction via machine learning.[37] The Court held that technical data, as used in the DFARS, does not include everything a contractor provides the Government under a contract.[38] Rather, the term means “recorded information … of a scientific or technical nature.”[39] Thus, while the DFARS carves out license rights for data related to the design of an item or process, how it was manufactured or its physical and functional requirements,[40] it does not provide rights to datasets, nor the format or quality of such data. Accordingly, to enable a sufficient AI/ML pipeline for data consolidation and data conditioning, the government must consider whether specially negotiated data rights terms and conditions are necessary. The contracting team must think through the entire data and AI/ML lifecycle prior to contract award to ensure that the project’s lifecycle will have sufficient data rights. Additionally, as machine-readable data from one project may be useful as a training set for another model or play a larger role within the DAF’s data strategy, contracting should consider obtaining rights to data not strictly necessary for the project’s lifecycle.

When the DoD is acquiring AI/ML, there will be many scenarios when the government should have unlimited rights or Government Purpose Rights (GPR), or equivalent license rights. For example, if the DoD is acquiring an AI/ML tool that is trained on government owned data, then the model will inherently be produced using government assets. The government should have unlimited rights or GPR to the model via negotiated clauses adapted into the contract or agreement.

It is important to note the difference between collecting data from a system and the transmission of that data between the owner and user. For the scope of this article, the “user” is the individual program management offices and the “owner” of the data being the Contractor. During the procurement process, the contracting officer or agreement officer (CO/AO)[41] must ensure that the data rights clearly indicate the extent of the license to data as it traverses through each of its inherent states: use, rest, and motion throughout the lifecycle of the specific project. When a CO/AO awards a contract or agreement, understanding who owns the output data for a system is critical—data rights must be obtained in the output data, preferably unlimited rights, or the equivalent license in an Other Transaction Agreement (OTA).

The Department of the Air Force’s Data Pipeline

According to the Fiscal Year 20 Industrial Capabilities Report to Congress, it is quite evident that DoD is the largest customer in the world.[42] The DoD, then, distributes the funding among its 2,586 programs which use these funds for national defense requirements and these programs are juncture where the DoD can enable an effective data supply chain.[43] These programs enable the DoD to own sensors of nearly every phenomenology that are gathering data from numerous environments.

The DoD does not have the manpower to independently support all the national defense requirements. Thus, the DoD executes contracts or agreements with industry to augment its capabilities to carry out its mission.[44] There are numerous types of contracts, but they are generally broken out into two categories: Federal Acquisition Regulation (FAR) based contracts and non-FAR based contracts.[45] Each contract and agreement has advantages and disadvantages, generally; however, regardless of the contract or agreement type, to implement the DoD AI and DoD Data strategies, the government must carefully assess and tailor what it requests as Contract Data Requirements Lists (CDRL) and Data Item Description (DID), as well as how it negotiates the license rights.

When requesting CDRLs, it is critical that the government begin requesting machine-readable data. In addition, the government must begin requiring that the data is accessible and readily usable. Although this added verbiage may seem tedious, it protects the government from receiving data that cannot be used for analysis or ML, at least not without significant cost, effort, and time.

Ensuring that data is accessible to government stakeholders is crucial for AI/ML development. While a contractor has full access to a product’s data streams, handing data over to the DoD affords the contractor an opportunity for potential profit as they could require a special access key for proper data access. Adding data accessibility requirements into the CDRL and DIDs assures efficient government access to government owned data.

Although it may seem redundant, the CO/AO should assure that the data is readily usable. Requiring readily usable data assures that, on top of requesting machine-readable data, the data is free of any technical or administrative inhibitions that may affect the government’s ability to ingest the data directly into a chosen algorithm.

These simple steps can save millions of dollars and thousands of hours that would otherwise be spent simply finding and conditioning data into a useful state. More importantly, requiring data to be collected and delivered in a machine-readable and accessible file format will give the DoD a significantly better chance when competing with peer adversaries.

While these recommendations are common sense steps that align government contracting with data collection and ML best practices, not to mention commercial contracts, they are not in wide practice in the DoD. For FAR-based contracts, requiring output data in an AI-Ready format as a deliverable may require new policy or class deviation pursuant to FAR 1.404 and DFARS 201.402. When utilizing other transaction authority, the proposed clauses in the appendix can be implemented immediately without policy change or class deviation.

Department of Defense access to machine-readable data on current contracts may be limited due to the DoD current variability with handling data. In some contracts, the DoD does not receive data as a deliverable or does not have adequate rights to use, modify, or disclose said data. There are two potential options for securing the same access to machine-readable, training quality output data on existing contracts. First, the government can pursue a bilateral contract modification to include AI-Ready data as a deliverable on a case-by-case basis. There is a potential cost risk associated with this approach as contractors may claim a cost increase with this request, though there is a significant cost risk associated with not acquiring the data in an accessible, machine-readable format as well. Second, the government can pursue third-party contract solutions to modernize its legacy output data, such as data labelling and reformatting. This approach assumes that the program currently has access to data and its rights to use, modify and disclose said data for government purposes.

Adapting to a Data-Focused Contracting Strategy

Successful AI requires relevant and robust TQD. To achieve its data strategy objectives, the DoD must ensure that the proper license rights language is included in all acquisitions. The DoD has access to unprecedented TQD through the equipment and contracts supported by its acquisition system, but the government must assert the appropriate rights (i.e., Government Purpose Rights or unlimited rights or equivalent license for other transactions) to effectively use that data to develop AI.

Asserting the appropriate license rights are only part of the challenge. To enable efficient supply chain development for TQD, the DoD must require machine-readable data from all possible programs and contracts in its acquisition system (see Appendix for sample contract terms and clauses).[46] This machine-readable data will proactively enable AI development for a host of AI applications.

These technologies will continue to evolve with or without the DoD. The DoD acquisition system and its stakeholders must implement these data rights best practices and novel acquisition strategies if it wishes to maintain pace with commercial AI development and its peer adversaries.

------APPENDIX------

Sample Terms and Clauses

The following sample terms and clauses can be used in FAR contracts or non-FAR agreements. Use of these samples in a FAR contract will require higher-level approval or class deviation; nonetheless, this proposed adaptation is consistent with the policy described in FAR 1.402. The recommended contracting language can be adopted and used in other transaction agreements without any further policy or class deviation.

The clauses below should be tailored to meet the project requirements and should be a starting point for negotiations. These clauses should be included in both solicitation and contract when it is expected that data developed during performance can become useful for future analytics, training, testing, or modeling. These clauses define the data to be collected, formatted, and delivered; the rights of the Government to data; and delivery instructions for the data.

Note: The terms in bold below are defined in the definitions section.

Data Collection and Delivery

The Performer shall collect, format, and deliver data developed under this [(Contract) (Agreement)], whether generated manually, through traditional computer software or model prediction, in accordance with the [(Contracting Officer’s) (Agreement Officer’s)] direction provided in [insert reference to task description/data delivery instructions reference]. Data collected under this [(Contract) (Agreement)] shall be delivered in a machine-readable [JSON, .CSV, .TSV or other machine-readable file format], with data input and output formatted in tables. Data collected shall by organized in uniquely named columns. Output data shall be annotated with labels, features, and metadata included according to [insert reference to task description/data delivery instructions]. Performer shall provide data in a manner that is usable and readily accessible by the Government. No special data conditioning should be executed unless ordered by the [(Contracting Officer) (Agreement Officer)]. Data shall be protected using encryption in accordance with [insert security standard] at transit and at rest. The Government has the right to review, verify, challenge, and validate the data meets the requirements set out in [insert reference to task description/data delivery instructions].

Data shall be delivered according to [insert reference to task description/data delivery instructions] or within [insert number of days] days of an order by the [(Contracting Officer) (Agreement Officer)]. Data shall be securely delivered on an encrypted delivery file (JSON, .XML, .RDF, .XLS, .CSV, or .TSV) [(via API) (as directed by the Delivery Schedule)].

To facilitate any potential deliveries, the Performer agrees to retain and maintain in good condition and in accordance with [insert reference to DATA REPOSITORY clause] all data generated under this [(Contract) (Agreement)] until [three (3) or insert number of years] years after completion or termination of this [(Contract) (Agreement)], or when delivery of such data is requested by the Government, whichever is sooner.

Data Rights

With respect to data developed or generated under this [(Contract) (Agreement)] pursuant to [insert reference to task description/data delivery instructions], the Government shall receive [(Unlimited Rights) (Government Purpose Rights) (other negotiated license)], as defined in Article [insert reference to “DEFINITIONS” article]. [If multiple licenses to data exist in the contract or agreement, add the following.] With respect to data delivered pursuant to [insert reference to task description/data delivery instructions] under the [(Contract) (Agreement)], the Government shall receive Unlimited Rights. Notwithstanding the provision in [insert reference to provision providing less than Unlimited Rights in data], the performer agrees, with respect to data generated or developed under this [(Contract) (Agreement)], the Government may, within [three (3), or insert number of years] years after completion or termination of this [(Contract) (Agreement)], require delivery of data and receive Unlimited Rights.

Government will own the Output. Except for the licenses expressly granted in this [(Contract) (Agreement)], this [(Contract) (Agreement)] does not grant any rights and Government owns and reserves all right, title, and interest in and to Government Materials and Output. Government grants Performer a worldwide, non-exclusive license [(a)] to use, reproduce, modify, and create derivative works based on Government Materials in order to provide, and support the services and provide the Output to Government [and (b) use, reproduce, modify, and create derivative works based upon Government Materials and Output to analyze and improve Performer’s products and services].

Definitions

Data: Recorded information, regardless of form, the media on which it is recorded, or the method of recording.
Generated: The data output resulting from a recording of processed input data as required by the [(Contract) (Agreement)], such as, but not limited to, manual recording of observable phenomena, output from traditional computer programming, or model predictions from a machine learning algorithm.
Government Materials: The digital files, data, and machine learning models that Government submits to the Performer API or otherwise provides to Performer to facilitate Performer’s provision of the work ordered.
Government Purpose Rights: [Tailor DFARS 252.227-7013 or use the following definition:] The rights to use, duplicate, or disclose Data, in whole or in part and in any manner, for Government purposes only, and to have or permit others to do so for Government purposes only.
Machine Learning Output: The fields returned by a Performer machine learning model as defined in the [Statement of Work/Statement of Objectives/Task Order/Delivery Order, etc.].
Machine-readable: A form readily processable by a computer and where the individual elements of the data can be easily accessed and modified without additional costs or tools beyond those described in [(Contract) (Agreement)].
Output: Annotations and labels based upon Government Materials that are returned to Government, including through the Performer API, or a CSV of TSV file, and Machine Learning Output.
Unlimited Rights: [Tailor DFARS 252.227-7013 or use the following definition:] Rights to use, duplicate, release, or disclose, Data, in whole or in part, in any manner and for any purposes whatsoever, and to have or permit others to do so.

About the Author

Major Andrew Bowne, USAF

(B.A., Pepperdine University; J.D., the George Washington University Law School; LL.M., The Judge Advocate General’s Legal Center and School; Ph.D. candidate, University of Adelaide) is currently assigned as the Chief Legal Counsel of the Department of the Air Force-MIT Artificial Intelligence Accelerator (DAF-MIT AIA), Cambridge, Massachusetts. He is licensed to practice in the state of California.

Captain Ryan Holte, USAF

(B.S., United States Air Force Academy) is a program manager assigned to the Space Systems Command, Los Angeles Air Force Base, California.

Edited by: Mr. Robert Klauzinski
Layout by: Thomasa Huffstutler

Endnotes

[1] Henry Kissinger et al., The Age of AI, 10 (2021).

[2] Dep’t of Defense, Summary of the 2018 Department of Defense Artificial Intelligence Strategy (2018) [hereinafter, “DoD AI Strategy”], https://media.defense.gov/2019/Feb/12/2002088963/-1/-1/1/Summary-of-DoD-AI-Strategy.pdf.

[3] See Defense Innovation Board, Software is Never Done, 1 (2019); Babak Siavoshy, The DoD Should Pilot a New Category of Software Data Rights, Anduril Blog (Mar. 2, 2022), https://blog.anduril.com/the-dod-should-pilot-a-new-category-of-software-data-rights-a949cc9aaae4.

[4] See Stuart Russell & Peter Norvig, Artificial Intelligence: A Modern Approach, 32–33 (4th ed., 2021).

[5] Id.

[6] Kissinger et al., supra note 1, at 81.

[7] Russell & Norvig, supra note 4, at 33.

[8] See Andreas Kaplan & Michael Haenlein, Siri, Siri, in My Hand: Who’s the Fairest in the Land? On the Interpretations, Illustrations, and Implications of Artificial Intelligence, 62 Bus. Horizons 15, 16 (2019).

[9] See id.

[10] Nicol Turner Lee et al., Algorithmic Bias Detection and Mitigation: Best Practices and Policies to Reduce Consumer Harms, Brookings, (May 22, 2019), https://www.brookings.edu/research/algorithmic-bias-detection-and-mitigation-best-practices-and-policies-to-reduce-consumer-harms/; Steve Nouri, The Role Of Bias In Artificial Intelligence, Forbes (Feb 4, 2021), https://www.forbes.com/sites/forbestechcouncil/2021/02/04/the-role-of-bias-in-artificial-intelligence/?sh=750faf3d579d.

[11] See James Manyika et al., What Do We Do About the Biases in AI?, Harv. Bus. Rev. (Oct. 25, 2019), https://hbr.org/2019/10/what-do-we-do-about-the-biases-in-ai.

[12] See Kissinger et al., supra note 1, at 79.

[13] Manvika et al., supra note 11.

[14] See id.

[15] DoD AI Strategy, supra note 2, at 11.

[16] Dep’t of Defense, Summary of the 2018 Department of Defense Artificial Intelligence Strategy, https://media.defense.gov/2019/Feb/12/2002088963/-1/-1/1/Summary-of-DoD-AI-Strategy.pdf.

[17] See Michael Kannan, T-Minus AI, 124 (2020).

[18] Vijay Gadepally & Jeremy Kepner, Simple Data Architecture Best Practices for AI Readiness, https://vijayg.mit.edu/sites/default/files/documents/DataforAIReadiness.pdf (last visited Sept. 27, 2022).

[19] Russell & Norvig, supra note 4, at 651.

[20] See Vijay Gadepally et al., AI Enabling Technologies: A Survey, https://arxiv.org/pdf/1905.03592.pdf (2019) [hereinafter “AI Enabling Technologies”].

[21] Id.

[22] Id.

[23] Jeremy Kepner & Hayden Janathan, Mathematics of Big Data: Spreadsheets, Databases, Matricies, and Graphs (2018).

[24] Gadepally & Kepner, supra note 18.

[25] AI Enabling Technologies, supra note 20.

[26] Open Data Handbook, Machine Readable, https://opendatahandbook.org/glossary/en/terms/machine-readable (last visited Sept. 27, 2022).

[27] Gadepally & Kepner, supra note 18.

[28] Dep’t of Defense, Executive Summary: DoD Data Strategy (Sept. 20, 2020), https://media.defense.gov/2020/Oct/08/2002514180/-1/-1/0/DoD-Data-Strategy.pdf.

[29] Id.

[30] Dep’t of Defense Instruction 5010.44, Intellectual Property Acquisition and Licensing (Oct. 16, 2019), https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/501044p.pdf.

[31] Id.

[32] Andrew Bowne, Making the Pentagon an Even More Attractive Customer for AI Upstarts, Contract Management (Feb. 2021), https://ncmahq.org/Web/Shared_ContentCM-Magazine/INNOVATIONS--Making-the-Pentagon-an-Even-More-Attractive-Customer-for-AI-Upstarts.aspx.

[33] Id.

[34] Andrew Bowne & Benjamin McMartin, Implementing Responsible AI: Proposed Framework for Data Licensing, Geo. Mason U. Sch. of Bus., White Paper Series No. 10, 4 (Apr. 29, 2022), https://www.gmu.edu/news/2022-04/no-10-implementing-responsible-ai-proposed-framework-data-licensing.

[35] Defense Federal Acquisition Regulation Supplement 227, 48 C.F.R. pt. 227 (2022) [hereinafter “DFARS”].

[36] DFARS 252.227-7013(b)(1); DFARS 252.227-7014(b)(1); 10 U.S.C. §§ 3771–3775.

[37] Raytheon Co. v. United States, No. 19-883C, slip op. at 2 (Ct. Cl. June 15, 2022) reissued for publication (June 30, 2022).

[38] Id. at 13.

[39] Id. at 3.

[40] See id. at 19.

[41] Contracting Officer (CO) refers to the government official responsible for the award and administration of a procurement contract governed by the Federal Acquisition Regulations (FAR). Agreement Officer (AO) is the counterpart for non-FAR contracts such as other transactions governed by 10 U.S.C. §§ 4021–4023.

[42] Dep’t of Defense, Fiscal Year 2020 Industrial Capabilities Report to Congress (Jan. 2021), https://media.defense.gov/2021/Jan/14/2002565311/-1/-1/0/FY20-Industrial-Capabilities-Report.pdf.

[43] Dep’t of Defense, Program Acquisition Cost by Weapon System (Feb. 2020), https://comptroller.defense.gov/Portals/45/Documents/defbudget/fy2021/fy2021_Weapons.pdf.

[44] Heidi M. Peters, Cong. Rsch. Serv., IF10600, Defense Primer: Department of Defense Contractors (2021), https://sgp.fas.org/crs/natsec/IF10600.pdf.

[45] See Defense Acquisition U., Contracting Cone, https://aaf.dau.edu/aaf/contracting-cone/ (last visited Sept. 27, 2022).

[46] Open Data Handbook, supra note 26.