Model Pipeline

Openness Taxonomy
Figure 1. Pipeline of the components of model (1) training, (2) evaluation and (3) deployment for typical LLMs.

There are several components involved in the (1) training, (2) evaluation and (3) deployment pipeline to obtain a Large Language Model (LLM). Model developers decide whether to make each component of those pipelines
private
or
public
, with varying levels of restrictions for the latter. These are summarized in Figure 1, and detailed below.

Model training processes can be grouped into three distinct stages:
pre-training
, where a model is exposed to large-scale datasets composed of trillions of tokens of data, with the goal of developing fundamental skills and broad knowledge;
supervised fine-tuning (SFT)
, which corrects for data quality issues in pre-training datasets using a smaller amount of high-quality data; and
alignment
, focusing on creating application-specific versions of the model by considering human preferences. Once trained, models are usually evaluated on openly available evaluation datasets (e.g., MMLU by Hendrycks et al., 2020) as well as curated benchmarks (e.g., HELM by Liang et al., 2022). Some models are also evaluated on utility-oriented proprietary datasets held internally by developers, potentially by holding out some of the SFT/alignment data from the training process (Touvron et al., 2023a). On top of utility-based benchmarking, developers sometimes create safety evaluation mechanisms to proactively stress-test the outputs of the model (e.g., red teaming via adversarial prompts). Finally, at the deployment stage, content can be generated by running the inference code with the associated model weights.

Classifying Openness

Openness Scale
Figure 2. Categorization of the levels of openness of the code and data of each model component.

To categorize the openness of each component, we introduce the scale presented in Figure 2. At the highest level, a
fully closed
component is not publicly accessible in any form. In contrast, a
semi-open
component is publicly accessible but with certain limitations on access or use, or it is available in a restricted manner, such as through an Application Programming Interface (API). Finally, a
fully open
component is available to the public without any restrictions on its use.

Further, the semi-open category comprises three subcategories, delineating varied openness levels (see Figure 2). Distinctions are made between Code (C1-C5) and Data (D1-D5) components, where C5/D5 represents unrestricted availability and C1/D1 denotes complete unavailability. For semi-open components, their classification relies on the license of the publicly available code/data.

To evaluate the licenses we introduce a point-based system where each license gets 1 point (for a total maximum of 5) for allowing each of the following:

The total number of points is indicative of a license's restrictiveness. A
Highly restrictive
license scores 0-1 points, aligning with openness levels of code C2 and data D3, imposing significant limitations. A
Moderately restrictive
license, scoring 2-3 points (code C3 and data D3), allows more flexibility but with some limitations. Licenses scoring 4 points are
Slightly restrictive
(code C4 and data D4), offering broader usage rights with minimal restrictions. Finally, a
Restriction free
license scores 5 points, indicating the highest level of openness (code C5 and data D5), permitting all forms of use, modification, and distribution without constraints.

Taxonomy

Below is a table with the classification of the components of the model pipeline according to the openness scale presented in Figure 2.