Model Pipeline
Figure 1. Pipeline of the components of model (1) training, (2) evaluation and (3) deployment for typical LLMs.
There are several components involved in the (1) training, (2) evaluation and (3) deployment pipeline to obtain a Large Language Model (LLM). Model developers decide whether to make each component of those pipelines
private
or public
, with varying levels of restrictions for
the latter. These are summarized in Figure 1, and detailed below.
Model training processes can be grouped into three distinct stages:
pre-training
, where a model is exposed to large-scale
datasets composed of trillions of tokens of data, with the
goal of developing fundamental skills and broad knowledge;
supervised fine-tuning (SFT)
, which corrects for data quality issues in pre-training datasets using a smaller amount
of high-quality data; and alignment
, focusing on creating
application-specific versions of the model by considering
human preferences. Once trained, models are usually evaluated on openly available evaluation datasets (e.g., MMLU
by Hendrycks et al., 2020) as well as curated benchmarks
(e.g., HELM by Liang et al., 2022). Some models are also
evaluated on utility-oriented proprietary datasets held internally by developers, potentially by holding out some of
the SFT/alignment data from the training process (Touvron
et al., 2023a). On top of utility-based benchmarking, developers sometimes create safety evaluation mechanisms
to proactively stress-test the outputs of the model (e.g., red
teaming via adversarial prompts). Finally, at the deployment
stage, content can be generated by running the inference
code with the associated model weights.
Classifying Openness
Figure 2. Categorization of the levels of openness of the code and data of each model component.
To categorize the openness of each component, we introduce the scale presented in Figure 2. At the highest level, a
fully closed
component is not publicly
accessible in any form. In contrast, a
semi-open
component is publicly accessible but with certain
limitations on access or use, or it is available in a restricted
manner, such as through an Application Programming Interface (API). Finally, a fully open
component is available to the public without any restrictions
on its use.
Further, the semi-open category comprises three subcategories, delineating varied openness levels (see Figure 2). Distinctions are made between Code (C1-C5) and Data (D1-D5) components, where C5/D5 represents unrestricted availability and C1/D1 denotes complete unavailability. For semi-open components, their classification relies on the license of the publicly available code/data.
To evaluate the licenses we introduce a point-based system where each license gets 1 point (for a total maximum of 5) for allowing each of the following:
- can use a component for research purposes (Research)
- can use a component for any commercial purposes (Commercial Purposes)
- can modify a component as desired (with notice) (Modify as Desired)
- can copyright derivative (Copyright Derivative Work)
- publicly shared derivative work can use another license (Other license derivative work)
The total number of points is indicative of a license's restrictiveness. A
Highly restrictive
license scores 0-1 points, aligning with
openness levels of code C2 and data D3, imposing significant limitations. A Moderately restrictive
license, scoring
2-3 points (code C3 and data D3), allows more flexibility
but with some limitations. Licenses scoring 4 points are
Slightly restrictive
(code C4 and data D4), offering broader
usage rights with minimal restrictions. Finally, a Restriction free
license scores 5 points, indicating the highest level
of openness (code C5 and data D5), permitting all forms of
use, modification, and distribution without constraints.