Risks and Opportunities of Open-Source Generative AI

Model Pipeline

Figure 1. Pipeline of the components of model (1) training, (2) evaluation and (3) deployment for typical LLMs.

There are several components involved in the (1) training, (2) evaluation and (3) deployment pipeline to obtain a Large Language Model (LLM). Model developers decide whether to make each component of those pipelines

private

public

, with varying levels of restrictions for the latter. These are summarized in Figure 1, and detailed below.

Model training processes can be grouped into three distinct stages:

pre-training

, where a model is exposed to large-scale datasets composed of trillions of tokens of data, with the goal of developing fundamental skills and broad knowledge;

supervised fine-tuning (SFT)

, which corrects for data quality issues in pre-training datasets using a smaller amount of high-quality data; and

alignment

, focusing on creating application-specific versions of the model by considering human preferences. Once trained, models are usually evaluated on openly available evaluation datasets (e.g., MMLU by Hendrycks et al., 2020) as well as curated benchmarks (e.g., HELM by Liang et al., 2022). Some models are also evaluated on utility-oriented proprietary datasets held internally by developers, potentially by holding out some of the SFT/alignment data from the training process (Touvron et al., 2023a). On top of utility-based benchmarking, developers sometimes create safety evaluation mechanisms to proactively stress-test the outputs of the model (e.g., red teaming via adversarial prompts). Finally, at the deployment stage, content can be generated by running the inference code with the associated model weights.

Classifying Openness

Figure 2. Categorization of the levels of openness of the code and data of each model component.

To categorize the openness of each component, we introduce the scale presented in Figure 2. At the highest level, a

fully closed

component is not publicly accessible in any form. In contrast, a

semi-open

component is publicly accessible but with certain limitations on access or use, or it is available in a restricted manner, such as through an Application Programming Interface (API). Finally, a

fully open

component is available to the public without any restrictions on its use.

Further, the semi-open category comprises three subcategories, delineating varied openness levels (see Figure 2). Distinctions are made between Code (C1-C5) and Data (D1-D5) components, where C5/D5 represents unrestricted availability and C1/D1 denotes complete unavailability. For semi-open components, their classification relies on the license of the publicly available code/data.

To evaluate the licenses we introduce a point-based system where each license gets 1 point (for a total maximum of 5) for allowing each of the following:

can use a component for research purposes (Research)
can use a component for any commercial purposes (Commercial Purposes)
can modify a component as desired (with notice) (Modify as Desired)
can copyright derivative (Copyright Derivative Work)
publicly shared derivative work can use another license (Other license derivative work)

The total number of points is indicative of a license's restrictiveness. A

Highly restrictive

license scores 0-1 points, aligning with openness levels of code C2 and data D3, imposing significant limitations. A

Moderately restrictive

license, scoring 2-3 points (code C3 and data D3), allows more flexibility but with some limitations. Licenses scoring 4 points are

Slightly restrictive

(code C4 and data D4), offering broader usage rights with minimal restrictions. Finally, a

Restriction free

license scores 5 points, indicating the highest level of openness (code C5 and data D5), permitting all forms of use, modification, and distribution without constraints.

Taxonomy

Below is a table with the classification of the components of the model pipeline according to the openness scale presented in Figure 2.

Model	Release	(1) Training	(2) Evaluation	(3) Deployment
Anthropic 175B	16752096000002-2023	C1	C1	C1	D1	D1	D1	C1	N/A	D1	D1	C1	D1
Anthropic LM	163831680000012-2021	C1	C1	N/A	D1	N/A	N/A	C1	N/A	N/A	N/A	C1	D1
BLOOM	16513632000005-2022	C5 (Apache 2.0)	Unknown	N/A	Unknown (D3 or D4)	D5 (Apache 2.0)	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D4 (RAIL)
Cerebras-GPT	16776288000003-2023	C5 (Apache 2.0)	N/A	N/A	D5 (MIT)	N/A	N/A	C5 (Publicly available)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
Chinchilla	16460928000003-2022	C1	C1	N/A	D1	N/A	N/A	C1	N/A	N/A	N/A	C1	D1
Claude-2	16881696000007-2023	C1	C1	C1	D1	D1	D1	C1	C1	D1	D1	C1	D2
Claude	16776288000003-2023	C1	C1	C1	D1	D1	D1	C1	N/A	D1	D1	C1	D1
CodeT5	16304544000009-2021	C5 (BSD-3)	C5 (BSD-3)	N/A	D4 (CodeT5)	N/A	N/A	C5 (BSD-3)	N/A	N/A	N/A	C5 (BSD-3)	D5 (Apache 2.0)
Command R+	17092512000003-2024	C1	C1	C1	D1	D1	D1	Unknown	Unknown	D1	D1	C4 (C4AI)	D4 (C4AI)
Command R	17092512000003-2024	C1	C1	C1	D1	D1	D1	Unknown	Unknown	D1	D1	C4 (C4AI)	D4 (C4AI)
DBRX	17092512000003-2024	C2	C2	C1	D1	Unknown	Unknown	N/A	C1	D1	D1	C3 (DBRX)	D3 (DBRX)
ERNIE 3.0	163831680000012-2021	C1	C1	N/A	D1	N/A	N/A	C1	N/A	N/A	N/A	C1	D1
FairSeq Dense	163831680000012-2021	C5 (MIT)	N/A	N/A	D5 (ComCrawl)	N/A	N/A	N/A	N/A	N/A	N/A	C5 (MIT)	D5 (MIT)
Falcon	16935264000009-2023	C1	C1	C1	D4 (ODC-By)	D1	D1	C1	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Falcon-180B Data)
Gemini	170138880000012-2023	C1	C1	C1	D1	D1	D1	C1	C1	D1	D1	C1	D2
GLaM	163831680000012-2021	C1	N/A	N/A	D1	N/A	N/A	C1	N/A	D5 (Public datasets)	N/A	C1	D1
GLM-130B	166458240000010-2022	C1	N/A	N/A	D1	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D3 (GLM-130B Data)
Gopher	163831680000012-2021	C1	C1	N/A	D1	N/A	N/A	C1	N/A	D1	D1	C1	D1
GPT-2	15489792000002-2019	C1	N/A	N/A	D1	N/A	N/A	C1	N/A	D1	N/A	C5 (Mod. MIT)	D5 (Mod. MIT)
GPT-3.5-turbo	16935264000009-2023	C1	C1	C1	D1	D1	D1	C5 (MIT)	C1	D1	D1	C1	D2
GPT-3	15882912000005-2020	C1	C1	N/A	D1	N/A	N/A	C1	N/A	D1	N/A	C1	D2
GPT-4	16776288000003-2023	C1	C1	C1	D1	D1	D1	C5 (MIT)	C1	D1	D1	C1	D2
GPT-J-6B	16225056000006-2021	C5 (Apache 2.0)	C5 (Apache 2.0)	N/A	D5 (MIT)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
GPT-Neo	16145568000003-2021	C5 (MIT)	C5 (MIT)	N/A	D5 (MIT)	N/A	N/A	C5 (MIT)	N/A	N/A	N/A	C5 (MIT)	D5 (MIT)
GPT-NeoX-20B	16436736000002-2022	C5 (Apache 2.0)	N/A	N/A	D5 (MIT)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
Grok-1	169879680000011-2023	C1	C1	Unknown	D1	D1	Unknown	C1	N/A	N/A	N/A	C1	D2
Jamba	17092512000003-2024	C1	C1	C1	D1	Unknown	Unknown	N/A	C1	D1	D1	C5 (Apache 2.0)	D5 (Apache 2.0)
LaMDA	16409952000001-2022	C1	C1	N/A	D1	D1	N/A	C1	C1	D1	D1	C1	D1
LLaMA-2	16881696000007-2023	C1	C1	C1	D1	D1	D1	C1	N/A	D1	N/A	C3 (LLaMA-2)	D3 (LLaMA-2)
LLaMA-3	17119296000004-2024	C1	C1	C1	D1	D1	D1	C1	C1	Unknown	Unknown	C3 (LLaMA-3)	D3 (LLaMA-3)
LLaMA	16752096000002-2023	C1	N/A	N/A	Unknown (likely D5)	N/A	N/A	C1	C1	N/A	D5 (Publicly available)	C5 (GNU GPL)	D3 (LLaMA)
Megatron-Turing	163304640000010-2021	C1	N/A	N/A	D1	N/A	N/A	C1	N/A	N/A	N/A	C1	D1
Mistral-7B	169611840000010-2023	C1	C1	N/A	D1	D1	N/A	C1	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
mT5	160151040000010-2020	C5 (Apache 2.0)	C5 (Apache 2.0)	N/A	D4 (ODC-By)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
OpenLLaMA	16855776000006-2023	C5 (Apache 2.0)	N/A	N/A	D4 (RedPajama Data)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
OPT	16513632000005-2022	C5 (MIT)	N/A	N/A	Unknown	N/A	N/A	C1	N/A	N/A	N/A	C5 (MIT)	D3 (OPT Data)
PaLM-2 Foundation model only	16828992000005-2023	C1	N/A	N/A	D1	N/A	N/A	C1	N/A	N/A	D5 (Publicly available)	C1	D1
PaLM	16487712000004-2022	C1	C1	N/A	D1	D1	N/A	C1	N/A	N/A	N/A	C1	D1
Phi-2	169879680000011-2023	C1	N/A	N/A	D1	N/A	N/A	C1	N/A	N/A	N/A	C5 (MIT)	D5 (MIT)
PolyCoder	16436736000002-2022	C5 (MIT)	N/A	N/A	Unknown (D3 or D4)	N/A	N/A	C5 (MIT)	N/A	N/A	N/A	C5 (CC BY-SA-4.0)	D5 (CC BY-SA-4.0)
Pythia	166985280000012-2022	C5 (Apache 2.0)	N/A	N/A	D5 (MIT)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
Stable LM Base-Alpha Base-Alpha-Tuned	16803072000004-2023	C5 (CC BY-SA-4.0)	C5 (CC BY-SA-4.0)	N/A	D5 (CC BY-SA-4.0, The Pile + others)	D5 (CC BY-SA-4.0)	N/A	C5 (CC BY-SA-4.0)	N/A	N/A	N/A	C5 (CC BY-SA-4.0)	D5 (CC BY-SA-4.0)
T5	156988800000010-2019	C5 (Apache 2.0)	C5 (Apache 2.0)	N/A	D4 (ODC-By)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
UL2	16513632000005-2022	C5 (Apache 2.0)	C5 (Apache 2.0)	N/A	D4 (ODC-By)	N/A	N/A	C5 (Apache 2.0)	N/A	N/A	N/A	C5 (Apache 2.0)	D5 (Apache 2.0)
XGLM	163831680000012-2021	C5 (MIT)	N/A	N/A	D5 (ComCrawl)	N/A	N/A	C5 (MIT)	C1	D5 (Public datasets)	N/A	C5 (MIT)	D5 (MIT)

Model

Release

(1) Training

(2) Evaluation

(3) Deployment

Code

Data

Code

Data

Code

Data

Pre-Training

Fine-tuning

Alignment

Pre-Training

Supervised
Fine-tuning

Alignment

General Evaluation

Automatic Safety
Evaluation

Utility Internal
Benchmarks

Safety Evaluation
Datasets

Inference

Model Architecture
and Weights

Anthropic 175B

2-2023

N/A

Anthropic LM

12-2021

N/A

BLOOM

5-2022

C5
(Apache 2.0)

Unknown

N/A

Unknown
(D3 or D4)

D5
(Apache 2.0)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D4
(RAIL)

Cerebras-GPT

3-2023

C5
(Apache 2.0)

N/A

D5
(MIT)

N/A

C5
(Publicly available)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

Chinchilla

3-2022

N/A

Claude-2

7-2023

Claude

3-2023

N/A

CodeT5

9-2021

C5
(BSD-3)

N/A

D4
(CodeT5)

N/A

C5
(BSD-3)

N/A

C5
(BSD-3)

D5
(Apache 2.0)

Command R+

3-2024

Unknown

C4
(C4AI)

D4
(C4AI)

Command R

3-2024

Unknown

C4
(C4AI)

D4
(C4AI)

DBRX

3-2024

Unknown

N/A

C3
(DBRX)

D3
(DBRX)

ERNIE 3.0

12-2021

N/A

FairSeq Dense

12-2021

C5
(MIT)

N/A

D5
(ComCrawl)

N/A

C5
(MIT)

D5
(MIT)

Falcon

9-2023

D4
(ODC-By)

N/A

C5
(Apache 2.0)

D5
(Falcon-180B Data)

Gemini

12-2023

GLaM

12-2021

N/A

D5
(Public datasets)

N/A

GLM-130B

10-2022

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D3
(GLM-130B Data)

Gopher

12-2021

N/A

GPT-2

2-2019

N/A

C5
(Mod. MIT)

D5
(Mod. MIT)

GPT-3.5-turbo

9-2023

C5
(MIT)

GPT-3

5-2020

N/A

GPT-4

3-2023

C5
(MIT)

GPT-J-6B

6-2021

C5
(Apache 2.0)

N/A

D5
(MIT)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

GPT-Neo

3-2021

C5
(MIT)

N/A

D5
(MIT)

N/A

C5
(MIT)

N/A

C5
(MIT)

D5
(MIT)

GPT-NeoX-20B

2-2022

C5
(Apache 2.0)

N/A

D5
(MIT)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

Grok-1

11-2023

Unknown

N/A

Jamba

3-2024

Unknown

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

LaMDA

1-2022

N/A

LLaMA-2

7-2023

N/A

C3
(LLaMA-2)

D3
(LLaMA-2)

LLaMA-3

4-2024

Unknown

C3
(LLaMA-3)

D3
(LLaMA-3)

LLaMA

2-2023

N/A

Unknown
(likely D5)

N/A

D5
(Publicly available)

C5
(GNU GPL)

D3
(LLaMA)

Megatron-Turing

10-2021

N/A

Mistral-7B

10-2023

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

mT5

10-2020

C5
(Apache 2.0)

N/A

D4
(ODC-By)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

OpenLLaMA

6-2023

C5
(Apache 2.0)

N/A

D4
(RedPajama Data)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

OPT

5-2022

C5
(MIT)

N/A

Unknown

N/A

C5
(MIT)

D3
(OPT Data)

PaLM-2 Foundation model only

5-2023

N/A

D5
(Publicly available)

PaLM

4-2022

N/A

Phi-2

11-2023

N/A

C5
(MIT)

D5
(MIT)

PolyCoder

2-2022

C5
(MIT)

N/A

Unknown
(D3 or D4)

N/A

C5
(MIT)

N/A

C5
(CC BY-SA-4.0)

D5
(CC BY-SA-4.0)

Pythia

12-2022

C5
(Apache 2.0)

N/A

D5
(MIT)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

Stable LM Base-Alpha Base-Alpha-Tuned

4-2023

C5
(CC BY-SA-4.0)

N/A

D5
(CC BY-SA-4.0, The Pile + others)

D5
(CC BY-SA-4.0)

N/A

C5
(CC BY-SA-4.0)

N/A

C5
(CC BY-SA-4.0)

D5
(CC BY-SA-4.0)

10-2019

C5
(Apache 2.0)

N/A

D4
(ODC-By)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

UL2

5-2022

C5
(Apache 2.0)

N/A

D4
(ODC-By)

N/A

C5
(Apache 2.0)

N/A

C5
(Apache 2.0)

D5
(Apache 2.0)

XGLM

12-2021

C5
(MIT)

N/A

D5
(ComCrawl)

N/A

C5
(MIT)

D5
(Public datasets)

N/A

C5
(MIT)

D5
(MIT)