From 843d0b055e15982e2fc24fbed6d2a466de43ce8c Mon Sep 17 00:00:00 2001 From: medusa Date: Fri, 14 Jun 2024 08:06:48 +0000 Subject: [PATCH] Add tech_docs/llm/transformer_llm.md --- tech_docs/llm/transformer_llm.md | 85 ++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 tech_docs/llm/transformer_llm.md diff --git a/tech_docs/llm/transformer_llm.md b/tech_docs/llm/transformer_llm.md new file mode 100644 index 0000000..f634966 --- /dev/null +++ b/tech_docs/llm/transformer_llm.md @@ -0,0 +1,85 @@ +When working with Transformers in libraries like Hugging Face's `transformers`, dictionaries and indices play crucial roles in handling data efficiently. Here's a concise explanation with rich context: + +### Dictionaries in Transformers + +1. **Tokenization:** + - Tokenizers convert text into tokens, outputting a dictionary with keys such as `input_ids`, `attention_mask`, and `token_type_ids`. + - Example: + ```python + from transformers import AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + inputs = tokenizer("Hello, how are you?", return_tensors="pt") + print(inputs) + # {'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])} + ``` + +2. **Model Outputs:** + - Model outputs are often dictionaries containing elements like `logits`, `hidden_states`, and `attentions`. + - Example: + ```python + from transformers import AutoModelForSequenceClassification + model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") + outputs = model(**inputs) + print(outputs) + # SequenceClassifierOutput(loss=None, logits=tensor([[0.2438, -0.1436]], grad_fn=), hidden_states=None, attentions=None) + ``` + +3. **Training and Configuration:** + - Training arguments and configurations are managed through dictionaries. + - Example: + ```python + from transformers import Trainer, TrainingArguments + training_args = TrainingArguments( + output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, + per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs' + ) + ``` + +### Indices in Transformers + +1. **Token Indices:** + - Each token is assigned a unique index. The tokenizer maps tokens to these indices. + - Example: + ```python + tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + tokens = tokenizer.tokenize("Hello, how are you?") + token_indices = tokenizer.convert_tokens_to_ids(tokens) + print(token_indices) + # [7592, 1010, 2129, 2024, 2017, 102] + ``` + +2. **Positional Encoding:** + - Positional indices are used to maintain the order of tokens in the sequence, crucial for the Transformer's attention mechanism. + +### Practical Example + +Combining dictionaries and indices in a text classification task: + +1. **Tokenization:** + ```python + from transformers import AutoTokenizer, AutoModelForSequenceClassification + tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + text = "Transformers are powerful models for NLP tasks." + inputs = tokenizer(text, return_tensors="pt") + ``` + +2. **Model Inference:** + ```python + model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") + outputs = model(**inputs) + logits = outputs.logits + ``` + +3. **Post-Processing:** + ```python + import torch + predicted_class = torch.argmax(logits, dim=1).item() + print(f"Predicted class: {predicted_class}") + ``` + +### Summary + +- **Dictionaries**: Used for managing complex data (e.g., tokenized inputs, model outputs, configurations). +- **Indices**: Used to represent tokens and positions, enabling efficient encoding and decoding. + +Together, they facilitate the efficient processing and manipulation of text data in Transformer models. \ No newline at end of file