Add tech_docs/llm/transformer_llm.md

2024-06-14 08:06:48 +00:00
parent d681f3b253
commit 843d0b055e
1 changed files with 85 additions and 0 deletions
--- a/tech_docs/llm/transformer_llm.md
+++ b/tech_docs/llm/transformer_llm.md
@@ -0,0 +1,85 @@
+When working with Transformers in libraries like Hugging Face's `transformers`, dictionaries and indices play crucial roles in handling data efficiently. Here's a concise explanation with rich context:
+
+### Dictionaries in Transformers
+
+1. **Tokenization:**
+   - Tokenizers convert text into tokens, outputting a dictionary with keys such as `input_ids`, `attention_mask`, and `token_type_ids`.
+   - Example:
+     ```python
+     from transformers import AutoTokenizer
+     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+     inputs = tokenizer("Hello, how are you?", return_tensors="pt")
+     print(inputs)
+     # {'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
+     ```
+
+2. **Model Outputs:**
+   - Model outputs are often dictionaries containing elements like `logits`, `hidden_states`, and `attentions`.
+   - Example:
+     ```python
+     from transformers import AutoModelForSequenceClassification
+     model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
+     outputs = model(**inputs)
+     print(outputs)
+     # SequenceClassifierOutput(loss=None, logits=tensor([[0.2438, -0.1436]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
+     ```
+
+3. **Training and Configuration:**
+   - Training arguments and configurations are managed through dictionaries.
+   - Example:
+     ```python
+     from transformers import Trainer, TrainingArguments
+     training_args = TrainingArguments(
+         output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16,
+         per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs'
+     )
+     ```
+
+### Indices in Transformers
+
+1. **Token Indices:**
+   - Each token is assigned a unique index. The tokenizer maps tokens to these indices.
+   - Example:
+     ```python
+     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+     tokens = tokenizer.tokenize("Hello, how are you?")
+     token_indices = tokenizer.convert_tokens_to_ids(tokens)
+     print(token_indices)
+     # [7592, 1010, 2129, 2024, 2017, 102]
+     ```
+
+2. **Positional Encoding:**
+   - Positional indices are used to maintain the order of tokens in the sequence, crucial for the Transformer's attention mechanism.
+
+### Practical Example
+
+Combining dictionaries and indices in a text classification task:
+
+1. **Tokenization:**
+   ```python
+   from transformers import AutoTokenizer, AutoModelForSequenceClassification
+   tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+   text = "Transformers are powerful models for NLP tasks."
+   inputs = tokenizer(text, return_tensors="pt")
+   ```
+
+2. **Model Inference:**
+   ```python
+   model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
+   outputs = model(**inputs)
+   logits = outputs.logits
+   ```
+
+3. **Post-Processing:**
+   ```python
+   import torch
+   predicted_class = torch.argmax(logits, dim=1).item()
+   print(f"Predicted class: {predicted_class}")
+   ```
+
+### Summary
+
+- **Dictionaries**: Used for managing complex data (e.g., tokenized inputs, model outputs, configurations).
+- **Indices**: Used to represent tokens and positions, enabling efficient encoding and decoding.
+
+Together, they facilitate the efficient processing and manipulation of text data in Transformer models.