Add tech_docs/llm/transformer_llm.md
This commit is contained in:
85
tech_docs/llm/transformer_llm.md
Normal file
85
tech_docs/llm/transformer_llm.md
Normal file
@@ -0,0 +1,85 @@
|
||||
When working with Transformers in libraries like Hugging Face's `transformers`, dictionaries and indices play crucial roles in handling data efficiently. Here's a concise explanation with rich context:
|
||||
|
||||
### Dictionaries in Transformers
|
||||
|
||||
1. **Tokenization:**
|
||||
- Tokenizers convert text into tokens, outputting a dictionary with keys such as `input_ids`, `attention_mask`, and `token_type_ids`.
|
||||
- Example:
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
|
||||
print(inputs)
|
||||
# {'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
|
||||
```
|
||||
|
||||
2. **Model Outputs:**
|
||||
- Model outputs are often dictionaries containing elements like `logits`, `hidden_states`, and `attentions`.
|
||||
- Example:
|
||||
```python
|
||||
from transformers import AutoModelForSequenceClassification
|
||||
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
|
||||
outputs = model(**inputs)
|
||||
print(outputs)
|
||||
# SequenceClassifierOutput(loss=None, logits=tensor([[0.2438, -0.1436]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
|
||||
```
|
||||
|
||||
3. **Training and Configuration:**
|
||||
- Training arguments and configurations are managed through dictionaries.
|
||||
- Example:
|
||||
```python
|
||||
from transformers import Trainer, TrainingArguments
|
||||
training_args = TrainingArguments(
|
||||
output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16,
|
||||
per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs'
|
||||
)
|
||||
```
|
||||
|
||||
### Indices in Transformers
|
||||
|
||||
1. **Token Indices:**
|
||||
- Each token is assigned a unique index. The tokenizer maps tokens to these indices.
|
||||
- Example:
|
||||
```python
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
tokens = tokenizer.tokenize("Hello, how are you?")
|
||||
token_indices = tokenizer.convert_tokens_to_ids(tokens)
|
||||
print(token_indices)
|
||||
# [7592, 1010, 2129, 2024, 2017, 102]
|
||||
```
|
||||
|
||||
2. **Positional Encoding:**
|
||||
- Positional indices are used to maintain the order of tokens in the sequence, crucial for the Transformer's attention mechanism.
|
||||
|
||||
### Practical Example
|
||||
|
||||
Combining dictionaries and indices in a text classification task:
|
||||
|
||||
1. **Tokenization:**
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
text = "Transformers are powerful models for NLP tasks."
|
||||
inputs = tokenizer(text, return_tensors="pt")
|
||||
```
|
||||
|
||||
2. **Model Inference:**
|
||||
```python
|
||||
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
|
||||
outputs = model(**inputs)
|
||||
logits = outputs.logits
|
||||
```
|
||||
|
||||
3. **Post-Processing:**
|
||||
```python
|
||||
import torch
|
||||
predicted_class = torch.argmax(logits, dim=1).item()
|
||||
print(f"Predicted class: {predicted_class}")
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
- **Dictionaries**: Used for managing complex data (e.g., tokenized inputs, model outputs, configurations).
|
||||
- **Indices**: Used to represent tokens and positions, enabling efficient encoding and decoding.
|
||||
|
||||
Together, they facilitate the efficient processing and manipulation of text data in Transformer models.
|
||||
Reference in New Issue
Block a user