add data science document

2024-05-28 18:44:55 -06:00
parent 84ab94a2c6
commit 64c911a79a
2 changed files with 204 additions and 0 deletions
--- a/tech_docs/llm/1
+++ b/tech_docs/llm/1
@@ -0,0 +1,102 @@
+### Reference Guide: Key Concepts in Data Science and AI Data Models
+
+#### 1. Data Types
+
+1.1 **Numerical Data**
+   - **Continuous Data**: Data that can take any value within a range. Examples include temperature, height, and weight.
+     - **Context**: Used in regression models where the prediction involves continuous values.
+     - **Facts & Figures**: Often requires normalization or standardization for machine learning models to perform effectively.
+   - **Discrete Data**: Data that can only take specific values. Examples include the number of students in a class or the number of cars in a parking lot.
+     - **Context**: Used in classification problems where the target variable is discrete.
+     - **Facts & Figures**: Can be encoded using techniques like one-hot encoding for use in machine learning models.
+
+1.2 **Categorical Data**
+   - **Nominal Data**: Data that represents categories without any intrinsic order. Examples include gender, color, and nationality.
+     - **Context**: Used in classification tasks where categories are distinct and unordered.
+     - **Facts & Figures**: Often transformed using one-hot encoding or label encoding for machine learning algorithms.
+   - **Ordinal Data**: Data that represents categories with a meaningful order but unknown intervals between values. Examples include rankings, education levels, and satisfaction ratings.
+     - **Context**: Useful in ordinal regression and decision trees where the order of categories matters.
+     - **Facts & Figures**: Can be handled using ordinal encoding which maintains the order of categories.
+
+1.3 **Text Data**
+   - **Unstructured Data**: Text data without a predefined format, such as social media posts, articles, and emails.
+     - **Context**: Used in natural language processing (NLP) tasks like sentiment analysis and text classification.
+     - **Facts & Figures**: Requires preprocessing steps like tokenization, stemming, and lemmatization. Commonly transformed into numerical vectors using techniques like TF-IDF or word embeddings (e.g., Word2Vec, GloVe).
+   - **Structured Data**: Text data with a predefined format, such as XML or JSON.
+     - **Context**: Often used in data exchange between systems and can be easily parsed and queried.
+     - **Facts & Figures**: Structured text data is typically easier to handle and integrate into databases and data models.
+
+1.4 **Time Series Data**
+   - **Definition**: Data points collected or recorded at specific time intervals. Examples include stock prices, weather data, and sensor readings.
+     - **Context**: Used in forecasting, anomaly detection, and temporal pattern recognition.
+     - **Facts & Figures**: Requires techniques like rolling averages, exponential smoothing, and ARIMA models for analysis. Seasonality and trend decomposition are common preprocessing steps.
+
+1.5 **Image Data**
+   - **Definition**: Data in the form of images, typically represented as pixel values.
+     - **Context**: Used in computer vision tasks like image classification, object detection, and image segmentation.
+     - **Facts & Figures**: Requires preprocessing like resizing, normalization, and augmentation. Convolutional neural networks (CNNs) are commonly used models for image data.
+
+1.6 **Audio Data**
+   - **Definition**: Data in the form of sound waves, often represented as time-series data.
+     - **Context**: Used in tasks like speech recognition, music classification, and audio anomaly detection.
+     - **Facts & Figures**: Preprocessing steps include noise reduction, feature extraction (e.g., MFCCs), and normalization. Recurrent neural networks (RNNs) and CNNs are often used for audio data.
+
+#### 2. Data Preprocessing Techniques
+
+2.1 **Normalization and Standardization**
+   - **Normalization**: Scaling data to a range of [0, 1].
+     - **Context**: Used when the feature scales vary widely.
+     - **Facts & Figures**: Helps improve the performance of gradient-based algorithms.
+   - **Standardization**: Scaling data to have a mean of 0 and a standard deviation of 1.
+     - **Context**: Commonly used when the data follows a Gaussian distribution.
+     - **Facts & Figures**: Often required for algorithms like SVM and PCA.
+
+2.2 **Encoding Categorical Data**
+   - **One-Hot Encoding**: Converts categorical variables into a series of binary vectors.
+     - **Context**: Suitable for nominal data.
+     - **Facts & Figures**: Increases the dimensionality of the dataset.
+   - **Label Encoding**: Assigns an integer value to each category.
+     - **Context**: Suitable for ordinal data.
+     - **Facts & Figures**: Can be misleading for nominal data due to the implicit order.
+
+2.3 **Handling Missing Data**
+   - **Imputation**: Filling missing values with mean, median, mode, or a specific value.
+     - **Context**: Prevents loss of data and maintains dataset size.
+     - **Facts & Figures**: Multiple imputation techniques can be used for more robust handling.
+   - **Dropping Missing Values**: Removing rows or columns with missing data.
+     - **Context**: Used when the amount of missing data is minimal.
+     - **Facts & Figures**: Can lead to a significant reduction in dataset size if not handled carefully.
+
+2.4 **Feature Engineering**
+   - **Definition**: Creating new features from existing ones to improve model performance.
+     - **Context**: Involves domain knowledge and creativity.
+     - **Facts & Figures**: Techniques include polynomial features, interaction terms, and domain-specific transformations.
+
+#### 3. Model Evaluation Metrics
+
+3.1 **Classification Metrics**
+   - **Accuracy**: The ratio of correctly predicted instances to the total instances.
+     - **Context**: Suitable for balanced datasets.
+     - **Facts & Figures**: Can be misleading for imbalanced datasets.
+   - **Precision, Recall, F1-Score**: Measures of a model’s performance considering true positives, false positives, and false negatives.
+     - **Context**: Important for imbalanced datasets.
+     - **Facts & Figures**: F1-Score is the harmonic mean of precision and recall.
+
+3.2 **Regression Metrics**
+   - **Mean Absolute Error (MAE)**: The average of absolute errors between predicted and actual values.
+     - **Context**: Provides a straightforward interpretation of prediction errors.
+     - **Facts & Figures**: Sensitive to outliers.
+   - **Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)**: Measures of average squared errors between predicted and actual values.
+     - **Context**: Penalizes larger errors more than smaller ones.
+     - **Facts & Figures**: RMSE is the square root of MSE, making it interpretable in the same units as the target variable.
+   - **R-squared**: The proportion of variance in the dependent variable that is predictable from the independent variables.
+     - **Context**: Indicates the goodness of fit.
+     - **Facts & Figures**: Values range from 0 to 1, with higher values indicating a better fit.
+
+3.3 **Time Series Metrics**
+   - **Mean Absolute Percentage Error (MAPE)**: The average of absolute percentage errors between predicted and actual values.
+     - **Context**: Suitable for time series forecasting.
+     - **Facts & Figures**: Provides an interpretable measure of prediction accuracy.
+   - **Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)**: Tools for identifying the correlation between time series observations.
+     - **Context**: Used in model identification and diagnostics.
+     - **Facts & Figures**: Helps in identifying the appropriate lag for models like ARIMA.