Files
the_information_nexus/tech_docs/llm/data_science.md
2024-05-28 18:44:55 -06:00

7.4 KiB
Raw Blame History

Reference Guide: Key Concepts in Data Science and AI Data Models

1. Data Types

1.1 Numerical Data

  • Continuous Data: Data that can take any value within a range. Examples include temperature, height, and weight.
    • Context: Used in regression models where the prediction involves continuous values.
    • Facts & Figures: Often requires normalization or standardization for machine learning models to perform effectively.
  • Discrete Data: Data that can only take specific values. Examples include the number of students in a class or the number of cars in a parking lot.
    • Context: Used in classification problems where the target variable is discrete.
    • Facts & Figures: Can be encoded using techniques like one-hot encoding for use in machine learning models.

1.2 Categorical Data

  • Nominal Data: Data that represents categories without any intrinsic order. Examples include gender, color, and nationality.
    • Context: Used in classification tasks where categories are distinct and unordered.
    • Facts & Figures: Often transformed using one-hot encoding or label encoding for machine learning algorithms.
  • Ordinal Data: Data that represents categories with a meaningful order but unknown intervals between values. Examples include rankings, education levels, and satisfaction ratings.
    • Context: Useful in ordinal regression and decision trees where the order of categories matters.
    • Facts & Figures: Can be handled using ordinal encoding which maintains the order of categories.

1.3 Text Data

  • Unstructured Data: Text data without a predefined format, such as social media posts, articles, and emails.
    • Context: Used in natural language processing (NLP) tasks like sentiment analysis and text classification.
    • Facts & Figures: Requires preprocessing steps like tokenization, stemming, and lemmatization. Commonly transformed into numerical vectors using techniques like TF-IDF or word embeddings (e.g., Word2Vec, GloVe).
  • Structured Data: Text data with a predefined format, such as XML or JSON.
    • Context: Often used in data exchange between systems and can be easily parsed and queried.
    • Facts & Figures: Structured text data is typically easier to handle and integrate into databases and data models.

1.4 Time Series Data

  • Definition: Data points collected or recorded at specific time intervals. Examples include stock prices, weather data, and sensor readings.
    • Context: Used in forecasting, anomaly detection, and temporal pattern recognition.
    • Facts & Figures: Requires techniques like rolling averages, exponential smoothing, and ARIMA models for analysis. Seasonality and trend decomposition are common preprocessing steps.

1.5 Image Data

  • Definition: Data in the form of images, typically represented as pixel values.
    • Context: Used in computer vision tasks like image classification, object detection, and image segmentation.
    • Facts & Figures: Requires preprocessing like resizing, normalization, and augmentation. Convolutional neural networks (CNNs) are commonly used models for image data.

1.6 Audio Data

  • Definition: Data in the form of sound waves, often represented as time-series data.
    • Context: Used in tasks like speech recognition, music classification, and audio anomaly detection.
    • Facts & Figures: Preprocessing steps include noise reduction, feature extraction (e.g., MFCCs), and normalization. Recurrent neural networks (RNNs) and CNNs are often used for audio data.

2. Data Preprocessing Techniques

2.1 Normalization and Standardization

  • Normalization: Scaling data to a range of [0, 1].
    • Context: Used when the feature scales vary widely.
    • Facts & Figures: Helps improve the performance of gradient-based algorithms.
  • Standardization: Scaling data to have a mean of 0 and a standard deviation of 1.
    • Context: Commonly used when the data follows a Gaussian distribution.
    • Facts & Figures: Often required for algorithms like SVM and PCA.

2.2 Encoding Categorical Data

  • One-Hot Encoding: Converts categorical variables into a series of binary vectors.
    • Context: Suitable for nominal data.
    • Facts & Figures: Increases the dimensionality of the dataset.
  • Label Encoding: Assigns an integer value to each category.
    • Context: Suitable for ordinal data.
    • Facts & Figures: Can be misleading for nominal data due to the implicit order.

2.3 Handling Missing Data

  • Imputation: Filling missing values with mean, median, mode, or a specific value.
    • Context: Prevents loss of data and maintains dataset size.
    • Facts & Figures: Multiple imputation techniques can be used for more robust handling.
  • Dropping Missing Values: Removing rows or columns with missing data.
    • Context: Used when the amount of missing data is minimal.
    • Facts & Figures: Can lead to a significant reduction in dataset size if not handled carefully.

2.4 Feature Engineering

  • Definition: Creating new features from existing ones to improve model performance.
    • Context: Involves domain knowledge and creativity.
    • Facts & Figures: Techniques include polynomial features, interaction terms, and domain-specific transformations.

3. Model Evaluation Metrics

3.1 Classification Metrics

  • Accuracy: The ratio of correctly predicted instances to the total instances.
    • Context: Suitable for balanced datasets.
    • Facts & Figures: Can be misleading for imbalanced datasets.
  • Precision, Recall, F1-Score: Measures of a models performance considering true positives, false positives, and false negatives.
    • Context: Important for imbalanced datasets.
    • Facts & Figures: F1-Score is the harmonic mean of precision and recall.

3.2 Regression Metrics

  • Mean Absolute Error (MAE): The average of absolute errors between predicted and actual values.
    • Context: Provides a straightforward interpretation of prediction errors.
    • Facts & Figures: Sensitive to outliers.
  • Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Measures of average squared errors between predicted and actual values.
    • Context: Penalizes larger errors more than smaller ones.
    • Facts & Figures: RMSE is the square root of MSE, making it interpretable in the same units as the target variable.
  • R-squared: The proportion of variance in the dependent variable that is predictable from the independent variables.
    • Context: Indicates the goodness of fit.
    • Facts & Figures: Values range from 0 to 1, with higher values indicating a better fit.

3.3 Time Series Metrics

  • Mean Absolute Percentage Error (MAPE): The average of absolute percentage errors between predicted and actual values.
    • Context: Suitable for time series forecasting.
    • Facts & Figures: Provides an interpretable measure of prediction accuracy.
  • Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF): Tools for identifying the correlation between time series observations.
    • Context: Used in model identification and diagnostics.
    • Facts & Figures: Helps in identifying the appropriate lag for models like ARIMA.