the_information_nexus/data_science.md at dbfcd4c33f5844f3e8a85a574bf51db846b11274

Files

Whisker Jones 64c911a79a add data science document

2024-05-28 18:44:55 -06:00

1.1 Numerical Data

Continuous Data: Data that can take any value within a range. Examples include temperature, height, and weight.
- Context: Used in regression models where the prediction involves continuous values.
- Facts & Figures: Often requires normalization or standardization for machine learning models to perform effectively.
Discrete Data: Data that can only take specific values. Examples include the number of students in a class or the number of cars in a parking lot.
- Context: Used in classification problems where the target variable is discrete.
- Facts & Figures: Can be encoded using techniques like one-hot encoding for use in machine learning models.

1.2 Categorical Data

Nominal Data: Data that represents categories without any intrinsic order. Examples include gender, color, and nationality.
- Context: Used in classification tasks where categories are distinct and unordered.
- Facts & Figures: Often transformed using one-hot encoding or label encoding for machine learning algorithms.
Ordinal Data: Data that represents categories with a meaningful order but unknown intervals between values. Examples include rankings, education levels, and satisfaction ratings.
- Context: Useful in ordinal regression and decision trees where the order of categories matters.
- Facts & Figures: Can be handled using ordinal encoding which maintains the order of categories.

1.3 Text Data

Unstructured Data: Text data without a predefined format, such as social media posts, articles, and emails.
- Context: Used in natural language processing (NLP) tasks like sentiment analysis and text classification.
- Facts & Figures: Requires preprocessing steps like tokenization, stemming, and lemmatization. Commonly transformed into numerical vectors using techniques like TF-IDF or word embeddings (e.g., Word2Vec, GloVe).
Structured Data: Text data with a predefined format, such as XML or JSON.
- Context: Often used in data exchange between systems and can be easily parsed and queried.
- Facts & Figures: Structured text data is typically easier to handle and integrate into databases and data models.

1.4 Time Series Data

Definition: Data points collected or recorded at specific time intervals. Examples include stock prices, weather data, and sensor readings.
- Context: Used in forecasting, anomaly detection, and temporal pattern recognition.
- Facts & Figures: Requires techniques like rolling averages, exponential smoothing, and ARIMA models for analysis. Seasonality and trend decomposition are common preprocessing steps.

1.5 Image Data

Definition: Data in the form of images, typically represented as pixel values.
- Context: Used in computer vision tasks like image classification, object detection, and image segmentation.
- Facts & Figures: Requires preprocessing like resizing, normalization, and augmentation. Convolutional neural networks (CNNs) are commonly used models for image data.

1.6 Audio Data

Definition: Data in the form of sound waves, often represented as time-series data.
- Context: Used in tasks like speech recognition, music classification, and audio anomaly detection.
- Facts & Figures: Preprocessing steps include noise reduction, feature extraction (e.g., MFCCs), and normalization. Recurrent neural networks (RNNs) and CNNs are often used for audio data.

2.1 Normalization and Standardization

Normalization: Scaling data to a range of [0, 1].
- Context: Used when the feature scales vary widely.
- Facts & Figures: Helps improve the performance of gradient-based algorithms.
Standardization: Scaling data to have a mean of 0 and a standard deviation of 1.
- Context: Commonly used when the data follows a Gaussian distribution.
- Facts & Figures: Often required for algorithms like SVM and PCA.

2.2 Encoding Categorical Data

One-Hot Encoding: Converts categorical variables into a series of binary vectors.
- Context: Suitable for nominal data.
- Facts & Figures: Increases the dimensionality of the dataset.
Label Encoding: Assigns an integer value to each category.
- Context: Suitable for ordinal data.
- Facts & Figures: Can be misleading for nominal data due to the implicit order.

2.3 Handling Missing Data

Imputation: Filling missing values with mean, median, mode, or a specific value.
- Context: Prevents loss of data and maintains dataset size.
- Facts & Figures: Multiple imputation techniques can be used for more robust handling.
Dropping Missing Values: Removing rows or columns with missing data.
- Context: Used when the amount of missing data is minimal.
- Facts & Figures: Can lead to a significant reduction in dataset size if not handled carefully.

2.4 Feature Engineering

Definition: Creating new features from existing ones to improve model performance.
- Context: Involves domain knowledge and creativity.
- Facts & Figures: Techniques include polynomial features, interaction terms, and domain-specific transformations.

3.1 Classification Metrics

Accuracy: The ratio of correctly predicted instances to the total instances.
- Context: Suitable for balanced datasets.
- Facts & Figures: Can be misleading for imbalanced datasets.
Precision, Recall, F1-Score: Measures of a model’s performance considering true positives, false positives, and false negatives.
- Context: Important for imbalanced datasets.
- Facts & Figures: F1-Score is the harmonic mean of precision and recall.

3.2 Regression Metrics

Mean Absolute Error (MAE): The average of absolute errors between predicted and actual values.
- Context: Provides a straightforward interpretation of prediction errors.
- Facts & Figures: Sensitive to outliers.
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Measures of average squared errors between predicted and actual values.
- Context: Penalizes larger errors more than smaller ones.
- Facts & Figures: RMSE is the square root of MSE, making it interpretable in the same units as the target variable.
R-squared: The proportion of variance in the dependent variable that is predictable from the independent variables.
- Context: Indicates the goodness of fit.
- Facts & Figures: Values range from 0 to 1, with higher values indicating a better fit.

3.3 Time Series Metrics

Mean Absolute Percentage Error (MAPE): The average of absolute percentage errors between predicted and actual values.
- Context: Suitable for time series forecasting.
- Facts & Figures: Provides an interpretable measure of prediction accuracy.
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF): Tools for identifying the correlation between time series observations.
- Context: Used in model identification and diagnostics.
- Facts & Figures: Helps in identifying the appropriate lag for models like ARIMA.