Update docs/tech_docs/csvkit.md

This commit is contained in:
2024-04-19 15:26:23 +00:00
parent 114d8a31ef
commit ddb58a4d52

View File

@@ -1,69 +1,65 @@
If you need to resize a CSV file to be under 10 MB, you can do so using command line tools available on Linux. One effective approach is to utilize the `csvkit` tools, specifically `csvcut` to cut out unnecessary columns or `csvgrep` to filter out rows based on specific criteria. Here are a couple of ways you might approach this: Certainly! Let's dive deeper into `csvkit`, a powerful suite of command-line tools designed to handle CSV files efficiently. It provides functionality to convert, cut, clean, and query CSV data without the need for spreadsheet software or a full-fledged database system.
### 1. **Install `csvkit`** ### Overview of `csvkit`
If you don't already have `csvkit` installed, you can install it via pip (Python's package manager):
**`csvkit`** is an open-source tool developed in Python. It is widely used for data manipulation and analysis, primarily because it allows data workers to perform complex operations on CSV files directly from the command line. This can be a big productivity boost, especially when dealing with large datasets.
### Core Tools and Functions
Here are some of the essential tools included in `csvkit`:
1. **`csvcut`**: This tool allows you to select specific columns from a CSV file. It's particularly useful for reducing the size of large files by removing unneeded columns.
2. **`csvgrep`**: Similar to the `grep` command but optimized for CSV data, this tool lets you filter rows based on column values.
3. **`csvstat`**: Provides quick, summary statistics for each column in a CSV file. It's a handy tool for getting a quick overview and understanding the distribution of data in each column.
4. **`csvlook`**: Converts a CSV file into a format that is easy to read in the terminal, with data arranged in a table.
5. **`csvstack`**: Merges multiple CSV files that have the same columns into a single CSV file.
6. **`in2csv`**: Converts various formats (like JSON, Excel, and SQL databases) into CSV.
7. **`csvsql`**: Allows you to run SQL queries directly on CSV files and output the results in CSV format. This can also be used to create tables in a database from CSV files.
8. **`sql2csv`**: Runs SQL queries against a database and outputs the results in CSV format.
### Installing `csvkit`
To install `csvkit`, you generally use Python's package installer `pip`:
```bash ```bash
pip install csvkit pip install csvkit
``` ```
### 2. **Check the Current File Size** ### Practical Examples
First, ensure that the file `workSQLtest.csv` indeed exceeds 10 MB but is close to it, as you noted it's around 2.9 MB. If you have other files that need resizing, you can check their sizes using:
```bash Heres how you might use some of these tools in practical scenarios:
ls -lh <filename>
```
### 3. **Analyze the CSV File** - **Reducing File Size**: As explained earlier, `csvcut` can be used to remove unnecessary columns, thus potentially reducing the file size:
Before resizing, analyze the file to understand what data it contains, which will help you decide what to keep and what to cut:
```bash ```bash
csvstat workSQLtest.csv csvcut -C 2,5,7 workSQLtest.csv > reduced_workSQLtest.csv
``` ```
### 4. **Reduce File Size** - **Filtering Data**: Using `csvgrep` to keep only the rows where a specific column matches a particular criterion:
Depending on the analysis, you can choose one of the following methods:
#### a. **Remove Unnecessary Columns** ```bash
If the file has columns that aren't needed, you can remove them using `csvcut`: csvgrep -c 3 -m "SpecificValue" workSQLtest.csv > filtered_workSQLtest.csv
```
```bash - **Data Analysis**: Quickly generating statistics to understand the dataset better:
csvcut -C column_number_to_remove workSQLtest.csv > reduced_workSQLtest.csv
```
Replace `column_number_to_remove` with the actual numbers of the columns you want to omit. ```bash
csvstat workSQLtest.csv
```
#### b. **Filter Rows** ### Benefits of Using `csvkit`
If there are specific rows that are not necessary (e.g., certain dates, entries), use `csvgrep`:
```bash - **Efficiency**: Operate directly on CSV files from the command line, speeding up data processing tasks.
csvgrep -c column_name -m match_value workSQLtest.csv > filtered_workSQLtest.csv - **Versatility**: Convert between various data formats and perform complex filtering and manipulation with simple commands.
``` - **Automation**: Easily integrate into scripts and pipelines for automated data processing tasks.
Replace `column_name` and `match_value` with the appropriate column and the value you want to filter by. ### Conclusion
#### c. **Split the CSV** `csvkit` is an invaluable toolkit for anyone who frequently works with CSV files, especially in data analysis, database management, and automation tasks. Its command-line nature allows for integration into workflows seamlessly, providing powerful data manipulation capabilities without the need for additional software.
If the dataset is too large and all data is essential, consider splitting the CSV into smaller parts:
```bash
csvsplit -c column_name workSQLtest.csv
```
This splits the CSV file based on unique values in the specified column.
### 5. **Check the New File Size**
After modifying the file, check the new file size:
```bash
ls -lh reduced_workSQLtest.csv
```
or
```bash
ls -lh filtered_workSQLtest.csv
```
Use these commands to confirm the file is now under the desired size limit.
These tools offer a powerful way to manipulate CSV files directly from the command line, allowing for quick resizing and adjustment of data files to meet specific constraints.