Add docs/tech_docs/csvkit.md

This commit is contained in:
2024-04-19 15:02:22 +00:00
parent 684fc9582f
commit 114d8a31ef

69
docs/tech_docs/csvkit.md Normal file
View File

@@ -0,0 +1,69 @@
If you need to resize a CSV file to be under 10 MB, you can do so using command line tools available on Linux. One effective approach is to utilize the `csvkit` tools, specifically `csvcut` to cut out unnecessary columns or `csvgrep` to filter out rows based on specific criteria. Here are a couple of ways you might approach this:
### 1. **Install `csvkit`**
If you don't already have `csvkit` installed, you can install it via pip (Python's package manager):
```bash
pip install csvkit
```
### 2. **Check the Current File Size**
First, ensure that the file `workSQLtest.csv` indeed exceeds 10 MB but is close to it, as you noted it's around 2.9 MB. If you have other files that need resizing, you can check their sizes using:
```bash
ls -lh <filename>
```
### 3. **Analyze the CSV File**
Before resizing, analyze the file to understand what data it contains, which will help you decide what to keep and what to cut:
```bash
csvstat workSQLtest.csv
```
### 4. **Reduce File Size**
Depending on the analysis, you can choose one of the following methods:
#### a. **Remove Unnecessary Columns**
If the file has columns that aren't needed, you can remove them using `csvcut`:
```bash
csvcut -C column_number_to_remove workSQLtest.csv > reduced_workSQLtest.csv
```
Replace `column_number_to_remove` with the actual numbers of the columns you want to omit.
#### b. **Filter Rows**
If there are specific rows that are not necessary (e.g., certain dates, entries), use `csvgrep`:
```bash
csvgrep -c column_name -m match_value workSQLtest.csv > filtered_workSQLtest.csv
```
Replace `column_name` and `match_value` with the appropriate column and the value you want to filter by.
#### c. **Split the CSV**
If the dataset is too large and all data is essential, consider splitting the CSV into smaller parts:
```bash
csvsplit -c column_name workSQLtest.csv
```
This splits the CSV file based on unique values in the specified column.
### 5. **Check the New File Size**
After modifying the file, check the new file size:
```bash
ls -lh reduced_workSQLtest.csv
```
or
```bash
ls -lh filtered_workSQLtest.csv
```
Use these commands to confirm the file is now under the desired size limit.
These tools offer a powerful way to manipulate CSV files directly from the command line, allowing for quick resizing and adjustment of data files to meet specific constraints.