Add docs/tech_docs/csvkit.md

2024-04-19 15:02:22 +00:00
parent 684fc9582f
commit 114d8a31ef
1 changed files with 69 additions and 0 deletions
--- a/docs/tech_docs/csvkit.md
+++ b/docs/tech_docs/csvkit.md
@@ -0,0 +1,69 @@
+If you need to resize a CSV file to be under 10 MB, you can do so using command line tools available on Linux. One effective approach is to utilize the `csvkit` tools, specifically `csvcut` to cut out unnecessary columns or `csvgrep` to filter out rows based on specific criteria. Here are a couple of ways you might approach this:
+
+### 1. **Install `csvkit`**
+If you don't already have `csvkit` installed, you can install it via pip (Python's package manager):
+
+```bash
+pip install csvkit
+```
+
+### 2. **Check the Current File Size**
+First, ensure that the file `workSQLtest.csv` indeed exceeds 10 MB but is close to it, as you noted it's around 2.9 MB. If you have other files that need resizing, you can check their sizes using:
+
+```bash
+ls -lh <filename>
+```
+
+### 3. **Analyze the CSV File**
+Before resizing, analyze the file to understand what data it contains, which will help you decide what to keep and what to cut:
+
+```bash
+csvstat workSQLtest.csv
+```
+
+### 4. **Reduce File Size**
+Depending on the analysis, you can choose one of the following methods:
+
+#### a. **Remove Unnecessary Columns**
+If the file has columns that aren't needed, you can remove them using `csvcut`:
+
+```bash
+csvcut -C column_number_to_remove workSQLtest.csv > reduced_workSQLtest.csv
+```
+
+Replace `column_number_to_remove` with the actual numbers of the columns you want to omit.
+
+#### b. **Filter Rows**
+If there are specific rows that are not necessary (e.g., certain dates, entries), use `csvgrep`:
+
+```bash
+csvgrep -c column_name -m match_value workSQLtest.csv > filtered_workSQLtest.csv
+```
+
+Replace `column_name` and `match_value` with the appropriate column and the value you want to filter by.
+
+#### c. **Split the CSV**
+If the dataset is too large and all data is essential, consider splitting the CSV into smaller parts:
+
+```bash
+csvsplit -c column_name workSQLtest.csv
+```
+
+This splits the CSV file based on unique values in the specified column.
+
+### 5. **Check the New File Size**
+After modifying the file, check the new file size:
+
+```bash
+ls -lh reduced_workSQLtest.csv
+```
+
+or 
+
+```bash
+ls -lh filtered_workSQLtest.csv
+```
+
+Use these commands to confirm the file is now under the desired size limit.
+
+These tools offer a powerful way to manipulate CSV files directly from the command line, allowing for quick resizing and adjustment of data files to meet specific constraints.