From 114d8a31ef6931cb789fd7ad65d95dcafa120101 Mon Sep 17 00:00:00 2001 From: medusa Date: Fri, 19 Apr 2024 15:02:22 +0000 Subject: [PATCH] Add docs/tech_docs/csvkit.md --- docs/tech_docs/csvkit.md | 69 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 docs/tech_docs/csvkit.md diff --git a/docs/tech_docs/csvkit.md b/docs/tech_docs/csvkit.md new file mode 100644 index 0000000..a31abc6 --- /dev/null +++ b/docs/tech_docs/csvkit.md @@ -0,0 +1,69 @@ +If you need to resize a CSV file to be under 10 MB, you can do so using command line tools available on Linux. One effective approach is to utilize the `csvkit` tools, specifically `csvcut` to cut out unnecessary columns or `csvgrep` to filter out rows based on specific criteria. Here are a couple of ways you might approach this: + +### 1. **Install `csvkit`** +If you don't already have `csvkit` installed, you can install it via pip (Python's package manager): + +```bash +pip install csvkit +``` + +### 2. **Check the Current File Size** +First, ensure that the file `workSQLtest.csv` indeed exceeds 10 MB but is close to it, as you noted it's around 2.9 MB. If you have other files that need resizing, you can check their sizes using: + +```bash +ls -lh +``` + +### 3. **Analyze the CSV File** +Before resizing, analyze the file to understand what data it contains, which will help you decide what to keep and what to cut: + +```bash +csvstat workSQLtest.csv +``` + +### 4. **Reduce File Size** +Depending on the analysis, you can choose one of the following methods: + +#### a. **Remove Unnecessary Columns** +If the file has columns that aren't needed, you can remove them using `csvcut`: + +```bash +csvcut -C column_number_to_remove workSQLtest.csv > reduced_workSQLtest.csv +``` + +Replace `column_number_to_remove` with the actual numbers of the columns you want to omit. + +#### b. **Filter Rows** +If there are specific rows that are not necessary (e.g., certain dates, entries), use `csvgrep`: + +```bash +csvgrep -c column_name -m match_value workSQLtest.csv > filtered_workSQLtest.csv +``` + +Replace `column_name` and `match_value` with the appropriate column and the value you want to filter by. + +#### c. **Split the CSV** +If the dataset is too large and all data is essential, consider splitting the CSV into smaller parts: + +```bash +csvsplit -c column_name workSQLtest.csv +``` + +This splits the CSV file based on unique values in the specified column. + +### 5. **Check the New File Size** +After modifying the file, check the new file size: + +```bash +ls -lh reduced_workSQLtest.csv +``` + +or + +```bash +ls -lh filtered_workSQLtest.csv +``` + +Use these commands to confirm the file is now under the desired size limit. + +These tools offer a powerful way to manipulate CSV files directly from the command line, allowing for quick resizing and adjustment of data files to meet specific constraints. \ No newline at end of file