Add docs/tech_docs/csvkit.md
This commit is contained in:
69
docs/tech_docs/csvkit.md
Normal file
69
docs/tech_docs/csvkit.md
Normal file
@@ -0,0 +1,69 @@
|
||||
If you need to resize a CSV file to be under 10 MB, you can do so using command line tools available on Linux. One effective approach is to utilize the `csvkit` tools, specifically `csvcut` to cut out unnecessary columns or `csvgrep` to filter out rows based on specific criteria. Here are a couple of ways you might approach this:
|
||||
|
||||
### 1. **Install `csvkit`**
|
||||
If you don't already have `csvkit` installed, you can install it via pip (Python's package manager):
|
||||
|
||||
```bash
|
||||
pip install csvkit
|
||||
```
|
||||
|
||||
### 2. **Check the Current File Size**
|
||||
First, ensure that the file `workSQLtest.csv` indeed exceeds 10 MB but is close to it, as you noted it's around 2.9 MB. If you have other files that need resizing, you can check their sizes using:
|
||||
|
||||
```bash
|
||||
ls -lh <filename>
|
||||
```
|
||||
|
||||
### 3. **Analyze the CSV File**
|
||||
Before resizing, analyze the file to understand what data it contains, which will help you decide what to keep and what to cut:
|
||||
|
||||
```bash
|
||||
csvstat workSQLtest.csv
|
||||
```
|
||||
|
||||
### 4. **Reduce File Size**
|
||||
Depending on the analysis, you can choose one of the following methods:
|
||||
|
||||
#### a. **Remove Unnecessary Columns**
|
||||
If the file has columns that aren't needed, you can remove them using `csvcut`:
|
||||
|
||||
```bash
|
||||
csvcut -C column_number_to_remove workSQLtest.csv > reduced_workSQLtest.csv
|
||||
```
|
||||
|
||||
Replace `column_number_to_remove` with the actual numbers of the columns you want to omit.
|
||||
|
||||
#### b. **Filter Rows**
|
||||
If there are specific rows that are not necessary (e.g., certain dates, entries), use `csvgrep`:
|
||||
|
||||
```bash
|
||||
csvgrep -c column_name -m match_value workSQLtest.csv > filtered_workSQLtest.csv
|
||||
```
|
||||
|
||||
Replace `column_name` and `match_value` with the appropriate column and the value you want to filter by.
|
||||
|
||||
#### c. **Split the CSV**
|
||||
If the dataset is too large and all data is essential, consider splitting the CSV into smaller parts:
|
||||
|
||||
```bash
|
||||
csvsplit -c column_name workSQLtest.csv
|
||||
```
|
||||
|
||||
This splits the CSV file based on unique values in the specified column.
|
||||
|
||||
### 5. **Check the New File Size**
|
||||
After modifying the file, check the new file size:
|
||||
|
||||
```bash
|
||||
ls -lh reduced_workSQLtest.csv
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```bash
|
||||
ls -lh filtered_workSQLtest.csv
|
||||
```
|
||||
|
||||
Use these commands to confirm the file is now under the desired size limit.
|
||||
|
||||
These tools offer a powerful way to manipulate CSV files directly from the command line, allowing for quick resizing and adjustment of data files to meet specific constraints.
|
||||
Reference in New Issue
Block a user