bash/talk-to-computer/corpus/programming/command_line_data_processing.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

# Local Data Processing With Unix Tools - Shell-based data wrangling

## Introduction

Leveraging standard Unix command-line tools for data processing is a powerful, efficient, and universally available method for handling text-based data. This guide focuses on the **Unix philosophy** of building complex data processing **pipelines** by composing small, single-purpose utilities. This approach is invaluable for ad-hoc data exploration, log analysis, and pre-processing tasks directly within the shell, often outperforming more complex scripts or dedicated software for common data wrangling operations.

Key applications include analyzing web server logs, filtering and transforming CSV/TSV files, and batch-processing any line-oriented text data.

## Core Concepts

### Streams and Redirection

At the core of Unix inter-process communication are three standard streams:

1.  `stdin` (standard input): The stream of data going into a program.
2.  `stdout` (standard output): The primary stream of data coming out of a program.
3.  `stderr` (standard error): A secondary output stream for error messages and diagnostics.

**Redirection** controls these streams. The pipe `|` operator is the most important, as it connects one command's `stdout` to the next command's `stdin`, forming a pipeline.

```bash
# Redirect stdout to a file (overwrite)
command > output.txt

# Redirect stdout to a file (append)
command >> output.txt

# Redirect a file to stdin
command < input.txt

# Redirect stderr to a file
command 2> error.log

# Redirect stderr to stdout
command 2>&1
```

### The Core Toolkit

A small set of highly-specialized tools forms the foundation of most data pipelines.

  - **`grep`**: Filters lines that match a regular expression.
  - **`awk`**: A powerful pattern-scanning and processing language. It excels at columnar data, allowing you to manipulate fields within each line.
  - **`sed`**: A "stream editor" for performing text transformations on an input stream (e.g., search and replace).
  - **`sort`**: Sorts lines of text files.
  - **`uniq`**: Reports or omits repeated lines. Often used with `-c` to count occurrences.
  - **`cut`**: Removes sections from each line of files (e.g., select specific columns).
  - **`tr`**: Translates or deletes characters.
  - **`xargs`**: Builds and executes command lines from standard input. It bridges the gap between commands that produce lists of files and commands that operate on them.

## Key Principles

The effectiveness of this approach stems from the **Unix Philosophy**:

1.  **Do one thing and do it well**: Each tool is specialized for a single task (e.g., `grep` only filters, `sort` only sorts).
2.  **Write programs that work together**: The universal text stream interface (`stdin`/`stdout`) allows for near-infinite combinations of tools.
3.  **Handle text streams**: Text is a universal interface, making the tools broadly applicable to a vast range of data formats.

## Implementation/Usage

Let's assume we have a web server access log file, `access.log`, with the following format:
`IP_ADDRESS - - [TIMESTAMP] "METHOD /path HTTP/1.1" STATUS_CODE RESPONSE_SIZE`

Example line:
`192.168.1.10 - - [20/Aug/2025:15:30:00 -0400] "GET /home HTTP/1.1" 200 5120`

### Basic Example

**Goal**: Find the top 5 IP addresses that accessed the server.

```bash
# This pipeline extracts, groups, counts, and sorts the IP addresses.
cat access.log | \
  awk '{print $1}' | \
  sort | \
  uniq -c | \
  sort -nr | \
  head -n 5
```

**Breakdown:**

1.  `cat access.log`: Reads the file and sends its content to `stdout`.
2.  `awk '{print $1}'`: For each line, print the first field (the IP address).
3.  `sort`: Sorts the IPs alphabetically, which is necessary for `uniq` to group them.
4.  `uniq -c`: Collapses adjacent identical lines into one and prepends the count.
5.  `sort -nr`: Sorts the result numerically (`-n`) and in reverse (`-r`) order to get the highest counts first.
6.  `head -n 5`: Takes the first 5 lines of the sorted output.

### Advanced Example

**Goal**: Calculate the total bytes served for all successful (`2xx` status code) `POST` requests.

```bash
# This pipeline filters for specific requests and sums a column.
grep '"POST ' access.log | \
  grep ' 2[0-9][0-9] ' | \
  awk '{total += $10} END {print total}'
```

**Breakdown:**

1.  `grep '"POST ' access.log`: Filters the log for lines containing ` "POST  ` (note the space to avoid matching other methods).
2.  `grep ' 2[0-9][0-9] '`: Filters the remaining lines for a 2xx status code. The spaces ensure we match the status code field specifically.
3.  `awk '{total += $10} END {print total}'`: For each line that passes the filters, `awk` adds the value of the 10th field (response size) to a running `total`. The `END` block executes after all lines are processed, printing the final sum.

## Common Patterns

### Pattern 1: Filter-Map-Reduce

This is a functional programming pattern that maps directly to Unix pipelines.

  - **Filter**: Select a subset of data (`grep`, `head`, `tail`, `awk '/pattern/'`).
  - **Map**: Transform each line of data (`awk '{...}'`, `sed 's/.../.../'`, `cut`).
  - **Reduce**: Aggregate data into a summary result (`sort | uniq -c`, `wc -l`, `awk '{sum+=$1} END {print sum}'`).

### Pattern 2: Shuffling (Sort-Based Grouping)

This is the command-line equivalent of a `GROUP BY` operation in SQL. The pattern is to extract a key, sort by that key to group related records together, and then process each group.

```bash
# Example: Find the most frequent user agent for each IP address.
# The key here is the IP address ($1).
awk '{print $1, $12}' access.log | \
  sort | \
  uniq -c | \
  sort -k2,2 -k1,1nr | \
  awk 'BEGIN{last=""} {if ($2 != last) {print} last=$2}'
```

This advanced pipeline sorts by IP, then by count, and finally uses `awk` to pick the first (highest count) entry for each unique IP.

## Best Practices

  - **Develop Incrementally**: Build pipelines one command at a time. After adding a `|` and a new command, run it to see if the intermediate output is what you expect.
  - **Filter Early**: Place `grep` or other filtering commands as early as possible in the pipeline. This reduces the amount of data that subsequent, potentially more expensive commands like `sort` have to process.
  - **Use `set -o pipefail`**: In shell scripts, this option causes a pipeline to return a failure status if *any* command in the pipeline fails, not just the last one.
  - **Prefer `awk` for Columns**: For tasks involving multiple columns, `awk` is generally more powerful, readable, and performant than a complex chain of `cut`, `paste`, and shell loops.
  - **Beware of Locales**: The `sort` command's behavior is affected by the `LC_ALL` environment variable. For byte-wise sorting, use `LC_ALL=C sort`.

## Common Pitfalls

  - **Forgetting to Sort Before `uniq`**: `uniq` only operates on adjacent lines. If the data is not sorted, it will not produce correct counts.
  - **Greedy Regular Expressions**: A `grep` pattern like ` .  ` can match more than intended. Be as specific as possible with your regex.
  - **Shell Globbing vs. `grep` Regex**: The wildcards used by the shell (`*`, `?`) are different from those used in regular expressions (`.*`, `.`).
  - **Word Splitting on Unquoted Variables**: When used in scripts, variables containing spaces can be split into multiple arguments if not quoted (`"my var"` vs `my var`).

## Performance Considerations

  - **I/O is King**: These tools are often I/O-bound. Reading from and writing to disk is the slowest part. Use pipelines to avoid creating intermediate files.
  - **`awk` vs. `sed` vs. `grep`**: For simple filtering, `grep` is fastest. For simple substitutions, `sed` is fastest. For any field-based logic, `awk` is the right tool and is extremely fast, as it's a single compiled process.
  - **GNU Parallel**: For tasks that can be broken into independent chunks (e.g., processing thousands of files), `GNU parallel` can be used to execute pipelines in parallel, dramatically speeding up the work on multi-core systems.

## Integration Points

  - **Shell Scripting**: These tools are the fundamental building blocks for automation and data processing scripts in `bash`, `zsh`, etc.
  - **Data Ingestion Pipelines**: Unix tools are often used as the first step (the "T" in an ELT process) to clean, filter, and normalize raw log files before they are loaded into a database or data warehouse.
  - **Other Languages**: Languages like Python (`subprocess`) and Go (`os/exec`) can invoke these command-line tools to leverage their performance and functionality without having to re-implement them.

## Troubleshooting

### Problem 1: Pipeline hangs or is extremely slow

**Symptoms:** The command prompt doesn't return, and there's no output.
**Solution:** This is often caused by a command like `sort` or another tool that needs to read all of its input before producing any output. It may be processing a massive amount of data.

1.  Test your pipeline on a small subset of the data first using `head -n 1000`.
2.  Use a tool like `pv` (pipe viewer) in the middle of your pipeline (`... | pv | ...`) to monitor the flow of data and see where it's getting stuck.

### Problem 2: `xargs` fails on filenames with spaces

**Symptoms:** An `xargs` command fails with "file not found" errors for files with spaces or special characters in their names.
**Solution:** Use the "null-delimited" mode of `find` and `xargs`, which is designed to handle all possible characters in filenames safely.

```bash
# Wrong way, will fail on "file name with spaces.txt"
find . -name "*.txt" | xargs rm

# Correct, safe way
find . -name "*.txt" -print0 | xargs -0 rm
```

## Examples in Context

  - **DevOps/SRE**: Quickly grepping through gigabytes of Kubernetes logs to find error messages related to a specific request ID.
  - **Bioinformatics**: Processing massive FASTA/FASTQ text files to filter, reformat, or extract sequence data.
  - **Security Analysis**: Analyzing `auth.log` files to find failed login attempts, group them by IP, and identify brute-force attacks.

## References

  - [The GNU Coreutils Manual](https://www.gnu.org/software/coreutils/manual/coreutils.html)
  - [The AWK Programming Language (Book by Aho, Kernighan, Weinberger)](https://archive.org/details/pdfy-MgN0H1joIoDVoIC7)
  - [Greg's Wiki - Bash Pitfalls](https://mywiki.wooledge.org/BashPitfalls)

## Related Topics

  - Shell Scripting
  - Regular Expressions (Regex)
  - AWK Programming
  - Data Wrangling