Day 12 - Finding Things and Text Processing I (grep, cut, sort, uniq, wc, tee)

Day12 grep

Command line text tools make it possible to filter and summarize large outputs quickly. This lesson focuses on grep, cut, sort, uniq, wc, and tee. It shows how to combine them into readable, reliable pipelines.

What these tools do

grep finds lines that match a pattern
cut extracts delimited fields or byte or character ranges
sort orders lines by a chosen key
uniq collapses identical adjacent lines and can count
wc prints line or word or byte totals
tee splits a stream so it can be saved and also piped forward

Prerequisites

Day 1 through Day 11 completed
Comfortable with pipes and redirection from earlier days

grep fundamentals

bash

# basic search with line numbers
grep -n "ssh" /var/log/auth.log 2>/dev/null || grep -n "sshd" /var/log/secure 2>/dev/null

# case insensitive and whole word
grep -niw "error" /var/log/syslog 2>/dev/null || grep -niw "error" /var/log/messages 2>/dev/null

# show context lines around matches
grep -nC 2 "Failed password" /var/log/auth.log 2>/dev/null

# only the matching part per line
grep -oE "[0-9]{1,3}(\.[0-9]{1,3}){3}" /var/log/auth.log 2>/dev/null | head

Common options:

-n line numbers, -H show file name, -R recursive, -i ignore case, -w match words
-E extended regex, -F fixed string search, -v invert match
-A N, -B N, -C N for context after, before, or both
--color=auto highlight matches

Binary files and performance

Use --binary-files=without-match to skip binaries. For very large trees, prefer fixed strings with -F or use faster tools such as ripgrep when available. Always try a small sample first with head.

Regex primer for today

Literal characters match themselves
. any character, * repeat zero or more, + one or more, ? optional
[] character class, [^] negated class, (a|b) alternation
Anchors: ^ start of line, $ end of line

Examples:

bash

# lines that start with a timestamp like 2025-10-02
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" app.log

# extract IPv4 addresses
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.log | head

# exclude noisy lines
grep -vE "healthcheck|ELB-HealthChecker|robots.txt" access.log | head

cut for fields

cut works best on simple delimited text such as CSV or colon separated records.

bash

# system users: fields from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head

# Nginx or Apache common log format: extract remote IP and path
cut -d' ' -f1,7 /var/log/nginx/access.log 2>/dev/null | head

# CSV: choose delimiter explicitly
cut -d, -f1,3,5 data.csv | head

Notes:

-d sets the delimiter, default is tab
-f selects fields, ranges like -f2-4 are allowed
For complex quoting or embedded delimiters, prefer awk which is covered tomorrow

sort and uniq for ranking

Pipelines often use sort then uniq -c to count occurrences.

bash

# top IPs in auth failures
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" /var/log/auth.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

Key options:

sort -n numeric, -r reverse, -h human numbers like 1K 2M
-k choose key, -t field delimiter, -u unique

Examples with fields:

bash

# rank requests by path from an access log
cut -d' ' -f7 /var/log/nginx/access.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

# sort by the 9th field (status code) numerically
sort -k9,9n -t' ' /var/log/nginx/access.log 2>/dev/null | head

Stable and locale neutral sorting

For predictable results across systems, set LC_ALL=C for byte order compare. Example: LC_ALL=C sort.

wc for quick totals

bash

# lines, words, bytes
wc -l app.log
wc -w README.md
wc -c /bin/ls

# count matches by combining grep and wc
grep -c "ERROR" app.log
# or
grep "ERROR" app.log | wc -l

grep -c is faster because it does not print matching lines.

tee for saving while streaming

tee writes input to a file and also passes it through. This is useful when a long pipeline should also be archived.

bash

# capture and continue processing
journalctl -u ssh 2>/dev/null \
| tee ~/logs/ssh.log \
| grep -i "authentication failure" \
| wc -l

Append with tee -a to avoid overwriting.

Putting it together: practical patterns

1) Count unique visitors per day from a web log

bash

# assumes common format with remote IP as field 1 and date like [02/Oct/2025:...]
cut -d' ' -f1,4 /var/log/nginx/access.log 2>/dev/null \
| sed 's/\[//; s/:.*$//' 2>/dev/null \
| sort \
| uniq \
| awk '{print $2, $1}' 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

2) Show the most frequent failing usernames in auth logs

bash

grep -E "Failed password for (invalid user )?[a-zA-Z0-9_-]+" /var/log/auth.log 2>/dev/null \
| sed -E 's/.*Failed password for (invalid user )?([a-zA-Z0-9_-]+).*/\2/' \
| sort \
| uniq -c \
| sort -nr \
| head -20

3) Extract timing data from app logs

bash

grep -oE "time_ms=[0-9]+" app.log \
| cut -d= -f2 \
| sort -n \
| awk 'NR%100==0{print "p" NR, $0}'

Order matters in pipelines

uniq only collapses adjacent duplicates, so sort must usually come first. When counting, the typical pattern is sort | uniq -c | sort -nr.

Performance and reliability tips

Filter early to avoid pushing huge data through the whole pipeline
Use -F for fixed strings in grep when regex is not needed
Restrict search to a subtree with grep -R path/ and add --exclude patterns for large trees
For multi gigabyte logs, stream with zcat or zgrep on compressed files
Keep intermediate files in /tmp only when necessary, otherwise stream with pipes

Examples:

bash

# search inside compressed logs
zgrep -n "timeout" /var/log/nginx/*.gz 2>/dev/null | head

# exclude vendor directories and node_modules
grep -R --exclude-dir={.git,node_modules,vendor} -n "TODO" . | head

Practical lab

Create a sample dataset and practice extraction and counting.

bash

mkdir -p ~/playground/day12 && cd ~/playground/day12
cat > data.csv <<'EOF'
user,action,duration_ms
alice,login,120
bob,login,80
alice,upload,560
alice,logout,20
bob,upload,140
bob,logout,30
charlie,login,200
EOF

# top actions by frequency
cut -d, -f2 data.csv | tail -n +2 | sort | uniq -c | sort -nr

# average duration per action (quick and rough)
cut -d, -f2,3 data.csv | tail -n +2 | sort -t, -k1,1 | \
awk -F, '{sum[$1]+=$2; cnt[$1]++} END{for(k in sum) printf "%s %.2f\n", k, sum[k]/cnt[k]}' | sort

Investigate authentication failures on the system log.

bash

grep -E "Failed password|authentication failure" /var/log/auth.log 2>/dev/null \
| tee failures.log \
| grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" \
| sort | uniq -c | sort -nr | head

Extract HTTP status distribution from an access log.

bash

cut -d' ' -f9 /var/log/nginx/access.log 2>/dev/null | sort | uniq -c | sort -nr

Troubleshooting

No output appears. Add -n or -H to grep for clarity, and check file paths and permissions
cut returns empty fields. Confirm the delimiter and field numbers, and that fields are not collapsed by multiple spaces
Counts look wrong. Ensure that sort precedes uniq -c
Pipelines appear slow. Filter earlier, or try grep -F. Consider writing a small subset to a temp file for iteration with tee
Locale differences in sort. Force LC_ALL=C to avoid accent or locale specific rules when numeric or byte order is expected

Up next: Day 13 - Find, xargs, sed, and awk

Day 13 expands text tooling with find, xargs, sed, and awk for bulk edits and richer reporting. It shows safe mass renames, search and replace across files, and building quick one liners to summarize CSV and logs.