Day 12 - Finding Things and Text Processing I (grep, cut, sort, uniq, wc, tee)

2025-10-047 min read

linuxtextgrepcutsortuniqwcteeregexlogs

Day12 grep

Command line text tools make it possible to filter and summarize large outputs quickly. This lesson focuses on grep, cut, sort, uniq, wc, and tee. It shows how to combine them into readable, reliable pipelines.

What these tools do
  • grep finds lines that match a pattern
  • cut extracts delimited fields or byte or character ranges
  • sort orders lines by a chosen key
  • uniq collapses identical adjacent lines and can count
  • wc prints line or word or byte totals
  • tee splits a stream so it can be saved and also piped forward

Prerequisites

  • Day 1 through Day 11 completed
  • Comfortable with pipes and redirection from earlier days

grep fundamentals

bash
# basic search with line numbers
grep -n "ssh" /var/log/auth.log 2>/dev/null || grep -n "sshd" /var/log/secure 2>/dev/null

# case insensitive and whole word
grep -niw "error" /var/log/syslog 2>/dev/null || grep -niw "error" /var/log/messages 2>/dev/null

# show context lines around matches
grep -nC 2 "Failed password" /var/log/auth.log 2>/dev/null

# only the matching part per line
grep -oE "[0-9]{1,3}(\.[0-9]{1,3}){3}" /var/log/auth.log 2>/dev/null | head

Common options:

  • -n line numbers, -H show file name, -R recursive, -i ignore case, -w match words
  • -E extended regex, -F fixed string search, -v invert match
  • -A N, -B N, -C N for context after, before, or both
  • --color=auto highlight matches
Binary files and performance

Use --binary-files=without-match to skip binaries. For very large trees, prefer fixed strings with -F or use faster tools such as ripgrep when available. Always try a small sample first with head.

Regex primer for today

  • Literal characters match themselves
  • . any character, * repeat zero or more, + one or more, ? optional
  • [] character class, [^] negated class, (a|b) alternation
  • Anchors: ^ start of line, $ end of line

Examples:

bash
# lines that start with a timestamp like 2025-10-02
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" app.log

# extract IPv4 addresses
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.log | head

# exclude noisy lines
grep -vE "healthcheck|ELB-HealthChecker|robots.txt" access.log | head

cut for fields

cut works best on simple delimited text such as CSV or colon separated records.

bash
# system users: fields from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head

# Nginx or Apache common log format: extract remote IP and path
cut -d' ' -f1,7 /var/log/nginx/access.log 2>/dev/null | head

# CSV: choose delimiter explicitly
cut -d, -f1,3,5 data.csv | head

Notes:

  • -d sets the delimiter, default is tab
  • -f selects fields, ranges like -f2-4 are allowed
  • For complex quoting or embedded delimiters, prefer awk which is covered tomorrow

sort and uniq for ranking

Pipelines often use sort then uniq -c to count occurrences.

bash
# top IPs in auth failures
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" /var/log/auth.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

Key options:

  • sort -n numeric, -r reverse, -h human numbers like 1K 2M
  • -k choose key, -t field delimiter, -u unique

Examples with fields:

bash
# rank requests by path from an access log
cut -d' ' -f7 /var/log/nginx/access.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

# sort by the 9th field (status code) numerically
sort -k9,9n -t' ' /var/log/nginx/access.log 2>/dev/null | head
Stable and locale neutral sorting

For predictable results across systems, set LC_ALL=C for byte order compare. Example: LC_ALL=C sort.

wc for quick totals

bash
# lines, words, bytes
wc -l app.log
wc -w README.md
wc -c /bin/ls

# count matches by combining grep and wc
grep -c "ERROR" app.log
# or
grep "ERROR" app.log | wc -l

grep -c is faster because it does not print matching lines.

tee for saving while streaming

tee writes input to a file and also passes it through. This is useful when a long pipeline should also be archived.

bash
# capture and continue processing
journalctl -u ssh 2>/dev/null \
| tee ~/logs/ssh.log \
| grep -i "authentication failure" \
| wc -l

Append with tee -a to avoid overwriting.

Putting it together: practical patterns

1) Count unique visitors per day from a web log

bash
# assumes common format with remote IP as field 1 and date like [02/Oct/2025:...]
cut -d' ' -f1,4 /var/log/nginx/access.log 2>/dev/null \
| sed 's/\[//; s/:.*$//' 2>/dev/null \
| sort \
| uniq \
| awk '{print $2, $1}' 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20

2) Show the most frequent failing usernames in auth logs

bash
grep -E "Failed password for (invalid user )?[a-zA-Z0-9_-]+" /var/log/auth.log 2>/dev/null \
| sed -E 's/.*Failed password for (invalid user )?([a-zA-Z0-9_-]+).*/\2/' \
| sort \
| uniq -c \
| sort -nr \
| head -20

3) Extract timing data from app logs

bash
grep -oE "time_ms=[0-9]+" app.log \
| cut -d= -f2 \
| sort -n \
| awk 'NR%100==0{print "p" NR, $0}'
Order matters in pipelines

uniq only collapses adjacent duplicates, so sort must usually come first. When counting, the typical pattern is sort | uniq -c | sort -nr.

Performance and reliability tips

  • Filter early to avoid pushing huge data through the whole pipeline
  • Use -F for fixed strings in grep when regex is not needed
  • Restrict search to a subtree with grep -R path/ and add --exclude patterns for large trees
  • For multi gigabyte logs, stream with zcat or zgrep on compressed files
  • Keep intermediate files in /tmp only when necessary, otherwise stream with pipes

Examples:

bash
# search inside compressed logs
zgrep -n "timeout" /var/log/nginx/*.gz 2>/dev/null | head

# exclude vendor directories and node_modules
grep -R --exclude-dir={.git,node_modules,vendor} -n "TODO" . | head

Practical lab

  1. Create a sample dataset and practice extraction and counting.
bash
mkdir -p ~/playground/day12 && cd ~/playground/day12
cat > data.csv <<'EOF'
user,action,duration_ms
alice,login,120
bob,login,80
alice,upload,560
alice,logout,20
bob,upload,140
bob,logout,30
charlie,login,200
EOF

# top actions by frequency
cut -d, -f2 data.csv | tail -n +2 | sort | uniq -c | sort -nr

# average duration per action (quick and rough)
cut -d, -f2,3 data.csv | tail -n +2 | sort -t, -k1,1 | \
awk -F, '{sum[$1]+=$2; cnt[$1]++} END{for(k in sum) printf "%s %.2f\n", k, sum[k]/cnt[k]}' | sort
  1. Investigate authentication failures on the system log.
bash
grep -E "Failed password|authentication failure" /var/log/auth.log 2>/dev/null \
| tee failures.log \
| grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" \
| sort | uniq -c | sort -nr | head
  1. Extract HTTP status distribution from an access log.
bash
cut -d' ' -f9 /var/log/nginx/access.log 2>/dev/null | sort | uniq -c | sort -nr

Troubleshooting

  • No output appears. Add -n or -H to grep for clarity, and check file paths and permissions
  • cut returns empty fields. Confirm the delimiter and field numbers, and that fields are not collapsed by multiple spaces
  • Counts look wrong. Ensure that sort precedes uniq -c
  • Pipelines appear slow. Filter earlier, or try grep -F. Consider writing a small subset to a temp file for iteration with tee
  • Locale differences in sort. Force LC_ALL=C to avoid accent or locale specific rules when numeric or byte order is expected

Next steps

Day 13 expands text tooling with find, xargs, sed, and awk for bulk edits and richer reporting. It shows safe mass renames, search and replace across files, and building quick one liners to summarize CSV and logs.