
Command line text tools make it possible to filter and summarize large outputs quickly. This lesson focuses on grep, cut, sort, uniq, wc, and tee. It shows how to combine them into readable, reliable pipelines.
grepfinds lines that match a patterncutextracts delimited fields or byte or character rangessortorders lines by a chosen keyuniqcollapses identical adjacent lines and can countwcprints line or word or byte totalsteesplits a stream so it can be saved and also piped forward
Prerequisites
grep fundamentals
# basic search with line numbers
grep -n "ssh" /var/log/auth.log 2>/dev/null || grep -n "sshd" /var/log/secure 2>/dev/null
# case insensitive and whole word
grep -niw "error" /var/log/syslog 2>/dev/null || grep -niw "error" /var/log/messages 2>/dev/null
# show context lines around matches
grep -nC 2 "Failed password" /var/log/auth.log 2>/dev/null
# only the matching part per line
grep -oE "[0-9]{1,3}(\.[0-9]{1,3}){3}" /var/log/auth.log 2>/dev/null | headCommon options:
-nline numbers,-Hshow file name,-Rrecursive,-iignore case,-wmatch words-Eextended regex,-Ffixed string search,-vinvert match-A N,-B N,-C Nfor context after, before, or both--color=autohighlight matches
Use --binary-files=without-match to skip binaries. For very large trees, prefer fixed strings with -F or use faster tools such as ripgrep when available. Always try a small sample first with head.
Regex primer for today
- Literal characters match themselves
.any character,*repeat zero or more,+one or more,?optional[]character class,[^]negated class,(a|b)alternation- Anchors:
^start of line,$end of line
Examples:
# lines that start with a timestamp like 2025-10-02
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" app.log
# extract IPv4 addresses
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.log | head
# exclude noisy lines
grep -vE "healthcheck|ELB-HealthChecker|robots.txt" access.log | headcut for fields
cut works best on simple delimited text such as CSV or colon separated records.
# system users: fields from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head
# Nginx or Apache common log format: extract remote IP and path
cut -d' ' -f1,7 /var/log/nginx/access.log 2>/dev/null | head
# CSV: choose delimiter explicitly
cut -d, -f1,3,5 data.csv | headNotes:
-dsets the delimiter, default is tab-fselects fields, ranges like-f2-4are allowed- For complex quoting or embedded delimiters, prefer
awkwhich is covered tomorrow
sort and uniq for ranking
Pipelines often use sort then uniq -c to count occurrences.
# top IPs in auth failures
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" /var/log/auth.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20Key options:
sort -nnumeric,-rreverse,-hhuman numbers like1K 2M-kchoose key,-tfield delimiter,-uunique
Examples with fields:
# rank requests by path from an access log
cut -d' ' -f7 /var/log/nginx/access.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20
# sort by the 9th field (status code) numerically
sort -k9,9n -t' ' /var/log/nginx/access.log 2>/dev/null | headFor predictable results across systems, set LC_ALL=C for byte order compare. Example: LC_ALL=C sort.
wc for quick totals
# lines, words, bytes
wc -l app.log
wc -w README.md
wc -c /bin/ls
# count matches by combining grep and wc
grep -c "ERROR" app.log
# or
grep "ERROR" app.log | wc -lgrep -c is faster because it does not print matching lines.
tee for saving while streaming
tee writes input to a file and also passes it through. This is useful when a long pipeline should also be archived.
# capture and continue processing
journalctl -u ssh 2>/dev/null \
| tee ~/logs/ssh.log \
| grep -i "authentication failure" \
| wc -lAppend with tee -a to avoid overwriting.
Putting it together: practical patterns
1) Count unique visitors per day from a web log
# assumes common format with remote IP as field 1 and date like [02/Oct/2025:...]
cut -d' ' -f1,4 /var/log/nginx/access.log 2>/dev/null \
| sed 's/\[//; s/:.*$//' 2>/dev/null \
| sort \
| uniq \
| awk '{print $2, $1}' 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -202) Show the most frequent failing usernames in auth logs
grep -E "Failed password for (invalid user )?[a-zA-Z0-9_-]+" /var/log/auth.log 2>/dev/null \
| sed -E 's/.*Failed password for (invalid user )?([a-zA-Z0-9_-]+).*/\2/' \
| sort \
| uniq -c \
| sort -nr \
| head -203) Extract timing data from app logs
grep -oE "time_ms=[0-9]+" app.log \
| cut -d= -f2 \
| sort -n \
| awk 'NR%100==0{print "p" NR, $0}'uniq only collapses adjacent duplicates, so sort must usually come first. When counting, the typical pattern is sort | uniq -c | sort -nr.
Performance and reliability tips
- Filter early to avoid pushing huge data through the whole pipeline
- Use
-Ffor fixed strings ingrepwhen regex is not needed - Restrict search to a subtree with
grep -R path/and add--excludepatterns for large trees - For multi gigabyte logs, stream with
zcatorzgrepon compressed files - Keep intermediate files in
/tmponly when necessary, otherwise stream with pipes
Examples:
# search inside compressed logs
zgrep -n "timeout" /var/log/nginx/*.gz 2>/dev/null | head
# exclude vendor directories and node_modules
grep -R --exclude-dir={.git,node_modules,vendor} -n "TODO" . | headPractical lab
- Create a sample dataset and practice extraction and counting.
mkdir -p ~/playground/day12 && cd ~/playground/day12
cat > data.csv <<'EOF'
user,action,duration_ms
alice,login,120
bob,login,80
alice,upload,560
alice,logout,20
bob,upload,140
bob,logout,30
charlie,login,200
EOF
# top actions by frequency
cut -d, -f2 data.csv | tail -n +2 | sort | uniq -c | sort -nr
# average duration per action (quick and rough)
cut -d, -f2,3 data.csv | tail -n +2 | sort -t, -k1,1 | \
awk -F, '{sum[$1]+=$2; cnt[$1]++} END{for(k in sum) printf "%s %.2f\n", k, sum[k]/cnt[k]}' | sort- Investigate authentication failures on the system log.
grep -E "Failed password|authentication failure" /var/log/auth.log 2>/dev/null \
| tee failures.log \
| grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" \
| sort | uniq -c | sort -nr | head- Extract HTTP status distribution from an access log.
cut -d' ' -f9 /var/log/nginx/access.log 2>/dev/null | sort | uniq -c | sort -nrTroubleshooting
- No output appears. Add
-nor-Htogrepfor clarity, and check file paths and permissions cutreturns empty fields. Confirm the delimiter and field numbers, and that fields are not collapsed by multiple spaces- Counts look wrong. Ensure that
sortprecedesuniq -c - Pipelines appear slow. Filter earlier, or try
grep -F. Consider writing a small subset to a temp file for iteration withtee - Locale differences in
sort. ForceLC_ALL=Cto avoid accent or locale specific rules when numeric or byte order is expected
Up next: Day 13 - Find, xargs, sed, and awk
Day 13 expands text tooling with find, xargs, sed, and awk for bulk edits and richer reporting. It shows safe mass renames, search and replace across files, and building quick one liners to summarize CSV and logs.