Command line text tools make it possible to filter and summarize large outputs quickly. This lesson focuses on grep
, cut
, sort
, uniq
, wc
, and tee
. It shows how to combine them into readable, reliable pipelines.
grep
finds lines that match a patterncut
extracts delimited fields or byte or character rangessort
orders lines by a chosen keyuniq
collapses identical adjacent lines and can countwc
prints line or word or byte totalstee
splits a stream so it can be saved and also piped forward
Prerequisites
grep fundamentals
# basic search with line numbers
grep -n "ssh" /var/log/auth.log 2>/dev/null || grep -n "sshd" /var/log/secure 2>/dev/null
# case insensitive and whole word
grep -niw "error" /var/log/syslog 2>/dev/null || grep -niw "error" /var/log/messages 2>/dev/null
# show context lines around matches
grep -nC 2 "Failed password" /var/log/auth.log 2>/dev/null
# only the matching part per line
grep -oE "[0-9]{1,3}(\.[0-9]{1,3}){3}" /var/log/auth.log 2>/dev/null | head
Common options:
-n
line numbers,-H
show file name,-R
recursive,-i
ignore case,-w
match words-E
extended regex,-F
fixed string search,-v
invert match-A N
,-B N
,-C N
for context after, before, or both--color=auto
highlight matches
Use --binary-files=without-match
to skip binaries. For very large trees, prefer fixed strings with -F
or use faster tools such as ripgrep when available. Always try a small sample first with head
.
Regex primer for today
- Literal characters match themselves
.
any character,*
repeat zero or more,+
one or more,?
optional[]
character class,[^]
negated class,(a|b)
alternation- Anchors:
^
start of line,$
end of line
Examples:
# lines that start with a timestamp like 2025-10-02
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" app.log
# extract IPv4 addresses
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" access.log | head
# exclude noisy lines
grep -vE "healthcheck|ELB-HealthChecker|robots.txt" access.log | head
cut for fields
cut
works best on simple delimited text such as CSV or colon separated records.
# system users: fields from /etc/passwd
cut -d: -f1,3,7 /etc/passwd | head
# Nginx or Apache common log format: extract remote IP and path
cut -d' ' -f1,7 /var/log/nginx/access.log 2>/dev/null | head
# CSV: choose delimiter explicitly
cut -d, -f1,3,5 data.csv | head
Notes:
-d
sets the delimiter, default is tab-f
selects fields, ranges like-f2-4
are allowed- For complex quoting or embedded delimiters, prefer
awk
which is covered tomorrow
sort and uniq for ranking
Pipelines often use sort
then uniq -c
to count occurrences.
# top IPs in auth failures
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" /var/log/auth.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20
Key options:
sort -n
numeric,-r
reverse,-h
human numbers like1K 2M
-k
choose key,-t
field delimiter,-u
unique
Examples with fields:
# rank requests by path from an access log
cut -d' ' -f7 /var/log/nginx/access.log 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20
# sort by the 9th field (status code) numerically
sort -k9,9n -t' ' /var/log/nginx/access.log 2>/dev/null | head
For predictable results across systems, set LC_ALL=C
for byte order compare. Example: LC_ALL=C sort
.
wc for quick totals
# lines, words, bytes
wc -l app.log
wc -w README.md
wc -c /bin/ls
# count matches by combining grep and wc
grep -c "ERROR" app.log
# or
grep "ERROR" app.log | wc -l
grep -c
is faster because it does not print matching lines.
tee for saving while streaming
tee
writes input to a file and also passes it through. This is useful when a long pipeline should also be archived.
# capture and continue processing
journalctl -u ssh 2>/dev/null \
| tee ~/logs/ssh.log \
| grep -i "authentication failure" \
| wc -l
Append with tee -a
to avoid overwriting.
Putting it together: practical patterns
1) Count unique visitors per day from a web log
# assumes common format with remote IP as field 1 and date like [02/Oct/2025:...]
cut -d' ' -f1,4 /var/log/nginx/access.log 2>/dev/null \
| sed 's/\[//; s/:.*$//' 2>/dev/null \
| sort \
| uniq \
| awk '{print $2, $1}' 2>/dev/null \
| sort \
| uniq -c \
| sort -nr \
| head -20
2) Show the most frequent failing usernames in auth logs
grep -E "Failed password for (invalid user )?[a-zA-Z0-9_-]+" /var/log/auth.log 2>/dev/null \
| sed -E 's/.*Failed password for (invalid user )?([a-zA-Z0-9_-]+).*/\2/' \
| sort \
| uniq -c \
| sort -nr \
| head -20
3) Extract timing data from app logs
grep -oE "time_ms=[0-9]+" app.log \
| cut -d= -f2 \
| sort -n \
| awk 'NR%100==0{print "p" NR, $0}'
uniq
only collapses adjacent duplicates, so sort
must usually come first. When counting, the typical pattern is sort | uniq -c | sort -nr
.
Performance and reliability tips
- Filter early to avoid pushing huge data through the whole pipeline
- Use
-F
for fixed strings ingrep
when regex is not needed - Restrict search to a subtree with
grep -R path/
and add--exclude
patterns for large trees - For multi gigabyte logs, stream with
zcat
orzgrep
on compressed files - Keep intermediate files in
/tmp
only when necessary, otherwise stream with pipes
Examples:
# search inside compressed logs
zgrep -n "timeout" /var/log/nginx/*.gz 2>/dev/null | head
# exclude vendor directories and node_modules
grep -R --exclude-dir={.git,node_modules,vendor} -n "TODO" . | head
Practical lab
- Create a sample dataset and practice extraction and counting.
mkdir -p ~/playground/day12 && cd ~/playground/day12
cat > data.csv <<'EOF'
user,action,duration_ms
alice,login,120
bob,login,80
alice,upload,560
alice,logout,20
bob,upload,140
bob,logout,30
charlie,login,200
EOF
# top actions by frequency
cut -d, -f2 data.csv | tail -n +2 | sort | uniq -c | sort -nr
# average duration per action (quick and rough)
cut -d, -f2,3 data.csv | tail -n +2 | sort -t, -k1,1 | \
awk -F, '{sum[$1]+=$2; cnt[$1]++} END{for(k in sum) printf "%s %.2f\n", k, sum[k]/cnt[k]}' | sort
- Investigate authentication failures on the system log.
grep -E "Failed password|authentication failure" /var/log/auth.log 2>/dev/null \
| tee failures.log \
| grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" \
| sort | uniq -c | sort -nr | head
- Extract HTTP status distribution from an access log.
cut -d' ' -f9 /var/log/nginx/access.log 2>/dev/null | sort | uniq -c | sort -nr
Troubleshooting
- No output appears. Add
-n
or-H
togrep
for clarity, and check file paths and permissions cut
returns empty fields. Confirm the delimiter and field numbers, and that fields are not collapsed by multiple spaces- Counts look wrong. Ensure that
sort
precedesuniq -c
- Pipelines appear slow. Filter earlier, or try
grep -F
. Consider writing a small subset to a temp file for iteration withtee
- Locale differences in
sort
. ForceLC_ALL=C
to avoid accent or locale specific rules when numeric or byte order is expected
Next steps
Day 13 expands text tooling with find
, xargs
, sed
, and awk
for bulk edits and richer reporting. It shows safe mass renames, search and replace across files, and building quick one liners to summarize CSV and logs.