Selecting Data Items

The content for this Reading is attributed to Dr Jim Tyson.

For this section we will use the following data file:

results.txt

If the data file you are using was created with windows, as these files were, you may need to translate the format for use on a Unix like system before you use them for this tutorial – or when you are analyzing data in real life.

This is because any system that deals with text data (which is what we will use) must have a way to recognize the end of a line. On Unix like systems, the end of a line is marked by the code for NEWLINE, often written \n while on Windows systems the end of line is marked by CARRIAGE RETURN NEWLINE, often written \r\n.

This difference can cause problems because if we search for the end of a line (which we will when we search for characters in context) the Unix system will fail to find a Windows file line ending.

To remedy this we will need to translate the file. The program that does this is called dos2unix.

The command to translate a Windows file called fileone.txt to a Unix friendly called fileone.txt is

dos2unix –n fileone.txt fileone.txt

If you wish you can change the name of the output file like this

dos2unix –n fileone.txt filetwo.txt

Selecting Rows

The grep command searches a file line by line. Try the following command

grep '^R' results.txt

This will display only three rows of the file. The expression in quotes is the search string. If we wish we can direct the output of this process to a new file, like this:

grep '^R' results.txt > outputfile.txt

This command line uses the redirect output symbol. In Unix the default output destination is the screen, and it's known as stdout (when it needs naming). The default input source is the keyboard, known as stdin. So when data is coming from or going to anywhere else, we use redirection.

We use redirection with > to pass the results of a process to a new output or with < to get data from a new input.

If we want to append the data to the end of an existing file (as new rows) we use >> instead of >.

We can use a similar command line to count the rows selected, but this time let's change the grep command slightly.

grep '^[RB]' results.txt | wc -l

This command line uses the pipe symbol.

We use piping with |to pass the results of one process to another process.

If we only wanted a count of the lines that match then instead of piping the result to wc we could use the -c parameter on grep, like this:

grep -c '^[RB]' results.txt | wc -l

We have used the character class, indicated by [ and ], containing R and B and grep will succeed if it finds any of the characters in the class. We enclose this regular expression in single quotes.

Also notice that in the cases above we have used the anchor ^ to limit the match by grep to the start of a line.

The anchor $ would limit the search to matches at the end of the line.

We use grep in this way to select row data.

More About Searching

The standard form of a basic grep command is:

grep -[options] 'search expression' filename

Typically search expression is a regular expression.

The simplest type of expression is a string literal - a succession of characters each treated literally, that is to say standing for themselves and nothing else. If the string literal contains a space, we will need to surround it by single quote marks.

In our data we might look for the following:

grep 'de souza' results.txt

The next thing to learn is how to match a class of characters rather than a specific character.

Consider grep [A-Z] results.txt

This matches any uppercase alphabetic character.

Similarly grep [1-9] results.txt matches any numeric character.

In both these cases any single character of the right class causes a successful match. You can specify the class by listing as well.

Consider grep [perl] results.txt

Which matches any character from the list p,e,r,l (the order in which they are listed is immaterial).

You can combine a character class and a literal in a search string.

Consider grep 'Grade [BC]' someresults.txt

this search would find lines containing Grade B and lines containing Grade C.

Notice that combining literals and classes means I need quotes.

You can also search using special characters as wildcards.

The character . for example, used in a search stands for any single character except the newline character.

So the search grep . results.txt succeeds for every non-empty line. (If . matched the newline character it would succeed for empty lines as well).

The character * stands for zero or any number of repetitions of a character.
So grep 'a*' results.txt

matches

a

aa

aaa

and so on.

Notice the blank line there? Probably not, but it's there. This regular expression matches zero or more instances of the preceding character.

Suppose that I wish to find a string that contains any sequence of characters followed by, for example m.

The grep command would be

grep '.+m' results.txt

This is a greedy search: it is not satisfied with the very first successful match, it continues past the first match it finds to match the longest string it can.

For now we will just accept this greedy searching, but if you investigate regular expressions further you will discover that some versions have non-greedy matching strategies available.

Searching with context

We can use the extended version of grep called egrep to add some lines of context to our search results.

For example, we can find lines with surnames beginning with B and two lines before and after:

egrep –C 2 '^B' results.txt

Selecting Columns

We can also select columns.

Because this is a delimited file we can split it into columns at each delimiter - in this case a comma.

This is equivalent to selecting fields from records.

Suppose that we want to extract column two from our data. We do this with the cut command.

Here's an example:

cut -d, -f3 results.txt| head

The first nine lines of the resulting display are:

55

52

51

59

56

We can display several columns like this

cut d, -f1-4 results.txt

which displays a contiguous range of columns, or

cut -d, -f1,4 results.txt

which displays a list of separate columns.

The -d option on cut specifies the delimiter (your system will have a default if you don't specify - find out what it is!) and the -f option specifies the column or field number. We use cut in this way to select column data.

The general form of the cut command is

cut -ddelimiter -ffieldnumbers datafile

So in the examples, we specified comma as the delimiter and used fields 1 and 3 and the range of fields 1 to 3.

Selecting Columns and Rows

Suppose that we want to select just some columns for only some rows?

We do this by first selecting rows with grep and passing this to cut to select columns. You can try:

grep '^[AR]' results.txt | cut -d, -f 1,4 | less

Again, we use piping to pass the results of one process to another. You could also redirect the output to a new file.

grep '^[AR]' results.txt | cut -d, -f 1,4 > resultsa-r.txt

Sorting

Unix sorts alphabetically by default. This means that 100 comes before 11.

On Rows

You can sort with the command sort. For example

sort results.txt | less

This sorts the file in UNIX order on each character of the entire line.

The default alphanumeric sort order means that the numbers one to ten would be sorted like this

1,10,2,3,4,5,6,7,8, 9

This makes perfect sense but it can be a surprise the first time you see it.

Descending

You can sort in reverse order with the option -r. Like this

sort -r results.txt | less

Numerically

To force a numeric sort, use the option -n.

sort -n results.txt

You can use a sort on numeric data to get maximum and minimum values for a variable.

Sort then pipe to head 1 and tail 1, which will produce the first and last records in the file.

On Columns

To sort on columns you must specify a delimiter, with -t and a field number with -k.

To sort on the third column of the results data, try this

sort -n -t ',' -k4 results.txt | less

(I've used a slightly more verbose method of specifying the delimiter here).

You can select rows after sorting, like this

sort-n -t ',' -k4 results.txt | grep '^[A]' | less

Which shows those pupils with surnames beginning with A sorted on the third field of the data file.

To sort on multiple columns we use more than one -k parameter.

For example, to sort first on Maths score and then on surname we use

sort -n -t ',' -k4n -k1 results.txt | less

Finding Unique Values in Columns

Suppose that you want to know how many different values appear in a particular column.

With a little work, you can find this out using the command uniq.

Used alone, uniq tests each line against what preceded it before writing it out and ignores duplicate lines.

Before we try to use uniq we need a sorted column with some repeated values. We can use cut to extract one.

Test this first

cut -d, -f3 results.txt | less

This should list just the second column of data which has a few duplicate values.

We pass the output through sort to uniq

cut -d, -f3 results.txt | sort |uniq | less

to get data in which the adjacent duplicates have been squeezed to one.

We can now pipe this result to wc-l to get the count of unique values.

cut -d, -f3 results.txt | sort | uniq | wc -l

Next activity: Exercises - Finding things.