http://bit.ly/201909_UCL_sc

Socrative link: https://b.socrative.com/login/student/ - room: RITS

Workshop page: http://rits.github-pages.ucl.ac.uk/2019-09-25-UCL_software_carpentry/

Unix

Data needed for the workshop: http://swcarpentry.github.io/shell-novice/data/data-shell.zip

Download it into your desktop.

Acronyms

CLI - command line interface

GUI - graphical user interface

HPC - high performance computing

Bash - Bourne again shell

SSH - secure shell, used for remote access

Shell Basics

User controls the shell by issuing commands. The shell performs the commands and gives the control back to the user. This loop repeats until the shell is stopped. Depending on the system, there are various commands with different syntax.

Here are basic commands:

clear … clear the screen
exit … terminate the shel
su … login as root user
top … view all currently running processes
man <command> … display the manual for any command ***

Current/Working Directory

In shell, there is a current directory (or working directory). When the shell starts, this is usually the current user’s home directory. The default shell prompt (characters written before the command) always shows the current directory. The tilde sign (~) stands for the home directory.

The working directory can be manipulated over time by the following commands:

cd … change the current directory (usage: “cd directory_name ”, if directory name is not given, the command changes to the home directory)
pwd … print working directory

Issuing Commands

Commands have names, arguments and options. The name determines what a command does, the arguments usually determine what objects are affected (files, directories) and options are used for configuration. Options are also sometimes called switches if they configure a binary (yes/no) setting.

Breaking up a command, the name is the first word (mandatory), it is usually followed by arbitrary long sequence of options and arguments (whether they are mandatory or optional usually depends on the command in question).

Example: “ls -F /usr”

Name: “ls” - list directory
Options: “-F” - print symbolic markers at the end of file names, classifying their type
Arguments: “/usr” - which directory to list

Manipulating Directories

Basic commands:

ls … list directory, prints files and directories inside a directory (usage “ls directory_name ”, if directory name is not given, the command lists the current directory)
mkdir … make directory, creates a new directory inside the current directory (usage “mkdir directory_name ” and “~” represents home directory)
rmdir … remove directory, deletes a directory within the current directory (usage “rmdir directory_name ”), note: the directory to be deleted must exist and be empty , otherwise the command fails with error
mv … move directory (or file) or change its name (usage “mv current_name new_name ”), note: the directory to be moved must exist
cp… copy directory (or file), usage is identical to “mv”
find … find files and directories (usage “find directory_name options ” - options specify search criteria, e.g. -type or -name)

The “ls” command has some useful options:

-a … list all (including hidden) files
-F … print symbolic markers at the end of file names, classifying their type (“/” at the end of item name signifies a directory, “*” signifies a binary file, “@” signifies a link)
-h … print file sizes in human-readable format (instead of exact number of bytes, units are used, e.g. “1M” for megabyte or “1G” for gigabyte)
--help … show documentation of the command (does not work on Mac OS -> use “man ls”)
-1 … write items on separate lines
-l (lowercase “L”) … write items on separate lines, printing also last modification times, owners, groups and access rights
-r … invert sorting order (whatever it is)
-R … list subdirectories recursively with contents
-S … sort by file size, large to small
-t … sort items by modification, newest first

The “cp” command options:

-r … recursive, copies also subdirectories, their subdirectories etc., if this option is not given and the copied directory contains any subdirectories, the “cp” command will fail with error

Manipulating Files

Basic commands:

nano/gedit/vim … edit file contents in interactive editor
cat … print out file (danger: do not use with large files)

Use “cat file_name | less” to scroll through large files

rm … remove file (use “-i” to manually confirm each deletion, and “-f” to disable confirmations and error messages)
mv, cp … move and copy, respectively (for documentation, see their identical counterparts for directories)
ls *char*.pdb … print file which contains “char” in the filename with pdb extension
wc … word count, counts words, lines and characters in a file (use the “-w”, “-l” or “-c” flags respectively to get only some of these counts)
sort … print sorted lines of file, where default sorting method is alphabetical, but other methods can be specified by flags (e.g. “-n” for numerical sorting)

To sort a file in place, and replace that file with the sorted version, “sort filename > filename” will not work because the shell will overwrite the file before sort has the opportunity to read it.
Instead use “sort -o filename filename ” or “sort -o filename {,}”.

head … print the beginning lines of a file (number of lines to print can be controlled with the -n option, default is 10)
tail … print the end lines of a file (counterpart to head)
cut …
touch … create an empty file, or update its last modification time if the file already exists (usage “touch file_name ”)
grep … (global regular expression print), prints lines from a file that match a provided pattern, such as searching for a particular word.

Paths

File system commands often include file or directory names. These can also be paths , so that it’s not necessary to change directories too often. The only difference between file/directory names and paths is that paths can include multiple nested directories. For instance, the path “a/b/c” points to a file “c” located in directory “b” that is located within another directory “a”. The slash (“/”) acts as a path separator or delimiter (note that this is different from Windows, where the path separator is backslash “\”).

There are some special directory names that are often used as shorthands, even though they do not really exist on the file system. These are:

“.” (single dot) - points to the shell’s current/working directory
“..” (double dot) - points to the parent directory of “.”

This way, file system can be traversed and modified efficiently. For instance, “cd ../../a” changes the directory to the directory “a” two levels above the current directory.

Paths are classified as either absolute or relative . The difference is that relative paths are dependent on the current working directory, whereas absolute paths are not. This is important because some commands require their arguments to be absolute paths, which are generally considered less prone to error (the same relative path in different working directories may have point to different places). Relative paths, on the other hand, are often shorter to type.

Wildcards

For convenience, shell allows paths to contain wildcards. These are simple symbols that allow to describe multiple files with a single path, so that we don’t have to type too much.

The simplest wildcard is an asterisk “*”. It stands for arbitrary character, or a sequence of more characters. When used in a shell command, it gets expanded depending on the file system contents. For instance, if directory “a” contains files “b”, “c”, and “d”. The path “a/*” will be expanded to “a/b”, “a/c” and “a/d”. This way, we can affect all files in the “a” directory simultaneously. Wildcards can also be used repeatedly or with prefixes/suffixes. If the “a” directory additionally contains files “x1”, “x2” and “x3”, the path “a/x*” will be expanded to only the files prefixed with “x”, whereas the previous path “a/*” will be expanded to all files in the directory (prefixed with “x” or not).

There is a lot of fun to be had with the asterisk. Here are some basic examples:

* … expands to all files
*.txt … expands to all files with the “txt” extension
*/* … expands to all files, which are exactly one directory deep

Basic file system commands (e.g. cp, mv, rmdir, mkdir, rm and many more) were made to accept not necessarily exactly one, but an arbitrary number of arguments. This is to enable compatibility with wildcards. Here’s how they work:

rm, rmdir - they remove all files/directories given, note that this is very dangerous and can potentially destroy your system if not used carefully
mkdir - creates all directories given,
cp, mv - if there are N arguments (files or directories), the command treats the first N-1 arguments as items to be copied/moved and the last argument as the destination

Following up on the example above, the command “mv a/x* y” will be expanded to “mv a/x1 a/x2 a/x3 y”, a command that will move files “a/x1”, “a/x2” and “a/x3” to the “y” directory.

Asterisk is not the only wildcard available. There is also:

Question mark “?” … expanded to exactly one character of any type
Square bracket “[]” … expanded to exactly one example of whatever character is inside the brackets (it’s possible to specify either comma-separated list, e.g. “[a,b,c]”, or a range with dash, e.g. “[a-c]” - both of these will be expanded to either “a”, “b” or “c”)
Curly brackets “{}” … expanded to all of the wildcards contained inside the brackets in a comma-separated list, e.g. “{*.txt,*.pdf}” will be expanded to anything that matches either “*.txt” or “*.pdf”
Tilde character “~” … expanded to the current user’s home directory (the same one that the shell starts in and the is changed when “cd” command without arguments is issued)

Pipes

When commands are executed in shell, they have standard input and standard output. By default, the standard input is whatever user types in on their keyboard and the standard output is displayed in the console. This can be however changed using the vertical line character “|”. The vertical line allows to create pipes , which can be used to daisy-chain multiple commands together, connecting the standard output of one command to the standard input of another. This way, an arbitrary number of commands can be connected to achieve more advanced objectives. The standard input of the first command and the standard output of the last command will be default (keyboard and console).

Here is a simple example: “cat a| sort | wc -l”

Reading from the left to the right, this command:

Prints the contents of a file called “a”
Sorts the printed contents line-by-line alphabetically
Displays the number of lines in the sorted output

At the end of the command above, only the line count is printed to the console. Outputs of “cat” and “sort” are consumed by their successive commands chained by “|”.

Furthermore, it’s possible to redirect standard input/output of commands to files. This is done using the “<” and “>” characters respectively, and can (but does not have to) be used in conjunction with pipes. Following up the example above, the command “cat a >b” will print the contents of file “a”, but will effectively dump them into a file called “b” since its output is redirected to that file. Similarly, the command “sort <a” will print the sorted contents of a file “a”. Combined, the command “sort <a >b” will do the same, but save the output in file called “b”.

Note that extra care needs to be taken when redirecting standard output to files, as files can be easily overwritten this way. Sometimes, it’s safer to use “>>” instead of “>”. This syntax has the same semantics as “>” but file contents are appended at the end of the file instead of being overwritten, preserving any previous contents of the file. If the target file does not exist, both “>” and “>>” will create it. It is generally recommended to use at most a single redirection for each stream (input and output). If more redirections are used at the same time (e.g. “>” as well as “>>”, or “>” as well as “|”), no errors will be produced, but the results on the file system are likely to differ from the original expectations, as each redirection of a stream will undo the previous one (e.g. “cat a >b >c” will create both files “b” and “c” but only “c” will contain the output of “cat a”)

Basic Scripting

Bash is a complete programming language, and it is possible to write any complex scripts or functions that you might desire. One simple example is a FOR loop:

for variable in <list of files, numerical values, etc.>

command1

command2

commandN

d one

The syntax so far is not remarkable in comparison to other programming and scripting languages; the bolded keywords denote a substructure of the script within which you can insert your own commands and variables. Variables can be assigned a value with a command such as “x=2”, and later accessed or called by a function using “$x” (e.g. “print $x”).

As an alternative to writing a script on separate lines, semicolons “;” may be used instead of line breaks.

It will rapidly become inefficient to write long and complicated scripts directly into the terminal, especially if they are likely to be useful on more than one occasion. Bash scripts can be stored into a text file with the “.sh” extension. This text file should begin with the comment line:

#!/usr/bin/bash

which instructs the system to use a bash interpreter to execute the script. Scripts are executed in the CLI by the command “bash <filename>”.

Searching

The “grep” command can search files line-by-line, printing lines that match given criteria. The basic usage is: “grep regular_expression file_name ”. The regular expression is a string, which may contain wildcards specifying the lines to print. Note that these are not the same wildcards as used in shell commands, so they usually need to be enclosed in apostrophes not to be expanded by the shell. Sometimes “egrep” (extended grep) is used instead of “grep” if extended regular expressions are required.

Here are some options for grepping:

-n … print line numbers
-w … match the regular expression to words only (if it matches only a part of a word, the match does not count)
-v … invert matching, i.e. print all lines that are not matched by the regular expression
-i … match in case-insensitive mode
-r … recursive matching, if the file name given is a directory, read and try match in all files inside

The “find” command can search files and directories, printing paths of file names that match given criteria. The basic usage is: “find directory_name options ”, where options are used to specify the search criteria.

The “locate” command is helpful for finding files if you are not sure where they are and need to search a very broad region of the filesystem. It is faster than find because it uses a cached database, though this database may not be up-to-date (see updatedb command).

Git

Please make sure you have signed up for an account at GitHub.com before the session.

Cheatsheet for reference: http://swcarpentry.github.io/git-novice/reference

;)

Why is Version Control important?

Because it is.

What’s the mouseover text for this one?

Nice

You’re welcome

Git was originally developed by Linus Torvalds as a way of not having to deal with people, and it is counterintuitive on purpose to anyone familiar with other version control systems

How to Setup Git

The command “git config --list” should return your username and email address. If it does not:

Use “ git config --global user.name “<full name>” “ to set your name.
Use “git config --global user.email “<email address>” “ to set your email address.

You can also set your default text editor using “git config --global core.editor <text editor>”. Vim is objectively the best editor.

Then, use “git config --global core.autocrlf <X>” where

X = “true” on Windows
X = “input” on Mac and Linux

so that line-endings are compatible across these platforms.

Creating a Local Repository

In your chosen directory, create a directory called recipes (“mkdir recipes”) and enter it (“cd recipes”). The following commands will be useful:

git init … initializes an empty Git repository, with a setup directory called “.git”
git status … returns your current Git “branch” or an error if you are not in a repository.
sudo rm -rf /* … deletes entire computer, only do this if very desperate and sad :(

But if you pushed to your Git repository, at least you'll still have that :)

git add <filename> … includes an untracked file in the list of files to be committed.

git add -u … updates all modified files in the repository, does not add new files
git add -A … includes all file changes, additions, and deletions.

git commit … after adding a “commit message” describing changes made, the files on the commit list are synced with the working branch (generally branch “master”)

Or to save time, use “git commit -m “<message>” “.

git diff … shows the difference between the working directory and the staging area (prior to git add)

git diff --cached ... compares the repository and the staging area
git diff HEAD~n … compares the staging area and repository n versions ago

git checkout HEAD <filename> … returns repository state to that of HEAD for files specified (or all files if not specified)
A .gitignore file can be created to specify what files git does not try to track.

Working in a Remote Repository

In your chosen directory,

Markdown

Learn it it’s great (no) :(

Useful tricks/tools

Git comes with a really useful tool that makes your bash prompt display some information about the repository you are in, so you don’t have to type git status ALL the time. Follow instructions here

Combine the above with oh-my-zsh to pimp your terminal: https://hackernoon.com/how-to-trick-out-terminal-287c0e93fce0

Python

Setup

Have you installed anaconda ?
Download the data and the code we will use in the lesson and save them under the same directory. (e.g, ~/Desktop/LearningPython/)

1st exercise!

Make git aware of that directory where you save the data and the code.
[Optional] - put that repo on github (do you remember how?)