DevOps meets scientific research

Danilo Pianini

image

Replication crisis

aka replicability crisis or reproducibility crisis

  • ongoing methodological crisis

  • the results of many scientific studies are hard or impossible to reproduce.

  • empirical reproductions are essential for the the scientific method

no reproducibility $\Rightarrow$ scientific credibility is undermined

Countermeasures

Specifically for data science and computer science

  • Make your artifacts available
    • Share code as open source (licensing)
    • Share code and data where people will find it (GitHub)
    • Share code and data where it will be archived for the foreseeable future (Zenodo)
  • Make your artifacts reproducible
    • It works on your PC? Ship your PC! (containerization)
  • Make your artifacts maintainable
    • Be ready to accept contributions and work in team (version control)
    • Always check that the software is working (continuous integration)
  • Make your artifacts reusable
    • document them appropriately (GitHub Pages)

Most techniques come from the DevOps world!

Course contents

Specifically for data science and computer science

  • Version control with git
  • Share code on GitHub
  • Continuous integration in GitHub Actions
  • Containerization via Docker
  • Archive your artifacts on Zenodo
  • Licensing
  • Create documentation on GitHub Pages

Course exam

Create a reusable artifact

  • If you have a paper that is being submitted, create the artifact for it and it counts as a valid exam (if done properly)
  • if you don’t, then a toy project is fine as well

Must-have

  • Version control
  • Appropriate License
  • Continuous Integration
  • Containerization
  • Zenodo archiving
  • GitHub Pages documentation

Development

  • Analysis of a domain
  • Design of a solution
  • Implementation
  • Testing

Operations

  • IT infrastructure
  • Deployment
  • Maintenance

Silo mentality

silos

No silos

silos

DevOps culture

  • Increased collaboration

    • Dev and Ops should exchange information and work together
  • Shared responsibility

    • A team is responsible for the (sub) product for its whole lifetime
    • No handing over projects from devs to ops
  • Autonomous teams

    • Lightweight decision making process
  • Focus on the process, not just the product

    • Promote small, incremental changes
    • Automate as much as possible
    • Leverage the right tool for the job at hand

Why bother?

  1. Risk management

    • Reduce the probability of failure
    • Detect defects before hitting the market
    • Quickly react to problems
  2. Resource exploitation

    • Use human resources for human-y work
    • Reduce time to market
    • Embrace innovation
    • Exploit emerging technologies

DevOps

  1. Principles
  2. Practices
  3. Tools

Principles inspire practices

Practices require tools

DevOps principles

(not exhaustive)

  • Collaboration
  • Reproducibility
  • Automation
  • Incrementality
  • Robustness

DevOps practices

  • Workflow organization
  • Build automation
  • Self-testing code
  • Code quality control
  • Continuous Integration
  • Continuous Delivery
  • Continuous Deployment
  • Continuous Monitoring

Version control with git

Tracking changes

Did you ever need to roll back some project or assignment to a previous version?

How did you track the history of the project?

Classic way

  1. find a naming convention for files/folders
  2. make a copy every time there is some relevant progress
  3. make a copy every time an ambitious but risky development begins

Inefficient!

  • Consumes a lot of resources
  • Requires time
  • How to tell what was in some previous releases?
  • How to cherry-pick some changes?

Fostering collaborative workflows

Did you ever need to develop some project or assignment as a team?

How did you organize the work to maximize the productivity?

Classic ways

  • One screen, many heads
    • a.k.a. one works, the other ones sleep
  • Locks: “please do not touch section 2, I’m working on that”
    • probability of arising conflicts close to 100%
  • Realtime-sharing (like google docs or overleaf)
    • okay in many cases for text documents (but with a risk of frankestein-ization)
    • disruptive with code (inconsistencies are much less tolerable in formal languages)

Version control systems

Tools meant to support the development of projects by:

  • Tracking the project history
  • Allowing roll-backs
  • Collecting meta-information on the changes
    • Authors, dates, notes…
  • Merging information produced at different stages
  • (in some cases) facilitate parallel workflows
  • Also called Source Content Management (SCM)

Distributed: Every copy of the repository contains (i.e., every developer locally have) the entire history.

Centralized: A reference copy of the repository contains the whole history; developers work on a subset of such history

Short history

  • Concurrent Versioning System (CVS) (1986): client-server (centralized model, the truth is on the server), operates on single files or repository-level, history stored in a hidden directory, uses delta compression to save space.
  • Apache Subversion (SVN) (2000): successor to CVS, still largely used (especially in businesses that struggle to renovate their processes). Centralized model (similar to CVS). Improved binary file management. Improved concurrency for the operation, still cumbersome for parallel workflows.
  • Mercurial and Git (both April 2005): decentralized version control systems (DVCSs), no “special” copy of the repository, each client stores the whole history. Highly scalable. Foster parallel work by allowing easy branching and merging. Very similar conceptually (when two succesful tools emerge at the same time with a similar model independently, it is an indication that the underlying model is “the right one” for the context).

Git is now the dominant DVCS (although Mercurial is still in use, e.g., for Python, Java, Facebook).

Intuition: the history of a project

  1. Create a new project
Initialize project
  1. Make some changes
Initialize projectMake some changes
  1. Then more and more, until the project is ready
Initialize projectMake some changesMake more changesIt's finished!

At a first glance, the history of a project looks like a line.

Except that, in the real world…

Anything that can go wrong will go wrong
$1^{st}$ Murphy’s law

If anything simply cannot go wrong, it will anyway $5^{th}$ Murphy’s law

…things go wrong

Initialize projectMake some changesMake more changesIt's *not* finished! :(fail

Rolling back changes

Go back in time to a previous state where things work

initokayerrorreject

Get the previous version and fix

Then fix the mistake

masterfix-issueinitokayerrorrejectimprovefinished

If you consider rollbacks, history is a tree!

Collaboration: diverging

Alice and Bob work together for some time, then they go home and work separately, in parallel

They have a diverging history!

alicebobtogether-1together-2together-3bob-4alice-4bob-5alice-5bob-6alice-6

Collaboration: reconciling

alicebobtogether-1together-2together-3bob-4alice-4bob-5alice-5bob-6alice-6reconciled!together-7

If you have the possibility to reconcile diverging developments, the history becomes a graph!

Reconciling diverging developments is usually referred to as merge

DVCS concepts and terminology: Repository

Project meta-data. Includes the whole project history

  • information on how to roll back changes
  • authors of changes
  • dates
  • differences between different points in time
  • and so on

Usually, stored in a hidden folder in the root folder of the project

DVCS concepts and terminology: Working Tree

(or worktree, or working directory)

the collection of files (usually, inside a root folder) that constitute the project, excluding the meta-data.

DVCS concepts and terminology: Commit

A saved status of the project.

  • Collects the changes required to transform the previous (parent) commit into the current (differential tracking)
  • Creates a snapshot of the status of the worktree (snapshotting).
  • Records metadata: parent commit, author, date, a message summarizing the changes, and a unique identifier.
  • A commit with no parent is an initial commit.
  • A commit with multiple parents is a merge commit.
alicebob0-797700ainitial1-d71384f2-b8afd293-e4538da4-502963b5-93f38906-4c35f667-9cd95f48-d44b71fmerge10-9bf013c

DVCS concepts and terminology: Branch

A named sequence of commits

default-branchbranch2branch3branch40-9bfcfe11-f1979e02-6700c6c3-5d873754-2d4a5d45-b57ad1f6-151d02f7-38c39f48-2f08a8d9-e01270f11-7720b4412-1e80f2d13-e14f14715-416a1d516-e36dd2e

If no branch has been created at the first commit, a default name is used.

DVCS concepts and terminology: Commit references

To be able to go back in time or change branch, we need to refer to commits *

  • Commit references are also referred to as tree-ishes
  • Every commit has a unique identifier, which is a valid reference
  • A branch name is a valid commit reference (points to the last commit of that branch)

A special commit name is HEAD, which refers to the current commit

  • When committing, the HEAD moves forward to the new commit

Absolute and relative references

Appending ~ and a number i to a valid tree-ish means “i-th parent of this tree-ish”

masterHEAD~8HEAD~7HEAD~6HEAD~5HEAD~4HEAD~3HEAD~2HEAD~1HEAD
b0b0~8b0~7b0~6b0~5b0~4b0~3b0~2b0~1b0

DVCS concepts and terminology: Checkout

The operation of moving to another commit

  • Moving to another branch
  • Moving back in time

Moves the HEAD to the specified target tree-ish

Project evolution example

Let us try to see what happens when ve develop some project, step by step.

  1. first commit
HEAD
default-branch
1
  1. second commit
HEAD
default-branch
2
1

four commits later

HEAD
default-branch
6
5
4
3
2
1

Oh, no, there was a mistake! We need to roll back!

checkout of C4

HEAD
default-branch
6
5
4
3
2
1
  • No information is lost, we can get back to 6 whenever we want to.
  • what if we commit now?

Branching!

HEAD
default-branch
new-branch
6
5
4
3
2
1
7
  • Okay, but there was useful stuff in 5, I’d like to have it into new-branch

Merging!

HEAD
default-branch
new-branch
6
5
4
3
2
1
7
8

Notice that:

  • we have two branches
  • 8 is a merge commit, as it has two parents: 7 and 5
  • the situation is the same regardless that is a single developer going back on the development or multiple developers working in parallel!
  • this is possible because every copy of the repository contains the entire history!

Reference DVCS: Git

De-facto reference distributed version control system

  • Distributed
  • Born in 2005 to replace BitKeeper as SCM for the Linux kernel
    • Performance was a major concern
    • Written in C
  • Developed by Linus Torvalds
    • Now maintained by others
  • Unix-oriented
    • Tracks Unix file permissions
  • Very fast
    • At conception, 10 times faster than Mercurial¹, 100 times faster than Bazaar

¹ Less difference now, Facebook vastly improved Mercurial

Funny historical introduction

Approach: terminal-first

(actually: terminal-only)

Git is a command line tool

Although graphical interfaces exsist, it makes no sense to learn a GUI:

  • they are more prone to future changes than the CLI
  • they add a level of interposition between you and the tool
  • unless they are incomplete, they expose more complexity than what we can deal with in this course
    • what do you do with a checkbox labeled “squash when merging”?
    • and what about recursively checkout submodules?
  • as soon as you learn the CLI, you become so proficient that you get slower when there is a graphical interface in-between

I am assuming minimal knowledge of the shell, please let me know NOW if you’ve never seen it

Configuration

Configuration in Git happens at two level

  • global: the default options, valid system-wide
  • repository: the options specific to a repository. They have precedence over the global settings

Strategy

Set up the global options reasonably, then override them at the repository level, if needed.

git config

The config subcommand sets the configuration options

  • when operated with the --global option, configures the tool globally
  • otherwise, it sets the option for the current repository
    • (there must be a valid repository)
  • Usage: git config [--global] category.option value
    • sets option of category to value

Configuration: main options

As said, --global can be omitted to override the global settings locally

Username and email: user.name and user.email

A name and a contact are always saved as metadata, so they need to be set up

  • git config --global user.name "Your Real Name"
  • git config --global user.email "your.email.address@your.provider"

Default editor

Some operations pop up a text editor. It is convenient to set it to a tool that you know how to use (to prevent, e.g., being “locked” inside vi or vim). Any editor that you can invoke from the terminal works.

  • git config --global core.editor nano

Default branch name

How to name the default branch. Two reasonable choices are main and master

  • git config --global init.defaultbranch master

Initializing a repository

git init

  • Initializes a new repository inside the current directory
  • Reified in the .git folder
  • The location of the .git folder marks the root of the repository
    • Do not nest repositories inside repositories, it is fragile
    • Nested projects are realized via submodules (not discussed in this course)
  • Beware of the place where you issue the command!
    • First use cd to locate yourself inside the folder that contains (or will containe the project)
      • (possibly, first create the folder with mkdir)
    • Then issue git init
    • if something goes awry, you can delete the repository by deleting the .git folder.

Staging

Git has the concept of stage (or index).

  • Changes must be added to the stage to be committed.
  • Commits save the changes included in the stage
    • Files changed after being added to the stage neet to be re-staged
  • git add <files> moves the current state of the files into the stage as changes
  • git reset <files> removes currently staged changes of the files from stage
  • git commit creates a new changeset with the contents of the stage
add
commit
reset
working directory
stage (or index)
repository

Observing the repository status

It is extremely important to understand clearly what the current state of affairs is

  • Which branch are we working on?
  • Which files have been modified?
  • Which changes are already staged?

git status prints the current state of the repository, example output:

❯ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   content/_index.md
        new file:   content/dvcs-basics/_index.md
        new file:   content/dvcs-basics/staging.png

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   layouts/shortcodes/gravizo.html
        modified:   layouts/shortcodes/today.html

Committing

  • Requires an author and an email
    • They can be configured globally (at the computer level):
      • git config --global user.name 'Your Real Name'
      • git config --global user.email 'your@email.com'
    • The global settings can be overridden at the repository level
      • e.g., you want to commit with a different email between work and personal projects
      • git config user.name 'Your Real Name'
      • git config user.email 'your@email.com'
  • Requires a message, using appropriate messages is extremely important
    • If unspecified, the commit does not get performed
    • it can be specified inline with -m, otherwise Git will pop up the default editor
      • git commit -m 'my very clear and explanatory message'
  • The date is recorded automatically
  • The commit identifier (a cryptographic hash) is generated automatically

Default branch

At the first commit, there is no branch and no HEAD.

Depending on the version of Git, the following behavior may happen upon the first commit:

  • Git creates a new branch named master
    • legacy behavior
    • the name is inherited from the default branch name in Bitkeeper
  • Git creates a new branch named master, but warns that it is a deprecated behavior
    • although coming from the Latin “magister” (teacher) and not from the “master/slave” model of asymmetric communication control, many recently prefer main as seen as more inclusive
  • Git refuses to commit until a default branch name is specified
    • modern behavior
    • Requires configuration: git config --global init.defaultbranch default-branch-name

Ignoring files

In general, we do not want to track all the files in the repository folder:

  • Some files could be temporary (e.g., created by the editor)
  • Some files could be regenerable (e.g., compiled binaries and application archives)
  • Some files could contain private information

Of course, we could just not add them, but the error is around the corner!

It would be much better to just tell Git to ignore some files.

This is achieved through a special .gitignore file.

  • the file must be named .gitignore, names like foo.gitignore or gitignore.txt won’t work
    • A good way to create/append to this file is via echo whatWeWantToIgnore >> .gitignore (multiplatform command)
  • it is a list of paths that git will ignore (unless git add is called with the --force option)
  • it is possible to add exceptions

.gitignore example

# ignore the bin folder and all its contents
bin/
# ignore every pdf file
*.pdf
# rule exception (beginning with a !): pdf files named 'myImportantFile.pdf' should be tracked
!myImportantFile.pdf

Going to a new line is more complicated than it seems

Going to a new line is a two-phased operation:

  1. Bring the cursor back to the begin of the line
  2. Bring the cursor down one line

In electromechanic teletypewriters (and in typewriters, too), they were two distinct operations:

  1. Carriage Return (bringing the carriage to its leftmost position)
  2. Line Feed (rotating the carriage of one step)

A teletypewriter

teletypewriter

  • driving them text without drivers required to explicitly send carriage return and line feed commands

Newlines in the modern world

Terminals were designed to behave like virtual teletypewriters

  • Indeed, they are still called TTY (TeleTYpewriter)
  • In Unix-like systems, they are still implemented as virtual devices
    • If you have MacOS X or Linux, you can see which virtual device backs your current terminal using tty
  • At some point, Unix decided that LF was sufficient in virtual TTYs to go to a new line
    • Probably inspired by the C language, where \n means “newline”
    • The behaviour can still be disabled
we would get
            lines
                 like these

Consequence:

  • Windows systems go to a new line with a CR character followed by an LF character: \r\n
  • Unix-like systems go to a new line with an LF character: \n
  • Old Mac systems used to go to a new line with a CR character: \r
    • Basically they decided to use a single character like Unix did, but made the opposite choice
    • MacOS X is POSIX-compliant, uses \n

Newlines and version control

If your team uses multiple OSs, it is likely that, by default, the text editors use either LF (on Unix) or CRLF

It is also very likely that, upon saving, the whole file gets rewritten with the “locally correct” line endings

  • This however would result in all the lines being changed!
  • The differential would be huge
  • Conflicts would arise everywhere!

Git tries to tackle this issue by converting the line endings so that they match the initial line endings of the file, resulting in repositories with illogically mixed line endings (depending on who created a file first) and loads of warnings about LF/CRLF conversions.

Line endings should instead be configured per file type!

.gitattributes

  • A sensible strategy is to use LF everywhere, but for Windows scripts (bat, cmd, ps1)
  • Git can be configured through a .gitattributes file in the repository root
  • Example:
* text=auto eol=lf
*.[cC][mM][dD] text eol=crlf
*.[bB][aA][tT] text eol=crlf
*.[pP][sS]1 text eol=crlf

Dealing with removal and renaming of files

  • The removal of a file is a legit change
  • As we discussed, git add adds a change to the stage
  • the change can be a removal!

git add someDeletedFile is a correct command, that will stage the fact that someDeletedFile does not exist anymore, and its deletion must be registered at the next commit.

  • File renaming is equivalent to file deletion and file creation where, incidentally, the new file has the same content of the deleted file
  • To stage the rinomination of file foo into bar:
    • git add foo bar
    • it records that foo has been deleted and bar has been created
    • Git is smart enough to understand that it is a name change, and will deal with it efficiently

Visualizing the history

Of course, it is useful to visualize the history of commits. Git provides a dedicated sub-command:

git log

  • opens a navigable interactive view of the history from the HEAD commit (the current commit) backwards
    • Press Q
  • compact visualization: git log --oneline
  • visualization of all branches: git log --all
  • visualization of a lateral graph: git log --graph
  • compact visualization of all branches with a graph: git log --oneline --all --graph

example output of git log --oneline --all --graph

* d114802 (HEAD -> master, origin/master, origin/HEAD) moar contribution
| * edb658b (origin/renovate/gohugoio-hugo-0.94.x) ci(deps): update gohugoio/hugo action to v0.94.2
|/  
* 4ce3431 ci(deps): update gohugoio/hugo action to v0.94.1
* 9efa88a ci(deps): update gohugoio/hugo action to v0.93.3
* bf32a8b begin with build slides
* b803a65 lesson 1 looks ready
* 6a85f8f ci(deps): update gohugoio/hugo action to v0.93.2
* b474d2a write more on the introductory lesson
* 8a7105e ci(deps): update gohugoio/hugo action to v0.93.1
* 6e40642 begin writing the first lesson

Referring to commits: <tree-ish>es

In git, a reference to a commit is called <tree-ish>. Valid <tree-ish>es are:

  • Full commit hashes, such as b82f7567961ba13b1794566dde97dda1e501cf88.
  • Shortened commit hashes, such as b82f7567.
  • Branch names, in which case the reference is to the last commit of the branch.
  • HEAD, a special name referring to the current commit (the head, indeed).
  • Tag names (we will discuss what a tag is later on).

Relative references

It is possible to build relative references, e.g., “get me the commit before this <tree-ish>”, by following the commit <tree-ish> with a tilde (~) and with the number of parents to get to:

  • <tree-ish>~STEPS where STEPS is an integer number produces a reference to the STEPS-th parent of the provided <tree-ish>:

    • b82f7567~1 references the parent of commit b82f7567.
    • some_branch~2 refers to the parent of the parent of the last commit of branch some_branch.
    • HEAD~3 refers to the parent of the parent of the parent of the current commit.
  • In case of merge commits (with multiple parents), ~ selects the first one

  • Selection of parents can be performed with caret in case of multiple parents (^)

Visualizing the differences

We want to see which differences a commit introduced, or what we modified in some files of the work tree

Git provides support to visualize the changes in terms of modified lines through git diff:

  • git diff shows the difference between the stage and the working tree
    • namely, what you would stage if you perform a git add
  • git diff --staged shows the difference between HEAD and the working tree
  • git diff <tree-ish> shows the difference between <tree-ish> and the working tree (stage excluded)
  • git diff --staged <tree-ish> shows the difference between <tree-ish> and the working tree, including staged changes
  • git diff <from> <to>, where <from> and <to> are <tree-ish>es, shows the differences between <from> and <to>

git diff Example output:

diff --git a/.github/workflows/build-and-deploy.yml b/.github/workflows/build-and-deploy.yml
index b492a8c..28302ff 100644
--- a/.github/workflows/build-and-deploy.yml
+++ b/.github/workflows/build-and-deploy.yml
@@ -28,7 +28,7 @@ jobs:
           # Idea: the regex matcher of Renovate keeps this string up to date automatically
           # The version is extracted and used to access the correct version of the scripts
           USES=$(cat <<TRICK_RENOVATE
-          - uses: gohugoio/hugo@v0.94.1
+          - uses: gohugoio/hugo@v0.93.3
           TRICK_RENOVATE
           )
           echo "Scripts update line: \"$USES\""

The output is compatible with the Unix commands diff and patch

Still, binary files are an issue! Tracking the right files is paramount.

Navigation of the history concretely means to move the head (in Git, HEAD) to arbitrary points of the history

In Git, this is performed with the checkout commit:

  • git checkout <tree-ish>
    • Unless there are changes that could get lost, moves HEAD to the provided <tree-ish>
    • Updates all tracked files to their version at the provided <tree-ish>

The command can be used to selectively checkout a file from another revision:

  • git checkout <tree-ish> -- foo bar baz
    • Restores the status of files foo, bar, and baz from commit <tree-ish>, and adds them to the stage (unless there are uncommitted changes that could be lost)
    • Note that -- is surrounded by whitespaces, it is not a --foo option, it is just used as a separator between the <tree-ish> and the list of files
      • the files could be named as a <tree-ish> and we need disambiguation

Detached head

Git does not allow multiple heads per branch (other DVCS do, in particular Mercurial): for a commit to be valid, HEAD must be at the “end” of a branch (on its last commit), as follows:

HEAD
master
10
9
8
7
6
5
4
3
2
1

When an old commit is checked out this condition doesn’t hold!

If we run git checkout HEAD~4:

HEAD
master
10
9
8
7
6
5
4
3
2
1

The system enters a special workmode called detached head.

When in detached head, Git allows to make commits, but they are lost!

(Not really, but to retrieve them we need git reflog and git cherry-pick, that we won’t discuss)

Branches as labels

To be able to start new development lines, we need to create a branch.

In Git, branches work like movable labels:

  • Upon creation, they are attached to the same commit HEAD refers to
  • If a new commit is made when HEAD is attached to them, they move along with HEAD

Branch creation

Branches are created with git branch branch_name

HEAD
master
10
9
8
7
6
5
4
3
2
1

⬇️ git branch new-experiment ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1

HEAD does not attach to the new branch by default, an explicit checkout is required.

Creating branches when in DETACHED_HEAD

Creating new branches allows to store changes made when we are in DETACHED_HEAD state.

HEAD
master
10
9
8
7
6
5
4
3
2
1

⬇️ git checkout HEAD~4 ⬇️

HEAD
master
10
9
8
7
6
5
4
3
2
1
  • DETACHED_HEAD: our changes will be discarded, unless…

➡️ Next: git branch new-experiment ➡️

Creating branches when in DETACHED_HEAD

⬇️ git branch new-experiment ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1

HEAD is still detached though, we need to attach it to the new branch for it to store our commits

➡️ Next: git checkout new-experiment ➡️

Creating branches when in DETACHED_HEAD

⬇️ git checkout new-experiment ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1
  • New commits will now be stored!

⬇️ [changes] + git add + git commit ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1
11

$\Rightarrow$ HEAD brings our branch forward with it!

One-shot branch creation

As you can imagine, creating a new branch and attaching HEAD to the freshly created branch is pretty common

As customary for common operations, a short-hand is provided: git checkout -b new-branch-name

  • Creates new-branch-name from the current position of HEAD
  • Attaches HEAD to new-branch-name
HEAD
master
10
9
8
7
6
5
4
3
2
1

⬇️ git checkout -b new-experiment ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1

Merging branches

Reunifying diverging development lines is much trickier than spawning new development lines

In other words, merging is much trickier than branching

  • Historically, with centralized version control systems, merging was considered extremely delicate and difficult
  • The distributed version control systems promoted frequent, small-sized merges, much easier to deal with
  • Conflicts can still arise!
    • what if we change the same line of code in two branches differently?

In Git, git merge target merges the branch named target into the current branch (HEAD must be attached)

Merge visual example

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1
12
11

⬇️ git merge master ⬇️

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1
13
12
11

Fast forwarding

Consider this situation:

HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1
  • We want new-experiment to also have the changes from C7, to C10 (to be up to date with master)
  • master contains all the commits of new-experiment
  • We don’t really need a merge commit, we can just move new-experiment to point it to C6
  • $\Rightarrow$ This is called a fast-forward
    • It is the default behavior in Git when merging branches where the target is the head plus something
HEAD
master
new-experiment
10
9
8
7
6
5
4
3
2
1

Merge conflicts

Git tries to resolve most conflicts by itself

  • It’s pretty good at it
  • but things can still require human intervention

In case of conflict on one or more files, Git marks the subject files as conflicted, and modifies them adding merge markers:

<<<<<<< HEAD
Changes made on the branch that is being merged into,
this is the branch currently checked out (HEAD).
=======
Changes made on the branch that is being merged in.
>>>>>>> other-branch-name
  • The user should change the conflicted files so that they reflect the final desired status
  • The (now fixed) files should get added to the stage with git add
  • The merge operation can be concluded through git commit
    • In case of merge, the message is pre-filled in
    • If the message is okay, git commit --no-edit can be used to use it without editing

Good practices

Avoiding merge conflicts is much better than solving them

Although they are unavoidable in some cases, they can be minimized by following a few good practices:

  • Do not track files that can be generated
    • This is harmful under many points of view, and merge conflicts are one
  • Do make many small commits
    • Each coherent change should be reified into a commit
    • Even very small changes, like modification of the whitespaces
    • Smaller commits help Git better figure out what changed and in which order, generally leading to finer grained (and easier to solve) conflicts
  • Do enforce style rules across the team
    • Style changes are legitimate changes
    • Style is often enforced at the IDE level
    • Minimal logical changes may cause widespread changes due to style modifications
  • Do pay attention to newlines
    • Different OSs use different newline characters
    • Git tries to be smart about it, often failing catastrophically

Exercise:

  • Fork the repository at: https://github.com/APICe-at-DISI/OOP-git-merge-conflict-test
  • Clone the repository locally
  • Merge the branch feature into master
  • There will be a conflict!
    • If you know Java, solve it in such a way that the program prints both the author and the number of processors
    • If you don’t, edit the file to make it appear like something that could work
  • Once the file has the look you want, complete the merge
  • Push to your fork
  • Observe that you can open a pull request
    • Do not open it, or at least not towards the original repository
    • (I won’t pull anyway ;) )

Branches as labels: deletion

Branches work like special labels that move if a commit is performed when HEAD is attached.

Also, the history tracked by git is a directed acyclic graph (each commit has a reference to its parents)

$\Rightarrow$ Branches can be removed without information loss, as far as there is at least another branch from which all the commits of the deleted branch are reachable

Safe branch deletion is performed with git branch -d branch-name (fails if there is information loss).

Branch deletion example

HEAD
master
fix/bug22
feat/serverless
10
9
8
7
6
5
4
3
2
1
11

⬇️ git branch -d fix/bug22 ⬇️

HEAD
master
feat/serverless
10
9
8
7
6
5
4
3
2
1
11

No commit is lost, branch fix/bug22 is removed

What about git branch -d feat/serverless?

It would fail with an error message, as 11 would be lost

Importing a repository

  • We can initialize an emtpy repository with git init
  • But most of the time we want to start from a local copy of an existing repository

Git provides a clone subcommand that copies the whole history of a repository locally

  • git clone URI destination creates the folder destination and clones the repository found at URI
    • If destination is not empty, fails
    • if destination is omitted, a folder with the same namen of the last segment of URI is created
    • URI can be remote or local, Git supports the file://, https://, and ssh protocols
      • ssh recommended when available
  • The clone subcommand checks out the remote branch where the HEAD is attached (default branch)

Examples:

  • git clone /some/repository/on/my/file/system destination
    • creates a local folder called destination and copies the repository from the local directory
  • git clone https://somewebsite.com/someRepository.git myfolder
    • creates a local folder called myfolder and copies the repository located at the specified URL
  • git clone user@sshserver.com:SomePath/SomeRepo.git
    • creates a local folder called SomeRepo and copies the repository located at the specified URL

Remotes

  • Remotes are the known copies of the repository that exist somewhere (usually in the Internet)
  • Each remote has a name and a URI
  • When a repository is created via init, no remote is known.
  • When a repository is imported via clone, a remote called origin is created automatically

Non-local branches can be referenced as remoteName/branchName

The remote subcommand is used to inspect and manage remotes:

  • git remote -v lists the known remotes

  • git remote add a-remote URI adds a new remote named a-remote and pointing to URI

  • git remote show a-remote displays extended information on a-remote

  • git remote remove a-remote removes a-remote (it does not delete information on the remote, it locally forgets that it exits)

Upstream branches

Remote branches can be associated with local branches, with the intended meaning that the local and the remote branch are intended to be two copies of the same branch

  • A remote branch associated to a local branch is its upstream branch
  • upstream branches can be configured via git branch --set-upstream-to=remote/branchName
    • e.g.: git branch --set-upstream-to=origin/develop sets the current branch upstream to origin/develop
  • When a repository is initialize by clone, its default branch is checked out locally with the same name it has on the remote, and the remote branch is automatically set as upstream

Actual result of git clone git@somesite.com/repo.git

local
somesite.com/repo.git
origin
HEAD
master
1
2
3
4
5
6
7
8
9
10
HEAD
master
feat/serverless
1
2
3
4
5
6
7
8
9
10
11
12
  • git@somesite.com/repo.git is saved as origin
  • The main branch (the branch where HEAD is attached, in our case master) on origin gets checked out locally with the same name
  • The local branch master is set up to track origin/master as upstream
  • Additional branches are fetched (they are known locally), but they are not checked out

Importing remote branches

git branch (or git checkout -b) can checkout remote branches locally once they have been fetched.

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
10
HEAD
master
feat/serverless
1
2
3
4
5
6
7
8
9
10
11
12

➡️ git checkout -b imported-feat origin/feat/serverless ➡️

⬇️ git checkout -b imported-feat origin/feat/serverless ⬇️

local
somesite.com/repo.git
origin
master
imported-feat
HEAD
7
11
12
1
2
3
4
5
6
8
9
10
HEAD
master
feat/serverless
1
2
3
4
5
6
7
8
9
10
11
12
  • A new branch imported-feat is created locally, and origin/feat/serverless is set as its upstream

Importing remote branches

  • It is customary to reuse the upstream name if there are no conflicts
    • git checkout -b feat/new-client origin/feat/new-client
  • Modern versions of Git automatically checkout remote branches if there are no ambiguities:
    • git checkout feat/new-client
    • creates a new branch feat/new-client with the upstream branch set to origin/feat/new-client if:
      • there is no local branch named feat/new-client
      • there is no ambiguity with remotes
    • Quicker if you are working with a single remote (pretty common)

Example with multiple remotes

somewherelse.org/repo.git
master
10
9
8
4
3
2
1
HEAD
somesite.com/repo.git
HEAD
4
3
2
1
7
6
5
master
feat/serverless

➡️ Next: git clone git@somesite.com/repo.git ➡️

⬇️ git clone git@somesite.com/repo.git ⬇️

local
somesite.com/repo.git
origin
HEAD
master
1
2
3
4
somewherelse.org/repo.git
master
10
9
8
4
3
2
1
HEAD
HEAD
master
feat/serverless
1
2
3
4
5
6
7

➡️ Next: git checkout -b feat/serverless origin/feat/serverless ➡️

⬇️ git checkout -b feat/serverless origin/feat/serverless ⬇️

local
somesite.com/repo.git
origin
HEAD
master
feat/serverless
1
2
3
4
5
6
7
somewherelse.org/repo.git
master
10
9
8
4
3
2
1
HEAD
HEAD
master
feat/serverless
1
2
3
4
5
6
7

➡️ Next: git remote add other git@somewhereelse.org/repo.git ➡️

⬇️ git remote add other git@somewhereelse.org/repo.git ⬇️

local
somesite.com/repo.git
origin
other
HEAD
master
feat/serverless
1
2
3
4
5
6
7
somewherelse.org/repo.git
master
10
9
8
4
3
2
1
HEAD
HEAD
master
feat/serverless
1
2
3
4
5
6
7

➡️ Next: git checkout -b other-master other/master ➡️

⬇️ git checkout -b other-master other/master ⬇️

local
somewherelse.org/repo.git
somesite.com/repo.git
HEAD
master
feat/serverless
other-master
1
2
3
4
8
9
10
5
6
7
origin
other
master
HEAD
1
2
3
4
8
9
10
HEAD
master
feat/serverless
1
2
3
4
5
6
7

Multiple remotes

You can operate with multiple remotes! Just remember: branch names must be unique for every repository

  • If you want to track origin/master and anotherRemote/master, you need two local branches with diverse names

Fetching updates

To check if a remote has any update available, git provides th git fetch subcommand.

  • git fetch a-remote checks if a-remote has any new information. If so, it downloads it.
    • Note: it does not merge it anywhere, it just memorizes its current status
  • git fetch without a remote:
    • if HEAD is attached and the current branch has an upstream, then the remote that is hosting the upstream branch is fetched
    • otherwise, origin is fetched, if present
  • To apply the updates, is then necessary to use manually use merge

The new information fetched includes new commits, branches, and tags.

Fetch + merge example

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
HEAD
master
1
2
3
4
5
6
7
8

➡️ Next: Changes happen on somesite.com/repo.git and on our repository concurrently ➡️

Fetch + merge example

⬇️ Changes happen on somesite.com/repo.git and on our repository concurrently ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
11
12
HEAD
master
1
2
3
4
5
6
7
8
9
10

➡️ git fetch && git merge origin/master (assuming no conflicts or conflicts resolved) ➡️

Fetch + merge example

⬇️ git fetch && git merge origin/master (assuming no conflicts or conflicts resolved) ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
11
12
13
9
10
HEAD
master
1
2
3
4
5
6
7
8
9
10

If there had been no updates locally, we would have experienced a fast-forward

git pull

Fetching the remote with the upstream branch and then merging is extremely common, so common that there is a special subcommand that operates.

git pull is equivalent to git fetch && git merge FETCH_HEAD

  • git pull remote is the same as git fetch remote && git merge FETCH_HEAD
  • git pull remote branch is the same as git fetch remote && git merge remote/branch

git pull is more commonly used than git fetch + git merge, still, it is important to understand that it is not a primitive operation

Sending local changes

Git provides a way to send changes to a remote: git push remote branch

  • sends the current branch changes to remote/branch, and updates the remote HEAD
  • if the branch or the remote is omitted, then the upstream branch is used
  • push requires writing rights to the remote repository
  • push fails if the pushed branch is not a descendant of the destination branch, which means:
    • the destination branch has work that is not present in the local branch
    • the destination branch cannot be fast-forwarded to the local branch
    • the commits on the destination branch are not a subset of the ones on the local branch

Pushing tags

By default, git push does not send tags

  • git push --tags sends only the tags
  • git push --follow-tags sends commits and then tags

Example with git pull and git push

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
HEAD
master
1
2
3
4
5
6
7
8

➡️ Next: [some changes] git add . && git commit ➡️

Example with git pull and git push

⬇️ [some changes] git add . && git commit ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
HEAD
master
1
2
3
4
5
6
7
8

➡️ Next: git push ➡️

Example with git pull and git push

⬇️ git push ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
HEAD
master
1
2
3
4
5
6
7
8
9
  • Everything okay! origin/master was a subset of master
  • The remote HEAD can be fast-forwarded

➡️ Next: someone else pushes a change ➡️

Example with git pull and git push

⬇️ someone else pushes a change ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
HEAD
master
1
2
3
4
5
6
7
8
9
10

➡️ Next: [some changes] git add . && git commit ➡️

Example with git pull and git push

⬇️ [some changes] git add . && git commit ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
11
HEAD
master
1
2
3
4
5
6
7
8
9
10

➡️ Next: git push ➡️

Example with git pull and git push

⬇️ git push ⬇️

ERROR

To somesite.com/repo.git
 ! [rejected]        master -> master (fetch first)
error: failed to push some refs to 'somesite.com/repo.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
  • master is not a superset of origin/master
    • commit 10 is in origin/master but not in master, preventing a remote fast-forward
  • How to solve?
    • (Git’s error explains it pretty well)

➡️ Next: git pull ➡️

Example with git pull and git push

⬇️ git pull (assuming no merge conflicts, or after conflict resolution) ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
11
12
10
HEAD
master
1
2
3
4
5
6
7
8
9
10
  • Now master is a superset of origin/master! (all the commits in origin/master, plus 11 and 12)

➡️ Next: git push ➡️

Example with git pull and git push

⬇️ git push ⬇️

local
somesite.com/repo.git
origin
master
HEAD
1
2
3
4
5
6
7
8
9
11
12
10
HEAD
master
1
2
3
4
5
6
7
8
9
10
12
11

The push suceeds now!

Associating symbolic names to commits

It is often handful to associate some commits with a symbolic name, most of the time to assign versions.

  • e.g., identify commit 8d400c0 as version 1.2.3

Although in principle branches could be used to do so, their nature is of moving labels: when HEAD is attached, new commits move the branch forward. We would like to have branches to which HEAD cannot attach (hence, they can’t be moved from their creation point).

HEAD
master
10
9
8
7
6
5
4
3
2
1

⬇️ git checkout C4 && git branch 1.2.3 && git checkout master ⬇️

HEAD
master
1.2.3
10
9
8
7
6
5
4
3
2
1

Branches as attachable (and movable) labels

HEAD
master
1.2.3
10
9
8
7
6
5
4
3
2
1

Looks good, but if we do something like: ⬇️ git checkout 1.2.3 [some changes] git commit ⬇️

HEAD
master
1.2.3
10
9
8
7
6
5
4
3
2
1
11

Our version moved, we never want this to happen!

Tagging

The tag subcommand to create permanent labels attached to commits. Tags come in two fashions:

  • Lightweight tags are very similar to a “permanent branch”: pointers to commits that never change
  • Annotated tags (option -a) store additional information: a message, and, optionally, a signature (option -s/-u)
HEAD
master
10
9
8
7
6
5
4
3
2
1

➡️ git checkout C4 && git tag 1.2.3 ➡️

⬇️ git checkout C4 && git tag 1.2.3 ⬇️

HEAD
master
1.2.3
10
9
8
7
6
5
4
3
2
1

HEAD cannot attach to tags!

Pushing tags

Tags are not pushed by default.

To push tags, use git push --tags after a normal push.

Alternatively, use git push --follow-tags to push both commits and tags.

Centralized Version Control Systems

Centralized VCS

Decentralized VCS

Decentralized VCS

Real-world DVCS

Real-world Decentralized VCS

Git repository hosting

Several services allow the creation of shared repositories on the cloud. They enrich the base git model with services built around the tool:

  • Forks: copies of a repository associated to different users/organizations
  • Pull requests (or Merge requests): formal requests to pull updates from forks
    • repositories do not allow pushes from everybody
    • what if we want to contribute to a project we cannot push to?
      • fork the repository (we own that copy)
      • write the contribution and push to our fork
      • ask the maintainers of the original repository to pull from our fork
  • Issue tracking

Most common services

  • GitHub
    • Replaced Sourceforge as the de-facto standard for open source projects hosting
    • Academic plan
  • GitLab
    • Available for free as self-hosted
    • Userbase grew when Microsoft acquired GitHub
  • Bitbucket
    • From Atlassian
    • Well integrated with other products (e.g., Jira)

GitHub

  • Hosting for git repositories
  • Free for open source
  • Academic accounts
  • De-facto standard for open source projects
  • One static website per-project, per-user, and per-organization
    • (a feature exploited by these slides)

repositories as remotes: authentication

repositories are uniquely identified by an owner and a repository name

  • owner/repo is a name unique to every repository

supports two kind of authentications:

HTTPS – Requires authentication via token

  • The port of should include a graphical authenticator, otherwise:
  • Recommended to users with no Unix shell

Secure Shell (SSH) – Requires authentication via public/private key pair

  • Recommended to / users and to those with a working SSH installation
  • The same protocol used to open remote terminals on other systems
  • Tell Github your public key and use the private (and secret) key to authenticate

Configuration of OpenSSH for

Disclaimer: this is a “quick and dirty” way of generating and using SSH keys.

You are warmly recommended to learn how it works and the best security practices.

  1. If you don’t already have one, generate a new key pair
    • ssh-keygen
    • You can confirm the default options
    • You can pick an empty password
      • your private key will be stored unencrypted on your file system
      • please understand the associated security issues, if you don’t, use a password.
  2. Obtain your public key
    • cat ~/.ssh/id_rsa.pub
    • Looks something like:
    ssh-rsa AAAAB3Nza<snip, a lot of seemingly random chars>PIl+qZfZ9+M= you@your_hostname
    
  3. Create a new key at https://github.com/settings/ssh/new
    • Provide a title that allows you to identify the key
    • Paste your key

You are all set! Enjoy your secure authentication.

Continuous Integration

The practice of integrating code with a main development line continuously
Verifying that the build remains intact

  • Requires build automation to be in place
  • Requires testing to be in place
  • Pivot point of the DevOps practices
  • Historically introduced by the extreme programming (XP) community
  • Now widespread in the larger DevOps community

Traditional integration

integration-traditional

The Integration Hell

  • Traditional software development takes several months for “integrating” a couple of years of development
  • The longer there is no integrated project, the higher the risk

Continuous integration

integration-traditional

Microreleases and protoduction

  • High frequency integration may lead to high frequency releases
    • Possibly, one per commit
    • Of course, versioning must be appropriate…

Traditionally, protoduction is jargon for a prototype that ends up in production

  • Traditionally used with a negative meaning
    • It implied software
      • unfinished,
      • unpolished,
      • badly designed
    • Very common, unfortunately
  • This si different in a continuously integrated environment
    • Incrementality is fostered
    • Partial features are up to date with the mainline

Intensive operations should be elsewhere

  • The build process should be rich and fast
  • Operations requiring a long time should be automated
    • And run somewhere else than devs’ PCs

Continuous integration software

Software that promotes CI practices should:

  • Provide clean environments for compilation/testing
  • Provide a wide range of environments
    • Matching the relevant specifications of the actual targets
  • High degree of configurability
  • Possibly, declarative configuration
  • A notification system to alert about failures or issues
  • Support for authentication and deployment to external services

Plenty of integrators on the market

Circle CI, Travis CI, Werker, done.io, Codefresh, Codeship, Bitbucket Pipelines, GitHub Actions, GitLab CI/CD Pipelines, JetBrains TeamCity…

Core concepts

Naming and organization is variable across platforms, but in general:

  • One or more pipelines can be associated to events
    • For instance, a new commit, an update to a pull request, or a timeout
  • Every pipeline is composed of a sequence of operations
  • Every operation could be composed of sequential or parallel sub-operations
  • How many hierarchical levels are available depends on the specific platform
    • GitHub Actions: workflow $\Rightarrow$ job $\Rightarrow$ step
    • Travis CI: build $\Rightarrow$ stage $\Rightarrow$ job $\Rightarrow$ phase
  • Execution happens in a fresh system (virtual machine or container)
    • Often containers inside virtual machines
    • The specific point of the hierarchy at which the VM/container is spawned depends on the CI platform

Pipeline design

In essence, designing a CI system is designing a software construction, verification, and delivery pipeline with the abstractions provided by the selected provider.

  1. Think of all the operations required starting from one or more blank VMs
    • OS configuration
    • Software installation
    • Project checkout
    • Compilation
    • Testing
    • Secrets configuration
    • Delivery
  2. Organize them in a dependency graph
  3. Model the graph with the provided CI tooling

Configuration can grow complex, and is usually stored in a YAML file
(but there are exceptions, JetBrains TeamCity uses a Kotlin DSL).

GitHub Actions: Structure

  • Workflows react to events, launching jobs
    • Multiple workflows run in parallel, unless explicitly restricted
  • Jobs of the same workflow run a sequence of steps
    • Multiple jobs run in parallel, unless a dependency among them is explicitly declared
    • Concurrency limits can be imposed across workflows
    • They can communicate via outputs
  • Steps of the same job run sequentially
    • They can communicate via outputs

GitHub Actions: Configuration

Workflows are configured in YAML files located in the default branch of the repository in the .github/workflows folder.

One configuration file $\Rightarrow$ one workflow

For security reasons, workflows may need to be manually activated in the Actions tab of the GitHub web interface.

GitHub Actions: Runners

Executors of GitHub actions are called runners: virtual machines (hosted by GitHub) with the GitHub Actions runner application installed.

Note: the GitHub Actions application is open source and can be installed locally, creating “self-hosted runners”. Self-hosted and GitHub-hosted runners can work together.

Upon their creation, runners have a default environment, which depends on their operating system

Convention over configuration

Several CI systems inherit the “convention over configuration principle.

For instance, by default (with an empty configuration file) Travis CI builds a Ruby project using rake.

GitHub actions does not adhere to the principle: if left unconfigured, the runner does nothing (it does not even clone the repository locally).

Probable reason: Actions is an all-round repository automation system for GitHub, not just a “plain” CI/CD pipeline

$\Rightarrow$ It can react to many different events, not just changes to the git repository history

GHA: basic workflow structure

Minimal, simplified workflow structure:

# Mandatory workflow name
name: Workflow Name
on: # Events that trigger the workflow
jobs: # Jobs composing the workflow, each one will run on a different runner
    Job-Name: # Every job must be named
        # The type of runner executing the job, usually the OS
        runs-on: runner-name
        steps: # A list of commands, or "actions"
            - # first step
            - # second step
    Another-Job: # This one runs in parallel with Job-Name
        runs-on: '...'
        steps: [ ... ]

DRY with YAML

We discussed that automation / integration pipelines are part of the software

  • They are subject to the same (or even higher) quality standards
  • All the good engineering principles apply!

YAML is often used by CI integrators as preferred configuration language as it enables some form of DRY:

  1. Anchors (& / *)
  2. Merge keys (<<:)
hey: &ref
  look: at
  me: [ "I'm", 'dancing' ]
merged:
  foo: *ref
  <<: *ref
  look: to

Same as:

hey: { look: at, me: [ "I'm", 'dancing' ] }
merged: { foo: { look: at, me: [ "I'm", 'dancing' ] }, look: to, me: [ "I'm", 'dancing' ] }

GitHub Actions’ actions

GHA’s YAML parser does not support standard YAML anchors and merge keys

(it is a well-known limit with an issue report open since ages)

GHA achieves reuse via:

  • actions”: reusable parameterizable steps
    • JavaScript (working on any OS)
    • Docker container-based (linux only)
    • Composite (assemblage of other actions)
  • reusable workflows”: reusable and parameterizable jobs

Many actions are provided by GitHub directly, and many are developed by the community.

Workflow minimal example

# This is a basic workflow to help you get started with Actions
name: Example workflow

# Controls when the workflow will run
on:
  push:
    tags: '*'
    branches-ignore: # Pushes on these branches won't start a build
      - 'autodelivery**'
      - 'bump-**'
      - 'renovate/**'
    paths-ignore: # Pushes that change only these file won't start the workflow
      - 'README.md'
      - 'CHANGELOG.md'
      - 'LICENSE'
  pull_request:
    branches: # Only pull requests based on these branches will start the workflow
      - master
  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

Workflow minimal example

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  Default-Example:
    # The type of runner that the job will run on
    runs-on: macos-latest
    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@d632683dd7b4114ad314bca15554477dd762a938
      # Runs a single command using the runners shell
      - name: Run a one-line script
        run: echo Hello from a ${{ runner.os }} machine!
      # Runs a set of commands using the runners shell
      - name: Run a multi-line script
        run: |
          echo Add other actions to build,
          echo test, and deploy your project.          

Workflow minimal example

  Explore-GitHub-Actions:
    runs-on: ubuntu-latest
    steps:
      - run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."
      - run: echo "🐧 This job is now running on a ${{ runner.os }} server hosted by GitHub!"
      - run: echo "🔎 The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}."
      - name: Check out repository code
        uses: actions/checkout@v4
      - run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
      - run: echo "🖥️ The workflow is now ready to test your code on the runner."
      - name: List files in the repository
        run: ls ${{ github.workspace }}
      - run: echo "🍏 This job's status is ${{ job.status }}."
      # Steps can be executed conditionally
      - name: Skipped conditional step
        if: runner.os == 'Windows'
        run: echo this step won't run, it has been excluded!
      - run: |
          echo This is
          echo a multi-line
          echo script.          

Workflow minimal example

  Conclusion:
    runs-on: windows-latest
    # Jobs may require other jobs
    needs: [ Default-Example, Explore-GitHub-Actions ]
    # Typically, steps that follow failed steps won't execute.
    # However, this behavior can be changed by using the built-in function "always()"
    if: always()
    steps:
      - name: Run something on powershell
        run: echo By default, ${{ runner.os }} runners execute with powershell
      - name: Run something on bash
        shell: bash
        run: echo However, it is allowed to force the shell type and there is a bash available for ${{ runner.os }} too.
      

GHA expressions

GitHub Actions allows expressions to be included in the workflow file

  • Syntax: ${{ <expression> }}
  • Special rule: if: conditionals are automatically evaluated as expressions, so ${{ }} is unnecessary
    • if: <expression> works just fine

The language is rather limited, and documented at

GHA Expressions Types

Type Literal Number coercion String coercion
Null null 0 ''
Boolean true or false true: 1, false: 0 'true' or 'false'
String '...' (mandatorily single quoted) Javascript’s parseInt, with the exception that '' is 0 none
JSON Array unavailable NaN error
JSON Object unavailable NaN error

Arrays and objects exist and can be manipulated, but cannot be created

GHA Expressions Operators

  • Grouping with ( )
  • Array access by index with [ ]
  • Object deference with .
  • Logic operators: not !, and &&, or ||
  • Comparison operators: ==, !=, <, <=, >, >=

GHA Expressions Functions

Functions cannot be defined. Some are built-in, their expressivity is limited. They are documented at

https://docs.github.com/en/actions/learn-github-actions/expressions#functions

Job status check functions

  • success(): true if none of the previous steps failed
    • By default, every step has an implicit if: success() conditional
  • always(): always true, causes the step evaluation even if previous failed, but supports combinations
    • always() && <expression returning false> evaluates the expression and does not run the step
  • cancelled(): true if the workflow execution has been canceled
  • failure(): true if a previous step of any previous job has failed

The GHA context

The expression can refer to some objects provided by the context. They are documented at

https://docs.github.com/en/actions/learn-github-actions/contexts

Some of the most useful are the following

  • github: information on the workflow context
    • .event_name: the event that triggered the workflow
    • .repository: repository name
    • .ref: branch or tag that triggered the workflow
      • e.g., refs/heads/<branch> refs/tags/<tag>
  • env: access to the environment variables
  • steps: access to previous step information
    • .<step id>.outputs.<output name>: information exchange between steps
  • runner:
    • .os: the operating system
  • secrets: access to secret variables (in a moment…)
  • matrix: access to the build matrix variables (in a moment…)

Checking out the repository

By default, GitHub actions’ runners do not check out the repository

  • Actions may not need to access the code
    • e.g., Actions automating issues, projects

It is a common and non-trivial operation (the checked out version must be the version originating the workflow), thus GitHub provides an action:

      - name: Check out repository code
        uses: actions/checkout@v4

Since actions typically do not need the entire history of the project, by default the action checks out only the commit that originated the workflow (--depth=1 when cloning)

  • Shallow cloning has better performance
  • $\Rightarrow$ It may break operations that rely on the entire history!
    • e.g., the git-sensitive semantic versioning system

Also, tags don’t get checked out

Checking out the whole history

    - name: Checkout with default token
      uses: actions/checkout@v4.2.0
      if: inputs.token == ''
      with:
        fetch-depth: 0
        submodules: recursive
    - name: Fetch tags
      shell: bash
      run: git fetch --tags -f

(code from a custom action, ignore the if)

  • Check out the repo with the maximum depth
  • Recursively check out all submodules
  • Checkout all tags

Writing outputs

Communication with the runner happens via workflow commands
The simplest way to send commands is to print on standard output a message in the form:
::workflow-command parameter1={data},parameter2={data}::{command value}

In particular, actions can set outputs by printing:
::set-output name={name}::{value}

jobs:
  Build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: danysk/action-checkout@0.2.20
      - id: branch-name # Custom id
        uses: tj-actions/branch-names@v8
      - id: output-from-shell
        run: ruby -e 'puts "dice=#{rand(1..6)}"' >> $GITHUB_OUTPUT
      - run: |
          echo "The dice roll resulted in number ${{ steps.output-from-shell.outputs.dice }}"
          if ${{ steps.branch-name.outputs.is_tag }} ; then
            echo "This is tag ${{ steps.branch-name.outputs.tag }}"
          else
            echo "This is branch ${{ steps.branch-name.outputs.current_branch }}"
            echo "Is this branch the default one? ${{ steps.branch-name.outputs.is_default }}"
          fi          

Build matrix

Most software products are meant to be portable

  • Across operating systems
  • Across different frameworks and languages
  • Across runtime configuration

A good continuous integration pipeline should test all the supported combinations

  • or a sample, if the performance is otherwise unbearable

The solution is the adoption of a build matrix

  • Build variables and their allowed values are specified
  • The CI integrator generates the cartesian product of the variable values, and launches a build for each!
  • Note: there is no built-in feature to exclude some combination
    • It must be done manually using if conditionals

Build matrix in GHA

jobs:
  Build:
    strategy:
      matrix:
        os: [windows, macos, ubuntu]
        jvm_version: [8, 11, 15, 16] # Arbitrarily-made and arbitrarily-valued variables
        ruby_version: [2.7, 3.0]
        python_version: [3.7, 3.9.12]
    runs-on: ${{ matrix.os }}-latest ## The string is computed interpolating a variable value
    steps:
      - uses: actions/setup-java@v4
        with:
          distribution: 'adopt'
          java-version: ${{ matrix.jvm_version }} # "${{ }}" contents are interpreted by the github actions runner
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python_version }} 
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: ${{ matrix.ruby_version }}
      - shell: bash
        run: java -version
      - shell: bash
        run: ruby --version
      - shell: bash
        run: python --version

Private data and continuous integration

We would like the CI to be able to

  • Sign our artifacts
  • Delivery/Deploy our artifacts on remote targets

Both operations require private information to be shared

Of course, private data can’t be shared

  • Attackers may steal the identity
  • Attackers may compromise deployments
  • In case of open projects, attackers may exploit pull requests!
    • Fork your project (which has e.g. a secret environment variable)
    • Print the value of the secret (e.g. with printenv)

How to share a secret with the build environment?

Secrets

Secrets can be stored in GitHub at the repository or organization level.

GitHub Actions can access these secrets from the context:

  • Using the secrets.<secret name> context object
  • Access is allowed only for workflows generated by local events
    • Namely, no secrets for pull requests

Secrets can be added from the web interface (for mice lovers), or via the GitHub API.

#!/usr/bin/env ruby
require 'rubygems'
require 'bundler/setup'
require 'octokit'
require 'rbnacl'
repo_slug, name, value = ARGV
client = Octokit::Client.new(:access_token => 'access_token_from_github')
pubkey = client.get_public_key(repo_slug)
key = Base64.decode64(pubkey.key)
sodium_box = RbNaCl::Boxes::Sealed.from_public_key(key)
encrypted_value = Base64.strict_encode64(sodium_box.encrypt(value))
payload = { 'key_id' => pubkey.key_id, 'encrypted_value' => encrypted_value }
client.create_or_update_secret(repo_slug, name, payload)

Containerization

The problem: “it works on my machine”

The solution: ship your machine

Containers can be thought of as (but they are not) lightweight virtual machines:

  • Isolated environments
  • Contain the whole runtime
    • They use the host’s kernel, but have their own system libraries
  • Ephemeral
    • Storage and state are discarded on termination
  • Generated from an image
    • We will write images and launch containers

Containers for us

  • Simple, reproducibile configuration
  • R2-proof reproducibility (single command)
  • Isolated test environments

Docker

Docker is the most common containerization technology (standard de facto).

We will interact with Docker through its Command Line Interface

Simplified Docker CLI

docker [container|image] <command> <args>

  • [container|image] optional target
  • <command> the action to perform
  • <args> the command’s arguments

Managing images

docker image

  • ls
    • List the locally available images
  • prune
    • Remove unused images
  • pull
    • Download an image from a registry (usually, dockerhub.io)
  • rm
    • Remove one or more images

Running containers

docker [container] run <options> image

Creates and executes a new container from image

  • container” can be omitted

Options

  • -p <host>:<guest> — publishes (exposes) the container’s port <guest> to host’s port <host>

  • -v <host>:<guest>bind mount: mounts absolute path <host> into the container at absolute path <guest>

  • -name <name> — assigns a unique name to the container

  • -i — interactive mode. Required to send commands to the container

  • -t — attaches a pseudo-TTY. Use the option to have the same feel of a terminal open in the container run Create and run a new container from an image

  • -e <key>=<value> — sets environment variable <key> to <value>

  • -d — detached: returns control to the host terminal and runs in background

Managing containers

docker container

  • ls --all
    • List the local containers (including the stopped ones)
  • exec <name> <command>
    • Executes <command> in a running container named <name>
    • Supports many options of run: -i, -t, -e, -d
  • top
    • Display the running processes of a container
  • rename <old> <new>
    • Rename a container <old> to <new>
  • rm <names>
    • Remove one or more containers
  • prune
    • Remove all stopped containers

Creating images

Images are defined in Dockerfiles

  • Dockerfile $\Rightarrow$ Image $\Rightarrow$ Container
  • Dockerfiles are a list of steps that modify an image to obtain a new one
# Starting point. SCRATCH to start from empty (not recommended)
FROM imagename:version

# Run a command in the current image. Side effects will be stored
RUN some command

# Copy a file into the image
COPY file/in/host /destination/in/image

# Set an environment variable
ENV MY_VARIABLE=MYVALUE

# Change directory
WORKDIR /my/new/directory

# Configure the container to run as an executable
ENTRYPOINT ["executable", "parameter", "parameter2"]

# Default command
CMD ["executable", "parameter", "parameter2"]

Tagging images

Tagging is the operation of adding custom symbolic names (aliases) to images

docker [image] tag <source_image> <target_image>

Creates an alias for <source_image> named <target_image>

  • image” can be omitted

Tags are usually structured as name:version

  • name is typically in the form owner_name/image_name
    • official images have no owner_name/
  • version can be any string, but:
    • latest is a special version that identifies the most recent image
    • versions are normally assigned as numbers with dot separators (e.g., 3.10)
      • Typically these numbers match those of the software inside the container
        • e.g., image python:3.10 contains the Python interpreter at version 3.10
    • additional information is stored in a dash-prefixed suffix
      • e.g., image python:3.10-buster contains the Python interpreter at version 3.10 and the runtime of Debian Buster

Building images

docker [image] build <options> <directory>

Creates a new image from a directory containing a Dockerfile

  • image” can be omitted
  • if launched from where the Dockerfile located:
    • docker build .

Options

  • -t <tag> — adds tag <tag> to the image. Multiple tags can be specified.

Typical build command:

  • docker build -t my_name/my_project:latest -t my_name/my_project:1.0.0 .

Sharing images: selecting a registry and logging in

Images are fetched and stored in registries

  • The most common registry for docker is dockerhub.io
  • GitHub also has an image registry, ghcr.io

Logging into a registry

docker login <registry>

  • If <registry> is omitted, dockerhub.io is used
  • Interactive login: docker login <registry>
  • Non-interactive login
    • docker login -u <username> -p <password> <registry>
    • (better) cat <password> | docker login -u <username> --password-stdin <registry>

Sharing images: pushing images

Pushing images

docker push <image>

  • Pushes <image> to the registry the user is currently logged into
  • If your username is <user>, then the image must be tagged as <user>/<name>:<version>

Your image can now be pulled by anyone!

Archival copies

Open Science

  • Open Science is a movement that aims to make scientific research, data, and dissemination accessible to all levels of society
  • A relevant part of Open Science is accessibility and reproducibility
  • Public services exist that take snapshots of scientific artifacts and store them for the future, e.g.:

Artefacts that are not archived are at a high risk of being lost forever!

With them, away goes the possibility of reproducing the results.

Zenodo

Zenodo is a service that allows to archive and share scientific artifacts. Its key features are:

  • Permanent storage
  • DOI assignment
  • GitHub integration
  • Managed by CERN
  • No cost

Archiving a repository on Zenodo

To automatically archive a repository on Zenodo:

  1. Connect your GitHub account to Zenodo
  2. In the project list of Zenodo, enable the repository to archive
  3. Done! Every new release on GitHub will be archived on Zenodo
  • So, of course, make the releases on GitHub automatic!

Licenses

What is a license

A legal instrument used to regulate access, use, and redistribution of software

  • You generally want to retain some guarantees on the software, e.g.:
    • be recognized as the original creator;
    • decide whether or not someone else can redistribute it
    • decide under which circumstances the software can be used
    • get paid if others use it
  • Law can change greatly among countries
    • If you are not a lawyer, or you can’t pay a lawyer, don’t come up with your own license
    • A custom license is likely to be in conflict with the law of some countries
    • Better choosing something proven to work

Copyright vs. Copyleft

Legal right that grants the creator of an original work exclusive rights for its use and (re)distribution

Copyleft

Practice (not a legal right!) in which the creators surrenders some, but not all, rights under copyright law.

  • Strong: all derived works inherit the copyleft license
  • Weak: some derived work may not inherit it
  • Full: all the parts of the work are distributed under the terms of the copyleft license
  • Partial: only some parts are covered by the copyleft license.

Ownership vs. Licensing

Ownership

Possession of a copy of software.

The possession implies right to use, even if such use implies a violation of the license (e.g. for making changes to the software, or making incidental copies).

Licensing

The software is not sold, but merely “licensed”, namely permitted to be used, under the conditions of a End-user license agreement (EULA).

Proprietary vs. Free


Rights scale, public domain on the left, proprietary on the right.

Proprietary

The software publisher grants the right to use a certain number of copies under the conditions of an EULA, but does not transfer ownership of the copies to the customer. Usage of the software may be subjected to acceptance of the EULA.

Free

The software publisher grants extensive rights to modify and redistribute the software, often prohibiting rolling back such rights (strong copyleft).

Freedom: as in beer vs. as in speech

Much easier for italian speakers:

  • free as in speech: libero
  • free as in beer: gratuito

Free as in beer

Free of charge

Free as in speech

The user receives the source code of the software, is allowed to modify and redistribute it.

  • The user can be asked to pay for receiving a copy: it can be distributed behind a fee
  • The author can also ask for additional money for accessing the source code
    • but not “too much”
    • e.g., asking for a billion dollars would make the software de-facto proprietary.

Free vs. Open source

Usually together, but:

  • Open source: focus on availability of source code and right to modify and share it
  • Free: focus on freedom to use the program, modify, and share it

There are non-free open source licenses:

  • Apple Public source license 1.0 $\Rightarrow$ too generic
  • Artistic License 1.0 $\Rightarrow$ overly vague
  • Nasa Open Source Agreement

And there are free non-open source licenses as well

Unlicensed software

  • Internal business and trade secrets
    • Release is unwanted
    • Undisclosed, unavailable, unlicensed
  • Software distributed without any license
    • fully copyright protected, and therefore legally unusable
  • Public domain
    • At copyright termination, distributed unlicensed software becomes public domain: freely usable, modifiable, redistributable
    • It takes over 100 years for software released after 2008

Proprietary licenses’ constraints

  • Closed volume: the customer commits to purchase a certain number of licenses over a fixed period (mostly two years).
    • Licensing can be limited per user, CPU, concurrent usage, etc.
  • Maintenance and support: proprietary licenses usually include forms of maintenance and support
  • Warranty: proprietary licenses often include a warranty
    • Typically time limited
    • Typically with purcheasable extensions

GNU GPLv3

GNU General Public License

  • Free and open source
  • Strong copyleft: derived work must be released under a compatible license
  • Does not allow linking from non GPL-compatible licensed software!
    • If you want your software to be used as a dependency of proprietary software, don’t use this license!

GNU LGPLv3

GNU Lesser General Public License (LGPL)

  • A modification of the GPLv3 with a linking exception
    • Not an entirely different license as the previous LGPL
  • Free and open source
  • Strong copyleft: derived work must be released under a compatible license
  • Allows linking from code with a different license
  • Work linking the LGPL library (combined work) must:
    • Allow modification of the original linked library shipped with the work
    • Allow reverse engineering and debugging of the combined work
  • These requirements are often unacceptable for companies!

GNU GPL with linking exception

GNU General Public License with linking exception

  • Can be built on top of a GPL, by manually adding an exception for linking
  • Free and open source
  • Strong copyleft: derived work must be released under a compatible license
  • Allows linking from code with a different license without further restrictions
    • Combined work can be redistributed under non GPL-compatible licenses

Example exception

Linking this library statically or dynamically with other modules is making a combined work based on this library.
Thus, the terms and conditions of the GNU General Public License cover the whole combination.
As a special exception, the copyright holders of this library give you permission to link this library with
independent modules to produce an executable, regardless of the license terms of these independent modules, and to
copy and distribute the resulting executable under terms of your choice, provided that you also meet, for each linked
independent module, the terms and conditions of the license of that module. An independent module is a module which
is not derived from or based on this library. If you modify this library, you may extend this exception to your
version of the library, but you are not obliged to do so. If you do not wish to do so, delete this exception
statement from your version.

MIT License

  • Extremely permissive
  • Free and open source
  • GPL compatible
  • No copyleft: the work or a derivative can be redistributed under any other license
  • Does not protect trademarks
  • Does not protect patent claims

Apache License 2.0

  • More restrictive than MIT
  • Free and open source
  • GPL compatible
  • No copyleft: the work or a derivative can be redistributed under any other license
  • Protects trademarks
  • Allows for placing a warranty
  • Protects patent claims
  • Forces authors of derived works to state significant changes made to the software
  • If a NOTICE file with authors is provided, it must be included when redistributed. Entries can be appended.

WTFPL

  • Extremely permissive but problematic
  • A good example of how things may go wrong if you write licenses without competence
  • Free, not open source
  • No copyleft
  • Better use something else
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004

Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.

FOSS Licence compatibility

floss license network

compatibility

Creative Commons

Available rights

  • BY (Attribution) – Derivative works must credit the original author
  • SA (Share-alike) – Enables copyleft
  • NC (Non-commercial) – — Derivative work can only be used for non commercial purposes
  • ND (No derivative Works) – Free distribution and copy, but derivatives are forbidden

Valid combinations

  • CC0 – Public domain (prefer the MIT licens for a similar protection)
  • CC-BY – Attribution
  • CC-BY-SA – Attribution, Share-alike (enables copyleft)
  • CC-BY-NC – Attribution Noncommercial
  • CC-BY-NC-SA – As above, plus copyleft
  • CC-BY-ND – Attribution Noderivatives (commercially usable, but not modifiable)
  • CC-BY-NC-ND – As above, non commercial

Applying a License

  1. Create a LICENSE or COPYING plain text file in the repository with the full license text
    • Full text can be easy found online
    • Change the owner and the copyright year
  2. Apply a copyright notice header to every source file
    • Use an automated tool to do it
    • Most IDEs can deal with this
    • Header templates are usually available where the license is published

Your software is now licensed!

GitHub Pages

Documenting repositories

Documentation of a project is part of the project

  • Documentation must stay in the same repository of the project
  • However, it should be accessible to non-developers

Meet GitHub Pages

  • GitHub provides an automated way to publish webpages from Markdown text
  • Markdown is a human readable markup language, easy to learn
    • These slides are written in Markdown
    • (generation is a bit richer, but just to make the point)
  • Supports Jekyll (a Ruby framework for static website generation) out of the box
    • We will not discuss it today
    • But maybe in Software Process Engineering…

Setting up a GitHub Pages website

Two possibilities:

  1. Select a branch to host the pages
    • Create an orphan branch with git checkout --orphan <branchname>
    • Write your website, one Markdown per page
    • Push the new branch
  2. Use a docs/ folder in a root of some branch
    • Could be master or any other branch

Setting up a GitHub Pages website

Once done, enable GitHub pages on the repository settings:

Enable GitHub Pages snapshot

GitHub Pages URLs

  • Repository web-pages are available at https://<username>.github.io/<reponame>
  • User web-pages are available at https://<username>.github.io/
    • They are generated from a repository named <username>.github.io
  • Organization web-pages are available at https://<organization>.github.io/
    • They are generated from a repository named <organization>.github.io