Click the assignment 6 invitation above, clone the repository, and follow the instructions in src/assignment.jl
.
Also, recall that you may run the automated tests on your own computer:
$ julia --project -e 'using Pkg; Pkg.test()'
For each assignment, the contents of the assignment code script will be rendered as html at the bottom of the assignment description page. If you're interested in how that works, check out Literate.jl
Note: this file is runnable in its current state, but is incomplete. You can run the file from the command line, or use the VS Code julia extension to run individual lines.
A common way to interact with biological sequence data is in a format called FASTA
. This is a very simple format, where each "record" (sequence) contains:
A "header" line, always starting with >
, that contains a description of the sequence
The sequence, which can be split across multiple lines
For example, in the data/
directory of this assignment, there is a file called cov2_genomes.fasta
, which contains some genome sequences of Sars-CoV2 isolated from humans early in the pandemic. Use the shell to take a look at the first 10 lines (head
) of the file.
Now, let's work on building a julia function to help us use this file.
"""
count_records(path)
Given a path to a `fasta` file,
counts and returns the number of records it contains.
Example
≡≡≡≡≡≡≡≡≡
julia> count_records("data/cov2_genomes.fasta")
10
"""
function count_records(path)
# Your code here.
end
Hint - If you're having trouble, the easiest thing to do is probably to count lines that start with >
.
It's nice to know the number of records the file contains, but it's more useful to actually be able to read and store those records.
Let's look again at the headers for these sequences. This can be accomplished from the terminal using grep
. The simplest way to use grep
is with 2 arguments - the first argument specifies a pattern to look for, and the second is the path to a file (or group of files).
grep
will then print to the screen all of the lines from the file that match that pattern. To see all of the header lines, for example,
$ grep '>' data/cov2_genomes.fasta
Each header has 3 components, separated by |
Write a function that divides the header into these 3 components, and returns them as a tuple.
"""
fasta_header(header)
Divides a fasta header into its component parts,
removing any leading or trailing spaces.
Example
≡≡≡≡≡≡≡≡≡
julia> fasta_header(">M0002 |China|Homo sapiens")
("M0002", "China", "Homo sapiens")
julia> fasta_header("AAATTC")
Error: Invalid header (headers must start with '>')
julia> fasta_header(">Another sequence")
("Another sequence",)
julia> fasta_header(">headers| can | have| any number | of | info blocks")
("headers", "can", "have", "any number", "of", "info blocks")
"""
function fasta_header(header)
startswith(header, '>') || error("Invalid header (headers must start with '>')")
# Your code here
end
A couple of things to note about this function if you're strugling:
Notice that, even though each header must start with a >
, the first item in the returned Tuple should not
The return item should always be a Tuple, even if there's only 1 item
The split()
function can divide a String
by a delimeter.
You may need to do some searching on the internet.
Alright - let's actually parse the file. For now, the easiest thing to do is to return two vectors, one containing the parsed identifiers as tuples, and one containing the sequences as strings.
One thing to keep in mind is that the sequences in a fasta file can be spread out over multiple lines, but they'll be much easier to work with as a single string, so your function should have a way to join multiple lines (remember the join()
function?).
I've included a few example files in the "data" directory with some example files to test your functions on. The examples in the docstring below assume that your working directory is the assignment directory.
"""
function parse_fasta(path)
Reads a fasta-formated file and returns 2 vectors,
one containing parsed headers,
the other containing the entire sequence as a `String`.
Note: function does not validate DNA sequences for correctness.
Example
≡≡≡≡≡≡≡≡≡
julia> ex1 = parse_fasta("data/ex1.fasta");
julia> ex1 isa Tuple
true
julia> ex1[1]
2-element Array{Tuple{String,String},1}:
("ex1.1", "easy")
("ex1.2", "multiline")
julia> ex1[2]
2-element Array{String,1}:
"AATTATAGC"
"CGCCCCCCAGTCGGATT"
julia> ex2 = parse_fasta("data/ex2.fasta");
julia> ex2[1]
4-element Array{Tuple{String,String},1}:
("ex2.1", "oneper")
("ex2.2", "wrong")
("ex2.3", "wronger")
("ex2.4", "wrongest")
julia> ex2[2]
4-element Array{String,1}:
"ATCCGT"
"ATCGTGGaact"
"ATCGTGGaact"
"this isn't a dna string,but parse it anyway"
"""
function parse_fasta(path)
# Think through the components you need
# Does it make sense to define any containers at the beginning?
# How will you loop through the file?
# What do you need to get from each line?
end
Tips:
Build the function in parts
first, see if you can build the vector of headers, ignoring the sequences
next, see if you can collect the sequences, even if they're not joined together properly
Work on pieces of the function in the REPL. Make sure each expression does what you think it does.
For example, if you make a for
loop to go through each line of the file, add @info
statements to check the value of each variable in each loop.
We now have most of the pieces necessary to get some summary statistics of CoV2 genomes. Note: You may want to grab some code from previous assignments.
Calculate the mean sequence length and standard deviation of the CoV2 geneomes in data/cov2_genomes.fasta
.
# helper code goes here
mean_cov2_length = 0.0
std_cov2_length = 0.0
Calculate the mean and standard deviation of GC content of the CoV2 genomes in data/cov2_genomes.fasta
. GC content should be a number between 0 and 1. Note: if you already parsed the fasta file above and assigned the results to variables, there's no need to re-parse it.
# helper code goes here
mean_cov2_gc = 0.0
std_cov2_gc = 0.0
Write a function that will read any proper fasta file and returns a tuple with the mean and standard deviation of the sequence length and gc content.