Computing Programming Assignment Help
Computing Programming Assignment Help
Deadline for submission: midnight on the night of the 11th March, 2016. Code
should be submitted to the Week 7 dropbox on Blackboard Learn.
In this assignment, we will undertake one of the core tasks of bioinformatics
analysis, sequence alignment. In particular, we will analyse 24 DNA
sequences that have been obtained, comparing their composition in order to
see the variation between sequences.
In most species there is variation between genomes of the members. In
some cases this variation takes the form of a single nucleotide polymorphism
(SNP). As fellow humans, our genomes will be similar, but at a particular
location, you might have a C and I have a T or you a G and I an A. However,
there any many other ways that variation can occur. My genome might be
missing a short sequence of nucleotides that is contained in yours. My
genome might contain several repeats of a sequence that is present in yours
only once or might contain a sequence that is present in yours, but with extra nucleotides inserted in the middle of my copy.
The degree of this variation is species specific. Mammals tend to have
relatively less variation than more prolific organisms such as bacteria.
Sequence alignment is a process that takes into account these
polymorphisms, gaps and duplications to produce our best guess at how the
sequences match up the sequences so that we can compare the genes within
each sequence.
Preamble
I. Download the file Sequence_data.fna from Blackboard Learn. Open it
up in Word or Notepad. The FNA file extension indicates that the file is
a FASTA file (FASTA is a file format used for storing DNA sequences).
Really it is just a text file, but for the FASTA format, the text has to be
arranged in certain ways. In the file you’ll see that there are 24 DNA
sequences, each about 1000 nucleotides long.
II. Go to //www.ebi.ac.uk/Tools/msa/clustalo/ and copy and paste the
contents of the file into the box. From thre drop down menu just above
the box, select DNA (by default PROTEIN comes up). Keep the rest of
the settings as default and hit the green submit button. This will start a
25-way sequence alignment. However, it may take up to a few minutes
to run depending on activity of the servers. Once it completes, you’lll
see all 24 sequences stacked up. Where one sequence is missing
some of the nucleotides present in other sequences, this is indicated
by ‘-‘. These are called gaps. Stars are shown below the nucleotides
that are consistent across all samples.
III. Here, we will undertake a similar analysis manually so that you can see
how it is done.
1) We’re going to create a class to store our sequence data. Create a class
called sequence. This class should contain:-
a) just one variable: a private string called sequence_data. (4 marks)
b) a constructor, public sequence(String new_string) that takes a string
as a parameter and assigns it to the sequence_data variable.
(4 marks)
c) a method with the definition public void printSequence() that prints
the string sequence_data to the screen. This will help you with
debugging. (5 marks)
d) a method with the definition public char getNucleotide(int i) that is as
follows:-
public char getNucleotide(int i) {
return sequence_data.charAt(i);
} (4 marks)
e) a method that returns the length of the sequence with the definition
public int getLength(). For this method, you can use the built-in
method that is a part of the definition of a string to obtain its length.
The length of the string sequence_data can be obtained using the
method sequence_data.length() (4 marks)
f) a method public String getSequence() that returns the string
sequence_data (4 marks)
With the exception of getNucleotide(int i), you will have to complete these
methods yourselves based on the material learned in class and from text
books and from your no-doubt prodigious googling skills.
2) We will now create a second class to store a table. If the length of the first
sequence is L1 and the length of the second sequence is L2 then this table
will have dimensions L1xL2. This table will describe comparisons between
all the nucleotides in the two sequences. Where they agree we will put an
entry of 1 in the table and where they disagree we will put an entry of 0.
Create a class called comparison_table. This class should contain:-
a) one variable, a private 2D array of integers called table_data. (5
marks)
b) a constructor public comparison_table(sequence a, sequence b)
that takes two objects of type sequence as parameters. This should
use two for-loops nested inside each other to test whether the i-th
nucleotide in sequence a is equal to the j-th nucleotide in sequence b.
If they are equal this method should put the number 1 in the (i,j)
element of the 2D array table_data. If they are different, it should put 0
in the (i, j) element. Use the getNucleotide method you created in
question 1 to obtain the individual nucleotides from each sequence.
(5 marks)
c) a method public void showTable() that prints the table to the screen.
(5 marks)
d) a method public int getEntry(int i, int j) that retrieves a single value
from the table. (5 marks)
e) a method public int[] getTableDim() that returns a 2 component array
containing the dimensions of the table. (5 marks)
3) We now need to build a second table using the data from the first table
and it is this second table that will tell us how to align the two sequences.
Create a class called alignment_table. This class should contain:-
a) one variable, a private 2D array of integers called table_data (5
marks)
b) a constructor public alignment_table(comparison_table T) that
takes an object of type comparison_table and populates the variable
table_data according to the Needleman-Wunsch algorithm.
(10 marks)
c) a method with the definition public void printSequence() that prints
the string sequence_data to the screen. This will help with debugging.
(5 marks)
d) a method public int[][] getPath() that calculates the alignment pattern
from the table_data variable according to the Neeldeman-Wunsch
algorithm. This method should return a Nx2 array of ints where N is
the length of the longest side of the table, ie the length of whichever
was the longest sequence, a or b. Each row of the table contains a
pair of numbers, the coordinates of matching nucleotides from the
sequence alignment table that contribute to the optimal sequence
alignment. By giving the array the length of the longest side of the
table, you are initializing this to have the longest possible length that
could be needed. For a given pair of sequences, not all the rows of
this table may be needed. (5 marks)
4) We finally need a fourth class that will employ the three classes above to
allow us to complete a sequence alignment. Create a class called
Calculate_alignments. This class should contain:-
a) a private array of variables of type sequence, called MySequences.
This array should have 24 members. (4 marks)
b) a private variable of type File called seq_set_file. For this variable,
use the line private File seq_set_file = new
File(“Sequence_data.fna”); (4 marks)
c) a private variable path that will be a Nx2 array of integers to store the
pairs of numbers generated by the getPath() method in the
alignment_table variable. (4
marks)
d) a method public void readFile() that can read the file defined in the File
variable and that populates the array of 24 sequences with sequence
data from the file. The code for this method is available Blackboard
Learn in the file ReadFile.java. This code can be copied and pasted
into the Calculate_alignments class.
(3 marks)
e) a method public void displayAlignment(sequence a, sequence b)
that takes two sequences and uses the information in the path variable
(the 2D array of integers) to display the two sequences on the screen,
correctly aligned with gaps inserted in the appropriate places. Gaps
should be indicated with the ‘-‘ character rather than just left blank.
(7 marks)
f) a constructor public Calculate_alignments() that contains the
following code
public Calculate_alignments()
{
readFile();
for (int i=1; i<24; i++) {
comparison_table CT = new comparison_table(MySequences[0],
MySequences[i]);
alignment_table AT = new alignment_table(CT);
path = AT.getPath();
displayAlignment(MySequences[0], MySequences[i]);
System.out.println(“ “);
}
} (3 marks)
Guidence
The aim of the project is to try to align the first sequence with each of the
remaining 23 sequences. However, when you are developing your code
and testing it, you won’t want to test the code with the full data set.
Instead, copy the Sequence_data.fna file renaming to copy something
like Sequence_data2.fna and edit the copy so that each sequence is only
5 or 6 nucleotides long. Then edit the command in the class
calculate_alignments so that it refers to Sequence_data2.fna instead.
The reason for doing this is that each time you run your code it will take
some time to complete all the calculations and the larger the sequences
you use for testing, the longer the code will take to run. Hence, the shorter
sequences will enable you to run/test/debug the code more quickly. The
second reason for using shorter sequences is that when the sequences
are 5 or 6 letters long, you can do the calculations by hand with a paper
and pen ands then check that the results produced by your code agree.
This gives you confidence that your code is working correctly, before you
apply it to the full ~1000 nucleotide long sequences. Clearly it would be
unreasonable to try to calculate the alignment of full ~1000 nucleotide long
sequences by hand in order to check that your code was correct.
For a detailed explanation of the Neeldeman-Wunsch algorithm, see the
slides.
Deadline for submission: midnight on the night of the 11th March, 2016. Code
should be submitted to the Week 7 dropbox on Blackboard Learn.
In this assignment, we will undertake one of the core tasks of bioinformatics
analysis, sequence alignment. In particular, we will analyse 24 DNA
sequences that have been obtained, comparing their composition in order to
see the variation between sequences.
In most species there is variation between genomes of the members. In
some cases this variation takes the form of a single nucleotide polymorphism
(SNP). As fellow humans, our genomes will be similar, but at a particular
location, you might have a C and I have a T or you a G and I an A. However,
there any many other ways that variation can occur. My genome might be
missing a short sequence of nucleotides that is contained in yours. My
genome might contain several repeats of a sequence that is present in yours
only once or might contain a sequence that is present in yours, but with extra nucleotides inserted in the middle of my copy.
The degree of this variation is species specific. Mammals tend to have
relatively less variation than more prolific organisms such as bacteria.
Sequence alignment is a process that takes into account these
polymorphisms, gaps and duplications to produce our best guess at how the
sequences match up the sequences so that we can compare the genes within
each sequence.
Preamble
I. Download the file Sequence_data.fna from Blackboard Learn. Open it
up in Word or Notepad. The FNA file extension indicates that the file is
a FASTA file (FASTA is a file format used for storing DNA sequences).
Really it is just a text file, but for the FASTA format, the text has to be
arranged in certain ways. In the file you’ll see that there are 24 DNA
sequences, each about 1000 nucleotides long.
II. Go to //www.ebi.ac.uk/Tools/msa/clustalo/ and copy and paste the
contents of the file into the box. From thre drop down menu just above
the box, select DNA (by default PROTEIN comes up). Keep the rest of
the settings as default and hit the green submit button. This will start a
25-way sequence alignment. However, it may take up to a few minutes
to run depending on activity of the servers. Once it completes, you’lll
see all 24 sequences stacked up. Where one sequence is missing
some of the nucleotides present in other sequences, this is indicated
by ‘-‘. These are called gaps. Stars are shown below the nucleotides
that are consistent across all samples.
III. Here, we will undertake a similar analysis manually so that you can see
how it is done.
1) We’re going to create a class to store our sequence data. Create a class
called sequence. This class should contain:-
a) just one variable: a private string called sequence_data. (4 marks)
b) a constructor, public sequence(String new_string) that takes a string
as a parameter and assigns it to the sequence_data variable.
(4 marks)
c) a method with the definition public void printSequence() that prints
the string sequence_data to the screen. This will help you with
debugging. (5 marks)
d) a method with the definition public char getNucleotide(int i) that is as
follows:-
public char getNucleotide(int i) {
return sequence_data.charAt(i);
} (4 marks)
e) a method that returns the length of the sequence with the definition
public int getLength(). For this method, you can use the built-in
method that is a part of the definition of a string to obtain its length.
The length of the string sequence_data can be obtained using the
method sequence_data.length() (4 marks)
f) a method public String getSequence() that returns the string
sequence_data (4 marks)
With the exception of getNucleotide(int i), you will have to complete these
methods yourselves based on the material learned in class and from text
books and from your no-doubt prodigious googling skills.
2) We will now create a second class to store a table. If the length of the first
sequence is L1 and the length of the second sequence is L2 then this table
will have dimensions L1xL2. This table will describe comparisons between
all the nucleotides in the two sequences. Where they agree we will put an
entry of 1 in the table and where they disagree we will put an entry of 0.
Create a class called comparison_table. This class should contain:-
a) one variable, a private 2D array of integers called table_data. (5
marks)
b) a constructor public comparison_table(sequence a, sequence b)
that takes two objects of type sequence as parameters. This should
use two for-loops nested inside each other to test whether the i-th
nucleotide in sequence a is equal to the j-th nucleotide in sequence b.
If they are equal this method should put the number 1 in the (i,j)
element of the 2D array table_data. If they are different, it should put 0
in the (i, j) element. Use the getNucleotide method you created in
question 1 to obtain the individual nucleotides from each sequence.
(5 marks)
c) a method public void showTable() that prints the table to the screen.
(5 marks)
d) a method public int getEntry(int i, int j) that retrieves a single value
from the table. (5 marks)
e) a method public int[] getTableDim() that returns a 2 component array
containing the dimensions of the table. (5 marks)
3) We now need to build a second table using the data from the first table
and it is this second table that will tell us how to align the two sequences.
Create a class called alignment_table. This class should contain:-
a) one variable, a private 2D array of integers called table_data (5
marks)
b) a constructor public alignment_table(comparison_table T) that
takes an object of type comparison_table and populates the variable
table_data according to the Needleman-Wunsch algorithm.
(10 marks)
c) a method with the definition public void printSequence() that prints
the string sequence_data to the screen. This will help with debugging.
(5 marks)
d) a method public int[][] getPath() that calculates the alignment pattern
from the table_data variable according to the Neeldeman-Wunsch
algorithm. This method should return a Nx2 array of ints where N is
the length of the longest side of the table, ie the length of whichever
was the longest sequence, a or b. Each row of the table contains a
pair of numbers, the coordinates of matching nucleotides from the
sequence alignment table that contribute to the optimal sequence
alignment. By giving the array the length of the longest side of the
table, you are initializing this to have the longest possible length that
could be needed. For a given pair of sequences, not all the rows of
this table may be needed. (5 marks)
4) We finally need a fourth class that will employ the three classes above to
allow us to complete a sequence alignment. Create a class called
Calculate_alignments. This class should contain:-
a) a private array of variables of type sequence, called MySequences.
This array should have 24 members. (4 marks)
b) a private variable of type File called seq_set_file. For this variable,
use the line private File seq_set_file = new
File(“Sequence_data.fna”); (4 marks)
c) a private variable path that will be a Nx2 array of integers to store the
pairs of numbers generated by the getPath() method in the
alignment_table variable. (4
marks)
d) a method public void readFile() that can read the file defined in the File
variable and that populates the array of 24 sequences with sequence
data from the file. The code for this method is available Blackboard
Learn in the file ReadFile.java. This code can be copied and pasted
into the Calculate_alignments class.
(3 marks)
e) a method public void displayAlignment(sequence a, sequence b)
that takes two sequences and uses the information in the path variable
(the 2D array of integers) to display the two sequences on the screen,
correctly aligned with gaps inserted in the appropriate places. Gaps
should be indicated with the ‘-‘ character rather than just left blank.
(7 marks)
f) a constructor public Calculate_alignments() that contains the
following code
public Calculate_alignments()
{
readFile();
for (int i=1; i<24; i++) {
comparison_table CT = new comparison_table(MySequences[0],
MySequences[i]);
alignment_table AT = new alignment_table(CT);
path = AT.getPath();
displayAlignment(MySequences[0], MySequences[i]);
System.out.println(“ “);
}
} (3 marks)
Guidence
The aim of the project is to try to align the first sequence with each of the
remaining 23 sequences. However, when you are developing your code
and testing it, you won’t want to test the code with the full data set.
Instead, copy the Sequence_data.fna file renaming to copy something
like Sequence_data2.fna and edit the copy so that each sequence is only
5 or 6 nucleotides long. Then edit the command in the class
calculate_alignments so that it refers to Sequence_data2.fna instead.
The reason for doing this is that each time you run your code it will take
some time to complete all the calculations and the larger the sequences
you use for testing, the longer the code will take to run. Hence, the shorter
sequences will enable you to run/test/debug the code more quickly. The
second reason for using shorter sequences is that when the sequences
are 5 or 6 letters long, you can do the calculations by hand with a paper
and pen ands then check that the results produced by your code agree.
This gives you confidence that your code is working correctly, before you
apply it to the full ~1000 nucleotide long sequences. Clearly it would be
unreasonable to try to calculate the alignment of full ~1000 nucleotide long
sequences by hand in order to check that your code was correct.
For a detailed explanation of the Neeldeman-Wunsch algorithm, see the
slides.
No comments:
Post a Comment