Friday 5 December 2008

genetics - How to create a collection of anonymous sequences for teaching and testing?

Here is the approach I ended up using, in part thanks to all the contributions here.



The associated R script is below or can be downloaded from:



BOLDS SEQUENCE RECOVERY



This creates 999 unique sequence files in plain text, with each sequence being identified to species level and few species being found across more than one sequence.



It also creates the matching answer key.



You can start at a random location to so that files change every year/group.



I used R to query the BOLDS database (Barcode of Life), to download a file and to split this huge file into separate sequences.



Here is the R script



rm(list=ls())

complete<-"http://services.boldsystems.org/eFetch.php?record_type=full&id_type=sampleid&ids=(*)&return_type=text"
write(complete, file="your location on disk")

rm(list=ls())

sequences.id<-data.frame("file.name", "recordID", "genus_name", "species_name")
write.table(x=sequences.id, file="sequences_id.csv", append=F, sep = ",", row.names=F, col.names=F)



set.seed(10)
start<-sample(1:1000, size=1)

i<-start
k<-1

while(k < 1000){

sequences<-read.delim(file=complete, skip=i, nrows=1, header=F)
sequence.compare<-read.csv(file="sequences_id.csv", skip=k-1, nrows=1, header=F)

if(! is.na(sequences$V24)){
if(as.character(sequences$V24)!=as.character(sequence.compare$V4)){
writeLines(text=as.character(sequences$V55), con=paste(k, ".txt", sep=""))
sequences.id<-c(k, sequences[,c("V3","V22", "V24")])
write.table(x=sequences.id, file="sequences_id.csv", append=T, sep = ",", row.names=F, col.names=F)
print("kept")
k<-k+1
}
}
i<-i+1
print(paste(k,"/", i))
}

No comments:

Post a Comment