Here is the approach I ended up using, in part thanks to all the contributions here.
The associated R script is below or can be downloaded from:
This creates 999 unique sequence files in plain text, with each sequence being identified to species level and few species being found across more than one sequence.
It also creates the matching answer key.
You can start at a random location to so that files change every year/group.
I used R to query the BOLDS database (Barcode of Life), to download a file and to split this huge file into separate sequences.
Here is the R script
rm(list=ls())
complete<-"http://services.boldsystems.org/eFetch.php?record_type=full&id_type=sampleid&ids=(*)&return_type=text"
write(complete, file="your location on disk")
rm(list=ls())
sequences.id<-data.frame("file.name", "recordID", "genus_name", "species_name")
write.table(x=sequences.id, file="sequences_id.csv", append=F, sep = ",", row.names=F, col.names=F)
set.seed(10)
start<-sample(1:1000, size=1)
i<-start
k<-1
while(k < 1000){
sequences<-read.delim(file=complete, skip=i, nrows=1, header=F)
sequence.compare<-read.csv(file="sequences_id.csv", skip=k-1, nrows=1, header=F)
if(! is.na(sequences$V24)){
if(as.character(sequences$V24)!=as.character(sequence.compare$V4)){
writeLines(text=as.character(sequences$V55), con=paste(k, ".txt", sep=""))
sequences.id<-c(k, sequences[,c("V3","V22", "V24")])
write.table(x=sequences.id, file="sequences_id.csv", append=T, sep = ",", row.names=F, col.names=F)
print("kept")
k<-k+1
}
}
i<-i+1
print(paste(k,"/", i))
}
No comments:
Post a Comment