Methylation Analysis
Plan of the Week: March 30 - April 5, 2026
High- level outline for the week. Adjusted daily to reflect progress of the day before
- Moving forward.
Monday - Catch up on UW-RUA and NWS Poster
Tuesday - UW-RUA, No Science
Wednesday - Biomarker Manuscript
Thursday - Biomarker ManuscriptFriday - NWS Symposium
Saturday - Biomarker Manuscript
Sunday - No Science
Plan of the Day
Granular level task list to accomplish the high- level goal outlined above
- Keep the methylation analysis moving forward
- Present at the NWS Symposium
Projects Touched Today
- DNA Methylation
- NWS Symposium
Progress Notes
Checked my methylation outputs - the deduplication, MultiQC, and output organization wrapped up at 0745.
Next step is to run the deduplication and parameter checks on the first 10k basepairs to have for later checks - not making any adjustments based on these yet.
I used the methylation extraction and qc script from ceasmaller (Sam’s repo) to create the extraction script for the next step.
Ran the methylation extraction after the parameter checks were completed.
Started the methylation extractions with my modified code. Had to rebuild tool paths after the first fail because I didn’t update them correctly in the modified script
After repeated attempts where it seems to detect methylation extraction where there was none, I realized two major things:
First, I build the stupid skip loops wrong - got too caught up in humming fe, fi, of, fum I guess.
Second, it was a blessing in disguise since I had two directories mislabeled and would have been really confused why my extractions and their reports ended up in my code or reference directories…
During my work block with KPJ, she helped me talk through the steps of what I was asking the code to do and my anticipated outcomes, and fix the loops since I was still a little stuck.
Adding to my decision/ defense of decisions log, coverage (analysis decision) and buffer size (computer decision) explanations.
Coverage is set at 5 in the scripts in all of the lab repos I have searched. I know it is number of times a read is completed at a particular site in the sequence, but I don’t know why 5 is the choice.
Coverage, a.k.a. total reads at a particular site= number of unmethylated reads + number of methylated read
It is the minimum number of reads to support the proportion of methylation. If you have 3 reads- 1 methylated and 2 not, that is not a reliable 33% methylation. If you have 5+ reads and methylation is 1 of 5, that 20% is likely more accurate.
Coverage values are a balancing act between our ‘confidence’ in the results and the amount of data we exclude based on read count.
Since these are whole genome BS sequences (not RR), I may want to play with this coverage value to compare results since there is more to work with - may be a fools errand, but maybe not. I do not know the implications of higher coverage values based on the DNA extraction and sequence quality, nor do I know if increasing coverage will knock out relevant sites of methylation since inverts are ‘normally’ methylated in a scattered way versus verts.
Buffer size is an indication to the computer how much data to hold before writing it out.
This is a space and speed balancing act; 50% buffer is just asking the computer not to write out anything until it reaching 50% of the available working memory.
Not sure if it is appropriate on Raven since I modified this code from Sam’s script that was running on Klone.
Will leave it in the script in hopes it will also help speed up the process without screwing up anyone else running stuff on Raven
A later task is to run a quick script to put all of the checksums into a table or excel file or whatever for quick side by side comparisons and to keep with all of the other metadata. This is a clean-up step, not a process step.
Created an exact duplicate of the extraction script to run with a cover 10 for comparison.
- I can run this after the extractions for cover 5 because I will need to take my time through the results to really lock in what I do and do not understand before reviewing those results in comparison.
Next snag- the sorted BAM files are not in an order recognized…
Error message in the log for 105M:
“The IDs of Read 1 (LH00469:254:22HGFVLT4:2:2361:52054:3816_1:N:0:GTTACGCA+ATGGCGAT) and Read 2 (LH00469:254:22HGFVLT4:2:2441:28449:5398_1:N:0:GTTACGCA+ATGGCGAT) are not the same. This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file using ‘samtools sort -n’ (by read name). This may also occur using samtools merge as it does not guarantee the read order. To properly merge files please use ‘samtools merge -n’ or ‘samtools cat’.”
Remember: Bismark methylation extractor requires R1 and R2 to remain adjacent; sorted BAMS are not going to work.
I fixed the script to pull the unsorted BAM files, and once it began working, I left it to get ready to go.
Finally, I fixed my lab notebook not showing up on the handbook page and not showing up in the lab feed per GH Issue #2090 guidance.
The handbook page update worked. I think I added my name to the path twice instead of once…
I can’t see if the feed that drops into Slack worked, so I will wait until tomorrow to verify in case it only pulls once a day or at specific times or whatever.
Outcomes: Products & Word Count
- Extraction Analysis (Cover 5 and 10): 2 scripts
- Methylation Analysis Details: 368 words
Today’s total: 368 words
Monthly total to date: 921 words
Annual total to date: 33,593 words
Annual target total to date: 46,500 words
Next Up: Tomorrow’s Plan
- Set April goals and attainment plan.