Methylation Analysis

Mussel Methylation

Published

April 3, 2026

Plan of the Week: March 30 - April 5, 2026

High- level outline for the week. Adjusted daily to reflect progress of the day before

Moving forward.

~~Monday - Catch up on UW-RUA and NWS Poster~~

~~Tuesday - UW-RUA, No Science~~

~~Wednesday - Biomarker Manuscript~~

~~Thursday - Biomarker Manuscript~~

Friday - NWS Symposium

Saturday - Biomarker Manuscript

Sunday - No Science

Plan of the Day

Granular level task list to accomplish the high- level goal outlined above

Keep the methylation analysis moving forward
Present at the NWS Symposium

Projects Touched Today

DNA Methylation
NWS Symposium

Progress Notes

Checked my methylation outputs - the deduplication, MultiQC, and output organization wrapped up at 0745.
- Next step is to run the deduplication and parameter checks on the first 10k basepairs to have for later checks - not making any adjustments based on these yet.
- I used the methylation extraction and qc script from ceasmaller (Sam’s repo) to create the extraction script for the next step.
Ran the methylation extraction after the parameter checks were completed.
- Started the methylation extractions with my modified code. Had to rebuild tool paths after the first fail because I didn’t update them correctly in the modified script
- After repeated attempts where it seems to detect methylation extraction where there was none, I realized two major things:
  - First, I build the stupid skip loops wrong - got too caught up in humming fe, fi, of, fum I guess.
  - Second, it was a blessing in disguise since I had two directories mislabeled and would have been really confused why my extractions and their reports ended up in my code or reference directories…
- During my work block with KPJ, she helped me talk through the steps of what I was asking the code to do and my anticipated outcomes, and fix the loops since I was still a little stuck.
Adding to my decision/ defense of decisions log, coverage (analysis decision) and buffer size (computer decision) explanations.
- Coverage is set at 5 in the scripts in all of the lab repos I have searched. I know it is number of times a read is completed at a particular site in the sequence, but I don’t know why 5 is the choice.
  - Coverage, a.k.a. total reads at a particular site= number of unmethylated reads + number of methylated read
  - It is the minimum number of reads to support the proportion of methylation. If you have 3 reads- 1 methylated and 2 not, that is not a reliable 33% methylation. If you have 5+ reads and methylation is 1 of 5, that 20% is likely more accurate.
  - Coverage values are a balancing act between our ‘confidence’ in the results and the amount of data we exclude based on read count.
  - Since these are whole genome BS sequences (not RR), I may want to play with this coverage value to compare results since there is more to work with - may be a fools errand, but maybe not. I do not know the implications of higher coverage values based on the DNA extraction and sequence quality, nor do I know if increasing coverage will knock out relevant sites of methylation since inverts are ‘normally’ methylated in a scattered way versus verts.
- Buffer size is an indication to the computer how much data to hold before writing it out.
  - This is a space and speed balancing act; 50% buffer is just asking the computer not to write out anything until it reaching 50% of the available working memory.
  - Not sure if it is appropriate on Raven since I modified this code from Sam’s script that was running on Klone.
  - Will leave it in the script in hopes it will also help speed up the process without screwing up anyone else running stuff on Raven
A later task is to run a quick script to put all of the checksums into a table or excel file or whatever for quick side by side comparisons and to keep with all of the other metadata. This is a clean-up step, not a process step.
Created an exact duplicate of the extraction script to run with a cover 10 for comparison.
- I can run this after the extractions for cover 5 because I will need to take my time through the results to really lock in what I do and do not understand before reviewing those results in comparison.
Next snag- the sorted BAM files are not in an order recognized…
- Error message in the log for 105M:
- “The IDs of Read 1 (LH00469:254:22HGFVLT4:2:2361:52054:3816_1:N:0:GTTACGCA+ATGGCGAT) and Read 2 (LH00469:254:22HGFVLT4:2:2441:28449:5398_1:N:0:GTTACGCA+ATGGCGAT) are not the same. This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file using ‘samtools sort -n’ (by read name). This may also occur using samtools merge as it does not guarantee the read order. To properly merge files please use ‘samtools merge -n’ or ‘samtools cat’.”
Remember: Bismark methylation extractor requires R1 and R2 to remain adjacent; sorted BAMS are not going to work.
I fixed the script to pull the unsorted BAM files, and once it began working, I left it to get ready to go.
Finally, I fixed my lab notebook not showing up on the handbook page and not showing up in the lab feed per GH Issue #2090 guidance.
- The handbook page update worked. I think I added my name to the path twice instead of once…
- I can’t see if the feed that drops into Slack worked, so I will wait until tomorrow to verify in case it only pulls once a day or at specific times or whatever.

Outcomes: Products & Word Count

Extraction Analysis (Cover 5 and 10): 2 scripts
Methylation Analysis Details: 368 words

Today’s total: 368 words

Monthly total to date: 921 words

Annual total to date: 33,593 words

Annual target total to date: 46,500 words

Next Up: Tomorrow’s Plan

Set April goals and attainment plan.