Methylation Analysis

Mussel Methylation
Published

April 3, 2026

Plan of the Week: March 30 - April 5, 2026

High- level outline for the week. Adjusted daily to reflect progress of the day before

  • Moving forward.

Monday - Catch up on UW-RUA and NWS Poster

Tuesday - UW-RUA, No Science

Wednesday - Biomarker Manuscript

Thursday - Biomarker Manuscript

Friday - NWS Symposium

Saturday - Biomarker Manuscript

Sunday - No Science


Plan of the Day

Granular level task list to accomplish the high- level goal outlined above

  • Keep the methylation analysis moving forward
  • Present at the NWS Symposium

Projects Touched Today

  • DNA Methylation
  • NWS Symposium

Progress Notes

  • Checked my methylation outputs - the deduplication, MultiQC, and output organization wrapped up at 0745.

    • Next step is to run the deduplication and parameter checks on the first 10k basepairs to have for later checks - not making any adjustments based on these yet.

    • I used the methylation extraction and qc script from ceasmaller (Sam’s repo) to create the extraction script for the next step.

  • Ran the methylation extraction after the parameter checks were completed.

    • Started the methylation extractions with my modified code. Had to rebuild tool paths after the first fail because I didn’t update them correctly in the modified script

    • After repeated attempts where it seems to detect methylation extraction where there was none, I realized two major things:

      • First, I build the stupid skip loops wrong - got too caught up in humming fe, fi, of, fum I guess. 

      • Second, it was a blessing in disguise since I had two directories mislabeled and would have been really confused why my extractions and their reports ended up in my code or reference directories…

    • During my work block with KPJ, she helped me talk through the steps of what I was asking the code to do and my anticipated outcomes, and fix the loops since I was still a little stuck.

  • Adding to my decision/ defense of decisions log, coverage (analysis decision) and buffer size (computer decision) explanations.

    • Coverage is set at 5 in the scripts in all of the lab repos I have searched. I know it is number of times a read is completed at a particular site in the sequence, but I don’t know why 5 is the choice.

      • Coverage, a.k.a. total reads at a particular site= number of unmethylated reads + number of methylated read

      • It is the minimum number of reads to support the proportion of methylation. If you have 3 reads- 1 methylated and 2 not, that is not a reliable 33% methylation. If you have 5+ reads and methylation is 1 of 5, that 20% is likely more accurate.

      • Coverage values are a balancing act between our ‘confidence’ in the results and the amount of data we exclude based on read count.

      • Since these are whole genome BS sequences (not RR), I may want to play with this coverage value to compare results since there is more to work with - may be a fools errand, but maybe not. I do not know the implications of higher coverage values based on the DNA extraction and sequence quality, nor do I know if increasing coverage will knock out relevant sites of methylation since inverts are ‘normally’ methylated in a scattered way versus verts.

    • Buffer size is an indication to the computer how much data to hold before writing it out.

      • This is a space and speed balancing act; 50% buffer is just asking the computer not to write out anything until it reaching 50% of the available working memory.

      • Not sure if it is appropriate on Raven since I modified this code from Sam’s script that was running on Klone. 

      • Will leave it in the script in hopes it will also help speed up the process without screwing up anyone else running stuff on Raven

  • A later task is to run a quick script to put all of the checksums into a table or excel file or whatever for quick side by side comparisons and to keep with all of the other metadata. This is a clean-up step, not a process step.

  • Created an exact duplicate of the extraction script to run with a cover 10 for comparison.

    • I can run this after the extractions for cover 5 because I will need to take my time through the results to really lock in what I do and do not understand before reviewing those results in comparison.
  • Next snag- the sorted BAM files are not in an order recognized… 

    • Error message in the log for 105M:

    • “The IDs of Read 1 (LH00469:254:22HGFVLT4:2:2361:52054:3816_1:N:0:GTTACGCA+ATGGCGAT) and Read 2 (LH00469:254:22HGFVLT4:2:2441:28449:5398_1:N:0:GTTACGCA+ATGGCGAT) are not the same. This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file using ‘samtools sort -n’ (by read name). This may also occur using samtools merge as it does not guarantee the read order. To properly merge files please use ‘samtools merge -n’ or ‘samtools cat’.”

  • Remember: Bismark methylation extractor requires R1 and R2 to remain adjacent; sorted BAMS are not going to work.

  • I fixed the script to pull the unsorted BAM files, and once it began working, I left it to get ready to go.

  • Finally, I fixed my lab notebook not showing up on the handbook page and not showing up in the lab feed per GH Issue #2090 guidance.

    • The handbook page update worked. I think I added my name to the path twice instead of once…

    • I can’t see if the feed that drops into Slack worked, so I will wait until tomorrow to verify in case it only pulls once a day or at specific times or whatever.


Outcomes: Products & Word Count

  • Extraction Analysis (Cover 5 and 10): 2 scripts
  • Methylation Analysis Details: 368 words

Today’s total: 368 words

Monthly total to date: 921 words

Annual total to date: 33,593 words

Annual target total to date: 46,500 words

Next Up: Tomorrow’s Plan

  • Set April goals and attainment plan.