close
close

first Drop

Com TW NOw News 2024

Use R to apply for a local LLM with ollamar
news

Use R to apply for a local LLM with ollamar

(This article was first published on Have genetics doneand kindly contributed to R-bloggers). (You can report problems with the content of this page here)


Want to share your content on R-bloggers? Click here if you have a blog, or here if you don’t.

This is a re-post of the original article:

https://blog.stephenturner.us/p/use-r-to-prompt-a-local-llm-with

Use R to start a local LLM with ollamar: Using R to start llama3.1:70b on my laptop with Ollama + ollamar to tell me what’s interesting about a set of genes and to summarize recent bioRxiv preprints

——————–

I have the lama3.1:70b model just released by Meta using Ollama running on my MacBook Pro. Ollama makes it easy to talk to a locally running LLM in the terminal (ollama run llama3.1:70b) or via a trusted GUI with the open-web Docker container.

Here I will demonstrate the use of the ollamar package on CRAN to talk to an LLM running locally on my Mac. I will demonstrate this by asking llama3.1-70b what he thinks about a set of genes and to summarize recent preprints published on bioRxiv’s channel for scientific communication and education using the smaller+faster 8B model.

First I will msigdbr R package to retrieve the gene symbols for all genes involved in the G2/M Checkpoint in the progression of the cell cycle of MSigDB file.

library(msigdbr)
hm 

Next, I’ll load the ollamar package, and test the connection to the Ollama server.

library(ollamar)
test_connection()

If all goes well, you should see something like this:

Ollama local server running

GET http://localhost:11434/
Status: 200 OK
Content-Type: text/plain
Body: In memory (17 bytes)

The ollamar README has good documentation on how to retrieve models, an overview of available models, etc.

Next, I'll collapse the gene symbol vector I created above into a string ("CDK1, SMC2, POLA2, …"), and create a prompt that holds this list of gene symbols until the end. Finally, I'll generate() function with this prompt I just created to prompt the llama3.1:70b model and print the results.

# Collapse the symbol vector to a string ("GENE1, GENE2, GENE3")
gs_string 

The prompt returns markdown text which looks nice after rendering with RMarkdown/Quarto:

What a massive list of gene symbols!

After analyzing the list, I noticed several interesting aspects:

  1. Cell cycle and DNA replication: Many genes are involved in cell cycle regulation (e.g., CDK1, CDK4, CDC20, CDC25A, CDC45), DNA replication (e.g., ORC5, ORC6, MCM2, MCM3, MCM5), and chromosome segregation (e.g., BUB1, BUB3, CENPA, CENPE).

  2. Apoptosis and cell death: Several genes are associated with apoptosis (programmed cell death) or regulation of cell death pathways, including CASP8AP2, BIRC5, and PML.

  3. Cancer-related genes: A significant number of genes have been implicated in cancer development or progression, such as MYC, CCND1, CCNE1, CDC25A, and PLK1. Some genes are tumor suppressors (e.g., BRCA2, TP53), while others may promote tumorigenesis.

  4. Chromatin modification and epigenetics: Genes like EZH2, CUL4A, and KMT5A are involved in chromatin modification, histone regulation, or other epigenetic processes that influence gene expression.

  5. Transcriptional regulation: Many genes encode transcription factors (e.g., E2F1, MYC, HIF1A) or co-factors that regulate gene expression, including some with roles in cell cycle progression and apoptosis.

  6. Nuclear structure and function: Genes like LBR, LIG3, and NUP50 are involved in nuclear envelope organization, pore complex formation, or nucleocytoplasmic transport.

These observations suggest that this gene set is enriched for genes involved in fundamental cellular processes such as cell cycle regulation, DNA replication, apoptosis, chromatin modification, and transcriptional control. This might indicate a specific biological context or experimental condition under which these genes are co-expressed.

Would you like to know more about any of these aspects or explore further?

For kicks I did this with a random selection of 50 genes from the human genome:

allgenesets 

Llama3.1 is doing a good job HARKing about what these genes could be doing:

After analyzing the list, I noticed a few interesting things:

  1. High proportion of pseudogenes: About 20-25% of the genes on your list are labeled as "pseudogenes" (e.g., RN7SL677P, RPL27AP3, HYAL6P). Pseudogenes are inactive copies of functional genes that have lost their ability to code for proteins. They can still be transcribed and play roles in regulating gene expression.

  2. Ribosomal protein pseudogenes: Many of the pseudogenes on your list (e.g., RPL27AP3, RN7SL677P) appear to be related to ribosomal proteins, which are essential for protein synthesis.

  3. MicroRNAs and small nuclear RNAs: You have several microRNA genes (MIRs) and small nuclear RNA (snRNA) genes on your list (e.g., MIR4444-2, RNU7-160P). These types of non-coding RNAs play crucial roles in regulating gene expression.

  4. Tyrosine phosphatase and kinases: The presence of PTPN7 (a tyrosine phosphatase) and TYK2 (a Janus kinase) on your list suggests a possible connection to signaling pathways involved in cell growth, differentiation, or immune responses.

  5. Vascular endothelial growth factor A (VEGFA): This gene is crucial for angiogenesis (the formation of new blood vessels). Its presence on your list might indicate a role in vascular development or disease.

These observations are just a starting point, and further analysis would be needed to determine the significance of these genes in a specific biological context. Do you have any additional information about how this gene set was obtained or what type of study it's related to?

Above I created a single prompt with a single call to generate(). Here I’ll create many prompts based on data in columns of a data frame, and purrr::map() the generate() function over all of those prompts to create a new column of responses in my data frame.

I’ll demonstrate this by asking for a two sentence summary of the most recent bioRxiv preprints. I’ll start by pulling the latest titles and abstracts from the Scientific Communication and Education channel on bioRxiv. I use the tidyrss package to pull this information directly from the bioRxiv RSS feed. This returns a data frame with title, abstract, and other information about the most recent preprints in the feed.

library(tidyRSS)
library(tidyverse)

# Parse the feed
feed 

Here are the latest three titles (as of August 3, 2024). The abstracts that go with them (not shown) are in the feed$item_description column.

(1) "\nBiological changes, political ideology, and scientific communication shape human perceptions of pollen seasons \n" 
(2) "\nAn updated and expanded characterization of the biological sciences academic job market \n"                        
(3) "\n\"I'd like to think I'd be able to spot one if I saw one\": How science journalists navigate predatory journals \n"

Then I take the first 20 lines of the feed, extract the title and abstract with the select statement, remove the leading and trailing spaces with the first mutate, construct a prompt with the second mutate, and generate a response with llama3.1 with the last mutate with purrr::map_chr()Here I'm using the smaller/faster model llama3.1:8b (the default, unless you specify :70b).

summarized 
  head(20) |>
  select(title=item_title, abstract=item_description) |>
  mutate(across(everything(), trimws)) |>
  mutate(prompt=paste(
    "\n\nI'm going to give you a paper's title and abstract.",
    "Can you summarize this paper in 2 sentences?",
    "\n\nTitle: ", title, "\n\nAbstract: ", abstract)) |>
  mutate(response=purrr::map_chr(prompt, \(x) 
                                 generate("llama3.1",
                                          prompt=x,
                                          output="text")$response))

Here's an example prompt that would execute for each title and summary in the feed:

I'm going to give you a paper's title and abstract. Can you summarize this paper in a few sentences? 

Title:  An updated and expanded characterization of the biological sciences academic job market 

Abstract:  In the biological sciences, many areas of uncertainty exist regarding the factors that contribute to success within the faculty job market. Earlier work from our group reported that beyond certain thresholds, academic and career metrics like the number of publications, fellowships or career transition awards, and years of experience did not separate applicants who received job offers from those who did not. Questions still exist regarding how academic and professional achievements influence job offers and if candidate demographics differentially influence outcomes. To continue addressing these gaps, we initiated surveys collecting data from faculty applicants in the biological sciences field for three hiring cycles in North America (Fall 2019 to the end of May 2022), a total of 449 respondents were included in our analysis. These responses highlight the interplay between various scholarly metrics, extensive demographic information, and hiring outcomes, and for the first time, allowed us to look at persons historically excluded due to ethnicity or race (PEER) status in the context of the faculty job market. Between 2019 and 2022, we found that the number of applications submitted, position seniority, and identifying as a women or transgender were positively correlated with a faculty job offer. Applicant age, residence, first generation status, and number of postdocs, however, were negatively correlated with receiving a faculty job offer. Our data are consistent with other surveys that also highlight the influence of achievements and other factors in hiring processes. Providing baseline comparative data for job seekers can support their informed decision-making in the market and is a first step towards demystifying the faculty job market.

Now the response column in my new table contains all the responses from the LLM. I can now put the title and summary in a table and render it with pandoc.

summarized |> 
  select(title, response) |> 
  mutate(response=gsub("\n", " ", response)) |> 
  knitr::kable()