Ethan Holleman

Posts with the tag R:

Scraping leafly.com cannabis strain data

Recently I was asked by a friend if they knew about any databases that classified cannabis strains by symptoms people tend to use them to relieve. I didn’t know of the existence of any but had heard about leafly.com which catalogues user reviews of various cannabis strains and compiles data on their characteristics. I thought this could be a good place for them to start and so I started looking into what it would take to make a webscrapper to pull down all the data leafly has complied on hundreds on cannabis strains. It turns out it didn’t take that much.

What would recursive academic citations look like?

The way academic citations are measured currently is pretty standardized. Authors of article A accrue a citation whenever their article is directly cited in article B. But there is likely a large amount of work that was cited by article A but not by article B. The authors of this work which indirectly contributed to article B by contributing to article A (which B cites) will not see a citation. What if instead citing one article triggered a recursive call all the way down the network formed by articles and their citations? Would this end up eventually citing almost all articles in a field?

Visualizing ligand docking results with PyMOL scripting and R

The past couple days I have been running some ligand docking simulations as part of my current rotation with the Cortopasssi lab using Rosetta. One of these docking simulations involved fitting a small portion of the insulin receptor (IR) the lab is interested in, into a known binding region of the Shc1 protein. Any Rosetta docking simulation will require hundreds of repetitions, which generate a significant number of pdb files which show the final conformation of the protein and ligand at the end of a given simulation. While reading about the best way to aggregate and do analyise on these results I spent a bit of time looking for ways to visualize everything Rosetta spits out.

Plotting COVID-19 Hospitalization Geo-Spacial Data

After finding the COG-UK data I was looking around for other interesting COVID-19 datasets to play around with and build my R plotting skills with. User moritz.kraemer posted this article on early case descriptions which included a lot of geo-spacial data that I was interested in takeing a look at. There was a significant number of fields devoted to hospitalization related measurements and so I focused on that subject for the plot below. The dataset includes patients with and without hospitalization records and so first I filtered down to just those with records and those who also had location data. This subset of patients formed subplot A.

Plotting COG-UK Data

The Covid-19 Genomics UK Consortium has been collecting and sequencing thousands of COVID-19 genomes from patients in the UK and around the world. All of their data is publicly available. Here I played around with the phylogenetic tree they have created from global alignments of all the genomes they have sequenced. You can download the tree in Newick format from their data page which also hosts sequences and the alignment files. Visualizing the COVID-19 phylogenetic tree by country of origin Genome count by country Note this plot is log scale in the y-axis. 16 most prevalent UK COVID-19 lineages Density plots showing the number of genomes of the 16 most prevalent lineages detected by COG-UK.