Analyzing the content of your competitors will offer you valuable insights concerning your operations and goals. This basic Python script can provide you with data on n-Grams in seconds.
This Python Script may be an elementary version of a content analysis of your competitor. The most plan is to induce a fast summary of what the writing focus appearance like. A lean approach is to fetch all computer addresses within the sitemap, take apart the URL slugs and run an n-Gram analysis on it. If you would like to understand a lot about n-Gram analysis, please even have a glance at our Free N-Gram Tool. you'll apply it not just for computer address however conjointly keywords, titles, and so on
As a result, you'll get a listing of used n-Grams within the URL slugs along with the number of pages that used this n-Gram. This analysis will solely take a couple of seconds, even on massive sitemaps, and can run with lower than fifty lines of code.
If you would like to induce deeper insights, I will suggest traveling on with these approaches:
- Fetch the content of every universal resource locator within the sitemap
- Create n-Grams found in headlines
- Create n-Grams found in the content
- Extract keywords with Textrank or Rake
- Extract familiar entities for your SEO business
But let's begin easy and take a primary consider the hollow with this script. Supported your feedback, I could add a lot of refined approaches. Before you run the script, you simply have to be compelled to enter the sitemap URL you want to analyze. Once running the script, you'll notice your leads to sitemap_ngrams.csv. Open it in Excel or Google Sheets and make merry with analyzing the data.
Here is the Python code:
# Pemavor.com Sitemap Content Analyzer
# Author: Stefan Neefischer
import advertools as adv
import pandas as pd
sitemap = adv.sitemap_to_df(site)
sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True)
# Some sitemaps keeps urls with "/" on the end, some is with no "/"
# If there is "/" on the end, we take the second last column as slugs
# Else, the last column is the slug column
slugs = sitemap['loc'].dropna()[sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ')
slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ')
# Merge two series
slugs = list(slugs) + list(slugs2)
# adv.word_frequency automatically removes the stop words
word_counts_onegram = adv.word_frequency(slugs)
word_counts_twogram = adv.word_frequency(slugs, phrase_len=2)
output_csv = pd.concat([word_counts_onegram, word_counts_twogram], ignore_index=True)\
#Save input csv with scores
print("csv file saved")
# Provide the Sitemap that should be analyzed
site = "https://searchengineland.com/sitemap_index.xml"
#the results will be saved to sitemap_ngrams.csv file