Adding Google Indexing to Hexo-based Website and SEO Configuration

Just did some SEO for the website and took the opportunity to update an old blog post from 4 years ago. This article covers: installing plugins and generating a sitemap for the website, configuring plugins to exclude pages or files that don't need to be indexed, submitting the website sitemap to Google for index, and using robots.txt for search engine optimization (SEO).

Plugin Installation and Configuration

Installation

First, we need to install the hexo-generator-sitemap plugin to generate a sitemap for the website. Use the following command:

1
npm install hexo-generator-sitemap --save

Configuration

After installation, open the configuration file of the Hexo framework, ./_config.yml, and add the following settings:

./_config.yml
1
2
3
4
5
# Sitemap Generator
sitemap:
path: sitemap.xml
tags: false
categories: false

Explanation:

  • path specifies the path for the generated sitemap. For example, if your website URL is https://siriusq.top/, the sitemap will be generated at https://siriusq.top/sitemap.xml.
  • tags is used to determine whether to include tag-related pages. If you have a large number of tags, you might want to disable this option to prevent an overwhelming number of tag indexes.
  • categories is used to decide whether to include category-related pages. Enable or disable as needed.
  • For more options, please refer to the plugin README page.

Excluding Specific Pages

If you wish to exclude specific pages, such as the 404 page, from appearing in the sitemap, you can add a line sitemap: false in the Front-matter section at the top of the respective Markdown file, separated by ---. Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
---
title: "404"
type: "404"
permalink: /404
date: 2022-04-12 20:38:42
layout: false
comments: false
sitemap: false
---

Page content

Excluding Specific File Formats

By default, the plugin excludes js and css file formats. However, if your website contains many json format files, such as Live2D model configuration files, they will be added to the sitemap during page generation. This might result in a large number of entries. As I cannot find a configuration option in the plugin settings to exclude specific file formats, I choose to modify the plugin code directly XD.

Open the generator.js file located in your blog’s directory under ./node_modules/hexo-generator-sitemap/lib/. Find the const skipRenderList and add a line '**/*.json' as shown below. Make sure not to forget the comma at the end of the previous line:

generator.js
1
2
3
4
5
const skipRenderList = [
'**/*.js',
'**/*.css',
'**/*.json'
];

Add to Google Index

Currently, your website cannot be found through search engines. To enable search engine indexing, I’ll guide you through the process, using Google as an example.

  1. Go to Google Search Console, click “Start now,” and log in with your Google account.
  2. Select your URL prefix and enter your site’s URL, such as https://siriusq.top, then click “Continue.”
    01e.webp
  3. Choose the HTML tag verification method and copy the meta tag provided by Google, which looks like this:
    02e.webp
    1
    <meta name="google-site-verification" content="tcRYfdvRwh-85F9TY63_0TZ4O56GfhgyNGZhgfzsdfgghfg" />
  4. Open the Next theme configuration file, _config.next.yml. Use the search function to find google_site_verification: and paste the content part of the meta tag you copied in the previous step. It should look like this:
    1
    google_site_verification: tcRYfdvRwh-85F9TY63_0TZ4O56GfhgyNGZhgfzsdfgghfg
  5. Run hexo cl and hexo g -d commands to regenerate your website’s resources and push them to GitHub. Check your website’s deployment status in the Actions tab of your blog repository.
  6. After deployment is complete, return to the page from step 3 and click “Verify.”
  7. Allow some time, possibly up to a month, for Google’s crawlers to index your website. Once indexing is complete, you can search your website on Google using site:domain to see your site in the results.
  8. I recommend manually submitting a sitemap after verification:
    • Click on “Sitemaps” in the left sidebar of Google Search Console.
    • The default URL for your sitemap is https://[your blog domain]/sitemap.xml.
    • Paste the URL and click “Submit.”

robots.txt

robots.txt is a text file used to control the access permissions for search engine crawlers. You can place this file in the root directory of your website to specify which pages or directories can be accessed by search engine crawlers and which ones cannot.

Here are some common elements and rules for robots.txt:

  1. User-agent: Specifies the name of the search engine crawler to which the rules apply. For example, “User-agent: Googlebot” indicates that the following rules apply to the Googlebot crawler. Common crawler names include:
    • Googlebot - Google’s search engine crawler.
    • Bingbot - Microsoft Bing’s search engine crawler.
    • Slurp - Yahoo’s search engine crawler.
    • DuckDuckBot - DuckDuckGo’s search engine crawler.
    • Baiduspider - Baidu’s search engine crawler.
    • YandexBot - Yandex’s search engine crawler.
  2. Disallow: Specifies the pages or directories that the crawler is not allowed to access. For example, “Disallow: /private/“ means that the “/private/“ directory is off-limits to crawlers.
  3. Allow: Used to override Disallow rules. If a directory is Disallow-ed but certain files or subdirectories within it need to be crawled, you can use Allow to lift the restriction.
  4. Crawl-delay: Specifies the crawling speed of the crawler. This rule is often used to control the crawl rate to avoid overloading the server.
  5. Sitemap: Informs the search engine about the location of the sitemap, which helps search engines better understand the website’s structure.

Example of robots.txt

Here’s a simple example of a robots.txt file:

1
2
3
4
5
6
7
User-agent: Googlebot
Disallow: /private/
Allow: /public/
Crawl-delay: 10

User-agent: Bingbot
Disallow: /admin/

In this example, the rules for Googlebot include not allowing access to the “/private/“ directory but allowing access to the “/public/“ directory while limiting its crawling speed. For Bingbot, access to the “/admin/“ directory is disallowed.

Provided by ChatGPT

Below is an example of my robots.txt file. It prohibits Baidu’s crawler from crawling the website and restricts other crawlers from accessing irrelevant directories and the 404 page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
User-agent: Baiduspider
Disallow: /

User-agent: *
Disallow: /about/
Disallow: /archives/
Disallow: /categories/
Disallow: /css/
Disallow: /images/
Disallow: /js/
Disallow: /lib/
Disallow: /live2d-widget/
Disallow: /live2d_api/
Disallow: /page/
Disallow: /tags/
Disallow: /404.html

Sitemap: https://siriusq.top/sitemap.xml

For more information about robots.txt, please refer to the Google Search Central documentation.