The heart of the game - getting the words out of a web page

A game that is about turning words on web pages into dungeons needs to start with a function that gets all the words in a web page, and sorts them by length - we want the fiercest dragons and the most powerful swords first.

My Java class to do this job took me a few days and had nearly 500 lines; I created an HTML scanner myself that featured methods to e.g. skip over comments and HTML tags. As you'll see by the end of this blog entry, I'm too embarassed to share it here.

So I went ahead and asked ChatGPT:

Hi, in this session we are a team of expert software developers who want to create a web based game in Javascript. First we we need a function that takes an URL, and that reads the web page of the URL and determines a list of the words of text it contains, sorted by character count. Can you write this function?

And it did.

I then asked it to create a test web page (I'm lazy):

can you now write a little test web page that tests the function, with a text field where a user can enter tue URL and then see the list of words?

And I got what I wanted. However, the first version didn't work to a security measure of today's web pages, the Same-Origin Policy. ChatGPT explained to me what it is, how to circumvent it with a server-side proxy, and updated the JS code to redirect to a demo server that does this. Once I activated the server's proxy with click of a button, the code worked beautifully.

I then noticed that I wanted to get rid of JS code in the page and <style> sections, asked ChatGPT to take care. 

wonderful, your new code works, but it includes not only the words of the text but also javascript code on the web page. Can you refine it to ignore javascript code?

And it did.

Then I asked it to remove words that are using characters that are not part of the English language, and to sort descending by length, and to turn all words to lowercase.

And again, it did all that beautifully, on the first attempt, without flaw.

Here's the code it created:

async function getSortedWordsByCharacterCount(url) {
  const proxyUrl = 'https://cors-anywhere.herokuapp.com/';
  const response = await fetch(proxyUrl + url);
  const html = await response.text();
  const text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>|<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '') // Remove script and style tags
                   .replace(/<[^>]*>?/gm, ''); // Remove HTML tags
  const words = text.split(/\s+/) // Split text into words
                   .filter(word => /^[a-zA-Z]*$/.test(word)) // Exclude non-English words
                   .map(word => word.toLowerCase()); // Convert words to lowercase
  const wordCounts = {};

  // Count occurrence of each word
  for (const word of words) {
    if (word in wordCounts) {
      wordCounts[word]++;
    } else {
      wordCounts[word] = 1;
    }
  }

  // Sort words by character count (longest first)
  const sortedWords = Object.keys(wordCounts).sort((a, b) => {
    return b.length - a.length;
  });

  return sortedWords;
}

And its output for a wikipedia page:


I

am

so 

impressed.

Yes, JS is very suited for this kind of task, but still. Where I took 450 lines of Java code it took 28 lines of JS code, by using VERY clever regexps where I scanned and if-then-elsed like hell. Amazing. And it didn't sound like a newbie developer I had to teach. Rather it sounded like an expert developer talking to a product manager.


Kommentare

Beliebte Posts aus diesem Blog

ChatGPT as a teacher

How it started...

Pathfinding and A*