TidyTuesday: Pokémon

Answering data analysis questions about Pokémon characters asked by ChatGPT.
TidyTuesday
Python
pandas
Author

Jason Bernstein

Published

April 5, 2025

Introduction

The TidyTuesday dataset this week is about Pokémon characters. I uploaded the first few rows of the dataset to ChatGPT and asked it to provide easy, medium, and hard data analysis questions. This post goes through those questions.

First, let’s import the pandas library for performing the data analysis.

import pandas as pd

Then we download the data.

base_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01"
df_raw = pd.read_csv(f"{base_url}/pokemon_df.csv")

Let’s remove some columns that are not needed.

df = df_raw.drop(columns=["id", "url_icon", "url_image"])

It’s useful to look at the first few rows of the dataset to get an idea of the variables and their values. We see that the dataset includes Pokémon names, variables like attack and defense, colors, egg groups, etc.

from IPython.display import HTML

HTML(df.head(5).to_html(index=False))
pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 generation_id
bulbasaur 1 0.7 6.9 64 grass poison 45 49 49 65 65 45 #78C850 #A040A0 #81A763 monster plant 1.0
ivysaur 2 1.0 13.0 142 grass poison 60 62 63 80 80 60 #78C850 #A040A0 #81A763 monster plant 1.0
venusaur 3 2.0 100.0 236 grass poison 80 82 83 100 100 80 #78C850 #A040A0 #81A763 monster plant 1.0
charmander 4 0.6 8.5 62 fire NaN 39 52 43 60 50 65 #F08030 NaN NaN monster dragon 1.0
charmeleon 5 1.1 19.0 142 fire NaN 58 64 58 80 65 80 #F08030 NaN NaN monster dragon 1.0

On to the questions!

Easy Questions

  1. What are the column names in the dataset?
df.columns
Index(['pokemon', 'species_id', 'height', 'weight', 'base_experience',
       'type_1', 'type_2', 'hp', 'attack', 'defense', 'special_attack',
       'special_defense', 'speed', 'color_1', 'color_2', 'color_f',
       'egg_group_1', 'egg_group_2', 'generation_id'],
      dtype='object')
  1. What is the average weight of all Pokémon?
df[["weight"]].mean()
weight    66.213172
dtype: float64
  1. How many Pokémon have a secondary type (type_2) listed?
df["type_2"].notna().sum()
np.int64(510)
  1. What are the top 5 most common egg groups (egg_group_1)?
df.groupby("egg_group_1").size().sort_values(ascending=False).head(n=5)
egg_group_1
ground     218
no-eggs    118
monster     94
water1      84
bug         80
dtype: int64
  1. How many Pokémon are of the color #F08030?
df.query("color_1 == '#F08030'").shape[0]
59

Medium Questions

  1. What’s the minimum and maximum defense among Grass-type Pokémon?
(
    df.query("type_1 == 'grass'")
    .assign(
        min_defense=lambda d: d["defense"].min(),
        max_defense=lambda d: d["defense"].max(),
    )
    .query("defense == min_defense or defense == max_defense")[["pokemon", "defense"]]
)
pokemon defense
190 sunkern 30
597 ferrothorn 131
797 kartana 131
  1. List all Pokémon with hp greater than 200.
df.query("hp > 200").loc[:, "pokemon"].tolist()
['chansey', 'blissey', 'guzzlord', 'zygarde-complete']
  1. How many unique combinations of egg_group_1 and egg_group_2 are there?
df[["egg_group_1", "egg_group_2"]].drop_duplicates().dropna().shape[0]
35
  1. Which Pokémon has the largest difference between attack and defense?
(
    df[["pokemon", "attack", "defense"]]
    .assign(diff_attack_defense=lambda x: x["attack"] - x["defense"])
    .sort_values("diff_attack_defense", ascending=False)
    .head(1)
)
pokemon attack defense diff_attack_defense
802 deoxys-attack 180 20 160
  1. Which primary type has the lowest average special_defense?
(
    df[["type_1", "special_defense"]]
    .groupby("type_1")
    .mean()
    .reset_index()
    .sort_values("special_defense")
    .head(n=1)
)
type_1 special_defense
10 ground 62.861111

Hard Questions

  1. Find the average base_experience by each combination of type_1 and type_2.
df[["type_1", "type_2", "base_experience"]].groupby(["type_1", "type_2"]).mean()
base_experience
type_1 type_2
bug electric 163.800000
fairy 111.500000
fighting 153.250000
fire 160.000000
flying 141.500000
... ... ...
water ice 179.000000
poison 111.666667
psychic 160.333333
rock 139.500000
steel 239.000000

155 rows × 1 columns

  1. Which Pokémon have all base stats (hp, attack, defense, special_attack, special_defense, speed) above the average of their respective columns?
df[
    (df["hp"] > df["hp"].mean())
    & (df["attack"] > df["attack"].mean())
    & (df["defense"] > df["defense"].mean())
    & (df["special_attack"] > df["special_attack"].mean())
    & (df["special_defense"] > df["special_defense"].mean())
    & (df["speed"] > df["speed"].mean())
]["pokemon"].head(6)
2      venusaur
5     charizard
8     blastoise
30    nidoqueen
33     nidoking
54      golduck
Name: pokemon, dtype: object
  1. Identify the Pokémon with a secondary type but no secondary egg group.
df.dropna(subset="egg_group_2").query("type_2 == type_2")["pokemon"]
0           bulbasaur
1             ivysaur
2            venusaur
5           charizard
33           nidoking
            ...      
868      altaria-mega
872      slowbro-mega
889      lopunny-mega
901      raichu-alola
930    salazzle-totem
Name: pokemon, Length: 131, dtype: object
  1. Create a new column for BMI (weight / height**2) and find the Pokémon with the highest BMI.
(
    df[["pokemon", "height", "weight"]]
    .assign(bmi=lambda x: x["weight"] / x["height"] ** 2)
    .sort_values("bmi", ascending=False)
    .head(1)
)
pokemon height weight bmi
789 cosmoem 0.1 999.9 99990.0
  1. For each generation, which Pokémon has the highest total_stats?
# Compute a total_stats column
stats_cols = ["hp", "attack", "defense", "special_attack", "special_defense", "speed"]
df["total_stats"] = df[stats_cols].sum(axis=1)

# Compute max total_stats by generation_id
df["top_stats_by_gen"] = df.groupby("generation_id")["total_stats"].transform("max")

# Filter rows where total_stats is max total_stats by generation_id
df.query("total_stats == top_stats_by_gen")[["pokemon", "generation_id", "total_stats"]]
pokemon generation_id total_stats
149 mewtwo 1.0 680
248 lugia 2.0 680
249 ho-oh 2.0 680
383 rayquaza 3.0 680
492 arceus 4.0 720
642 reshiram 5.0 680
643 zekrom 5.0 680
715 xerneas 6.0 680
716 yveltal 6.0 680
790 solgaleo 7.0 680
791 lunala 7.0 680

Summary

Overall, I found this to be an easy way to get some practice with pandas and explore a new dataset. While answering these questions, I ran some of my solutions through ChatGPT to get feedback on possible issues and improvements. For example, with the last question, ChatGPT suggested using the transform method instead of merge, which led to a simpler solution. And I learned something about Pokémon - Cosmoem has a BMI of almost 10,000!