import pandas as pd
Introduction
The TidyTuesday dataset this week is about Pokémon characters. I uploaded the first few rows of the dataset to ChatGPT and asked it to provide easy, medium, and hard data analysis questions. This post goes through those questions.
First, let’s import the pandas library for performing the data analysis.
Then we download the data.
= "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01"
base_url = pd.read_csv(f"{base_url}/pokemon_df.csv") df_raw
Let’s remove some columns that are not needed.
= df_raw.drop(columns=["id", "url_icon", "url_image"]) df
It’s useful to look at the first few rows of the dataset to get an idea of the variables and their values. We see that the dataset includes Pokémon names, variables like attack and defense, colors, egg groups, etc.
from IPython.display import HTML
5).to_html(index=False)) HTML(df.head(
pokemon | species_id | height | weight | base_experience | type_1 | type_2 | hp | attack | defense | special_attack | special_defense | speed | color_1 | color_2 | color_f | egg_group_1 | egg_group_2 | generation_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bulbasaur | 1 | 0.7 | 6.9 | 64 | grass | poison | 45 | 49 | 49 | 65 | 65 | 45 | #78C850 | #A040A0 | #81A763 | monster | plant | 1.0 |
ivysaur | 2 | 1.0 | 13.0 | 142 | grass | poison | 60 | 62 | 63 | 80 | 80 | 60 | #78C850 | #A040A0 | #81A763 | monster | plant | 1.0 |
venusaur | 3 | 2.0 | 100.0 | 236 | grass | poison | 80 | 82 | 83 | 100 | 100 | 80 | #78C850 | #A040A0 | #81A763 | monster | plant | 1.0 |
charmander | 4 | 0.6 | 8.5 | 62 | fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | #F08030 | NaN | NaN | monster | dragon | 1.0 |
charmeleon | 5 | 1.1 | 19.0 | 142 | fire | NaN | 58 | 64 | 58 | 80 | 65 | 80 | #F08030 | NaN | NaN | monster | dragon | 1.0 |
On to the questions!
Easy Questions
- What are the column names in the dataset?
df.columns
Index(['pokemon', 'species_id', 'height', 'weight', 'base_experience',
'type_1', 'type_2', 'hp', 'attack', 'defense', 'special_attack',
'special_defense', 'speed', 'color_1', 'color_2', 'color_f',
'egg_group_1', 'egg_group_2', 'generation_id'],
dtype='object')
- What is the average weight of all Pokémon?
"weight"]].mean() df[[
weight 66.213172
dtype: float64
- How many Pokémon have a secondary type (type_2) listed?
"type_2"].notna().sum() df[
np.int64(510)
- What are the top 5 most common egg groups (egg_group_1)?
"egg_group_1").size().sort_values(ascending=False).head(n=5) df.groupby(
egg_group_1
ground 218
no-eggs 118
monster 94
water1 84
bug 80
dtype: int64
- How many Pokémon are of the color #F08030?
"color_1 == '#F08030'").shape[0] df.query(
59
Medium Questions
- What’s the minimum and maximum defense among Grass-type Pokémon?
("type_1 == 'grass'")
df.query(
.assign(=lambda d: d["defense"].min(),
min_defense=lambda d: d["defense"].max(),
max_defense
)"defense == min_defense or defense == max_defense")[["pokemon", "defense"]]
.query( )
pokemon | defense | |
---|---|---|
190 | sunkern | 30 |
597 | ferrothorn | 131 |
797 | kartana | 131 |
- List all Pokémon with hp greater than 200.
"hp > 200").loc[:, "pokemon"].tolist() df.query(
['chansey', 'blissey', 'guzzlord', 'zygarde-complete']
- How many unique combinations of egg_group_1 and egg_group_2 are there?
"egg_group_1", "egg_group_2"]].drop_duplicates().dropna().shape[0] df[[
35
- Which Pokémon has the largest difference between attack and defense?
("pokemon", "attack", "defense"]]
df[[=lambda x: x["attack"] - x["defense"])
.assign(diff_attack_defense"diff_attack_defense", ascending=False)
.sort_values(1)
.head( )
pokemon | attack | defense | diff_attack_defense | |
---|---|---|---|---|
802 | deoxys-attack | 180 | 20 | 160 |
- Which primary type has the lowest average special_defense?
("type_1", "special_defense"]]
df[["type_1")
.groupby(
.mean()
.reset_index()"special_defense")
.sort_values(=1)
.head(n )
type_1 | special_defense | |
---|---|---|
10 | ground | 62.861111 |
Hard Questions
- Find the average base_experience by each combination of type_1 and type_2.
"type_1", "type_2", "base_experience"]].groupby(["type_1", "type_2"]).mean() df[[
base_experience | ||
---|---|---|
type_1 | type_2 | |
bug | electric | 163.800000 |
fairy | 111.500000 | |
fighting | 153.250000 | |
fire | 160.000000 | |
flying | 141.500000 | |
... | ... | ... |
water | ice | 179.000000 |
poison | 111.666667 | |
psychic | 160.333333 | |
rock | 139.500000 | |
steel | 239.000000 |
155 rows × 1 columns
- Which Pokémon have all base stats (hp, attack, defense, special_attack, special_defense, speed) above the average of their respective columns?
df["hp"] > df["hp"].mean())
(df[& (df["attack"] > df["attack"].mean())
& (df["defense"] > df["defense"].mean())
& (df["special_attack"] > df["special_attack"].mean())
& (df["special_defense"] > df["special_defense"].mean())
& (df["speed"] > df["speed"].mean())
"pokemon"].head(6) ][
2 venusaur
5 charizard
8 blastoise
30 nidoqueen
33 nidoking
54 golduck
Name: pokemon, dtype: object
- Identify the Pokémon with a secondary type but no secondary egg group.
="egg_group_2").query("type_2 == type_2")["pokemon"] df.dropna(subset
0 bulbasaur
1 ivysaur
2 venusaur
5 charizard
33 nidoking
...
868 altaria-mega
872 slowbro-mega
889 lopunny-mega
901 raichu-alola
930 salazzle-totem
Name: pokemon, Length: 131, dtype: object
- Create a new column for BMI (weight / height**2) and find the Pokémon with the highest BMI.
("pokemon", "height", "weight"]]
df[[=lambda x: x["weight"] / x["height"] ** 2)
.assign(bmi"bmi", ascending=False)
.sort_values(1)
.head( )
pokemon | height | weight | bmi | |
---|---|---|---|---|
789 | cosmoem | 0.1 | 999.9 | 99990.0 |
- For each generation, which Pokémon has the highest total_stats?
# Compute a total_stats column
= ["hp", "attack", "defense", "special_attack", "special_defense", "speed"]
stats_cols "total_stats"] = df[stats_cols].sum(axis=1)
df[
# Compute max total_stats by generation_id
"top_stats_by_gen"] = df.groupby("generation_id")["total_stats"].transform("max")
df[
# Filter rows where total_stats is max total_stats by generation_id
"total_stats == top_stats_by_gen")[["pokemon", "generation_id", "total_stats"]] df.query(
pokemon | generation_id | total_stats | |
---|---|---|---|
149 | mewtwo | 1.0 | 680 |
248 | lugia | 2.0 | 680 |
249 | ho-oh | 2.0 | 680 |
383 | rayquaza | 3.0 | 680 |
492 | arceus | 4.0 | 720 |
642 | reshiram | 5.0 | 680 |
643 | zekrom | 5.0 | 680 |
715 | xerneas | 6.0 | 680 |
716 | yveltal | 6.0 | 680 |
790 | solgaleo | 7.0 | 680 |
791 | lunala | 7.0 | 680 |
Summary
Overall, I found this to be an easy way to get some practice with pandas and explore a new dataset. While answering these questions, I ran some of my solutions through ChatGPT to get feedback on possible issues and improvements. For example, with the last question, ChatGPT suggested using the transform method instead of merge, which led to a simpler solution. And I learned something about Pokémon - Cosmoem has a BMI of almost 10,000!