Image by Christina Brinza on Unsplash.

Classifying Iris Species with K-Means Clustering

Implementing the K-Means Clustering Algorithm on the Iris Dataset

If you’re anything like me, you might have spent the past couple of months diving into the realm of neural networks only to submerge from the depths of deep learning and realize you’ve neglected classical machine learning by casting it aside.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans

df = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")

# First look at the dataset table
df.head(5)
# Shape of data, # of samples working with

df.shape
# Useful info like mean, min, max for features

df.describe()
# Finding null values

df.isna().sum()
# Plot distributions against features

colors = {"Iris-virginica":"purple", "Iris-setosa": "blue", "Iris-versicolor":"green"}

def showDistributions(feature1):
plt.figure(figsize=(30,30))
plt.subplot(6,6,1)
plt.scatter(df[feature1], df['species'], c=df['species'].map(colors))
plt.title("{} distribution".format(feature1))
plt.xlabel(feature1)
plt.ylabel('species')
plt.show()

showDistributions('sepal_length')
showDistributions('petal_length')
showDistributions('sepal_width')
showDistributions('petal_width')

What is the K-Means Clustering Algorithm?

K-Means clustering is an unsupervised classical machine learning algorithm that classifies targets by clusters of aggregated datapoints resulting from certain similarities within the features. As mentioned, K-Means is an unsupervised algorithm hence the model will be trying to understand the data and draw out valuable or informative features and reveal patterns within the dataset.

# creating our target and prediction values x = df.drop(['species'], axis=1)
y = df['species']
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 500, n_init = 10, random_state = 0)
model = kmeans.fit_predict(x)
# Looking at the centroid values generated

kmeans.cluster_centers_
species = {"Iris-versicolor": 0, "Iris-setosa": 1, "Iris-virginica": 2}

irisdf = df.copy()

irisdf["species"] = irisdf["species"].map(species)
irisdf["predicted"] = model
irisdf

Closing Notes

This is a pretty standard dataset. Not too interesting or too many patterns or insights to extract because it only has 4 features, none of which are too unique from the others. It was nice to be able to reacquaint myself with K-Means clustering though! I’m looking forward to implementing more classical machine learning algorithms on some other datasets.

Writer of technology-centric articles.