Skip to the content.

Overview

I need to choose a school for child and I have a problem there - I don’t know what should I look at in this process. For example there are a lot of school types: ‘Openbaar’, ‘Rooms-Katholiek’, ‘Protestants-Christelijk’, etc. As for a person who never met this system before it’s not very clear for me which one should I choose and which one would fit my child better. On the other hand I want to make a desicion based on objective data such as

As a result of all this thoughts I realized that I have a following issue: I don’t know if I need to choose a type of a school together with other characteristics, or the type can be explained by those characteristics. If the second thought is true, then I can just choose the characteristics that I want for my childs school to have and they will simply define the best type of school. In other words, is it possible to make a classifier which would identify whether a school is Openbaar or Rooms-Katholiek or has other denomination based on other characterrictics we have?

I found and pooled in a dataset with the most important information I want to know about my childs future school. Now I want to know if I need to choose a denomination of a school (and spend time to investigate what each school denomination means in Dutch culture, their comparison and so on)? Or I can just choose other characteristic (which look more clear to me) and they will define a denomination of a school. I will build several classifiers based on different ML alghorithms. Then I will try to choose the best of them and see if the best classifier works good enough to predict a denomination of a primary school. If there is no such classifier, we can conclude the denomination is a purely non-derivative feature and can be ignored while choosing a school.

Data preparation

Let’s start with reading the dataset from the file called ‘Score.csv’:

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

data_score = pd.read_csv("./make_school_data/main/output/Score.csv", error_bad_lines=False, sep=',', encoding = "ISO-8859-1") 
print(data_score.shape)
print(data_score.columns)
data_score.head() 
(31300, 33)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
       'PLAATSNAAM', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
       'DENOMINATIE_VESTIGING', 'EXAMEN', 'EXAMEN_AANTAL', 'EXAMEN_GEM',
       'REKENEN_LAGER1F', 'REKENEN_1F', 'REKENEN_1S', 'REKENEN_2F',
       'LV_LAGER1F', 'LV_1F', 'LV_2F', 'TV_LAGER1F', 'TV_1F', 'TV_2F', 'VSO',
       'PRO', 'VMBO', 'VMBO_HAVO', 'HAVO', 'HAVO_VWO', 'VWO',
       'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES', 'LJ8', 'ZIT'],
      dtype='object')
SCHOOL_ID DATUM INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENAAM PROVINCIE SOORT_PO DENOMINATIE_VESTIGING EXAMEN ... PRO VMBO VMBO_HAVO HAVO HAVO_VWO VWO ADVIES_NIET_MOGELIJK TOTAAL_ADVIES LJ8 ZIT
0 000AP_2015 2015 De Schanskorf 2715BT ZOETERMEER Zoetermeer Zuid-Holland Bo Gereformeerd vrijgemaakt CET ... 0 1 0 5 1 1 0 11 13 0
1 000AR_2015 2015 BS "De Maasparel" 6109AM OHE EN LAAK Maasgouw Limburg Bo Rooms-Katholiek CET ... 0 1 0 7 0 6 0 18 17 0
2 000AZ_2015 2015 De Kiezel en de Kei 2971AR BLESKENSGRAAF CA Molenwaard Zuid-Holland Bo Openbaar CET ... 1 3 5 3 2 4 0 22 19 3
3 000BA_2015 2015 OBS De Klimboom 6666EB HETEREN Overbetuwe Gelderland Bo Openbaar CET ... 1 1 0 1 4 0 0 12 24 5
4 000BB_2015 2015 Obs Letterwies 9944AR NIEUWOLDA Oldambt Groningen Bo Openbaar CET ... 0 7 0 5 0 2 0 16 15 2

5 rows × 33 columns

There are a lot of variables which define an address of a school in different ways here. We need to choose one of them. I think we can leave GEMEENTENAAM column as a variable which explains location and remove the rest of the the address columns:

columns = ['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING', 'PLAATSNAAM', 
           'PROVINCIE']
print(data_score.shape)
data_score.drop(columns, 1, inplace = True)
print(data_score.shape)
print(data_score.columns)
data_score.head()
(31300, 33)
(31300, 27)
Index(['GEMEENTENAAM', 'SOORT_PO', 'DENOMINATIE_VESTIGING', 'EXAMEN',
       'EXAMEN_AANTAL', 'EXAMEN_GEM', 'REKENEN_LAGER1F', 'REKENEN_1F',
       'REKENEN_1S', 'REKENEN_2F', 'LV_LAGER1F', 'LV_1F', 'LV_2F',
       'TV_LAGER1F', 'TV_1F', 'TV_2F', 'VSO', 'PRO', 'VMBO', 'VMBO_HAVO',
       'HAVO', 'HAVO_VWO', 'VWO', 'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES',
       'LJ8', 'ZIT'],
      dtype='object')
GEMEENTENAAM SOORT_PO DENOMINATIE_VESTIGING EXAMEN EXAMEN_AANTAL EXAMEN_GEM REKENEN_LAGER1F REKENEN_1F REKENEN_1S REKENEN_2F ... PRO VMBO VMBO_HAVO HAVO HAVO_VWO VWO ADVIES_NIET_MOGELIJK TOTAAL_ADVIES LJ8 ZIT
0 Zoetermeer Bo Gereformeerd vrijgemaakt CET 11 534.3 0 0 0 0 ... 0 1 0 5 1 1 0 11 13 0
1 Maasgouw Bo Rooms-Katholiek CET 18 539.8 0 0 0 0 ... 0 1 0 7 0 6 0 18 17 0
2 Molenwaard Bo Openbaar CET 22 532.9 0 0 0 0 ... 1 3 5 3 2 4 0 22 19 3
3 Overbetuwe Bo Openbaar CET 11 536.2 0 0 0 0 ... 1 1 0 1 4 0 0 12 24 5
4 Oldambt Bo Openbaar CET 16 531.3 0 0 0 0 ... 0 7 0 5 0 2 0 16 15 2

5 rows × 27 columns

Next I would like to explore the outcome:

print(data_score['DENOMINATIE_VESTIGING'].value_counts())
plt.figure(figsize=(15,8))
print(sns.countplot(y='DENOMINATIE_VESTIGING',data=data_score, 
                    order = data_score['DENOMINATIE_VESTIGING'].value_counts().index))
Openbaar                          9951
Rooms-Katholiek                   9588
Protestants-Christelijk           7899
Algemeen bijzonder                1513
Reformatorisch                     806
Gereformeerd vrijgemaakt           534
Antroposofisch                     328
Samenwerking PC, RK                265
Islamitisch                        235
Interconfessioneel                  60
Evangelisch                         41
Hindoeistisch                       32
Samenwerking Opb., RK               15
Samenwerking Opb., PC               15
Joods                               10
Evangelische broedergemeenscha       4
Samenwerking PC, RK, Alg. Bijz       4
Name: DENOMINATIE_VESTIGING, dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)

png

From the plot and data above I see that there are several values of DENOMINATIE_VESTIGING column with relatively small amoumt of data. Let’s remove rows with a number of the denomination values smaller than 0.5% of the total observations amount.

values_to_remove = []
value_counts = data_score['DENOMINATIE_VESTIGING'].value_counts()
threshold = 0.005 * len(data_score['DENOMINATIE_VESTIGING'])
for column in data_score['DENOMINATIE_VESTIGING'].value_counts().index.tolist():
    if value_counts[column] < threshold:
        values_to_remove.append(column)
values_to_remove
['Interconfessioneel',
 'Evangelisch',
 'Hindoeistisch',
 'Samenwerking Opb., RK',
 'Samenwerking Opb., PC',
 'Joods',
 'Evangelische broedergemeenscha',
 'Samenwerking PC, RK, Alg. Bijz']
print(data_score.shape)
data_score = data_score.drop(data_score[data_score['DENOMINATIE_VESTIGING'].isin(values_to_remove)].index, 0)
data_score.shape
(31300, 27)





(31119, 27)

In the dataset there are 3 categorical columns. Before starting the modelling we need to convert all the categorical variables to numerical ones. Let’s use LabelEncoder this time:

from sklearn.preprocessing import LabelEncoder

data_score['SOORT_PO'] = LabelEncoder().fit_transform(data_score['SOORT_PO'])

data_score['GEMEENTENAAM'] = LabelEncoder().fit_transform(data_score['GEMEENTENAAM'])

data_score['DENOMINATIE_VESTIGING'] = LabelEncoder().fit_transform(data_score['DENOMINATIE_VESTIGING'])

data_score['EXAMEN'] = LabelEncoder().fit_transform(data_score['EXAMEN'])
data_score.head()
GEMEENTENAAM SOORT_PO DENOMINATIE_VESTIGING EXAMEN EXAMEN_AANTAL EXAMEN_GEM REKENEN_LAGER1F REKENEN_1F REKENEN_1S REKENEN_2F ... PRO VMBO VMBO_HAVO HAVO HAVO_VWO VWO ADVIES_NIET_MOGELIJK TOTAAL_ADVIES LJ8 ZIT
0 402 0 2 2 11 534.3 0 0 0 0 ... 0 1 0 5 1 1 0 11 13 0
1 210 0 7 2 18 539.8 0 0 0 0 ... 0 1 0 7 0 6 0 18 17 0
2 227 0 4 2 22 532.9 0 0 0 0 ... 1 3 5 3 2 4 0 22 19 3
3 271 0 4 2 11 536.2 0 0 0 0 ... 1 1 0 1 4 0 0 12 24 5
4 253 0 4 2 16 531.3 0 0 0 0 ... 0 7 0 5 0 2 0 16 15 2

5 rows × 27 columns

Building classification models

Ley’s try to build classification models using several different ML alghorithms: Random Forest, SVM, KNN, XGBoosting.

First of all, we need to standardize the data and split it into test and train subsets:

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import roc_curve

pred = data_score['DENOMINATIE_VESTIGING']

data_score.drop('DENOMINATIE_VESTIGING', 1, inplace = True)

random_state = 4004

scaler = MinMaxScaler()
data = scaler.fit_transform(data_score)
    
X_train, X_test, y_train, y_test = train_test_split(data, pred, train_size=0.7, test_size=0.3, random_state=random_state)

Random forest

rfc = RandomForestClassifier(n_estimators=100, random_state = random_state)
rfc.fit(X_train, y_train)
pred_test = rfc.predict(X_test)

rfc_score = rfc.score(X_test, y_test)
print("Random Forest's accuracy is: {:.2%}".format(rfc_score))
Random Forest's accuracy is: 40.47%

K Nearest Neighbors

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
pred_test = knn.predict(X_test)
knn_score = knn.score(X_test, y_test)
print("KNN's accuracy is: {:.2%}".format(knn_score))
KNN's accuracy is: 34.89%

XGBoost

xgb = XGBClassifier(n_estimators=100)
xgb.fit(X_train, y_train)
pred_test = xgb.predict(X_test)
xgb_score = (pred_test == y_test).sum() / len(pred_test)
print("XGBoost's prediction accuracy is: {:.2%}".format(xgb_score))
XGBoost's prediction accuracy is: 51.86%
results = pd.DataFrame({
    'Model': ['Random Forest', 'KNN', 'XGBoost'],
    'Score': ["{:.2%}".format(rfc_score), "{:.2%}".format(knn_score), "{:.2%}".format(xgb_score)]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Model')
result_df
Score
Model
XGBoost 51.86%
Random Forest 40.47%
KNN 34.89%

We can see that none of the ML algorithm was able to build an apropriate model. The best one shows accuracy of a mere 52% which is far from perfect.

That means classes of different denominations do not differ that much from each other based on the objective criteria we have in the dataset. In other words: denominations are more or less the same if compared with objective numbers.

Therefore I can conclude that while choosing a school for my child I may ignore schools denominations and concentrate on scores, knowledge quality and other metrics which are important and familiar to me.