Since I need to choose the school most fitted for my child, I really interested in having more knowledge about primary education system in The Netherlands and to know more about school differences. I found that Dutch Ministry of education, culture and science opened some data about primary schools. It’s very scattered information placed in different files in not very useful for analysis form, so before go to analysis I want to join info I need together, clean data, deal with missing values and maybe tramsform data to the view which would be easier to read and analyse.
All the data for this part of dataset construction I got here: https://duo.nl/open_onderwijsdata/databestanden/po/leerlingen-po/bo-sbo/bo-sbo-eindscores.jsp
1. Exams scores
First I will join a few .cvs files about exam score from differetn years together:
import pandas as pd
data_score_2014 = pd.read_csv("input_data/Test_score_2014_2015.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_score_2015 = pd.read_csv("input_data/Test_score_2015_2016.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_score_2016 = pd.read_csv("input_data/Test_score_2016_2017.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_score_2017 = pd.read_csv("input_data/Test_score_2017_2018.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_score_2018 = pd.read_csv("input_data/Test_score_2018_2019.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
print(data_score_2014.columns)
data_score_2018.head()
Index(['PEILDATUM_LEERLINGEN', 'PRIKDATUM_SCORES', 'BRIN_NUMMER',
'VESTIGINGSNUMMER', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'Leerjaar_8',
'ONTHEFFING_REDEN_ND', 'CET_AANTAL', 'cet_gem', 'IEP_AANTAL', 'iep_gem',
'ROUTE8_AANTAL', 'route8_gem', 'DREMPEL_AANTAL', 'drempel_gem'],
dtype='object')
PEILDATUM_LEERLINGEN | PRIKDATUM_SCORES | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | CET_AANTAL | CET_GEM | IEP_AANTAL | IEP_GEM | ROUTE8_AANTAL | ROUTE8_GEM | DIA_AANTAL | DIA_GEM | AMN_AANTAL | AMN_GEM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20190417 | 20190725 | 00AP | 0 | Wereldwijs | 2716PH | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | ... | 11 | 538,636363636364 | 0 | 0 | 0 | 0 | ||||
1 | 20190417 | 20190725 | 00AR | 0 | BS "De Maasparel" | 6107AW | STEVENSWEERT | 1641 | Maasgouw | Limburg | ... | 17 | 537,588235294118 | 0 | 0 | 0 | 0 | ||||
2 | 20190417 | 20190725 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 1930 | Nissewaard | Zuid-Holland | ... | 12 | 532,5 | 0 | 0 | 0 | 0 | ||||
3 | 20190417 | 20190725 | 00AZ | 0 | De Stapsteen | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | ... | 0 | 14 | 84,3571428571429 | 0 | 0 | 0 | ||||
4 | 20190417 | 20190725 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | ... | 0 | 16 | 83,625 | 0 | 0 | 0 |
5 rows × 25 columns
I noticed that the data_score_2014 dataframe have some columns named using small letters, but in other dataframes only capital letters are using, let’s change it:
data_score_2014.rename(columns = {'cet_gem': 'CET_GEM', 'iep_gem': 'IEP_GEM', 'route8_gem':'ROUTE8_GEM', 'Leerjaar_8':'LEERJAAR_8'}, inplace = True)
Now let’s merge everything in one dataframe:
data_frames = [data_score_2014, data_score_2015, data_score_2016, data_score_2017, data_score_2018]
data_score = pd.concat(data_frames).fillna(' ')
data_score.reset_index(drop=True, inplace=True)
print(data_score_2014.shape)
print(data_score_2015.shape)
print(data_score_2016.shape)
print(data_score_2017.shape)
print(data_score_2018.shape)
print(data_score.shape)
(6920, 23)
(6851, 21)
(6751, 27)
(6658, 25)
(6579, 25)
(33759, 29)
data_score.head()
PEILDATUM_LEERLINGEN | PRIKDATUM_SCORES | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | ROUTE8_AANTAL | ROUTE8_GEM | DREMPEL_AANTAL | drempel_gem | DIA_AANTAL | DIA_GEM | CESAN_AANTAL | CESAN_GEM | AMN_AANTAL | AMN_GEM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20141001 | 20150918 | 00AP | 0 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | ... | 0,00 | 0,00 | 0,00 | 0,00 | ||||||
1 | 20141001 | 20150918 | 00AR | 0 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | ... | 0,00 | 0,00 | 0,00 | 0,00 | ||||||
2 | 20141001 | 20150918 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 612 | Spijkenisse | Zuid-Holland | ... | 0,00 | 0,00 | 0,00 | 0,00 | ||||||
3 | 20141001 | 20150918 | 00AZ | 0 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | ... | 0,00 | 0,00 | 0,00 | 0,00 | ||||||
4 | 20141001 | 20150918 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | ... | 0,00 | 0,00 | 0,00 | 0,00 |
5 rows × 29 columns
Now I want to see which type of columns we have in dataset:
print(data_score.dtypes)
PEILDATUM_LEERLINGEN int64
PRIKDATUM_SCORES int64
BRIN_NUMMER object
VESTIGINGSNUMMER int64
INSTELLINGSNAAM_VESTIGING object
POSTCODE_VESTIGING object
PLAATSNAAM object
GEMEENTENUMMER int64
GEMEENTENAAM object
PROVINCIE object
SOORT_PO object
DENOMINATIE_VESTIGING object
BEVOEGD_GEZAG_NUMMER int64
LEERJAAR_8 int64
ONTHEFFING_REDEN_ND object
CET_AANTAL object
CET_GEM object
IEP_AANTAL object
IEP_GEM object
ROUTE8_AANTAL object
ROUTE8_GEM object
DREMPEL_AANTAL object
drempel_gem object
DIA_AANTAL object
DIA_GEM object
CESAN_AANTAL object
CESAN_GEM object
AMN_AANTAL object
AMN_GEM object
dtype: object
There are few columns has type ‘object’, but they should has ‘int’ or ‘float’ type, so I think there are some values which cannot be converted to numeric ones. Let’s check it. I think that such values can be considered as missing values, so let’s change it to 0:
def fill_missing_zero_value(dataframe, column):
dataframe.loc[dataframe[column] == ' ', column] = 0
def convert_to_numeric(dataframe, column):
dataframe[column] = dataframe[column].astype(str).str.replace(',','.')
dataframe[column] = pd.to_numeric(dataframe[column])
def get_numeric_column(dataframe, column):
fill_missing_zero_value(dataframe, column)
convert_to_numeric(dataframe, column)
get_numeric_column(data_score, 'ONTHEFFING_REDEN_ND')
data_score['ONTHEFFING_REDEN_ND'].astype(int)
get_numeric_column(data_score, 'CET_AANTAL')
data_score['CET_AANTAL'] = data_score['CET_AANTAL'].astype(int)
get_numeric_column(data_score, 'CET_GEM')
get_numeric_column(data_score, 'IEP_AANTAL')
data_score['IEP_AANTAL'] = data_score['IEP_AANTAL'].astype(int)
get_numeric_column(data_score, 'IEP_GEM')
get_numeric_column(data_score, 'ROUTE8_AANTAL')
data_score['ROUTE8_AANTAL'] = data_score['ROUTE8_AANTAL'].astype(int)
get_numeric_column(data_score, 'ROUTE8_GEM')
get_numeric_column(data_score, 'DREMPEL_AANTAL')
data_score['DREMPEL_AANTAL'] = data_score['DREMPEL_AANTAL'].astype(int)
get_numeric_column(data_score, 'drempel_gem')
get_numeric_column(data_score, 'DIA_AANTAL')
data_score['DIA_AANTAL'] = data_score['DIA_AANTAL'].astype(int)
get_numeric_column(data_score, 'DIA_GEM')
get_numeric_column(data_score, 'CESAN_AANTAL')
data_score['CESAN_AANTAL'] = data_score['CESAN_AANTAL'].astype(int)
get_numeric_column(data_score, 'CESAN_GEM')
get_numeric_column(data_score, 'AMN_AANTAL')
data_score['AMN_AANTAL'] = data_score['AMN_AANTAL'].astype(int)
get_numeric_column(data_score, 'AMN_GEM')
print(data_score.dtypes)
PEILDATUM_LEERLINGEN int64
PRIKDATUM_SCORES int64
BRIN_NUMMER object
VESTIGINGSNUMMER int64
INSTELLINGSNAAM_VESTIGING object
POSTCODE_VESTIGING object
PLAATSNAAM object
GEMEENTENUMMER int64
GEMEENTENAAM object
PROVINCIE object
SOORT_PO object
DENOMINATIE_VESTIGING object
BEVOEGD_GEZAG_NUMMER int64
LEERJAAR_8 int64
ONTHEFFING_REDEN_ND float64
CET_AANTAL int64
CET_GEM float64
IEP_AANTAL int64
IEP_GEM float64
ROUTE8_AANTAL int64
ROUTE8_GEM float64
DREMPEL_AANTAL int64
drempel_gem float64
DIA_AANTAL int64
DIA_GEM float64
CESAN_AANTAL int64
CESAN_GEM float64
AMN_AANTAL int64
AMN_GEM float64
dtype: object
Let’s check do we have null or NaN values:
print(data_score.isnull().sum())
print(data_score.isna().sum())
PEILDATUM_LEERLINGEN 0
PRIKDATUM_SCORES 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
LEERJAAR_8 0
ONTHEFFING_REDEN_ND 0
CET_AANTAL 0
CET_GEM 0
IEP_AANTAL 0
IEP_GEM 0
ROUTE8_AANTAL 0
ROUTE8_GEM 0
DREMPEL_AANTAL 0
drempel_gem 0
DIA_AANTAL 0
DIA_GEM 0
CESAN_AANTAL 0
CESAN_GEM 0
AMN_AANTAL 0
AMN_GEM 0
dtype: int64
PEILDATUM_LEERLINGEN 0
PRIKDATUM_SCORES 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
LEERJAAR_8 0
ONTHEFFING_REDEN_ND 0
CET_AANTAL 0
CET_GEM 0
IEP_AANTAL 0
IEP_GEM 0
ROUTE8_AANTAL 0
ROUTE8_GEM 0
DREMPEL_AANTAL 0
drempel_gem 0
DIA_AANTAL 0
DIA_GEM 0
CESAN_AANTAL 0
CESAN_GEM 0
AMN_AANTAL 0
AMN_GEM 0
dtype: int64
It looks like we don’t have NaN or null values.
Now I would like to change columns with exam’s date info to columns with just a year of the exam:
data_score['PRIKDATUM_SCORES'] = data_score['PRIKDATUM_SCORES'].map({
20150918 : 2015,
20160722 : 2016,
20170722 : 2017,
20180725 : 2018,
20190725 : 2019,
})
data_score.rename(columns = {'PRIKDATUM_SCORES': 'DATUM'}, inplace = True)
print(data_score['DATUM'].value_counts())
2015 6920
2016 6851
2017 6751
2018 6658
2019 6579
Name: DATUM, dtype: int64
I know that most common final test in the primary schools in The Netherlands is CET and suppose it shoul be most common schools choise in terms of types of final exam. I want to explore what information we have about exams students are taking, but first let’s explore do we have the rows without any info about exams:
schools_without_exams = data_score[(data_score['CET_AANTAL'] == 0) &
(data_score['DREMPEL_AANTAL'] == 0) & (data_score['AMN_AANTAL'] == 0)
& (data_score['CESAN_AANTAL'] == 0) & (data_score['DIA_AANTAL'] == 0)
& (data_score['ROUTE8_AANTAL'] == 0) & (data_score['IEP_AANTAL'] == 0)]
schools_at_least_1_exam = data_score[(data_score['CET_AANTAL'] > 0) |
(data_score['DREMPEL_AANTAL'] > 0) | (data_score['AMN_AANTAL'] > 0)
| (data_score['CESAN_AANTAL'] > 0) | (data_score['DIA_AANTAL'] > 0)
| (data_score['ROUTE8_AANTAL'] > 0) | (data_score['IEP_AANTAL'] > 0)]
print(data_score.shape)
print(schools_at_least_1_exam.shape)
print(schools_without_exams.shape)
(33759, 29)
(32253, 29)
(1506, 29)
And remove this rows (without any exams):
data_score.drop(schools_without_exams.index, 0, inplace = True)
print(data_score.shape)
(32253, 29)
Next let’s take a look do we have wrong exams data (for example, we have amount of pupils who took the exam, but score is 0 or vice versa). And if we have it, we need to remove it.
def remove_wrong_data(subset):
if subset.shape[0] > 0:
data_score.drop(subset.index, 0, inplace = True)
wrong_cet = data_score[((data_score['CET_AANTAL'] == 0) & (data_score['CET_GEM'] > 0)) |
((data_score['CET_AANTAL'] > 0) & (data_score['CET_GEM'] == 0))]
print(wrong_cet.shape)
remove_wrong_data(wrong_cet)
wrong_drempel = data_score[((data_score['DREMPEL_AANTAL'] == 0) & (data_score['drempel_gem'] > 0)) |
((data_score['DREMPEL_AANTAL'] > 0) & (data_score['drempel_gem'] == 0))]
print(wrong_drempel.shape)
remove_wrong_data(wrong_drempel)
wrong_AMN = data_score[((data_score['AMN_AANTAL'] == 0) & (data_score['AMN_GEM'] > 0)) |
((data_score['AMN_AANTAL'] > 0) & (data_score['AMN_GEM'] == 0))]
print(wrong_AMN.shape)
remove_wrong_data(wrong_AMN)
wrong_CESAN = data_score[((data_score['CESAN_AANTAL'] == 0) & (data_score['CESAN_GEM'] > 0)) |
((data_score['CESAN_AANTAL'] > 0) & (data_score['CESAN_GEM'] == 0))]
print(wrong_CESAN.shape)
remove_wrong_data(wrong_CESAN)
wrong_DIA = data_score[((data_score['DIA_AANTAL'] == 0) & (data_score['DIA_GEM'] > 0)) |
((data_score['DIA_AANTAL'] > 0) & (data_score['DIA_GEM'] == 0))]
print(wrong_DIA.shape)
remove_wrong_data(wrong_DIA)
wrong_ROUTE8 = data_score[((data_score['ROUTE8_AANTAL'] == 0) & (data_score['ROUTE8_GEM'] > 0)) |
((data_score['ROUTE8_AANTAL'] > 0) & (data_score['ROUTE8_GEM'] == 0))]
print(wrong_ROUTE8.shape)
remove_wrong_data(wrong_ROUTE8)
wrong_IEP = data_score[((data_score['IEP_AANTAL'] == 0) & (data_score['IEP_GEM'] > 0)) |
((data_score['IEP_AANTAL'] > 0) & (data_score['IEP_GEM'] == 0))]
print(wrong_IEP.shape)
remove_wrong_data(wrong_IEP)
print(data_score.shape)
(537, 29)
(68, 29)
(10, 29)
(0, 29)
(12, 29)
(115, 29)
(208, 29)
(31303, 29)
Now I want to see how many schools choose different type of exams and is there a lot of schools which choose more then two kind of test:
schools_more_then_2_exam = data_score.copy()
def count_exams(row):
count = 1 if row.CET_AANTAL > 0 else 0
count = count + 1 if row.DREMPEL_AANTAL > 0 else count
count = count + 1 if row.AMN_AANTAL > 0 else count
count = count + 1 if row.CESAN_AANTAL > 0 else count
count = count + 1 if row.DIA_AANTAL > 0 else count
count = count + 1 if row.ROUTE8_AANTAL > 0 else count
count = count + 1 if row.IEP_AANTAL > 0 else count
return count
schools_more_then_2_exam['EXAMEN_AANTAL'] = schools_more_then_2_exam.apply(lambda row: count_exams(row), axis = 1)
schools_more_then_2_exam = schools_more_then_2_exam[schools_more_then_2_exam['EXAMEN_AANTAL'] > 2]
print(schools_more_then_2_exam.shape)
schools_more_then_2_exam.head()
(3, 30)
PEILDATUM_LEERLINGEN | DATUM | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | ROUTE8_GEM | DREMPEL_AANTAL | drempel_gem | DIA_AANTAL | DIA_GEM | CESAN_AANTAL | CESAN_GEM | AMN_AANTAL | AMN_GEM | EXAMEN_AANTAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18551 | 20170415 | 2017 | 15YU | 0 | OBS Holy | 3137WD | VLAARDINGEN | 622 | Vlaardingen | Zuid-Holland | ... | 184.600000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 3 |
25227 | 20180417 | 2018 | 15YU | 0 | OBS Holy | 3137WD | VLAARDINGEN | 622 | Vlaardingen | Zuid-Holland | ... | 208.166667 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 3 |
31810 | 20190417 | 2019 | 15YU | 0 | OBS Holy | 3137WD | VLAARDINGEN | 622 | Vlaardingen | Zuid-Holland | ... | 191.200000 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 3 |
3 rows × 30 columns
There are few rows with more then 2 type of test chosen. Let’s just remove it, then it will be possible to make a column of chosen exams in each school and remove all unnecessary columns:
#remove rows with more then 2 choosen type of tests
print(data_score.shape)
data_score.drop(schools_more_then_2_exam.index, 0, inplace = True)
print(data_score.shape)
(31303, 29)
(31300, 29)
# make columns for the test with biggest amount of pupils who took it
def most_common_exam(row):
exam_type, amount, score = '', 0, 0
if row.CET_AANTAL > 0:
exam_type,amount,score ='CET', row.CET_AANTAL, row.CET_GEM
if row.DREMPEL_AANTAL > amount:
exam_type,amount,score = 'DREMPEL', row.DREMPEL_AANTAL, row.drempel_gem
if row.AMN_AANTAL > amount:
exam_type,amount,score = 'AMN', row.AMN_AANTAL, row.AMN_GEM
if row.CESAN_AANTAL > amount:
exam_type,amount,score = 'CESAN', row.CESAN_AANTAL, row.CESAN_GEM
if row.DIA_AANTAL > amount:
exam_type,amount,score = 'DIA', row.DIA_AANTAL, row.DIA_GEM
if row.ROUTE8_AANTAL > amount:
exam_type,amount,score = 'ROUTE8', row.ROUTE8_AANTAL, row.ROUTE8_GEM
if row.IEP_AANTAL > amount:
exam_type,amount,score = 'IEP', row.IEP_AANTAL, row.IEP_GEM
return exam_type,amount,score
data_score['EXAMEN_1'], data_score['EXAMEN_1_AANTAL'], data_score['EXAMEN_1_GEM'] = zip(*data_score.apply(lambda row: most_common_exam(row), axis = 1))
data_score.head()
PEILDATUM_LEERLINGEN | DATUM | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | drempel_gem | DIA_AANTAL | DIA_GEM | CESAN_AANTAL | CESAN_GEM | AMN_AANTAL | AMN_GEM | EXAMEN_1 | EXAMEN_1_AANTAL | EXAMEN_1_GEM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20141001 | 2015 | 00AP | 0 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | CET | 11 | 534.3 |
1 | 20141001 | 2015 | 00AR | 0 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | CET | 18 | 539.8 |
3 | 20141001 | 2015 | 00AZ | 0 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | CET | 22 | 532.9 |
4 | 20141001 | 2015 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | CET | 11 | 536.2 |
5 | 20141001 | 2015 | 00BB | 0 | Obs Letterwies | 9944AR | NIEUWOLDA | 1895 | Oldambt | Groningen | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | CET | 16 | 531.3 |
5 rows × 32 columns
# make columns for the second chosen test
def get_second_exam(row):
if (row.CET_AANTAL > 0) & (row.EXAMEN_1 != 'CET'):
return 'CET', row.CET_AANTAL, row.CET_GEM
if (row.DREMPEL_AANTAL > 0) & (row.EXAMEN_1 != 'DREMPEL'):
return 'DREMPEL', row.DREMPEL_AANTAL, row.drempel_gem
if (row.AMN_AANTAL > 0) & (row.EXAMEN_1 != 'AMN'):
return 'AMN', row.AMN_AANTAL, row.AMN_GEM
if (row.CESAN_AANTAL > 0) & (row.EXAMEN_1 != 'CESAN'):
return 'CESAN', row.CESAN_AANTAL, row.CESAN_GEM
if (row.DIA_AANTAL > 0) & (row.EXAMEN_1 != 'DIA'):
return 'DIA', row.DIA_AANTAL, row.DIA_GEM
if (row.ROUTE8_AANTAL > 0) & (row.EXAMEN_1 != 'ROUTE8'):
return 'ROUTE8', row.ROUTE8_AANTAL, row.ROUTE8_GEM
if (row.IEP_AANTAL > 0) & (row.EXAMEN_1 != 'IEP'):
return 'IEP', row.IEP_AANTAL, row.IEP_GEM
return '', 0, 0
data_score['EXAMEN_2'], data_score['EXAMEN_2_AANTAL'], data_score['EXAMEN_2_GEM'] = zip(*data_score.apply(lambda row: get_second_exam(row), axis = 1))
data_score['EXAMEN_2'].value_counts()
31151
CET 55
IEP 55
ROUTE8 18
DREMPEL 11
DIA 6
AMN 4
Name: EXAMEN_2, dtype: int64
We can see that most of the schools in dataset (31151) has one chosen type of exam, so we can remove new columns EXAMEN_2_… because it has relatively small information.
Next we also need to make a new field which values would be unique for each school (and which we would use later to merge different dataframes). I think if we pool VESTIGINGSNUMMER, BRIN_NUMMER and DATUM columns, we would have unique value for each row:
new_column = data_score["VESTIGINGSNUMMER"].astype(str) + data_score["BRIN_NUMMER"] + "_" + data_score["DATUM"].astype(str)
data_score.insert (0, "SCHOOL_ID", new_column)
print(data_score.shape)
print(data_score.columns)
(31300, 36)
Index(['SCHOOL_ID', 'PEILDATUM_LEERLINGEN', 'DATUM', 'BRIN_NUMMER',
'VESTIGINGSNUMMER', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'LEERJAAR_8',
'ONTHEFFING_REDEN_ND', 'CET_AANTAL', 'CET_GEM', 'IEP_AANTAL', 'IEP_GEM',
'ROUTE8_AANTAL', 'ROUTE8_GEM', 'DREMPEL_AANTAL', 'drempel_gem',
'DIA_AANTAL', 'DIA_GEM', 'CESAN_AANTAL', 'CESAN_GEM', 'AMN_AANTAL',
'AMN_GEM', 'EXAMEN_1', 'EXAMEN_1_AANTAL', 'EXAMEN_1_GEM', 'EXAMEN_2',
'EXAMEN_2_AANTAL', 'EXAMEN_2_GEM'],
dtype='object')
And last (but not least) what we should do - remove all unnecessary columns:
data_score.drop('PEILDATUM_LEERLINGEN', 1, inplace = True)
data_score.drop('BRIN_NUMMER', 1, inplace = True)
data_score.drop('VESTIGINGSNUMMER', 1, inplace = True)
data_score.drop('LEERJAAR_8', 1, inplace = True)
data_score.drop('ONTHEFFING_REDEN_ND', 1, inplace = True)
data_score.drop('CET_AANTAL', 1, inplace = True)
data_score.drop('CET_GEM', 1, inplace = True)
data_score.drop('IEP_AANTAL', 1, inplace = True)
data_score.drop('IEP_GEM', 1, inplace = True)
data_score.drop('ROUTE8_AANTAL', 1, inplace = True)
data_score.drop('ROUTE8_GEM', 1, inplace = True)
data_score.drop('DREMPEL_AANTAL', 1, inplace = True)
data_score.drop('drempel_gem', 1, inplace = True)
data_score.drop('DIA_AANTAL', 1, inplace = True)
data_score.drop('DIA_GEM', 1, inplace = True)
data_score.drop('CESAN_AANTAL', 1, inplace = True)
data_score.drop('CESAN_GEM', 1, inplace = True)
data_score.drop('AMN_AANTAL', 1, inplace = True)
data_score.drop('AMN_GEM', 1, inplace = True)
data_score.drop('EXAMEN_2', 1, inplace = True)
data_score.drop('EXAMEN_2_AANTAL', 1, inplace = True)
data_score.drop('EXAMEN_2_GEM', 1, inplace = True)
data_score.rename(columns = {'EXAMEN_1': 'EXAMEN', 'EXAMEN_1_AANTAL': 'EXAMEN_AANTAL', 'EXAMEN_1_GEM':'EXAMEN_GEM', 'Leerjaar_8':'LEERJAAR_8'}, inplace = True)
print(data_score.shape)
data_score.head()
(31300, 14)
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | BEVOEGD_GEZAG_NUMMER | EXAMEN | EXAMEN_AANTAL | EXAMEN_GEM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 2015 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | 41646 | CET | 11 | 534.3 |
1 | 000AR_2015 | 2015 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | 41797 | CET | 18 | 539.8 |
3 | 000AZ_2015 | 2015 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | 41638 | CET | 22 | 532.9 |
4 | 000BA_2015 | 2015 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | 41530 | CET | 11 | 536.2 |
5 | 000BB_2015 | 2015 | Obs Letterwies | 9944AR | NIEUWOLDA | 1895 | Oldambt | Groningen | Bo | Openbaar | 41613 | CET | 16 | 531.3 |
First part of our dataset constraction has done, so le’t write results down to temporary .csv file:
data_score.to_csv(r'score_first_result.csv', index = False)