How You Can Use Julia for Machine Learning

0
1298
ML featured image

Julia is a high level, dynamic and general-purpose programming language. While it can be used for multiple purposes, it is most suited for complex numerical data analytics. This makes Julia an ideal language for implementing machine learning models. In this article, we will demonstrate how Julia can be used for machine learning.

First launched in 2012, Julia has become one of the top emerging open source programming languages to learn in 2021 and beyond. This is because Julia, by design, is a language that is ideal for complex analysis of numerical data. With the rise of data science (DS) and cloud computing, and the abundance of Big Data, Julia is becoming more and more relevant and necessary for the ML/DS community. In this article, our focus will be to demonstrate how you can get started with machine learning (ML) using Julia. Julia is astonishingly similar to Python when it comes to the syntax of the code. This makes the language quick to grasp and easy to implement for most developers.

Instead of taking you through the rather self-explanatory installation process or the basics of the language, I am going to jump-start this article by taking you directly to the good stuff. We will learn how to use Julia for machine learning by implementing a linear regression model to predict the house prices in Boston, Massachusetts, USA.

Why use Julia for machine learning?
Before jumping into the implementation part, let me list out a few reasons why you should consider using Julia for machine learning. After all, in a world where there is Python and R, why would you want to add another language to perform the same tasks? Here’s why:

  • It is open source and free to use under the MIT licence.
  • Julia is faster than Python and R. Yes, you read that right. Julia is by design faster at executing complex mathematical formulae as this was the original purpose behind creating the language.
  • Julia supports concurrent parallel and distributed computing.
  • Julia has the ability to directly call C or Fortran code without the requirement of additional glue code.
  • Julia uses the ‘Just Ahead of Time’ (JAOT) compiler, which compiles the code to machine code by default before execution.
  • Julia has some of the most efficient libraries for floating-point calculations and linear algebra (i.e., calculations involving matrices), which are essential for machine learning.
  • Julia is supported by popular IDEs such as Visual Studio Code and execution environments such as Jupyter Notebook.
  • It is super easy to install and get started.
  • Julia has a vibrant online community that is active and increasing by the day.

Pre-requisites
Now that we have a pretty good grasp on what Julia is and what makes it ideal for machine learning, it’s time to get our hands dirty. In this section of the article, we are going to implement a simple linear regression model to help us predict the prices of houses in Boston, Massachusetts, USA.

There are a few pre-requisites that need to be taken care of before getting started with the implementation:

  • Visit https://julialang.org/ and install the language in your OS. The procedure is pretty straightforward and Julia is available on all major platforms.
  • Visit https://jupyter.org/ and install Jupyter Notebook in your OS. Again, the procedure is pretty easy to follow, so we do not need to go into too many details.
  • You will also need the standard Boston house prices data set used to demonstrate linear regression. Visit https://www.kaggle.com/ to download.
  • Now that we have taken care of the necessary pre-requisites for this demonstration, let’s get to the code.

Implementation
Developers familiar with Python are going to find many similarities in the syntax, structure and method of implementation in Julia. We will use a Jupyter Notebook for the compiling and execution of our code. So open up a fresh Jupyter Notebook in the desired folder and make sure to save your data set in the same location.

1. To start with, we are going to require certain libraries that we will make use of in this example. We will deal with data from a csv file. This will require us to use DataFrames. Additionally, we will perform some statistical calculations. Last but not the least, we will require a generalised linear model (GLM) for the implementation.

using DataFrames, CSV
using Plots
using GLM
using Statistics
using StatsPlots

2. Using the CSV and DataFrame libraries that we imported, we load the data from the data set.

# Read the file using CSV.File and convert it to DataFrame
df = DataFrame(CSV.File(“Boston.csv”))
first(df,5) #displaying the first 5 rows to get an overview of the dataset

Figure 1 shows the first five rows of data in the form of a DataFrame. This helps us to understand the data has loaded properly.

Figure 1: The first five rows of the data set

3. We can now explore this data set to find out its size, i.e., the number of rows and columns. We can also use the describe method to draw up some statistical data.

# Read the file using CSV.File and convert it to DataFrame
df = DataFrame(CSV.File(“Boston.csv”))
first(df,5) #displaying the first 5 rows to get an overview of the dataset

Figure 2 shows the statistical description of the data, i.e., mean, median, min, max, etc.

Figure 2: Statistical description of the data

4. An important step before implementing the model is to divide the data set into features and target variable. We will take the target variable (house prices) on the Y list and the features on the X.

y = df[:, :medv]; #Y-values
X = select!(df, Not(:medv)); #features

5. In order to both train and test the model, we will need to divide the data set into training data and testing data. We will use 80 per cent of the data for training and the rest 20 per cent for testing.

design_matrix = convert(Matrix, X)
train_size = 0.80 #sepecifying the split ratio
num_samples = size(design_matrix)[1] #total number of samples in dataset

train_index = trunc(Int, train_size * num_samples) #truncating 80% of samples for training

# Split using the desired train size
X_train = design_matrix[1:train_index, :]
X_test = design_matrix[train_index+1:end, :]

y_train = y[1:train_index]
y_test = y[train_index+1:end]

print(“Dataset split into train and test”)

6. As a necessary pre-processing step, we will perform scaling and transformation on the training and testing data.

# defining function to scale features
function scale_features(X)

μ = mean(X, dims=1)
σ = std(X, dims=1)

X_norm = (X .- μ) ./ σ
return (X_norm, μ, σ)

end

#defining function to transform features.
function transform_features(X, μ, σ)
X_norm = (X .- μ) ./ σ
return X_norm
end

# Scale training features
X_train_scaled, μ, σ = scale_features(X_train)

# Transform the testing features
X_test_scaled = transform_features(X_test, μ, σ)

print(“Training and testing data are now transformed”)

7. We will also need to define a cost function for determining the mean squared error.

# compute cost function helps us to compute the mean squared error
function compute_cost(X, y, theta)
m = size(X)[1] # number of samples

preds = X * theta #calculate predictions using theta
loss = preds - y #calculate error

# Half mean squared loss
cost = (1/(2m)) * (loss’ * loss)
return cost

end

8. The cost function requires the theta value for making the predictions from which the error can be calculated. In order to update the theta value, we require the gradient descent function.

# Gradient Descent function to update the theta values
function gradient_descent(X, y, alpha, fit_intercept=true, n_iter=2000)

m = length(y) # number of training examples

if fit_intercept
# Add a bias
b = ones(m, 1)
X = hcat(b, X)
else
X
end

# Initializing theta
theta = zeros(size(X)[2])

# Initialise the cost vector
cost = zeros(n_iter)

#looping over the number of iterations
for iter in range(1, stop=n_iter)
pred = X * theta #predictions

# Calculate the cost for each iter
cost[iter] = compute_cost(X, y, theta)

# Update the theta θ at each iter
theta = theta - ((alpha/m) * X’) * (pred - y);
end
return (theta, cost)
end

9. Finally, we are ready to train the model with our scaled and transformed data set.

theta, cost = gradient_descent(X_train_scaled, y_train, 0.05, true, 1000)

# plot the cost during training
plot(cost,
label=”Cost per iter”,
ylabel=”Cost”,
xlabel=”Number of Iteration”,
title=”Cost Per Iteration”)

10. We will now make predictions on both the training and the testing data using our model.

# function used for prediction on given dataset using trained theta values
function predict(X, theta, fit_intercept=true)
m = size(X)[1]

if fit_intercept
b = ones(m)
X = hcat(b, X)
else
X
end

predictions = X * theta

return predictions
end

# Make predictions for both training and testing datasets
pred_train = predict(X_train_scaled, theta)
pred_test = predict(X_test_scaled, theta)

print(“Predictions made on the training and testing set”)

11. We have now reached the final step where we will verify the accuracy of our model by measuring the r-squared value of our predictions for the testing data.

#The function below returns the R squared value based on the predictions (y_pred) and the ground truth values (y_true)passed to it.

function r_squared_score(y_pred, y_true)

# Compute sum of explained variance (SST) and sum of squares of residuals
sst = sum(((y_true .- mean(y_true)) .^ 2))
ssr = sum(((y_pred .- y_true) .^ 2))

r_square = 1 - (ssr / sst)

return r_square
end

# Get the r-squared score for training and test datasets
train_r_square = r_squared_score(pred_train, y_train)

println(“Training R square score for test sets: “, train_r_square)

Figure 3 shows us the r-square score for the model which can be rounded to about 0.73. In this demonstration, we trained the linear regression model from scratch. Of course, just like Python, you always have the option to use the model directly from the stats package in Julia. That way, you can execute this entire process in four lines of code.

Figure 3: R-Square score for test data
df = DataFrame(CSV.File(“Boston.csv”))
fm = @formula(medv ~ crim +zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat)
linearRegressor = lm(fm, df)

Figure 4 shows us the coefficients that are derived using the linear regression model from the stats package. Let us now calculate the r-square for this model:

Figure 4: Output of regression model using stats package
# R Square value of the model
r2(linearRegressor)

From Figure 5, it becomes abundantly clear that it doesn’t make much of a difference whether you train data from scratch or use the more convenient approach of using a regression model from the stats package. The difference between both the r-squared scores, i.e., 0.73 and 0.74, is negligible.

Figure 4: Output of regression model using stats package

In this article, we implemented a linear regression model using the Julia programming language to predict prices in the housing market of Boston, Massachusetts, USA. This article has been written to give you a taste of the Julia programming language. The example used is but a simple demonstration of the power of Julia. If you are coming from a Python background, then you are most certainly going to find many similarities between the two languages in terms of syntax, structure and approach. This is good news as the learning curve is less.

It is my opinion that Julia is a must have on your resume, whether you are a practising or a prospective data scientist.

LEAVE A REPLY

Please enter your comment!
Please enter your name here