How to Create DataFrame in Julia?

How to Create DataFrame in Julia?

  • Julia
  • 3 mins read

A DataFrame in Julia is a two-dimensional data structure that contains columns of data, with each column having a name and a specific data type. DataFrames are similar to tables in a relational database or data frames in other programming languages like R and Python. To create a DataFrame in Julia, you can use the DataFrame constructor with the data that you want to include in the DataFrame. Below are some examples:

Create DataFrame in Julia Examples

julia> using DataFrames

julia> df = DataFrame(A = [1, 2, 3], B = ["a", "b", "c"])
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │
│ 3   │ 3     │ c      │

In this example, the df variable is initialized to a DataFrame with two columns named A and B. The A column contains the numbers 1, 2, and 3, and the B column contains the strings "a", "b", and "c".

Create a DataFrame from Dictionary in Julia

You can also create a DataFrame from a dictionary, where the keys of the dictionary become the column names and the values of the dictionary become the data for the columns. For example:

julia> data = Dict(:A => [1, 2, 3], :B => ["a", "b", "c"])
Dict{Symbol,Any} with 2 entries:
  :B => ["a", "b", "c"]
  :A => [1, 2, 3]

julia> df = DataFrame(data)
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │
│ 3   | 3     | c      |

Access the DataFrame in Julia

you can access the data in a DataFrame using the columnname syntax or the DataFrame[!, columnname] syntax. For example:

julia> df = DataFrame(A = [1, 2, 3], B = ["a", "b", "c"])
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │
│ 3   │ 3     │ c      │

julia> df.A
3-element Array{Int64,1}:
 1
 2
 3

julia> df[!, :B]
3-element Array{String,1}:
 "a"
 "b"
 "c"

In this example, the df.A and df[!, :B] expressions return the A and B columns of the df DataFrame, respectively. The . syntax is used to access columns by name, and the [!, columnname] syntax is used to access columns by name using the Symbol type.

You can also use the DataFrame[!, row, column] syntax to access individual elements of a DataFrame. For example:

julia> df = DataFrame(A = [1, 2, 3], B = ["a", "b", "c"])
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │
│ 3   │ 3     │ c      │

julia> df[!, 1, :A]
1

julia> df[!, 2, :B]
"b"

In this example, the df[!, 1, :A] expression returns the first element of the A column, which is the number 1, and the df[!, 2, :B] expression returns the second element of the B column, which is the string "b". The [!, row, column] syntax allows you to access individual elements of a DataFrame by row and column indices.

Get Stats of a DataFrame in Julia

you can use the describe function to get summary statistics for the columns of a DataFrame. The describe function returns a new DataFrame with the statistics for each column, including the count, mean, minimum, maximum, and quartile values. For example:

julia> df = DataFrame(A = [1, 2, 3], B = ["a", "b", "c"])
3×2 DataFrame
│ Row │ A     │ B      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ b      │
│ 3   │ 3     │ c      │

julia> describe(df)
2×8 DataFrame
│ Row │ variable │ mean    │ min │ median │ max │ nunique │ nmissing │ eltype   │
│     │ Symbol   │ Float64 │ Any │ Float64 │ Any │ Int64   │ Int64    │ DataType │
├─────┼──────────┼─────────┼─────┼────────┼─────┼─────────┼──────────┼──────────┤
│ 1   │ A        │ 2.0     │ 1   │ 2.0    │ 3   │         │          │ Int64    │
│ 2   │ B        │         │ a   │        │ c   │ 3       │          │ String   │

In this example, the describe function is applied to the df DataFrame, and it returns a new DataFrame with the summary statistics for each column. The mean, min, median, and max columns contain the mean, minimum, median, and maximum values for each column, respectively. The nunique column contains the number of unique values in each column, and the nmissing column contains the number of missing values in each column.

Related:

  1. Convert dataframe to matrix in Julia
  2. Operations on DataFrames in R