First of all, we are going to discuss from where the concept of a data frame came. The origin of data frames stems from intensive empirical research in the world of statistical software. The tabular data is referred by the data frames. In particular, it is a data structure in R that represents cases in which there are a number of observations(rows) or measurements (columns).
A data frame is being used for storing data tables, the vectors that are contained in the form of a list in a data frame are of equal length.
Characteristics of R Data Frame
Now, let’s discuss the characteristics of data frame in R.
- The column names should be non-empty.
- The row names should be unique.
- The data frame can hold the data which can be numeric, character, or factor type.
- Each column should contain the same number of data items.
Create Data Frame
# Create the data frame.
df <- data.frame(
id = c (1:4),
name = c("Sam","Dan","Zack","Ryan"),
age = c(62,51,61,72),
stringsAsFactors = FALSE
)
# Print the data frame.
print(df)
Output of the above code
id name age
1 1 Sam 62
2 2 Dan 51
3 3 Zack 61
4 4 Ryan 72
Get the Structure of the R Data Frame
The structure of the data frame can be seen by using str()
function.
# To get the structure of a data frame
> str(df)
'data.frame': 4 obs. of 3 variables:
$ id : int 1 2 3 4
$ name: chr "Sam" "Dan" "Zack" "Ryan"
$ age : num 62 51 61 72
Operations on Data Frame
Extract the first two columns
# First two columns of the data frame
> two_col <- data.frame(df$id, df$name)
> print(two_col)
Output of the above code
id name
1 1 Sam
2 2 Dan
3 3 Zack
4 4 Ryan
Extract the first two rows and then all columns
# Extract first two rows. with all columns
2_row_all_col <- df[1:2,]
print(2_row_all_col)
Output of the above code
id name age
1 1 Sam 62
2 2 Dan 51
Extract 3rd and 4th row with 2nd and 3th column
# Extract first two rows.
output <- df[c(3:4),c(2:3)]
print(output)
Output of the above code
ame age
3 Zack 61
4 Ryan 72
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by applying summary()
function.
# To get the summary of any data
> summ <- summary(df)
> print(summ)
Output of the above code
id name age
Min. :1.00 Length:4 Min. :51.0
1st Qu.:1.75 Class :character 1st Qu.:58.5
Median :2.50 Mode :character Median :61.5
Mean :2.50 Mean :61.5
3rd Qu.:3.25 3rd Qu.:64.5
Max. :4.00 Max. :72.0
Built-in Data Frame
For our tutorials, we will use built-in data frames in R. For example, here is a built-in data frame in R, called mtcars
.
Motor Trend Car Road Tests
Description
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Usage
mtcars
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
.....
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket “[]” operator. The two coordinates are separated by a comma. In other words, the coordinates begin with row position, then followed by a comma, and ends with the column position. The order is important.
To access the value of the 1st column and 4st row
> mtcars[1, 4]
[1] 110
Moreover, we can use the row and column names instead of the numeric coordinates.
> mtcars["Honda Civic", "mpg"]
[1] 30.4
To know more about the dataset type help()
command.
> help(mtcars)
And to get a list of all available dataset present in R
> data()
Output
Data sets in package 'datasets':
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
mtcars Motor Trend Car Road Tests
...
Conclusion
So, we have learned about the data frame along with its characteristics in detail. Also, we have discussed the different operations of a data frame. With the help of the above-mentioned information, it is easier to understand how to expand the data frame as we have included examples of it.
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!
Zigya Academy
BEING RELEVANT