Summarizing the data into a data frame

To get a summary of the data, we may execute summary(data) and see the relevant summaries for each type of variable. The summary is tailored for each column's data type. As you can see, numerical variables such as ID and NVotes get a quantile summary, while factor (categorical) variables get a count for each different category, such as AreaType and RegionName. If there are many categories, the summary will show the categories that appear the most and group the rest into a (Other) group, as we can see at the bottom of RegionName.

summary(data)
#> ID RegionName NVotes Leave
#> Min. : 1 Length: 1070 Min. : 1039 Min. : 287
#> 1st Qu.: 268 Class : character 1st Qu.: 4252 1st Qu.: 1698
#> Median : 536 Mode : character Median : 5746 Median : 2874
#> Mean : 536 Mean : 5703 Mean : 2971
#> 3rd Qu.: 803 3rd Qu.: 7020 3rd Qu.: 3936
#> Max. : 1070 Max. : 15148 Max. : 8316
(Truncated output)

From here, we can see that London is the region to which more wards belong, followed by the North West and West Midlands. We can also see that the ward with the least votes in all of the data had only 1,039 votes, the one with the most votes had 15,148, and the mean number of votes per ward was 5,703. We will take a deeper look into these kinds of analyses later in the chapter. For now we'll focus on making this summary data useful for further analysis. As you may have noticed, we can't use the summary() results to make computations. We can try to save the summary into a variable, find out the variable type, and traverse it in an appropriate way. However, if we do that we will find that it's text data, which means that we can't use it for computations as it is:

summary <- summary(data) 
class(summary)
#> [1] "table"
summary[1]
#> [1] "Min. : 1 "
class(summary[1])
#> [1] "character"

Surely, there must be a way to get the summary data into a data frame for further analysis. This is R, so you can bet there is! The first thing we should note is that we can't directly translate the output of the summary() function into a data frame because of the non-numerical variables. These non-numerical variables contain a different summary structure which is not composed of the minimum, first quartile, median, mean, third quartile, and maximum values. This means that we first need to subset the data to get only the numerical variables. After all, a data frame is a rectangular structure with well defined rows and columns. If we tried to mix types (by including numerical and non-numerical summaries) into the data frame, we would have a hard time doing so.

To find if a column is numeric or not, we can use the is.numeric() function. For example, we can see that the Proportion column is numeric and the RegionName is not:

is.numeric(data$Proportion) 
#> [1] TRUE is.numeric(data$RegionName)
#> [1] FALSE

We can then apply is.numeric() to each column by using the sapply() function. This will give us a logical (Boolean) vector with a TRUE or FALSE value for each column, indicating whether or not it's numeric. Then we can use this logical vector to subset our data and get only the numerical columns with data[, numerical_variables]. As you can see, there are no non-numerical columns in the data_numerical object:

numerical_variables <- sapply(data, is.numeric) 
numerical_variables
#> ID RegionName NVotes Leave Residents
#> TRUE FALSE TRUE TRUE TRUE
#> Households MeanAge AdultMeanAge Aget_0to4 Age_5to7
#> TRUE TRUE TRUE TRUE TRUE
(Truncated output)

data_numerical <- data[, numerical_variables]
colnames(data_numerical)
#> [1] "ID" "Nvotes" "Leave" "Residents"
#> [5] "Households" "MeanAge" "AdultMeanAge" "Age_0to4"
#> [9] "Age_5to7" "Age_8to9" "Age_10to14" "Age_15"
#> [13] "Age_16to17 "Age_18to19" "Age_20to24" "Age_25to29"
(Truncated output)

Since it doesn't make much sense to get the summary values for the ID variable, we can remove it from the logical vector, effectively treating it as a non-numerical variable. If we do, we must remember to recreate the data_numeric object to make sure it doesn't include the ID variable also:

numerical_variables[["ID"]] <- FALSE 
data_numerical <- data[, numerical_variables]

To create our numerical variables summary, we first will apply the summary() function we used before to each numerical column using the lapply() function. The lapply() function returns a named list, where each list member has the corresponding column name:

lapply(data[, numerical_variables], summary)
#> $NVotes
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1039 4252 5746 5703 7020 15148
#>
#> $Leave
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 287 1698 2874 2971 3936 8316
#>
#> $Residents
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1932 8288 11876 11646 14144 34098
#>
(Truncated output)

Now we need to put each member of this list together into a data frame. To do so, we will use the cbind() and do.call() functions. do.call() will consecutively apply cbind() to each member of the list generated by lapply() and return them all together. To get a reminder on how these vectorized operations work, take a look at Chapter 1, Introduction to R:

numerical_summary <- do.call(cbind, lapply(data_numerical, summary))
#> Warning in (function (..., deparse.level = 1) : number of rows of result is
#> not a multiple of vector length (arg 1)

numerical_summary
#> NVotes Leave Residents Households MeanAge AdultMeanAge Age_0to4
#> Min. 1039 287 1932 779 27.80 29.20 2.200
#> 1st Qu. 4252 1698 8288 3466 35.60 44.10 5.400
#> Median 5746 2874 11876 4938 38.70 47.40 6.300
#> Mean 5703 2971 11646 4767 38.45 46.85 6.481
#> 3rd Qu. 7020 3936 14144 5832 41.40 49.90 7.50058
#> Max. 15148 8316 34098 15726 51.60 58.10 12.300
#> NA's 1039 267 1932 779 27.80 29.20 2.200

We got our results, but not so fast! We got a warning, and it looks suspicious. What does this number of rows of result is not a multiple of vector length message mean? Aha! If we take a more detailed look at the list we previously got from our lapply() function, we can see that in the case of Leave (and Proportion) we get an extra column for NAs that we don't get for any other column. That means that when we try to use cbind() on these columns, the extra NAs column will create an extra space that needs to be filled. This is a problem we looked at in Chapter 1, Introduction to R.

As we saw, then, R deals with it by repeating the vectors in order until all spaces are filled. In our case this means that the first element, the one corresponding to the minimum value, will be repeated for the NAs space for each column that doesn't have an NAs space. You can verify this by comparing the numbers of the Min and NAs columns for variables other than Leave or Proportion (for these two, the values should actually be different).

To fix it we can just remove the extra NA value's row from the resulting data frame, but this would not deal with the warning's source, only the symptom. To deal with the source, we need to have the  same number of columns for each variable before we apply cbind(). Since we already know that we have 267 missing values for the Leave variable, which then affects the Proportion variable, we can easily fix this by just ignoring that information. To do so, we simply use the complete cases, meaning that we keep observations that don't have any NA values in any of their variables; or, put another way, we drop every observation that contains at least one NA. Once we do that, we get our results back and we don't get any warnings:

data <- data[complete.cases(data), ] 
data_numerical <- data[, numerical_variables]
numerical_summary <- do.call(cbind, lapply(data_numerical, summary))
numerical_summary
#> NVotes Leave Residents Households MeanAge AdultMeanAge Age_0to4
#> Min. 1039 287 1932 779 28.40 30.50 2.200
#> 1st Qu. 4242 1698 8405 3486 35.70 44.10 5.400
#> Median 5739 2874 11911 4935 38.60 47.40 6.300
#> Mean 5725 2971 11739 4793 38.43 46.83 6.479
#> 3rd Qu. 7030 3936 14200 5850 41.40 49.90 7.500
#> Max. 15148 8316 34098 15726 47.30 56.10 12.300
(Truncated output)

If we want to get the summary values as columns and the variables as rows, we can use the rbind() function instead of cbind(). The structure we actually end up using will depend on what we want to do with it. However, we can easily change between them later if we need to:

do.call(rbind, lapply(data_numerical, summary))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> NVotes 1039.0000 4241.5000 5.739e+03 5.725e+03 7.030e+03 1.515e+04
#> Leave 287.0000 1697.5000 2.874e+03 2.971e+03 3.936e+03 8.316e+03
#> Residents 1932.0000 8405.0000 1.191e+04 1.174e+04 1.420e+04 3.410e+04
#> Households 779.0000 3486.0000 4.935e+03 4.793e+03 5.850e+03 1.573e+04
#> MeanAge 28.4000 35.7000 3.860e+01 3.843e+01 4.140e+01 4.730e+01

Now that we have this numerical_summary object, we can use it to perform computations, such as finding the range between the wards with the least and most proportions of votes in favor of leaving (0.6681), which may be useful to interpret the big difference among the types of wards we may find in the UK. If we want to know which wards are being used to get to this result, we can search for the wards with the least and most proportion of votes:

numerical_summary["Max.", "Proportion"] - numerical_summary["Min.", "Proportion"] 
desired_variables <- c(
"ID",
"NoQuals",
"Proportion",
"AdultMeanAge",
"L4Quals_plus",
"RegionName"
)

>data[which.max(data$Proportion), desired_variables]
#> ID NoQuals Proportion AdultMeanAge L4Quals_plus RegionName
#> 754 754 35.8 0.7897 48.7 13.7 L

data[which.min(data$Proportion), desired_variables]
#> ID NoQuals Proportion AdultMeanAge L4Quals_plus RegionName
#> 732 732 2.8 0.1216 31.2 44.3 EE

As you can see, this analysis already shows some interesting results. The UK ward that voted to leave the EU the most is characterized by older people (MeanAge) with low education levels (NoQuals, L4Quals_plus). On the other hand, the UK ward that voted to remain in the EU the most is characterized by younger people with much higher education levels. Of course, this is not the full picture, but it's a hint about the direction in which we need to look to further understand what's going on. For now, we have found that education and age seem to be relevant variables for the analysis.